docs: architectural analysis of top repos (CockroachDB, Prometheus, Ecto, Oban)
Four documents examining codebases at module and ecosystem levels: - architectural-analysis.md — internal structure, dependency flow - ecosystem-analysis.md — consumer extension points, deliberate absences - crosscutting-analysis.md — logging, config, retry, lifecycle - testing-evolution-analysis.md — proof models, API evolution strategies
This commit is contained in:
@@ -0,0 +1,340 @@
|
|||||||
|
# Architectural Patterns from Top Repos
|
||||||
|
|
||||||
|
## CockroachDB: How to Organize 20,000 Files
|
||||||
|
|
||||||
|
### The 116-Package Principle
|
||||||
|
|
||||||
|
CockroachDB has 116 packages under `pkg/util/` averaging
|
||||||
|
**4 files each**. This is deliberate:
|
||||||
|
|
||||||
|
**Force:** A 2M-line codebase where developers work on
|
||||||
|
different subsystems simultaneously. If `pkg/util` were
|
||||||
|
5 big packages, every PR would conflict.
|
||||||
|
|
||||||
|
**Pattern:** One concept = one package. `circuit/` is 3
|
||||||
|
files (breaker, options, signal). `quotapool/` is 5 files.
|
||||||
|
`stop/` is 2 files. The package boundary IS the API
|
||||||
|
boundary — no internal debates about what is exported.
|
||||||
|
|
||||||
|
**Naming:** Single-concept nouns. No `helpers`, no
|
||||||
|
`common`, no `shared`. Every package name tells you what
|
||||||
|
it does: `cancelchecker`, `ctxgroup`, `syncutil`.
|
||||||
|
|
||||||
|
### Dependency Layering
|
||||||
|
|
||||||
|
```
|
||||||
|
sql → kv → storage → util
|
||||||
|
↓ ↓ ↓
|
||||||
|
↓ ↓ roachpb (protobuf types)
|
||||||
|
↓ ↓ ↓
|
||||||
|
↓ keys ← util
|
||||||
|
↓
|
||||||
|
settings, config
|
||||||
|
```
|
||||||
|
|
||||||
|
**Critical insight:** `kv` imports from `sql` AND `sql`
|
||||||
|
imports from `kv`. They solved circular deps via
|
||||||
|
interfaces + callback registration — not by eliminating
|
||||||
|
the cycle. The `internal/` package provides the bridge.
|
||||||
|
|
||||||
|
`storage` imports `kv` (for transaction types) but `kv`
|
||||||
|
also imports `storage`. Again, interface boundaries break
|
||||||
|
the cycle at compile time.
|
||||||
|
|
||||||
|
**Lesson:** Perfect layering is impossible in distributed
|
||||||
|
databases. The real skill is knowing where to put the
|
||||||
|
interface that breaks the cycle.
|
||||||
|
|
||||||
|
### Error Handling at Scale
|
||||||
|
|
||||||
|
They use `github.com/cockroachdb/errors` — their own
|
||||||
|
library that extends stdlib `errors` with:
|
||||||
|
|
||||||
|
- **Error marks:** Tag errors with metadata without
|
||||||
|
changing the error chain
|
||||||
|
- **Wrapping with causes:** `errors.Wrap(err, "context")`
|
||||||
|
- **Safe printing:** `redact.Sprint` for log-safe errors
|
||||||
|
- **Network encoding:** Errors serialize across RPC
|
||||||
|
boundaries
|
||||||
|
|
||||||
|
**Pattern:** Errors are first-class data that flows through
|
||||||
|
the entire system, surviving serialization across nodes.
|
||||||
|
Not just strings — structured, typed, matchable.
|
||||||
|
|
||||||
|
### Circuit Breaker (not stdlib)
|
||||||
|
|
||||||
|
```go
|
||||||
|
type Breaker struct {
|
||||||
|
mu struct {
|
||||||
|
syncutil.RWMutex
|
||||||
|
errAndCh *errAndCh // stable Signal() results
|
||||||
|
probing bool
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Key design:** `Signal()` returns a channel + error getter
|
||||||
|
(like `context.Done()` + `context.Err()`). The channel is
|
||||||
|
stable — closing it doesn't affect callers who already have
|
||||||
|
a reference. New callers get a new channel after reset.
|
||||||
|
|
||||||
|
**Force:** In a distributed DB, a broken replica should
|
||||||
|
fail-fast all pending requests, then probe for recovery.
|
||||||
|
Context cancellation isn't enough because you need to
|
||||||
|
distinguish "gave up waiting" from "system is broken."
|
||||||
|
|
||||||
|
### QuotaPool: Abstract Resource Allocation
|
||||||
|
|
||||||
|
```go
|
||||||
|
type Resource interface{}
|
||||||
|
type Request interface {
|
||||||
|
Acquire(ctx context.Context, r Resource) (
|
||||||
|
fulfilled bool, tryAgainAfter time.Duration)
|
||||||
|
ShouldWait() bool
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Pattern:** The pool is generic over any resource type.
|
||||||
|
Concrete implementations include:
|
||||||
|
- `IntPool` — weighted semaphore with FIFO ordering
|
||||||
|
- Rate limiters (via `tryAgainAfter`)
|
||||||
|
- Token buckets
|
||||||
|
|
||||||
|
**Force:** Different subsystems need different quota types
|
||||||
|
but the same queueing/fairness semantics. Abstract once,
|
||||||
|
instantiate many.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Prometheus: Interface-Driven Storage Architecture
|
||||||
|
|
||||||
|
### The Contract Layer
|
||||||
|
|
||||||
|
`storage/interface.go` defines **15+ interfaces** that
|
||||||
|
form the entire query/storage contract:
|
||||||
|
|
||||||
|
```
|
||||||
|
Storage (top level)
|
||||||
|
├── Appendable → Appender (write path)
|
||||||
|
├── Queryable → Querier (read path)
|
||||||
|
├── ChunkQueryable → ChunkQuerier (bulk read)
|
||||||
|
├── ExemplarStorage (exemplars)
|
||||||
|
└── Searcher (experimental)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Force:** Prometheus must support:
|
||||||
|
- Local TSDB (the main implementation)
|
||||||
|
- Remote read/write (federation)
|
||||||
|
- Recording rules (virtual series)
|
||||||
|
- Testing (mock implementations)
|
||||||
|
|
||||||
|
All through the same interface. The contract layer is
|
||||||
|
the single point of truth for "what does storage mean."
|
||||||
|
|
||||||
|
### Compile-Time Interface Verification
|
||||||
|
|
||||||
|
```go
|
||||||
|
var _ storage.GetRef = &headAppender{}
|
||||||
|
var _ storage.Searcher = &blockBaseQuerier{}
|
||||||
|
```
|
||||||
|
|
||||||
|
Prometheus uses this pattern **8 times** in tsdb/ alone.
|
||||||
|
Every concrete type that claims to satisfy a storage
|
||||||
|
interface proves it at compile time.
|
||||||
|
|
||||||
|
**Why this matters at scale:** Storage interfaces evolve.
|
||||||
|
When `Searcher` was added, every type that should
|
||||||
|
implement it needed updating. The `var _` pattern makes
|
||||||
|
the compiler tell you what you missed.
|
||||||
|
|
||||||
|
### Plugin Discovery via Channel
|
||||||
|
|
||||||
|
```go
|
||||||
|
type Discoverer interface {
|
||||||
|
Run(ctx context.Context, up chan<- []*targetgroup.Group)
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Brilliance:** The entire service discovery system is one
|
||||||
|
interface with one method. Consul, DNS, Kubernetes, AWS —
|
||||||
|
all implement `Run`. They push target groups through a
|
||||||
|
channel. The manager multiplexes.
|
||||||
|
|
||||||
|
**Force:** Prometheus supports 20+ discovery mechanisms.
|
||||||
|
Adding one should require zero changes to the core. The
|
||||||
|
channel-based push model means the manager never polls.
|
||||||
|
|
||||||
|
### Atomic File Operations
|
||||||
|
|
||||||
|
Block lifecycle uses filesystem conventions:
|
||||||
|
- `.tmp-for-creation` — incomplete write
|
||||||
|
- `.tmp-for-deletion` — incomplete delete
|
||||||
|
|
||||||
|
On startup, scan and clean up. No WAL needed for
|
||||||
|
block-level operations because rename is atomic on POSIX.
|
||||||
|
|
||||||
|
**Force:** TSDB blocks are large (hours of data). A WAL
|
||||||
|
for block operations would be overkill. The suffix
|
||||||
|
convention gives crash consistency with zero overhead.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Ecto: Composability Through Data
|
||||||
|
|
||||||
|
### Query as Accumulating Struct
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
defstruct prefix: nil, sources: nil, from: nil,
|
||||||
|
joins: [], wheres: [], select: nil,
|
||||||
|
order_bys: [], limit: nil, offset: nil,
|
||||||
|
group_bys: [], updates: [], havings: [],
|
||||||
|
preloads: [], distinct: nil, lock: nil,
|
||||||
|
windows: [], with_ctes: nil
|
||||||
|
```
|
||||||
|
|
||||||
|
**Every query operation appends to a list or sets a
|
||||||
|
field.** Nothing is executed. The struct accumulates intent
|
||||||
|
until `Repo.all/Repo.one` triggers planning + execution.
|
||||||
|
|
||||||
|
**Force:** Queries must be composable (build in one
|
||||||
|
module, filter in another, paginate in a third). If
|
||||||
|
operations executed immediately, composition would require
|
||||||
|
the entire DB context at every step.
|
||||||
|
|
||||||
|
### Macro → Builder → Planner Pipeline
|
||||||
|
|
||||||
|
```
|
||||||
|
User writes: from(u in User, where: u.age > 18)
|
||||||
|
↓
|
||||||
|
Macro expands: Builder.Filter.build(query, expr, env)
|
||||||
|
↓
|
||||||
|
Builder produces: %Ecto.Query.BooleanExpr{...}
|
||||||
|
↓
|
||||||
|
Planner resolves: types, bindings, params
|
||||||
|
↓
|
||||||
|
Adapter generates: SQL string
|
||||||
|
```
|
||||||
|
|
||||||
|
Each builder module handles one clause type. There are
|
||||||
|
**15 builder modules** (from, join, filter, select, etc.).
|
||||||
|
The planner doesn't know about SQL — it resolves the
|
||||||
|
query struct into a normalized form that any adapter can
|
||||||
|
consume.
|
||||||
|
|
||||||
|
**Force:** Support multiple databases (Postgres, MySQL,
|
||||||
|
SQLite) with the same query language. The adapter is the
|
||||||
|
only part that knows SQL dialect.
|
||||||
|
|
||||||
|
### Protocol for Extensibility
|
||||||
|
|
||||||
|
`Ecto.Queryable` protocol lets you pass:
|
||||||
|
- A module atom (`User`) → resolved to schema query
|
||||||
|
- A string (`"users"`) → raw table
|
||||||
|
- A tuple (`{"filtered_users", User}`) → view + schema
|
||||||
|
- An `Ecto.Query` struct → identity
|
||||||
|
|
||||||
|
**Force:** `Repo.all(X)` should work with any "queryable
|
||||||
|
thing." New queryable types can be added without touching
|
||||||
|
Repo code.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Oban: Architecture for Testability
|
||||||
|
|
||||||
|
### Engine Swap by Config
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
def get_engine(%{engine: engine, testing: :disabled}), do: engine
|
||||||
|
def get_engine(%{testing: :inline}), do: Oban.Engines.Inline
|
||||||
|
def get_engine(%{testing: :manual}), do: engine
|
||||||
|
```
|
||||||
|
|
||||||
|
Three modes:
|
||||||
|
- **disabled** (production) — real engine
|
||||||
|
- **inline** (unit test) — execute in caller process
|
||||||
|
- **manual** (integration) — enqueue but don't execute
|
||||||
|
|
||||||
|
**Force:** Background jobs are inherently untestable
|
||||||
|
without process control. Rather than making tests async
|
||||||
|
(flaky), make the engine deterministic.
|
||||||
|
|
||||||
|
### Flat Supervision with Named Registry
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
children = [
|
||||||
|
{Notifier, conf: conf, name: Registry.via(name, Notifier)},
|
||||||
|
{Nursery, conf: conf, name: Registry.via(name, Nursery)},
|
||||||
|
{Peer, conf: conf, name: Registry.via(name, Peer)},
|
||||||
|
{Sonar, conf: conf, name: Registry.via(name, Sonar)},
|
||||||
|
{Harbor, conf: conf, name: Registry.via(name, Harbor)}
|
||||||
|
]
|
||||||
|
```
|
||||||
|
|
||||||
|
Every child gets its config via `conf:` and its identity
|
||||||
|
via `Registry.via`. This means:
|
||||||
|
- Multiple Oban instances can run in the same VM
|
||||||
|
- Tests can start isolated Oban supervisors
|
||||||
|
- No global state — everything is namespaced
|
||||||
|
|
||||||
|
**Force:** Libraries can't own global names. Enterprise
|
||||||
|
apps run multiple Oban instances (different repos,
|
||||||
|
different queues). The Registry pattern makes this
|
||||||
|
possible without process naming conflicts.
|
||||||
|
|
||||||
|
### Behaviour as Plugin Contract
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
# Plugin must be a GenServer AND implement these:
|
||||||
|
@callback start_link([option()]) :: GenServer.on_start()
|
||||||
|
@callback validate([option()]) :: :ok | {:error, String.t()}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Force:** Plugins need lifecycle management (start, stop,
|
||||||
|
crash recovery) AND configuration validation. By requiring
|
||||||
|
both a behaviour AND OTP compliance, Oban gets:
|
||||||
|
- Fault isolation (supervisor restarts crashed plugins)
|
||||||
|
- Config validation at startup (fail fast)
|
||||||
|
- No coupling (any GenServer works)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Cross-Cutting Insights
|
||||||
|
|
||||||
|
### 1. Interfaces at Boundaries, Structs Internally
|
||||||
|
|
||||||
|
All four codebases define interfaces at system boundaries
|
||||||
|
(storage, engine, discovery) but use concrete types
|
||||||
|
internally. The interface is the published contract; the
|
||||||
|
struct is the implementation detail.
|
||||||
|
|
||||||
|
### 2. Config as Validated Struct, Not Map
|
||||||
|
|
||||||
|
Every system validates config at startup and stores it as
|
||||||
|
a typed struct. Never a raw map floating around.
|
||||||
|
|
||||||
|
### 3. Testing is an Architecture Decision
|
||||||
|
|
||||||
|
Oban's engine swap, CockroachDB's stopper tracking,
|
||||||
|
Prometheus's mock interfaces — testability isn't bolted on,
|
||||||
|
it's designed in from day one.
|
||||||
|
|
||||||
|
### 4. Composition via Data, Not Inheritance
|
||||||
|
|
||||||
|
Ecto queries accumulate as data. Prometheus discoverers
|
||||||
|
push through channels. CockroachDB quota requests are
|
||||||
|
data objects. Nobody uses class hierarchies.
|
||||||
|
|
||||||
|
### 5. The Cycle Problem is Solved with Interfaces
|
||||||
|
|
||||||
|
CockroachDB has circular dependencies between sql↔kv↔
|
||||||
|
storage. They break cycles with interface packages that
|
||||||
|
both sides depend on. This is the only way at scale.
|
||||||
|
|
||||||
|
### 6. Small Packages > Large Packages
|
||||||
|
|
||||||
|
CockroachDB: 4 files average per package.
|
||||||
|
Oban: focused modules (engine, worker, plugin).
|
||||||
|
Ecto: one builder per clause type.
|
||||||
|
The package boundary forces you to define the API.
|
||||||
|
|
||||||
|
<!-- PATTERN_COMPLETE -->
|
||||||
@@ -0,0 +1,301 @@
|
|||||||
|
# Cross-Cutting Concerns: How Mature Codebases Handle the Hard Parts
|
||||||
|
|
||||||
|
Cross-cutting concerns are the things that touch everything
|
||||||
|
but belong nowhere. How a codebase handles logging,
|
||||||
|
telemetry, config, retry, and lifecycle management reveals
|
||||||
|
its architectural philosophy more than any feature code.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Logging: From Strings to Semantic Channels
|
||||||
|
|
||||||
|
### CockroachDB: Channel-Based Log Routing
|
||||||
|
|
||||||
|
CockroachDB doesn't just log at severity levels — it
|
||||||
|
routes logs to **semantic channels**:
|
||||||
|
|
||||||
|
```go
|
||||||
|
const DEV = logpb.Channel_DEV // development noise
|
||||||
|
const OPS = logpb.Channel_OPS // operator actions
|
||||||
|
const HEALTH = logpb.Channel_HEALTH // background health
|
||||||
|
const STORAGE = logpb.Channel_STORAGE
|
||||||
|
const SESSIONS = logpb.Channel_SESSIONS
|
||||||
|
const SQL_SCHEMA = logpb.Channel_SQL_SCHEMA
|
||||||
|
const USER_ADMIN = logpb.Channel_USER_ADMIN
|
||||||
|
```
|
||||||
|
|
||||||
|
Each channel can be routed to different sinks (file,
|
||||||
|
network, etc.) independently. Production deploys typically
|
||||||
|
disable DEV entirely and route HEALTH to monitoring.
|
||||||
|
|
||||||
|
**Force:** In a multi-tenant distributed database, "who
|
||||||
|
cares about this log?" is a different question than "how
|
||||||
|
bad is it?" An INFO-level schema change matters to DBAs
|
||||||
|
but not to SREs monitoring node health.
|
||||||
|
|
||||||
|
**Ecosystem insight:** The channel IS the audience. When
|
||||||
|
you write `log.Health.Warningf(...)`, you're declaring
|
||||||
|
"the person watching cluster health needs to see this."
|
||||||
|
Severity is orthogonal to audience.
|
||||||
|
|
||||||
|
### Prometheus: Self-Instrumentation
|
||||||
|
|
||||||
|
Prometheus instruments itself with its own metrics:
|
||||||
|
|
||||||
|
```go
|
||||||
|
type scrapeMetrics struct {
|
||||||
|
targetScrapeSampleLimit prometheus.Counter
|
||||||
|
targetScrapeSampleOutOfOrder prometheus.Counter
|
||||||
|
targetIntervalLengthHistogram *prometheus.HistogramVec
|
||||||
|
// ... 20+ metrics
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Metrics are collected in a struct, constructed once via
|
||||||
|
`newScrapeMetrics(reg)`, and passed to subsystems. No
|
||||||
|
global registration — the registerer is injected.
|
||||||
|
|
||||||
|
**Force:** Prometheus IS the metrics system. If it used
|
||||||
|
a different metrics library to instrument itself, that
|
||||||
|
would be a design smell. Dogfooding proves the API works.
|
||||||
|
|
||||||
|
### Ecto + Oban: Telemetry as Standard
|
||||||
|
|
||||||
|
Both use Erlang's `:telemetry` library with predictable
|
||||||
|
naming:
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
# Oban
|
||||||
|
:telemetry.execute([:oban, :job, :start], measurements, meta)
|
||||||
|
:telemetry.execute([:oban, :job, :stop], measurements, meta)
|
||||||
|
:telemetry.execute([:oban, :job, :exception], measurements, meta)
|
||||||
|
|
||||||
|
# Ecto (adapter-emitted)
|
||||||
|
[:my_app, :repo, :query]
|
||||||
|
```
|
||||||
|
|
||||||
|
**Force:** The BEAM ecosystem standardized on `:telemetry`
|
||||||
|
for observability. Libraries don't own their monitoring —
|
||||||
|
they emit events; consumers attach handlers. This inverts
|
||||||
|
the logging relationship: the library doesn't decide what
|
||||||
|
to do with the information.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Config Propagation: Three Models
|
||||||
|
|
||||||
|
### CockroachDB: Cluster Settings (Distributed Config)
|
||||||
|
|
||||||
|
```go
|
||||||
|
settings.RegisterDurationSetting(
|
||||||
|
settings.ApplicationLevel,
|
||||||
|
"bulkio.ingest.flush_delay",
|
||||||
|
"amount of time to wait before sending a file...",
|
||||||
|
0, // default
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
Settings are:
|
||||||
|
- **Typed** (Duration, Bool, Int, String)
|
||||||
|
- **Leveled** (ApplicationLevel vs SystemVisible)
|
||||||
|
- **Validated** (NonNegativeInt, etc.)
|
||||||
|
- **Distributed** (propagated across all nodes)
|
||||||
|
- **Version-gated** (new settings require cluster version)
|
||||||
|
|
||||||
|
Usage: `settings.Version.IsActive(ctx, clusterversion.V26_2)`
|
||||||
|
|
||||||
|
**Force:** In a distributed database, config isn't a file
|
||||||
|
— it's consensus. Every node must agree on every setting,
|
||||||
|
and settings can only be enabled once all nodes support
|
||||||
|
them. The version gate is the safety mechanism.
|
||||||
|
|
||||||
|
### Prometheus: ApplyConfig (Hot Reload)
|
||||||
|
|
||||||
|
```go
|
||||||
|
func (m *Manager) ApplyConfig(cfg *config.Config) error {
|
||||||
|
m.mtxScrape.Lock()
|
||||||
|
defer m.mtxScrape.Unlock()
|
||||||
|
// rebuild scrape pools from new config
|
||||||
|
// close old loggers, open new ones
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Config is a struct loaded from YAML. On SIGHUP (or API
|
||||||
|
call), the entire config is re-parsed and `ApplyConfig`
|
||||||
|
is called on each subsystem. Each subsystem holds a mutex
|
||||||
|
and swaps atomically.
|
||||||
|
|
||||||
|
**Force:** Prometheus runs as a single binary. Config
|
||||||
|
reload must be atomic per-subsystem but doesn't need
|
||||||
|
distributed consensus. The mutex-per-subsystem pattern
|
||||||
|
gives independent reload without global coordination.
|
||||||
|
|
||||||
|
### Ecto + Oban: Config at Init, Validated Once
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
# Oban validates exhaustively at startup
|
||||||
|
Validation.validate_schema(opts,
|
||||||
|
engine: {:behaviour, Oban.Engine},
|
||||||
|
queues: {:custom, &validate_queues/1},
|
||||||
|
repo: {:module, [config: 0]},
|
||||||
|
...
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
Config is validated once at startup and stored as an
|
||||||
|
immutable struct. No hot reload. If config is wrong,
|
||||||
|
you know immediately (fail fast).
|
||||||
|
|
||||||
|
**Force:** Elixir/OTP applications restart processes to
|
||||||
|
apply new config. Hot reload is handled by supervisor
|
||||||
|
restarts, not config mutation. The "config as immutable
|
||||||
|
struct" pattern means no runtime config bugs — it either
|
||||||
|
passes validation at startup or the app doesn't start.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Retry and Resilience
|
||||||
|
|
||||||
|
### CockroachDB: Iterator-Based Retry
|
||||||
|
|
||||||
|
```go
|
||||||
|
opts := retry.Options{
|
||||||
|
InitialBackoff: 100 * time.Millisecond,
|
||||||
|
MaxBackoff: 2 * time.Second,
|
||||||
|
Multiplier: 2,
|
||||||
|
MaxRetries: 5,
|
||||||
|
}
|
||||||
|
for r := retry.StartWithCtx(ctx, opts); r.Next(); {
|
||||||
|
// attempt operation
|
||||||
|
if err == nil { break }
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Retry is a **for-loop iterator**. `r.Next()` handles
|
||||||
|
backoff timing and returns false when exhausted. This
|
||||||
|
means retry logic reads like normal code — no callbacks,
|
||||||
|
no framework.
|
||||||
|
|
||||||
|
**Force:** CockroachDB has hundreds of retry sites. A
|
||||||
|
callback-based retry would create deeply nested code.
|
||||||
|
The iterator pattern keeps retry at the same indentation
|
||||||
|
level as the operation.
|
||||||
|
|
||||||
|
### Oban: Repo Dispatch with Built-In Retry
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
defp dynamic_dispatch(conf, name, args, attempt) do
|
||||||
|
with_dynamic_repo(conf, fn repo ->
|
||||||
|
apply(repo, name, args)
|
||||||
|
end)
|
||||||
|
rescue
|
||||||
|
error in UndefinedFunctionError ->
|
||||||
|
if attempt < @retry_opts[:retry] do
|
||||||
|
jittery_sleep(attempt * @retry_opts[:delay])
|
||||||
|
dynamic_dispatch(conf, name, args, attempt + 1)
|
||||||
|
else
|
||||||
|
reraise error, __STACKTRACE__
|
||||||
|
end
|
||||||
|
end
|
||||||
|
```
|
||||||
|
|
||||||
|
Every Ecto operation dispatched through Oban's repo
|
||||||
|
wrapper gets automatic retry for transient failures.
|
||||||
|
The consumer never sees the retry — it's invisible
|
||||||
|
infrastructure.
|
||||||
|
|
||||||
|
**Key insight:** Oban retries `UndefinedFunctionError`
|
||||||
|
on the repo module itself — absorbing the window during
|
||||||
|
hot code reload when the module doesn't exist. This is
|
||||||
|
an ecosystem-level concern (BEAM hot code loading) handled
|
||||||
|
transparently.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Resource Lifecycle: The Stopper Pattern
|
||||||
|
|
||||||
|
### CockroachDB: Stopper as Universal Lifecycle
|
||||||
|
|
||||||
|
```go
|
||||||
|
type Stopper struct { ... }
|
||||||
|
|
||||||
|
// RunTask runs a synchronous task
|
||||||
|
func (s *Stopper) RunTask(ctx context.Context, taskName string, f func(context.Context)) error
|
||||||
|
|
||||||
|
// RunAsyncTask runs a goroutine tracked by the stopper
|
||||||
|
func (s *Stopper) RunAsyncTask(ctx context.Context, taskName string, f func(context.Context)) error
|
||||||
|
|
||||||
|
// ShouldQuiesce returns a channel closed when shutdown begins
|
||||||
|
func (s *Stopper) ShouldQuiesce() <-chan struct{}
|
||||||
|
|
||||||
|
// Stop initiates graceful shutdown
|
||||||
|
func (s *Stopper) Stop(ctx context.Context)
|
||||||
|
```
|
||||||
|
|
||||||
|
Every goroutine in CockroachDB is launched through a
|
||||||
|
Stopper. This gives:
|
||||||
|
- **Tracking**: know exactly which goroutines are running
|
||||||
|
- **Graceful shutdown**: quiesce signal before hard stop
|
||||||
|
- **Leak detection**: `PrintLeakedStoppers` in tests
|
||||||
|
- **Throttling**: semaphore limits async tasks
|
||||||
|
|
||||||
|
```go
|
||||||
|
func init() {
|
||||||
|
leaktest.PrintLeakedStoppers = PrintLeakedStoppers
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Force:** A database cannot afford goroutine leaks —
|
||||||
|
they hold locks, connections, and file handles. The
|
||||||
|
Stopper is the universal answer: every background task
|
||||||
|
is accounted for, every shutdown is graceful, every leak
|
||||||
|
is detected in tests.
|
||||||
|
|
||||||
|
### Oban: Registry-Based Lifecycle
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
children = [
|
||||||
|
{Notifier, conf: conf, name: Registry.via(name, Notifier)},
|
||||||
|
{Nursery, conf: conf, name: Registry.via(name, Nursery)},
|
||||||
|
...
|
||||||
|
]
|
||||||
|
```
|
||||||
|
|
||||||
|
OTP already provides lifecycle management via supervisors.
|
||||||
|
Oban's addition is the Registry — namespacing processes
|
||||||
|
so multiple instances can coexist. Lifecycle is delegated
|
||||||
|
to the platform; naming is the library's concern.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. What These Patterns Teach for Code Review
|
||||||
|
|
||||||
|
### Questions to Ask About Cross-Cutting Concerns:
|
||||||
|
|
||||||
|
1. **Logging:** Who is the audience for this log? Is there
|
||||||
|
a routing mechanism, or does everything go to stdout?
|
||||||
|
Does the log help the *operator*, not just the developer?
|
||||||
|
|
||||||
|
2. **Config:** How does config reach this code? Is it
|
||||||
|
validated at startup or silently wrong at runtime? Can
|
||||||
|
it be changed without restart? Should it be?
|
||||||
|
|
||||||
|
3. **Retry:** Is retry happening at the right layer? Is it
|
||||||
|
invisible to the caller? Does it have backoff + jitter?
|
||||||
|
Does it respect context cancellation?
|
||||||
|
|
||||||
|
4. **Lifecycle:** Are background tasks tracked? Will they
|
||||||
|
shut down gracefully? Can you detect leaks in tests?
|
||||||
|
|
||||||
|
5. **Telemetry:** Are events emitted or is logging the only
|
||||||
|
observability? Can consumers attach their own handlers?
|
||||||
|
|
||||||
|
### Red Flags:
|
||||||
|
|
||||||
|
- `log.Info("something happened")` with no channel/audience
|
||||||
|
- Config read from environment at point-of-use (not validated)
|
||||||
|
- Retry logic duplicated in 5 places with different backoff
|
||||||
|
- Goroutines launched with `go func()` and no tracking
|
||||||
|
- No telemetry events — only log lines for observability
|
||||||
|
|
||||||
|
<!-- PATTERN_COMPLETE -->
|
||||||
@@ -0,0 +1,371 @@
|
|||||||
|
# Ecosystem-Level Patterns: How Codebases Present to Consumers
|
||||||
|
|
||||||
|
## The Three Questions
|
||||||
|
|
||||||
|
For each codebase, ask:
|
||||||
|
1. How do consumers **extend** it? (What interfaces/behaviours
|
||||||
|
do they implement?)
|
||||||
|
2. How do consumers **compose** with it? (What does day-to-day
|
||||||
|
usage look like?)
|
||||||
|
3. What does it deliberately **NOT do**? (What forces shaped
|
||||||
|
those refusals?)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## CockroachDB: Errors as First-Class Distributed Data
|
||||||
|
|
||||||
|
### Extension Points
|
||||||
|
|
||||||
|
CockroachDB is not a library — it is a system. Consumers
|
||||||
|
extend it through:
|
||||||
|
- **SQL builtins** (function registration)
|
||||||
|
- **Storage engines** (via pebble interface)
|
||||||
|
- **Service discovery** (not user-extensible — closed)
|
||||||
|
|
||||||
|
The interesting pattern is how errors flow from storage
|
||||||
|
through KV through SQL to the client.
|
||||||
|
|
||||||
|
### Error Architecture (ecosystem-level idiom)
|
||||||
|
|
||||||
|
```
|
||||||
|
Storage error → encoded via cockroachdb/errors →
|
||||||
|
KV wraps with context → serialized across gRPC →
|
||||||
|
SQL decodes → maps to pgcode → wire protocol to client
|
||||||
|
```
|
||||||
|
|
||||||
|
**Key design decisions:**
|
||||||
|
|
||||||
|
1. **Errors have priority.** `ErrPriority()` ranks errors so
|
||||||
|
the system knows which to surface when multiple things
|
||||||
|
fail simultaneously. Transaction abort > restart >
|
||||||
|
unambiguous error > non-retriable.
|
||||||
|
|
||||||
|
2. **Errors survive serialization.** `EncodeError` /
|
||||||
|
`DecodeError` serialize errors across RPC boundaries.
|
||||||
|
The error that originated on node 3 arrives at node 1
|
||||||
|
with its full cause chain intact.
|
||||||
|
|
||||||
|
3. **Errors map to pg codes.** Every internal error maps to
|
||||||
|
a Postgres error code that clients understand. This is
|
||||||
|
the *ecosystem contract* — clients write
|
||||||
|
`if pgcode == '40001' { retry }`.
|
||||||
|
|
||||||
|
**What this teaches:** In a distributed system, an error
|
||||||
|
isn't a string — it's a data object with identity,
|
||||||
|
priority, serializability, and a consumer-facing code.
|
||||||
|
Design your error types for the *consumer*, not the
|
||||||
|
*producer*.
|
||||||
|
|
||||||
|
### Deliberate Absences
|
||||||
|
|
||||||
|
- **No dependency injection framework.** Config structs
|
||||||
|
passed explicitly. 1178-line `StoreConfig` struct, but
|
||||||
|
it's all data — no framework magic.
|
||||||
|
- **No context.Background() on hot paths.** 144 uses in
|
||||||
|
kvserver, but auditable — each justified in comments.
|
||||||
|
- **No functional options.** CockroachDB uses config
|
||||||
|
structs universally. The Option interface in stopper is
|
||||||
|
the exception, not the rule.
|
||||||
|
|
||||||
|
### Test Architecture
|
||||||
|
|
||||||
|
- **TestMain in every package.** Sets up security certs,
|
||||||
|
random seeds, and test server factories.
|
||||||
|
- **Goroutine leak detection.** `leaktest.AfterTest(t)()`
|
||||||
|
at the start of every test. Detects leaked goroutines
|
||||||
|
by diffing goroutine stacks before/after.
|
||||||
|
- **Stopper leak detection.** Every Stopper is tracked
|
||||||
|
globally; `PrintLeakedStoppers(t)` in TestMain catches
|
||||||
|
forgot-to-stop bugs.
|
||||||
|
- **`//go:generate` for test setup.** Codegen tool
|
||||||
|
(`add-leaktest.sh`) auto-adds leak checks to every
|
||||||
|
test file.
|
||||||
|
|
||||||
|
**What this teaches:** At scale, the most important test
|
||||||
|
infrastructure isn't assertions — it's resource leak
|
||||||
|
detection. Every goroutine, every connection, every
|
||||||
|
Stopper is tracked and verified to be cleaned up.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Prometheus: The One-Method Interface Contract
|
||||||
|
|
||||||
|
### Extension Points
|
||||||
|
|
||||||
|
Prometheus is extended through:
|
||||||
|
- **Service discovery** (30 implementations, 1 interface)
|
||||||
|
- **Storage** (remote read/write adapters)
|
||||||
|
- **Exporters** (client_golang metrics)
|
||||||
|
|
||||||
|
### The Discoverer Pattern (ecosystem-level idiom)
|
||||||
|
|
||||||
|
```go
|
||||||
|
type Discoverer interface {
|
||||||
|
Run(ctx context.Context, up chan<- []*targetgroup.Group)
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
This is **one method**. Thirty implementations. The
|
||||||
|
channel-based push model means:
|
||||||
|
- The discoverer controls timing (not polled)
|
||||||
|
- The manager multiplexes without knowing implementations
|
||||||
|
- Adding a new discovery source = implement Run, register
|
||||||
|
|
||||||
|
**Registration via init():**
|
||||||
|
```go
|
||||||
|
func init() {
|
||||||
|
discovery.RegisterConfig(&SDConfig{})
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
This is the classic Go plugin pattern. Import the package
|
||||||
|
→ init registers it → the system discovers it at startup.
|
||||||
|
|
||||||
|
**What this teaches:** The smallest possible interface
|
||||||
|
creates the largest possible ecosystem. One method + one
|
||||||
|
channel = 30 implementations without coordination.
|
||||||
|
|
||||||
|
### Storage Contract (15 interfaces, 1 file)
|
||||||
|
|
||||||
|
All of Prometheus's storage contract lives in
|
||||||
|
`storage/interface.go`. This is the:
|
||||||
|
- Read path: `Queryable → Querier → SeriesSet → Series`
|
||||||
|
- Write path: `Appendable → Appender`
|
||||||
|
- Extension: `ExemplarAppender`, `MetadataUpdater`
|
||||||
|
|
||||||
|
**Key:** Every implementation proves satisfaction at
|
||||||
|
compile time with `var _ storage.Searcher = &type{}`.
|
||||||
|
When the contract evolves, the compiler finds every
|
||||||
|
broken implementation.
|
||||||
|
|
||||||
|
### Deliberate Absences
|
||||||
|
|
||||||
|
- **No generics in storage interfaces.** Despite Go 1.20+
|
||||||
|
support. The interfaces predate generics and adding them
|
||||||
|
would break all existing implementations.
|
||||||
|
- **No dependency injection.** Direct struct construction
|
||||||
|
everywhere. Testability through interface satisfaction,
|
||||||
|
not framework wiring.
|
||||||
|
- **Almost no functional options.** Only in leaf packages
|
||||||
|
(chunk writer, parser). Core APIs use config structs.
|
||||||
|
- **No goroutine leak in production code.** `goleak` in
|
||||||
|
tests, `TolerantVerifyLeak` with explicit allowlist for
|
||||||
|
known third-party leaks.
|
||||||
|
|
||||||
|
### Test Architecture
|
||||||
|
|
||||||
|
- **`TolerantVerifyLeak`** — goroutine leak detection with
|
||||||
|
allowlist for known third-party leaks (opencensus, klog)
|
||||||
|
- **Mock implementations of every interface** — defined
|
||||||
|
right in `storage/interface.go` next to the real ones
|
||||||
|
- **Golden file tests** in PromQL evaluation
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Ecto: Composability as Architectural Principle
|
||||||
|
|
||||||
|
### Extension Points
|
||||||
|
|
||||||
|
Consumers extend Ecto through:
|
||||||
|
- **Custom types** (7 callbacks: cast, load, dump, equal?,
|
||||||
|
embed_as, autogenerate, type)
|
||||||
|
- **Adapters** (Queryable, Schema, Transaction, Storage —
|
||||||
|
4 behaviour modules)
|
||||||
|
- **Protocols** (`Ecto.Queryable` — anything can become a
|
||||||
|
query)
|
||||||
|
|
||||||
|
### The NotLoaded Sentinel (ecosystem-level idiom)
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
defmodule Ecto.Association.NotLoaded do
|
||||||
|
defstruct [:__field__, :__owner__, :__cardinality__]
|
||||||
|
end
|
||||||
|
```
|
||||||
|
|
||||||
|
Ecto **refuses to lazy-load associations**. If you access
|
||||||
|
`user.posts` without preloading, you get a `NotLoaded`
|
||||||
|
struct — not nil, not an empty list, not a database query.
|
||||||
|
|
||||||
|
**Why this is an ecosystem decision:**
|
||||||
|
- Forces consumers to be explicit about data needs
|
||||||
|
- Prevents N+1 queries by making them impossible
|
||||||
|
- Makes the data boundary visible in code
|
||||||
|
|
||||||
|
This is a *consumer-hostile* decision that makes
|
||||||
|
*systems built on Ecto* dramatically better. The library
|
||||||
|
optimizes for the 1000th user, not the first-day
|
||||||
|
experience.
|
||||||
|
|
||||||
|
### Query Composition (ecosystem-level idiom)
|
||||||
|
|
||||||
|
Every query clause appends to a list in the Query struct.
|
||||||
|
Nothing executes. The Query is pure data that accumulates
|
||||||
|
intent.
|
||||||
|
|
||||||
|
**Consumer impact:** You can build queries across module
|
||||||
|
boundaries:
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
# Module A builds the base
|
||||||
|
def active_users, do: from(u in User, where: u.active)
|
||||||
|
|
||||||
|
# Module B adds pagination
|
||||||
|
def paginate(query, page, size) do
|
||||||
|
query
|
||||||
|
|> limit(^size)
|
||||||
|
|> offset(^((page - 1) * size))
|
||||||
|
end
|
||||||
|
|
||||||
|
# Module C adds authorization
|
||||||
|
def visible_to(query, role) do
|
||||||
|
where(query, [u], u.role in ^roles_for(role))
|
||||||
|
end
|
||||||
|
```
|
||||||
|
|
||||||
|
Each module is independent. They compose because queries
|
||||||
|
are data, not effects.
|
||||||
|
|
||||||
|
### Adapter Architecture
|
||||||
|
|
||||||
|
```
|
||||||
|
Ecto.Repo.all(query)
|
||||||
|
→ Planner resolves types, bindings
|
||||||
|
→ Adapter.prepare/2 produces {cache, prepared}
|
||||||
|
→ Adapter.execute/5 runs against DB
|
||||||
|
→ Adapter.loaders/2 converts back to Elixir types
|
||||||
|
```
|
||||||
|
|
||||||
|
The adapter is the ONLY part that knows SQL. Ecto core
|
||||||
|
is database-agnostic. This is why the same code works on
|
||||||
|
Postgres, MySQL, SQLite, and custom stores.
|
||||||
|
|
||||||
|
### Deliberate Absences
|
||||||
|
|
||||||
|
- **No lazy loading.** `NotLoaded` struct instead.
|
||||||
|
- **No global state.** Per-repo config, per-repo process.
|
||||||
|
- **No query caching at library level.** The adapter
|
||||||
|
caches prepared statements; Ecto doesn't.
|
||||||
|
- **No connection to schema naming.** `schema "legacy_tbl"`
|
||||||
|
is independent of `defmodule NewUser`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Oban: Designing for Testability First
|
||||||
|
|
||||||
|
### Extension Points
|
||||||
|
|
||||||
|
Consumers extend Oban through:
|
||||||
|
- **Workers** (`perform/1` — the job logic)
|
||||||
|
- **Plugins** (GenServer + validate callback)
|
||||||
|
- **Engines** (entire backend swap)
|
||||||
|
- **Notifiers** (pub/sub mechanism)
|
||||||
|
- **Peers** (leader election)
|
||||||
|
|
||||||
|
### The Worker Result Type (ecosystem-level idiom)
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
@type result ::
|
||||||
|
:ok
|
||||||
|
| {:ok, ignored :: term()}
|
||||||
|
| {:error, reason :: term()}
|
||||||
|
| {:cancel, reason :: term()}
|
||||||
|
| {:snooze, period :: Period.t()}
|
||||||
|
```
|
||||||
|
|
||||||
|
Five possible outcomes, each with distinct semantics:
|
||||||
|
- `:ok` → success, remove from queue
|
||||||
|
- `{:error, reason}` → retry (respects max_attempts)
|
||||||
|
- `{:cancel, reason}` → permanent failure, don't retry
|
||||||
|
- `{:snooze, period}` → reschedule for later
|
||||||
|
|
||||||
|
**Ecosystem impact:** Every worker author makes an
|
||||||
|
explicit decision about failure semantics. "What should
|
||||||
|
happen when this fails?" is answered in the type system,
|
||||||
|
not in configuration.
|
||||||
|
|
||||||
|
### Contextual Backoff (ecosystem-level idiom)
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
def backoff(%Job{attempt: attempt, unsaved_error: err}) do
|
||||||
|
case err.reason do
|
||||||
|
%RateLimitError{retry_after: ms} -> ms
|
||||||
|
_ -> trunc(:math.pow(attempt, 4) + jitter())
|
||||||
|
end
|
||||||
|
end
|
||||||
|
```
|
||||||
|
|
||||||
|
The error that caused the failure is available to the
|
||||||
|
backoff calculation. Different errors → different retry
|
||||||
|
strategies. This is impossible in systems where backoff
|
||||||
|
is configured globally.
|
||||||
|
|
||||||
|
### Testing Design (ecosystem-level idiom)
|
||||||
|
|
||||||
|
Three testing modes via config:
|
||||||
|
- **`:inline`** — execute jobs synchronously in tests
|
||||||
|
- **`:manual`** — enqueue but don't execute
|
||||||
|
- **`:disabled`** — production behavior
|
||||||
|
|
||||||
|
Plus `use Oban.Testing` which provides:
|
||||||
|
- `assert_enqueued/1` — verify job was queued
|
||||||
|
- `refute_enqueued/1` — verify job was NOT queued
|
||||||
|
- `perform_job/2` — execute a job manually in tests
|
||||||
|
- `all_enqueued/1` — list all matching jobs
|
||||||
|
|
||||||
|
**Ecosystem impact:** Every Oban consumer gets
|
||||||
|
deterministic, fast, isolated tests for free. No sleep,
|
||||||
|
no polling, no flaky async assertions.
|
||||||
|
|
||||||
|
### Deliberate Absences
|
||||||
|
|
||||||
|
- **No global process names.** Registry.via everywhere —
|
||||||
|
multiple Oban instances can coexist.
|
||||||
|
- **No direct DB coupling in workers.** Workers receive a
|
||||||
|
Job struct; they don't import Repo.
|
||||||
|
- **No implicit retries.** max_attempts is explicit per
|
||||||
|
worker. No "retry forever" default.
|
||||||
|
- **No built-in rate limiting in OSS.** That is a Pro
|
||||||
|
feature — deliberate business boundary.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Cross-Cutting: What "Idiomatic" Means at Ecosystem Level
|
||||||
|
|
||||||
|
### 1. The Consumer Contract is the API
|
||||||
|
|
||||||
|
Not the functions you export — the *experience* of
|
||||||
|
building on your system:
|
||||||
|
- CockroachDB: "Your errors will be pg-codes, always"
|
||||||
|
- Prometheus: "Implement Run(), get discovery for free"
|
||||||
|
- Ecto: "Queries are data; loading is always explicit"
|
||||||
|
- Oban: "Return a result type; testing is built in"
|
||||||
|
|
||||||
|
### 2. Deliberate Absences Define Character
|
||||||
|
|
||||||
|
What a system refuses to do is as important as what it
|
||||||
|
does:
|
||||||
|
- Ecto refuses lazy loading → forces explicit data needs
|
||||||
|
- Oban refuses global names → enables multi-instance
|
||||||
|
- Prometheus refuses DI frameworks → keeps simplicity
|
||||||
|
- CockroachDB refuses context.Background on hot paths →
|
||||||
|
forces timeout discipline
|
||||||
|
|
||||||
|
### 3. Testability is Never Retrofitted
|
||||||
|
|
||||||
|
Every system that tests well designed testing in from the
|
||||||
|
start:
|
||||||
|
- CockroachDB: leak detection, stopper tracking
|
||||||
|
- Prometheus: goroutine leak verification, mock interfaces
|
||||||
|
- Ecto: adapter abstraction, embedded schemas for testing
|
||||||
|
- Oban: engine swap, testing modes, assertion helpers
|
||||||
|
|
||||||
|
### 4. Extension Points Define the Ecosystem Size
|
||||||
|
|
||||||
|
- Prometheus: 1 interface, 30 discoverers
|
||||||
|
- Ecto: 7 type callbacks, hundreds of custom types
|
||||||
|
- Oban: Worker behaviour + 5 engine callbacks
|
||||||
|
|
||||||
|
**Smaller interface → larger ecosystem.** The less you
|
||||||
|
demand from implementors, the more you get.
|
||||||
|
|
||||||
|
<!-- PATTERN_COMPLETE -->
|
||||||
@@ -0,0 +1,297 @@
|
|||||||
|
# Testing Philosophy & API Evolution
|
||||||
|
|
||||||
|
How codebases prove correctness and manage change over
|
||||||
|
time reveals their deepest architectural commitments.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Testing Philosophy: Four Models of Proof
|
||||||
|
|
||||||
|
### CockroachDB: Defense in Depth
|
||||||
|
|
||||||
|
**Levels of proof:**
|
||||||
|
1. **Unit tests** — co-located in same package
|
||||||
|
2. **Echotest/golden files** — snapshot expected output (209
|
||||||
|
testdata directories, auto-rewrite with -rewrite flag)
|
||||||
|
3. **Data-driven tests** — declarative test specs in txt files
|
||||||
|
4. **KVNemesis** — chaos/fuzzing that generates random KV
|
||||||
|
operations and checks linearizability
|
||||||
|
5. **Leak detection** — goroutines, stoppers tracked globally
|
||||||
|
|
||||||
|
**The echotest pattern:**
|
||||||
|
```go
|
||||||
|
echotest.Require(t, output, filepath.Join("testdata", name+".txt"))
|
||||||
|
```
|
||||||
|
|
||||||
|
Golden file says:
|
||||||
|
```
|
||||||
|
echo
|
||||||
|
----
|
||||||
|
result is ambiguous: boom with a secret
|
||||||
|
result is ambiguous: boom with a ‹secret›
|
||||||
|
```
|
||||||
|
|
||||||
|
The test produces output, compares against the golden file.
|
||||||
|
Run with `-rewrite` to update. This means:
|
||||||
|
- Tests are **self-documenting** (the golden file IS the spec)
|
||||||
|
- Regressions are **visible in diffs** (the golden file changes)
|
||||||
|
- No manual expected-value maintenance
|
||||||
|
|
||||||
|
**KVNemesis (chaos testing at ecosystem level):**
|
||||||
|
Generates random sequences of KV operations (puts, gets,
|
||||||
|
splits, merges, transfers) against a real cluster, then
|
||||||
|
validates that results satisfy serializable isolation.
|
||||||
|
|
||||||
|
This isn't unit testing. This is proving the *system* is
|
||||||
|
correct, not individual functions.
|
||||||
|
|
||||||
|
**Resource leak detection as CI gate:**
|
||||||
|
```go
|
||||||
|
// Every test file
|
||||||
|
defer leaktest.AfterTest(t)()
|
||||||
|
|
||||||
|
// Every TestMain
|
||||||
|
func init() {
|
||||||
|
leaktest.PrintLeakedStoppers = PrintLeakedStoppers
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
If a test leaks a goroutine or Stopper, it **fails**. Not
|
||||||
|
a warning. A failure. This means resource correctness is
|
||||||
|
as enforceable as logic correctness.
|
||||||
|
|
||||||
|
### Prometheus: Golden Files + Goroutine Verification
|
||||||
|
|
||||||
|
**Testing DSL for PromQL:**
|
||||||
|
```
|
||||||
|
load 5m
|
||||||
|
http_requests{job="api-server"} 0+10x10
|
||||||
|
|
||||||
|
eval instant at 50m SUM BY (group) (http_requests)
|
||||||
|
{group="canary"} 700
|
||||||
|
{group="production"} 300
|
||||||
|
```
|
||||||
|
|
||||||
|
This is a custom test language. Load data, evaluate
|
||||||
|
expressions, assert results. **205 test config files**
|
||||||
|
in `config/testdata/` alone.
|
||||||
|
|
||||||
|
**Force:** PromQL is complex enough that example-based
|
||||||
|
testing would be insufficient. The DSL lets you write
|
||||||
|
hundreds of test cases concisely, covering edge cases
|
||||||
|
that would require dozens of Go test functions.
|
||||||
|
|
||||||
|
**Goroutine leak detection:**
|
||||||
|
```go
|
||||||
|
func TolerantVerifyLeak(m *testing.M) {
|
||||||
|
goleak.VerifyTestMain(m,
|
||||||
|
goleak.IgnoreTopFunction("go.opencensus.io/..."),
|
||||||
|
goleak.IgnoreTopFunction("k8s.io/klog/..."),
|
||||||
|
)
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Explicit allowlist for known third-party leaks. Everything
|
||||||
|
else is a test failure. Zero-tolerance with escape hatches
|
||||||
|
for unfixable external dependencies.
|
||||||
|
|
||||||
|
### Ecto: Fake Adapter + Process Mailbox Assertions
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
defmodule Ecto.TestAdapter do
|
||||||
|
@behaviour Ecto.Adapter
|
||||||
|
@behaviour Ecto.Adapter.Queryable
|
||||||
|
@behaviour Ecto.Adapter.Schema
|
||||||
|
@behaviour Ecto.Adapter.Transaction
|
||||||
|
|
||||||
|
def execute(_, _, {:nocache, {:all, query}}, _, _) do
|
||||||
|
send(self(), {:all, query})
|
||||||
|
Process.get(:test_repo_all_results) || results_for_all_query(query)
|
||||||
|
end
|
||||||
|
end
|
||||||
|
```
|
||||||
|
|
||||||
|
**Ecto tests the entire query pipeline without a database.**
|
||||||
|
The fake adapter:
|
||||||
|
- Sends messages to `self()` on every operation
|
||||||
|
- Tests assert on `receive {:insert, meta}` etc.
|
||||||
|
- No network, no state, pure message-passing verification
|
||||||
|
|
||||||
|
**48 test files, 43 with `async: true`.** The test suite
|
||||||
|
runs in parallel because there's no shared state — every
|
||||||
|
test talks to its own process mailbox.
|
||||||
|
|
||||||
|
**Force:** Ecto is a *library*, not a service. It can't
|
||||||
|
require Postgres in CI for every contributor. The fake
|
||||||
|
adapter makes the entire query compilation + planning
|
||||||
|
pipeline testable without external dependencies.
|
||||||
|
|
||||||
|
### Oban: Testing Modes as First-Class Feature
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
# In test config
|
||||||
|
config :my_app, Oban, testing: :inline
|
||||||
|
|
||||||
|
# In test
|
||||||
|
use Oban.Testing, repo: MyApp.Repo
|
||||||
|
|
||||||
|
test "job was enqueued" do
|
||||||
|
assert_enqueued worker: MyWorker, args: %{id: 1}
|
||||||
|
end
|
||||||
|
|
||||||
|
test "job executes correctly" do
|
||||||
|
assert :ok = perform_job(MyWorker, %{id: 1})
|
||||||
|
end
|
||||||
|
```
|
||||||
|
|
||||||
|
Three modes:
|
||||||
|
- **`:inline`** — jobs execute synchronously in the test
|
||||||
|
process. No GenServers, no queues, no async.
|
||||||
|
- **`:manual`** — jobs are enqueued but not executed.
|
||||||
|
Use `assert_enqueued` to verify they were created.
|
||||||
|
- **`:disabled`** — production behavior in tests.
|
||||||
|
|
||||||
|
**Force:** Background jobs are the #1 source of test
|
||||||
|
flakiness. Oban eliminates it by making the execution
|
||||||
|
model configurable. Tests never poll, never sleep, never
|
||||||
|
race.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## API Evolution: Three Strategies
|
||||||
|
|
||||||
|
### CockroachDB: Version Gates (Distributed Migration)
|
||||||
|
|
||||||
|
```go
|
||||||
|
const (
|
||||||
|
V26_2_AddStatementStatisticsComputedColumns Key = iota
|
||||||
|
V26_2_ChangefeedsStopReadingSpanLevelCheckpoints
|
||||||
|
V26_2_ChangefeedsStopWritingSpanLevelCheckpoints
|
||||||
|
)
|
||||||
|
|
||||||
|
// In code:
|
||||||
|
if settings.Version.IsActive(ctx, clusterversion.V26_2) {
|
||||||
|
// use new behavior
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**The pattern:** Every change to observable behavior gets
|
||||||
|
a version constant. The feature is only enabled when ALL
|
||||||
|
nodes in the cluster have been upgraded past that version.
|
||||||
|
|
||||||
|
**Two-phase deprecation for distributed changes:**
|
||||||
|
```
|
||||||
|
V26_2_ChangefeedsStopReadingSpanLevelCheckpoints
|
||||||
|
V26_2_ChangefeedsStopWritingSpanLevelCheckpoints
|
||||||
|
V26_2_ChangefeedsNoLongerHaveSpanLevelCheckpoints
|
||||||
|
```
|
||||||
|
|
||||||
|
Three versions for one removal:
|
||||||
|
1. Stop reading (new code doesn't depend on old format)
|
||||||
|
2. Stop writing (old format no longer produced)
|
||||||
|
3. Clean up (safe to remove the old code)
|
||||||
|
|
||||||
|
**Force:** In a distributed database, you can't change
|
||||||
|
behavior atomically. Some nodes will be old, some new.
|
||||||
|
The version gate ensures new behavior only activates
|
||||||
|
when it's safe — when all nodes understand it.
|
||||||
|
|
||||||
|
**Pruning:** Once MinSupported advances past a version
|
||||||
|
constant, it's deleted. The code path is always active
|
||||||
|
so the `IsActive` check becomes dead code. Regular
|
||||||
|
pruning keeps the codebase from accumulating gates.
|
||||||
|
|
||||||
|
### Oban: Numbered Migrations (Schema Evolution)
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
lib/oban/migrations/postgres/
|
||||||
|
├── v01.ex # Initial schema (job table, state enum)
|
||||||
|
├── v02.ex # Add columns
|
||||||
|
├── v03.ex # Index optimization
|
||||||
|
...
|
||||||
|
├── v14.ex # Latest
|
||||||
|
```
|
||||||
|
|
||||||
|
Each migration is:
|
||||||
|
- **Idempotent** (safe to run twice)
|
||||||
|
- **Prefix-aware** (multi-tenant schemas)
|
||||||
|
- **Bidirectional** (up + down)
|
||||||
|
- **Database-specific** (postgres/, sqlite/, myxql/)
|
||||||
|
|
||||||
|
**Consumer usage:**
|
||||||
|
```elixir
|
||||||
|
defmodule MyApp.Repo.Migrations.AddOban do
|
||||||
|
use Ecto.Migration
|
||||||
|
def up, do: Oban.Migrations.up(version: 14)
|
||||||
|
def down, do: Oban.Migrations.down(version: 14)
|
||||||
|
end
|
||||||
|
```
|
||||||
|
|
||||||
|
**Force:** Oban owns a database table but lives inside
|
||||||
|
the consumer's migration system. Numbered versions let
|
||||||
|
consumers upgrade incrementally without knowing Oban
|
||||||
|
internals.
|
||||||
|
|
||||||
|
### Ecto: Compile-Time Deprecation + Semver
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
# In changeset.ex
|
||||||
|
IO.warn(
|
||||||
|
"passing a list of binaries to cast/3 is deprecated..."
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
Ecto deprecates at **compile time**. When you compile
|
||||||
|
code that uses a deprecated API, you get a warning.
|
||||||
|
At runtime, everything still works.
|
||||||
|
|
||||||
|
**CHANGELOG as contract:**
|
||||||
|
```
|
||||||
|
## v3.14.0-dev
|
||||||
|
### Enhancements
|
||||||
|
### Bug fixes
|
||||||
|
|
||||||
|
## v3.13.5 (2025-11-09)
|
||||||
|
### Enhancements
|
||||||
|
```
|
||||||
|
|
||||||
|
The changelog is the API evolution document. Breaking
|
||||||
|
changes require a major version bump (hasn't happened
|
||||||
|
in years because the adapter pattern provides
|
||||||
|
extensibility without breakage).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What This Teaches for Code Review
|
||||||
|
|
||||||
|
### Testing Questions:
|
||||||
|
1. Is this testable **without standing up the system**?
|
||||||
|
(Ecto's fake adapter, Oban's inline engine)
|
||||||
|
2. Are resources **tracked and leak-detected**?
|
||||||
|
(CockroachDB's stopper/goroutine tracking)
|
||||||
|
3. Are test assertions **deterministic**? No sleep, no
|
||||||
|
poll, no "eventually consistent" in unit tests.
|
||||||
|
4. Could this be a **golden file test**? If the output
|
||||||
|
is deterministic, snapshot it. Regression = visible diff.
|
||||||
|
5. Is there **chaos/property testing** for invariants?
|
||||||
|
(KVNemesis for linearizability)
|
||||||
|
|
||||||
|
### Evolution Questions:
|
||||||
|
1. Can this change be deployed **gradually**? Or does it
|
||||||
|
require all consumers to upgrade atomically?
|
||||||
|
2. Is there a **two-phase** path? (Stop reading → stop
|
||||||
|
writing → remove)
|
||||||
|
3. Is the deprecation **visible at compile time**? Or
|
||||||
|
will consumers only discover it at runtime?
|
||||||
|
4. Is the migration **idempotent**? Can it be run twice
|
||||||
|
safely?
|
||||||
|
|
||||||
|
### Red Flags:
|
||||||
|
- Tests that require a running database for unit-level logic
|
||||||
|
- No resource leak detection in concurrent code
|
||||||
|
- `time.Sleep` / `Process.sleep` in tests instead of
|
||||||
|
deterministic signals
|
||||||
|
- Breaking changes without version gates or migration path
|
||||||
|
- Deprecation that only appears in docs, not in tooling
|
||||||
|
|
||||||
|
<!-- PATTERN_COMPLETE -->
|
||||||
Reference in New Issue
Block a user