diff --git a/analysis/architectural-analysis.md b/analysis/architectural-analysis.md new file mode 100644 index 0000000..bbd3881 --- /dev/null +++ b/analysis/architectural-analysis.md @@ -0,0 +1,340 @@ +# Architectural Patterns from Top Repos + +## CockroachDB: How to Organize 20,000 Files + +### The 116-Package Principle + +CockroachDB has 116 packages under `pkg/util/` averaging +**4 files each**. This is deliberate: + +**Force:** A 2M-line codebase where developers work on +different subsystems simultaneously. If `pkg/util` were +5 big packages, every PR would conflict. + +**Pattern:** One concept = one package. `circuit/` is 3 +files (breaker, options, signal). `quotapool/` is 5 files. +`stop/` is 2 files. The package boundary IS the API +boundary — no internal debates about what is exported. + +**Naming:** Single-concept nouns. No `helpers`, no +`common`, no `shared`. Every package name tells you what +it does: `cancelchecker`, `ctxgroup`, `syncutil`. + +### Dependency Layering + +``` +sql → kv → storage → util + ↓ ↓ ↓ + ↓ ↓ roachpb (protobuf types) + ↓ ↓ ↓ + ↓ keys ← util + ↓ + settings, config +``` + +**Critical insight:** `kv` imports from `sql` AND `sql` +imports from `kv`. They solved circular deps via +interfaces + callback registration — not by eliminating +the cycle. The `internal/` package provides the bridge. + +`storage` imports `kv` (for transaction types) but `kv` +also imports `storage`. Again, interface boundaries break +the cycle at compile time. + +**Lesson:** Perfect layering is impossible in distributed +databases. The real skill is knowing where to put the +interface that breaks the cycle. + +### Error Handling at Scale + +They use `github.com/cockroachdb/errors` — their own +library that extends stdlib `errors` with: + +- **Error marks:** Tag errors with metadata without + changing the error chain +- **Wrapping with causes:** `errors.Wrap(err, "context")` +- **Safe printing:** `redact.Sprint` for log-safe errors +- **Network encoding:** Errors serialize across RPC + boundaries + +**Pattern:** Errors are first-class data that flows through +the entire system, surviving serialization across nodes. +Not just strings — structured, typed, matchable. + +### Circuit Breaker (not stdlib) + +```go +type Breaker struct { + mu struct { + syncutil.RWMutex + errAndCh *errAndCh // stable Signal() results + probing bool + } +} +``` + +**Key design:** `Signal()` returns a channel + error getter +(like `context.Done()` + `context.Err()`). The channel is +stable — closing it doesn't affect callers who already have +a reference. New callers get a new channel after reset. + +**Force:** In a distributed DB, a broken replica should +fail-fast all pending requests, then probe for recovery. +Context cancellation isn't enough because you need to +distinguish "gave up waiting" from "system is broken." + +### QuotaPool: Abstract Resource Allocation + +```go +type Resource interface{} +type Request interface { + Acquire(ctx context.Context, r Resource) ( + fulfilled bool, tryAgainAfter time.Duration) + ShouldWait() bool +} +``` + +**Pattern:** The pool is generic over any resource type. +Concrete implementations include: +- `IntPool` — weighted semaphore with FIFO ordering +- Rate limiters (via `tryAgainAfter`) +- Token buckets + +**Force:** Different subsystems need different quota types +but the same queueing/fairness semantics. Abstract once, +instantiate many. + +--- + +## Prometheus: Interface-Driven Storage Architecture + +### The Contract Layer + +`storage/interface.go` defines **15+ interfaces** that +form the entire query/storage contract: + +``` +Storage (top level) +├── Appendable → Appender (write path) +├── Queryable → Querier (read path) +├── ChunkQueryable → ChunkQuerier (bulk read) +├── ExemplarStorage (exemplars) +└── Searcher (experimental) +``` + +**Force:** Prometheus must support: +- Local TSDB (the main implementation) +- Remote read/write (federation) +- Recording rules (virtual series) +- Testing (mock implementations) + +All through the same interface. The contract layer is +the single point of truth for "what does storage mean." + +### Compile-Time Interface Verification + +```go +var _ storage.GetRef = &headAppender{} +var _ storage.Searcher = &blockBaseQuerier{} +``` + +Prometheus uses this pattern **8 times** in tsdb/ alone. +Every concrete type that claims to satisfy a storage +interface proves it at compile time. + +**Why this matters at scale:** Storage interfaces evolve. +When `Searcher` was added, every type that should +implement it needed updating. The `var _` pattern makes +the compiler tell you what you missed. + +### Plugin Discovery via Channel + +```go +type Discoverer interface { + Run(ctx context.Context, up chan<- []*targetgroup.Group) +} +``` + +**Brilliance:** The entire service discovery system is one +interface with one method. Consul, DNS, Kubernetes, AWS — +all implement `Run`. They push target groups through a +channel. The manager multiplexes. + +**Force:** Prometheus supports 20+ discovery mechanisms. +Adding one should require zero changes to the core. The +channel-based push model means the manager never polls. + +### Atomic File Operations + +Block lifecycle uses filesystem conventions: +- `.tmp-for-creation` — incomplete write +- `.tmp-for-deletion` — incomplete delete + +On startup, scan and clean up. No WAL needed for +block-level operations because rename is atomic on POSIX. + +**Force:** TSDB blocks are large (hours of data). A WAL +for block operations would be overkill. The suffix +convention gives crash consistency with zero overhead. + +--- + +## Ecto: Composability Through Data + +### Query as Accumulating Struct + +```elixir +defstruct prefix: nil, sources: nil, from: nil, + joins: [], wheres: [], select: nil, + order_bys: [], limit: nil, offset: nil, + group_bys: [], updates: [], havings: [], + preloads: [], distinct: nil, lock: nil, + windows: [], with_ctes: nil +``` + +**Every query operation appends to a list or sets a +field.** Nothing is executed. The struct accumulates intent +until `Repo.all/Repo.one` triggers planning + execution. + +**Force:** Queries must be composable (build in one +module, filter in another, paginate in a third). If +operations executed immediately, composition would require +the entire DB context at every step. + +### Macro → Builder → Planner Pipeline + +``` +User writes: from(u in User, where: u.age > 18) + ↓ +Macro expands: Builder.Filter.build(query, expr, env) + ↓ +Builder produces: %Ecto.Query.BooleanExpr{...} + ↓ +Planner resolves: types, bindings, params + ↓ +Adapter generates: SQL string +``` + +Each builder module handles one clause type. There are +**15 builder modules** (from, join, filter, select, etc.). +The planner doesn't know about SQL — it resolves the +query struct into a normalized form that any adapter can +consume. + +**Force:** Support multiple databases (Postgres, MySQL, +SQLite) with the same query language. The adapter is the +only part that knows SQL dialect. + +### Protocol for Extensibility + +`Ecto.Queryable` protocol lets you pass: +- A module atom (`User`) → resolved to schema query +- A string (`"users"`) → raw table +- A tuple (`{"filtered_users", User}`) → view + schema +- An `Ecto.Query` struct → identity + +**Force:** `Repo.all(X)` should work with any "queryable +thing." New queryable types can be added without touching +Repo code. + +--- + +## Oban: Architecture for Testability + +### Engine Swap by Config + +```elixir +def get_engine(%{engine: engine, testing: :disabled}), do: engine +def get_engine(%{testing: :inline}), do: Oban.Engines.Inline +def get_engine(%{testing: :manual}), do: engine +``` + +Three modes: +- **disabled** (production) — real engine +- **inline** (unit test) — execute in caller process +- **manual** (integration) — enqueue but don't execute + +**Force:** Background jobs are inherently untestable +without process control. Rather than making tests async +(flaky), make the engine deterministic. + +### Flat Supervision with Named Registry + +```elixir +children = [ + {Notifier, conf: conf, name: Registry.via(name, Notifier)}, + {Nursery, conf: conf, name: Registry.via(name, Nursery)}, + {Peer, conf: conf, name: Registry.via(name, Peer)}, + {Sonar, conf: conf, name: Registry.via(name, Sonar)}, + {Harbor, conf: conf, name: Registry.via(name, Harbor)} +] +``` + +Every child gets its config via `conf:` and its identity +via `Registry.via`. This means: +- Multiple Oban instances can run in the same VM +- Tests can start isolated Oban supervisors +- No global state — everything is namespaced + +**Force:** Libraries can't own global names. Enterprise +apps run multiple Oban instances (different repos, +different queues). The Registry pattern makes this +possible without process naming conflicts. + +### Behaviour as Plugin Contract + +```elixir +# Plugin must be a GenServer AND implement these: +@callback start_link([option()]) :: GenServer.on_start() +@callback validate([option()]) :: :ok | {:error, String.t()} +``` + +**Force:** Plugins need lifecycle management (start, stop, +crash recovery) AND configuration validation. By requiring +both a behaviour AND OTP compliance, Oban gets: +- Fault isolation (supervisor restarts crashed plugins) +- Config validation at startup (fail fast) +- No coupling (any GenServer works) + +--- + +## Cross-Cutting Insights + +### 1. Interfaces at Boundaries, Structs Internally + +All four codebases define interfaces at system boundaries +(storage, engine, discovery) but use concrete types +internally. The interface is the published contract; the +struct is the implementation detail. + +### 2. Config as Validated Struct, Not Map + +Every system validates config at startup and stores it as +a typed struct. Never a raw map floating around. + +### 3. Testing is an Architecture Decision + +Oban's engine swap, CockroachDB's stopper tracking, +Prometheus's mock interfaces — testability isn't bolted on, +it's designed in from day one. + +### 4. Composition via Data, Not Inheritance + +Ecto queries accumulate as data. Prometheus discoverers +push through channels. CockroachDB quota requests are +data objects. Nobody uses class hierarchies. + +### 5. The Cycle Problem is Solved with Interfaces + +CockroachDB has circular dependencies between sql↔kv↔ +storage. They break cycles with interface packages that +both sides depend on. This is the only way at scale. + +### 6. Small Packages > Large Packages + +CockroachDB: 4 files average per package. +Oban: focused modules (engine, worker, plugin). +Ecto: one builder per clause type. +The package boundary forces you to define the API. + + diff --git a/analysis/crosscutting-analysis.md b/analysis/crosscutting-analysis.md new file mode 100644 index 0000000..7ad78bd --- /dev/null +++ b/analysis/crosscutting-analysis.md @@ -0,0 +1,301 @@ +# Cross-Cutting Concerns: How Mature Codebases Handle the Hard Parts + +Cross-cutting concerns are the things that touch everything +but belong nowhere. How a codebase handles logging, +telemetry, config, retry, and lifecycle management reveals +its architectural philosophy more than any feature code. + +--- + +## 1. Logging: From Strings to Semantic Channels + +### CockroachDB: Channel-Based Log Routing + +CockroachDB doesn't just log at severity levels — it +routes logs to **semantic channels**: + +```go +const DEV = logpb.Channel_DEV // development noise +const OPS = logpb.Channel_OPS // operator actions +const HEALTH = logpb.Channel_HEALTH // background health +const STORAGE = logpb.Channel_STORAGE +const SESSIONS = logpb.Channel_SESSIONS +const SQL_SCHEMA = logpb.Channel_SQL_SCHEMA +const USER_ADMIN = logpb.Channel_USER_ADMIN +``` + +Each channel can be routed to different sinks (file, +network, etc.) independently. Production deploys typically +disable DEV entirely and route HEALTH to monitoring. + +**Force:** In a multi-tenant distributed database, "who +cares about this log?" is a different question than "how +bad is it?" An INFO-level schema change matters to DBAs +but not to SREs monitoring node health. + +**Ecosystem insight:** The channel IS the audience. When +you write `log.Health.Warningf(...)`, you're declaring +"the person watching cluster health needs to see this." +Severity is orthogonal to audience. + +### Prometheus: Self-Instrumentation + +Prometheus instruments itself with its own metrics: + +```go +type scrapeMetrics struct { + targetScrapeSampleLimit prometheus.Counter + targetScrapeSampleOutOfOrder prometheus.Counter + targetIntervalLengthHistogram *prometheus.HistogramVec + // ... 20+ metrics +} +``` + +Metrics are collected in a struct, constructed once via +`newScrapeMetrics(reg)`, and passed to subsystems. No +global registration — the registerer is injected. + +**Force:** Prometheus IS the metrics system. If it used +a different metrics library to instrument itself, that +would be a design smell. Dogfooding proves the API works. + +### Ecto + Oban: Telemetry as Standard + +Both use Erlang's `:telemetry` library with predictable +naming: + +```elixir +# Oban +:telemetry.execute([:oban, :job, :start], measurements, meta) +:telemetry.execute([:oban, :job, :stop], measurements, meta) +:telemetry.execute([:oban, :job, :exception], measurements, meta) + +# Ecto (adapter-emitted) +[:my_app, :repo, :query] +``` + +**Force:** The BEAM ecosystem standardized on `:telemetry` +for observability. Libraries don't own their monitoring — +they emit events; consumers attach handlers. This inverts +the logging relationship: the library doesn't decide what +to do with the information. + +--- + +## 2. Config Propagation: Three Models + +### CockroachDB: Cluster Settings (Distributed Config) + +```go +settings.RegisterDurationSetting( + settings.ApplicationLevel, + "bulkio.ingest.flush_delay", + "amount of time to wait before sending a file...", + 0, // default +) +``` + +Settings are: +- **Typed** (Duration, Bool, Int, String) +- **Leveled** (ApplicationLevel vs SystemVisible) +- **Validated** (NonNegativeInt, etc.) +- **Distributed** (propagated across all nodes) +- **Version-gated** (new settings require cluster version) + +Usage: `settings.Version.IsActive(ctx, clusterversion.V26_2)` + +**Force:** In a distributed database, config isn't a file +— it's consensus. Every node must agree on every setting, +and settings can only be enabled once all nodes support +them. The version gate is the safety mechanism. + +### Prometheus: ApplyConfig (Hot Reload) + +```go +func (m *Manager) ApplyConfig(cfg *config.Config) error { + m.mtxScrape.Lock() + defer m.mtxScrape.Unlock() + // rebuild scrape pools from new config + // close old loggers, open new ones +} +``` + +Config is a struct loaded from YAML. On SIGHUP (or API +call), the entire config is re-parsed and `ApplyConfig` +is called on each subsystem. Each subsystem holds a mutex +and swaps atomically. + +**Force:** Prometheus runs as a single binary. Config +reload must be atomic per-subsystem but doesn't need +distributed consensus. The mutex-per-subsystem pattern +gives independent reload without global coordination. + +### Ecto + Oban: Config at Init, Validated Once + +```elixir +# Oban validates exhaustively at startup +Validation.validate_schema(opts, + engine: {:behaviour, Oban.Engine}, + queues: {:custom, &validate_queues/1}, + repo: {:module, [config: 0]}, + ... +) +``` + +Config is validated once at startup and stored as an +immutable struct. No hot reload. If config is wrong, +you know immediately (fail fast). + +**Force:** Elixir/OTP applications restart processes to +apply new config. Hot reload is handled by supervisor +restarts, not config mutation. The "config as immutable +struct" pattern means no runtime config bugs — it either +passes validation at startup or the app doesn't start. + +--- + +## 3. Retry and Resilience + +### CockroachDB: Iterator-Based Retry + +```go +opts := retry.Options{ + InitialBackoff: 100 * time.Millisecond, + MaxBackoff: 2 * time.Second, + Multiplier: 2, + MaxRetries: 5, +} +for r := retry.StartWithCtx(ctx, opts); r.Next(); { + // attempt operation + if err == nil { break } +} +``` + +Retry is a **for-loop iterator**. `r.Next()` handles +backoff timing and returns false when exhausted. This +means retry logic reads like normal code — no callbacks, +no framework. + +**Force:** CockroachDB has hundreds of retry sites. A +callback-based retry would create deeply nested code. +The iterator pattern keeps retry at the same indentation +level as the operation. + +### Oban: Repo Dispatch with Built-In Retry + +```elixir +defp dynamic_dispatch(conf, name, args, attempt) do + with_dynamic_repo(conf, fn repo -> + apply(repo, name, args) + end) +rescue + error in UndefinedFunctionError -> + if attempt < @retry_opts[:retry] do + jittery_sleep(attempt * @retry_opts[:delay]) + dynamic_dispatch(conf, name, args, attempt + 1) + else + reraise error, __STACKTRACE__ + end +end +``` + +Every Ecto operation dispatched through Oban's repo +wrapper gets automatic retry for transient failures. +The consumer never sees the retry — it's invisible +infrastructure. + +**Key insight:** Oban retries `UndefinedFunctionError` +on the repo module itself — absorbing the window during +hot code reload when the module doesn't exist. This is +an ecosystem-level concern (BEAM hot code loading) handled +transparently. + +--- + +## 4. Resource Lifecycle: The Stopper Pattern + +### CockroachDB: Stopper as Universal Lifecycle + +```go +type Stopper struct { ... } + +// RunTask runs a synchronous task +func (s *Stopper) RunTask(ctx context.Context, taskName string, f func(context.Context)) error + +// RunAsyncTask runs a goroutine tracked by the stopper +func (s *Stopper) RunAsyncTask(ctx context.Context, taskName string, f func(context.Context)) error + +// ShouldQuiesce returns a channel closed when shutdown begins +func (s *Stopper) ShouldQuiesce() <-chan struct{} + +// Stop initiates graceful shutdown +func (s *Stopper) Stop(ctx context.Context) +``` + +Every goroutine in CockroachDB is launched through a +Stopper. This gives: +- **Tracking**: know exactly which goroutines are running +- **Graceful shutdown**: quiesce signal before hard stop +- **Leak detection**: `PrintLeakedStoppers` in tests +- **Throttling**: semaphore limits async tasks + +```go +func init() { + leaktest.PrintLeakedStoppers = PrintLeakedStoppers +} +``` + +**Force:** A database cannot afford goroutine leaks — +they hold locks, connections, and file handles. The +Stopper is the universal answer: every background task +is accounted for, every shutdown is graceful, every leak +is detected in tests. + +### Oban: Registry-Based Lifecycle + +```elixir +children = [ + {Notifier, conf: conf, name: Registry.via(name, Notifier)}, + {Nursery, conf: conf, name: Registry.via(name, Nursery)}, + ... +] +``` + +OTP already provides lifecycle management via supervisors. +Oban's addition is the Registry — namespacing processes +so multiple instances can coexist. Lifecycle is delegated +to the platform; naming is the library's concern. + +--- + +## 5. What These Patterns Teach for Code Review + +### Questions to Ask About Cross-Cutting Concerns: + +1. **Logging:** Who is the audience for this log? Is there + a routing mechanism, or does everything go to stdout? + Does the log help the *operator*, not just the developer? + +2. **Config:** How does config reach this code? Is it + validated at startup or silently wrong at runtime? Can + it be changed without restart? Should it be? + +3. **Retry:** Is retry happening at the right layer? Is it + invisible to the caller? Does it have backoff + jitter? + Does it respect context cancellation? + +4. **Lifecycle:** Are background tasks tracked? Will they + shut down gracefully? Can you detect leaks in tests? + +5. **Telemetry:** Are events emitted or is logging the only + observability? Can consumers attach their own handlers? + +### Red Flags: + +- `log.Info("something happened")` with no channel/audience +- Config read from environment at point-of-use (not validated) +- Retry logic duplicated in 5 places with different backoff +- Goroutines launched with `go func()` and no tracking +- No telemetry events — only log lines for observability + + diff --git a/analysis/ecosystem-analysis.md b/analysis/ecosystem-analysis.md new file mode 100644 index 0000000..e8ce749 --- /dev/null +++ b/analysis/ecosystem-analysis.md @@ -0,0 +1,371 @@ +# Ecosystem-Level Patterns: How Codebases Present to Consumers + +## The Three Questions + +For each codebase, ask: +1. How do consumers **extend** it? (What interfaces/behaviours + do they implement?) +2. How do consumers **compose** with it? (What does day-to-day + usage look like?) +3. What does it deliberately **NOT do**? (What forces shaped + those refusals?) + +--- + +## CockroachDB: Errors as First-Class Distributed Data + +### Extension Points + +CockroachDB is not a library — it is a system. Consumers +extend it through: +- **SQL builtins** (function registration) +- **Storage engines** (via pebble interface) +- **Service discovery** (not user-extensible — closed) + +The interesting pattern is how errors flow from storage +through KV through SQL to the client. + +### Error Architecture (ecosystem-level idiom) + +``` +Storage error → encoded via cockroachdb/errors → + KV wraps with context → serialized across gRPC → + SQL decodes → maps to pgcode → wire protocol to client +``` + +**Key design decisions:** + +1. **Errors have priority.** `ErrPriority()` ranks errors so + the system knows which to surface when multiple things + fail simultaneously. Transaction abort > restart > + unambiguous error > non-retriable. + +2. **Errors survive serialization.** `EncodeError` / + `DecodeError` serialize errors across RPC boundaries. + The error that originated on node 3 arrives at node 1 + with its full cause chain intact. + +3. **Errors map to pg codes.** Every internal error maps to + a Postgres error code that clients understand. This is + the *ecosystem contract* — clients write + `if pgcode == '40001' { retry }`. + +**What this teaches:** In a distributed system, an error +isn't a string — it's a data object with identity, +priority, serializability, and a consumer-facing code. +Design your error types for the *consumer*, not the +*producer*. + +### Deliberate Absences + +- **No dependency injection framework.** Config structs + passed explicitly. 1178-line `StoreConfig` struct, but + it's all data — no framework magic. +- **No context.Background() on hot paths.** 144 uses in + kvserver, but auditable — each justified in comments. +- **No functional options.** CockroachDB uses config + structs universally. The Option interface in stopper is + the exception, not the rule. + +### Test Architecture + +- **TestMain in every package.** Sets up security certs, + random seeds, and test server factories. +- **Goroutine leak detection.** `leaktest.AfterTest(t)()` + at the start of every test. Detects leaked goroutines + by diffing goroutine stacks before/after. +- **Stopper leak detection.** Every Stopper is tracked + globally; `PrintLeakedStoppers(t)` in TestMain catches + forgot-to-stop bugs. +- **`//go:generate` for test setup.** Codegen tool + (`add-leaktest.sh`) auto-adds leak checks to every + test file. + +**What this teaches:** At scale, the most important test +infrastructure isn't assertions — it's resource leak +detection. Every goroutine, every connection, every +Stopper is tracked and verified to be cleaned up. + +--- + +## Prometheus: The One-Method Interface Contract + +### Extension Points + +Prometheus is extended through: +- **Service discovery** (30 implementations, 1 interface) +- **Storage** (remote read/write adapters) +- **Exporters** (client_golang metrics) + +### The Discoverer Pattern (ecosystem-level idiom) + +```go +type Discoverer interface { + Run(ctx context.Context, up chan<- []*targetgroup.Group) +} +``` + +This is **one method**. Thirty implementations. The +channel-based push model means: +- The discoverer controls timing (not polled) +- The manager multiplexes without knowing implementations +- Adding a new discovery source = implement Run, register + +**Registration via init():** +```go +func init() { + discovery.RegisterConfig(&SDConfig{}) +} +``` + +This is the classic Go plugin pattern. Import the package +→ init registers it → the system discovers it at startup. + +**What this teaches:** The smallest possible interface +creates the largest possible ecosystem. One method + one +channel = 30 implementations without coordination. + +### Storage Contract (15 interfaces, 1 file) + +All of Prometheus's storage contract lives in +`storage/interface.go`. This is the: +- Read path: `Queryable → Querier → SeriesSet → Series` +- Write path: `Appendable → Appender` +- Extension: `ExemplarAppender`, `MetadataUpdater` + +**Key:** Every implementation proves satisfaction at +compile time with `var _ storage.Searcher = &type{}`. +When the contract evolves, the compiler finds every +broken implementation. + +### Deliberate Absences + +- **No generics in storage interfaces.** Despite Go 1.20+ + support. The interfaces predate generics and adding them + would break all existing implementations. +- **No dependency injection.** Direct struct construction + everywhere. Testability through interface satisfaction, + not framework wiring. +- **Almost no functional options.** Only in leaf packages + (chunk writer, parser). Core APIs use config structs. +- **No goroutine leak in production code.** `goleak` in + tests, `TolerantVerifyLeak` with explicit allowlist for + known third-party leaks. + +### Test Architecture + +- **`TolerantVerifyLeak`** — goroutine leak detection with + allowlist for known third-party leaks (opencensus, klog) +- **Mock implementations of every interface** — defined + right in `storage/interface.go` next to the real ones +- **Golden file tests** in PromQL evaluation + +--- + +## Ecto: Composability as Architectural Principle + +### Extension Points + +Consumers extend Ecto through: +- **Custom types** (7 callbacks: cast, load, dump, equal?, + embed_as, autogenerate, type) +- **Adapters** (Queryable, Schema, Transaction, Storage — + 4 behaviour modules) +- **Protocols** (`Ecto.Queryable` — anything can become a + query) + +### The NotLoaded Sentinel (ecosystem-level idiom) + +```elixir +defmodule Ecto.Association.NotLoaded do + defstruct [:__field__, :__owner__, :__cardinality__] +end +``` + +Ecto **refuses to lazy-load associations**. If you access +`user.posts` without preloading, you get a `NotLoaded` +struct — not nil, not an empty list, not a database query. + +**Why this is an ecosystem decision:** +- Forces consumers to be explicit about data needs +- Prevents N+1 queries by making them impossible +- Makes the data boundary visible in code + +This is a *consumer-hostile* decision that makes +*systems built on Ecto* dramatically better. The library +optimizes for the 1000th user, not the first-day +experience. + +### Query Composition (ecosystem-level idiom) + +Every query clause appends to a list in the Query struct. +Nothing executes. The Query is pure data that accumulates +intent. + +**Consumer impact:** You can build queries across module +boundaries: + +```elixir +# Module A builds the base +def active_users, do: from(u in User, where: u.active) + +# Module B adds pagination +def paginate(query, page, size) do + query + |> limit(^size) + |> offset(^((page - 1) * size)) +end + +# Module C adds authorization +def visible_to(query, role) do + where(query, [u], u.role in ^roles_for(role)) +end +``` + +Each module is independent. They compose because queries +are data, not effects. + +### Adapter Architecture + +``` +Ecto.Repo.all(query) + → Planner resolves types, bindings + → Adapter.prepare/2 produces {cache, prepared} + → Adapter.execute/5 runs against DB + → Adapter.loaders/2 converts back to Elixir types +``` + +The adapter is the ONLY part that knows SQL. Ecto core +is database-agnostic. This is why the same code works on +Postgres, MySQL, SQLite, and custom stores. + +### Deliberate Absences + +- **No lazy loading.** `NotLoaded` struct instead. +- **No global state.** Per-repo config, per-repo process. +- **No query caching at library level.** The adapter + caches prepared statements; Ecto doesn't. +- **No connection to schema naming.** `schema "legacy_tbl"` + is independent of `defmodule NewUser`. + +--- + +## Oban: Designing for Testability First + +### Extension Points + +Consumers extend Oban through: +- **Workers** (`perform/1` — the job logic) +- **Plugins** (GenServer + validate callback) +- **Engines** (entire backend swap) +- **Notifiers** (pub/sub mechanism) +- **Peers** (leader election) + +### The Worker Result Type (ecosystem-level idiom) + +```elixir +@type result :: + :ok + | {:ok, ignored :: term()} + | {:error, reason :: term()} + | {:cancel, reason :: term()} + | {:snooze, period :: Period.t()} +``` + +Five possible outcomes, each with distinct semantics: +- `:ok` → success, remove from queue +- `{:error, reason}` → retry (respects max_attempts) +- `{:cancel, reason}` → permanent failure, don't retry +- `{:snooze, period}` → reschedule for later + +**Ecosystem impact:** Every worker author makes an +explicit decision about failure semantics. "What should +happen when this fails?" is answered in the type system, +not in configuration. + +### Contextual Backoff (ecosystem-level idiom) + +```elixir +def backoff(%Job{attempt: attempt, unsaved_error: err}) do + case err.reason do + %RateLimitError{retry_after: ms} -> ms + _ -> trunc(:math.pow(attempt, 4) + jitter()) + end +end +``` + +The error that caused the failure is available to the +backoff calculation. Different errors → different retry +strategies. This is impossible in systems where backoff +is configured globally. + +### Testing Design (ecosystem-level idiom) + +Three testing modes via config: +- **`:inline`** — execute jobs synchronously in tests +- **`:manual`** — enqueue but don't execute +- **`:disabled`** — production behavior + +Plus `use Oban.Testing` which provides: +- `assert_enqueued/1` — verify job was queued +- `refute_enqueued/1` — verify job was NOT queued +- `perform_job/2` — execute a job manually in tests +- `all_enqueued/1` — list all matching jobs + +**Ecosystem impact:** Every Oban consumer gets +deterministic, fast, isolated tests for free. No sleep, +no polling, no flaky async assertions. + +### Deliberate Absences + +- **No global process names.** Registry.via everywhere — + multiple Oban instances can coexist. +- **No direct DB coupling in workers.** Workers receive a + Job struct; they don't import Repo. +- **No implicit retries.** max_attempts is explicit per + worker. No "retry forever" default. +- **No built-in rate limiting in OSS.** That is a Pro + feature — deliberate business boundary. + +--- + +## Cross-Cutting: What "Idiomatic" Means at Ecosystem Level + +### 1. The Consumer Contract is the API + +Not the functions you export — the *experience* of +building on your system: +- CockroachDB: "Your errors will be pg-codes, always" +- Prometheus: "Implement Run(), get discovery for free" +- Ecto: "Queries are data; loading is always explicit" +- Oban: "Return a result type; testing is built in" + +### 2. Deliberate Absences Define Character + +What a system refuses to do is as important as what it +does: +- Ecto refuses lazy loading → forces explicit data needs +- Oban refuses global names → enables multi-instance +- Prometheus refuses DI frameworks → keeps simplicity +- CockroachDB refuses context.Background on hot paths → + forces timeout discipline + +### 3. Testability is Never Retrofitted + +Every system that tests well designed testing in from the +start: +- CockroachDB: leak detection, stopper tracking +- Prometheus: goroutine leak verification, mock interfaces +- Ecto: adapter abstraction, embedded schemas for testing +- Oban: engine swap, testing modes, assertion helpers + +### 4. Extension Points Define the Ecosystem Size + +- Prometheus: 1 interface, 30 discoverers +- Ecto: 7 type callbacks, hundreds of custom types +- Oban: Worker behaviour + 5 engine callbacks + +**Smaller interface → larger ecosystem.** The less you +demand from implementors, the more you get. + + diff --git a/analysis/testing-evolution-analysis.md b/analysis/testing-evolution-analysis.md new file mode 100644 index 0000000..f724cde --- /dev/null +++ b/analysis/testing-evolution-analysis.md @@ -0,0 +1,297 @@ +# Testing Philosophy & API Evolution + +How codebases prove correctness and manage change over +time reveals their deepest architectural commitments. + +--- + +## Testing Philosophy: Four Models of Proof + +### CockroachDB: Defense in Depth + +**Levels of proof:** +1. **Unit tests** — co-located in same package +2. **Echotest/golden files** — snapshot expected output (209 + testdata directories, auto-rewrite with -rewrite flag) +3. **Data-driven tests** — declarative test specs in txt files +4. **KVNemesis** — chaos/fuzzing that generates random KV + operations and checks linearizability +5. **Leak detection** — goroutines, stoppers tracked globally + +**The echotest pattern:** +```go +echotest.Require(t, output, filepath.Join("testdata", name+".txt")) +``` + +Golden file says: +``` +echo +---- +result is ambiguous: boom with a secret +result is ambiguous: boom with a ‹secret› +``` + +The test produces output, compares against the golden file. +Run with `-rewrite` to update. This means: +- Tests are **self-documenting** (the golden file IS the spec) +- Regressions are **visible in diffs** (the golden file changes) +- No manual expected-value maintenance + +**KVNemesis (chaos testing at ecosystem level):** +Generates random sequences of KV operations (puts, gets, +splits, merges, transfers) against a real cluster, then +validates that results satisfy serializable isolation. + +This isn't unit testing. This is proving the *system* is +correct, not individual functions. + +**Resource leak detection as CI gate:** +```go +// Every test file +defer leaktest.AfterTest(t)() + +// Every TestMain +func init() { + leaktest.PrintLeakedStoppers = PrintLeakedStoppers +} +``` + +If a test leaks a goroutine or Stopper, it **fails**. Not +a warning. A failure. This means resource correctness is +as enforceable as logic correctness. + +### Prometheus: Golden Files + Goroutine Verification + +**Testing DSL for PromQL:** +``` +load 5m + http_requests{job="api-server"} 0+10x10 + +eval instant at 50m SUM BY (group) (http_requests) + {group="canary"} 700 + {group="production"} 300 +``` + +This is a custom test language. Load data, evaluate +expressions, assert results. **205 test config files** +in `config/testdata/` alone. + +**Force:** PromQL is complex enough that example-based +testing would be insufficient. The DSL lets you write +hundreds of test cases concisely, covering edge cases +that would require dozens of Go test functions. + +**Goroutine leak detection:** +```go +func TolerantVerifyLeak(m *testing.M) { + goleak.VerifyTestMain(m, + goleak.IgnoreTopFunction("go.opencensus.io/..."), + goleak.IgnoreTopFunction("k8s.io/klog/..."), + ) +} +``` + +Explicit allowlist for known third-party leaks. Everything +else is a test failure. Zero-tolerance with escape hatches +for unfixable external dependencies. + +### Ecto: Fake Adapter + Process Mailbox Assertions + +```elixir +defmodule Ecto.TestAdapter do + @behaviour Ecto.Adapter + @behaviour Ecto.Adapter.Queryable + @behaviour Ecto.Adapter.Schema + @behaviour Ecto.Adapter.Transaction + + def execute(_, _, {:nocache, {:all, query}}, _, _) do + send(self(), {:all, query}) + Process.get(:test_repo_all_results) || results_for_all_query(query) + end +end +``` + +**Ecto tests the entire query pipeline without a database.** +The fake adapter: +- Sends messages to `self()` on every operation +- Tests assert on `receive {:insert, meta}` etc. +- No network, no state, pure message-passing verification + +**48 test files, 43 with `async: true`.** The test suite +runs in parallel because there's no shared state — every +test talks to its own process mailbox. + +**Force:** Ecto is a *library*, not a service. It can't +require Postgres in CI for every contributor. The fake +adapter makes the entire query compilation + planning +pipeline testable without external dependencies. + +### Oban: Testing Modes as First-Class Feature + +```elixir +# In test config +config :my_app, Oban, testing: :inline + +# In test +use Oban.Testing, repo: MyApp.Repo + +test "job was enqueued" do + assert_enqueued worker: MyWorker, args: %{id: 1} +end + +test "job executes correctly" do + assert :ok = perform_job(MyWorker, %{id: 1}) +end +``` + +Three modes: +- **`:inline`** — jobs execute synchronously in the test + process. No GenServers, no queues, no async. +- **`:manual`** — jobs are enqueued but not executed. + Use `assert_enqueued` to verify they were created. +- **`:disabled`** — production behavior in tests. + +**Force:** Background jobs are the #1 source of test +flakiness. Oban eliminates it by making the execution +model configurable. Tests never poll, never sleep, never +race. + +--- + +## API Evolution: Three Strategies + +### CockroachDB: Version Gates (Distributed Migration) + +```go +const ( + V26_2_AddStatementStatisticsComputedColumns Key = iota + V26_2_ChangefeedsStopReadingSpanLevelCheckpoints + V26_2_ChangefeedsStopWritingSpanLevelCheckpoints +) + +// In code: +if settings.Version.IsActive(ctx, clusterversion.V26_2) { + // use new behavior +} +``` + +**The pattern:** Every change to observable behavior gets +a version constant. The feature is only enabled when ALL +nodes in the cluster have been upgraded past that version. + +**Two-phase deprecation for distributed changes:** +``` +V26_2_ChangefeedsStopReadingSpanLevelCheckpoints +V26_2_ChangefeedsStopWritingSpanLevelCheckpoints +V26_2_ChangefeedsNoLongerHaveSpanLevelCheckpoints +``` + +Three versions for one removal: +1. Stop reading (new code doesn't depend on old format) +2. Stop writing (old format no longer produced) +3. Clean up (safe to remove the old code) + +**Force:** In a distributed database, you can't change +behavior atomically. Some nodes will be old, some new. +The version gate ensures new behavior only activates +when it's safe — when all nodes understand it. + +**Pruning:** Once MinSupported advances past a version +constant, it's deleted. The code path is always active +so the `IsActive` check becomes dead code. Regular +pruning keeps the codebase from accumulating gates. + +### Oban: Numbered Migrations (Schema Evolution) + +```elixir +lib/oban/migrations/postgres/ +├── v01.ex # Initial schema (job table, state enum) +├── v02.ex # Add columns +├── v03.ex # Index optimization +... +├── v14.ex # Latest +``` + +Each migration is: +- **Idempotent** (safe to run twice) +- **Prefix-aware** (multi-tenant schemas) +- **Bidirectional** (up + down) +- **Database-specific** (postgres/, sqlite/, myxql/) + +**Consumer usage:** +```elixir +defmodule MyApp.Repo.Migrations.AddOban do + use Ecto.Migration + def up, do: Oban.Migrations.up(version: 14) + def down, do: Oban.Migrations.down(version: 14) +end +``` + +**Force:** Oban owns a database table but lives inside +the consumer's migration system. Numbered versions let +consumers upgrade incrementally without knowing Oban +internals. + +### Ecto: Compile-Time Deprecation + Semver + +```elixir +# In changeset.ex +IO.warn( + "passing a list of binaries to cast/3 is deprecated..." +) +``` + +Ecto deprecates at **compile time**. When you compile +code that uses a deprecated API, you get a warning. +At runtime, everything still works. + +**CHANGELOG as contract:** +``` +## v3.14.0-dev +### Enhancements +### Bug fixes + +## v3.13.5 (2025-11-09) +### Enhancements +``` + +The changelog is the API evolution document. Breaking +changes require a major version bump (hasn't happened +in years because the adapter pattern provides +extensibility without breakage). + +--- + +## What This Teaches for Code Review + +### Testing Questions: +1. Is this testable **without standing up the system**? + (Ecto's fake adapter, Oban's inline engine) +2. Are resources **tracked and leak-detected**? + (CockroachDB's stopper/goroutine tracking) +3. Are test assertions **deterministic**? No sleep, no + poll, no "eventually consistent" in unit tests. +4. Could this be a **golden file test**? If the output + is deterministic, snapshot it. Regression = visible diff. +5. Is there **chaos/property testing** for invariants? + (KVNemesis for linearizability) + +### Evolution Questions: +1. Can this change be deployed **gradually**? Or does it + require all consumers to upgrade atomically? +2. Is there a **two-phase** path? (Stop reading → stop + writing → remove) +3. Is the deprecation **visible at compile time**? Or + will consumers only discover it at runtime? +4. Is the migration **idempotent**? Can it be run twice + safely? + +### Red Flags: +- Tests that require a running database for unit-level logic +- No resource leak detection in concurrent code +- `time.Sleep` / `Process.sleep` in tests instead of + deterministic signals +- Breaking changes without version gates or migration path +- Deprecation that only appears in docs, not in tooling + +