From 46fe9c23c95ec6ea73c29edacabb33aded136007 Mon Sep 17 00:00:00 2001 From: Rodin Date: Thu, 30 Apr 2026 10:50:37 -0700 Subject: [PATCH] chore: move cross-ecosystem analysis to patterns-vs-guidelines --- sources/architectural-analysis.md | 340 ----------------------- sources/crosscutting-analysis.md | 301 --------------------- sources/ecosystem-analysis.md | 371 -------------------------- sources/testing-evolution-analysis.md | 297 --------------------- 4 files changed, 1309 deletions(-) delete mode 100644 sources/architectural-analysis.md delete mode 100644 sources/crosscutting-analysis.md delete mode 100644 sources/ecosystem-analysis.md delete mode 100644 sources/testing-evolution-analysis.md diff --git a/sources/architectural-analysis.md b/sources/architectural-analysis.md deleted file mode 100644 index bbd3881..0000000 --- a/sources/architectural-analysis.md +++ /dev/null @@ -1,340 +0,0 @@ -# Architectural Patterns from Top Repos - -## CockroachDB: How to Organize 20,000 Files - -### The 116-Package Principle - -CockroachDB has 116 packages under `pkg/util/` averaging -**4 files each**. This is deliberate: - -**Force:** A 2M-line codebase where developers work on -different subsystems simultaneously. If `pkg/util` were -5 big packages, every PR would conflict. - -**Pattern:** One concept = one package. `circuit/` is 3 -files (breaker, options, signal). `quotapool/` is 5 files. -`stop/` is 2 files. The package boundary IS the API -boundary — no internal debates about what is exported. - -**Naming:** Single-concept nouns. No `helpers`, no -`common`, no `shared`. Every package name tells you what -it does: `cancelchecker`, `ctxgroup`, `syncutil`. - -### Dependency Layering - -``` -sql → kv → storage → util - ↓ ↓ ↓ - ↓ ↓ roachpb (protobuf types) - ↓ ↓ ↓ - ↓ keys ← util - ↓ - settings, config -``` - -**Critical insight:** `kv` imports from `sql` AND `sql` -imports from `kv`. They solved circular deps via -interfaces + callback registration — not by eliminating -the cycle. The `internal/` package provides the bridge. - -`storage` imports `kv` (for transaction types) but `kv` -also imports `storage`. Again, interface boundaries break -the cycle at compile time. - -**Lesson:** Perfect layering is impossible in distributed -databases. The real skill is knowing where to put the -interface that breaks the cycle. - -### Error Handling at Scale - -They use `github.com/cockroachdb/errors` — their own -library that extends stdlib `errors` with: - -- **Error marks:** Tag errors with metadata without - changing the error chain -- **Wrapping with causes:** `errors.Wrap(err, "context")` -- **Safe printing:** `redact.Sprint` for log-safe errors -- **Network encoding:** Errors serialize across RPC - boundaries - -**Pattern:** Errors are first-class data that flows through -the entire system, surviving serialization across nodes. -Not just strings — structured, typed, matchable. - -### Circuit Breaker (not stdlib) - -```go -type Breaker struct { - mu struct { - syncutil.RWMutex - errAndCh *errAndCh // stable Signal() results - probing bool - } -} -``` - -**Key design:** `Signal()` returns a channel + error getter -(like `context.Done()` + `context.Err()`). The channel is -stable — closing it doesn't affect callers who already have -a reference. New callers get a new channel after reset. - -**Force:** In a distributed DB, a broken replica should -fail-fast all pending requests, then probe for recovery. -Context cancellation isn't enough because you need to -distinguish "gave up waiting" from "system is broken." - -### QuotaPool: Abstract Resource Allocation - -```go -type Resource interface{} -type Request interface { - Acquire(ctx context.Context, r Resource) ( - fulfilled bool, tryAgainAfter time.Duration) - ShouldWait() bool -} -``` - -**Pattern:** The pool is generic over any resource type. -Concrete implementations include: -- `IntPool` — weighted semaphore with FIFO ordering -- Rate limiters (via `tryAgainAfter`) -- Token buckets - -**Force:** Different subsystems need different quota types -but the same queueing/fairness semantics. Abstract once, -instantiate many. - ---- - -## Prometheus: Interface-Driven Storage Architecture - -### The Contract Layer - -`storage/interface.go` defines **15+ interfaces** that -form the entire query/storage contract: - -``` -Storage (top level) -├── Appendable → Appender (write path) -├── Queryable → Querier (read path) -├── ChunkQueryable → ChunkQuerier (bulk read) -├── ExemplarStorage (exemplars) -└── Searcher (experimental) -``` - -**Force:** Prometheus must support: -- Local TSDB (the main implementation) -- Remote read/write (federation) -- Recording rules (virtual series) -- Testing (mock implementations) - -All through the same interface. The contract layer is -the single point of truth for "what does storage mean." - -### Compile-Time Interface Verification - -```go -var _ storage.GetRef = &headAppender{} -var _ storage.Searcher = &blockBaseQuerier{} -``` - -Prometheus uses this pattern **8 times** in tsdb/ alone. -Every concrete type that claims to satisfy a storage -interface proves it at compile time. - -**Why this matters at scale:** Storage interfaces evolve. -When `Searcher` was added, every type that should -implement it needed updating. The `var _` pattern makes -the compiler tell you what you missed. - -### Plugin Discovery via Channel - -```go -type Discoverer interface { - Run(ctx context.Context, up chan<- []*targetgroup.Group) -} -``` - -**Brilliance:** The entire service discovery system is one -interface with one method. Consul, DNS, Kubernetes, AWS — -all implement `Run`. They push target groups through a -channel. The manager multiplexes. - -**Force:** Prometheus supports 20+ discovery mechanisms. -Adding one should require zero changes to the core. The -channel-based push model means the manager never polls. - -### Atomic File Operations - -Block lifecycle uses filesystem conventions: -- `.tmp-for-creation` — incomplete write -- `.tmp-for-deletion` — incomplete delete - -On startup, scan and clean up. No WAL needed for -block-level operations because rename is atomic on POSIX. - -**Force:** TSDB blocks are large (hours of data). A WAL -for block operations would be overkill. The suffix -convention gives crash consistency with zero overhead. - ---- - -## Ecto: Composability Through Data - -### Query as Accumulating Struct - -```elixir -defstruct prefix: nil, sources: nil, from: nil, - joins: [], wheres: [], select: nil, - order_bys: [], limit: nil, offset: nil, - group_bys: [], updates: [], havings: [], - preloads: [], distinct: nil, lock: nil, - windows: [], with_ctes: nil -``` - -**Every query operation appends to a list or sets a -field.** Nothing is executed. The struct accumulates intent -until `Repo.all/Repo.one` triggers planning + execution. - -**Force:** Queries must be composable (build in one -module, filter in another, paginate in a third). If -operations executed immediately, composition would require -the entire DB context at every step. - -### Macro → Builder → Planner Pipeline - -``` -User writes: from(u in User, where: u.age > 18) - ↓ -Macro expands: Builder.Filter.build(query, expr, env) - ↓ -Builder produces: %Ecto.Query.BooleanExpr{...} - ↓ -Planner resolves: types, bindings, params - ↓ -Adapter generates: SQL string -``` - -Each builder module handles one clause type. There are -**15 builder modules** (from, join, filter, select, etc.). -The planner doesn't know about SQL — it resolves the -query struct into a normalized form that any adapter can -consume. - -**Force:** Support multiple databases (Postgres, MySQL, -SQLite) with the same query language. The adapter is the -only part that knows SQL dialect. - -### Protocol for Extensibility - -`Ecto.Queryable` protocol lets you pass: -- A module atom (`User`) → resolved to schema query -- A string (`"users"`) → raw table -- A tuple (`{"filtered_users", User}`) → view + schema -- An `Ecto.Query` struct → identity - -**Force:** `Repo.all(X)` should work with any "queryable -thing." New queryable types can be added without touching -Repo code. - ---- - -## Oban: Architecture for Testability - -### Engine Swap by Config - -```elixir -def get_engine(%{engine: engine, testing: :disabled}), do: engine -def get_engine(%{testing: :inline}), do: Oban.Engines.Inline -def get_engine(%{testing: :manual}), do: engine -``` - -Three modes: -- **disabled** (production) — real engine -- **inline** (unit test) — execute in caller process -- **manual** (integration) — enqueue but don't execute - -**Force:** Background jobs are inherently untestable -without process control. Rather than making tests async -(flaky), make the engine deterministic. - -### Flat Supervision with Named Registry - -```elixir -children = [ - {Notifier, conf: conf, name: Registry.via(name, Notifier)}, - {Nursery, conf: conf, name: Registry.via(name, Nursery)}, - {Peer, conf: conf, name: Registry.via(name, Peer)}, - {Sonar, conf: conf, name: Registry.via(name, Sonar)}, - {Harbor, conf: conf, name: Registry.via(name, Harbor)} -] -``` - -Every child gets its config via `conf:` and its identity -via `Registry.via`. This means: -- Multiple Oban instances can run in the same VM -- Tests can start isolated Oban supervisors -- No global state — everything is namespaced - -**Force:** Libraries can't own global names. Enterprise -apps run multiple Oban instances (different repos, -different queues). The Registry pattern makes this -possible without process naming conflicts. - -### Behaviour as Plugin Contract - -```elixir -# Plugin must be a GenServer AND implement these: -@callback start_link([option()]) :: GenServer.on_start() -@callback validate([option()]) :: :ok | {:error, String.t()} -``` - -**Force:** Plugins need lifecycle management (start, stop, -crash recovery) AND configuration validation. By requiring -both a behaviour AND OTP compliance, Oban gets: -- Fault isolation (supervisor restarts crashed plugins) -- Config validation at startup (fail fast) -- No coupling (any GenServer works) - ---- - -## Cross-Cutting Insights - -### 1. Interfaces at Boundaries, Structs Internally - -All four codebases define interfaces at system boundaries -(storage, engine, discovery) but use concrete types -internally. The interface is the published contract; the -struct is the implementation detail. - -### 2. Config as Validated Struct, Not Map - -Every system validates config at startup and stores it as -a typed struct. Never a raw map floating around. - -### 3. Testing is an Architecture Decision - -Oban's engine swap, CockroachDB's stopper tracking, -Prometheus's mock interfaces — testability isn't bolted on, -it's designed in from day one. - -### 4. Composition via Data, Not Inheritance - -Ecto queries accumulate as data. Prometheus discoverers -push through channels. CockroachDB quota requests are -data objects. Nobody uses class hierarchies. - -### 5. The Cycle Problem is Solved with Interfaces - -CockroachDB has circular dependencies between sql↔kv↔ -storage. They break cycles with interface packages that -both sides depend on. This is the only way at scale. - -### 6. Small Packages > Large Packages - -CockroachDB: 4 files average per package. -Oban: focused modules (engine, worker, plugin). -Ecto: one builder per clause type. -The package boundary forces you to define the API. - - diff --git a/sources/crosscutting-analysis.md b/sources/crosscutting-analysis.md deleted file mode 100644 index 7ad78bd..0000000 --- a/sources/crosscutting-analysis.md +++ /dev/null @@ -1,301 +0,0 @@ -# Cross-Cutting Concerns: How Mature Codebases Handle the Hard Parts - -Cross-cutting concerns are the things that touch everything -but belong nowhere. How a codebase handles logging, -telemetry, config, retry, and lifecycle management reveals -its architectural philosophy more than any feature code. - ---- - -## 1. Logging: From Strings to Semantic Channels - -### CockroachDB: Channel-Based Log Routing - -CockroachDB doesn't just log at severity levels — it -routes logs to **semantic channels**: - -```go -const DEV = logpb.Channel_DEV // development noise -const OPS = logpb.Channel_OPS // operator actions -const HEALTH = logpb.Channel_HEALTH // background health -const STORAGE = logpb.Channel_STORAGE -const SESSIONS = logpb.Channel_SESSIONS -const SQL_SCHEMA = logpb.Channel_SQL_SCHEMA -const USER_ADMIN = logpb.Channel_USER_ADMIN -``` - -Each channel can be routed to different sinks (file, -network, etc.) independently. Production deploys typically -disable DEV entirely and route HEALTH to monitoring. - -**Force:** In a multi-tenant distributed database, "who -cares about this log?" is a different question than "how -bad is it?" An INFO-level schema change matters to DBAs -but not to SREs monitoring node health. - -**Ecosystem insight:** The channel IS the audience. When -you write `log.Health.Warningf(...)`, you're declaring -"the person watching cluster health needs to see this." -Severity is orthogonal to audience. - -### Prometheus: Self-Instrumentation - -Prometheus instruments itself with its own metrics: - -```go -type scrapeMetrics struct { - targetScrapeSampleLimit prometheus.Counter - targetScrapeSampleOutOfOrder prometheus.Counter - targetIntervalLengthHistogram *prometheus.HistogramVec - // ... 20+ metrics -} -``` - -Metrics are collected in a struct, constructed once via -`newScrapeMetrics(reg)`, and passed to subsystems. No -global registration — the registerer is injected. - -**Force:** Prometheus IS the metrics system. If it used -a different metrics library to instrument itself, that -would be a design smell. Dogfooding proves the API works. - -### Ecto + Oban: Telemetry as Standard - -Both use Erlang's `:telemetry` library with predictable -naming: - -```elixir -# Oban -:telemetry.execute([:oban, :job, :start], measurements, meta) -:telemetry.execute([:oban, :job, :stop], measurements, meta) -:telemetry.execute([:oban, :job, :exception], measurements, meta) - -# Ecto (adapter-emitted) -[:my_app, :repo, :query] -``` - -**Force:** The BEAM ecosystem standardized on `:telemetry` -for observability. Libraries don't own their monitoring — -they emit events; consumers attach handlers. This inverts -the logging relationship: the library doesn't decide what -to do with the information. - ---- - -## 2. Config Propagation: Three Models - -### CockroachDB: Cluster Settings (Distributed Config) - -```go -settings.RegisterDurationSetting( - settings.ApplicationLevel, - "bulkio.ingest.flush_delay", - "amount of time to wait before sending a file...", - 0, // default -) -``` - -Settings are: -- **Typed** (Duration, Bool, Int, String) -- **Leveled** (ApplicationLevel vs SystemVisible) -- **Validated** (NonNegativeInt, etc.) -- **Distributed** (propagated across all nodes) -- **Version-gated** (new settings require cluster version) - -Usage: `settings.Version.IsActive(ctx, clusterversion.V26_2)` - -**Force:** In a distributed database, config isn't a file -— it's consensus. Every node must agree on every setting, -and settings can only be enabled once all nodes support -them. The version gate is the safety mechanism. - -### Prometheus: ApplyConfig (Hot Reload) - -```go -func (m *Manager) ApplyConfig(cfg *config.Config) error { - m.mtxScrape.Lock() - defer m.mtxScrape.Unlock() - // rebuild scrape pools from new config - // close old loggers, open new ones -} -``` - -Config is a struct loaded from YAML. On SIGHUP (or API -call), the entire config is re-parsed and `ApplyConfig` -is called on each subsystem. Each subsystem holds a mutex -and swaps atomically. - -**Force:** Prometheus runs as a single binary. Config -reload must be atomic per-subsystem but doesn't need -distributed consensus. The mutex-per-subsystem pattern -gives independent reload without global coordination. - -### Ecto + Oban: Config at Init, Validated Once - -```elixir -# Oban validates exhaustively at startup -Validation.validate_schema(opts, - engine: {:behaviour, Oban.Engine}, - queues: {:custom, &validate_queues/1}, - repo: {:module, [config: 0]}, - ... -) -``` - -Config is validated once at startup and stored as an -immutable struct. No hot reload. If config is wrong, -you know immediately (fail fast). - -**Force:** Elixir/OTP applications restart processes to -apply new config. Hot reload is handled by supervisor -restarts, not config mutation. The "config as immutable -struct" pattern means no runtime config bugs — it either -passes validation at startup or the app doesn't start. - ---- - -## 3. Retry and Resilience - -### CockroachDB: Iterator-Based Retry - -```go -opts := retry.Options{ - InitialBackoff: 100 * time.Millisecond, - MaxBackoff: 2 * time.Second, - Multiplier: 2, - MaxRetries: 5, -} -for r := retry.StartWithCtx(ctx, opts); r.Next(); { - // attempt operation - if err == nil { break } -} -``` - -Retry is a **for-loop iterator**. `r.Next()` handles -backoff timing and returns false when exhausted. This -means retry logic reads like normal code — no callbacks, -no framework. - -**Force:** CockroachDB has hundreds of retry sites. A -callback-based retry would create deeply nested code. -The iterator pattern keeps retry at the same indentation -level as the operation. - -### Oban: Repo Dispatch with Built-In Retry - -```elixir -defp dynamic_dispatch(conf, name, args, attempt) do - with_dynamic_repo(conf, fn repo -> - apply(repo, name, args) - end) -rescue - error in UndefinedFunctionError -> - if attempt < @retry_opts[:retry] do - jittery_sleep(attempt * @retry_opts[:delay]) - dynamic_dispatch(conf, name, args, attempt + 1) - else - reraise error, __STACKTRACE__ - end -end -``` - -Every Ecto operation dispatched through Oban's repo -wrapper gets automatic retry for transient failures. -The consumer never sees the retry — it's invisible -infrastructure. - -**Key insight:** Oban retries `UndefinedFunctionError` -on the repo module itself — absorbing the window during -hot code reload when the module doesn't exist. This is -an ecosystem-level concern (BEAM hot code loading) handled -transparently. - ---- - -## 4. Resource Lifecycle: The Stopper Pattern - -### CockroachDB: Stopper as Universal Lifecycle - -```go -type Stopper struct { ... } - -// RunTask runs a synchronous task -func (s *Stopper) RunTask(ctx context.Context, taskName string, f func(context.Context)) error - -// RunAsyncTask runs a goroutine tracked by the stopper -func (s *Stopper) RunAsyncTask(ctx context.Context, taskName string, f func(context.Context)) error - -// ShouldQuiesce returns a channel closed when shutdown begins -func (s *Stopper) ShouldQuiesce() <-chan struct{} - -// Stop initiates graceful shutdown -func (s *Stopper) Stop(ctx context.Context) -``` - -Every goroutine in CockroachDB is launched through a -Stopper. This gives: -- **Tracking**: know exactly which goroutines are running -- **Graceful shutdown**: quiesce signal before hard stop -- **Leak detection**: `PrintLeakedStoppers` in tests -- **Throttling**: semaphore limits async tasks - -```go -func init() { - leaktest.PrintLeakedStoppers = PrintLeakedStoppers -} -``` - -**Force:** A database cannot afford goroutine leaks — -they hold locks, connections, and file handles. The -Stopper is the universal answer: every background task -is accounted for, every shutdown is graceful, every leak -is detected in tests. - -### Oban: Registry-Based Lifecycle - -```elixir -children = [ - {Notifier, conf: conf, name: Registry.via(name, Notifier)}, - {Nursery, conf: conf, name: Registry.via(name, Nursery)}, - ... -] -``` - -OTP already provides lifecycle management via supervisors. -Oban's addition is the Registry — namespacing processes -so multiple instances can coexist. Lifecycle is delegated -to the platform; naming is the library's concern. - ---- - -## 5. What These Patterns Teach for Code Review - -### Questions to Ask About Cross-Cutting Concerns: - -1. **Logging:** Who is the audience for this log? Is there - a routing mechanism, or does everything go to stdout? - Does the log help the *operator*, not just the developer? - -2. **Config:** How does config reach this code? Is it - validated at startup or silently wrong at runtime? Can - it be changed without restart? Should it be? - -3. **Retry:** Is retry happening at the right layer? Is it - invisible to the caller? Does it have backoff + jitter? - Does it respect context cancellation? - -4. **Lifecycle:** Are background tasks tracked? Will they - shut down gracefully? Can you detect leaks in tests? - -5. **Telemetry:** Are events emitted or is logging the only - observability? Can consumers attach their own handlers? - -### Red Flags: - -- `log.Info("something happened")` with no channel/audience -- Config read from environment at point-of-use (not validated) -- Retry logic duplicated in 5 places with different backoff -- Goroutines launched with `go func()` and no tracking -- No telemetry events — only log lines for observability - - diff --git a/sources/ecosystem-analysis.md b/sources/ecosystem-analysis.md deleted file mode 100644 index e8ce749..0000000 --- a/sources/ecosystem-analysis.md +++ /dev/null @@ -1,371 +0,0 @@ -# Ecosystem-Level Patterns: How Codebases Present to Consumers - -## The Three Questions - -For each codebase, ask: -1. How do consumers **extend** it? (What interfaces/behaviours - do they implement?) -2. How do consumers **compose** with it? (What does day-to-day - usage look like?) -3. What does it deliberately **NOT do**? (What forces shaped - those refusals?) - ---- - -## CockroachDB: Errors as First-Class Distributed Data - -### Extension Points - -CockroachDB is not a library — it is a system. Consumers -extend it through: -- **SQL builtins** (function registration) -- **Storage engines** (via pebble interface) -- **Service discovery** (not user-extensible — closed) - -The interesting pattern is how errors flow from storage -through KV through SQL to the client. - -### Error Architecture (ecosystem-level idiom) - -``` -Storage error → encoded via cockroachdb/errors → - KV wraps with context → serialized across gRPC → - SQL decodes → maps to pgcode → wire protocol to client -``` - -**Key design decisions:** - -1. **Errors have priority.** `ErrPriority()` ranks errors so - the system knows which to surface when multiple things - fail simultaneously. Transaction abort > restart > - unambiguous error > non-retriable. - -2. **Errors survive serialization.** `EncodeError` / - `DecodeError` serialize errors across RPC boundaries. - The error that originated on node 3 arrives at node 1 - with its full cause chain intact. - -3. **Errors map to pg codes.** Every internal error maps to - a Postgres error code that clients understand. This is - the *ecosystem contract* — clients write - `if pgcode == '40001' { retry }`. - -**What this teaches:** In a distributed system, an error -isn't a string — it's a data object with identity, -priority, serializability, and a consumer-facing code. -Design your error types for the *consumer*, not the -*producer*. - -### Deliberate Absences - -- **No dependency injection framework.** Config structs - passed explicitly. 1178-line `StoreConfig` struct, but - it's all data — no framework magic. -- **No context.Background() on hot paths.** 144 uses in - kvserver, but auditable — each justified in comments. -- **No functional options.** CockroachDB uses config - structs universally. The Option interface in stopper is - the exception, not the rule. - -### Test Architecture - -- **TestMain in every package.** Sets up security certs, - random seeds, and test server factories. -- **Goroutine leak detection.** `leaktest.AfterTest(t)()` - at the start of every test. Detects leaked goroutines - by diffing goroutine stacks before/after. -- **Stopper leak detection.** Every Stopper is tracked - globally; `PrintLeakedStoppers(t)` in TestMain catches - forgot-to-stop bugs. -- **`//go:generate` for test setup.** Codegen tool - (`add-leaktest.sh`) auto-adds leak checks to every - test file. - -**What this teaches:** At scale, the most important test -infrastructure isn't assertions — it's resource leak -detection. Every goroutine, every connection, every -Stopper is tracked and verified to be cleaned up. - ---- - -## Prometheus: The One-Method Interface Contract - -### Extension Points - -Prometheus is extended through: -- **Service discovery** (30 implementations, 1 interface) -- **Storage** (remote read/write adapters) -- **Exporters** (client_golang metrics) - -### The Discoverer Pattern (ecosystem-level idiom) - -```go -type Discoverer interface { - Run(ctx context.Context, up chan<- []*targetgroup.Group) -} -``` - -This is **one method**. Thirty implementations. The -channel-based push model means: -- The discoverer controls timing (not polled) -- The manager multiplexes without knowing implementations -- Adding a new discovery source = implement Run, register - -**Registration via init():** -```go -func init() { - discovery.RegisterConfig(&SDConfig{}) -} -``` - -This is the classic Go plugin pattern. Import the package -→ init registers it → the system discovers it at startup. - -**What this teaches:** The smallest possible interface -creates the largest possible ecosystem. One method + one -channel = 30 implementations without coordination. - -### Storage Contract (15 interfaces, 1 file) - -All of Prometheus's storage contract lives in -`storage/interface.go`. This is the: -- Read path: `Queryable → Querier → SeriesSet → Series` -- Write path: `Appendable → Appender` -- Extension: `ExemplarAppender`, `MetadataUpdater` - -**Key:** Every implementation proves satisfaction at -compile time with `var _ storage.Searcher = &type{}`. -When the contract evolves, the compiler finds every -broken implementation. - -### Deliberate Absences - -- **No generics in storage interfaces.** Despite Go 1.20+ - support. The interfaces predate generics and adding them - would break all existing implementations. -- **No dependency injection.** Direct struct construction - everywhere. Testability through interface satisfaction, - not framework wiring. -- **Almost no functional options.** Only in leaf packages - (chunk writer, parser). Core APIs use config structs. -- **No goroutine leak in production code.** `goleak` in - tests, `TolerantVerifyLeak` with explicit allowlist for - known third-party leaks. - -### Test Architecture - -- **`TolerantVerifyLeak`** — goroutine leak detection with - allowlist for known third-party leaks (opencensus, klog) -- **Mock implementations of every interface** — defined - right in `storage/interface.go` next to the real ones -- **Golden file tests** in PromQL evaluation - ---- - -## Ecto: Composability as Architectural Principle - -### Extension Points - -Consumers extend Ecto through: -- **Custom types** (7 callbacks: cast, load, dump, equal?, - embed_as, autogenerate, type) -- **Adapters** (Queryable, Schema, Transaction, Storage — - 4 behaviour modules) -- **Protocols** (`Ecto.Queryable` — anything can become a - query) - -### The NotLoaded Sentinel (ecosystem-level idiom) - -```elixir -defmodule Ecto.Association.NotLoaded do - defstruct [:__field__, :__owner__, :__cardinality__] -end -``` - -Ecto **refuses to lazy-load associations**. If you access -`user.posts` without preloading, you get a `NotLoaded` -struct — not nil, not an empty list, not a database query. - -**Why this is an ecosystem decision:** -- Forces consumers to be explicit about data needs -- Prevents N+1 queries by making them impossible -- Makes the data boundary visible in code - -This is a *consumer-hostile* decision that makes -*systems built on Ecto* dramatically better. The library -optimizes for the 1000th user, not the first-day -experience. - -### Query Composition (ecosystem-level idiom) - -Every query clause appends to a list in the Query struct. -Nothing executes. The Query is pure data that accumulates -intent. - -**Consumer impact:** You can build queries across module -boundaries: - -```elixir -# Module A builds the base -def active_users, do: from(u in User, where: u.active) - -# Module B adds pagination -def paginate(query, page, size) do - query - |> limit(^size) - |> offset(^((page - 1) * size)) -end - -# Module C adds authorization -def visible_to(query, role) do - where(query, [u], u.role in ^roles_for(role)) -end -``` - -Each module is independent. They compose because queries -are data, not effects. - -### Adapter Architecture - -``` -Ecto.Repo.all(query) - → Planner resolves types, bindings - → Adapter.prepare/2 produces {cache, prepared} - → Adapter.execute/5 runs against DB - → Adapter.loaders/2 converts back to Elixir types -``` - -The adapter is the ONLY part that knows SQL. Ecto core -is database-agnostic. This is why the same code works on -Postgres, MySQL, SQLite, and custom stores. - -### Deliberate Absences - -- **No lazy loading.** `NotLoaded` struct instead. -- **No global state.** Per-repo config, per-repo process. -- **No query caching at library level.** The adapter - caches prepared statements; Ecto doesn't. -- **No connection to schema naming.** `schema "legacy_tbl"` - is independent of `defmodule NewUser`. - ---- - -## Oban: Designing for Testability First - -### Extension Points - -Consumers extend Oban through: -- **Workers** (`perform/1` — the job logic) -- **Plugins** (GenServer + validate callback) -- **Engines** (entire backend swap) -- **Notifiers** (pub/sub mechanism) -- **Peers** (leader election) - -### The Worker Result Type (ecosystem-level idiom) - -```elixir -@type result :: - :ok - | {:ok, ignored :: term()} - | {:error, reason :: term()} - | {:cancel, reason :: term()} - | {:snooze, period :: Period.t()} -``` - -Five possible outcomes, each with distinct semantics: -- `:ok` → success, remove from queue -- `{:error, reason}` → retry (respects max_attempts) -- `{:cancel, reason}` → permanent failure, don't retry -- `{:snooze, period}` → reschedule for later - -**Ecosystem impact:** Every worker author makes an -explicit decision about failure semantics. "What should -happen when this fails?" is answered in the type system, -not in configuration. - -### Contextual Backoff (ecosystem-level idiom) - -```elixir -def backoff(%Job{attempt: attempt, unsaved_error: err}) do - case err.reason do - %RateLimitError{retry_after: ms} -> ms - _ -> trunc(:math.pow(attempt, 4) + jitter()) - end -end -``` - -The error that caused the failure is available to the -backoff calculation. Different errors → different retry -strategies. This is impossible in systems where backoff -is configured globally. - -### Testing Design (ecosystem-level idiom) - -Three testing modes via config: -- **`:inline`** — execute jobs synchronously in tests -- **`:manual`** — enqueue but don't execute -- **`:disabled`** — production behavior - -Plus `use Oban.Testing` which provides: -- `assert_enqueued/1` — verify job was queued -- `refute_enqueued/1` — verify job was NOT queued -- `perform_job/2` — execute a job manually in tests -- `all_enqueued/1` — list all matching jobs - -**Ecosystem impact:** Every Oban consumer gets -deterministic, fast, isolated tests for free. No sleep, -no polling, no flaky async assertions. - -### Deliberate Absences - -- **No global process names.** Registry.via everywhere — - multiple Oban instances can coexist. -- **No direct DB coupling in workers.** Workers receive a - Job struct; they don't import Repo. -- **No implicit retries.** max_attempts is explicit per - worker. No "retry forever" default. -- **No built-in rate limiting in OSS.** That is a Pro - feature — deliberate business boundary. - ---- - -## Cross-Cutting: What "Idiomatic" Means at Ecosystem Level - -### 1. The Consumer Contract is the API - -Not the functions you export — the *experience* of -building on your system: -- CockroachDB: "Your errors will be pg-codes, always" -- Prometheus: "Implement Run(), get discovery for free" -- Ecto: "Queries are data; loading is always explicit" -- Oban: "Return a result type; testing is built in" - -### 2. Deliberate Absences Define Character - -What a system refuses to do is as important as what it -does: -- Ecto refuses lazy loading → forces explicit data needs -- Oban refuses global names → enables multi-instance -- Prometheus refuses DI frameworks → keeps simplicity -- CockroachDB refuses context.Background on hot paths → - forces timeout discipline - -### 3. Testability is Never Retrofitted - -Every system that tests well designed testing in from the -start: -- CockroachDB: leak detection, stopper tracking -- Prometheus: goroutine leak verification, mock interfaces -- Ecto: adapter abstraction, embedded schemas for testing -- Oban: engine swap, testing modes, assertion helpers - -### 4. Extension Points Define the Ecosystem Size - -- Prometheus: 1 interface, 30 discoverers -- Ecto: 7 type callbacks, hundreds of custom types -- Oban: Worker behaviour + 5 engine callbacks - -**Smaller interface → larger ecosystem.** The less you -demand from implementors, the more you get. - - diff --git a/sources/testing-evolution-analysis.md b/sources/testing-evolution-analysis.md deleted file mode 100644 index f724cde..0000000 --- a/sources/testing-evolution-analysis.md +++ /dev/null @@ -1,297 +0,0 @@ -# Testing Philosophy & API Evolution - -How codebases prove correctness and manage change over -time reveals their deepest architectural commitments. - ---- - -## Testing Philosophy: Four Models of Proof - -### CockroachDB: Defense in Depth - -**Levels of proof:** -1. **Unit tests** — co-located in same package -2. **Echotest/golden files** — snapshot expected output (209 - testdata directories, auto-rewrite with -rewrite flag) -3. **Data-driven tests** — declarative test specs in txt files -4. **KVNemesis** — chaos/fuzzing that generates random KV - operations and checks linearizability -5. **Leak detection** — goroutines, stoppers tracked globally - -**The echotest pattern:** -```go -echotest.Require(t, output, filepath.Join("testdata", name+".txt")) -``` - -Golden file says: -``` -echo ----- -result is ambiguous: boom with a secret -result is ambiguous: boom with a ‹secret› -``` - -The test produces output, compares against the golden file. -Run with `-rewrite` to update. This means: -- Tests are **self-documenting** (the golden file IS the spec) -- Regressions are **visible in diffs** (the golden file changes) -- No manual expected-value maintenance - -**KVNemesis (chaos testing at ecosystem level):** -Generates random sequences of KV operations (puts, gets, -splits, merges, transfers) against a real cluster, then -validates that results satisfy serializable isolation. - -This isn't unit testing. This is proving the *system* is -correct, not individual functions. - -**Resource leak detection as CI gate:** -```go -// Every test file -defer leaktest.AfterTest(t)() - -// Every TestMain -func init() { - leaktest.PrintLeakedStoppers = PrintLeakedStoppers -} -``` - -If a test leaks a goroutine or Stopper, it **fails**. Not -a warning. A failure. This means resource correctness is -as enforceable as logic correctness. - -### Prometheus: Golden Files + Goroutine Verification - -**Testing DSL for PromQL:** -``` -load 5m - http_requests{job="api-server"} 0+10x10 - -eval instant at 50m SUM BY (group) (http_requests) - {group="canary"} 700 - {group="production"} 300 -``` - -This is a custom test language. Load data, evaluate -expressions, assert results. **205 test config files** -in `config/testdata/` alone. - -**Force:** PromQL is complex enough that example-based -testing would be insufficient. The DSL lets you write -hundreds of test cases concisely, covering edge cases -that would require dozens of Go test functions. - -**Goroutine leak detection:** -```go -func TolerantVerifyLeak(m *testing.M) { - goleak.VerifyTestMain(m, - goleak.IgnoreTopFunction("go.opencensus.io/..."), - goleak.IgnoreTopFunction("k8s.io/klog/..."), - ) -} -``` - -Explicit allowlist for known third-party leaks. Everything -else is a test failure. Zero-tolerance with escape hatches -for unfixable external dependencies. - -### Ecto: Fake Adapter + Process Mailbox Assertions - -```elixir -defmodule Ecto.TestAdapter do - @behaviour Ecto.Adapter - @behaviour Ecto.Adapter.Queryable - @behaviour Ecto.Adapter.Schema - @behaviour Ecto.Adapter.Transaction - - def execute(_, _, {:nocache, {:all, query}}, _, _) do - send(self(), {:all, query}) - Process.get(:test_repo_all_results) || results_for_all_query(query) - end -end -``` - -**Ecto tests the entire query pipeline without a database.** -The fake adapter: -- Sends messages to `self()` on every operation -- Tests assert on `receive {:insert, meta}` etc. -- No network, no state, pure message-passing verification - -**48 test files, 43 with `async: true`.** The test suite -runs in parallel because there's no shared state — every -test talks to its own process mailbox. - -**Force:** Ecto is a *library*, not a service. It can't -require Postgres in CI for every contributor. The fake -adapter makes the entire query compilation + planning -pipeline testable without external dependencies. - -### Oban: Testing Modes as First-Class Feature - -```elixir -# In test config -config :my_app, Oban, testing: :inline - -# In test -use Oban.Testing, repo: MyApp.Repo - -test "job was enqueued" do - assert_enqueued worker: MyWorker, args: %{id: 1} -end - -test "job executes correctly" do - assert :ok = perform_job(MyWorker, %{id: 1}) -end -``` - -Three modes: -- **`:inline`** — jobs execute synchronously in the test - process. No GenServers, no queues, no async. -- **`:manual`** — jobs are enqueued but not executed. - Use `assert_enqueued` to verify they were created. -- **`:disabled`** — production behavior in tests. - -**Force:** Background jobs are the #1 source of test -flakiness. Oban eliminates it by making the execution -model configurable. Tests never poll, never sleep, never -race. - ---- - -## API Evolution: Three Strategies - -### CockroachDB: Version Gates (Distributed Migration) - -```go -const ( - V26_2_AddStatementStatisticsComputedColumns Key = iota - V26_2_ChangefeedsStopReadingSpanLevelCheckpoints - V26_2_ChangefeedsStopWritingSpanLevelCheckpoints -) - -// In code: -if settings.Version.IsActive(ctx, clusterversion.V26_2) { - // use new behavior -} -``` - -**The pattern:** Every change to observable behavior gets -a version constant. The feature is only enabled when ALL -nodes in the cluster have been upgraded past that version. - -**Two-phase deprecation for distributed changes:** -``` -V26_2_ChangefeedsStopReadingSpanLevelCheckpoints -V26_2_ChangefeedsStopWritingSpanLevelCheckpoints -V26_2_ChangefeedsNoLongerHaveSpanLevelCheckpoints -``` - -Three versions for one removal: -1. Stop reading (new code doesn't depend on old format) -2. Stop writing (old format no longer produced) -3. Clean up (safe to remove the old code) - -**Force:** In a distributed database, you can't change -behavior atomically. Some nodes will be old, some new. -The version gate ensures new behavior only activates -when it's safe — when all nodes understand it. - -**Pruning:** Once MinSupported advances past a version -constant, it's deleted. The code path is always active -so the `IsActive` check becomes dead code. Regular -pruning keeps the codebase from accumulating gates. - -### Oban: Numbered Migrations (Schema Evolution) - -```elixir -lib/oban/migrations/postgres/ -├── v01.ex # Initial schema (job table, state enum) -├── v02.ex # Add columns -├── v03.ex # Index optimization -... -├── v14.ex # Latest -``` - -Each migration is: -- **Idempotent** (safe to run twice) -- **Prefix-aware** (multi-tenant schemas) -- **Bidirectional** (up + down) -- **Database-specific** (postgres/, sqlite/, myxql/) - -**Consumer usage:** -```elixir -defmodule MyApp.Repo.Migrations.AddOban do - use Ecto.Migration - def up, do: Oban.Migrations.up(version: 14) - def down, do: Oban.Migrations.down(version: 14) -end -``` - -**Force:** Oban owns a database table but lives inside -the consumer's migration system. Numbered versions let -consumers upgrade incrementally without knowing Oban -internals. - -### Ecto: Compile-Time Deprecation + Semver - -```elixir -# In changeset.ex -IO.warn( - "passing a list of binaries to cast/3 is deprecated..." -) -``` - -Ecto deprecates at **compile time**. When you compile -code that uses a deprecated API, you get a warning. -At runtime, everything still works. - -**CHANGELOG as contract:** -``` -## v3.14.0-dev -### Enhancements -### Bug fixes - -## v3.13.5 (2025-11-09) -### Enhancements -``` - -The changelog is the API evolution document. Breaking -changes require a major version bump (hasn't happened -in years because the adapter pattern provides -extensibility without breakage). - ---- - -## What This Teaches for Code Review - -### Testing Questions: -1. Is this testable **without standing up the system**? - (Ecto's fake adapter, Oban's inline engine) -2. Are resources **tracked and leak-detected**? - (CockroachDB's stopper/goroutine tracking) -3. Are test assertions **deterministic**? No sleep, no - poll, no "eventually consistent" in unit tests. -4. Could this be a **golden file test**? If the output - is deterministic, snapshot it. Regression = visible diff. -5. Is there **chaos/property testing** for invariants? - (KVNemesis for linearizability) - -### Evolution Questions: -1. Can this change be deployed **gradually**? Or does it - require all consumers to upgrade atomically? -2. Is there a **two-phase** path? (Stop reading → stop - writing → remove) -3. Is the deprecation **visible at compile time**? Or - will consumers only discover it at runtime? -4. Is the migration **idempotent**? Can it be run twice - safely? - -### Red Flags: -- Tests that require a running database for unit-level logic -- No resource leak detection in concurrent code -- `time.Sleep` / `Process.sleep` in tests instead of - deterministic signals -- Breaking changes without version gates or migration path -- Deprecation that only appears in docs, not in tooling - -