docs: cross-cutting concerns analysis (logging, config, retry, lifecycle)

How CockroachDB, Prometheus, Ecto, and Oban handle the things that touch everything but belong nowhere. Includes red flags and review questions for each concern.
2026-04-30 10:31:19 -07:00
parent cee58e85a4
commit d6f36b67c8
1 changed files with 301 additions and 0 deletions
@@ -0,0 +1,301 @@
 # Cross-Cutting Concerns: How Mature Codebases Handle the Hard Parts
 Cross-cutting concerns are the things that touch everything
 but belong nowhere. How a codebase handles logging,
 telemetry, config, retry, and lifecycle management reveals
 its architectural philosophy more than any feature code.
 ---
 ## 1. Logging: From Strings to Semantic Channels
 ### CockroachDB: Channel-Based Log Routing
 CockroachDB doesn't just log at severity levels — it
 routes logs to **semantic channels**:
 ```go
 const DEV = logpb.Channel_DEV       // development noise
 const OPS = logpb.Channel_OPS       // operator actions
 const HEALTH = logpb.Channel_HEALTH // background health
 const STORAGE = logpb.Channel_STORAGE
 const SESSIONS = logpb.Channel_SESSIONS
 const SQL_SCHEMA = logpb.Channel_SQL_SCHEMA
 const USER_ADMIN = logpb.Channel_USER_ADMIN
 ```
 Each channel can be routed to different sinks (file,
 network, etc.) independently. Production deploys typically
 disable DEV entirely and route HEALTH to monitoring.
 **Force:** In a multi-tenant distributed database, "who
 cares about this log?" is a different question than "how
 bad is it?" An INFO-level schema change matters to DBAs
 but not to SREs monitoring node health.
 **Ecosystem insight:** The channel IS the audience. When
 you write `log.Health.Warningf(...)`, you're declaring
 "the person watching cluster health needs to see this."
 Severity is orthogonal to audience.
 ### Prometheus: Self-Instrumentation
 Prometheus instruments itself with its own metrics:
 ```go
 type scrapeMetrics struct {
    targetScrapeSampleLimit        prometheus.Counter
    targetScrapeSampleOutOfOrder   prometheus.Counter
    targetIntervalLengthHistogram  *prometheus.HistogramVec
    // ... 20+ metrics
 }
 ```
 Metrics are collected in a struct, constructed once via
 `newScrapeMetrics(reg)`, and passed to subsystems. No
 global registration — the registerer is injected.
 **Force:** Prometheus IS the metrics system. If it used
 a different metrics library to instrument itself, that
 would be a design smell. Dogfooding proves the API works.
 ### Ecto + Oban: Telemetry as Standard
 Both use Erlang's `:telemetry` library with predictable
 naming:
 ```elixir
 # Oban
 :telemetry.execute([:oban, :job, :start], measurements, meta)
 :telemetry.execute([:oban, :job, :stop], measurements, meta)
 :telemetry.execute([:oban, :job, :exception], measurements, meta)
 # Ecto (adapter-emitted)
 [:my_app, :repo, :query]
 ```
 **Force:** The BEAM ecosystem standardized on `:telemetry`
 for observability. Libraries don't own their monitoring —
 they emit events; consumers attach handlers. This inverts
 the logging relationship: the library doesn't decide what
 to do with the information.
 ---
 ## 2. Config Propagation: Three Models
 ### CockroachDB: Cluster Settings (Distributed Config)
 ```go
 settings.RegisterDurationSetting(
    settings.ApplicationLevel,
    "bulkio.ingest.flush_delay",
    "amount of time to wait before sending a file...",
    0,  // default
 )
 ```
 Settings are:
 - **Typed** (Duration, Bool, Int, String)
 - **Leveled** (ApplicationLevel vs SystemVisible)
 - **Validated** (NonNegativeInt, etc.)
 - **Distributed** (propagated across all nodes)
 - **Version-gated** (new settings require cluster version)
 Usage: `settings.Version.IsActive(ctx, clusterversion.V26_2)`
 **Force:** In a distributed database, config isn't a file
 — it's consensus. Every node must agree on every setting,
 and settings can only be enabled once all nodes support
 them. The version gate is the safety mechanism.
 ### Prometheus: ApplyConfig (Hot Reload)
 ```go
 func (m *Manager) ApplyConfig(cfg *config.Config) error {
    m.mtxScrape.Lock()
    defer m.mtxScrape.Unlock()
    // rebuild scrape pools from new config
    // close old loggers, open new ones
 }
 ```
 Config is a struct loaded from YAML. On SIGHUP (or API
 call), the entire config is re-parsed and `ApplyConfig`
 is called on each subsystem. Each subsystem holds a mutex
 and swaps atomically.
 **Force:** Prometheus runs as a single binary. Config
 reload must be atomic per-subsystem but doesn't need
 distributed consensus. The mutex-per-subsystem pattern
 gives independent reload without global coordination.
 ### Ecto + Oban: Config at Init, Validated Once
 ```elixir
 # Oban validates exhaustively at startup
 Validation.validate_schema(opts,
    engine: {:behaviour, Oban.Engine},
    queues: {:custom, &validate_queues/1},
    repo: {:module, [config: 0]},
    ...
 )
 ```
 Config is validated once at startup and stored as an
 immutable struct. No hot reload. If config is wrong,
 you know immediately (fail fast).
 **Force:** Elixir/OTP applications restart processes to
 apply new config. Hot reload is handled by supervisor
 restarts, not config mutation. The "config as immutable
 struct" pattern means no runtime config bugs — it either
 passes validation at startup or the app doesn't start.
 ---
 ## 3. Retry and Resilience
 ### CockroachDB: Iterator-Based Retry
 ```go
 opts := retry.Options{
    InitialBackoff: 100 * time.Millisecond,
    MaxBackoff:     2 * time.Second,
    Multiplier:     2,
    MaxRetries:     5,
 }
 for r := retry.StartWithCtx(ctx, opts); r.Next(); {
    // attempt operation
    if err == nil { break }
 }
 ```
 Retry is a **for-loop iterator**. `r.Next()` handles
 backoff timing and returns false when exhausted. This
 means retry logic reads like normal code — no callbacks,
 no framework.
 **Force:** CockroachDB has hundreds of retry sites. A
 callback-based retry would create deeply nested code.
 The iterator pattern keeps retry at the same indentation
 level as the operation.
 ### Oban: Repo Dispatch with Built-In Retry
 ```elixir
 defp dynamic_dispatch(conf, name, args, attempt) do
    with_dynamic_repo(conf, fn repo ->
        apply(repo, name, args)
    end)
 rescue
    error in UndefinedFunctionError ->
        if attempt < @retry_opts[:retry] do
            jittery_sleep(attempt * @retry_opts[:delay])
            dynamic_dispatch(conf, name, args, attempt + 1)
        else
            reraise error, __STACKTRACE__
        end
 end
 ```
 Every Ecto operation dispatched through Oban's repo
 wrapper gets automatic retry for transient failures.
 The consumer never sees the retry — it's invisible
 infrastructure.
 **Key insight:** Oban retries `UndefinedFunctionError`
 on the repo module itself — absorbing the window during
 hot code reload when the module doesn't exist. This is
 an ecosystem-level concern (BEAM hot code loading) handled
 transparently.
 ---
 ## 4. Resource Lifecycle: The Stopper Pattern
 ### CockroachDB: Stopper as Universal Lifecycle
 ```go
 type Stopper struct { ... }
 // RunTask runs a synchronous task
 func (s *Stopper) RunTask(ctx context.Context, taskName string, f func(context.Context)) error
 // RunAsyncTask runs a goroutine tracked by the stopper
 func (s *Stopper) RunAsyncTask(ctx context.Context, taskName string, f func(context.Context)) error
 // ShouldQuiesce returns a channel closed when shutdown begins
 func (s *Stopper) ShouldQuiesce() <-chan struct{}
 // Stop initiates graceful shutdown
 func (s *Stopper) Stop(ctx context.Context)
 ```
 Every goroutine in CockroachDB is launched through a
 Stopper. This gives:
 - **Tracking**: know exactly which goroutines are running
 - **Graceful shutdown**: quiesce signal before hard stop
 - **Leak detection**: `PrintLeakedStoppers` in tests
 - **Throttling**: semaphore limits async tasks
 ```go
 func init() {
    leaktest.PrintLeakedStoppers = PrintLeakedStoppers
 }
 ```
 **Force:** A database cannot afford goroutine leaks —
 they hold locks, connections, and file handles. The
 Stopper is the universal answer: every background task
 is accounted for, every shutdown is graceful, every leak
 is detected in tests.
 ### Oban: Registry-Based Lifecycle
 ```elixir
 children = [
    {Notifier, conf: conf, name: Registry.via(name, Notifier)},
    {Nursery, conf: conf, name: Registry.via(name, Nursery)},
    ...
 ]
 ```
 OTP already provides lifecycle management via supervisors.
 Oban's addition is the Registry — namespacing processes
 so multiple instances can coexist. Lifecycle is delegated
 to the platform; naming is the library's concern.
 ---
 ## 5. What These Patterns Teach for Code Review
 ### Questions to Ask About Cross-Cutting Concerns:
 1. **Logging:** Who is the audience for this log? Is there
   a routing mechanism, or does everything go to stdout?
   Does the log help the *operator*, not just the developer?
 2. **Config:** How does config reach this code? Is it
   validated at startup or silently wrong at runtime? Can
   it be changed without restart? Should it be?
 3. **Retry:** Is retry happening at the right layer? Is it
   invisible to the caller? Does it have backoff + jitter?
   Does it respect context cancellation?
 4. **Lifecycle:** Are background tasks tracked? Will they
   shut down gracefully? Can you detect leaks in tests?
 5. **Telemetry:** Are events emitted or is logging the only
   observability? Can consumers attach their own handlers?
 ### Red Flags:
 - `log.Info("something happened")` with no channel/audience
 - Config read from environment at point-of-use (not validated)
 - Retry logic duplicated in 5 places with different backoff
 - Goroutines launched with `go func()` and no tracking
 - No telemetry events — only log lines for observability
 <!-- PATTERN_COMPLETE -->