docs: cross-cutting concerns analysis (cross-referenced)

2026-04-30 10:31:19 -07:00
parent f5007e22e9
commit be7eeb0d63
1 changed files with 301 additions and 0 deletions
@@ -0,0 +1,301 @@
 # Cross-Cutting Concerns: How Mature Codebases Handle the Hard Parts
 Cross-cutting concerns are the things that touch everything
 but belong nowhere. How a codebase handles logging,
 telemetry, config, retry, and lifecycle management reveals
 its architectural philosophy more than any feature code.
 ---
 ## 1. Logging: From Strings to Semantic Channels
 ### CockroachDB: Channel-Based Log Routing
 CockroachDB doesn't just log at severity levels — it
 routes logs to **semantic channels**:
 ```go
 const DEV = logpb.Channel_DEV       // development noise
 const OPS = logpb.Channel_OPS       // operator actions
 const HEALTH = logpb.Channel_HEALTH // background health
 const STORAGE = logpb.Channel_STORAGE
 const SESSIONS = logpb.Channel_SESSIONS
 const SQL_SCHEMA = logpb.Channel_SQL_SCHEMA
 const USER_ADMIN = logpb.Channel_USER_ADMIN
 ```
 Each channel can be routed to different sinks (file,
 network, etc.) independently. Production deploys typically
 disable DEV entirely and route HEALTH to monitoring.
 **Force:** In a multi-tenant distributed database, "who
 cares about this log?" is a different question than "how
 bad is it?" An INFO-level schema change matters to DBAs
 but not to SREs monitoring node health.
 **Ecosystem insight:** The channel IS the audience. When
 you write `log.Health.Warningf(...)`, you're declaring
 "the person watching cluster health needs to see this."
 Severity is orthogonal to audience.
 ### Prometheus: Self-Instrumentation
 Prometheus instruments itself with its own metrics:
 ```go
 type scrapeMetrics struct {
    targetScrapeSampleLimit        prometheus.Counter
    targetScrapeSampleOutOfOrder   prometheus.Counter
    targetIntervalLengthHistogram  *prometheus.HistogramVec
    // ... 20+ metrics
 }
 ```
 Metrics are collected in a struct, constructed once via
 `newScrapeMetrics(reg)`, and passed to subsystems. No
 global registration — the registerer is injected.
 **Force:** Prometheus IS the metrics system. If it used
 a different metrics library to instrument itself, that
 would be a design smell. Dogfooding proves the API works.
 ### Ecto + Oban: Telemetry as Standard
 Both use Erlang's `:telemetry` library with predictable
 naming:
 ```elixir
 # Oban
 :telemetry.execute([:oban, :job, :start], measurements, meta)
 :telemetry.execute([:oban, :job, :stop], measurements, meta)
 :telemetry.execute([:oban, :job, :exception], measurements, meta)
 # Ecto (adapter-emitted)
 [:my_app, :repo, :query]
 ```
 **Force:** The BEAM ecosystem standardized on `:telemetry`
 for observability. Libraries don't own their monitoring —
 they emit events; consumers attach handlers. This inverts
 the logging relationship: the library doesn't decide what
 to do with the information.
 ---
 ## 2. Config Propagation: Three Models
 ### CockroachDB: Cluster Settings (Distributed Config)
 ```go
 settings.RegisterDurationSetting(
    settings.ApplicationLevel,
    "bulkio.ingest.flush_delay",
    "amount of time to wait before sending a file...",
    0,  // default
 )
 ```
 Settings are:
 - **Typed** (Duration, Bool, Int, String)
 - **Leveled** (ApplicationLevel vs SystemVisible)
 - **Validated** (NonNegativeInt, etc.)
 - **Distributed** (propagated across all nodes)
 - **Version-gated** (new settings require cluster version)
 Usage: `settings.Version.IsActive(ctx, clusterversion.V26_2)`
 **Force:** In a distributed database, config isn't a file
 — it's consensus. Every node must agree on every setting,
 and settings can only be enabled once all nodes support
 them. The version gate is the safety mechanism.
 ### Prometheus: ApplyConfig (Hot Reload)
 ```go
 func (m *Manager) ApplyConfig(cfg *config.Config) error {
    m.mtxScrape.Lock()
    defer m.mtxScrape.Unlock()
    // rebuild scrape pools from new config
    // close old loggers, open new ones
 }
 ```
 Config is a struct loaded from YAML. On SIGHUP (or API
 call), the entire config is re-parsed and `ApplyConfig`
 is called on each subsystem. Each subsystem holds a mutex
 and swaps atomically.
 **Force:** Prometheus runs as a single binary. Config
 reload must be atomic per-subsystem but doesn't need
 distributed consensus. The mutex-per-subsystem pattern
 gives independent reload without global coordination.
 ### Ecto + Oban: Config at Init, Validated Once
 ```elixir
 # Oban validates exhaustively at startup
 Validation.validate_schema(opts,
    engine: {:behaviour, Oban.Engine},
    queues: {:custom, &validate_queues/1},
    repo: {:module, [config: 0]},
    ...
 )
 ```
 Config is validated once at startup and stored as an
 immutable struct. No hot reload. If config is wrong,
 you know immediately (fail fast).
 **Force:** Elixir/OTP applications restart processes to
 apply new config. Hot reload is handled by supervisor
 restarts, not config mutation. The "config as immutable
 struct" pattern means no runtime config bugs — it either
 passes validation at startup or the app doesn't start.
 ---
 ## 3. Retry and Resilience
 ### CockroachDB: Iterator-Based Retry
 ```go
 opts := retry.Options{
    InitialBackoff: 100 * time.Millisecond,
    MaxBackoff:     2 * time.Second,
    Multiplier:     2,
    MaxRetries:     5,
 }
 for r := retry.StartWithCtx(ctx, opts); r.Next(); {
    // attempt operation
    if err == nil { break }
 }
 ```
 Retry is a **for-loop iterator**. `r.Next()` handles
 backoff timing and returns false when exhausted. This
 means retry logic reads like normal code — no callbacks,
 no framework.
 **Force:** CockroachDB has hundreds of retry sites. A
 callback-based retry would create deeply nested code.
 The iterator pattern keeps retry at the same indentation
 level as the operation.
 ### Oban: Repo Dispatch with Built-In Retry
 ```elixir
 defp dynamic_dispatch(conf, name, args, attempt) do
    with_dynamic_repo(conf, fn repo ->
        apply(repo, name, args)
    end)
 rescue
    error in UndefinedFunctionError ->
        if attempt < @retry_opts[:retry] do
            jittery_sleep(attempt * @retry_opts[:delay])
            dynamic_dispatch(conf, name, args, attempt + 1)
        else
            reraise error, __STACKTRACE__
        end
 end
 ```
 Every Ecto operation dispatched through Oban's repo
 wrapper gets automatic retry for transient failures.
 The consumer never sees the retry — it's invisible
 infrastructure.
 **Key insight:** Oban retries `UndefinedFunctionError`
 on the repo module itself — absorbing the window during
 hot code reload when the module doesn't exist. This is
 an ecosystem-level concern (BEAM hot code loading) handled
 transparently.
 ---
 ## 4. Resource Lifecycle: The Stopper Pattern
 ### CockroachDB: Stopper as Universal Lifecycle
 ```go
 type Stopper struct { ... }
 // RunTask runs a synchronous task
 func (s *Stopper) RunTask(ctx context.Context, taskName string, f func(context.Context)) error
 // RunAsyncTask runs a goroutine tracked by the stopper
 func (s *Stopper) RunAsyncTask(ctx context.Context, taskName string, f func(context.Context)) error
 // ShouldQuiesce returns a channel closed when shutdown begins
 func (s *Stopper) ShouldQuiesce() <-chan struct{}
 // Stop initiates graceful shutdown
 func (s *Stopper) Stop(ctx context.Context)
 ```
 Every goroutine in CockroachDB is launched through a
 Stopper. This gives:
 - **Tracking**: know exactly which goroutines are running
 - **Graceful shutdown**: quiesce signal before hard stop
 - **Leak detection**: `PrintLeakedStoppers` in tests
 - **Throttling**: semaphore limits async tasks
 ```go
 func init() {
    leaktest.PrintLeakedStoppers = PrintLeakedStoppers
 }
 ```
 **Force:** A database cannot afford goroutine leaks —
 they hold locks, connections, and file handles. The
 Stopper is the universal answer: every background task
 is accounted for, every shutdown is graceful, every leak
 is detected in tests.
 ### Oban: Registry-Based Lifecycle
 ```elixir
 children = [
    {Notifier, conf: conf, name: Registry.via(name, Notifier)},
    {Nursery, conf: conf, name: Registry.via(name, Nursery)},
    ...
 ]
 ```
 OTP already provides lifecycle management via supervisors.
 Oban's addition is the Registry — namespacing processes
 so multiple instances can coexist. Lifecycle is delegated
 to the platform; naming is the library's concern.
 ---
 ## 5. What These Patterns Teach for Code Review
 ### Questions to Ask About Cross-Cutting Concerns:
 1. **Logging:** Who is the audience for this log? Is there
   a routing mechanism, or does everything go to stdout?
   Does the log help the *operator*, not just the developer?
 2. **Config:** How does config reach this code? Is it
   validated at startup or silently wrong at runtime? Can
   it be changed without restart? Should it be?
 3. **Retry:** Is retry happening at the right layer? Is it
   invisible to the caller? Does it have backoff + jitter?
   Does it respect context cancellation?
 4. **Lifecycle:** Are background tasks tracked? Will they
   shut down gracefully? Can you detect leaks in tests?
 5. **Telemetry:** Are events emitted or is logging the only
   observability? Can consumers attach their own handlers?
 ### Red Flags:
 - `log.Info("something happened")` with no channel/audience
 - Config read from environment at point-of-use (not validated)
 - Retry logic duplicated in 5 places with different backoff
 - Goroutines launched with `go func()` and no tracking
 - No telemetry events — only log lines for observability
 <!-- PATTERN_COMPLETE -->