From be7eeb0d637fa198c3f2d18b0ded1ed866eb0697 Mon Sep 17 00:00:00 2001 From: Rodin Date: Thu, 30 Apr 2026 10:31:19 -0700 Subject: [PATCH] docs: cross-cutting concerns analysis (cross-referenced) --- sources/crosscutting-analysis.md | 301 +++++++++++++++++++++++++++++++ 1 file changed, 301 insertions(+) create mode 100644 sources/crosscutting-analysis.md diff --git a/sources/crosscutting-analysis.md b/sources/crosscutting-analysis.md new file mode 100644 index 0000000..7ad78bd --- /dev/null +++ b/sources/crosscutting-analysis.md @@ -0,0 +1,301 @@ +# Cross-Cutting Concerns: How Mature Codebases Handle the Hard Parts + +Cross-cutting concerns are the things that touch everything +but belong nowhere. How a codebase handles logging, +telemetry, config, retry, and lifecycle management reveals +its architectural philosophy more than any feature code. + +--- + +## 1. Logging: From Strings to Semantic Channels + +### CockroachDB: Channel-Based Log Routing + +CockroachDB doesn't just log at severity levels — it +routes logs to **semantic channels**: + +```go +const DEV = logpb.Channel_DEV // development noise +const OPS = logpb.Channel_OPS // operator actions +const HEALTH = logpb.Channel_HEALTH // background health +const STORAGE = logpb.Channel_STORAGE +const SESSIONS = logpb.Channel_SESSIONS +const SQL_SCHEMA = logpb.Channel_SQL_SCHEMA +const USER_ADMIN = logpb.Channel_USER_ADMIN +``` + +Each channel can be routed to different sinks (file, +network, etc.) independently. Production deploys typically +disable DEV entirely and route HEALTH to monitoring. + +**Force:** In a multi-tenant distributed database, "who +cares about this log?" is a different question than "how +bad is it?" An INFO-level schema change matters to DBAs +but not to SREs monitoring node health. + +**Ecosystem insight:** The channel IS the audience. When +you write `log.Health.Warningf(...)`, you're declaring +"the person watching cluster health needs to see this." +Severity is orthogonal to audience. + +### Prometheus: Self-Instrumentation + +Prometheus instruments itself with its own metrics: + +```go +type scrapeMetrics struct { + targetScrapeSampleLimit prometheus.Counter + targetScrapeSampleOutOfOrder prometheus.Counter + targetIntervalLengthHistogram *prometheus.HistogramVec + // ... 20+ metrics +} +``` + +Metrics are collected in a struct, constructed once via +`newScrapeMetrics(reg)`, and passed to subsystems. No +global registration — the registerer is injected. + +**Force:** Prometheus IS the metrics system. If it used +a different metrics library to instrument itself, that +would be a design smell. Dogfooding proves the API works. + +### Ecto + Oban: Telemetry as Standard + +Both use Erlang's `:telemetry` library with predictable +naming: + +```elixir +# Oban +:telemetry.execute([:oban, :job, :start], measurements, meta) +:telemetry.execute([:oban, :job, :stop], measurements, meta) +:telemetry.execute([:oban, :job, :exception], measurements, meta) + +# Ecto (adapter-emitted) +[:my_app, :repo, :query] +``` + +**Force:** The BEAM ecosystem standardized on `:telemetry` +for observability. Libraries don't own their monitoring — +they emit events; consumers attach handlers. This inverts +the logging relationship: the library doesn't decide what +to do with the information. + +--- + +## 2. Config Propagation: Three Models + +### CockroachDB: Cluster Settings (Distributed Config) + +```go +settings.RegisterDurationSetting( + settings.ApplicationLevel, + "bulkio.ingest.flush_delay", + "amount of time to wait before sending a file...", + 0, // default +) +``` + +Settings are: +- **Typed** (Duration, Bool, Int, String) +- **Leveled** (ApplicationLevel vs SystemVisible) +- **Validated** (NonNegativeInt, etc.) +- **Distributed** (propagated across all nodes) +- **Version-gated** (new settings require cluster version) + +Usage: `settings.Version.IsActive(ctx, clusterversion.V26_2)` + +**Force:** In a distributed database, config isn't a file +— it's consensus. Every node must agree on every setting, +and settings can only be enabled once all nodes support +them. The version gate is the safety mechanism. + +### Prometheus: ApplyConfig (Hot Reload) + +```go +func (m *Manager) ApplyConfig(cfg *config.Config) error { + m.mtxScrape.Lock() + defer m.mtxScrape.Unlock() + // rebuild scrape pools from new config + // close old loggers, open new ones +} +``` + +Config is a struct loaded from YAML. On SIGHUP (or API +call), the entire config is re-parsed and `ApplyConfig` +is called on each subsystem. Each subsystem holds a mutex +and swaps atomically. + +**Force:** Prometheus runs as a single binary. Config +reload must be atomic per-subsystem but doesn't need +distributed consensus. The mutex-per-subsystem pattern +gives independent reload without global coordination. + +### Ecto + Oban: Config at Init, Validated Once + +```elixir +# Oban validates exhaustively at startup +Validation.validate_schema(opts, + engine: {:behaviour, Oban.Engine}, + queues: {:custom, &validate_queues/1}, + repo: {:module, [config: 0]}, + ... +) +``` + +Config is validated once at startup and stored as an +immutable struct. No hot reload. If config is wrong, +you know immediately (fail fast). + +**Force:** Elixir/OTP applications restart processes to +apply new config. Hot reload is handled by supervisor +restarts, not config mutation. The "config as immutable +struct" pattern means no runtime config bugs — it either +passes validation at startup or the app doesn't start. + +--- + +## 3. Retry and Resilience + +### CockroachDB: Iterator-Based Retry + +```go +opts := retry.Options{ + InitialBackoff: 100 * time.Millisecond, + MaxBackoff: 2 * time.Second, + Multiplier: 2, + MaxRetries: 5, +} +for r := retry.StartWithCtx(ctx, opts); r.Next(); { + // attempt operation + if err == nil { break } +} +``` + +Retry is a **for-loop iterator**. `r.Next()` handles +backoff timing and returns false when exhausted. This +means retry logic reads like normal code — no callbacks, +no framework. + +**Force:** CockroachDB has hundreds of retry sites. A +callback-based retry would create deeply nested code. +The iterator pattern keeps retry at the same indentation +level as the operation. + +### Oban: Repo Dispatch with Built-In Retry + +```elixir +defp dynamic_dispatch(conf, name, args, attempt) do + with_dynamic_repo(conf, fn repo -> + apply(repo, name, args) + end) +rescue + error in UndefinedFunctionError -> + if attempt < @retry_opts[:retry] do + jittery_sleep(attempt * @retry_opts[:delay]) + dynamic_dispatch(conf, name, args, attempt + 1) + else + reraise error, __STACKTRACE__ + end +end +``` + +Every Ecto operation dispatched through Oban's repo +wrapper gets automatic retry for transient failures. +The consumer never sees the retry — it's invisible +infrastructure. + +**Key insight:** Oban retries `UndefinedFunctionError` +on the repo module itself — absorbing the window during +hot code reload when the module doesn't exist. This is +an ecosystem-level concern (BEAM hot code loading) handled +transparently. + +--- + +## 4. Resource Lifecycle: The Stopper Pattern + +### CockroachDB: Stopper as Universal Lifecycle + +```go +type Stopper struct { ... } + +// RunTask runs a synchronous task +func (s *Stopper) RunTask(ctx context.Context, taskName string, f func(context.Context)) error + +// RunAsyncTask runs a goroutine tracked by the stopper +func (s *Stopper) RunAsyncTask(ctx context.Context, taskName string, f func(context.Context)) error + +// ShouldQuiesce returns a channel closed when shutdown begins +func (s *Stopper) ShouldQuiesce() <-chan struct{} + +// Stop initiates graceful shutdown +func (s *Stopper) Stop(ctx context.Context) +``` + +Every goroutine in CockroachDB is launched through a +Stopper. This gives: +- **Tracking**: know exactly which goroutines are running +- **Graceful shutdown**: quiesce signal before hard stop +- **Leak detection**: `PrintLeakedStoppers` in tests +- **Throttling**: semaphore limits async tasks + +```go +func init() { + leaktest.PrintLeakedStoppers = PrintLeakedStoppers +} +``` + +**Force:** A database cannot afford goroutine leaks — +they hold locks, connections, and file handles. The +Stopper is the universal answer: every background task +is accounted for, every shutdown is graceful, every leak +is detected in tests. + +### Oban: Registry-Based Lifecycle + +```elixir +children = [ + {Notifier, conf: conf, name: Registry.via(name, Notifier)}, + {Nursery, conf: conf, name: Registry.via(name, Nursery)}, + ... +] +``` + +OTP already provides lifecycle management via supervisors. +Oban's addition is the Registry — namespacing processes +so multiple instances can coexist. Lifecycle is delegated +to the platform; naming is the library's concern. + +--- + +## 5. What These Patterns Teach for Code Review + +### Questions to Ask About Cross-Cutting Concerns: + +1. **Logging:** Who is the audience for this log? Is there + a routing mechanism, or does everything go to stdout? + Does the log help the *operator*, not just the developer? + +2. **Config:** How does config reach this code? Is it + validated at startup or silently wrong at runtime? Can + it be changed without restart? Should it be? + +3. **Retry:** Is retry happening at the right layer? Is it + invisible to the caller? Does it have backoff + jitter? + Does it respect context cancellation? + +4. **Lifecycle:** Are background tasks tracked? Will they + shut down gracefully? Can you detect leaks in tests? + +5. **Telemetry:** Are events emitted or is logging the only + observability? Can consumers attach their own handlers? + +### Red Flags: + +- `log.Info("something happened")` with no channel/audience +- Config read from environment at point-of-use (not validated) +- Retry logic duplicated in 5 places with different backoff +- Goroutines launched with `go func()` and no tracking +- No telemetry events — only log lines for observability + +