Files
go-patterns/sources/crosscutting-analysis.md
T
Rodin d6f36b67c8 docs: cross-cutting concerns analysis (logging, config, retry, lifecycle)
How CockroachDB, Prometheus, Ecto, and Oban handle the
things that touch everything but belong nowhere. Includes
red flags and review questions for each concern.
2026-04-30 10:31:19 -07:00

302 lines
9.1 KiB
Markdown

# Cross-Cutting Concerns: How Mature Codebases Handle the Hard Parts
Cross-cutting concerns are the things that touch everything
but belong nowhere. How a codebase handles logging,
telemetry, config, retry, and lifecycle management reveals
its architectural philosophy more than any feature code.
---
## 1. Logging: From Strings to Semantic Channels
### CockroachDB: Channel-Based Log Routing
CockroachDB doesn't just log at severity levels — it
routes logs to **semantic channels**:
```go
const DEV = logpb.Channel_DEV // development noise
const OPS = logpb.Channel_OPS // operator actions
const HEALTH = logpb.Channel_HEALTH // background health
const STORAGE = logpb.Channel_STORAGE
const SESSIONS = logpb.Channel_SESSIONS
const SQL_SCHEMA = logpb.Channel_SQL_SCHEMA
const USER_ADMIN = logpb.Channel_USER_ADMIN
```
Each channel can be routed to different sinks (file,
network, etc.) independently. Production deploys typically
disable DEV entirely and route HEALTH to monitoring.
**Force:** In a multi-tenant distributed database, "who
cares about this log?" is a different question than "how
bad is it?" An INFO-level schema change matters to DBAs
but not to SREs monitoring node health.
**Ecosystem insight:** The channel IS the audience. When
you write `log.Health.Warningf(...)`, you're declaring
"the person watching cluster health needs to see this."
Severity is orthogonal to audience.
### Prometheus: Self-Instrumentation
Prometheus instruments itself with its own metrics:
```go
type scrapeMetrics struct {
targetScrapeSampleLimit prometheus.Counter
targetScrapeSampleOutOfOrder prometheus.Counter
targetIntervalLengthHistogram *prometheus.HistogramVec
// ... 20+ metrics
}
```
Metrics are collected in a struct, constructed once via
`newScrapeMetrics(reg)`, and passed to subsystems. No
global registration — the registerer is injected.
**Force:** Prometheus IS the metrics system. If it used
a different metrics library to instrument itself, that
would be a design smell. Dogfooding proves the API works.
### Ecto + Oban: Telemetry as Standard
Both use Erlang's `:telemetry` library with predictable
naming:
```elixir
# Oban
:telemetry.execute([:oban, :job, :start], measurements, meta)
:telemetry.execute([:oban, :job, :stop], measurements, meta)
:telemetry.execute([:oban, :job, :exception], measurements, meta)
# Ecto (adapter-emitted)
[:my_app, :repo, :query]
```
**Force:** The BEAM ecosystem standardized on `:telemetry`
for observability. Libraries don't own their monitoring —
they emit events; consumers attach handlers. This inverts
the logging relationship: the library doesn't decide what
to do with the information.
---
## 2. Config Propagation: Three Models
### CockroachDB: Cluster Settings (Distributed Config)
```go
settings.RegisterDurationSetting(
settings.ApplicationLevel,
"bulkio.ingest.flush_delay",
"amount of time to wait before sending a file...",
0, // default
)
```
Settings are:
- **Typed** (Duration, Bool, Int, String)
- **Leveled** (ApplicationLevel vs SystemVisible)
- **Validated** (NonNegativeInt, etc.)
- **Distributed** (propagated across all nodes)
- **Version-gated** (new settings require cluster version)
Usage: `settings.Version.IsActive(ctx, clusterversion.V26_2)`
**Force:** In a distributed database, config isn't a file
— it's consensus. Every node must agree on every setting,
and settings can only be enabled once all nodes support
them. The version gate is the safety mechanism.
### Prometheus: ApplyConfig (Hot Reload)
```go
func (m *Manager) ApplyConfig(cfg *config.Config) error {
m.mtxScrape.Lock()
defer m.mtxScrape.Unlock()
// rebuild scrape pools from new config
// close old loggers, open new ones
}
```
Config is a struct loaded from YAML. On SIGHUP (or API
call), the entire config is re-parsed and `ApplyConfig`
is called on each subsystem. Each subsystem holds a mutex
and swaps atomically.
**Force:** Prometheus runs as a single binary. Config
reload must be atomic per-subsystem but doesn't need
distributed consensus. The mutex-per-subsystem pattern
gives independent reload without global coordination.
### Ecto + Oban: Config at Init, Validated Once
```elixir
# Oban validates exhaustively at startup
Validation.validate_schema(opts,
engine: {:behaviour, Oban.Engine},
queues: {:custom, &validate_queues/1},
repo: {:module, [config: 0]},
...
)
```
Config is validated once at startup and stored as an
immutable struct. No hot reload. If config is wrong,
you know immediately (fail fast).
**Force:** Elixir/OTP applications restart processes to
apply new config. Hot reload is handled by supervisor
restarts, not config mutation. The "config as immutable
struct" pattern means no runtime config bugs — it either
passes validation at startup or the app doesn't start.
---
## 3. Retry and Resilience
### CockroachDB: Iterator-Based Retry
```go
opts := retry.Options{
InitialBackoff: 100 * time.Millisecond,
MaxBackoff: 2 * time.Second,
Multiplier: 2,
MaxRetries: 5,
}
for r := retry.StartWithCtx(ctx, opts); r.Next(); {
// attempt operation
if err == nil { break }
}
```
Retry is a **for-loop iterator**. `r.Next()` handles
backoff timing and returns false when exhausted. This
means retry logic reads like normal code — no callbacks,
no framework.
**Force:** CockroachDB has hundreds of retry sites. A
callback-based retry would create deeply nested code.
The iterator pattern keeps retry at the same indentation
level as the operation.
### Oban: Repo Dispatch with Built-In Retry
```elixir
defp dynamic_dispatch(conf, name, args, attempt) do
with_dynamic_repo(conf, fn repo ->
apply(repo, name, args)
end)
rescue
error in UndefinedFunctionError ->
if attempt < @retry_opts[:retry] do
jittery_sleep(attempt * @retry_opts[:delay])
dynamic_dispatch(conf, name, args, attempt + 1)
else
reraise error, __STACKTRACE__
end
end
```
Every Ecto operation dispatched through Oban's repo
wrapper gets automatic retry for transient failures.
The consumer never sees the retry — it's invisible
infrastructure.
**Key insight:** Oban retries `UndefinedFunctionError`
on the repo module itself — absorbing the window during
hot code reload when the module doesn't exist. This is
an ecosystem-level concern (BEAM hot code loading) handled
transparently.
---
## 4. Resource Lifecycle: The Stopper Pattern
### CockroachDB: Stopper as Universal Lifecycle
```go
type Stopper struct { ... }
// RunTask runs a synchronous task
func (s *Stopper) RunTask(ctx context.Context, taskName string, f func(context.Context)) error
// RunAsyncTask runs a goroutine tracked by the stopper
func (s *Stopper) RunAsyncTask(ctx context.Context, taskName string, f func(context.Context)) error
// ShouldQuiesce returns a channel closed when shutdown begins
func (s *Stopper) ShouldQuiesce() <-chan struct{}
// Stop initiates graceful shutdown
func (s *Stopper) Stop(ctx context.Context)
```
Every goroutine in CockroachDB is launched through a
Stopper. This gives:
- **Tracking**: know exactly which goroutines are running
- **Graceful shutdown**: quiesce signal before hard stop
- **Leak detection**: `PrintLeakedStoppers` in tests
- **Throttling**: semaphore limits async tasks
```go
func init() {
leaktest.PrintLeakedStoppers = PrintLeakedStoppers
}
```
**Force:** A database cannot afford goroutine leaks —
they hold locks, connections, and file handles. The
Stopper is the universal answer: every background task
is accounted for, every shutdown is graceful, every leak
is detected in tests.
### Oban: Registry-Based Lifecycle
```elixir
children = [
{Notifier, conf: conf, name: Registry.via(name, Notifier)},
{Nursery, conf: conf, name: Registry.via(name, Nursery)},
...
]
```
OTP already provides lifecycle management via supervisors.
Oban's addition is the Registry — namespacing processes
so multiple instances can coexist. Lifecycle is delegated
to the platform; naming is the library's concern.
---
## 5. What These Patterns Teach for Code Review
### Questions to Ask About Cross-Cutting Concerns:
1. **Logging:** Who is the audience for this log? Is there
a routing mechanism, or does everything go to stdout?
Does the log help the *operator*, not just the developer?
2. **Config:** How does config reach this code? Is it
validated at startup or silently wrong at runtime? Can
it be changed without restart? Should it be?
3. **Retry:** Is retry happening at the right layer? Is it
invisible to the caller? Does it have backoff + jitter?
Does it respect context cancellation?
4. **Lifecycle:** Are background tasks tracked? Will they
shut down gracefully? Can you detect leaks in tests?
5. **Telemetry:** Are events emitted or is logging the only
observability? Can consumers attach their own handlers?
### Red Flags:
- `log.Info("something happened")` with no channel/audience
- Config read from environment at point-of-use (not validated)
- Retry logic duplicated in 5 places with different backoff
- Goroutines launched with `go func()` and no tracking
- No telemetry events — only log lines for observability
<!-- PATTERN_COMPLETE -->