docs: cross-cutting concerns analysis (logging, config, retry, lifecycle)
How CockroachDB, Prometheus, Ecto, and Oban handle the things that touch everything but belong nowhere. Includes red flags and review questions for each concern.
This commit is contained in:
@@ -0,0 +1,301 @@
|
|||||||
|
# Cross-Cutting Concerns: How Mature Codebases Handle the Hard Parts
|
||||||
|
|
||||||
|
Cross-cutting concerns are the things that touch everything
|
||||||
|
but belong nowhere. How a codebase handles logging,
|
||||||
|
telemetry, config, retry, and lifecycle management reveals
|
||||||
|
its architectural philosophy more than any feature code.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Logging: From Strings to Semantic Channels
|
||||||
|
|
||||||
|
### CockroachDB: Channel-Based Log Routing
|
||||||
|
|
||||||
|
CockroachDB doesn't just log at severity levels — it
|
||||||
|
routes logs to **semantic channels**:
|
||||||
|
|
||||||
|
```go
|
||||||
|
const DEV = logpb.Channel_DEV // development noise
|
||||||
|
const OPS = logpb.Channel_OPS // operator actions
|
||||||
|
const HEALTH = logpb.Channel_HEALTH // background health
|
||||||
|
const STORAGE = logpb.Channel_STORAGE
|
||||||
|
const SESSIONS = logpb.Channel_SESSIONS
|
||||||
|
const SQL_SCHEMA = logpb.Channel_SQL_SCHEMA
|
||||||
|
const USER_ADMIN = logpb.Channel_USER_ADMIN
|
||||||
|
```
|
||||||
|
|
||||||
|
Each channel can be routed to different sinks (file,
|
||||||
|
network, etc.) independently. Production deploys typically
|
||||||
|
disable DEV entirely and route HEALTH to monitoring.
|
||||||
|
|
||||||
|
**Force:** In a multi-tenant distributed database, "who
|
||||||
|
cares about this log?" is a different question than "how
|
||||||
|
bad is it?" An INFO-level schema change matters to DBAs
|
||||||
|
but not to SREs monitoring node health.
|
||||||
|
|
||||||
|
**Ecosystem insight:** The channel IS the audience. When
|
||||||
|
you write `log.Health.Warningf(...)`, you're declaring
|
||||||
|
"the person watching cluster health needs to see this."
|
||||||
|
Severity is orthogonal to audience.
|
||||||
|
|
||||||
|
### Prometheus: Self-Instrumentation
|
||||||
|
|
||||||
|
Prometheus instruments itself with its own metrics:
|
||||||
|
|
||||||
|
```go
|
||||||
|
type scrapeMetrics struct {
|
||||||
|
targetScrapeSampleLimit prometheus.Counter
|
||||||
|
targetScrapeSampleOutOfOrder prometheus.Counter
|
||||||
|
targetIntervalLengthHistogram *prometheus.HistogramVec
|
||||||
|
// ... 20+ metrics
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Metrics are collected in a struct, constructed once via
|
||||||
|
`newScrapeMetrics(reg)`, and passed to subsystems. No
|
||||||
|
global registration — the registerer is injected.
|
||||||
|
|
||||||
|
**Force:** Prometheus IS the metrics system. If it used
|
||||||
|
a different metrics library to instrument itself, that
|
||||||
|
would be a design smell. Dogfooding proves the API works.
|
||||||
|
|
||||||
|
### Ecto + Oban: Telemetry as Standard
|
||||||
|
|
||||||
|
Both use Erlang's `:telemetry` library with predictable
|
||||||
|
naming:
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
# Oban
|
||||||
|
:telemetry.execute([:oban, :job, :start], measurements, meta)
|
||||||
|
:telemetry.execute([:oban, :job, :stop], measurements, meta)
|
||||||
|
:telemetry.execute([:oban, :job, :exception], measurements, meta)
|
||||||
|
|
||||||
|
# Ecto (adapter-emitted)
|
||||||
|
[:my_app, :repo, :query]
|
||||||
|
```
|
||||||
|
|
||||||
|
**Force:** The BEAM ecosystem standardized on `:telemetry`
|
||||||
|
for observability. Libraries don't own their monitoring —
|
||||||
|
they emit events; consumers attach handlers. This inverts
|
||||||
|
the logging relationship: the library doesn't decide what
|
||||||
|
to do with the information.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Config Propagation: Three Models
|
||||||
|
|
||||||
|
### CockroachDB: Cluster Settings (Distributed Config)
|
||||||
|
|
||||||
|
```go
|
||||||
|
settings.RegisterDurationSetting(
|
||||||
|
settings.ApplicationLevel,
|
||||||
|
"bulkio.ingest.flush_delay",
|
||||||
|
"amount of time to wait before sending a file...",
|
||||||
|
0, // default
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
Settings are:
|
||||||
|
- **Typed** (Duration, Bool, Int, String)
|
||||||
|
- **Leveled** (ApplicationLevel vs SystemVisible)
|
||||||
|
- **Validated** (NonNegativeInt, etc.)
|
||||||
|
- **Distributed** (propagated across all nodes)
|
||||||
|
- **Version-gated** (new settings require cluster version)
|
||||||
|
|
||||||
|
Usage: `settings.Version.IsActive(ctx, clusterversion.V26_2)`
|
||||||
|
|
||||||
|
**Force:** In a distributed database, config isn't a file
|
||||||
|
— it's consensus. Every node must agree on every setting,
|
||||||
|
and settings can only be enabled once all nodes support
|
||||||
|
them. The version gate is the safety mechanism.
|
||||||
|
|
||||||
|
### Prometheus: ApplyConfig (Hot Reload)
|
||||||
|
|
||||||
|
```go
|
||||||
|
func (m *Manager) ApplyConfig(cfg *config.Config) error {
|
||||||
|
m.mtxScrape.Lock()
|
||||||
|
defer m.mtxScrape.Unlock()
|
||||||
|
// rebuild scrape pools from new config
|
||||||
|
// close old loggers, open new ones
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Config is a struct loaded from YAML. On SIGHUP (or API
|
||||||
|
call), the entire config is re-parsed and `ApplyConfig`
|
||||||
|
is called on each subsystem. Each subsystem holds a mutex
|
||||||
|
and swaps atomically.
|
||||||
|
|
||||||
|
**Force:** Prometheus runs as a single binary. Config
|
||||||
|
reload must be atomic per-subsystem but doesn't need
|
||||||
|
distributed consensus. The mutex-per-subsystem pattern
|
||||||
|
gives independent reload without global coordination.
|
||||||
|
|
||||||
|
### Ecto + Oban: Config at Init, Validated Once
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
# Oban validates exhaustively at startup
|
||||||
|
Validation.validate_schema(opts,
|
||||||
|
engine: {:behaviour, Oban.Engine},
|
||||||
|
queues: {:custom, &validate_queues/1},
|
||||||
|
repo: {:module, [config: 0]},
|
||||||
|
...
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
Config is validated once at startup and stored as an
|
||||||
|
immutable struct. No hot reload. If config is wrong,
|
||||||
|
you know immediately (fail fast).
|
||||||
|
|
||||||
|
**Force:** Elixir/OTP applications restart processes to
|
||||||
|
apply new config. Hot reload is handled by supervisor
|
||||||
|
restarts, not config mutation. The "config as immutable
|
||||||
|
struct" pattern means no runtime config bugs — it either
|
||||||
|
passes validation at startup or the app doesn't start.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Retry and Resilience
|
||||||
|
|
||||||
|
### CockroachDB: Iterator-Based Retry
|
||||||
|
|
||||||
|
```go
|
||||||
|
opts := retry.Options{
|
||||||
|
InitialBackoff: 100 * time.Millisecond,
|
||||||
|
MaxBackoff: 2 * time.Second,
|
||||||
|
Multiplier: 2,
|
||||||
|
MaxRetries: 5,
|
||||||
|
}
|
||||||
|
for r := retry.StartWithCtx(ctx, opts); r.Next(); {
|
||||||
|
// attempt operation
|
||||||
|
if err == nil { break }
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Retry is a **for-loop iterator**. `r.Next()` handles
|
||||||
|
backoff timing and returns false when exhausted. This
|
||||||
|
means retry logic reads like normal code — no callbacks,
|
||||||
|
no framework.
|
||||||
|
|
||||||
|
**Force:** CockroachDB has hundreds of retry sites. A
|
||||||
|
callback-based retry would create deeply nested code.
|
||||||
|
The iterator pattern keeps retry at the same indentation
|
||||||
|
level as the operation.
|
||||||
|
|
||||||
|
### Oban: Repo Dispatch with Built-In Retry
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
defp dynamic_dispatch(conf, name, args, attempt) do
|
||||||
|
with_dynamic_repo(conf, fn repo ->
|
||||||
|
apply(repo, name, args)
|
||||||
|
end)
|
||||||
|
rescue
|
||||||
|
error in UndefinedFunctionError ->
|
||||||
|
if attempt < @retry_opts[:retry] do
|
||||||
|
jittery_sleep(attempt * @retry_opts[:delay])
|
||||||
|
dynamic_dispatch(conf, name, args, attempt + 1)
|
||||||
|
else
|
||||||
|
reraise error, __STACKTRACE__
|
||||||
|
end
|
||||||
|
end
|
||||||
|
```
|
||||||
|
|
||||||
|
Every Ecto operation dispatched through Oban's repo
|
||||||
|
wrapper gets automatic retry for transient failures.
|
||||||
|
The consumer never sees the retry — it's invisible
|
||||||
|
infrastructure.
|
||||||
|
|
||||||
|
**Key insight:** Oban retries `UndefinedFunctionError`
|
||||||
|
on the repo module itself — absorbing the window during
|
||||||
|
hot code reload when the module doesn't exist. This is
|
||||||
|
an ecosystem-level concern (BEAM hot code loading) handled
|
||||||
|
transparently.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Resource Lifecycle: The Stopper Pattern
|
||||||
|
|
||||||
|
### CockroachDB: Stopper as Universal Lifecycle
|
||||||
|
|
||||||
|
```go
|
||||||
|
type Stopper struct { ... }
|
||||||
|
|
||||||
|
// RunTask runs a synchronous task
|
||||||
|
func (s *Stopper) RunTask(ctx context.Context, taskName string, f func(context.Context)) error
|
||||||
|
|
||||||
|
// RunAsyncTask runs a goroutine tracked by the stopper
|
||||||
|
func (s *Stopper) RunAsyncTask(ctx context.Context, taskName string, f func(context.Context)) error
|
||||||
|
|
||||||
|
// ShouldQuiesce returns a channel closed when shutdown begins
|
||||||
|
func (s *Stopper) ShouldQuiesce() <-chan struct{}
|
||||||
|
|
||||||
|
// Stop initiates graceful shutdown
|
||||||
|
func (s *Stopper) Stop(ctx context.Context)
|
||||||
|
```
|
||||||
|
|
||||||
|
Every goroutine in CockroachDB is launched through a
|
||||||
|
Stopper. This gives:
|
||||||
|
- **Tracking**: know exactly which goroutines are running
|
||||||
|
- **Graceful shutdown**: quiesce signal before hard stop
|
||||||
|
- **Leak detection**: `PrintLeakedStoppers` in tests
|
||||||
|
- **Throttling**: semaphore limits async tasks
|
||||||
|
|
||||||
|
```go
|
||||||
|
func init() {
|
||||||
|
leaktest.PrintLeakedStoppers = PrintLeakedStoppers
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Force:** A database cannot afford goroutine leaks —
|
||||||
|
they hold locks, connections, and file handles. The
|
||||||
|
Stopper is the universal answer: every background task
|
||||||
|
is accounted for, every shutdown is graceful, every leak
|
||||||
|
is detected in tests.
|
||||||
|
|
||||||
|
### Oban: Registry-Based Lifecycle
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
children = [
|
||||||
|
{Notifier, conf: conf, name: Registry.via(name, Notifier)},
|
||||||
|
{Nursery, conf: conf, name: Registry.via(name, Nursery)},
|
||||||
|
...
|
||||||
|
]
|
||||||
|
```
|
||||||
|
|
||||||
|
OTP already provides lifecycle management via supervisors.
|
||||||
|
Oban's addition is the Registry — namespacing processes
|
||||||
|
so multiple instances can coexist. Lifecycle is delegated
|
||||||
|
to the platform; naming is the library's concern.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. What These Patterns Teach for Code Review
|
||||||
|
|
||||||
|
### Questions to Ask About Cross-Cutting Concerns:
|
||||||
|
|
||||||
|
1. **Logging:** Who is the audience for this log? Is there
|
||||||
|
a routing mechanism, or does everything go to stdout?
|
||||||
|
Does the log help the *operator*, not just the developer?
|
||||||
|
|
||||||
|
2. **Config:** How does config reach this code? Is it
|
||||||
|
validated at startup or silently wrong at runtime? Can
|
||||||
|
it be changed without restart? Should it be?
|
||||||
|
|
||||||
|
3. **Retry:** Is retry happening at the right layer? Is it
|
||||||
|
invisible to the caller? Does it have backoff + jitter?
|
||||||
|
Does it respect context cancellation?
|
||||||
|
|
||||||
|
4. **Lifecycle:** Are background tasks tracked? Will they
|
||||||
|
shut down gracefully? Can you detect leaks in tests?
|
||||||
|
|
||||||
|
5. **Telemetry:** Are events emitted or is logging the only
|
||||||
|
observability? Can consumers attach their own handlers?
|
||||||
|
|
||||||
|
### Red Flags:
|
||||||
|
|
||||||
|
- `log.Info("something happened")` with no channel/audience
|
||||||
|
- Config read from environment at point-of-use (not validated)
|
||||||
|
- Retry logic duplicated in 5 places with different backoff
|
||||||
|
- Goroutines launched with `go func()` and no tracking
|
||||||
|
- No telemetry events — only log lines for observability
|
||||||
|
|
||||||
|
<!-- PATTERN_COMPLETE -->
|
||||||
Reference in New Issue
Block a user