docs: cross-cutting concerns analysis (cross-referenced)
This commit is contained in:
@@ -0,0 +1,301 @@
|
|||||||
|
# Cross-Cutting Concerns: How Mature Codebases Handle the Hard Parts
|
||||||
|
|
||||||
|
Cross-cutting concerns are the things that touch everything
|
||||||
|
but belong nowhere. How a codebase handles logging,
|
||||||
|
telemetry, config, retry, and lifecycle management reveals
|
||||||
|
its architectural philosophy more than any feature code.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Logging: From Strings to Semantic Channels
|
||||||
|
|
||||||
|
### CockroachDB: Channel-Based Log Routing
|
||||||
|
|
||||||
|
CockroachDB doesn't just log at severity levels — it
|
||||||
|
routes logs to **semantic channels**:
|
||||||
|
|
||||||
|
```go
|
||||||
|
const DEV = logpb.Channel_DEV // development noise
|
||||||
|
const OPS = logpb.Channel_OPS // operator actions
|
||||||
|
const HEALTH = logpb.Channel_HEALTH // background health
|
||||||
|
const STORAGE = logpb.Channel_STORAGE
|
||||||
|
const SESSIONS = logpb.Channel_SESSIONS
|
||||||
|
const SQL_SCHEMA = logpb.Channel_SQL_SCHEMA
|
||||||
|
const USER_ADMIN = logpb.Channel_USER_ADMIN
|
||||||
|
```
|
||||||
|
|
||||||
|
Each channel can be routed to different sinks (file,
|
||||||
|
network, etc.) independently. Production deploys typically
|
||||||
|
disable DEV entirely and route HEALTH to monitoring.
|
||||||
|
|
||||||
|
**Force:** In a multi-tenant distributed database, "who
|
||||||
|
cares about this log?" is a different question than "how
|
||||||
|
bad is it?" An INFO-level schema change matters to DBAs
|
||||||
|
but not to SREs monitoring node health.
|
||||||
|
|
||||||
|
**Ecosystem insight:** The channel IS the audience. When
|
||||||
|
you write `log.Health.Warningf(...)`, you're declaring
|
||||||
|
"the person watching cluster health needs to see this."
|
||||||
|
Severity is orthogonal to audience.
|
||||||
|
|
||||||
|
### Prometheus: Self-Instrumentation
|
||||||
|
|
||||||
|
Prometheus instruments itself with its own metrics:
|
||||||
|
|
||||||
|
```go
|
||||||
|
type scrapeMetrics struct {
|
||||||
|
targetScrapeSampleLimit prometheus.Counter
|
||||||
|
targetScrapeSampleOutOfOrder prometheus.Counter
|
||||||
|
targetIntervalLengthHistogram *prometheus.HistogramVec
|
||||||
|
// ... 20+ metrics
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Metrics are collected in a struct, constructed once via
|
||||||
|
`newScrapeMetrics(reg)`, and passed to subsystems. No
|
||||||
|
global registration — the registerer is injected.
|
||||||
|
|
||||||
|
**Force:** Prometheus IS the metrics system. If it used
|
||||||
|
a different metrics library to instrument itself, that
|
||||||
|
would be a design smell. Dogfooding proves the API works.
|
||||||
|
|
||||||
|
### Ecto + Oban: Telemetry as Standard
|
||||||
|
|
||||||
|
Both use Erlang's `:telemetry` library with predictable
|
||||||
|
naming:
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
# Oban
|
||||||
|
:telemetry.execute([:oban, :job, :start], measurements, meta)
|
||||||
|
:telemetry.execute([:oban, :job, :stop], measurements, meta)
|
||||||
|
:telemetry.execute([:oban, :job, :exception], measurements, meta)
|
||||||
|
|
||||||
|
# Ecto (adapter-emitted)
|
||||||
|
[:my_app, :repo, :query]
|
||||||
|
```
|
||||||
|
|
||||||
|
**Force:** The BEAM ecosystem standardized on `:telemetry`
|
||||||
|
for observability. Libraries don't own their monitoring —
|
||||||
|
they emit events; consumers attach handlers. This inverts
|
||||||
|
the logging relationship: the library doesn't decide what
|
||||||
|
to do with the information.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Config Propagation: Three Models
|
||||||
|
|
||||||
|
### CockroachDB: Cluster Settings (Distributed Config)
|
||||||
|
|
||||||
|
```go
|
||||||
|
settings.RegisterDurationSetting(
|
||||||
|
settings.ApplicationLevel,
|
||||||
|
"bulkio.ingest.flush_delay",
|
||||||
|
"amount of time to wait before sending a file...",
|
||||||
|
0, // default
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
Settings are:
|
||||||
|
- **Typed** (Duration, Bool, Int, String)
|
||||||
|
- **Leveled** (ApplicationLevel vs SystemVisible)
|
||||||
|
- **Validated** (NonNegativeInt, etc.)
|
||||||
|
- **Distributed** (propagated across all nodes)
|
||||||
|
- **Version-gated** (new settings require cluster version)
|
||||||
|
|
||||||
|
Usage: `settings.Version.IsActive(ctx, clusterversion.V26_2)`
|
||||||
|
|
||||||
|
**Force:** In a distributed database, config isn't a file
|
||||||
|
— it's consensus. Every node must agree on every setting,
|
||||||
|
and settings can only be enabled once all nodes support
|
||||||
|
them. The version gate is the safety mechanism.
|
||||||
|
|
||||||
|
### Prometheus: ApplyConfig (Hot Reload)
|
||||||
|
|
||||||
|
```go
|
||||||
|
func (m *Manager) ApplyConfig(cfg *config.Config) error {
|
||||||
|
m.mtxScrape.Lock()
|
||||||
|
defer m.mtxScrape.Unlock()
|
||||||
|
// rebuild scrape pools from new config
|
||||||
|
// close old loggers, open new ones
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Config is a struct loaded from YAML. On SIGHUP (or API
|
||||||
|
call), the entire config is re-parsed and `ApplyConfig`
|
||||||
|
is called on each subsystem. Each subsystem holds a mutex
|
||||||
|
and swaps atomically.
|
||||||
|
|
||||||
|
**Force:** Prometheus runs as a single binary. Config
|
||||||
|
reload must be atomic per-subsystem but doesn't need
|
||||||
|
distributed consensus. The mutex-per-subsystem pattern
|
||||||
|
gives independent reload without global coordination.
|
||||||
|
|
||||||
|
### Ecto + Oban: Config at Init, Validated Once
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
# Oban validates exhaustively at startup
|
||||||
|
Validation.validate_schema(opts,
|
||||||
|
engine: {:behaviour, Oban.Engine},
|
||||||
|
queues: {:custom, &validate_queues/1},
|
||||||
|
repo: {:module, [config: 0]},
|
||||||
|
...
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
Config is validated once at startup and stored as an
|
||||||
|
immutable struct. No hot reload. If config is wrong,
|
||||||
|
you know immediately (fail fast).
|
||||||
|
|
||||||
|
**Force:** Elixir/OTP applications restart processes to
|
||||||
|
apply new config. Hot reload is handled by supervisor
|
||||||
|
restarts, not config mutation. The "config as immutable
|
||||||
|
struct" pattern means no runtime config bugs — it either
|
||||||
|
passes validation at startup or the app doesn't start.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Retry and Resilience
|
||||||
|
|
||||||
|
### CockroachDB: Iterator-Based Retry
|
||||||
|
|
||||||
|
```go
|
||||||
|
opts := retry.Options{
|
||||||
|
InitialBackoff: 100 * time.Millisecond,
|
||||||
|
MaxBackoff: 2 * time.Second,
|
||||||
|
Multiplier: 2,
|
||||||
|
MaxRetries: 5,
|
||||||
|
}
|
||||||
|
for r := retry.StartWithCtx(ctx, opts); r.Next(); {
|
||||||
|
// attempt operation
|
||||||
|
if err == nil { break }
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Retry is a **for-loop iterator**. `r.Next()` handles
|
||||||
|
backoff timing and returns false when exhausted. This
|
||||||
|
means retry logic reads like normal code — no callbacks,
|
||||||
|
no framework.
|
||||||
|
|
||||||
|
**Force:** CockroachDB has hundreds of retry sites. A
|
||||||
|
callback-based retry would create deeply nested code.
|
||||||
|
The iterator pattern keeps retry at the same indentation
|
||||||
|
level as the operation.
|
||||||
|
|
||||||
|
### Oban: Repo Dispatch with Built-In Retry
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
defp dynamic_dispatch(conf, name, args, attempt) do
|
||||||
|
with_dynamic_repo(conf, fn repo ->
|
||||||
|
apply(repo, name, args)
|
||||||
|
end)
|
||||||
|
rescue
|
||||||
|
error in UndefinedFunctionError ->
|
||||||
|
if attempt < @retry_opts[:retry] do
|
||||||
|
jittery_sleep(attempt * @retry_opts[:delay])
|
||||||
|
dynamic_dispatch(conf, name, args, attempt + 1)
|
||||||
|
else
|
||||||
|
reraise error, __STACKTRACE__
|
||||||
|
end
|
||||||
|
end
|
||||||
|
```
|
||||||
|
|
||||||
|
Every Ecto operation dispatched through Oban's repo
|
||||||
|
wrapper gets automatic retry for transient failures.
|
||||||
|
The consumer never sees the retry — it's invisible
|
||||||
|
infrastructure.
|
||||||
|
|
||||||
|
**Key insight:** Oban retries `UndefinedFunctionError`
|
||||||
|
on the repo module itself — absorbing the window during
|
||||||
|
hot code reload when the module doesn't exist. This is
|
||||||
|
an ecosystem-level concern (BEAM hot code loading) handled
|
||||||
|
transparently.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Resource Lifecycle: The Stopper Pattern
|
||||||
|
|
||||||
|
### CockroachDB: Stopper as Universal Lifecycle
|
||||||
|
|
||||||
|
```go
|
||||||
|
type Stopper struct { ... }
|
||||||
|
|
||||||
|
// RunTask runs a synchronous task
|
||||||
|
func (s *Stopper) RunTask(ctx context.Context, taskName string, f func(context.Context)) error
|
||||||
|
|
||||||
|
// RunAsyncTask runs a goroutine tracked by the stopper
|
||||||
|
func (s *Stopper) RunAsyncTask(ctx context.Context, taskName string, f func(context.Context)) error
|
||||||
|
|
||||||
|
// ShouldQuiesce returns a channel closed when shutdown begins
|
||||||
|
func (s *Stopper) ShouldQuiesce() <-chan struct{}
|
||||||
|
|
||||||
|
// Stop initiates graceful shutdown
|
||||||
|
func (s *Stopper) Stop(ctx context.Context)
|
||||||
|
```
|
||||||
|
|
||||||
|
Every goroutine in CockroachDB is launched through a
|
||||||
|
Stopper. This gives:
|
||||||
|
- **Tracking**: know exactly which goroutines are running
|
||||||
|
- **Graceful shutdown**: quiesce signal before hard stop
|
||||||
|
- **Leak detection**: `PrintLeakedStoppers` in tests
|
||||||
|
- **Throttling**: semaphore limits async tasks
|
||||||
|
|
||||||
|
```go
|
||||||
|
func init() {
|
||||||
|
leaktest.PrintLeakedStoppers = PrintLeakedStoppers
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Force:** A database cannot afford goroutine leaks —
|
||||||
|
they hold locks, connections, and file handles. The
|
||||||
|
Stopper is the universal answer: every background task
|
||||||
|
is accounted for, every shutdown is graceful, every leak
|
||||||
|
is detected in tests.
|
||||||
|
|
||||||
|
### Oban: Registry-Based Lifecycle
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
children = [
|
||||||
|
{Notifier, conf: conf, name: Registry.via(name, Notifier)},
|
||||||
|
{Nursery, conf: conf, name: Registry.via(name, Nursery)},
|
||||||
|
...
|
||||||
|
]
|
||||||
|
```
|
||||||
|
|
||||||
|
OTP already provides lifecycle management via supervisors.
|
||||||
|
Oban's addition is the Registry — namespacing processes
|
||||||
|
so multiple instances can coexist. Lifecycle is delegated
|
||||||
|
to the platform; naming is the library's concern.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. What These Patterns Teach for Code Review
|
||||||
|
|
||||||
|
### Questions to Ask About Cross-Cutting Concerns:
|
||||||
|
|
||||||
|
1. **Logging:** Who is the audience for this log? Is there
|
||||||
|
a routing mechanism, or does everything go to stdout?
|
||||||
|
Does the log help the *operator*, not just the developer?
|
||||||
|
|
||||||
|
2. **Config:** How does config reach this code? Is it
|
||||||
|
validated at startup or silently wrong at runtime? Can
|
||||||
|
it be changed without restart? Should it be?
|
||||||
|
|
||||||
|
3. **Retry:** Is retry happening at the right layer? Is it
|
||||||
|
invisible to the caller? Does it have backoff + jitter?
|
||||||
|
Does it respect context cancellation?
|
||||||
|
|
||||||
|
4. **Lifecycle:** Are background tasks tracked? Will they
|
||||||
|
shut down gracefully? Can you detect leaks in tests?
|
||||||
|
|
||||||
|
5. **Telemetry:** Are events emitted or is logging the only
|
||||||
|
observability? Can consumers attach their own handlers?
|
||||||
|
|
||||||
|
### Red Flags:
|
||||||
|
|
||||||
|
- `log.Info("something happened")` with no channel/audience
|
||||||
|
- Config read from environment at point-of-use (not validated)
|
||||||
|
- Retry logic duplicated in 5 places with different backoff
|
||||||
|
- Goroutines launched with `go func()` and no tracking
|
||||||
|
- No telemetry events — only log lines for observability
|
||||||
|
|
||||||
|
<!-- PATTERN_COMPLETE -->
|
||||||
Reference in New Issue
Block a user