Files

T

Rodin 76f4bcc33e docs: architectural analysis of top repos (CockroachDB, Prometheus, Ecto, Oban)

Four documents examining codebases at module and ecosystem levels:
- architectural-analysis.md — internal structure, dependency flow
- ecosystem-analysis.md — consumer extension points, deliberate absences
- crosscutting-analysis.md — logging, config, retry, lifecycle
- testing-evolution-analysis.md — proof models, API evolution strategies

2026-04-30 10:50:54 -07:00

9.1 KiB

Raw Permalink Blame History

Cross-Cutting Concerns: How Mature Codebases Handle the Hard Parts

Cross-cutting concerns are the things that touch everything but belong nowhere. How a codebase handles logging, telemetry, config, retry, and lifecycle management reveals its architectural philosophy more than any feature code.

1. Logging: From Strings to Semantic Channels

CockroachDB: Channel-Based Log Routing

CockroachDB doesn't just log at severity levels — it routes logs to semantic channels:

const DEV = logpb.Channel_DEV       // development noise
const OPS = logpb.Channel_OPS       // operator actions
const HEALTH = logpb.Channel_HEALTH // background health
const STORAGE = logpb.Channel_STORAGE
const SESSIONS = logpb.Channel_SESSIONS
const SQL_SCHEMA = logpb.Channel_SQL_SCHEMA
const USER_ADMIN = logpb.Channel_USER_ADMIN

Each channel can be routed to different sinks (file, network, etc.) independently. Production deploys typically disable DEV entirely and route HEALTH to monitoring.

Force: In a multi-tenant distributed database, "who cares about this log?" is a different question than "how bad is it?" An INFO-level schema change matters to DBAs but not to SREs monitoring node health.

Ecosystem insight: The channel IS the audience. When you write log.Health.Warningf(...), you're declaring "the person watching cluster health needs to see this." Severity is orthogonal to audience.

Prometheus: Self-Instrumentation

Prometheus instruments itself with its own metrics:

type scrapeMetrics struct {
    targetScrapeSampleLimit        prometheus.Counter
    targetScrapeSampleOutOfOrder   prometheus.Counter
    targetIntervalLengthHistogram  *prometheus.HistogramVec
    // ... 20+ metrics
}

Metrics are collected in a struct, constructed once via newScrapeMetrics(reg), and passed to subsystems. No global registration — the registerer is injected.

Force: Prometheus IS the metrics system. If it used a different metrics library to instrument itself, that would be a design smell. Dogfooding proves the API works.

Ecto + Oban: Telemetry as Standard

Both use Erlang's :telemetry library with predictable naming:

# Oban
:telemetry.execute([:oban, :job, :start], measurements, meta)
:telemetry.execute([:oban, :job, :stop], measurements, meta)
:telemetry.execute([:oban, :job, :exception], measurements, meta)

# Ecto (adapter-emitted)
[:my_app, :repo, :query]

Force: The BEAM ecosystem standardized on :telemetry for observability. Libraries don't own their monitoring — they emit events; consumers attach handlers. This inverts the logging relationship: the library doesn't decide what to do with the information.

2. Config Propagation: Three Models

CockroachDB: Cluster Settings (Distributed Config)

settings.RegisterDurationSetting(
    settings.ApplicationLevel,
    "bulkio.ingest.flush_delay",
    "amount of time to wait before sending a file...",
    0,  // default
)

Settings are:

Typed (Duration, Bool, Int, String)
Leveled (ApplicationLevel vs SystemVisible)
Validated (NonNegativeInt, etc.)
Distributed (propagated across all nodes)
Version-gated (new settings require cluster version)

Usage: settings.Version.IsActive(ctx, clusterversion.V26_2)

Force: In a distributed database, config isn't a file — it's consensus. Every node must agree on every setting, and settings can only be enabled once all nodes support them. The version gate is the safety mechanism.

Prometheus: ApplyConfig (Hot Reload)

func (m *Manager) ApplyConfig(cfg *config.Config) error {
    m.mtxScrape.Lock()
    defer m.mtxScrape.Unlock()
    // rebuild scrape pools from new config
    // close old loggers, open new ones
}

Config is a struct loaded from YAML. On SIGHUP (or API call), the entire config is re-parsed and ApplyConfig is called on each subsystem. Each subsystem holds a mutex and swaps atomically.

Force: Prometheus runs as a single binary. Config reload must be atomic per-subsystem but doesn't need distributed consensus. The mutex-per-subsystem pattern gives independent reload without global coordination.

Ecto + Oban: Config at Init, Validated Once

# Oban validates exhaustively at startup
Validation.validate_schema(opts,
    engine: {:behaviour, Oban.Engine},
    queues: {:custom, &validate_queues/1},
    repo: {:module, [config: 0]},
    ...
)

Config is validated once at startup and stored as an immutable struct. No hot reload. If config is wrong, you know immediately (fail fast).

Force: Elixir/OTP applications restart processes to apply new config. Hot reload is handled by supervisor restarts, not config mutation. The "config as immutable struct" pattern means no runtime config bugs — it either passes validation at startup or the app doesn't start.

3. Retry and Resilience

CockroachDB: Iterator-Based Retry

opts := retry.Options{
    InitialBackoff: 100 * time.Millisecond,
    MaxBackoff:     2 * time.Second,
    Multiplier:     2,
    MaxRetries:     5,
}
for r := retry.StartWithCtx(ctx, opts); r.Next(); {
    // attempt operation
    if err == nil { break }
}

Retry is a for-loop iterator. r.Next() handles backoff timing and returns false when exhausted. This means retry logic reads like normal code — no callbacks, no framework.

Force: CockroachDB has hundreds of retry sites. A callback-based retry would create deeply nested code. The iterator pattern keeps retry at the same indentation level as the operation.

Oban: Repo Dispatch with Built-In Retry

defp dynamic_dispatch(conf, name, args, attempt) do
    with_dynamic_repo(conf, fn repo ->
        apply(repo, name, args)
    end)
rescue
    error in UndefinedFunctionError ->
        if attempt < @retry_opts[:retry] do
            jittery_sleep(attempt * @retry_opts[:delay])
            dynamic_dispatch(conf, name, args, attempt + 1)
        else
            reraise error, __STACKTRACE__
        end
end

Every Ecto operation dispatched through Oban's repo wrapper gets automatic retry for transient failures. The consumer never sees the retry — it's invisible infrastructure.

Key insight: Oban retries UndefinedFunctionError on the repo module itself — absorbing the window during hot code reload when the module doesn't exist. This is an ecosystem-level concern (BEAM hot code loading) handled transparently.

4. Resource Lifecycle: The Stopper Pattern

CockroachDB: Stopper as Universal Lifecycle

type Stopper struct { ... }

// RunTask runs a synchronous task
func (s *Stopper) RunTask(ctx context.Context, taskName string, f func(context.Context)) error

// RunAsyncTask runs a goroutine tracked by the stopper
func (s *Stopper) RunAsyncTask(ctx context.Context, taskName string, f func(context.Context)) error

// ShouldQuiesce returns a channel closed when shutdown begins
func (s *Stopper) ShouldQuiesce() <-chan struct{}

// Stop initiates graceful shutdown
func (s *Stopper) Stop(ctx context.Context)

Every goroutine in CockroachDB is launched through a Stopper. This gives:

Tracking: know exactly which goroutines are running
Graceful shutdown: quiesce signal before hard stop
Leak detection: PrintLeakedStoppers in tests
Throttling: semaphore limits async tasks

func init() {
    leaktest.PrintLeakedStoppers = PrintLeakedStoppers
}

Force: A database cannot afford goroutine leaks — they hold locks, connections, and file handles. The Stopper is the universal answer: every background task is accounted for, every shutdown is graceful, every leak is detected in tests.

Oban: Registry-Based Lifecycle

children = [
    {Notifier, conf: conf, name: Registry.via(name, Notifier)},
    {Nursery, conf: conf, name: Registry.via(name, Nursery)},
    ...
]

OTP already provides lifecycle management via supervisors. Oban's addition is the Registry — namespacing processes so multiple instances can coexist. Lifecycle is delegated to the platform; naming is the library's concern.

5. What These Patterns Teach for Code Review

Questions to Ask About Cross-Cutting Concerns:

Logging: Who is the audience for this log? Is there a routing mechanism, or does everything go to stdout? Does the log help the operator, not just the developer?
Config: How does config reach this code? Is it validated at startup or silently wrong at runtime? Can it be changed without restart? Should it be?
Retry: Is retry happening at the right layer? Is it invisible to the caller? Does it have backoff + jitter? Does it respect context cancellation?
Lifecycle: Are background tasks tracked? Will they shut down gracefully? Can you detect leaks in tests?
Telemetry: Are events emitted or is logging the only observability? Can consumers attach their own handlers?

Red Flags:

log.Info("something happened") with no channel/audience
Config read from environment at point-of-use (not validated)
Retry logic duplicated in 5 places with different backoff
Goroutines launched with go func() and no tracking
No telemetry events — only log lines for observability

9.1 KiB Raw Permalink Blame History