docs: pattern break archaeology — WHY (not just WHAT) is impure
Git blame + commit messages reveal four categories: 1. Ship behavior, fix plumbing later (time pressure) 2. Better tooling exposed limitations (observability) 3. Removal cost > carrying cost (zero-interest debt) 4. Context needs different patterns (not actually a break) Specific findings: - CockroachDB Handle: introduced for profiling, not correctness - Stale 22.2 TODO: struct field costs 0, removal touches many files - Prometheus globals: startup constants read in hot path = correct call - AppenderV2: unifies 4 type-asserted interfaces into 1 - Ecto zero TODOs: version-gated, cleaned on version bumps by José
This commit is contained in:
@@ -0,0 +1,234 @@
|
||||
# Addendum: The Archaeology — WHY Patterns Break
|
||||
|
||||
The previous doc showed WHAT is impure. This digs into
|
||||
WHY — based on git blame, commit messages, and PR
|
||||
discussions.
|
||||
|
||||
---
|
||||
|
||||
## CockroachDB: The Bare Goroutine in kvadmission
|
||||
|
||||
**The code:**
|
||||
```go
|
||||
// TODO(irfansharif): Use a stopper here instead.
|
||||
go func() {
|
||||
ticker := time.NewTicker(10 * time.Minute)
|
||||
for { select { case <-ticker.C: ... } }
|
||||
}()
|
||||
```
|
||||
|
||||
**The history:**
|
||||
- **2022-08-22:** Irfan Sharif introduces the elastic CPU
|
||||
limiter (`admission,kvserver: introduce an elastic CPU
|
||||
limiter`). This was a complex new system for preventing
|
||||
long-running elastic work (backups) from starving
|
||||
foreground traffic.
|
||||
- The goroutine was part of a weight calculation loop.
|
||||
Irfan marked it with a TODO *in the same commit*.
|
||||
|
||||
**Why the shortcut:** The commit was already touching
|
||||
20+ files introducing a novel admission control system.
|
||||
Getting the Stopper integration right would have been
|
||||
one more thing to debug in an already complex change.
|
||||
Ship the behavior first, fix the plumbing later.
|
||||
|
||||
**Why it survived 3+ years:** The function *receives* a
|
||||
stopper as a parameter. It could be fixed in 5 lines.
|
||||
But nobody touched this function for any other reason,
|
||||
and the goroutine leaks nowhere (it lives for the
|
||||
process lifetime). The TODO is correct but not urgent.
|
||||
|
||||
---
|
||||
|
||||
## CockroachDB: The Handle Pattern's Origin
|
||||
|
||||
**The code (new way):**
|
||||
```go
|
||||
ctx, hdl, err := s.GetHandle(ctx, opts)
|
||||
go func(ctx context.Context) {
|
||||
defer hdl.Activate(ctx).Release(ctx)
|
||||
taskFn(ctx)
|
||||
}(ctx)
|
||||
```
|
||||
|
||||
**Why it replaced RunAsyncTask (Feb 2025):**
|
||||
|
||||
Tobias Grieger's commit message explains it perfectly:
|
||||
|
||||
> Prior to this change, the only option to launch async
|
||||
> tasks was the stopper's `RunAsyncTask{,Ex}` method.
|
||||
> This method would accept a closure and invoke it in a
|
||||
> goroutine.
|
||||
>
|
||||
> This has the undesirable property of recording a
|
||||
> location in `Stopper.RunAsyncTaskEx` as the async
|
||||
> task's creating goroutine plus its lowermost call
|
||||
> frame. Anyone who's ever looked at any profile will be
|
||||
> familiar with this fact.
|
||||
>
|
||||
> In other words, it becomes an ordeal to search for a
|
||||
> specific goroutine, since all goroutines started by the
|
||||
> Stopper get assigned the same searchable identity.
|
||||
|
||||
**The force:** Profiling. When every goroutine has the
|
||||
same root frame (`RunAsyncTaskEx`), CPU profiles and
|
||||
execution traces become unreadable. The new API makes
|
||||
callers launch goroutines directly so each has a unique
|
||||
stack identity in profiling tools.
|
||||
|
||||
**The migration strategy:** Old API stays. New code uses
|
||||
Handle. Conversion happens "as follow-up work" (the
|
||||
commit's words). Today: 295 Handle vs 63 RunAsyncTask.
|
||||
The ratio will shift naturally as old code is touched.
|
||||
|
||||
---
|
||||
|
||||
## CockroachDB: The Stale "Remove in 22.2" TODO
|
||||
|
||||
**The code:**
|
||||
```go
|
||||
// TODO(ajwerner): Remove in 22.2.
|
||||
SystemConfigProvider config.SystemConfigProvider
|
||||
```
|
||||
|
||||
**The history:**
|
||||
- **2022-02-09:** Andrew Werner lands "stop gossiping the
|
||||
system config." The span configuration infrastructure
|
||||
replaces the old gossip-based system config.
|
||||
- `SystemConfigProvider` was the bridge for mixed-version
|
||||
clusters (some nodes on old gossip, some on new spans).
|
||||
- The TODO expected this to be removable in 22.2 (the
|
||||
next major release), when all clusters would be past
|
||||
the migration.
|
||||
|
||||
**Why it survived:** The field is in a config struct.
|
||||
It's set but only used in the "mixed-version state."
|
||||
Once all clusters upgraded, it became dead weight — but
|
||||
it doesn't *break* anything. It's a struct field that
|
||||
consumes one pointer of memory. The cost of carrying it
|
||||
is near zero. The cost of removing it is touching
|
||||
`StoreConfig` initialization across many call sites.
|
||||
|
||||
**Lesson:** TODOs survive when removal cost > carrying
|
||||
cost. The more a struct is embedded in initialization
|
||||
paths, the harder it is to remove fields.
|
||||
|
||||
---
|
||||
|
||||
## Prometheus: Why Global Vars Instead of Config
|
||||
|
||||
**The code:**
|
||||
```go
|
||||
var ScrapeTimestampTolerance = 2 * time.Millisecond
|
||||
var AlignScrapeTimestamps = true
|
||||
```
|
||||
|
||||
**The history:**
|
||||
- **2020-10-07:** Julien Pivotto adds "experimental,
|
||||
hidden flag" for scrape timestamp tolerance.
|
||||
- **2021-09-08:** Made public after production data
|
||||
showed 10ms tolerance saves significant disk space.
|
||||
"For users who can afford a 10ms difference, there is
|
||||
a lot of resources and disk space to win."
|
||||
|
||||
**Why a global var:**
|
||||
1. Set once at startup from a command-line flag
|
||||
2. Read thousands of times per second (every scrape)
|
||||
3. Never changes after startup
|
||||
4. Threading a config struct would require adding a
|
||||
parameter to `scrapeLoop`, `scrapeLoopAppender`, and
|
||||
every function in the hot path — for a value that's
|
||||
effectively a constant.
|
||||
|
||||
**When this IS the right call:** A startup-configured
|
||||
value that never changes and is read in a hot path is
|
||||
semantically a constant. Making it a config struct field
|
||||
adds indirection for no benefit. The global is honest
|
||||
about what this value IS: a process-lifetime constant
|
||||
that happens to be configurable at init.
|
||||
|
||||
---
|
||||
|
||||
## Prometheus: WHY AppenderV2
|
||||
|
||||
**The problem:** V1 Appender has separate interfaces for
|
||||
different data types:
|
||||
- `Append()` — float64 samples
|
||||
- `AppendExemplar()` — exemplar data
|
||||
- `AppendHistogram()` — native histograms
|
||||
|
||||
Each requires type-assertion to check if the appender
|
||||
supports it. The scrape loop has 4 different code paths
|
||||
depending on which interfaces the underlying storage
|
||||
implements.
|
||||
|
||||
**V2 unifies them:** Single `Append()` method that
|
||||
accepts all types. No type assertions needed. The
|
||||
scrape loop becomes one code path.
|
||||
|
||||
**The migration pattern:** An adapter interface
|
||||
(`scrapeLoopAppendAdapter`) makes both V1 and V2
|
||||
callable through the same code. The selection happens
|
||||
once at construction time. ETA for V1 removal: Q2 2026
|
||||
(it's in the source comments).
|
||||
|
||||
---
|
||||
|
||||
## Ecto: Zero TODOs Is Discipline, Not Accident
|
||||
|
||||
**Evidence from git history:**
|
||||
- TODOs ARE added. Example: "TODO: Remove
|
||||
changeset.empty_values field" (added 2023-10-01)
|
||||
- They include *deadlines*: "in Ecto v3.14"
|
||||
- They get cleaned in dedicated "Update TODOs" commits
|
||||
by José Valim, triggered by version bumps
|
||||
|
||||
**The pattern:** TODOs are conditional on Elixir/Erlang
|
||||
version requirements. When Ecto bumps its minimum
|
||||
supported Elixir version, all TODOs gated on that
|
||||
version become actionable. José runs through them in a
|
||||
single commit.
|
||||
|
||||
**Oban (same pattern):** Parker Selbert removes TODOs
|
||||
within ~2 weeks of adding them. The solo-maintainer
|
||||
model means one person's standards apply uniformly.
|
||||
|
||||
---
|
||||
|
||||
## The Meta-Lesson: WHY Patterns Break
|
||||
|
||||
After archaeology, pattern breaks fall into exactly
|
||||
four categories:
|
||||
|
||||
1. **"Ship the behavior, fix the plumbing later"**
|
||||
(kvadmission goroutine). Time pressure on complex
|
||||
features. The author KNOWS it's wrong. Tagged with
|
||||
TODO immediately. Survives because removal isn't
|
||||
urgent.
|
||||
|
||||
2. **"Better tooling exposed the limitation"**
|
||||
(Handle replacing RunAsyncTask). The old pattern
|
||||
worked correctly but was invisible to profiling. The
|
||||
new pattern was motivated by observability, not
|
||||
correctness.
|
||||
|
||||
3. **"The removal cost exceeds the carrying cost"**
|
||||
(stale 22.2 TODO, global vars). Technically debt, but
|
||||
the interest rate is zero. Nobody will fix what
|
||||
doesn't hurt.
|
||||
|
||||
4. **"Different context needs different patterns"**
|
||||
(functional options in chunk writer, fmt.Errorf in CLI
|
||||
tools). Not actually a break — it's the right pattern
|
||||
for that specific context.
|
||||
|
||||
### For Code Review:
|
||||
|
||||
When you see a pattern break, ask: *which category?*
|
||||
|
||||
- Category 1 → worth filing a TODO, not blocking the PR
|
||||
- Category 2 → this is how progress happens; track it
|
||||
- Category 3 → leave it alone
|
||||
- Category 4 → approve it; context matters
|
||||
|
||||
<!-- PATTERN_COMPLETE -->
|
||||
Reference in New Issue
Block a user