docs: pattern break archaeology — WHY (not just WHAT) is impure
Git blame + commit messages reveal four categories: 1. Ship behavior, fix plumbing later (time pressure) 2. Better tooling exposed limitations (observability) 3. Removal cost > carrying cost (zero-interest debt) 4. Context needs different patterns (not actually a break) Specific findings: - CockroachDB Handle: introduced for profiling, not correctness - Stale 22.2 TODO: struct field costs 0, removal touches many files - Prometheus globals: startup constants read in hot path = correct call - AppenderV2: unifies 4 type-asserted interfaces into 1 - Ecto zero TODOs: version-gated, cleaned on version bumps by José
This commit is contained in:
@@ -0,0 +1,234 @@
|
|||||||
|
# Addendum: The Archaeology — WHY Patterns Break
|
||||||
|
|
||||||
|
The previous doc showed WHAT is impure. This digs into
|
||||||
|
WHY — based on git blame, commit messages, and PR
|
||||||
|
discussions.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## CockroachDB: The Bare Goroutine in kvadmission
|
||||||
|
|
||||||
|
**The code:**
|
||||||
|
```go
|
||||||
|
// TODO(irfansharif): Use a stopper here instead.
|
||||||
|
go func() {
|
||||||
|
ticker := time.NewTicker(10 * time.Minute)
|
||||||
|
for { select { case <-ticker.C: ... } }
|
||||||
|
}()
|
||||||
|
```
|
||||||
|
|
||||||
|
**The history:**
|
||||||
|
- **2022-08-22:** Irfan Sharif introduces the elastic CPU
|
||||||
|
limiter (`admission,kvserver: introduce an elastic CPU
|
||||||
|
limiter`). This was a complex new system for preventing
|
||||||
|
long-running elastic work (backups) from starving
|
||||||
|
foreground traffic.
|
||||||
|
- The goroutine was part of a weight calculation loop.
|
||||||
|
Irfan marked it with a TODO *in the same commit*.
|
||||||
|
|
||||||
|
**Why the shortcut:** The commit was already touching
|
||||||
|
20+ files introducing a novel admission control system.
|
||||||
|
Getting the Stopper integration right would have been
|
||||||
|
one more thing to debug in an already complex change.
|
||||||
|
Ship the behavior first, fix the plumbing later.
|
||||||
|
|
||||||
|
**Why it survived 3+ years:** The function *receives* a
|
||||||
|
stopper as a parameter. It could be fixed in 5 lines.
|
||||||
|
But nobody touched this function for any other reason,
|
||||||
|
and the goroutine leaks nowhere (it lives for the
|
||||||
|
process lifetime). The TODO is correct but not urgent.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## CockroachDB: The Handle Pattern's Origin
|
||||||
|
|
||||||
|
**The code (new way):**
|
||||||
|
```go
|
||||||
|
ctx, hdl, err := s.GetHandle(ctx, opts)
|
||||||
|
go func(ctx context.Context) {
|
||||||
|
defer hdl.Activate(ctx).Release(ctx)
|
||||||
|
taskFn(ctx)
|
||||||
|
}(ctx)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Why it replaced RunAsyncTask (Feb 2025):**
|
||||||
|
|
||||||
|
Tobias Grieger's commit message explains it perfectly:
|
||||||
|
|
||||||
|
> Prior to this change, the only option to launch async
|
||||||
|
> tasks was the stopper's `RunAsyncTask{,Ex}` method.
|
||||||
|
> This method would accept a closure and invoke it in a
|
||||||
|
> goroutine.
|
||||||
|
>
|
||||||
|
> This has the undesirable property of recording a
|
||||||
|
> location in `Stopper.RunAsyncTaskEx` as the async
|
||||||
|
> task's creating goroutine plus its lowermost call
|
||||||
|
> frame. Anyone who's ever looked at any profile will be
|
||||||
|
> familiar with this fact.
|
||||||
|
>
|
||||||
|
> In other words, it becomes an ordeal to search for a
|
||||||
|
> specific goroutine, since all goroutines started by the
|
||||||
|
> Stopper get assigned the same searchable identity.
|
||||||
|
|
||||||
|
**The force:** Profiling. When every goroutine has the
|
||||||
|
same root frame (`RunAsyncTaskEx`), CPU profiles and
|
||||||
|
execution traces become unreadable. The new API makes
|
||||||
|
callers launch goroutines directly so each has a unique
|
||||||
|
stack identity in profiling tools.
|
||||||
|
|
||||||
|
**The migration strategy:** Old API stays. New code uses
|
||||||
|
Handle. Conversion happens "as follow-up work" (the
|
||||||
|
commit's words). Today: 295 Handle vs 63 RunAsyncTask.
|
||||||
|
The ratio will shift naturally as old code is touched.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## CockroachDB: The Stale "Remove in 22.2" TODO
|
||||||
|
|
||||||
|
**The code:**
|
||||||
|
```go
|
||||||
|
// TODO(ajwerner): Remove in 22.2.
|
||||||
|
SystemConfigProvider config.SystemConfigProvider
|
||||||
|
```
|
||||||
|
|
||||||
|
**The history:**
|
||||||
|
- **2022-02-09:** Andrew Werner lands "stop gossiping the
|
||||||
|
system config." The span configuration infrastructure
|
||||||
|
replaces the old gossip-based system config.
|
||||||
|
- `SystemConfigProvider` was the bridge for mixed-version
|
||||||
|
clusters (some nodes on old gossip, some on new spans).
|
||||||
|
- The TODO expected this to be removable in 22.2 (the
|
||||||
|
next major release), when all clusters would be past
|
||||||
|
the migration.
|
||||||
|
|
||||||
|
**Why it survived:** The field is in a config struct.
|
||||||
|
It's set but only used in the "mixed-version state."
|
||||||
|
Once all clusters upgraded, it became dead weight — but
|
||||||
|
it doesn't *break* anything. It's a struct field that
|
||||||
|
consumes one pointer of memory. The cost of carrying it
|
||||||
|
is near zero. The cost of removing it is touching
|
||||||
|
`StoreConfig` initialization across many call sites.
|
||||||
|
|
||||||
|
**Lesson:** TODOs survive when removal cost > carrying
|
||||||
|
cost. The more a struct is embedded in initialization
|
||||||
|
paths, the harder it is to remove fields.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Prometheus: Why Global Vars Instead of Config
|
||||||
|
|
||||||
|
**The code:**
|
||||||
|
```go
|
||||||
|
var ScrapeTimestampTolerance = 2 * time.Millisecond
|
||||||
|
var AlignScrapeTimestamps = true
|
||||||
|
```
|
||||||
|
|
||||||
|
**The history:**
|
||||||
|
- **2020-10-07:** Julien Pivotto adds "experimental,
|
||||||
|
hidden flag" for scrape timestamp tolerance.
|
||||||
|
- **2021-09-08:** Made public after production data
|
||||||
|
showed 10ms tolerance saves significant disk space.
|
||||||
|
"For users who can afford a 10ms difference, there is
|
||||||
|
a lot of resources and disk space to win."
|
||||||
|
|
||||||
|
**Why a global var:**
|
||||||
|
1. Set once at startup from a command-line flag
|
||||||
|
2. Read thousands of times per second (every scrape)
|
||||||
|
3. Never changes after startup
|
||||||
|
4. Threading a config struct would require adding a
|
||||||
|
parameter to `scrapeLoop`, `scrapeLoopAppender`, and
|
||||||
|
every function in the hot path — for a value that's
|
||||||
|
effectively a constant.
|
||||||
|
|
||||||
|
**When this IS the right call:** A startup-configured
|
||||||
|
value that never changes and is read in a hot path is
|
||||||
|
semantically a constant. Making it a config struct field
|
||||||
|
adds indirection for no benefit. The global is honest
|
||||||
|
about what this value IS: a process-lifetime constant
|
||||||
|
that happens to be configurable at init.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Prometheus: WHY AppenderV2
|
||||||
|
|
||||||
|
**The problem:** V1 Appender has separate interfaces for
|
||||||
|
different data types:
|
||||||
|
- `Append()` — float64 samples
|
||||||
|
- `AppendExemplar()` — exemplar data
|
||||||
|
- `AppendHistogram()` — native histograms
|
||||||
|
|
||||||
|
Each requires type-assertion to check if the appender
|
||||||
|
supports it. The scrape loop has 4 different code paths
|
||||||
|
depending on which interfaces the underlying storage
|
||||||
|
implements.
|
||||||
|
|
||||||
|
**V2 unifies them:** Single `Append()` method that
|
||||||
|
accepts all types. No type assertions needed. The
|
||||||
|
scrape loop becomes one code path.
|
||||||
|
|
||||||
|
**The migration pattern:** An adapter interface
|
||||||
|
(`scrapeLoopAppendAdapter`) makes both V1 and V2
|
||||||
|
callable through the same code. The selection happens
|
||||||
|
once at construction time. ETA for V1 removal: Q2 2026
|
||||||
|
(it's in the source comments).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Ecto: Zero TODOs Is Discipline, Not Accident
|
||||||
|
|
||||||
|
**Evidence from git history:**
|
||||||
|
- TODOs ARE added. Example: "TODO: Remove
|
||||||
|
changeset.empty_values field" (added 2023-10-01)
|
||||||
|
- They include *deadlines*: "in Ecto v3.14"
|
||||||
|
- They get cleaned in dedicated "Update TODOs" commits
|
||||||
|
by José Valim, triggered by version bumps
|
||||||
|
|
||||||
|
**The pattern:** TODOs are conditional on Elixir/Erlang
|
||||||
|
version requirements. When Ecto bumps its minimum
|
||||||
|
supported Elixir version, all TODOs gated on that
|
||||||
|
version become actionable. José runs through them in a
|
||||||
|
single commit.
|
||||||
|
|
||||||
|
**Oban (same pattern):** Parker Selbert removes TODOs
|
||||||
|
within ~2 weeks of adding them. The solo-maintainer
|
||||||
|
model means one person's standards apply uniformly.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## The Meta-Lesson: WHY Patterns Break
|
||||||
|
|
||||||
|
After archaeology, pattern breaks fall into exactly
|
||||||
|
four categories:
|
||||||
|
|
||||||
|
1. **"Ship the behavior, fix the plumbing later"**
|
||||||
|
(kvadmission goroutine). Time pressure on complex
|
||||||
|
features. The author KNOWS it's wrong. Tagged with
|
||||||
|
TODO immediately. Survives because removal isn't
|
||||||
|
urgent.
|
||||||
|
|
||||||
|
2. **"Better tooling exposed the limitation"**
|
||||||
|
(Handle replacing RunAsyncTask). The old pattern
|
||||||
|
worked correctly but was invisible to profiling. The
|
||||||
|
new pattern was motivated by observability, not
|
||||||
|
correctness.
|
||||||
|
|
||||||
|
3. **"The removal cost exceeds the carrying cost"**
|
||||||
|
(stale 22.2 TODO, global vars). Technically debt, but
|
||||||
|
the interest rate is zero. Nobody will fix what
|
||||||
|
doesn't hurt.
|
||||||
|
|
||||||
|
4. **"Different context needs different patterns"**
|
||||||
|
(functional options in chunk writer, fmt.Errorf in CLI
|
||||||
|
tools). Not actually a break — it's the right pattern
|
||||||
|
for that specific context.
|
||||||
|
|
||||||
|
### For Code Review:
|
||||||
|
|
||||||
|
When you see a pattern break, ask: *which category?*
|
||||||
|
|
||||||
|
- Category 1 → worth filing a TODO, not blocking the PR
|
||||||
|
- Category 2 → this is how progress happens; track it
|
||||||
|
- Category 3 → leave it alone
|
||||||
|
- Category 4 → approve it; context matters
|
||||||
|
|
||||||
|
<!-- PATTERN_COMPLETE -->
|
||||||
Reference in New Issue
Block a user