6af8a6ee10
Break the monolithic 3249-line findings file into 29 individual files, one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy chronological sorting and discovery. No content changes — purely structural reorganization.
78 lines
4.1 KiB
Markdown
78 lines
4.1 KiB
Markdown
# Finding 9: Gap-finding in architecture docs: GPT-5 finds domain-specific gaps, GPT-4.1 is generic, Mini is formulaic
|
|
|
|
**Date:** 2026-05-02
|
|
**Task:** Identify missing failure scenarios in gargoyle's `failure-modes.md` (383 lines)
|
|
**How we used them:** Same document (full text, no truncation) + same focused
|
|
analytical question to all 3 models via HAI proxy (OpenAI-compatible endpoint).
|
|
No tools, no project context beyond the document itself. Single prompt, no
|
|
conversation history. Temperature 0.3 for GPT-4.1/Mini, default (1.0) for GPT-5
|
|
(required by the model).
|
|
|
|
| Model | Time | Output tokens | Reasoning tokens | Scenarios found |
|
|
|---|---|---|---|---|
|
|
| GPT-4.1 Mini | 16s | 2,003 | 0 | 10 |
|
|
| GPT-4.1 | 24s | 2,575 | 0 | 15 |
|
|
| GPT-5 | 45s | 8,565 | 6,656 | 14 |
|
|
|
|
**What they found — common ground (all 3 identified):**
|
|
- ETS table corruption/loss affecting gates
|
|
- BEAM scheduler starvation / GC pauses
|
|
- WebSocket message duplication/reordering
|
|
- Postgres connection pool exhaustion / deadlocks
|
|
- Clock skew / time drift
|
|
- Process registry inconsistency
|
|
|
|
**GPT-5 unique findings (not in either other model):**
|
|
- Broker rate limiting (429s) — not "connection lost" so existing logic
|
|
doesn't trigger, but can't flatten during kill switch
|
|
- Broker auth failure / credential rotation — distinct from connection loss
|
|
- Corporate actions (splits, symbol changes) — position drift without
|
|
triggering staleness detection
|
|
- Duplicate pipeline instances for same user (DynamicSupervisor race)
|
|
- DB "commit unknown outcome" causing restart loops (Ecto commit succeeds
|
|
at Postgres but client times out → retry → unique constraint → crash loop)
|
|
- Cross-symbol strategies with partial staleness — multi-leg signals
|
|
computed from mix of fresh and stale data
|
|
- Partial cancel_all during kill switch masked by process restarts
|
|
|
|
**GPT-4.1 unique findings (not in GPT-5 or Mini):**
|
|
- Zombie processes after halt (supervisor misconfiguration)
|
|
- Unsupervised Task crashes going unnoticed
|
|
- Audit log writes failing silently (not in same transaction as state change)
|
|
- ClOrdID unique constraint violation from race in sequence generation
|
|
- Broker API semantic changes (silent breaking changes)
|
|
|
|
**GPT-4.1 Mini unique findings:**
|
|
- Race between kill switch engagement and reconciliation completion
|
|
(timing coordination gap) — this was more explicitly called out than
|
|
in the other models, though GPT-5 touches it implicitly
|
|
- Strategy.Worker / Aggregator partial crash inconsistency
|
|
|
|
**Quality assessment:**
|
|
- **GPT-5** had the most *domain-relevant* and *actionable* gaps. Broker
|
|
rate limiting, auth failures, corporate actions, and the DB commit
|
|
unknown-outcome scenario are all realistic production issues specific
|
|
to THIS system. The cross-symbol partial staleness finding shows
|
|
deeper architectural reasoning about component interactions.
|
|
- **GPT-4.1** was thorough and well-structured but more generic/defensive.
|
|
Many of its unique findings (zombie processes, unsupervised Tasks,
|
|
audit log loss) are general Elixir concerns rather than specific to
|
|
the document's architecture. Good for a completeness checklist.
|
|
- **GPT-4.1 Mini** was formulaic — each finding followed the same template
|
|
and several were somewhat surface-level or restated things the document
|
|
partially covers. Still found the most scenarios per dollar.
|
|
|
|
**Takeaway:** For gap-finding in architecture documents, GPT-5's reasoning
|
|
tokens pay off. It doesn't just list "things that could go wrong" — it
|
|
identifies *specific interactions* that the document's existing mechanisms
|
|
don't cover (e.g., rate limiting bypasses the "connection lost" detection,
|
|
corporate actions bypass staleness detection). GPT-4.1 is a solid
|
|
middle-ground: more thorough than Mini, less insightful than GPT-5.
|
|
Mini is fine for a quick sanity check but won't find the subtle gaps.
|
|
|
|
**Cost-effectiveness:** Mini found 10 scenarios in 16s for ~7K tokens.
|
|
GPT-5 found 14 scenarios (with 7 genuinely unique insights) in 45s for
|
|
~13.5K tokens (including 6.6K reasoning). For architecture review where
|
|
missing a gap could mean financial loss, the GPT-5 cost is justified.
|
|
For routine doc review, Mini + human judgment is probably sufficient.
|