Files

T

Rodin 6af8a6ee10 refactor(findings): split ALL-FINDINGS.md into per-experiment files

Break the monolithic 3249-line findings file into 29 individual files,
one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy
chronological sorting and discovery.

No content changes — purely structural reorganization.

2026-05-06 07:15:50 -07:00

4.1 KiB

Raw Blame History

Finding 9: Gap-finding in architecture docs: GPT-5 finds domain-specific gaps, GPT-4.1 is generic, Mini is formulaic

Date: 2026-05-02 Task: Identify missing failure scenarios in gargoyle's failure-modes.md (383 lines) How we used them: Same document (full text, no truncation) + same focused analytical question to all 3 models via HAI proxy (OpenAI-compatible endpoint). No tools, no project context beyond the document itself. Single prompt, no conversation history. Temperature 0.3 for GPT-4.1/Mini, default (1.0) for GPT-5 (required by the model).

Model	Time	Output tokens	Reasoning tokens	Scenarios found
GPT-4.1 Mini	16s	2,003	0	10
GPT-4.1	24s	2,575	0	15
GPT-5	45s	8,565	6,656	14

What they found — common ground (all 3 identified):

ETS table corruption/loss affecting gates
BEAM scheduler starvation / GC pauses
WebSocket message duplication/reordering
Postgres connection pool exhaustion / deadlocks
Clock skew / time drift
Process registry inconsistency

GPT-5 unique findings (not in either other model):

Broker rate limiting (429s) — not "connection lost" so existing logic doesn't trigger, but can't flatten during kill switch
Broker auth failure / credential rotation — distinct from connection loss
Corporate actions (splits, symbol changes) — position drift without triggering staleness detection
Duplicate pipeline instances for same user (DynamicSupervisor race)
DB "commit unknown outcome" causing restart loops (Ecto commit succeeds at Postgres but client times out → retry → unique constraint → crash loop)
Cross-symbol strategies with partial staleness — multi-leg signals computed from mix of fresh and stale data
Partial cancel_all during kill switch masked by process restarts

GPT-4.1 unique findings (not in GPT-5 or Mini):

Zombie processes after halt (supervisor misconfiguration)
Unsupervised Task crashes going unnoticed
Audit log writes failing silently (not in same transaction as state change)
ClOrdID unique constraint violation from race in sequence generation
Broker API semantic changes (silent breaking changes)

GPT-4.1 Mini unique findings:

Race between kill switch engagement and reconciliation completion (timing coordination gap) — this was more explicitly called out than in the other models, though GPT-5 touches it implicitly
Strategy.Worker / Aggregator partial crash inconsistency

Quality assessment:

GPT-5 had the most domain-relevant and actionable gaps. Broker rate limiting, auth failures, corporate actions, and the DB commit unknown-outcome scenario are all realistic production issues specific to THIS system. The cross-symbol partial staleness finding shows deeper architectural reasoning about component interactions.
GPT-4.1 was thorough and well-structured but more generic/defensive. Many of its unique findings (zombie processes, unsupervised Tasks, audit log loss) are general Elixir concerns rather than specific to the document's architecture. Good for a completeness checklist.
GPT-4.1 Mini was formulaic — each finding followed the same template and several were somewhat surface-level or restated things the document partially covers. Still found the most scenarios per dollar.

Takeaway: For gap-finding in architecture documents, GPT-5's reasoning tokens pay off. It doesn't just list "things that could go wrong" — it identifies specific interactions that the document's existing mechanisms don't cover (e.g., rate limiting bypasses the "connection lost" detection, corporate actions bypass staleness detection). GPT-4.1 is a solid middle-ground: more thorough than Mini, less insightful than GPT-5. Mini is fine for a quick sanity check but won't find the subtle gaps.

Cost-effectiveness: Mini found 10 scenarios in 16s for ~7K tokens. GPT-5 found 14 scenarios (with 7 genuinely unique insights) in 45s for ~13.5K tokens (including 6.6K reasoning). For architecture review where missing a gap could mean financial loss, the GPT-5 cost is justified. For routine doc review, Mini + human judgment is probably sufficient.

4.1 KiB Raw Blame History

Finding 9: Gap-finding in architecture docs: GPT-5 finds domain-specific gaps, GPT-4.1 is generic, Mini is formulaic

4.1 KiB

Raw Blame History