Files
model-research/findings/2026-05-02-09-gapfinding-in-architecture-docs-gpt5.md
T
Rodin 6af8a6ee10 refactor(findings): split ALL-FINDINGS.md into per-experiment files
Break the monolithic 3249-line findings file into 29 individual files,
one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy
chronological sorting and discovery.

No content changes — purely structural reorganization.
2026-05-06 07:15:50 -07:00

4.1 KiB

Finding 9: Gap-finding in architecture docs: GPT-5 finds domain-specific gaps, GPT-4.1 is generic, Mini is formulaic

Date: 2026-05-02 Task: Identify missing failure scenarios in gargoyle's failure-modes.md (383 lines) How we used them: Same document (full text, no truncation) + same focused analytical question to all 3 models via HAI proxy (OpenAI-compatible endpoint). No tools, no project context beyond the document itself. Single prompt, no conversation history. Temperature 0.3 for GPT-4.1/Mini, default (1.0) for GPT-5 (required by the model).

Model Time Output tokens Reasoning tokens Scenarios found
GPT-4.1 Mini 16s 2,003 0 10
GPT-4.1 24s 2,575 0 15
GPT-5 45s 8,565 6,656 14

What they found — common ground (all 3 identified):

  • ETS table corruption/loss affecting gates
  • BEAM scheduler starvation / GC pauses
  • WebSocket message duplication/reordering
  • Postgres connection pool exhaustion / deadlocks
  • Clock skew / time drift
  • Process registry inconsistency

GPT-5 unique findings (not in either other model):

  • Broker rate limiting (429s) — not "connection lost" so existing logic doesn't trigger, but can't flatten during kill switch
  • Broker auth failure / credential rotation — distinct from connection loss
  • Corporate actions (splits, symbol changes) — position drift without triggering staleness detection
  • Duplicate pipeline instances for same user (DynamicSupervisor race)
  • DB "commit unknown outcome" causing restart loops (Ecto commit succeeds at Postgres but client times out → retry → unique constraint → crash loop)
  • Cross-symbol strategies with partial staleness — multi-leg signals computed from mix of fresh and stale data
  • Partial cancel_all during kill switch masked by process restarts

GPT-4.1 unique findings (not in GPT-5 or Mini):

  • Zombie processes after halt (supervisor misconfiguration)
  • Unsupervised Task crashes going unnoticed
  • Audit log writes failing silently (not in same transaction as state change)
  • ClOrdID unique constraint violation from race in sequence generation
  • Broker API semantic changes (silent breaking changes)

GPT-4.1 Mini unique findings:

  • Race between kill switch engagement and reconciliation completion (timing coordination gap) — this was more explicitly called out than in the other models, though GPT-5 touches it implicitly
  • Strategy.Worker / Aggregator partial crash inconsistency

Quality assessment:

  • GPT-5 had the most domain-relevant and actionable gaps. Broker rate limiting, auth failures, corporate actions, and the DB commit unknown-outcome scenario are all realistic production issues specific to THIS system. The cross-symbol partial staleness finding shows deeper architectural reasoning about component interactions.
  • GPT-4.1 was thorough and well-structured but more generic/defensive. Many of its unique findings (zombie processes, unsupervised Tasks, audit log loss) are general Elixir concerns rather than specific to the document's architecture. Good for a completeness checklist.
  • GPT-4.1 Mini was formulaic — each finding followed the same template and several were somewhat surface-level or restated things the document partially covers. Still found the most scenarios per dollar.

Takeaway: For gap-finding in architecture documents, GPT-5's reasoning tokens pay off. It doesn't just list "things that could go wrong" — it identifies specific interactions that the document's existing mechanisms don't cover (e.g., rate limiting bypasses the "connection lost" detection, corporate actions bypass staleness detection). GPT-4.1 is a solid middle-ground: more thorough than Mini, less insightful than GPT-5. Mini is fine for a quick sanity check but won't find the subtle gaps.

Cost-effectiveness: Mini found 10 scenarios in 16s for ~7K tokens. GPT-5 found 14 scenarios (with 7 genuinely unique insights) in 45s for ~13.5K tokens (including 6.6K reasoning). For architecture review where missing a gap could mean financial loss, the GPT-5 cost is justified. For routine doc review, Mini + human judgment is probably sufficient.