New analytical lens applied to signal-lifecycle.md (111 lines).
All three models (GPT-5, Opus, Sonnet) found 7-9 findings each with
70% at Critical/High severity. Key insight: concurrency analysis
rewards compositional temporal reasoning over enumeration breadth,
narrowing the gap between models compared to other lenses.
Unique finds: GPT-5 (stop-loss race, duplicate UUID), Opus (crash
survival contradiction), Sonnet (Signal Risk audit gap after dispatch).
New analytical lens (adversarial/offensive security) tested on gargoyle's
signal audit log spec. GPT-5 most exhaustive (25), Opus deepest individual
attack narratives (14), Sonnet most creative meta-attacks (11).
Adversarial lens is ~2.5x more productive than defensive lenses on
comparable docs. All three models converged on same root cause (trust model).
New analytical lens: where systems rely on single mechanisms rather than
layered defenses. GPT-5 finds exploitable SSRF; Opus identifies trust-root
collapse (session+sudo share SECRET_KEY_BASE); Sonnet is surface-level.
Tests a novel analytical lens on aggregation.md (239 lines): 'what happens
when many correct instances operate simultaneously in a correlated environment?'
Results: GPT-5 (13 findings, 76s) excels at systemic dynamics and feedback
loops. Opus (8 findings, 93s) finds the most consequential single findings
(stop-loss defeated by temporal composition, crash-opportunity correlation).
Sonnet 4.0 (6 findings, 32s) too abstract for this task.
Key insight: This lens finds DEPLOYMENT bugs invisible at design time -
the gap between 'correct by construction' and 'correct in production'.
Novel experiment testing 'what's invisible to operators' on gargoyle's
observability.md (563 lines). GPT-5 (18 findings), Opus (12), Sonnet (10).
Key discovery: 'actively misleads' category (observability creating false
confidence) is highest-value and Opus-dominated. Distinct from assumption-
finding, race conditions, or gap analysis — requires reasoning about
negation (what ISN'T instrumented vs what production needs).
Sonnet finds ZERO subtle contradictions between signal-lifecycle.md and
aggregation.md, while GPT-5 and Opus each find 3 genuine conflicts.
Key insight: Sonnet can detect explicit contradictions (Finding 28: 4/6)
but completely fails on implication conflicts where one doc's simplified
model creates false impressions about another doc's complete specification.
Refines Finding 28 and confirms cross-document consistency is actually
TWO distinct tasks with different model requirements.
Tests the open question from Finding #39: does Opus's internal reasoning
depth suffice for self-contradiction verification?
Key result: wrong question. Opus finds a different CLASS of contradiction
than GPT-5. GPT-5 finds specification conflicts (statement comparison).
Opus finds logical impossibilities (deductive rule interaction). Neither
dominates — they don't overlap. Sonnet remains unreliable (~33% precision).
Document tested: escalation-policy.md (228 lines)
Models: GPT-5, Claude Opus 4.6, Claude Sonnet 4.6
New analytical lens: failure propagation chains. Opus matched GPT-5's count
(10 findings each) while using 2.2x fewer tokens. Overview docs are ideal
for this lens. Sonnet produced zero unique insights.
New analytical lens testing whether models can identify sequential operations
where order matters but isn't mechanically enforced. GPT-5 finds systemic
gaps (WHY ordering matters), Opus finds inverted dangers (WHICH direction
is dangerous), Sonnet identifies themes without unique depth.
New analytical lens applied to lot-accounting.md (181 lines).
Tests how models identify sequences of individually correct
operations that produce silently wrong financial results.
Results:
- GPT-5: 12 findings (137s, 10688 reasoning tokens) - tax law domain knowledge
- Opus: 8 findings (121s) - concurrent systems / crash recovery focus
- Sonnet: 8 findings (111s) - structural meta-analysis, highest-leverage finding
Key insight: First experiment where domain-specific knowledge (tax law)
is the primary differentiator. Models reason from different knowledge
domains: GPT-5=tax law, Opus=distributed systems, Sonnet=architecture patterns.
Sonnet produced the most architecturally significant finding: that the
system's reconciliation mechanism confirms corruption rather than detecting
it (because it re-derives from LotClosed which is itself the corrupted source).
Tested open question from Finding #5: does narrow framing give Sonnet
GPT-5-level semantic analysis?
Result: NO. Narrow framing changes WHAT Sonnet looks for (redirects from
gaps to contradictions) but not HOW WELL it reasons. Sonnet narrow found
3 contradictions but only 1 was genuine (2 were analytical errors/misreads).
GPT-5 found 4 all-genuine findings with precise reasoning.
Key insight: framing controls scope, not reasoning depth. For tasks
requiring logical verification (contradictions, race conditions, invariant
violations), reasoning tokens are necessary — framing alone is insufficient.
Updated open-questions.md: marked Sonnet+narrow as answered, added new
question about Opus+narrow for contradiction detection.
First experiment testing domain-specific regulatory knowledge rather than
pure architectural reasoning. Opus demonstrates deepest FINRA Rule 4210
knowledge; GPT-5 finds broker-API semantic mismatches; content filters
are a new failure mode for financial domain analysis via enterprise proxies.
Tested kill-switch.md + escalation-policy.md (same bounded context,
shared vocabulary). Key insight: shared vocabulary claims are the most
dangerous inconsistency — same words with opposite severity ordering.
Opus found the severity-ordering inversion (restrict/liquidate ladders
run in opposite directions). GPT-5 found the meta-issue (the 'same
vocabulary' claim is itself the problem). Sonnet fast but shallow.
Tightly coupled docs produce more Critical findings than loosely coupled
ones (Finding #28).
New experiment type: give models two related architecture documents and ask
them to identify assumptions each document makes about the other that could
be violated.
Results: GPT-5 (10 findings, 175s, operational/race-focused) and Opus (10
findings, 111s, structural/architectural) both found unique interface gaps.
Sonnet (7 findings, 29s) found nothing unique - all its findings were
simplified versions of GPT-5/Opus findings.
Key insight: Interface analysis requires holding two mental models simultaneously
and is harder than single-document analysis. Sonnet produced 0 unique findings
(vs 2-6 on single-doc tasks). Extended reasoning appears necessary for this
task type.
Tests GPT-5 → Opus critique+extend pipeline on dtbp-margin-call.md.
Key results:
- Ensemble produces 56 unique findings vs 43 (GPT-5) or 28 (Opus) alone
- Zero full disagreements — GPT-5's coverage is reliable signal
- Critique phase (severity calibration) more valuable than extension phase
- 28% more tokens for 30% more coverage + structured prioritization
- Answers open question about adversarial ensemble value
New analytical lens: where data propagation creates stale, contradictory,
or misleading views for different consumers.
Key result: highest model convergence (45% common ground) due to document's
explicit failure mode table. GPT-5 finds event-level provenance gaps; Opus
identifies strategy attribution dimension. Sonnet adds zero unique value.
Two-model stack (GPT-5 + Opus) optimal.
New analytical lens: observability gap analysis — asking 'when something
goes wrong, can you SEE it?' rather than 'what can go wrong?'
Results on aggregation.md (239 lines):
- GPT-5: 23 findings (12 unique), exhaustive telemetry architecture
- Opus: 14 findings (6 unique), operator-behavioral insights
- Sonnet: 11 findings (0 unique), no added value
Key insight: GPT-5 designs the instrumentation; Opus identifies where
available signals mislead operators toward wrong remediations.
Two-model (GPT-5 + Opus) optimal for this task type.
Tested GPT-5, Opus, Sonnet on wash-sale-tracking.md spec.
Opus found a genuine spec bug (trigger logic described backwards).
Confirms pattern: GPT-5 for breadth, Opus for logic contradictions,
Sonnet adds no value for systematic analytical tasks.
New task type: specification gap/completeness analysis (vs adversarial gaming).
GPT-5 dominates count (25 findings), Opus produces best single insight
(realized P&L non-reversibility violates de-escalation model assumption).
Sonnet adds no unique value for this task type — skip for completeness audits.
REPORT.md — full analysis of 29 experiments: model strengths, task-type
mappings, meta-findings, cost-effectiveness, and open questions.
LESSONS.md — distilled operational playbook: which model for which task,
anti-patterns, decision framework, and the three core rules.
Break the monolithic 3249-line findings file into 29 individual files,
one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy
chronological sorting and discovery.
No content changes — purely structural reorganization.
Full comparative analysis of GPT-5, Claude Opus 4.6, Claude Sonnet 4.6,
GPT-4.1, and GPT-4.1 Mini on analytical tasks (not coding).
Contents:
- findings/ALL-FINDINGS.md — complete 3,249-line research log with all
29 findings, methodology notes, and open questions
- prompts/ — 6 exact prompts used across experiments
- methodology.md — experimental setup and evaluation criteria
- open-questions.md — unanswered questions for future work
- README.md — overview and summary table
Key findings:
- Cross-document consistency: Opus is 2.4x faster with more findings
- Gap-finding: GPT-5 reasoning tokens find domain-specific gaps
- Race conditions: Opus excels at temporal interaction reasoning
- Bias detection: Signal-to-noise ratio > model capability
- Adversarial analysis: GPT-5 exhaustive, Opus qualitatively different
Signed-off-by: Rodin