Tests the open question from Finding #39: does Opus's internal reasoning
depth suffice for self-contradiction verification?
Key result: wrong question. Opus finds a different CLASS of contradiction
than GPT-5. GPT-5 finds specification conflicts (statement comparison).
Opus finds logical impossibilities (deductive rule interaction). Neither
dominates — they don't overlap. Sonnet remains unreliable (~33% precision).
Document tested: escalation-policy.md (228 lines)
Models: GPT-5, Claude Opus 4.6, Claude Sonnet 4.6
New analytical lens: failure propagation chains. Opus matched GPT-5's count
(10 findings each) while using 2.2x fewer tokens. Overview docs are ideal
for this lens. Sonnet produced zero unique insights.
New analytical lens testing whether models can identify sequential operations
where order matters but isn't mechanically enforced. GPT-5 finds systemic
gaps (WHY ordering matters), Opus finds inverted dangers (WHICH direction
is dangerous), Sonnet identifies themes without unique depth.
New analytical lens applied to lot-accounting.md (181 lines).
Tests how models identify sequences of individually correct
operations that produce silently wrong financial results.
Results:
- GPT-5: 12 findings (137s, 10688 reasoning tokens) - tax law domain knowledge
- Opus: 8 findings (121s) - concurrent systems / crash recovery focus
- Sonnet: 8 findings (111s) - structural meta-analysis, highest-leverage finding
Key insight: First experiment where domain-specific knowledge (tax law)
is the primary differentiator. Models reason from different knowledge
domains: GPT-5=tax law, Opus=distributed systems, Sonnet=architecture patterns.
Sonnet produced the most architecturally significant finding: that the
system's reconciliation mechanism confirms corruption rather than detecting
it (because it re-derives from LotClosed which is itself the corrupted source).
Tested open question from Finding #5: does narrow framing give Sonnet
GPT-5-level semantic analysis?
Result: NO. Narrow framing changes WHAT Sonnet looks for (redirects from
gaps to contradictions) but not HOW WELL it reasons. Sonnet narrow found
3 contradictions but only 1 was genuine (2 were analytical errors/misreads).
GPT-5 found 4 all-genuine findings with precise reasoning.
Key insight: framing controls scope, not reasoning depth. For tasks
requiring logical verification (contradictions, race conditions, invariant
violations), reasoning tokens are necessary — framing alone is insufficient.
Updated open-questions.md: marked Sonnet+narrow as answered, added new
question about Opus+narrow for contradiction detection.
First experiment testing domain-specific regulatory knowledge rather than
pure architectural reasoning. Opus demonstrates deepest FINRA Rule 4210
knowledge; GPT-5 finds broker-API semantic mismatches; content filters
are a new failure mode for financial domain analysis via enterprise proxies.
Tested kill-switch.md + escalation-policy.md (same bounded context,
shared vocabulary). Key insight: shared vocabulary claims are the most
dangerous inconsistency — same words with opposite severity ordering.
Opus found the severity-ordering inversion (restrict/liquidate ladders
run in opposite directions). GPT-5 found the meta-issue (the 'same
vocabulary' claim is itself the problem). Sonnet fast but shallow.
Tightly coupled docs produce more Critical findings than loosely coupled
ones (Finding #28).
New experiment type: give models two related architecture documents and ask
them to identify assumptions each document makes about the other that could
be violated.
Results: GPT-5 (10 findings, 175s, operational/race-focused) and Opus (10
findings, 111s, structural/architectural) both found unique interface gaps.
Sonnet (7 findings, 29s) found nothing unique - all its findings were
simplified versions of GPT-5/Opus findings.
Key insight: Interface analysis requires holding two mental models simultaneously
and is harder than single-document analysis. Sonnet produced 0 unique findings
(vs 2-6 on single-doc tasks). Extended reasoning appears necessary for this
task type.
Tests GPT-5 → Opus critique+extend pipeline on dtbp-margin-call.md.
Key results:
- Ensemble produces 56 unique findings vs 43 (GPT-5) or 28 (Opus) alone
- Zero full disagreements — GPT-5's coverage is reliable signal
- Critique phase (severity calibration) more valuable than extension phase
- 28% more tokens for 30% more coverage + structured prioritization
- Answers open question about adversarial ensemble value
New analytical lens: where data propagation creates stale, contradictory,
or misleading views for different consumers.
Key result: highest model convergence (45% common ground) due to document's
explicit failure mode table. GPT-5 finds event-level provenance gaps; Opus
identifies strategy attribution dimension. Sonnet adds zero unique value.
Two-model stack (GPT-5 + Opus) optimal.
New analytical lens: observability gap analysis — asking 'when something
goes wrong, can you SEE it?' rather than 'what can go wrong?'
Results on aggregation.md (239 lines):
- GPT-5: 23 findings (12 unique), exhaustive telemetry architecture
- Opus: 14 findings (6 unique), operator-behavioral insights
- Sonnet: 11 findings (0 unique), no added value
Key insight: GPT-5 designs the instrumentation; Opus identifies where
available signals mislead operators toward wrong remediations.
Two-model (GPT-5 + Opus) optimal for this task type.
Tested GPT-5, Opus, Sonnet on wash-sale-tracking.md spec.
Opus found a genuine spec bug (trigger logic described backwards).
Confirms pattern: GPT-5 for breadth, Opus for logic contradictions,
Sonnet adds no value for systematic analytical tasks.
New task type: specification gap/completeness analysis (vs adversarial gaming).
GPT-5 dominates count (25 findings), Opus produces best single insight
(realized P&L non-reversibility violates de-escalation model assumption).
Sonnet adds no unique value for this task type — skip for completeness audits.
Break the monolithic 3249-line findings file into 29 individual files,
one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy
chronological sorting and discovery.
No content changes — purely structural reorganization.
Full comparative analysis of GPT-5, Claude Opus 4.6, Claude Sonnet 4.6,
GPT-4.1, and GPT-4.1 Mini on analytical tasks (not coding).
Contents:
- findings/ALL-FINDINGS.md — complete 3,249-line research log with all
29 findings, methodology notes, and open questions
- prompts/ — 6 exact prompts used across experiments
- methodology.md — experimental setup and evaluation criteria
- open-questions.md — unanswered questions for future work
- README.md — overview and summary table
Key findings:
- Cross-document consistency: Opus is 2.4x faster with more findings
- Gap-finding: GPT-5 reasoning tokens find domain-specific gaps
- Race conditions: Opus excels at temporal interaction reasoning
- Bias detection: Signal-to-noise ratio > model capability
- Adversarial analysis: GPT-5 exhaustive, Opus qualitatively different
Signed-off-by: Rodin