Files

T

Rodin 6af8a6ee10 refactor(findings): split ALL-FINDINGS.md into per-experiment files

Break the monolithic 3249-line findings file into 29 individual files,
one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy
chronological sorting and discovery.

No content changes — purely structural reorganization.

2026-05-06 07:15:50 -07:00

18 KiB

Raw Blame History

Finding 27: Design coherence on risk-controls.md: GPT-5 regains top position; Opus's advantage from Finding #15 was document-specific, not task-specific

Date: 2026-05-05 Task: Identify internal design incoherences in gargoyle's risk-controls.md (277 lines) — a pre-trade risk control specification covering two evaluation stages, reduction semantics, ordering rationale, fail-closed claims, and audit logging. How we used them: Same document (full text) + same focused analytical question to all 3 models via HAI proxy. Highly structured prompt specifying 5 categories of incoherence (safety properties not enforced, ordering/sequencing contradictions, reduction semantics conflicts, fail-closed claims vs actual behavior, cross-stage inconsistencies). Required each finding to reference specific contradictory parts. No tools, no project context beyond the document itself.

Model	Time	Output tokens	Reasoning tokens	Incoherences found	Critical	High	Medium
GPT-5	112s	8,231	7,232	6	1	3	2
Claude Opus 4.6	41s	1,858	(internal)	5	2	2	1
Claude Sonnet 4.6	15s	699	(internal)	4	1	2	1

What they found — common ground (all 3 identified):

Reduction re-entry at BuyingPower contradicts "reducing controls never re-enter earlier controls" (all three flagged this as the most obvious contradiction — Concentration at position 5 reduces, re-enters at BuyingPower at position 4, which IS an earlier control)
Ordering rationale's categorization of buying power/concentration is internally confused (the doc labels both as "quantity-sensitive checks" that run after reducing controls, but concentration IS a reducing control at position 5 while buying power at position 4 sits between the two reducing controls)

GPT-5 unique findings (not in either Claude model):

Signal-level "no portfolio context" contradicts NoShortSales requiring knowledge of current positions. The doc explicitly states signals are evaluated "in isolation" with "no portfolio context — only the signal itself and user settings" — but checking whether the user holds a position IS portfolio context. This is a genuine design tension: either SignalRisk has hidden portfolio access (violating isolation) or NoShortSales can't actually work as specified.
Settings "fall through to system defaults" vs "Settings cache miss → reject." Two incompatible instructions for the same condition (missing settings).
"Universal fail-closed" with "only exception is order rate window" contradicted by Failure Modes table showing buying power as another exception ("Conservative estimate; may over-reject" is NOT rejection — it's a different failure mode than either fail-closed or the documented single exception).
Audit model says "every control evaluation produces an audit entry regardless of outcome" but the signal-stage write point only describes writing on rejection. Passing signals produce no documented audit entry at the signal stage.

Claude Opus unique findings (not in either other model):

Signal flow diagram swaps control order vs table: table shows (1) MarketHours, (2) PerTradeStop, (3) NoShortSales, but diagram flows MarketHours → NoShortSales → PerTradeStopLoss. Controls 2 and 3 are reversed between the two representations. (VERIFIED: this is correct — the diagram does show a different order.)
Concentration re-entry loop can bypass Order Rate, Duplicate, Self-Trade, and Fat Finger entirely during intermediate iterations. Also: Position Size at order 3 is never re-checked against Concentration-reduced quantity because re-entry starts at BuyingPower (order 4), meaning "smaller quantity wins" semantics are implemented differently than the linear model described in Reduction Semantics.

Claude Sonnet unique findings (not in either other model):

Buying Power "Can reduce? No" creates a dead end: if a reduced quantity still exceeds buying power, the system can only reject entirely (no mechanism to further optimize), defeating the purpose of the reduction system for capital-limited users. (NOTE: this is more of a design limitation than a self-contradiction, but the framing — that the reduction system's purpose is undermined by buying power's inability to reduce — is a legitimate coherence observation.)

Quality assessment:

GPT-5 produced the most findings (6) with the broadest coverage across the prompt's 5 categories. The NoShortSales/portfolio-context finding is the most genuinely insightful — it's a fundamental design-level contradiction (a signal-level control that REQUIRES decision-level context). The settings contradiction and audit logging inconsistency are also solid. Every finding points to two specific textual statements that are incompatible. Severity ratings were calibrated (1 Critical, 3 High, 2 Medium — compared to Opus's 2 Critical for similar findings).
Claude Opus was remarkably fast (41s, 1,858 tokens) and found one thing neither other model caught: the diagram/table order reversal for signal controls. This is a concrete, verifiable error (not a design tension — a literal mistake in the document). The re-entry loop analysis (finding #5) goes deeper than GPT-5's version of the same core issue, exploring the implications for "smaller quantity wins" semantics. However, Opus found fewer total issues and missed the settings contradiction and audit logging inconsistency.
Claude Sonnet was the fastest (15s, 699 tokens) and found 4 issues. The buying power dead-end observation is unique and shows genuine reasoning about the reduction system's limitations. However, it's more of a "this design can't achieve its stated goal" than a strict self-contradiction. Sonnet's other findings overlap with the common ground. Quality is solid but narrower scope.

Key insight — Finding #15's Opus > GPT-5 result was document-specific: In Finding #15 (coherence checking on failure-modes.md), Opus found 7 incoherences vs GPT-5's 4. Here, on risk-controls.md, GPT-5 found 6 vs Opus's 5. The reversal suggests that the relative performance on coherence checking depends on the DOCUMENT'S structure, not on a fixed model advantage:

failure-modes.md (383 lines): A complex multi-process system with many stated invariants across failure states, supervision trees, and recovery paths. Rich in design TENSIONS where one subsystem's safety mechanism undermines another. This plays to Opus's strength (finding design tensions between subsystems).
risk-controls.md (277 lines): A more focused specification with explicit rules, ordering constraints, and behavior tables. Rich in SPECIFICATION CONTRADICTIONS where one statement directly conflicts with another. This plays to GPT-5's strength (systematic verification of claims against stated mechanisms).

The difference: Opus excels when contradictions are EMERGENT (arise from composing multiple design decisions). GPT-5 excels when contradictions are EXPLICIT (two statements in the document say incompatible things). Risk-controls.md has more explicit contradictions (the settings fallback vs fail-closed, the "no portfolio context" vs NoShortSales, the audit "always" vs write point "only on reject").

Model performance depends on CONTRADICTION TYPE:

Contradiction type	Best model	Example
Emergent/compositional	Opus	"Rest-for-one cascade creates a 5th state"
Explicit/definitional	GPT-5	"No portfolio context" but check requires portfolio
Diagrammatic/structural	Opus	Table order ≠ diagram order
Semantic/category confusion	All (common ground)	Reduction re-entry violates ordering claims

Revised conclusion on Finding #15's open question: "Does Opus > GPT-5 ordering for coherence checking hold across other documents?" No. The ordering depends on the document's contradiction density and type. Documents rich in emergent design tensions favor Opus. Documents with explicit specification errors favor GPT-5. The task type (coherence checking) doesn't have a fixed model winner — it depends on what KIND of incoherences the document contains.

Practical implication: Continue running both models for coherence checking. Their strengths are complementary even within the same task type. GPT-5 catches things you can point to in the spec and say "these two sentences conflict." Opus catches things where you need to reason about the implications of multiple mechanisms interacting.

Open Questions

Does GPT's advantage in finding inconsistencies extend to logical inconsistencies in arguments? One data point (verdict mismatches) — need more.
What's the optimal task granularity for GPT analytical review? "Whole PR" is too big. Is "one hypothesis" right, or can we batch?
~~Is the GPT-4.1 Mini bias detection result repeatable, or was it a well- structured task that any model would ace?~~ ANSWERED (Finding #8): Any model aces it when the biased text is presented without noise. The original result was about noise elimination, not model capability.
NEW: Does adding a narrow bias-check question to a rich PR review context recover the detection that broad review misses? (Signal-to-noise confirmation test)
~~How does reasoning_effort affect analytical quality? Only tested default so far.~~ ANSWERED (Finding #21): Negligible effect on GPT-5 for open-ended analytical tasks. Low/medium/high produced 33/30/30 findings with nearly identical reasoning tokens (~4K) and per-finding depth. The parameter may primarily affect verifiable-answer tasks, not exploration. Task framing remains the dominant quality lever.
Can we design a systematic "analytical review checklist" that leverages each model's strengths?
What analytical tasks is Opus best at vs Sonnet? Finding #11 shows Opus excels at design-tension identification. How does Sonnet compare on the same task? (Sonnet is non-reasoning but fast — would it match GPT-4.1?) ANSWERED (Finding #12): Sonnet 4.6 significantly outperforms GPT-4.1 (17 vs ~14 assumptions) and approaches GPT-5 (17 vs 20). It's not a non-reasoning model in the GPT-4.1 sense — it occupies a middle tier with genuine component-interaction reasoning. Opus still wins on design-tension identification specifically.
How do the models compare on research synthesis tasks (our #381 rewrite)? We'll find out during the actual rewrite.
~~Does the reasoning-token advantage scale with document complexity? Test with a simpler doc to see if the gap narrows.~~ ANSWERED (Finding #11): The gap doesn't narrow with simpler docs. GPT-5 maintains ~1.7x the findings of GPT-4.1 regardless of document complexity. Reasoning tokens enable exhaustive exploration independent of input difficulty.
~~Would Claude Opus (also a reasoning model) match GPT-5's assumption-finding performance, or does it have different blind spots?~~ ANSWERED (Finding #11): Different blind spots, different strengths. GPT-5 reasons deeper into implementation mechanics (breadth + technical depth). Opus reasons wider about system context and design tensions (insight density). They're complementary, not competing. Run both on important architecture docs.
Does Sonnet 4.6's strong showing hold across other analytical tasks (bias detection, gap-finding) or is it specific to assumption-finding on complex documents? Need to test Sonnet on simpler docs and different question types. PARTIALLY ANSWERED (Finding #13): Sonnet's strength does NOT transfer to concurrency reasoning. It dropped from 85% of GPT-5 (assumption- finding) to ~58% (race condition identification). Task type matters more than we thought. Still untested: gap-finding, bias detection for Sonnet.
NEW: What other analytical tasks require sequential/temporal reasoning (like race condition identification) vs pattern-matching reasoning (like assumption-finding)? Building a task taxonomy would help assign models correctly.
NEW: What explains Sonnet taking slightly longer than Opus here (106s vs 105s) despite normally being the faster model? Is it the document length, or does Sonnet's internal reasoning scale with complexity similarly to Opus?
~~How does GPT-5 Mini compare to GPT-5 on analytical tasks? Is it a viable cheaper substitute?~~ ANSWERED (Finding #14): GPT-5 Mini is a viable middle option. Finds fewer issues (6 vs 10) but with genuine reasoning depth at ~50% cost/time. Better than non-reasoning models, not as exhaustive as GPT-5.
NEW: How does Claude 4.5 Opus/Sonnet compare to Claude 4.6? HAI now exposes both; worth testing whether the newer versions regress on analytical tasks.
~~Would running GPT-5 Mini + Sonnet together (different axes) approach GPT-5's coverage at lower combined cost?~~ ANSWERED (Finding #19): 71% coverage at 31% cost. Good for low-stakes work; GPT-5 irreplaceable for high-stakes due to unique domain-knowledge findings in the missing 29%.
NEW (Finding #15): Does the Opus > GPT-5 ordering for coherence checking hold across other documents? The inversion (Opus finding more than GPT-5) was striking — need to confirm it wasn't document-specific. ANSWERED (Finding #27): No — it was document-specific. On risk-controls.md, GPT-5 found 6 vs Opus's 5. The winner depends on contradiction TYPE: Opus excels at emergent/compositional contradictions, GPT-5 at explicit/definitional ones. No fixed ordering for this task type.
NEW (Finding #15): Is the two-pass approach (Opus generates → GPT-5 validates) worth the extra cost vs just running Opus alone? Need to test whether GPT-5 actually catches Opus false-positives or just agrees.
~~How do the Claude 4.5 and 4.6 models compare on analytical tasks?~~ ANSWERED (Finding #16): 4.5 is more exhaustive (2x findings), 4.6 is more precise (higher signal-to-noise). Genuine tradeoff, not a regression. 4.5 for coverage, 4.6 for actionability.
NEW (Finding #16): Does the 4.5 vs 4.6 pattern hold across other task types? Spec completeness may favor exhaustiveness; would coherence checking or race condition analysis show the same pattern?
NEW (Finding #16): Is running both Sonnet versions (4.5 + 4.6) cost- effective vs just running GPT-5? Need to compare the UNION of their findings against GPT-5's output for overlap analysis.
NEW (Finding #18): Does Opus's "predictable exploit window" detection transfer to other policy documents? It uniquely identified that the cooldown mechanism creates a GUARANTEED safe window that strategies could systematically exploit — this is a higher-order security insight. Worth testing whether Opus consistently finds "adversarial opportunity" framings that other models miss.
NEW (Finding #20): Does GPT-5's extreme verification behavior (15:1 reasoning-to-output ratio, 3 findings from 12K reasoning) persist across other documents with this prompt? Or was user-pipeline-lifecycle.md particularly verification-heavy? Test invariant violation paths on a simpler document.
NEW (Finding #20): Would giving GPT-5 a "minimum 8 findings" instruction reduce its selectivity and produce MORE invariant violations at lower precision? Or would it just pad with non-violations? The extreme selectivity may be a feature OR it may mean GPT-5 is discarding valid-but-hard-to-verify findings.
NEW (Finding #20): Opus's self-correction behavior is now confirmed across Finding #15 and #20. Is this trainable/promptable? Could we ask non-Opus models to "show your reasoning and withdraw findings you cannot fully verify"?
NEW (Finding #22): The "silent correctness" lens revealed three distinct analytical modes: Opus → missing functionality, GPT-5 → mechanism incorrectness, Sonnet → composition failures. Does this three-way differentiation hold on other documents, or was it specific to the regulatory/financial domain of specid-lot-selection?
NEW (Finding #22): Does the "silent correctness" lens work on non-financial documents? The financial/regulatory domain has a large gap between syntactic and semantic correctness. Would the same prompt on an infrastructure/systems doc produce equally differentiated findings, or would it collapse into assumption-finding?
NEW (Finding #22): Opus's "missing feature identification" mode (wash sales, commissions) — is this promptable on other models? Could we explicitly ask GPT-5 "what should this system compute but doesn't" and get similar results? ANSWERED (Finding #26): YES — all three models find regulatory gaps and missing features when explicitly prompted. Opus's unique behavior in #22 was an emergent DEFAULT tendency, not a capability. Prompt framing dominates model personality.
NEW (Finding #28): Cross-document consistency found real bugs in gargoyle docs (fills vs events, position ownership, signal persistence). Does running this analysis across MORE document pairs (e.g., domain readmes vs implementation docs, design docs vs plan docs) yield additional real inconsistencies? Could become a systematic documentation maintenance tool.
NEW (Finding #28): Opus was 2.4x faster AND found more issues than GPT-5 on cross-document consistency. Is this because cross-doc contradictions are easy to verify once spotted (reducing GPT-5's verification advantage)? Or because boundary reasoning (Opus's strength) is the primary skill needed?

Methodology Notes

Internet opinions about models are overwhelmingly about coding. Don't extrapolate to analytical work without testing.
"Just because someone says it on the internet doesn't make it right." — Aaron, 2026-04-26. Opinions need context. Track our own evidence.
Absence of published methodology for a use case is itself a finding.
Each finding needs: date, task, how we used it (context shape, task framing, what info the model had/didn't have), what happened, takeaway. No unsupported generalizations.
Context dimensions to track:
- Rich vs minimal (how much background info)
- Broad vs focused ("review this" vs "answer this specific question")
- What kind of context (diff, full files, issue text, research notes, project conventions, nothing)
- Whether the model had access to tools or just text
- Whether the task was explicit step-by-step or open-ended

18 KiB Raw Blame History

Finding 27: Design coherence on risk-controls.md: GPT-5 regains top position; Opus's advantage from Finding #15 was document-specific, not task-specific

Open Questions

Methodology Notes

18 KiB

Raw Blame History