Files
model-research/findings/2026-05-05-27-design-coherence-on-riskcontrolsmd-gpt5.md
T
Rodin 6af8a6ee10 refactor(findings): split ALL-FINDINGS.md into per-experiment files
Break the monolithic 3249-line findings file into 29 individual files,
one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy
chronological sorting and discovery.

No content changes — purely structural reorganization.
2026-05-06 07:15:50 -07:00

277 lines
18 KiB
Markdown

# Finding 27: Design coherence on risk-controls.md: GPT-5 regains top position; Opus's advantage from Finding #15 was document-specific, not task-specific
**Date:** 2026-05-05
**Task:** Identify internal design incoherences in gargoyle's `risk-controls.md` (277 lines)
— a pre-trade risk control specification covering two evaluation stages, reduction semantics,
ordering rationale, fail-closed claims, and audit logging.
**How we used them:** Same document (full text) + same focused analytical question to all
3 models via HAI proxy. Highly structured prompt specifying 5 categories of incoherence
(safety properties not enforced, ordering/sequencing contradictions, reduction semantics
conflicts, fail-closed claims vs actual behavior, cross-stage inconsistencies). Required
each finding to reference specific contradictory parts. No tools, no project context beyond
the document itself.
| Model | Time | Output tokens | Reasoning tokens | Incoherences found | Critical | High | Medium |
|---|---|---|---|---|---|---|---|
| GPT-5 | 112s | 8,231 | 7,232 | 6 | 1 | 3 | 2 |
| Claude Opus 4.6 | 41s | 1,858 | (internal) | 5 | 2 | 2 | 1 |
| Claude Sonnet 4.6 | 15s | 699 | (internal) | 4 | 1 | 2 | 1 |
**What they found — common ground (all 3 identified):**
- Reduction re-entry at BuyingPower contradicts "reducing controls never re-enter
earlier controls" (all three flagged this as the most obvious contradiction —
Concentration at position 5 reduces, re-enters at BuyingPower at position 4,
which IS an earlier control)
- Ordering rationale's categorization of buying power/concentration is internally
confused (the doc labels both as "quantity-sensitive checks" that run after
reducing controls, but concentration IS a reducing control at position 5 while
buying power at position 4 sits between the two reducing controls)
**GPT-5 unique findings (not in either Claude model):**
- Signal-level "no portfolio context" contradicts NoShortSales requiring knowledge
of current positions. The doc explicitly states signals are evaluated "in isolation"
with "no portfolio context — only the signal itself and user settings" — but checking
whether the user holds a position IS portfolio context. This is a genuine design
tension: either SignalRisk has hidden portfolio access (violating isolation) or
NoShortSales can't actually work as specified.
- Settings "fall through to system defaults" vs "Settings cache miss → reject."
Two incompatible instructions for the same condition (missing settings).
- "Universal fail-closed" with "only exception is order rate window" contradicted
by Failure Modes table showing buying power as another exception ("Conservative
estimate; may over-reject" is NOT rejection — it's a different failure mode than
either fail-closed or the documented single exception).
- Audit model says "every control evaluation produces an audit entry regardless of
outcome" but the signal-stage write point only describes writing on rejection.
Passing signals produce no documented audit entry at the signal stage.
**Claude Opus unique findings (not in either other model):**
- Signal flow diagram swaps control order vs table: table shows (1) MarketHours,
(2) PerTradeStop, (3) NoShortSales, but diagram flows MarketHours → NoShortSales
→ PerTradeStopLoss. Controls 2 and 3 are reversed between the two representations.
(VERIFIED: this is correct — the diagram does show a different order.)
- Concentration re-entry loop can bypass Order Rate, Duplicate, Self-Trade, and
Fat Finger entirely during intermediate iterations. Also: Position Size at order 3
is never re-checked against Concentration-reduced quantity because re-entry starts
at BuyingPower (order 4), meaning "smaller quantity wins" semantics are implemented
differently than the linear model described in Reduction Semantics.
**Claude Sonnet unique findings (not in either other model):**
- Buying Power "Can reduce? No" creates a dead end: if a reduced quantity still
exceeds buying power, the system can only reject entirely (no mechanism to further
optimize), defeating the purpose of the reduction system for capital-limited users.
(NOTE: this is more of a design limitation than a self-contradiction, but the
framing — that the reduction system's purpose is undermined by buying power's
inability to reduce — is a legitimate coherence observation.)
**Quality assessment:**
- **GPT-5** produced the most findings (6) with the broadest coverage across the
prompt's 5 categories. The NoShortSales/portfolio-context finding is the most
genuinely insightful — it's a fundamental design-level contradiction (a signal-level
control that REQUIRES decision-level context). The settings contradiction and
audit logging inconsistency are also solid. Every finding points to two specific
textual statements that are incompatible. Severity ratings were calibrated (1
Critical, 3 High, 2 Medium — compared to Opus's 2 Critical for similar findings).
- **Claude Opus** was remarkably fast (41s, 1,858 tokens) and found one thing
neither other model caught: the diagram/table order reversal for signal controls.
This is a concrete, verifiable error (not a design tension — a literal mistake in
the document). The re-entry loop analysis (finding #5) goes deeper than GPT-5's
version of the same core issue, exploring the implications for "smaller quantity
wins" semantics. However, Opus found fewer total issues and missed the
settings contradiction and audit logging inconsistency.
- **Claude Sonnet** was the fastest (15s, 699 tokens) and found 4 issues. The buying
power dead-end observation is unique and shows genuine reasoning about the reduction
system's limitations. However, it's more of a "this design can't achieve its stated
goal" than a strict self-contradiction. Sonnet's other findings overlap with the
common ground. Quality is solid but narrower scope.
**Key insight — Finding #15's Opus > GPT-5 result was document-specific:**
In Finding #15 (coherence checking on failure-modes.md), Opus found 7 incoherences
vs GPT-5's 4. Here, on risk-controls.md, GPT-5 found 6 vs Opus's 5. The reversal
suggests that the relative performance on coherence checking depends on the
DOCUMENT'S structure, not on a fixed model advantage:
- **failure-modes.md** (383 lines): A complex multi-process system with many
stated invariants across failure states, supervision trees, and recovery paths.
Rich in design TENSIONS where one subsystem's safety mechanism undermines another.
This plays to Opus's strength (finding design tensions between subsystems).
- **risk-controls.md** (277 lines): A more focused specification with explicit rules,
ordering constraints, and behavior tables. Rich in SPECIFICATION CONTRADICTIONS
where one statement directly conflicts with another. This plays to GPT-5's
strength (systematic verification of claims against stated mechanisms).
The difference: Opus excels when contradictions are EMERGENT (arise from composing
multiple design decisions). GPT-5 excels when contradictions are EXPLICIT (two
statements in the document say incompatible things). Risk-controls.md has more
explicit contradictions (the settings fallback vs fail-closed, the "no portfolio
context" vs NoShortSales, the audit "always" vs write point "only on reject").
**Model performance depends on CONTRADICTION TYPE:**
| Contradiction type | Best model | Example |
|---|---|---|
| Emergent/compositional | Opus | "Rest-for-one cascade creates a 5th state" |
| Explicit/definitional | GPT-5 | "No portfolio context" but check requires portfolio |
| Diagrammatic/structural | Opus | Table order ≠ diagram order |
| Semantic/category confusion | All (common ground) | Reduction re-entry violates ordering claims |
**Revised conclusion on Finding #15's open question:**
"Does Opus > GPT-5 ordering for coherence checking hold across other documents?"
**No.** The ordering depends on the document's contradiction density and type.
Documents rich in emergent design tensions favor Opus. Documents with explicit
specification errors favor GPT-5. The task type (coherence checking) doesn't have
a fixed model winner — it depends on what KIND of incoherences the document contains.
**Practical implication:** Continue running both models for coherence checking. Their
strengths are complementary even within the same task type. GPT-5 catches things you
can point to in the spec and say "these two sentences conflict." Opus catches things
where you need to reason about the implications of multiple mechanisms interacting.
## Open Questions
- Does GPT's advantage in finding inconsistencies extend to logical
inconsistencies in arguments? One data point (verdict mismatches) — need more.
- What's the optimal task granularity for GPT analytical review? "Whole PR" is
too big. Is "one hypothesis" right, or can we batch?
- ~~Is the GPT-4.1 Mini bias detection result repeatable, or was it a well-
structured task that any model would ace?~~ **ANSWERED (Finding #8):** Any
model aces it when the biased text is presented without noise. The original
result was about noise elimination, not model capability.
- **NEW:** Does adding a narrow bias-check question to a rich PR review
context recover the detection that broad review misses? (Signal-to-noise
confirmation test)
- ~~How does reasoning_effort affect analytical quality? Only tested default so
far.~~ **ANSWERED (Finding #21):** Negligible effect on GPT-5 for open-ended
analytical tasks. Low/medium/high produced 33/30/30 findings with nearly
identical reasoning tokens (~4K) and per-finding depth. The parameter
may primarily affect verifiable-answer tasks, not exploration. Task framing
remains the dominant quality lever.
- Can we design a systematic "analytical review checklist" that leverages each
model's strengths?
- ~~What analytical tasks is Opus best at vs Sonnet? Finding #11 shows Opus
excels at design-tension identification. How does Sonnet compare on the
same task? (Sonnet is non-reasoning but fast — would it match GPT-4.1?)~~
**ANSWERED (Finding #12):** Sonnet 4.6 significantly outperforms GPT-4.1
(17 vs ~14 assumptions) and approaches GPT-5 (17 vs 20). It's not a
non-reasoning model in the GPT-4.1 sense — it occupies a middle tier with
genuine component-interaction reasoning. Opus still wins on design-tension
identification specifically.
- How do the models compare on research synthesis tasks (our #381 rewrite)?
We'll find out during the actual rewrite.
- ~~Does the reasoning-token advantage scale with document complexity? Test
with a simpler doc to see if the gap narrows.~~ **ANSWERED (Finding #11):**
The gap doesn't narrow with simpler docs. GPT-5 maintains ~1.7x the findings
of GPT-4.1 regardless of document complexity. Reasoning tokens enable
exhaustive exploration independent of input difficulty.
- ~~Would Claude Opus (also a reasoning model) match GPT-5's assumption-finding
performance, or does it have different blind spots?~~ **ANSWERED (Finding #11):**
Different blind spots, different strengths. GPT-5 reasons deeper into
implementation mechanics (breadth + technical depth). Opus reasons wider
about system context and design tensions (insight density). They're
complementary, not competing. Run both on important architecture docs.
- ~~Does Sonnet 4.6's strong showing hold across other analytical tasks
(bias detection, gap-finding) or is it specific to assumption-finding on
complex documents? Need to test Sonnet on simpler docs and different question
types.~~ **PARTIALLY ANSWERED (Finding #13):** Sonnet's strength does NOT
transfer to concurrency reasoning. It dropped from 85% of GPT-5 (assumption-
finding) to ~58% (race condition identification). Task type matters more
than we thought. Still untested: gap-finding, bias detection for Sonnet.
- **NEW:** What other analytical tasks require sequential/temporal reasoning
(like race condition identification) vs pattern-matching reasoning (like
assumption-finding)? Building a task taxonomy would help assign models
correctly.
- **NEW:** What explains Sonnet taking slightly longer than Opus here (106s vs
105s) despite normally being the faster model? Is it the document length, or
does Sonnet's internal reasoning scale with complexity similarly to Opus?
- ~~How does GPT-5 Mini compare to GPT-5 on analytical tasks? Is it a viable
cheaper substitute?~~ **ANSWERED (Finding #14):** GPT-5 Mini is a viable
middle option. Finds fewer issues (6 vs 10) but with genuine reasoning
depth at ~50% cost/time. Better than non-reasoning models, not as
exhaustive as GPT-5.
- **NEW:** How does Claude 4.5 Opus/Sonnet compare to Claude 4.6? HAI now
exposes both; worth testing whether the newer versions regress on
analytical tasks.
- ~~Would running GPT-5 Mini + Sonnet together (different axes)
approach GPT-5's coverage at lower combined cost?~~ **ANSWERED (Finding #19):**
71% coverage at 31% cost. Good for low-stakes work; GPT-5 irreplaceable for
high-stakes due to unique domain-knowledge findings in the missing 29%.
- ~~**NEW (Finding #15):** Does the Opus > GPT-5 ordering for coherence checking
hold across other documents? The inversion (Opus finding more than GPT-5)
was striking — need to confirm it wasn't document-specific.~~
**ANSWERED (Finding #27):** No — it was document-specific. On risk-controls.md,
GPT-5 found 6 vs Opus's 5. The winner depends on contradiction TYPE: Opus
excels at emergent/compositional contradictions, GPT-5 at explicit/definitional
ones. No fixed ordering for this task type.
- **NEW (Finding #15):** Is the two-pass approach (Opus generates → GPT-5
validates) worth the extra cost vs just running Opus alone? Need to test
whether GPT-5 actually catches Opus false-positives or just agrees.
- ~~How do the Claude 4.5 and 4.6 models compare on analytical tasks?~~
**ANSWERED (Finding #16):** 4.5 is more exhaustive (2x findings), 4.6 is
more precise (higher signal-to-noise). Genuine tradeoff, not a regression.
4.5 for coverage, 4.6 for actionability.
- **NEW (Finding #16):** Does the 4.5 vs 4.6 pattern hold across other task
types? Spec completeness may favor exhaustiveness; would coherence checking
or race condition analysis show the same pattern?
- **NEW (Finding #16):** Is running both Sonnet versions (4.5 + 4.6) cost-
effective vs just running GPT-5? Need to compare the UNION of their findings
against GPT-5's output for overlap analysis.
- **NEW (Finding #18):** Does Opus's "predictable exploit window" detection
transfer to other policy documents? It uniquely identified that the cooldown
mechanism creates a GUARANTEED safe window that strategies could systematically
exploit — this is a higher-order security insight. Worth testing whether Opus
consistently finds "adversarial opportunity" framings that other models miss.
- **NEW (Finding #20):** Does GPT-5's extreme verification behavior (15:1
reasoning-to-output ratio, 3 findings from 12K reasoning) persist across
other documents with this prompt? Or was user-pipeline-lifecycle.md
particularly verification-heavy? Test invariant violation paths on a simpler
document.
- **NEW (Finding #20):** Would giving GPT-5 a "minimum 8 findings" instruction
reduce its selectivity and produce MORE invariant violations at lower
precision? Or would it just pad with non-violations? The extreme selectivity
may be a feature OR it may mean GPT-5 is discarding valid-but-hard-to-verify
findings.
- **NEW (Finding #20):** Opus's self-correction behavior is now confirmed across
Finding #15 and #20. Is this trainable/promptable? Could we ask non-Opus models
to "show your reasoning and withdraw findings you cannot fully verify"?
- **NEW (Finding #22):** The "silent correctness" lens revealed three distinct
analytical modes: Opus → missing functionality, GPT-5 → mechanism incorrectness,
Sonnet → composition failures. Does this three-way differentiation hold on other
documents, or was it specific to the regulatory/financial domain of specid-lot-selection?
- **NEW (Finding #22):** Does the "silent correctness" lens work on non-financial
documents? The financial/regulatory domain has a large gap between syntactic and
semantic correctness. Would the same prompt on an infrastructure/systems doc produce
equally differentiated findings, or would it collapse into assumption-finding?
- ~~**NEW (Finding #22):** Opus's "missing feature identification" mode (wash sales,
commissions) — is this promptable on other models? Could we explicitly ask GPT-5
"what should this system compute but doesn't" and get similar results?~~
**ANSWERED (Finding #26):** YES — all three models find regulatory gaps and
missing features when explicitly prompted. Opus's unique behavior in #22 was
an emergent DEFAULT tendency, not a capability. Prompt framing dominates
model personality.
- **NEW (Finding #28):** Cross-document consistency found real bugs in gargoyle
docs (fills vs events, position ownership, signal persistence). Does running
this analysis across MORE document pairs (e.g., domain readmes vs implementation
docs, design docs vs plan docs) yield additional real inconsistencies? Could
become a systematic documentation maintenance tool.
- **NEW (Finding #28):** Opus was 2.4x faster AND found more issues than GPT-5
on cross-document consistency. Is this because cross-doc contradictions are
easy to verify once spotted (reducing GPT-5's verification advantage)? Or
because boundary reasoning (Opus's strength) is the primary skill needed?
## Methodology Notes
- Internet opinions about models are overwhelmingly about coding. Don't
extrapolate to analytical work without testing.
- "Just because someone says it on the internet doesn't make it right." —
Aaron, 2026-04-26. Opinions need context. Track our own evidence.
- Absence of published methodology for a use case is itself a finding.
- Each finding needs: date, task, **how we used it** (context shape, task
framing, what info the model had/didn't have), what happened, takeaway.
No unsupported generalizations.
- **Context dimensions to track:**
- Rich vs minimal (how much background info)
- Broad vs focused ("review this" vs "answer this specific question")
- What kind of context (diff, full files, issue text, research notes,
project conventions, nothing)
- Whether the model had access to tools or just text
- Whether the task was explicit step-by-step or open-ended