model-research/findings/2026-05-05-27-design-coherence-on-riskcontrolsmd-gpt5.md

# Finding 27: Design coherence on risk-controls.md: GPT-5 regains top position; Opus's advantage from Finding #15 was document-specific, not task-specific

**Date:** 2026-05-05
**Task:** Identify internal design incoherences in gargoyle's `risk-controls.md` (277 lines)
— a pre-trade risk control specification covering two evaluation stages, reduction semantics,
ordering rationale, fail-closed claims, and audit logging.
**How we used them:** Same document (full text) + same focused analytical question to all
3 models via HAI proxy. Highly structured prompt specifying 5 categories of incoherence
(safety properties not enforced, ordering/sequencing contradictions, reduction semantics
conflicts, fail-closed claims vs actual behavior, cross-stage inconsistencies). Required
each finding to reference specific contradictory parts. No tools, no project context beyond
the document itself.

| Model | Time | Output tokens | Reasoning tokens | Incoherences found | Critical | High | Medium |
|---|---|---|---|---|---|---|---|
| GPT-5 | 112s | 8,231 | 7,232 | 6 | 1 | 3 | 2 |
| Claude Opus 4.6 | 41s | 1,858 | (internal) | 5 | 2 | 2 | 1 |
| Claude Sonnet 4.6 | 15s | 699 | (internal) | 4 | 1 | 2 | 1 |

**What they found — common ground (all 3 identified):**
- Reduction re-entry at BuyingPower contradicts "reducing controls never re-enter
  earlier controls" (all three flagged this as the most obvious contradiction —
  Concentration at position 5 reduces, re-enters at BuyingPower at position 4,
  which IS an earlier control)
- Ordering rationale's categorization of buying power/concentration is internally
  confused (the doc labels both as "quantity-sensitive checks" that run after
  reducing controls, but concentration IS a reducing control at position 5 while
  buying power at position 4 sits between the two reducing controls)

**GPT-5 unique findings (not in either Claude model):**
- Signal-level "no portfolio context" contradicts NoShortSales requiring knowledge
  of current positions. The doc explicitly states signals are evaluated "in isolation"
  with "no portfolio context — only the signal itself and user settings" — but checking
  whether the user holds a position IS portfolio context. This is a genuine design
  tension: either SignalRisk has hidden portfolio access (violating isolation) or
  NoShortSales can't actually work as specified.
- Settings "fall through to system defaults" vs "Settings cache miss → reject."
  Two incompatible instructions for the same condition (missing settings).
- "Universal fail-closed" with "only exception is order rate window" contradicted
  by Failure Modes table showing buying power as another exception ("Conservative
  estimate; may over-reject" is NOT rejection — it's a different failure mode than
  either fail-closed or the documented single exception).
- Audit model says "every control evaluation produces an audit entry regardless of
  outcome" but the signal-stage write point only describes writing on rejection.
  Passing signals produce no documented audit entry at the signal stage.

**Claude Opus unique findings (not in either other model):**
- Signal flow diagram swaps control order vs table: table shows (1) MarketHours,
  (2) PerTradeStop, (3) NoShortSales, but diagram flows MarketHours → NoShortSales
  → PerTradeStopLoss. Controls 2 and 3 are reversed between the two representations.
  (VERIFIED: this is correct — the diagram does show a different order.)
- Concentration re-entry loop can bypass Order Rate, Duplicate, Self-Trade, and
  Fat Finger entirely during intermediate iterations. Also: Position Size at order 3
  is never re-checked against Concentration-reduced quantity because re-entry starts
  at BuyingPower (order 4), meaning "smaller quantity wins" semantics are implemented
  differently than the linear model described in Reduction Semantics.

**Claude Sonnet unique findings (not in either other model):**
- Buying Power "Can reduce? No" creates a dead end: if a reduced quantity still
  exceeds buying power, the system can only reject entirely (no mechanism to further
  optimize), defeating the purpose of the reduction system for capital-limited users.
  (NOTE: this is more of a design limitation than a self-contradiction, but the
  framing — that the reduction system's purpose is undermined by buying power's
  inability to reduce — is a legitimate coherence observation.)

**Quality assessment:**
- **GPT-5** produced the most findings (6) with the broadest coverage across the
  prompt's 5 categories. The NoShortSales/portfolio-context finding is the most
  genuinely insightful — it's a fundamental design-level contradiction (a signal-level
  control that REQUIRES decision-level context). The settings contradiction and
  audit logging inconsistency are also solid. Every finding points to two specific
  textual statements that are incompatible. Severity ratings were calibrated (1
  Critical, 3 High, 2 Medium — compared to Opus's 2 Critical for similar findings).
- **Claude Opus** was remarkably fast (41s, 1,858 tokens) and found one thing
  neither other model caught: the diagram/table order reversal for signal controls.
  This is a concrete, verifiable error (not a design tension — a literal mistake in
  the document). The re-entry loop analysis (finding #5) goes deeper than GPT-5's
  version of the same core issue, exploring the implications for "smaller quantity
  wins" semantics. However, Opus found fewer total issues and missed the
  settings contradiction and audit logging inconsistency.
- **Claude Sonnet** was the fastest (15s, 699 tokens) and found 4 issues. The buying
  power dead-end observation is unique and shows genuine reasoning about the reduction
  system's limitations. However, it's more of a "this design can't achieve its stated
  goal" than a strict self-contradiction. Sonnet's other findings overlap with the
  common ground. Quality is solid but narrower scope.

**Key insight — Finding #15's Opus > GPT-5 result was document-specific:**
In Finding #15 (coherence checking on failure-modes.md), Opus found 7 incoherences
vs GPT-5's 4. Here, on risk-controls.md, GPT-5 found 6 vs Opus's 5. The reversal
suggests that the relative performance on coherence checking depends on the
DOCUMENT'S structure, not on a fixed model advantage:

- **failure-modes.md** (383 lines): A complex multi-process system with many
  stated invariants across failure states, supervision trees, and recovery paths.
  Rich in design TENSIONS where one subsystem's safety mechanism undermines another.
  This plays to Opus's strength (finding design tensions between subsystems).
- **risk-controls.md** (277 lines): A more focused specification with explicit rules,
  ordering constraints, and behavior tables. Rich in SPECIFICATION CONTRADICTIONS
  where one statement directly conflicts with another. This plays to GPT-5's
  strength (systematic verification of claims against stated mechanisms).

The difference: Opus excels when contradictions are EMERGENT (arise from composing
multiple design decisions). GPT-5 excels when contradictions are EXPLICIT (two
statements in the document say incompatible things). Risk-controls.md has more
explicit contradictions (the settings fallback vs fail-closed, the "no portfolio
context" vs NoShortSales, the audit "always" vs write point "only on reject").

**Model performance depends on CONTRADICTION TYPE:**
| Contradiction type | Best model | Example |
|---|---|---|
| Emergent/compositional | Opus | "Rest-for-one cascade creates a 5th state" |
| Explicit/definitional | GPT-5 | "No portfolio context" but check requires portfolio |
| Diagrammatic/structural | Opus | Table order ≠ diagram order |
| Semantic/category confusion | All (common ground) | Reduction re-entry violates ordering claims |

**Revised conclusion on Finding #15's open question:**
"Does Opus > GPT-5 ordering for coherence checking hold across other documents?"
**No.** The ordering depends on the document's contradiction density and type.
Documents rich in emergent design tensions favor Opus. Documents with explicit
specification errors favor GPT-5. The task type (coherence checking) doesn't have
a fixed model winner — it depends on what KIND of incoherences the document contains.

**Practical implication:** Continue running both models for coherence checking. Their
strengths are complementary even within the same task type. GPT-5 catches things you
can point to in the spec and say "these two sentences conflict." Opus catches things
where you need to reason about the implications of multiple mechanisms interacting.

## Open Questions

- Does GPT's advantage in finding inconsistencies extend to logical
  inconsistencies in arguments? One data point (verdict mismatches) — need more.
- What's the optimal task granularity for GPT analytical review? "Whole PR" is
  too big. Is "one hypothesis" right, or can we batch?
- ~~Is the GPT-4.1 Mini bias detection result repeatable, or was it a well-
  structured task that any model would ace?~~ **ANSWERED (Finding #8):** Any
  model aces it when the biased text is presented without noise. The original
  result was about noise elimination, not model capability.
- **NEW:** Does adding a narrow bias-check question to a rich PR review
  context recover the detection that broad review misses? (Signal-to-noise
  confirmation test)
- ~~How does reasoning_effort affect analytical quality? Only tested default so
  far.~~ **ANSWERED (Finding #21):** Negligible effect on GPT-5 for open-ended
  analytical tasks. Low/medium/high produced 33/30/30 findings with nearly
  identical reasoning tokens (~4K) and per-finding depth. The parameter
  may primarily affect verifiable-answer tasks, not exploration. Task framing
  remains the dominant quality lever.
- Can we design a systematic "analytical review checklist" that leverages each
  model's strengths?
- ~~What analytical tasks is Opus best at vs Sonnet? Finding #11 shows Opus
  excels at design-tension identification. How does Sonnet compare on the
  same task? (Sonnet is non-reasoning but fast — would it match GPT-4.1?)~~
  **ANSWERED (Finding #12):** Sonnet 4.6 significantly outperforms GPT-4.1
  (17 vs ~14 assumptions) and approaches GPT-5 (17 vs 20). It's not a
  non-reasoning model in the GPT-4.1 sense — it occupies a middle tier with
  genuine component-interaction reasoning. Opus still wins on design-tension
  identification specifically.
- How do the models compare on research synthesis tasks (our #381 rewrite)?
  We'll find out during the actual rewrite.
- ~~Does the reasoning-token advantage scale with document complexity? Test
  with a simpler doc to see if the gap narrows.~~ **ANSWERED (Finding #11):**
  The gap doesn't narrow with simpler docs. GPT-5 maintains ~1.7x the findings
  of GPT-4.1 regardless of document complexity. Reasoning tokens enable
  exhaustive exploration independent of input difficulty.
- ~~Would Claude Opus (also a reasoning model) match GPT-5's assumption-finding
  performance, or does it have different blind spots?~~ **ANSWERED (Finding #11):**
  Different blind spots, different strengths. GPT-5 reasons deeper into
  implementation mechanics (breadth + technical depth). Opus reasons wider
  about system context and design tensions (insight density). They're
  complementary, not competing. Run both on important architecture docs.
- ~~Does Sonnet 4.6's strong showing hold across other analytical tasks
  (bias detection, gap-finding) or is it specific to assumption-finding on
  complex documents? Need to test Sonnet on simpler docs and different question
  types.~~ **PARTIALLY ANSWERED (Finding #13):** Sonnet's strength does NOT
  transfer to concurrency reasoning. It dropped from 85% of GPT-5 (assumption-
  finding) to ~58% (race condition identification). Task type matters more
  than we thought. Still untested: gap-finding, bias detection for Sonnet.
- **NEW:** What other analytical tasks require sequential/temporal reasoning
  (like race condition identification) vs pattern-matching reasoning (like
  assumption-finding)? Building a task taxonomy would help assign models
  correctly.
- **NEW:** What explains Sonnet taking slightly longer than Opus here (106s vs
  105s) despite normally being the faster model? Is it the document length, or
  does Sonnet's internal reasoning scale with complexity similarly to Opus?
- ~~How does GPT-5 Mini compare to GPT-5 on analytical tasks? Is it a viable
  cheaper substitute?~~ **ANSWERED (Finding #14):** GPT-5 Mini is a viable
  middle option. Finds fewer issues (6 vs 10) but with genuine reasoning
  depth at ~50% cost/time. Better than non-reasoning models, not as
  exhaustive as GPT-5.
- **NEW:** How does Claude 4.5 Opus/Sonnet compare to Claude 4.6? HAI now
  exposes both; worth testing whether the newer versions regress on
  analytical tasks.
- ~~Would running GPT-5 Mini + Sonnet together (different axes)
  approach GPT-5's coverage at lower combined cost?~~ **ANSWERED (Finding #19):**
  71% coverage at 31% cost. Good for low-stakes work; GPT-5 irreplaceable for
  high-stakes due to unique domain-knowledge findings in the missing 29%.
- ~~**NEW (Finding #15):** Does the Opus > GPT-5 ordering for coherence checking
  hold across other documents? The inversion (Opus finding more than GPT-5)
  was striking — need to confirm it wasn't document-specific.~~
  **ANSWERED (Finding #27):** No — it was document-specific. On risk-controls.md,
  GPT-5 found 6 vs Opus's 5. The winner depends on contradiction TYPE: Opus
  excels at emergent/compositional contradictions, GPT-5 at explicit/definitional
  ones. No fixed ordering for this task type.
- **NEW (Finding #15):** Is the two-pass approach (Opus generates → GPT-5
  validates) worth the extra cost vs just running Opus alone? Need to test
  whether GPT-5 actually catches Opus false-positives or just agrees.
- ~~How do the Claude 4.5 and 4.6 models compare on analytical tasks?~~
  **ANSWERED (Finding #16):** 4.5 is more exhaustive (2x findings), 4.6 is
  more precise (higher signal-to-noise). Genuine tradeoff, not a regression.
  4.5 for coverage, 4.6 for actionability.
- **NEW (Finding #16):** Does the 4.5 vs 4.6 pattern hold across other task
  types? Spec completeness may favor exhaustiveness; would coherence checking
  or race condition analysis show the same pattern?
- **NEW (Finding #16):** Is running both Sonnet versions (4.5 + 4.6) cost-
  effective vs just running GPT-5? Need to compare the UNION of their findings
  against GPT-5's output for overlap analysis.
- **NEW (Finding #18):** Does Opus's "predictable exploit window" detection
  transfer to other policy documents? It uniquely identified that the cooldown
  mechanism creates a GUARANTEED safe window that strategies could systematically
  exploit — this is a higher-order security insight. Worth testing whether Opus
  consistently finds "adversarial opportunity" framings that other models miss.
- **NEW (Finding #20):** Does GPT-5's extreme verification behavior (15:1
  reasoning-to-output ratio, 3 findings from 12K reasoning) persist across
  other documents with this prompt? Or was user-pipeline-lifecycle.md
  particularly verification-heavy? Test invariant violation paths on a simpler
  document.
- **NEW (Finding #20):** Would giving GPT-5 a "minimum 8 findings" instruction
  reduce its selectivity and produce MORE invariant violations at lower
  precision? Or would it just pad with non-violations? The extreme selectivity
  may be a feature OR it may mean GPT-5 is discarding valid-but-hard-to-verify
  findings.
- **NEW (Finding #20):** Opus's self-correction behavior is now confirmed across
  Finding #15 and #20. Is this trainable/promptable? Could we ask non-Opus models
  to "show your reasoning and withdraw findings you cannot fully verify"?
- **NEW (Finding #22):** The "silent correctness" lens revealed three distinct
  analytical modes: Opus → missing functionality, GPT-5 → mechanism incorrectness,
  Sonnet → composition failures. Does this three-way differentiation hold on other
  documents, or was it specific to the regulatory/financial domain of specid-lot-selection?
- **NEW (Finding #22):** Does the "silent correctness" lens work on non-financial
  documents? The financial/regulatory domain has a large gap between syntactic and
  semantic correctness. Would the same prompt on an infrastructure/systems doc produce
  equally differentiated findings, or would it collapse into assumption-finding?
- ~~**NEW (Finding #22):** Opus's "missing feature identification" mode (wash sales,
  commissions) — is this promptable on other models? Could we explicitly ask GPT-5
  "what should this system compute but doesn't" and get similar results?~~
  **ANSWERED (Finding #26):** YES — all three models find regulatory gaps and
  missing features when explicitly prompted. Opus's unique behavior in #22 was
  an emergent DEFAULT tendency, not a capability. Prompt framing dominates
  model personality.

- **NEW (Finding #28):** Cross-document consistency found real bugs in gargoyle
  docs (fills vs events, position ownership, signal persistence). Does running
  this analysis across MORE document pairs (e.g., domain readmes vs implementation
  docs, design docs vs plan docs) yield additional real inconsistencies? Could
  become a systematic documentation maintenance tool.
- **NEW (Finding #28):** Opus was 2.4x faster AND found more issues than GPT-5
  on cross-document consistency. Is this because cross-doc contradictions are
  easy to verify once spotted (reducing GPT-5's verification advantage)? Or
  because boundary reasoning (Opus's strength) is the primary skill needed?

## Methodology Notes

- Internet opinions about models are overwhelmingly about coding. Don't
  extrapolate to analytical work without testing.
- "Just because someone says it on the internet doesn't make it right." —
  Aaron, 2026-04-26. Opinions need context. Track our own evidence.
- Absence of published methodology for a use case is itself a finding.
- Each finding needs: date, task, **how we used it** (context shape, task
  framing, what info the model had/didn't have), what happened, takeaway.
  No unsupported generalizations.
- **Context dimensions to track:**
  - Rich vs minimal (how much background info)
  - Broad vs focused ("review this" vs "answer this specific question")
  - What kind of context (diff, full files, issue text, research notes,
    project conventions, nothing)
  - Whether the model had access to tools or just text
  - Whether the task was explicit step-by-step or open-ended