6af8a6ee10
Break the monolithic 3249-line findings file into 29 individual files, one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy chronological sorting and discovery. No content changes — purely structural reorganization.
277 lines
18 KiB
Markdown
277 lines
18 KiB
Markdown
# Finding 27: Design coherence on risk-controls.md: GPT-5 regains top position; Opus's advantage from Finding #15 was document-specific, not task-specific
|
|
|
|
**Date:** 2026-05-05
|
|
**Task:** Identify internal design incoherences in gargoyle's `risk-controls.md` (277 lines)
|
|
— a pre-trade risk control specification covering two evaluation stages, reduction semantics,
|
|
ordering rationale, fail-closed claims, and audit logging.
|
|
**How we used them:** Same document (full text) + same focused analytical question to all
|
|
3 models via HAI proxy. Highly structured prompt specifying 5 categories of incoherence
|
|
(safety properties not enforced, ordering/sequencing contradictions, reduction semantics
|
|
conflicts, fail-closed claims vs actual behavior, cross-stage inconsistencies). Required
|
|
each finding to reference specific contradictory parts. No tools, no project context beyond
|
|
the document itself.
|
|
|
|
| Model | Time | Output tokens | Reasoning tokens | Incoherences found | Critical | High | Medium |
|
|
|---|---|---|---|---|---|---|---|
|
|
| GPT-5 | 112s | 8,231 | 7,232 | 6 | 1 | 3 | 2 |
|
|
| Claude Opus 4.6 | 41s | 1,858 | (internal) | 5 | 2 | 2 | 1 |
|
|
| Claude Sonnet 4.6 | 15s | 699 | (internal) | 4 | 1 | 2 | 1 |
|
|
|
|
**What they found — common ground (all 3 identified):**
|
|
- Reduction re-entry at BuyingPower contradicts "reducing controls never re-enter
|
|
earlier controls" (all three flagged this as the most obvious contradiction —
|
|
Concentration at position 5 reduces, re-enters at BuyingPower at position 4,
|
|
which IS an earlier control)
|
|
- Ordering rationale's categorization of buying power/concentration is internally
|
|
confused (the doc labels both as "quantity-sensitive checks" that run after
|
|
reducing controls, but concentration IS a reducing control at position 5 while
|
|
buying power at position 4 sits between the two reducing controls)
|
|
|
|
**GPT-5 unique findings (not in either Claude model):**
|
|
- Signal-level "no portfolio context" contradicts NoShortSales requiring knowledge
|
|
of current positions. The doc explicitly states signals are evaluated "in isolation"
|
|
with "no portfolio context — only the signal itself and user settings" — but checking
|
|
whether the user holds a position IS portfolio context. This is a genuine design
|
|
tension: either SignalRisk has hidden portfolio access (violating isolation) or
|
|
NoShortSales can't actually work as specified.
|
|
- Settings "fall through to system defaults" vs "Settings cache miss → reject."
|
|
Two incompatible instructions for the same condition (missing settings).
|
|
- "Universal fail-closed" with "only exception is order rate window" contradicted
|
|
by Failure Modes table showing buying power as another exception ("Conservative
|
|
estimate; may over-reject" is NOT rejection — it's a different failure mode than
|
|
either fail-closed or the documented single exception).
|
|
- Audit model says "every control evaluation produces an audit entry regardless of
|
|
outcome" but the signal-stage write point only describes writing on rejection.
|
|
Passing signals produce no documented audit entry at the signal stage.
|
|
|
|
**Claude Opus unique findings (not in either other model):**
|
|
- Signal flow diagram swaps control order vs table: table shows (1) MarketHours,
|
|
(2) PerTradeStop, (3) NoShortSales, but diagram flows MarketHours → NoShortSales
|
|
→ PerTradeStopLoss. Controls 2 and 3 are reversed between the two representations.
|
|
(VERIFIED: this is correct — the diagram does show a different order.)
|
|
- Concentration re-entry loop can bypass Order Rate, Duplicate, Self-Trade, and
|
|
Fat Finger entirely during intermediate iterations. Also: Position Size at order 3
|
|
is never re-checked against Concentration-reduced quantity because re-entry starts
|
|
at BuyingPower (order 4), meaning "smaller quantity wins" semantics are implemented
|
|
differently than the linear model described in Reduction Semantics.
|
|
|
|
**Claude Sonnet unique findings (not in either other model):**
|
|
- Buying Power "Can reduce? No" creates a dead end: if a reduced quantity still
|
|
exceeds buying power, the system can only reject entirely (no mechanism to further
|
|
optimize), defeating the purpose of the reduction system for capital-limited users.
|
|
(NOTE: this is more of a design limitation than a self-contradiction, but the
|
|
framing — that the reduction system's purpose is undermined by buying power's
|
|
inability to reduce — is a legitimate coherence observation.)
|
|
|
|
**Quality assessment:**
|
|
- **GPT-5** produced the most findings (6) with the broadest coverage across the
|
|
prompt's 5 categories. The NoShortSales/portfolio-context finding is the most
|
|
genuinely insightful — it's a fundamental design-level contradiction (a signal-level
|
|
control that REQUIRES decision-level context). The settings contradiction and
|
|
audit logging inconsistency are also solid. Every finding points to two specific
|
|
textual statements that are incompatible. Severity ratings were calibrated (1
|
|
Critical, 3 High, 2 Medium — compared to Opus's 2 Critical for similar findings).
|
|
- **Claude Opus** was remarkably fast (41s, 1,858 tokens) and found one thing
|
|
neither other model caught: the diagram/table order reversal for signal controls.
|
|
This is a concrete, verifiable error (not a design tension — a literal mistake in
|
|
the document). The re-entry loop analysis (finding #5) goes deeper than GPT-5's
|
|
version of the same core issue, exploring the implications for "smaller quantity
|
|
wins" semantics. However, Opus found fewer total issues and missed the
|
|
settings contradiction and audit logging inconsistency.
|
|
- **Claude Sonnet** was the fastest (15s, 699 tokens) and found 4 issues. The buying
|
|
power dead-end observation is unique and shows genuine reasoning about the reduction
|
|
system's limitations. However, it's more of a "this design can't achieve its stated
|
|
goal" than a strict self-contradiction. Sonnet's other findings overlap with the
|
|
common ground. Quality is solid but narrower scope.
|
|
|
|
**Key insight — Finding #15's Opus > GPT-5 result was document-specific:**
|
|
In Finding #15 (coherence checking on failure-modes.md), Opus found 7 incoherences
|
|
vs GPT-5's 4. Here, on risk-controls.md, GPT-5 found 6 vs Opus's 5. The reversal
|
|
suggests that the relative performance on coherence checking depends on the
|
|
DOCUMENT'S structure, not on a fixed model advantage:
|
|
|
|
- **failure-modes.md** (383 lines): A complex multi-process system with many
|
|
stated invariants across failure states, supervision trees, and recovery paths.
|
|
Rich in design TENSIONS where one subsystem's safety mechanism undermines another.
|
|
This plays to Opus's strength (finding design tensions between subsystems).
|
|
- **risk-controls.md** (277 lines): A more focused specification with explicit rules,
|
|
ordering constraints, and behavior tables. Rich in SPECIFICATION CONTRADICTIONS
|
|
where one statement directly conflicts with another. This plays to GPT-5's
|
|
strength (systematic verification of claims against stated mechanisms).
|
|
|
|
The difference: Opus excels when contradictions are EMERGENT (arise from composing
|
|
multiple design decisions). GPT-5 excels when contradictions are EXPLICIT (two
|
|
statements in the document say incompatible things). Risk-controls.md has more
|
|
explicit contradictions (the settings fallback vs fail-closed, the "no portfolio
|
|
context" vs NoShortSales, the audit "always" vs write point "only on reject").
|
|
|
|
**Model performance depends on CONTRADICTION TYPE:**
|
|
| Contradiction type | Best model | Example |
|
|
|---|---|---|
|
|
| Emergent/compositional | Opus | "Rest-for-one cascade creates a 5th state" |
|
|
| Explicit/definitional | GPT-5 | "No portfolio context" but check requires portfolio |
|
|
| Diagrammatic/structural | Opus | Table order ≠ diagram order |
|
|
| Semantic/category confusion | All (common ground) | Reduction re-entry violates ordering claims |
|
|
|
|
**Revised conclusion on Finding #15's open question:**
|
|
"Does Opus > GPT-5 ordering for coherence checking hold across other documents?"
|
|
**No.** The ordering depends on the document's contradiction density and type.
|
|
Documents rich in emergent design tensions favor Opus. Documents with explicit
|
|
specification errors favor GPT-5. The task type (coherence checking) doesn't have
|
|
a fixed model winner — it depends on what KIND of incoherences the document contains.
|
|
|
|
**Practical implication:** Continue running both models for coherence checking. Their
|
|
strengths are complementary even within the same task type. GPT-5 catches things you
|
|
can point to in the spec and say "these two sentences conflict." Opus catches things
|
|
where you need to reason about the implications of multiple mechanisms interacting.
|
|
|
|
## Open Questions
|
|
|
|
- Does GPT's advantage in finding inconsistencies extend to logical
|
|
inconsistencies in arguments? One data point (verdict mismatches) — need more.
|
|
- What's the optimal task granularity for GPT analytical review? "Whole PR" is
|
|
too big. Is "one hypothesis" right, or can we batch?
|
|
- ~~Is the GPT-4.1 Mini bias detection result repeatable, or was it a well-
|
|
structured task that any model would ace?~~ **ANSWERED (Finding #8):** Any
|
|
model aces it when the biased text is presented without noise. The original
|
|
result was about noise elimination, not model capability.
|
|
- **NEW:** Does adding a narrow bias-check question to a rich PR review
|
|
context recover the detection that broad review misses? (Signal-to-noise
|
|
confirmation test)
|
|
- ~~How does reasoning_effort affect analytical quality? Only tested default so
|
|
far.~~ **ANSWERED (Finding #21):** Negligible effect on GPT-5 for open-ended
|
|
analytical tasks. Low/medium/high produced 33/30/30 findings with nearly
|
|
identical reasoning tokens (~4K) and per-finding depth. The parameter
|
|
may primarily affect verifiable-answer tasks, not exploration. Task framing
|
|
remains the dominant quality lever.
|
|
- Can we design a systematic "analytical review checklist" that leverages each
|
|
model's strengths?
|
|
- ~~What analytical tasks is Opus best at vs Sonnet? Finding #11 shows Opus
|
|
excels at design-tension identification. How does Sonnet compare on the
|
|
same task? (Sonnet is non-reasoning but fast — would it match GPT-4.1?)~~
|
|
**ANSWERED (Finding #12):** Sonnet 4.6 significantly outperforms GPT-4.1
|
|
(17 vs ~14 assumptions) and approaches GPT-5 (17 vs 20). It's not a
|
|
non-reasoning model in the GPT-4.1 sense — it occupies a middle tier with
|
|
genuine component-interaction reasoning. Opus still wins on design-tension
|
|
identification specifically.
|
|
- How do the models compare on research synthesis tasks (our #381 rewrite)?
|
|
We'll find out during the actual rewrite.
|
|
- ~~Does the reasoning-token advantage scale with document complexity? Test
|
|
with a simpler doc to see if the gap narrows.~~ **ANSWERED (Finding #11):**
|
|
The gap doesn't narrow with simpler docs. GPT-5 maintains ~1.7x the findings
|
|
of GPT-4.1 regardless of document complexity. Reasoning tokens enable
|
|
exhaustive exploration independent of input difficulty.
|
|
- ~~Would Claude Opus (also a reasoning model) match GPT-5's assumption-finding
|
|
performance, or does it have different blind spots?~~ **ANSWERED (Finding #11):**
|
|
Different blind spots, different strengths. GPT-5 reasons deeper into
|
|
implementation mechanics (breadth + technical depth). Opus reasons wider
|
|
about system context and design tensions (insight density). They're
|
|
complementary, not competing. Run both on important architecture docs.
|
|
- ~~Does Sonnet 4.6's strong showing hold across other analytical tasks
|
|
(bias detection, gap-finding) or is it specific to assumption-finding on
|
|
complex documents? Need to test Sonnet on simpler docs and different question
|
|
types.~~ **PARTIALLY ANSWERED (Finding #13):** Sonnet's strength does NOT
|
|
transfer to concurrency reasoning. It dropped from 85% of GPT-5 (assumption-
|
|
finding) to ~58% (race condition identification). Task type matters more
|
|
than we thought. Still untested: gap-finding, bias detection for Sonnet.
|
|
- **NEW:** What other analytical tasks require sequential/temporal reasoning
|
|
(like race condition identification) vs pattern-matching reasoning (like
|
|
assumption-finding)? Building a task taxonomy would help assign models
|
|
correctly.
|
|
- **NEW:** What explains Sonnet taking slightly longer than Opus here (106s vs
|
|
105s) despite normally being the faster model? Is it the document length, or
|
|
does Sonnet's internal reasoning scale with complexity similarly to Opus?
|
|
- ~~How does GPT-5 Mini compare to GPT-5 on analytical tasks? Is it a viable
|
|
cheaper substitute?~~ **ANSWERED (Finding #14):** GPT-5 Mini is a viable
|
|
middle option. Finds fewer issues (6 vs 10) but with genuine reasoning
|
|
depth at ~50% cost/time. Better than non-reasoning models, not as
|
|
exhaustive as GPT-5.
|
|
- **NEW:** How does Claude 4.5 Opus/Sonnet compare to Claude 4.6? HAI now
|
|
exposes both; worth testing whether the newer versions regress on
|
|
analytical tasks.
|
|
- ~~Would running GPT-5 Mini + Sonnet together (different axes)
|
|
approach GPT-5's coverage at lower combined cost?~~ **ANSWERED (Finding #19):**
|
|
71% coverage at 31% cost. Good for low-stakes work; GPT-5 irreplaceable for
|
|
high-stakes due to unique domain-knowledge findings in the missing 29%.
|
|
- ~~**NEW (Finding #15):** Does the Opus > GPT-5 ordering for coherence checking
|
|
hold across other documents? The inversion (Opus finding more than GPT-5)
|
|
was striking — need to confirm it wasn't document-specific.~~
|
|
**ANSWERED (Finding #27):** No — it was document-specific. On risk-controls.md,
|
|
GPT-5 found 6 vs Opus's 5. The winner depends on contradiction TYPE: Opus
|
|
excels at emergent/compositional contradictions, GPT-5 at explicit/definitional
|
|
ones. No fixed ordering for this task type.
|
|
- **NEW (Finding #15):** Is the two-pass approach (Opus generates → GPT-5
|
|
validates) worth the extra cost vs just running Opus alone? Need to test
|
|
whether GPT-5 actually catches Opus false-positives or just agrees.
|
|
- ~~How do the Claude 4.5 and 4.6 models compare on analytical tasks?~~
|
|
**ANSWERED (Finding #16):** 4.5 is more exhaustive (2x findings), 4.6 is
|
|
more precise (higher signal-to-noise). Genuine tradeoff, not a regression.
|
|
4.5 for coverage, 4.6 for actionability.
|
|
- **NEW (Finding #16):** Does the 4.5 vs 4.6 pattern hold across other task
|
|
types? Spec completeness may favor exhaustiveness; would coherence checking
|
|
or race condition analysis show the same pattern?
|
|
- **NEW (Finding #16):** Is running both Sonnet versions (4.5 + 4.6) cost-
|
|
effective vs just running GPT-5? Need to compare the UNION of their findings
|
|
against GPT-5's output for overlap analysis.
|
|
- **NEW (Finding #18):** Does Opus's "predictable exploit window" detection
|
|
transfer to other policy documents? It uniquely identified that the cooldown
|
|
mechanism creates a GUARANTEED safe window that strategies could systematically
|
|
exploit — this is a higher-order security insight. Worth testing whether Opus
|
|
consistently finds "adversarial opportunity" framings that other models miss.
|
|
- **NEW (Finding #20):** Does GPT-5's extreme verification behavior (15:1
|
|
reasoning-to-output ratio, 3 findings from 12K reasoning) persist across
|
|
other documents with this prompt? Or was user-pipeline-lifecycle.md
|
|
particularly verification-heavy? Test invariant violation paths on a simpler
|
|
document.
|
|
- **NEW (Finding #20):** Would giving GPT-5 a "minimum 8 findings" instruction
|
|
reduce its selectivity and produce MORE invariant violations at lower
|
|
precision? Or would it just pad with non-violations? The extreme selectivity
|
|
may be a feature OR it may mean GPT-5 is discarding valid-but-hard-to-verify
|
|
findings.
|
|
- **NEW (Finding #20):** Opus's self-correction behavior is now confirmed across
|
|
Finding #15 and #20. Is this trainable/promptable? Could we ask non-Opus models
|
|
to "show your reasoning and withdraw findings you cannot fully verify"?
|
|
- **NEW (Finding #22):** The "silent correctness" lens revealed three distinct
|
|
analytical modes: Opus → missing functionality, GPT-5 → mechanism incorrectness,
|
|
Sonnet → composition failures. Does this three-way differentiation hold on other
|
|
documents, or was it specific to the regulatory/financial domain of specid-lot-selection?
|
|
- **NEW (Finding #22):** Does the "silent correctness" lens work on non-financial
|
|
documents? The financial/regulatory domain has a large gap between syntactic and
|
|
semantic correctness. Would the same prompt on an infrastructure/systems doc produce
|
|
equally differentiated findings, or would it collapse into assumption-finding?
|
|
- ~~**NEW (Finding #22):** Opus's "missing feature identification" mode (wash sales,
|
|
commissions) — is this promptable on other models? Could we explicitly ask GPT-5
|
|
"what should this system compute but doesn't" and get similar results?~~
|
|
**ANSWERED (Finding #26):** YES — all three models find regulatory gaps and
|
|
missing features when explicitly prompted. Opus's unique behavior in #22 was
|
|
an emergent DEFAULT tendency, not a capability. Prompt framing dominates
|
|
model personality.
|
|
|
|
- **NEW (Finding #28):** Cross-document consistency found real bugs in gargoyle
|
|
docs (fills vs events, position ownership, signal persistence). Does running
|
|
this analysis across MORE document pairs (e.g., domain readmes vs implementation
|
|
docs, design docs vs plan docs) yield additional real inconsistencies? Could
|
|
become a systematic documentation maintenance tool.
|
|
- **NEW (Finding #28):** Opus was 2.4x faster AND found more issues than GPT-5
|
|
on cross-document consistency. Is this because cross-doc contradictions are
|
|
easy to verify once spotted (reducing GPT-5's verification advantage)? Or
|
|
because boundary reasoning (Opus's strength) is the primary skill needed?
|
|
|
|
## Methodology Notes
|
|
|
|
- Internet opinions about models are overwhelmingly about coding. Don't
|
|
extrapolate to analytical work without testing.
|
|
- "Just because someone says it on the internet doesn't make it right." —
|
|
Aaron, 2026-04-26. Opinions need context. Track our own evidence.
|
|
- Absence of published methodology for a use case is itself a finding.
|
|
- Each finding needs: date, task, **how we used it** (context shape, task
|
|
framing, what info the model had/didn't have), what happened, takeaway.
|
|
No unsupported generalizations.
|
|
- **Context dimensions to track:**
|
|
- Rich vs minimal (how much background info)
|
|
- Broad vs focused ("review this" vs "answer this specific question")
|
|
- What kind of context (diff, full files, issue text, research notes,
|
|
project conventions, nothing)
|
|
- Whether the model had access to tools or just text
|
|
- Whether the task was explicit step-by-step or open-ended
|