Security-review-bot persona caught HTTPS enforcement inconsistency in write-path
methods (PostReview, DeleteReview, RequestReviewer) that generalist reviewers missed.
Issue fixed within 30 minutes, all reviewers re-approved. Validates specialized
security persona value in multi-model pipeline.
New task type testing distributed systems consistency analysis.
GPT-5 found 18 issues (with 4,416 reasoning tokens), Sonnet found 13.
Key insight: distributed systems reasoning benefits from extended
reasoning - Sonnet at 72% of GPT-5 count, similar to race condition
analysis (58%) and worse than assumption-finding (85%).
GPT-5 outperforms Sonnet on cross-context integration analysis:
- GPT-5: 10 findings (4 Critical) in 191s with 7,744 reasoning tokens
- Sonnet: 7 findings (1 Critical) in 23s
Key insight: Cross-context contract verification benefits from extended
reasoning (contrast to Finding #67 where Sonnet was better at inter-doc
contradictions). Flow tracing and subscription gap detection require
systematic verification that GPT-5's exhaustive style excels at.
Discovered actual spec gaps in gargoyle domain model (FillReceived
missing fields, no liquidation instruction event, Risk not subscribing
to LotOpened for PDT, etc.).
New analytical lens: security boundary analysis — identifying where trust
assumptions cross component boundaries in exploitable ways.
Document: gargoyle system-overview.md (323 lines)
Models: Claude Opus (15 findings), Claude Sonnet 4 (10 findings)
Key finding: Opus identified that transient signals (a performance design
choice) create a structural security vulnerability — malicious strategies
can probe risk limits without leaving any audit trail.
This experiment establishes security boundary analysis as a distinct,
viable analytical task type for architecture review.
- GPT-5 finds 20 gaps with exhaustive FINRA cross-referencing
- Opus finds 12 gaps focusing on operational compliance requirements
- Sonnet provides fast screening (16s) with 9 gaps
- Key insight: regulatory gap analysis benefits from reasoning tokens
- New lens for compliance audits of financial software
New analytical lens testing concurrent write hazards against event-catalog.md.
GPT-5 found 19 hazards, Opus 11, Sonnet 12. Union ~27 distinct findings.
Key insight: this lens is high-value for event sourcing docs because replay
correctness depends on ordering invariants that are often implicit.
Tested GPT-5, Opus, Sonnet on specid-lot-selection.md (125 lines)
for implementation specification gaps.
Key findings:
- Opus most cost-effective (4.6 gaps/1K tokens vs 1.8 for GPT-5)
- GPT-5 catches operational/financial edge cases (fees, multi-execution)
- Opus catches design-level binding ambiguities
- Sonnet too shallow for serious spec review
New lens distinct from hidden assumptions and race conditions:
focuses on ambiguity of intent, not risks.
New analytical lens examining implicit assumptions about broker APIs,
market behavior, network conditions, and timing.
Document: gargoyle's feeds-and-instruments.md (115 lines)
Models: GPT-5 (24 findings), Opus (15), Sonnet (15)
Key insight: External system assumptions benefit more from reasoning
depth than internal architecture analysis. GPT-5's exhaustive coverage
of broker implementation details and network failure modes justifies
the token cost for critical external interfaces.
Union of all models finds ~30 distinct assumptions vs ~24 max single model.
Tested on signal-lifecycle.md (111 lines). Results:
- GPT-5: 17 gaps (7,744 reasoning tokens)
- Opus: 11 gaps (design-level focus)
- Sonnet: 8 gaps (fastest, protocol-level)
Key insight: Union of all models (~26 gaps) far exceeds any single
model (max 17). Only 5 gaps found by all three — highly differentiated
outputs make multi-model runs valuable for interface documents.
New task type: analyzing prescriptive/specification documents for completeness.
- GPT-5 dominates with exhaustive enumeration (34 findings)
- Opus traces gaps to consequences (routing failures, compiler issues)
- Sonnet surface-level (not recommended for thorough analysis)
Key insight: GPT-5 found internal contradiction (telemetry verb rule vs example)
that neither Claude model caught. Opus unique in tracing PubSub collision
to actual routing failure scenario.
Task taxonomy: convention gap analysis follows same pattern as architecture
docs - GPT-5 for coverage, Opus for consequences.
GPT-5 finds 16 gaps, Opus 11, Sonnet 9. GPT-5 excels at exhaustive
state space enumeration; Opus finds convention-vs-enforcement gaps;
Sonnet adequate but less thorough.
Key insight: state machine completeness is a GPT-5 sweet spot due to
reasoning tokens enabling systematic combinatorial coverage.
Tests a novel lens for event-sourced architectures: can all state be
reconstructed from documented events alone?
Key findings:
- GPT-5 brings external domain knowledge (broker APIs, compliance)
- Opus reasons through failure modes systematically (crash boundaries)
- Sonnet does rapid structural analysis (missing pieces)
21 unique findings across three models with only 5 in common.
Each model's reasoning style reveals different issue categories.
New pattern: event flow analysis exposes model reasoning styles
that gap-finding and contradiction detection don't surface.
Tests a novel lens asking 'what cognitive/procedural load does this design
place on operators?' Applied to escalation-policy.md with GPT-5, Sonnet 4.6,
and Opus 4.6.
Key findings:
- All models identified manual liquidate→restrict has no procedure (CRITICAL)
- GPT-5 excels at exhaustive enumeration (21+ findings, config gaps)
- Opus identifies systemic vulnerabilities (monitor crash → silent unsafe state)
- Sonnet fills procedural gaps (authorization, timeouts)
Recommendation: Opus alone for time-constrained analysis, GPT-5 + Opus for
thoroughness. They find different types of issues with minimal overlap.
New analytical lens tested on gargoyle order-state-machine.md:
- GPT-5: 15 findings (most CRITICAL issues, exhaustive field analysis)
- Opus: 14 findings (state lifecycle focus, implementation mechanisms)
- Sonnet: 10 findings (fast but shallow)
Key insight: "unstated constraints" finds what's IMPLIED but not stated,
distinct from gaps, race conditions, or ambiguities. GPT-5 is best for
catching CRITICAL data integrity constraints; Opus for state machine
implementation details.
Cross-document boundary analysis: signal-lifecycle + buying-power + risk-controls.
Opus decisively outperforms GPT-5 (10 vs 7 findings) — first inversion observed.
New lens finds a distinct class of bug: partially-degraded paths that mimic normal behavior.
New analytical lens applied to signal-lifecycle.md (111 lines).
All three models (GPT-5, Opus, Sonnet) found 7-9 findings each with
70% at Critical/High severity. Key insight: concurrency analysis
rewards compositional temporal reasoning over enumeration breadth,
narrowing the gap between models compared to other lenses.
Unique finds: GPT-5 (stop-loss race, duplicate UUID), Opus (crash
survival contradiction), Sonnet (Signal Risk audit gap after dispatch).
New analytical lens (adversarial/offensive security) tested on gargoyle's
signal audit log spec. GPT-5 most exhaustive (25), Opus deepest individual
attack narratives (14), Sonnet most creative meta-attacks (11).
Adversarial lens is ~2.5x more productive than defensive lenses on
comparable docs. All three models converged on same root cause (trust model).
New analytical lens: where systems rely on single mechanisms rather than
layered defenses. GPT-5 finds exploitable SSRF; Opus identifies trust-root
collapse (session+sudo share SECRET_KEY_BASE); Sonnet is surface-level.
Tests a novel analytical lens on aggregation.md (239 lines): 'what happens
when many correct instances operate simultaneously in a correlated environment?'
Results: GPT-5 (13 findings, 76s) excels at systemic dynamics and feedback
loops. Opus (8 findings, 93s) finds the most consequential single findings
(stop-loss defeated by temporal composition, crash-opportunity correlation).
Sonnet 4.0 (6 findings, 32s) too abstract for this task.
Key insight: This lens finds DEPLOYMENT bugs invisible at design time -
the gap between 'correct by construction' and 'correct in production'.
Novel experiment testing 'what's invisible to operators' on gargoyle's
observability.md (563 lines). GPT-5 (18 findings), Opus (12), Sonnet (10).
Key discovery: 'actively misleads' category (observability creating false
confidence) is highest-value and Opus-dominated. Distinct from assumption-
finding, race conditions, or gap analysis — requires reasoning about
negation (what ISN'T instrumented vs what production needs).
Sonnet finds ZERO subtle contradictions between signal-lifecycle.md and
aggregation.md, while GPT-5 and Opus each find 3 genuine conflicts.
Key insight: Sonnet can detect explicit contradictions (Finding 28: 4/6)
but completely fails on implication conflicts where one doc's simplified
model creates false impressions about another doc's complete specification.
Refines Finding 28 and confirms cross-document consistency is actually
TWO distinct tasks with different model requirements.
Tests the open question from Finding #39: does Opus's internal reasoning
depth suffice for self-contradiction verification?
Key result: wrong question. Opus finds a different CLASS of contradiction
than GPT-5. GPT-5 finds specification conflicts (statement comparison).
Opus finds logical impossibilities (deductive rule interaction). Neither
dominates — they don't overlap. Sonnet remains unreliable (~33% precision).
Document tested: escalation-policy.md (228 lines)
Models: GPT-5, Claude Opus 4.6, Claude Sonnet 4.6
New analytical lens: failure propagation chains. Opus matched GPT-5's count
(10 findings each) while using 2.2x fewer tokens. Overview docs are ideal
for this lens. Sonnet produced zero unique insights.
New analytical lens testing whether models can identify sequential operations
where order matters but isn't mechanically enforced. GPT-5 finds systemic
gaps (WHY ordering matters), Opus finds inverted dangers (WHICH direction
is dangerous), Sonnet identifies themes without unique depth.
New analytical lens applied to lot-accounting.md (181 lines).
Tests how models identify sequences of individually correct
operations that produce silently wrong financial results.
Results:
- GPT-5: 12 findings (137s, 10688 reasoning tokens) - tax law domain knowledge
- Opus: 8 findings (121s) - concurrent systems / crash recovery focus
- Sonnet: 8 findings (111s) - structural meta-analysis, highest-leverage finding
Key insight: First experiment where domain-specific knowledge (tax law)
is the primary differentiator. Models reason from different knowledge
domains: GPT-5=tax law, Opus=distributed systems, Sonnet=architecture patterns.
Sonnet produced the most architecturally significant finding: that the
system's reconciliation mechanism confirms corruption rather than detecting
it (because it re-derives from LotClosed which is itself the corrupted source).
Tested open question from Finding #5: does narrow framing give Sonnet
GPT-5-level semantic analysis?
Result: NO. Narrow framing changes WHAT Sonnet looks for (redirects from
gaps to contradictions) but not HOW WELL it reasons. Sonnet narrow found
3 contradictions but only 1 was genuine (2 were analytical errors/misreads).
GPT-5 found 4 all-genuine findings with precise reasoning.
Key insight: framing controls scope, not reasoning depth. For tasks
requiring logical verification (contradictions, race conditions, invariant
violations), reasoning tokens are necessary — framing alone is insufficient.
Updated open-questions.md: marked Sonnet+narrow as answered, added new
question about Opus+narrow for contradiction detection.
First experiment testing domain-specific regulatory knowledge rather than
pure architectural reasoning. Opus demonstrates deepest FINRA Rule 4210
knowledge; GPT-5 finds broker-API semantic mismatches; content filters
are a new failure mode for financial domain analysis via enterprise proxies.
Tested kill-switch.md + escalation-policy.md (same bounded context,
shared vocabulary). Key insight: shared vocabulary claims are the most
dangerous inconsistency — same words with opposite severity ordering.
Opus found the severity-ordering inversion (restrict/liquidate ladders
run in opposite directions). GPT-5 found the meta-issue (the 'same
vocabulary' claim is itself the problem). Sonnet fast but shallow.
Tightly coupled docs produce more Critical findings than loosely coupled
ones (Finding #28).
New experiment type: give models two related architecture documents and ask
them to identify assumptions each document makes about the other that could
be violated.
Results: GPT-5 (10 findings, 175s, operational/race-focused) and Opus (10
findings, 111s, structural/architectural) both found unique interface gaps.
Sonnet (7 findings, 29s) found nothing unique - all its findings were
simplified versions of GPT-5/Opus findings.
Key insight: Interface analysis requires holding two mental models simultaneously
and is harder than single-document analysis. Sonnet produced 0 unique findings
(vs 2-6 on single-doc tasks). Extended reasoning appears necessary for this
task type.
Tests GPT-5 → Opus critique+extend pipeline on dtbp-margin-call.md.
Key results:
- Ensemble produces 56 unique findings vs 43 (GPT-5) or 28 (Opus) alone
- Zero full disagreements — GPT-5's coverage is reliable signal
- Critique phase (severity calibration) more valuable than extension phase
- 28% more tokens for 30% more coverage + structured prioritization
- Answers open question about adversarial ensemble value
New analytical lens: where data propagation creates stale, contradictory,
or misleading views for different consumers.
Key result: highest model convergence (45% common ground) due to document's
explicit failure mode table. GPT-5 finds event-level provenance gaps; Opus
identifies strategy attribution dimension. Sonnet adds zero unique value.
Two-model stack (GPT-5 + Opus) optimal.
New analytical lens: observability gap analysis — asking 'when something
goes wrong, can you SEE it?' rather than 'what can go wrong?'
Results on aggregation.md (239 lines):
- GPT-5: 23 findings (12 unique), exhaustive telemetry architecture
- Opus: 14 findings (6 unique), operator-behavioral insights
- Sonnet: 11 findings (0 unique), no added value
Key insight: GPT-5 designs the instrumentation; Opus identifies where
available signals mislead operators toward wrong remediations.
Two-model (GPT-5 + Opus) optimal for this task type.
Tested GPT-5, Opus, Sonnet on wash-sale-tracking.md spec.
Opus found a genuine spec bug (trigger logic described backwards).
Confirms pattern: GPT-5 for breadth, Opus for logic contradictions,
Sonnet adds no value for systematic analytical tasks.