58 Commits

Author SHA1 Message Date
Rodin 5426026908 docs: regenerate weekly report (2026-05-18) 2026-05-18 16:10:16 +00:00
Rodin afbc013e2e finding #80: config-a/b dispatcher malfunction detected in multi-model review pipeline (3.5x cost overage) 2026-05-15 08:37:01 +00:00
Rodin 8e64f8f012 finding(79): multi-model security review catches HTTPS bypass in GitHub client (PR #131)
Security-review-bot persona caught HTTPS enforcement inconsistency in write-path
methods (PostReview, DeleteReview, RequestReviewer) that generalist reviewers missed.
Issue fixed within 30 minutes, all reviewers re-approved. Validates specialized
security persona value in multi-model pipeline.
2026-05-14 21:56:58 +00:00
Rodin 643a804bdf finding #79: multi-model security review catches CGN + proxy-assisted SSRF gaps
- Python ipaddress.is_private/is_reserved misses CGN (100.64.0.0/10)
- Go http.DefaultTransport clone retains ProxyFromEnvironment (proxy-assisted SSRF)
- Both gaps survived Sonnet+GPT approval; only security-reviewer blocked merge
- Lesson: dedicated security reviewer role required for auth/network security code
2026-05-14 12:24:54 +00:00
aweiker f9523d46b1 data: dev loop effectiveness analysis (2026-05-14) 2026-05-14 06:54:42 +00:00
Rodin 828da269c0 docs: regenerate weekly report (2026-05-11) 2026-05-11 09:04:35 -07:00
Rodin 2ca8c974f3 Add finding #25: Data integrity analysis on audit-log.md
New task type testing distributed systems consistency analysis.
GPT-5 found 18 issues (with 4,416 reasoning tokens), Sonnet found 13.
Key insight: distributed systems reasoning benefits from extended
reasoning - Sonnet at 72% of GPT-5 count, similar to race condition
analysis (58%) and worse than assumption-finding (85%).
2026-05-11 08:49:32 -07:00
Rodin ac55ecdb98 Finding 28: Regulatory compliance analysis on wash sale tracking
- GPT-5 most comprehensive on IRS-specific rules (18 findings, 9600 reasoning tokens)
- Sonnet fast first-pass (14 findings in 25s)
- Opus high-density actionable (11 findings with clear remediation)
- New insight: domain expertise tasks favor GPT-5 reasoning depth
- Updated model assignment for compliance review workflow
2026-05-11 00:29:12 -07:00
Rodin 2b10595bff Finding #68: Cross-context contract coherence analysis
GPT-5 outperforms Sonnet on cross-context integration analysis:
- GPT-5: 10 findings (4 Critical) in 191s with 7,744 reasoning tokens
- Sonnet: 7 findings (1 Critical) in 23s

Key insight: Cross-context contract verification benefits from extended
reasoning (contrast to Finding #67 where Sonnet was better at inter-doc
contradictions). Flow tracing and subscription gap detection require
systematic verification that GPT-5's exhaustive style excels at.

Discovered actual spec gaps in gargoyle domain model (FillReceived
missing fields, no liquidation instruction event, Risk not subscribing
to LotOpened for PDT, etc.).
2026-05-10 21:47:27 -07:00
Rodin 0f43934cb8 Add finding #67: Inter-document contradiction analysis
Sonnet 4 outperforms GPT-5 on inter-document contradiction analysis:
- More findings (5 vs 4)
- Faster (14s vs 136s)
- Better severity calibration (3 Critical vs 0 Critical)

Key insight: GPT-5's extended reasoning (9.7K tokens) doesn't pay off
for this task type. Inter-document comparison requires parallel pattern
matching, not serial verification.
2026-05-10 18:32:45 -07:00
Rodin bb50188e63 Add Finding #30: Boundary violation analysis on context README 2026-05-10 17:28:54 -07:00
Rodin 8adf09b3fb Add security boundary analysis experiment (2026-05-10)
New analytical lens: security boundary analysis — identifying where trust
assumptions cross component boundaries in exploitable ways.

Document: gargoyle system-overview.md (323 lines)
Models: Claude Opus (15 findings), Claude Sonnet 4 (10 findings)

Key finding: Opus identified that transient signals (a performance design
choice) create a structural security vulnerability — malicious strategies
can probe risk limits without leaving any audit trail.

This experiment establishes security boundary analysis as a distinct,
viable analytical task type for architecture review.
2026-05-10 16:05:45 -07:00
Rodin c1eb97ed6c Add finding #65: Temporal correctness analysis (new lens) 2026-05-10 14:50:56 -07:00
Rodin 398f33aad4 Finding #64: Regulatory implementation gap analysis
- GPT-5 finds 20 gaps with exhaustive FINRA cross-referencing
- Opus finds 12 gaps focusing on operational compliance requirements
- Sonnet provides fast screening (16s) with 9 gaps
- Key insight: regulatory gap analysis benefits from reasoning tokens
- New lens for compliance audits of financial software
2026-05-10 12:30:20 -07:00
Rodin 7c64712c2f Add finding #65: concurrent write hazards in event sourcing
New analytical lens testing concurrent write hazards against event-catalog.md.
GPT-5 found 19 hazards, Opus 11, Sonnet 12. Union ~27 distinct findings.
Key insight: this lens is high-value for event sourcing docs because replay
correctness depends on ordering invariants that are often implicit.
2026-05-10 11:48:41 -07:00
Rodin 873591877d Finding #64: Specification gap analysis - new analytical lens
Tested GPT-5, Opus, Sonnet on specid-lot-selection.md (125 lines)
for implementation specification gaps.

Key findings:
- Opus most cost-effective (4.6 gaps/1K tokens vs 1.8 for GPT-5)
- GPT-5 catches operational/financial edge cases (fees, multi-execution)
- Opus catches design-level binding ambiguities
- Sonnet too shallow for serious spec review

New lens distinct from hidden assumptions and race conditions:
focuses on ambiguity of intent, not risks.
2026-05-10 11:10:33 -07:00
Rodin b9036401c2 Finding #63: External System Assumptions Analysis
New analytical lens examining implicit assumptions about broker APIs,
market behavior, network conditions, and timing.

Document: gargoyle's feeds-and-instruments.md (115 lines)
Models: GPT-5 (24 findings), Opus (15), Sonnet (15)

Key insight: External system assumptions benefit more from reasoning
depth than internal architecture analysis. GPT-5's exhaustive coverage
of broker implementation details and network failure modes justifies
the token cost for critical external interfaces.

Union of all models finds ~30 distinct assumptions vs ~24 max single model.
2026-05-10 02:27:53 -07:00
Rodin ce4801e8a3 Add Finding #62: Boundary contract analysis (new analytical lens)
Tested on signal-lifecycle.md (111 lines). Results:
- GPT-5: 17 gaps (7,744 reasoning tokens)
- Opus: 11 gaps (design-level focus)
- Sonnet: 8 gaps (fastest, protocol-level)

Key insight: Union of all models (~26 gaps) far exceeds any single
model (max 17). Only 5 gaps found by all three — highly differentiated
outputs make multi-model runs valuable for interface documents.
2026-05-09 23:35:36 -07:00
Rodin 9f15047892 Finding #62: Data integrity analysis on signal-lifecycle.md
New lens: data integrity analysis — testing whether data survives flow
through systems with correct identity, values, and auditability.

Key insights:
- GPT-5 excels at audit/forensics gaps (idempotency, ordering, provenance)
- Opus finds semantic violations (phantom group, quantity mutation ambiguity)
- Sonnet identifies operational races (restart scenarios)

Document: gargoyle signal-lifecycle.md (102 lines)
Models: GPT-5 (13 findings), Opus (6+), Sonnet (6)
2026-05-09 22:26:46 -07:00
Rodin 527e71a1d6 finding #61: regulatory completeness analysis lens 2026-05-09 20:06:51 -07:00
Rodin af950a33d1 Add finding #60: Counterfactual event ordering analysis
New analytical lens testing what breaks when events arrive out of order.
- GPT-5: 30 findings via exhaustive permutation enumeration
- Opus: 19 findings with operational consequence tracing
- Sonnet: 17 findings with regulatory compliance focus

Key insight: GPT-5's reasoning enables systematic swap/delay/duplicate/
interleave enumeration. Sonnet uniquely connects to regulatory requirements.
2026-05-09 18:28:40 -07:00
Rodin 2988f31fc3 finding 59: convention rule gap analysis
New task type: analyzing prescriptive/specification documents for completeness.

- GPT-5 dominates with exhaustive enumeration (34 findings)
- Opus traces gaps to consequences (routing failures, compiler issues)
- Sonnet surface-level (not recommended for thorough analysis)

Key insight: GPT-5 found internal contradiction (telemetry verb rule vs example)
that neither Claude model caught. Opus unique in tracing PubSub collision
to actual routing failure scenario.

Task taxonomy: convention gap analysis follows same pattern as architecture
docs - GPT-5 for coverage, Opus for consequences.
2026-05-09 17:28:53 -07:00
Rodin 98304604ac Finding 58: State machine completeness analysis on kill-switch.md
GPT-5 finds 16 gaps, Opus 11, Sonnet 9. GPT-5 excels at exhaustive
state space enumeration; Opus finds convention-vs-enforcement gaps;
Sonnet adequate but less thorough.

Key insight: state machine completeness is a GPT-5 sweet spot due to
reasoning tokens enabling systematic combinatorial coverage.
2026-05-09 15:06:32 -07:00
Rodin faaa6d9c11 Finding #57: Event flow correctness analysis - new analytical lens
Tests a novel lens for event-sourced architectures: can all state be
reconstructed from documented events alone?

Key findings:
- GPT-5 brings external domain knowledge (broker APIs, compliance)
- Opus reasons through failure modes systematically (crash boundaries)
- Sonnet does rapid structural analysis (missing pieces)

21 unique findings across three models with only 5 in common.
Each model's reasoning style reveals different issue categories.

New pattern: event flow analysis exposes model reasoning styles
that gap-finding and contradiction detection don't surface.
2026-05-09 13:29:58 -07:00
claw b7acbd7662 Finding #56: Operational burden analysis - new analytical lens
Tests a novel lens asking 'what cognitive/procedural load does this design
place on operators?' Applied to escalation-policy.md with GPT-5, Sonnet 4.6,
and Opus 4.6.

Key findings:
- All models identified manual liquidate→restrict has no procedure (CRITICAL)
- GPT-5 excels at exhaustive enumeration (21+ findings, config gaps)
- Opus identifies systemic vulnerabilities (monitor crash → silent unsafe state)
- Sonnet fills procedural gaps (authorization, timeouts)

Recommendation: Opus alone for time-constrained analysis, GPT-5 + Opus for
thoroughness. They find different types of issues with minimal overlap.
2026-05-09 06:46:29 -07:00
claw 5ee0cff3a8 experiment #55: state reconstruction correctness — new analytical lens
Tests whether event stream supports time-travel queries, retroactive truth,
and audit reconstruction. All three models found CRITICAL issues in a document
that passed previous lenses. Key insight: distinguishes telemetry events from
sourcing events.

Document: gargoyle corporate-actions.md
Models: GPT-5, Sonnet 4.6, Opus 4.6
Lens validation: model-stable, domain-independent, architecturally significant
2026-05-09 05:06:45 -07:00
claw bb191e48d1 finding #54: wash sale multi-model design review analysis
Compared Sonnet 4, GPT-5, and Opus 4.6 on gargoyle wash-sale-tracking.md.
Key insights:
- GPT-5 requires 16K+ completion tokens (4K for reasoning alone)
- Opus caught holding period add-vs-backdate correctness issue
- Sonnet caught Section 1259 (constructive sales) that others missed
- All three missed multi-broker 1099-B reconciliation problem
- Multi-model review justified for tax compliance domains
2026-05-09 03:35:12 -07:00
Rodin 9d0a94bd68 Add finding #53: unstated constraint detection on state machines
New analytical lens tested on gargoyle order-state-machine.md:
- GPT-5: 15 findings (most CRITICAL issues, exhaustive field analysis)
- Opus: 14 findings (state lifecycle focus, implementation mechanisms)
- Sonnet: 10 findings (fast but shallow)

Key insight: "unstated constraints" finds what's IMPLIED but not stated,
distinct from gaps, race conditions, or ambiguities. GPT-5 is best for
catching CRITICAL data integrity constraints; Opus for state machine
implementation details.
2026-05-08 23:47:51 -07:00
claw c1ca8cfe46 finding #52: degraded-mode propagation analysis (new lens)
Cross-document boundary analysis: signal-lifecycle + buying-power + risk-controls.
Opus decisively outperforms GPT-5 (10 vs 7 findings) — first inversion observed.
New lens finds a distinct class of bug: partially-degraded paths that mimic normal behavior.
2026-05-08 14:29:29 -07:00
claw 79915d1dc3 finding 51: implementation ambiguity analysis — new analytical lens 2026-05-08 12:46:32 -07:00
claw 5b8f8caf8c finding 50: concurrency and race condition analysis lens
New analytical lens applied to signal-lifecycle.md (111 lines).
All three models (GPT-5, Opus, Sonnet) found 7-9 findings each with
70% at Critical/High severity. Key insight: concurrency analysis
rewards compositional temporal reasoning over enumeration breadth,
narrowing the gap between models compared to other lenses.

Unique finds: GPT-5 (stop-loss race, duplicate UUID), Opus (crash
survival contradiction), Sonnet (Signal Risk audit gap after dispatch).
2026-05-08 11:06:06 -07:00
claw 7ca01f0cbf finding 49: adversarial evasion/tampering analysis on audit-log.md
New analytical lens (adversarial/offensive security) tested on gargoyle's
signal audit log spec. GPT-5 most exhaustive (25), Opus deepest individual
attack narratives (14), Sonnet most creative meta-attacks (11).

Adversarial lens is ~2.5x more productive than defensive lenses on
comparable docs. All three models converged on same root cause (trust model).
2026-05-08 09:09:58 -07:00
claw 8f9e87415e finding #48: defense-in-depth gap analysis on auth-and-credentials.md
New analytical lens: where systems rely on single mechanisms rather than
layered defenses. GPT-5 finds exploitable SSRF; Opus identifies trust-root
collapse (session+sudo share SECRET_KEY_BASE); Sonnet is surface-level.
2026-05-08 03:47:09 -07:00
claw f3266ccc13 finding 47: emergent behavior from rule composition - new analytical lens
Tests a novel analytical lens on aggregation.md (239 lines): 'what happens
when many correct instances operate simultaneously in a correlated environment?'

Results: GPT-5 (13 findings, 76s) excels at systemic dynamics and feedback
loops. Opus (8 findings, 93s) finds the most consequential single findings
(stop-loss defeated by temporal composition, crash-opportunity correlation).
Sonnet 4.0 (6 findings, 32s) too abstract for this task.

Key insight: This lens finds DEPLOYMENT bugs invisible at design time -
the gap between 'correct by construction' and 'correct in production'.
2026-05-08 02:06:25 -07:00
claw b5b5b64a40 finding #46: operational blind spot analysis — new task type
Novel experiment testing 'what's invisible to operators' on gargoyle's
observability.md (563 lines). GPT-5 (18 findings), Opus (12), Sonnet (10).

Key discovery: 'actively misleads' category (observability creating false
confidence) is highest-value and Opus-dominated. Distinct from assumption-
finding, race conditions, or gap analysis — requires reasoning about
negation (what ISN'T instrumented vs what production needs).
2026-05-08 00:27:23 -07:00
claw 64fdfebed3 finding 45: operator decision support gap analysis — new task type 2026-05-07 21:07:46 -07:00
claw e127e7b0c7 finding 44: cross-doc consistency on closely related docs
Sonnet finds ZERO subtle contradictions between signal-lifecycle.md and
aggregation.md, while GPT-5 and Opus each find 3 genuine conflicts.

Key insight: Sonnet can detect explicit contradictions (Finding 28: 4/6)
but completely fails on implication conflicts where one doc's simplified
model creates false impressions about another doc's complete specification.

Refines Finding 28 and confirms cross-document consistency is actually
TWO distinct tasks with different model requirements.
2026-05-07 19:27:20 -07:00
claw d8a030d9e9 finding #43: opus + narrow framing for contradiction detection
Tests the open question from Finding #39: does Opus's internal reasoning
depth suffice for self-contradiction verification?

Key result: wrong question. Opus finds a different CLASS of contradiction
than GPT-5. GPT-5 finds specification conflicts (statement comparison).
Opus finds logical impossibilities (deductive rule interaction). Neither
dominates — they don't overlap. Sonnet remains unreliable (~33% precision).

Document tested: escalation-policy.md (228 lines)
Models: GPT-5, Claude Opus 4.6, Claude Sonnet 4.6
2026-05-07 16:05:14 -07:00
claw 296bb21eb7 finding #42: failure propagation chain analysis on system-overview.md
New analytical lens: failure propagation chains. Opus matched GPT-5's count
(10 findings each) while using 2.2x fewer tokens. Overview docs are ideal
for this lens. Sonnet produced zero unique insights.
2026-05-07 14:28:26 -07:00
claw a65c471a3f finding 41: temporal ordering dependency analysis on kill-switch.md
New analytical lens testing whether models can identify sequential operations
where order matters but isn't mechanically enforced. GPT-5 finds systemic
gaps (WHY ordering matters), Opus finds inverted dangers (WHICH direction
is dangerous), Sonnet identifies themes without unique depth.
2026-05-07 12:47:03 -07:00
claw bb0c0d564b Finding #40: Silent data corruption paths in financial accounting
New analytical lens applied to lot-accounting.md (181 lines).
Tests how models identify sequences of individually correct
operations that produce silently wrong financial results.

Results:
- GPT-5: 12 findings (137s, 10688 reasoning tokens) - tax law domain knowledge
- Opus: 8 findings (121s) - concurrent systems / crash recovery focus
- Sonnet: 8 findings (111s) - structural meta-analysis, highest-leverage finding

Key insight: First experiment where domain-specific knowledge (tax law)
is the primary differentiator. Models reason from different knowledge
domains: GPT-5=tax law, Opus=distributed systems, Sonnet=architecture patterns.

Sonnet produced the most architecturally significant finding: that the
system's reconciliation mechanism confirms corruption rather than detecting
it (because it re-derives from LotClosed which is itself the corrupted source).
2026-05-07 11:09:58 -07:00
claw 0c632c255a finding #39: narrow framing does not close Sonnet-GPT-5 gap for semantic consistency
Tested open question from Finding #5: does narrow framing give Sonnet
GPT-5-level semantic analysis?

Result: NO. Narrow framing changes WHAT Sonnet looks for (redirects from
gaps to contradictions) but not HOW WELL it reasons. Sonnet narrow found
3 contradictions but only 1 was genuine (2 were analytical errors/misreads).
GPT-5 found 4 all-genuine findings with precise reasoning.

Key insight: framing controls scope, not reasoning depth. For tasks
requiring logical verification (contradictions, race conditions, invariant
violations), reasoning tokens are necessary — framing alone is insufficient.

Updated open-questions.md: marked Sonnet+narrow as answered, added new
question about Opus+narrow for contradiction detection.
2026-05-07 09:26:08 -07:00
claw d27ce6f5e1 finding #38: regulatory compliance gap analysis (FINRA/PDT domain knowledge test)
First experiment testing domain-specific regulatory knowledge rather than
pure architectural reasoning. Opus demonstrates deepest FINRA Rule 4210
knowledge; GPT-5 finds broker-API semantic mismatches; content filters
are a new failure mode for financial domain analysis via enterprise proxies.
2026-05-07 07:47:11 -07:00
claw 58e69e21f8 finding 37: cross-doc consistency on tightly coupled risk docs
Tested kill-switch.md + escalation-policy.md (same bounded context,
shared vocabulary). Key insight: shared vocabulary claims are the most
dangerous inconsistency — same words with opposite severity ordering.

Opus found the severity-ordering inversion (restrict/liquidate ladders
run in opposite directions). GPT-5 found the meta-issue (the 'same
vocabulary' claim is itself the problem). Sonnet fast but shallow.

Tightly coupled docs produce more Critical findings than loosely coupled
ones (Finding #28).
2026-05-07 04:29:23 -07:00
claw c071ffc31f Finding #36: Compositional interface analysis - two-document interface assumptions
New experiment type: give models two related architecture documents and ask
them to identify assumptions each document makes about the other that could
be violated.

Results: GPT-5 (10 findings, 175s, operational/race-focused) and Opus (10
findings, 111s, structural/architectural) both found unique interface gaps.
Sonnet (7 findings, 29s) found nothing unique - all its findings were
simplified versions of GPT-5/Opus findings.

Key insight: Interface analysis requires holding two mental models simultaneously
and is harder than single-document analysis. Sonnet produced 0 unique findings
(vs 2-6 on single-doc tasks). Extended reasoning appears necessary for this
task type.
2026-05-07 02:48:46 -07:00
claw d8ddbc9861 mark adversarial ensemble question as answered (finding #35) 2026-05-06 21:29:35 -07:00
claw 8338ae3019 finding #35: adversarial ensemble (critique+extend) produces 30% more coverage
Tests GPT-5 → Opus critique+extend pipeline on dtbp-margin-call.md.
Key results:
- Ensemble produces 56 unique findings vs 43 (GPT-5) or 28 (Opus) alone
- Zero full disagreements — GPT-5's coverage is reliable signal
- Critique phase (severity calibration) more valuable than extension phase
- 28% more tokens for 30% more coverage + structured prioritization
- Answers open question about adversarial ensemble value
2026-05-06 21:29:17 -07:00
Rodin 4a69a99d05 finding #34: information flow hazard analysis on lot-accounting.md
New analytical lens: where data propagation creates stale, contradictory,
or misleading views for different consumers.

Key result: highest model convergence (45% common ground) due to document's
explicit failure mode table. GPT-5 finds event-level provenance gaps; Opus
identifies strategy attribution dimension. Sonnet adds zero unique value.
Two-model stack (GPT-5 + Opus) optimal.
2026-05-06 18:29:06 -07:00
Rodin 20c0bd2492 feat: experiment #33 — observability gap analysis on aggregation.md
New analytical lens: observability gap analysis — asking 'when something
goes wrong, can you SEE it?' rather than 'what can go wrong?'

Results on aggregation.md (239 lines):
- GPT-5: 23 findings (12 unique), exhaustive telemetry architecture
- Opus: 14 findings (6 unique), operator-behavioral insights
- Sonnet: 11 findings (0 unique), no added value

Key insight: GPT-5 designs the instrumentation; Opus identifies where
available signals mislead operators toward wrong remediations.
Two-model (GPT-5 + Opus) optimal for this task type.
2026-05-06 11:49:05 -07:00
Rodin 8cfabfdc55 experiment #32: testability analysis — new analytical lens
Tested GPT-5, Opus, Sonnet on wash-sale-tracking.md spec.
Opus found a genuine spec bug (trigger logic described backwards).
Confirms pattern: GPT-5 for breadth, Opus for logic contradictions,
Sonnet adds no value for systematic analytical tasks.
2026-05-06 10:09:05 -07:00