model-research

Author	SHA1	Message	Date
Rodin	5426026908	docs: regenerate weekly report (2026-05-18)	2026-05-18 16:10:16 +00:00
Rodin	afbc013e2e	finding #80 : config-a/b dispatcher malfunction detected in multi-model review pipeline (3.5x cost overage)	2026-05-15 08:37:01 +00:00
Rodin	8e64f8f012	finding(79): multi-model security review catches HTTPS bypass in GitHub client (PR #131 ) Security-review-bot persona caught HTTPS enforcement inconsistency in write-path methods (PostReview, DeleteReview, RequestReviewer) that generalist reviewers missed. Issue fixed within 30 minutes, all reviewers re-approved. Validates specialized security persona value in multi-model pipeline.	2026-05-14 21:56:58 +00:00
Rodin	643a804bdf	finding #79 : multi-model security review catches CGN + proxy-assisted SSRF gaps - Python ipaddress.is_private/is_reserved misses CGN (100.64.0.0/10) - Go http.DefaultTransport clone retains ProxyFromEnvironment (proxy-assisted SSRF) - Both gaps survived Sonnet+GPT approval; only security-reviewer blocked merge - Lesson: dedicated security reviewer role required for auth/network security code	2026-05-14 12:24:54 +00:00
aweiker	f9523d46b1	data: dev loop effectiveness analysis (2026-05-14)	2026-05-14 06:54:42 +00:00
Rodin	828da269c0	docs: regenerate weekly report (2026-05-11)	2026-05-11 09:04:35 -07:00
Rodin	2ca8c974f3	Add finding #25 : Data integrity analysis on audit-log.md New task type testing distributed systems consistency analysis. GPT-5 found 18 issues (with 4,416 reasoning tokens), Sonnet found 13. Key insight: distributed systems reasoning benefits from extended reasoning - Sonnet at 72% of GPT-5 count, similar to race condition analysis (58%) and worse than assumption-finding (85%).	2026-05-11 08:49:32 -07:00
Rodin	ac55ecdb98	Finding 28: Regulatory compliance analysis on wash sale tracking - GPT-5 most comprehensive on IRS-specific rules (18 findings, 9600 reasoning tokens) - Sonnet fast first-pass (14 findings in 25s) - Opus high-density actionable (11 findings with clear remediation) - New insight: domain expertise tasks favor GPT-5 reasoning depth - Updated model assignment for compliance review workflow	2026-05-11 00:29:12 -07:00
Rodin	2b10595bff	Finding #68 : Cross-context contract coherence analysis GPT-5 outperforms Sonnet on cross-context integration analysis: - GPT-5: 10 findings (4 Critical) in 191s with 7,744 reasoning tokens - Sonnet: 7 findings (1 Critical) in 23s Key insight: Cross-context contract verification benefits from extended reasoning (contrast to Finding #67 where Sonnet was better at inter-doc contradictions). Flow tracing and subscription gap detection require systematic verification that GPT-5's exhaustive style excels at. Discovered actual spec gaps in gargoyle domain model (FillReceived missing fields, no liquidation instruction event, Risk not subscribing to LotOpened for PDT, etc.).	2026-05-10 21:47:27 -07:00
Rodin	0f43934cb8	Add finding #67 : Inter-document contradiction analysis Sonnet 4 outperforms GPT-5 on inter-document contradiction analysis: - More findings (5 vs 4) - Faster (14s vs 136s) - Better severity calibration (3 Critical vs 0 Critical) Key insight: GPT-5's extended reasoning (9.7K tokens) doesn't pay off for this task type. Inter-document comparison requires parallel pattern matching, not serial verification.	2026-05-10 18:32:45 -07:00
Rodin	bb50188e63	Add Finding #30 : Boundary violation analysis on context README	2026-05-10 17:28:54 -07:00
Rodin	8adf09b3fb	Add security boundary analysis experiment (2026-05-10) New analytical lens: security boundary analysis — identifying where trust assumptions cross component boundaries in exploitable ways. Document: gargoyle system-overview.md (323 lines) Models: Claude Opus (15 findings), Claude Sonnet 4 (10 findings) Key finding: Opus identified that transient signals (a performance design choice) create a structural security vulnerability — malicious strategies can probe risk limits without leaving any audit trail. This experiment establishes security boundary analysis as a distinct, viable analytical task type for architecture review.	2026-05-10 16:05:45 -07:00
Rodin	c1eb97ed6c	Add finding #65 : Temporal correctness analysis (new lens)	2026-05-10 14:50:56 -07:00
Rodin	398f33aad4	Finding #64 : Regulatory implementation gap analysis - GPT-5 finds 20 gaps with exhaustive FINRA cross-referencing - Opus finds 12 gaps focusing on operational compliance requirements - Sonnet provides fast screening (16s) with 9 gaps - Key insight: regulatory gap analysis benefits from reasoning tokens - New lens for compliance audits of financial software	2026-05-10 12:30:20 -07:00
Rodin	7c64712c2f	Add finding #65 : concurrent write hazards in event sourcing New analytical lens testing concurrent write hazards against event-catalog.md. GPT-5 found 19 hazards, Opus 11, Sonnet 12. Union ~27 distinct findings. Key insight: this lens is high-value for event sourcing docs because replay correctness depends on ordering invariants that are often implicit.	2026-05-10 11:48:41 -07:00
Rodin	873591877d	Finding #64 : Specification gap analysis - new analytical lens Tested GPT-5, Opus, Sonnet on specid-lot-selection.md (125 lines) for implementation specification gaps. Key findings: - Opus most cost-effective (4.6 gaps/1K tokens vs 1.8 for GPT-5) - GPT-5 catches operational/financial edge cases (fees, multi-execution) - Opus catches design-level binding ambiguities - Sonnet too shallow for serious spec review New lens distinct from hidden assumptions and race conditions: focuses on ambiguity of intent, not risks.	2026-05-10 11:10:33 -07:00
Rodin	b9036401c2	Finding #63 : External System Assumptions Analysis New analytical lens examining implicit assumptions about broker APIs, market behavior, network conditions, and timing. Document: gargoyle's feeds-and-instruments.md (115 lines) Models: GPT-5 (24 findings), Opus (15), Sonnet (15) Key insight: External system assumptions benefit more from reasoning depth than internal architecture analysis. GPT-5's exhaustive coverage of broker implementation details and network failure modes justifies the token cost for critical external interfaces. Union of all models finds ~30 distinct assumptions vs ~24 max single model.	2026-05-10 02:27:53 -07:00
Rodin	ce4801e8a3	Add Finding #62 : Boundary contract analysis (new analytical lens) Tested on signal-lifecycle.md (111 lines). Results: - GPT-5: 17 gaps (7,744 reasoning tokens) - Opus: 11 gaps (design-level focus) - Sonnet: 8 gaps (fastest, protocol-level) Key insight: Union of all models (~26 gaps) far exceeds any single model (max 17). Only 5 gaps found by all three — highly differentiated outputs make multi-model runs valuable for interface documents.	2026-05-09 23:35:36 -07:00
Rodin	9f15047892	Finding #62 : Data integrity analysis on signal-lifecycle.md New lens: data integrity analysis — testing whether data survives flow through systems with correct identity, values, and auditability. Key insights: - GPT-5 excels at audit/forensics gaps (idempotency, ordering, provenance) - Opus finds semantic violations (phantom group, quantity mutation ambiguity) - Sonnet identifies operational races (restart scenarios) Document: gargoyle signal-lifecycle.md (102 lines) Models: GPT-5 (13 findings), Opus (6+), Sonnet (6)	2026-05-09 22:26:46 -07:00
Rodin	527e71a1d6	finding #61 : regulatory completeness analysis lens	2026-05-09 20:06:51 -07:00
Rodin	af950a33d1	Add finding #60 : Counterfactual event ordering analysis New analytical lens testing what breaks when events arrive out of order. - GPT-5: 30 findings via exhaustive permutation enumeration - Opus: 19 findings with operational consequence tracing - Sonnet: 17 findings with regulatory compliance focus Key insight: GPT-5's reasoning enables systematic swap/delay/duplicate/ interleave enumeration. Sonnet uniquely connects to regulatory requirements.	2026-05-09 18:28:40 -07:00
Rodin	2988f31fc3	finding 59: convention rule gap analysis New task type: analyzing prescriptive/specification documents for completeness. - GPT-5 dominates with exhaustive enumeration (34 findings) - Opus traces gaps to consequences (routing failures, compiler issues) - Sonnet surface-level (not recommended for thorough analysis) Key insight: GPT-5 found internal contradiction (telemetry verb rule vs example) that neither Claude model caught. Opus unique in tracing PubSub collision to actual routing failure scenario. Task taxonomy: convention gap analysis follows same pattern as architecture docs - GPT-5 for coverage, Opus for consequences.	2026-05-09 17:28:53 -07:00
Rodin	98304604ac	Finding 58: State machine completeness analysis on kill-switch.md GPT-5 finds 16 gaps, Opus 11, Sonnet 9. GPT-5 excels at exhaustive state space enumeration; Opus finds convention-vs-enforcement gaps; Sonnet adequate but less thorough. Key insight: state machine completeness is a GPT-5 sweet spot due to reasoning tokens enabling systematic combinatorial coverage.	2026-05-09 15:06:32 -07:00
Rodin	faaa6d9c11	Finding #57 : Event flow correctness analysis - new analytical lens Tests a novel lens for event-sourced architectures: can all state be reconstructed from documented events alone? Key findings: - GPT-5 brings external domain knowledge (broker APIs, compliance) - Opus reasons through failure modes systematically (crash boundaries) - Sonnet does rapid structural analysis (missing pieces) 21 unique findings across three models with only 5 in common. Each model's reasoning style reveals different issue categories. New pattern: event flow analysis exposes model reasoning styles that gap-finding and contradiction detection don't surface.	2026-05-09 13:29:58 -07:00
claw	b7acbd7662	Finding #56 : Operational burden analysis - new analytical lens Tests a novel lens asking 'what cognitive/procedural load does this design place on operators?' Applied to escalation-policy.md with GPT-5, Sonnet 4.6, and Opus 4.6. Key findings: - All models identified manual liquidate→restrict has no procedure (CRITICAL) - GPT-5 excels at exhaustive enumeration (21+ findings, config gaps) - Opus identifies systemic vulnerabilities (monitor crash → silent unsafe state) - Sonnet fills procedural gaps (authorization, timeouts) Recommendation: Opus alone for time-constrained analysis, GPT-5 + Opus for thoroughness. They find different types of issues with minimal overlap.	2026-05-09 06:46:29 -07:00
claw	5ee0cff3a8	experiment #55 : state reconstruction correctness — new analytical lens Tests whether event stream supports time-travel queries, retroactive truth, and audit reconstruction. All three models found CRITICAL issues in a document that passed previous lenses. Key insight: distinguishes telemetry events from sourcing events. Document: gargoyle corporate-actions.md Models: GPT-5, Sonnet 4.6, Opus 4.6 Lens validation: model-stable, domain-independent, architecturally significant	2026-05-09 05:06:45 -07:00
claw	bb191e48d1	finding #54 : wash sale multi-model design review analysis Compared Sonnet 4, GPT-5, and Opus 4.6 on gargoyle wash-sale-tracking.md. Key insights: - GPT-5 requires 16K+ completion tokens (4K for reasoning alone) - Opus caught holding period add-vs-backdate correctness issue - Sonnet caught Section 1259 (constructive sales) that others missed - All three missed multi-broker 1099-B reconciliation problem - Multi-model review justified for tax compliance domains	2026-05-09 03:35:12 -07:00
Rodin	9d0a94bd68	Add finding #53 : unstated constraint detection on state machines New analytical lens tested on gargoyle order-state-machine.md: - GPT-5: 15 findings (most CRITICAL issues, exhaustive field analysis) - Opus: 14 findings (state lifecycle focus, implementation mechanisms) - Sonnet: 10 findings (fast but shallow) Key insight: "unstated constraints" finds what's IMPLIED but not stated, distinct from gaps, race conditions, or ambiguities. GPT-5 is best for catching CRITICAL data integrity constraints; Opus for state machine implementation details.	2026-05-08 23:47:51 -07:00
claw	c1ca8cfe46	finding #52 : degraded-mode propagation analysis (new lens) Cross-document boundary analysis: signal-lifecycle + buying-power + risk-controls. Opus decisively outperforms GPT-5 (10 vs 7 findings) — first inversion observed. New lens finds a distinct class of bug: partially-degraded paths that mimic normal behavior.	2026-05-08 14:29:29 -07:00
claw	79915d1dc3	finding 51: implementation ambiguity analysis — new analytical lens	2026-05-08 12:46:32 -07:00
claw	5b8f8caf8c	finding 50: concurrency and race condition analysis lens New analytical lens applied to signal-lifecycle.md (111 lines). All three models (GPT-5, Opus, Sonnet) found 7-9 findings each with 70% at Critical/High severity. Key insight: concurrency analysis rewards compositional temporal reasoning over enumeration breadth, narrowing the gap between models compared to other lenses. Unique finds: GPT-5 (stop-loss race, duplicate UUID), Opus (crash survival contradiction), Sonnet (Signal Risk audit gap after dispatch).	2026-05-08 11:06:06 -07:00
claw	7ca01f0cbf	finding 49: adversarial evasion/tampering analysis on audit-log.md New analytical lens (adversarial/offensive security) tested on gargoyle's signal audit log spec. GPT-5 most exhaustive (25), Opus deepest individual attack narratives (14), Sonnet most creative meta-attacks (11). Adversarial lens is ~2.5x more productive than defensive lenses on comparable docs. All three models converged on same root cause (trust model).	2026-05-08 09:09:58 -07:00
claw	8f9e87415e	finding #48 : defense-in-depth gap analysis on auth-and-credentials.md New analytical lens: where systems rely on single mechanisms rather than layered defenses. GPT-5 finds exploitable SSRF; Opus identifies trust-root collapse (session+sudo share SECRET_KEY_BASE); Sonnet is surface-level.	2026-05-08 03:47:09 -07:00
claw	f3266ccc13	finding 47: emergent behavior from rule composition - new analytical lens Tests a novel analytical lens on aggregation.md (239 lines): 'what happens when many correct instances operate simultaneously in a correlated environment?' Results: GPT-5 (13 findings, 76s) excels at systemic dynamics and feedback loops. Opus (8 findings, 93s) finds the most consequential single findings (stop-loss defeated by temporal composition, crash-opportunity correlation). Sonnet 4.0 (6 findings, 32s) too abstract for this task. Key insight: This lens finds DEPLOYMENT bugs invisible at design time - the gap between 'correct by construction' and 'correct in production'.	2026-05-08 02:06:25 -07:00
claw	b5b5b64a40	finding #46 : operational blind spot analysis — new task type Novel experiment testing 'what's invisible to operators' on gargoyle's observability.md (563 lines). GPT-5 (18 findings), Opus (12), Sonnet (10). Key discovery: 'actively misleads' category (observability creating false confidence) is highest-value and Opus-dominated. Distinct from assumption- finding, race conditions, or gap analysis — requires reasoning about negation (what ISN'T instrumented vs what production needs).	2026-05-08 00:27:23 -07:00
claw	64fdfebed3	finding 45: operator decision support gap analysis — new task type	2026-05-07 21:07:46 -07:00
claw	e127e7b0c7	finding 44: cross-doc consistency on closely related docs Sonnet finds ZERO subtle contradictions between signal-lifecycle.md and aggregation.md, while GPT-5 and Opus each find 3 genuine conflicts. Key insight: Sonnet can detect explicit contradictions (Finding 28: 4/6) but completely fails on implication conflicts where one doc's simplified model creates false impressions about another doc's complete specification. Refines Finding 28 and confirms cross-document consistency is actually TWO distinct tasks with different model requirements.	2026-05-07 19:27:20 -07:00
claw	d8a030d9e9	finding #43 : opus + narrow framing for contradiction detection Tests the open question from Finding #39: does Opus's internal reasoning depth suffice for self-contradiction verification? Key result: wrong question. Opus finds a different CLASS of contradiction than GPT-5. GPT-5 finds specification conflicts (statement comparison). Opus finds logical impossibilities (deductive rule interaction). Neither dominates — they don't overlap. Sonnet remains unreliable (~33% precision). Document tested: escalation-policy.md (228 lines) Models: GPT-5, Claude Opus 4.6, Claude Sonnet 4.6	2026-05-07 16:05:14 -07:00
claw	296bb21eb7	finding #42 : failure propagation chain analysis on system-overview.md New analytical lens: failure propagation chains. Opus matched GPT-5's count (10 findings each) while using 2.2x fewer tokens. Overview docs are ideal for this lens. Sonnet produced zero unique insights.	2026-05-07 14:28:26 -07:00
claw	a65c471a3f	finding 41: temporal ordering dependency analysis on kill-switch.md New analytical lens testing whether models can identify sequential operations where order matters but isn't mechanically enforced. GPT-5 finds systemic gaps (WHY ordering matters), Opus finds inverted dangers (WHICH direction is dangerous), Sonnet identifies themes without unique depth.	2026-05-07 12:47:03 -07:00
claw	bb0c0d564b	Finding #40 : Silent data corruption paths in financial accounting New analytical lens applied to lot-accounting.md (181 lines). Tests how models identify sequences of individually correct operations that produce silently wrong financial results. Results: - GPT-5: 12 findings (137s, 10688 reasoning tokens) - tax law domain knowledge - Opus: 8 findings (121s) - concurrent systems / crash recovery focus - Sonnet: 8 findings (111s) - structural meta-analysis, highest-leverage finding Key insight: First experiment where domain-specific knowledge (tax law) is the primary differentiator. Models reason from different knowledge domains: GPT-5=tax law, Opus=distributed systems, Sonnet=architecture patterns. Sonnet produced the most architecturally significant finding: that the system's reconciliation mechanism confirms corruption rather than detecting it (because it re-derives from LotClosed which is itself the corrupted source).	2026-05-07 11:09:58 -07:00
claw	0c632c255a	finding #39 : narrow framing does not close Sonnet-GPT-5 gap for semantic consistency Tested open question from Finding #5: does narrow framing give Sonnet GPT-5-level semantic analysis? Result: NO. Narrow framing changes WHAT Sonnet looks for (redirects from gaps to contradictions) but not HOW WELL it reasons. Sonnet narrow found 3 contradictions but only 1 was genuine (2 were analytical errors/misreads). GPT-5 found 4 all-genuine findings with precise reasoning. Key insight: framing controls scope, not reasoning depth. For tasks requiring logical verification (contradictions, race conditions, invariant violations), reasoning tokens are necessary — framing alone is insufficient. Updated open-questions.md: marked Sonnet+narrow as answered, added new question about Opus+narrow for contradiction detection.	2026-05-07 09:26:08 -07:00
claw	d27ce6f5e1	finding #38 : regulatory compliance gap analysis (FINRA/PDT domain knowledge test) First experiment testing domain-specific regulatory knowledge rather than pure architectural reasoning. Opus demonstrates deepest FINRA Rule 4210 knowledge; GPT-5 finds broker-API semantic mismatches; content filters are a new failure mode for financial domain analysis via enterprise proxies.	2026-05-07 07:47:11 -07:00
claw	58e69e21f8	finding 37: cross-doc consistency on tightly coupled risk docs Tested kill-switch.md + escalation-policy.md (same bounded context, shared vocabulary). Key insight: shared vocabulary claims are the most dangerous inconsistency — same words with opposite severity ordering. Opus found the severity-ordering inversion (restrict/liquidate ladders run in opposite directions). GPT-5 found the meta-issue (the 'same vocabulary' claim is itself the problem). Sonnet fast but shallow. Tightly coupled docs produce more Critical findings than loosely coupled ones (Finding #28).	2026-05-07 04:29:23 -07:00
claw	c071ffc31f	Finding #36 : Compositional interface analysis - two-document interface assumptions New experiment type: give models two related architecture documents and ask them to identify assumptions each document makes about the other that could be violated. Results: GPT-5 (10 findings, 175s, operational/race-focused) and Opus (10 findings, 111s, structural/architectural) both found unique interface gaps. Sonnet (7 findings, 29s) found nothing unique - all its findings were simplified versions of GPT-5/Opus findings. Key insight: Interface analysis requires holding two mental models simultaneously and is harder than single-document analysis. Sonnet produced 0 unique findings (vs 2-6 on single-doc tasks). Extended reasoning appears necessary for this task type.	2026-05-07 02:48:46 -07:00
claw	d8ddbc9861	mark adversarial ensemble question as answered (finding #35 )	2026-05-06 21:29:35 -07:00
claw	8338ae3019	finding #35 : adversarial ensemble (critique+extend) produces 30% more coverage Tests GPT-5 → Opus critique+extend pipeline on dtbp-margin-call.md. Key results: - Ensemble produces 56 unique findings vs 43 (GPT-5) or 28 (Opus) alone - Zero full disagreements — GPT-5's coverage is reliable signal - Critique phase (severity calibration) more valuable than extension phase - 28% more tokens for 30% more coverage + structured prioritization - Answers open question about adversarial ensemble value	2026-05-06 21:29:17 -07:00
Rodin	4a69a99d05	finding #34 : information flow hazard analysis on lot-accounting.md New analytical lens: where data propagation creates stale, contradictory, or misleading views for different consumers. Key result: highest model convergence (45% common ground) due to document's explicit failure mode table. GPT-5 finds event-level provenance gaps; Opus identifies strategy attribution dimension. Sonnet adds zero unique value. Two-model stack (GPT-5 + Opus) optimal.	2026-05-06 18:29:06 -07:00
Rodin	20c0bd2492	feat: experiment #33 — observability gap analysis on aggregation.md New analytical lens: observability gap analysis — asking 'when something goes wrong, can you SEE it?' rather than 'what can go wrong?' Results on aggregation.md (239 lines): - GPT-5: 23 findings (12 unique), exhaustive telemetry architecture - Opus: 14 findings (6 unique), operator-behavioral insights - Sonnet: 11 findings (0 unique), no added value Key insight: GPT-5 designs the instrumentation; Opus identifies where available signals mislead operators toward wrong remediations. Two-model (GPT-5 + Opus) optimal for this task type.	2026-05-06 11:49:05 -07:00
Rodin	8cfabfdc55	experiment #32 : testability analysis — new analytical lens Tested GPT-5, Opus, Sonnet on wash-sale-tracking.md spec. Opus found a genuine spec bug (trigger logic described backwards). Confirms pattern: GPT-5 for breadth, Opus for logic contradictions, Sonnet adds no value for systematic analytical tasks.	2026-05-06 10:09:05 -07:00

1 2