model-research

Author	SHA1	Message	Date
claw	d8a030d9e9	finding #43 : opus + narrow framing for contradiction detection Tests the open question from Finding #39: does Opus's internal reasoning depth suffice for self-contradiction verification? Key result: wrong question. Opus finds a different CLASS of contradiction than GPT-5. GPT-5 finds specification conflicts (statement comparison). Opus finds logical impossibilities (deductive rule interaction). Neither dominates — they don't overlap. Sonnet remains unreliable (~33% precision). Document tested: escalation-policy.md (228 lines) Models: GPT-5, Claude Opus 4.6, Claude Sonnet 4.6	2026-05-07 16:05:14 -07:00
claw	296bb21eb7	finding #42 : failure propagation chain analysis on system-overview.md New analytical lens: failure propagation chains. Opus matched GPT-5's count (10 findings each) while using 2.2x fewer tokens. Overview docs are ideal for this lens. Sonnet produced zero unique insights.	2026-05-07 14:28:26 -07:00
claw	a65c471a3f	finding 41: temporal ordering dependency analysis on kill-switch.md New analytical lens testing whether models can identify sequential operations where order matters but isn't mechanically enforced. GPT-5 finds systemic gaps (WHY ordering matters), Opus finds inverted dangers (WHICH direction is dangerous), Sonnet identifies themes without unique depth.	2026-05-07 12:47:03 -07:00
claw	bb0c0d564b	Finding #40 : Silent data corruption paths in financial accounting New analytical lens applied to lot-accounting.md (181 lines). Tests how models identify sequences of individually correct operations that produce silently wrong financial results. Results: - GPT-5: 12 findings (137s, 10688 reasoning tokens) - tax law domain knowledge - Opus: 8 findings (121s) - concurrent systems / crash recovery focus - Sonnet: 8 findings (111s) - structural meta-analysis, highest-leverage finding Key insight: First experiment where domain-specific knowledge (tax law) is the primary differentiator. Models reason from different knowledge domains: GPT-5=tax law, Opus=distributed systems, Sonnet=architecture patterns. Sonnet produced the most architecturally significant finding: that the system's reconciliation mechanism confirms corruption rather than detecting it (because it re-derives from LotClosed which is itself the corrupted source).	2026-05-07 11:09:58 -07:00
claw	0c632c255a	finding #39 : narrow framing does not close Sonnet-GPT-5 gap for semantic consistency Tested open question from Finding #5: does narrow framing give Sonnet GPT-5-level semantic analysis? Result: NO. Narrow framing changes WHAT Sonnet looks for (redirects from gaps to contradictions) but not HOW WELL it reasons. Sonnet narrow found 3 contradictions but only 1 was genuine (2 were analytical errors/misreads). GPT-5 found 4 all-genuine findings with precise reasoning. Key insight: framing controls scope, not reasoning depth. For tasks requiring logical verification (contradictions, race conditions, invariant violations), reasoning tokens are necessary — framing alone is insufficient. Updated open-questions.md: marked Sonnet+narrow as answered, added new question about Opus+narrow for contradiction detection.	2026-05-07 09:26:08 -07:00
claw	d27ce6f5e1	finding #38 : regulatory compliance gap analysis (FINRA/PDT domain knowledge test) First experiment testing domain-specific regulatory knowledge rather than pure architectural reasoning. Opus demonstrates deepest FINRA Rule 4210 knowledge; GPT-5 finds broker-API semantic mismatches; content filters are a new failure mode for financial domain analysis via enterprise proxies.	2026-05-07 07:47:11 -07:00
claw	58e69e21f8	finding 37: cross-doc consistency on tightly coupled risk docs Tested kill-switch.md + escalation-policy.md (same bounded context, shared vocabulary). Key insight: shared vocabulary claims are the most dangerous inconsistency — same words with opposite severity ordering. Opus found the severity-ordering inversion (restrict/liquidate ladders run in opposite directions). GPT-5 found the meta-issue (the 'same vocabulary' claim is itself the problem). Sonnet fast but shallow. Tightly coupled docs produce more Critical findings than loosely coupled ones (Finding #28).	2026-05-07 04:29:23 -07:00
claw	c071ffc31f	Finding #36 : Compositional interface analysis - two-document interface assumptions New experiment type: give models two related architecture documents and ask them to identify assumptions each document makes about the other that could be violated. Results: GPT-5 (10 findings, 175s, operational/race-focused) and Opus (10 findings, 111s, structural/architectural) both found unique interface gaps. Sonnet (7 findings, 29s) found nothing unique - all its findings were simplified versions of GPT-5/Opus findings. Key insight: Interface analysis requires holding two mental models simultaneously and is harder than single-document analysis. Sonnet produced 0 unique findings (vs 2-6 on single-doc tasks). Extended reasoning appears necessary for this task type.	2026-05-07 02:48:46 -07:00
claw	8338ae3019	finding #35 : adversarial ensemble (critique+extend) produces 30% more coverage Tests GPT-5 → Opus critique+extend pipeline on dtbp-margin-call.md. Key results: - Ensemble produces 56 unique findings vs 43 (GPT-5) or 28 (Opus) alone - Zero full disagreements — GPT-5's coverage is reliable signal - Critique phase (severity calibration) more valuable than extension phase - 28% more tokens for 30% more coverage + structured prioritization - Answers open question about adversarial ensemble value	2026-05-06 21:29:17 -07:00
Rodin	4a69a99d05	finding #34 : information flow hazard analysis on lot-accounting.md New analytical lens: where data propagation creates stale, contradictory, or misleading views for different consumers. Key result: highest model convergence (45% common ground) due to document's explicit failure mode table. GPT-5 finds event-level provenance gaps; Opus identifies strategy attribution dimension. Sonnet adds zero unique value. Two-model stack (GPT-5 + Opus) optimal.	2026-05-06 18:29:06 -07:00
Rodin	20c0bd2492	feat: experiment #33 — observability gap analysis on aggregation.md New analytical lens: observability gap analysis — asking 'when something goes wrong, can you SEE it?' rather than 'what can go wrong?' Results on aggregation.md (239 lines): - GPT-5: 23 findings (12 unique), exhaustive telemetry architecture - Opus: 14 findings (6 unique), operator-behavioral insights - Sonnet: 11 findings (0 unique), no added value Key insight: GPT-5 designs the instrumentation; Opus identifies where available signals mislead operators toward wrong remediations. Two-model (GPT-5 + Opus) optimal for this task type.	2026-05-06 11:49:05 -07:00
Rodin	8cfabfdc55	experiment #32 : testability analysis — new analytical lens Tested GPT-5, Opus, Sonnet on wash-sale-tracking.md spec. Opus found a genuine spec bug (trigger logic described backwards). Confirms pattern: GPT-5 for breadth, Opus for logic contradictions, Sonnet adds no value for systematic analytical tasks.	2026-05-06 10:09:05 -07:00
Rodin	ee3063997a	finding #31 : spec-gap analysis on continuous-risk-monitoring.md New task type: specification gap/completeness analysis (vs adversarial gaming). GPT-5 dominates count (25 findings), Opus produces best single insight (realized P&L non-reversibility violates de-escalation model assumption). Sonnet adds no unique value for this task type — skip for completeness audits.	2026-05-06 08:27:00 -07:00
Rodin	6af8a6ee10	refactor(findings): split ALL-FINDINGS.md into per-experiment files Break the monolithic 3249-line findings file into 29 individual files, one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy chronological sorting and discovery. No content changes — purely structural reorganization.	2026-05-06 07:15:50 -07:00
Rodin	1b108ff66e	Initial publish: 29 findings, 6 prompts, methodology, open questions Full comparative analysis of GPT-5, Claude Opus 4.6, Claude Sonnet 4.6, GPT-4.1, and GPT-4.1 Mini on analytical tasks (not coding). Contents: - findings/ALL-FINDINGS.md — complete 3,249-line research log with all 29 findings, methodology notes, and open questions - prompts/ — 6 exact prompts used across experiments - methodology.md — experimental setup and evaluation criteria - open-questions.md — unanswered questions for future work - README.md — overview and summary table Key findings: - Cross-document consistency: Opus is 2.4x faster with more findings - Gap-finding: GPT-5 reasoning tokens find domain-specific gaps - Race conditions: Opus excels at temporal interaction reasoning - Bias detection: Signal-to-noise ratio > model capability - Adversarial analysis: GPT-5 exhaustive, Opus qualitatively different Signed-off-by: Rodin	2026-05-05 19:13:03 -07:00