Commit Graph

23 Commits

Author SHA1 Message Date
claw 64fdfebed3 finding 45: operator decision support gap analysis — new task type 2026-05-07 21:07:46 -07:00
claw e127e7b0c7 finding 44: cross-doc consistency on closely related docs
Sonnet finds ZERO subtle contradictions between signal-lifecycle.md and
aggregation.md, while GPT-5 and Opus each find 3 genuine conflicts.

Key insight: Sonnet can detect explicit contradictions (Finding 28: 4/6)
but completely fails on implication conflicts where one doc's simplified
model creates false impressions about another doc's complete specification.

Refines Finding 28 and confirms cross-document consistency is actually
TWO distinct tasks with different model requirements.
2026-05-07 19:27:20 -07:00
claw d8a030d9e9 finding #43: opus + narrow framing for contradiction detection
Tests the open question from Finding #39: does Opus's internal reasoning
depth suffice for self-contradiction verification?

Key result: wrong question. Opus finds a different CLASS of contradiction
than GPT-5. GPT-5 finds specification conflicts (statement comparison).
Opus finds logical impossibilities (deductive rule interaction). Neither
dominates — they don't overlap. Sonnet remains unreliable (~33% precision).

Document tested: escalation-policy.md (228 lines)
Models: GPT-5, Claude Opus 4.6, Claude Sonnet 4.6
2026-05-07 16:05:14 -07:00
claw 296bb21eb7 finding #42: failure propagation chain analysis on system-overview.md
New analytical lens: failure propagation chains. Opus matched GPT-5's count
(10 findings each) while using 2.2x fewer tokens. Overview docs are ideal
for this lens. Sonnet produced zero unique insights.
2026-05-07 14:28:26 -07:00
claw a65c471a3f finding 41: temporal ordering dependency analysis on kill-switch.md
New analytical lens testing whether models can identify sequential operations
where order matters but isn't mechanically enforced. GPT-5 finds systemic
gaps (WHY ordering matters), Opus finds inverted dangers (WHICH direction
is dangerous), Sonnet identifies themes without unique depth.
2026-05-07 12:47:03 -07:00
claw bb0c0d564b Finding #40: Silent data corruption paths in financial accounting
New analytical lens applied to lot-accounting.md (181 lines).
Tests how models identify sequences of individually correct
operations that produce silently wrong financial results.

Results:
- GPT-5: 12 findings (137s, 10688 reasoning tokens) - tax law domain knowledge
- Opus: 8 findings (121s) - concurrent systems / crash recovery focus
- Sonnet: 8 findings (111s) - structural meta-analysis, highest-leverage finding

Key insight: First experiment where domain-specific knowledge (tax law)
is the primary differentiator. Models reason from different knowledge
domains: GPT-5=tax law, Opus=distributed systems, Sonnet=architecture patterns.

Sonnet produced the most architecturally significant finding: that the
system's reconciliation mechanism confirms corruption rather than detecting
it (because it re-derives from LotClosed which is itself the corrupted source).
2026-05-07 11:09:58 -07:00
claw 0c632c255a finding #39: narrow framing does not close Sonnet-GPT-5 gap for semantic consistency
Tested open question from Finding #5: does narrow framing give Sonnet
GPT-5-level semantic analysis?

Result: NO. Narrow framing changes WHAT Sonnet looks for (redirects from
gaps to contradictions) but not HOW WELL it reasons. Sonnet narrow found
3 contradictions but only 1 was genuine (2 were analytical errors/misreads).
GPT-5 found 4 all-genuine findings with precise reasoning.

Key insight: framing controls scope, not reasoning depth. For tasks
requiring logical verification (contradictions, race conditions, invariant
violations), reasoning tokens are necessary — framing alone is insufficient.

Updated open-questions.md: marked Sonnet+narrow as answered, added new
question about Opus+narrow for contradiction detection.
2026-05-07 09:26:08 -07:00
claw d27ce6f5e1 finding #38: regulatory compliance gap analysis (FINRA/PDT domain knowledge test)
First experiment testing domain-specific regulatory knowledge rather than
pure architectural reasoning. Opus demonstrates deepest FINRA Rule 4210
knowledge; GPT-5 finds broker-API semantic mismatches; content filters
are a new failure mode for financial domain analysis via enterprise proxies.
2026-05-07 07:47:11 -07:00
claw 58e69e21f8 finding 37: cross-doc consistency on tightly coupled risk docs
Tested kill-switch.md + escalation-policy.md (same bounded context,
shared vocabulary). Key insight: shared vocabulary claims are the most
dangerous inconsistency — same words with opposite severity ordering.

Opus found the severity-ordering inversion (restrict/liquidate ladders
run in opposite directions). GPT-5 found the meta-issue (the 'same
vocabulary' claim is itself the problem). Sonnet fast but shallow.

Tightly coupled docs produce more Critical findings than loosely coupled
ones (Finding #28).
2026-05-07 04:29:23 -07:00
claw c071ffc31f Finding #36: Compositional interface analysis - two-document interface assumptions
New experiment type: give models two related architecture documents and ask
them to identify assumptions each document makes about the other that could
be violated.

Results: GPT-5 (10 findings, 175s, operational/race-focused) and Opus (10
findings, 111s, structural/architectural) both found unique interface gaps.
Sonnet (7 findings, 29s) found nothing unique - all its findings were
simplified versions of GPT-5/Opus findings.

Key insight: Interface analysis requires holding two mental models simultaneously
and is harder than single-document analysis. Sonnet produced 0 unique findings
(vs 2-6 on single-doc tasks). Extended reasoning appears necessary for this
task type.
2026-05-07 02:48:46 -07:00
claw d8ddbc9861 mark adversarial ensemble question as answered (finding #35) 2026-05-06 21:29:35 -07:00
claw 8338ae3019 finding #35: adversarial ensemble (critique+extend) produces 30% more coverage
Tests GPT-5 → Opus critique+extend pipeline on dtbp-margin-call.md.
Key results:
- Ensemble produces 56 unique findings vs 43 (GPT-5) or 28 (Opus) alone
- Zero full disagreements — GPT-5's coverage is reliable signal
- Critique phase (severity calibration) more valuable than extension phase
- 28% more tokens for 30% more coverage + structured prioritization
- Answers open question about adversarial ensemble value
2026-05-06 21:29:17 -07:00
Rodin 4a69a99d05 finding #34: information flow hazard analysis on lot-accounting.md
New analytical lens: where data propagation creates stale, contradictory,
or misleading views for different consumers.

Key result: highest model convergence (45% common ground) due to document's
explicit failure mode table. GPT-5 finds event-level provenance gaps; Opus
identifies strategy attribution dimension. Sonnet adds zero unique value.
Two-model stack (GPT-5 + Opus) optimal.
2026-05-06 18:29:06 -07:00
Rodin 20c0bd2492 feat: experiment #33 — observability gap analysis on aggregation.md
New analytical lens: observability gap analysis — asking 'when something
goes wrong, can you SEE it?' rather than 'what can go wrong?'

Results on aggregation.md (239 lines):
- GPT-5: 23 findings (12 unique), exhaustive telemetry architecture
- Opus: 14 findings (6 unique), operator-behavioral insights
- Sonnet: 11 findings (0 unique), no added value

Key insight: GPT-5 designs the instrumentation; Opus identifies where
available signals mislead operators toward wrong remediations.
Two-model (GPT-5 + Opus) optimal for this task type.
2026-05-06 11:49:05 -07:00
Rodin 8cfabfdc55 experiment #32: testability analysis — new analytical lens
Tested GPT-5, Opus, Sonnet on wash-sale-tracking.md spec.
Opus found a genuine spec bug (trigger logic described backwards).
Confirms pattern: GPT-5 for breadth, Opus for logic contradictions,
Sonnet adds no value for systematic analytical tasks.
2026-05-06 10:09:05 -07:00
Rodin ee3063997a finding #31: spec-gap analysis on continuous-risk-monitoring.md
New task type: specification gap/completeness analysis (vs adversarial gaming).
GPT-5 dominates count (25 findings), Opus produces best single insight
(realized P&L non-reversibility violates de-escalation model assumption).
Sonnet adds no unique value for this task type — skip for completeness audits.
2026-05-06 08:27:00 -07:00
Rodin cfcad67baa feat: add generic review prompts and generation guide
- review-prompts/generic/sonnet.md: language-agnostic structural review
- review-prompts/generic/gpt5.md: language-agnostic semantic/domain review
- review-prompts/generic/opus.md: language-agnostic design coherence review
- review-prompts/GENERATE.md: meta-prompt for tailoring to any repo
- review-prompts/ORCHESTRATION.md: multi-model review orchestration pattern
2026-05-06 08:00:59 -07:00
Rodin a3aebc7cc1 docs(readme): add Reports section with links to REPORT.md and LESSONS.md
Explains what each file contains, that they're auto-regenerated weekly,
and includes generation timestamps.
2026-05-06 07:29:03 -07:00
Rodin b832f32a16 docs: add generation timestamps to REPORT.md and LESSONS.md 2026-05-06 07:26:48 -07:00
Rodin f865a0d778 docs: add research report and actionable lessons summary
REPORT.md — full analysis of 29 experiments: model strengths, task-type
mappings, meta-findings, cost-effectiveness, and open questions.

LESSONS.md — distilled operational playbook: which model for which task,
anti-patterns, decision framework, and the three core rules.
2026-05-06 07:24:12 -07:00
Rodin 6af8a6ee10 refactor(findings): split ALL-FINDINGS.md into per-experiment files
Break the monolithic 3249-line findings file into 29 individual files,
one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy
chronological sorting and discovery.

No content changes — purely structural reorganization.
2026-05-06 07:15:50 -07:00
Rodin 1b108ff66e Initial publish: 29 findings, 6 prompts, methodology, open questions
Full comparative analysis of GPT-5, Claude Opus 4.6, Claude Sonnet 4.6,
GPT-4.1, and GPT-4.1 Mini on analytical tasks (not coding).

Contents:
- findings/ALL-FINDINGS.md — complete 3,249-line research log with all
  29 findings, methodology notes, and open questions
- prompts/ — 6 exact prompts used across experiments
- methodology.md — experimental setup and evaluation criteria
- open-questions.md — unanswered questions for future work
- README.md — overview and summary table

Key findings:
- Cross-document consistency: Opus is 2.4x faster with more findings
- Gap-finding: GPT-5 reasoning tokens find domain-specific gaps
- Race conditions: Opus excels at temporal interaction reasoning
- Bias detection: Signal-to-noise ratio > model capability
- Adversarial analysis: GPT-5 exhaustive, Opus qualitatively different

Signed-off-by: Rodin
2026-05-05 19:13:03 -07:00
rodin 4aea0d004b Initial commit 2026-05-06 02:10:14 +00:00