refactor(findings): split ALL-FINDINGS.md into per-experiment files

Break the monolithic 3249-line findings file into 29 individual files, one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy chronological sorting and discovery. No content changes — purely structural reorganization.
2026-05-06 07:15:50 -07:00
parent 1b108ff66e
commit 6af8a6ee10
32 changed files with 3232 additions and 3254 deletions
@@ -0,0 +1,58 @@
+# Finding 8: Bias detection: all models catch it with any framing — when the signal isn't buried
+
+**Date:** 2026-04-27
+**Task:** Detect directional bias in 8 deliberately biased hypotheses about
+microservices vs monolith architecture for fintech startups.
+**How we used them:** Created fresh test material (8 hypotheses with pro-
+microservices bias via absolutes like "inevitably," "necessary," "must,"
+"requires," plus one factually inverted claim about consistency guarantees).
+Ran 4 conditions in parallel sub-agents:
+
+| Condition | Model | Framing | Context |
+|---|---|---|---|
+| A | GPT-4.1 Mini | Narrow: "Do any lead toward a predetermined conclusion?" | Hypotheses only |
+| B | Sonnet | Same narrow question | Hypotheses only |
+| C | GPT-5 | Same narrow question | Hypotheses only |
+| D | Sonnet | Broad: "Review quality, clarity, testability, and issues" | Hypotheses only |
+
+**Results:**
+- **All 4 conditions detected 8/8 biased hypotheses.** No misses.
+- All 3 narrow-framing models (Mini, Sonnet, GPT-5) produced structurally
+  similar output: per-hypothesis verdict, biasing words, neutral version,
+  severity assessment.
+- All 3 narrow-framing models flagged H8's factual inversion (distributed
+  transactions DON'T provide stronger consistency than monolithic ACID).
+- GPT-5 added specific counterexamples (LMAX Disruptor, Shopify, Stack
+  Overflow, Basecamp) — marginally richer analysis.
+- Sonnet broad mandate also caught the bias — framed as one of three
+  "systemic problems" (deterministic language, pro-microservices framing
+  bias, underspecified constructs). Additionally provided testability and
+  operationalization analysis that the narrow framing didn't ask for.
+- Sonnet broad took ~72s vs ~39s for narrow conditions (more output).
+
+**Takeaway:** When the biased text is the ONLY input (no surrounding noise),
+all tested models — including the cheapest (GPT-4.1 Mini) — detect bias
+regardless of whether the question is narrow or broad. This appears to
+**contradict** original finding #2 ("cheap model + narrow lens > expensive
+model + broad review"), but the key difference is context noise:
+
+- **Original experiment (2026-04-26):** Sonnet and GPT-5 missed bias during
+  FULL PR REVIEW with rich project context (diff, file content, issue text,
+  acceptance criteria, project conventions). The hypotheses were buried in
+  layers of review mechanics.
+- **This experiment (2026-04-27):** Even the "broad" condition gave ONLY the
+  hypothesis text — no diff, no PR structure, no project context noise.
+
+**Refined hypothesis:** The original finding #2 was about **signal-to-noise
+ratio**, not about model capability or framing precision. When biased text
+is presented in isolation, any model catches it. When biased text is buried
+in a large PR review with many other things to check, the bias signal gets
+lost in the noise — unless you explicitly ask about it. The "narrow lens"
+worked because it eliminated the noise, not because smaller models are
+better at bias detection.
+
+**Next experiment to confirm:** Give a model the FULL PR review context
+(diff, files, issue, AC) but add the narrow bias question as an explicit
+review checklist item. If the model catches bias despite the rich context,
+it confirms the signal-to-noise hypothesis. If it misses, it suggests
+something else is at play (attention allocation, task switching cost).