6af8a6ee10
Break the monolithic 3249-line findings file into 29 individual files, one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy chronological sorting and discovery. No content changes — purely structural reorganization.
59 lines
3.3 KiB
Markdown
59 lines
3.3 KiB
Markdown
# Finding 8: Bias detection: all models catch it with any framing — when the signal isn't buried
|
|
|
|
**Date:** 2026-04-27
|
|
**Task:** Detect directional bias in 8 deliberately biased hypotheses about
|
|
microservices vs monolith architecture for fintech startups.
|
|
**How we used them:** Created fresh test material (8 hypotheses with pro-
|
|
microservices bias via absolutes like "inevitably," "necessary," "must,"
|
|
"requires," plus one factually inverted claim about consistency guarantees).
|
|
Ran 4 conditions in parallel sub-agents:
|
|
|
|
| Condition | Model | Framing | Context |
|
|
|---|---|---|---|
|
|
| A | GPT-4.1 Mini | Narrow: "Do any lead toward a predetermined conclusion?" | Hypotheses only |
|
|
| B | Sonnet | Same narrow question | Hypotheses only |
|
|
| C | GPT-5 | Same narrow question | Hypotheses only |
|
|
| D | Sonnet | Broad: "Review quality, clarity, testability, and issues" | Hypotheses only |
|
|
|
|
**Results:**
|
|
- **All 4 conditions detected 8/8 biased hypotheses.** No misses.
|
|
- All 3 narrow-framing models (Mini, Sonnet, GPT-5) produced structurally
|
|
similar output: per-hypothesis verdict, biasing words, neutral version,
|
|
severity assessment.
|
|
- All 3 narrow-framing models flagged H8's factual inversion (distributed
|
|
transactions DON'T provide stronger consistency than monolithic ACID).
|
|
- GPT-5 added specific counterexamples (LMAX Disruptor, Shopify, Stack
|
|
Overflow, Basecamp) — marginally richer analysis.
|
|
- Sonnet broad mandate also caught the bias — framed as one of three
|
|
"systemic problems" (deterministic language, pro-microservices framing
|
|
bias, underspecified constructs). Additionally provided testability and
|
|
operationalization analysis that the narrow framing didn't ask for.
|
|
- Sonnet broad took ~72s vs ~39s for narrow conditions (more output).
|
|
|
|
**Takeaway:** When the biased text is the ONLY input (no surrounding noise),
|
|
all tested models — including the cheapest (GPT-4.1 Mini) — detect bias
|
|
regardless of whether the question is narrow or broad. This appears to
|
|
**contradict** original finding #2 ("cheap model + narrow lens > expensive
|
|
model + broad review"), but the key difference is context noise:
|
|
|
|
- **Original experiment (2026-04-26):** Sonnet and GPT-5 missed bias during
|
|
FULL PR REVIEW with rich project context (diff, file content, issue text,
|
|
acceptance criteria, project conventions). The hypotheses were buried in
|
|
layers of review mechanics.
|
|
- **This experiment (2026-04-27):** Even the "broad" condition gave ONLY the
|
|
hypothesis text — no diff, no PR structure, no project context noise.
|
|
|
|
**Refined hypothesis:** The original finding #2 was about **signal-to-noise
|
|
ratio**, not about model capability or framing precision. When biased text
|
|
is presented in isolation, any model catches it. When biased text is buried
|
|
in a large PR review with many other things to check, the bias signal gets
|
|
lost in the noise — unless you explicitly ask about it. The "narrow lens"
|
|
worked because it eliminated the noise, not because smaller models are
|
|
better at bias detection.
|
|
|
|
**Next experiment to confirm:** Give a model the FULL PR review context
|
|
(diff, files, issue, AC) but add the narrow bias question as an explicit
|
|
review checklist item. If the model catches bias despite the rich context,
|
|
it confirms the signal-to-noise hypothesis. If it misses, it suggests
|
|
something else is at play (attention allocation, task switching cost).
|