refactor(findings): split ALL-FINDINGS.md into per-experiment files

Break the monolithic 3249-line findings file into 29 individual files, one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy chronological sorting and discovery. No content changes — purely structural reorganization.
2026-05-06 07:15:50 -07:00
parent 1b108ff66e
commit 6af8a6ee10
32 changed files with 3232 additions and 3254 deletions
@@ -0,0 +1,125 @@
+# Finding 21: Reasoning effort (low/medium/high) has negligible effect on GPT-5's analytical output; the parameter may not work as documented for open-ended analysis
+
+**Date:** 2026-05-04
+**Task:** Identify hidden assumptions in gargoyle's `order-state-machine.md` (221 lines)
+— a well-structured state machine specification covering order lifecycle, fill precedence,
+TIF semantics, and parameter resolution.
+**How we used them:** Same document, same prompt, same model (GPT-5), same
+max_completion_tokens (16K). Only variable: `reasoning.effort` parameter set to
+"low", "medium", or "high". Run sequentially via HAI proxy (OpenAI-compatible
+endpoint). No tools, no project context beyond the document.
+
+| Effort | Time (ms) | Output tokens | Reasoning tokens | Findings |
+|---|---|---|---|---|
+| Low | 97,913 | 7,657 | 4,288 | 33 (+11 recs) |
+| Medium | 94,824 | 7,112 | 4,160 | 30 |
+| High | 88,607 | 6,891 | 3,712 | 30 |
+
+**The counterintuitive result:** Higher reasoning effort produced FEWER findings,
+FEWER reasoning tokens, FEWER output tokens, and completed FASTER. The expected
+pattern (high effort → more reasoning → more depth) was inverted.
+
+**Per-finding metrics (remarkably consistent):**
+
+| Effort | Output tokens/finding | Reasoning tokens/finding |
+|---|---|---|
+| Low | 232 | 129 |
+| Medium | 237 | 138 |
+| High | 229 | 123 |
+
+The depth per finding was nearly identical across all three levels. The models
+didn't get more detailed or rigorous per-finding at higher effort — they just
+found slightly fewer things.
+
+**Severity distributions (similar across all three):**
+- Low: 7 Critical, 21 High, 5 Medium (33 findings)
+- Medium: 9 Critical, 15 High, 4 Medium + 2 borderline (30 findings)
+- High: 6 Critical, 14 High, 5 Medium + 4 borderline (30 findings)
+
+**Qualitative differences — WHAT they found:**
+
+High-effort unique findings (not in low):
+- Single-writer authority to broker (no out-of-band modifications)
+- Broker emits fills for all executed quantities (no silent netting)
+- Instrument identity remains stable across corporate actions
+- Late-fill override won't violate downstream invariants
+- Validation covers lot sizes, price ticks, borrow/locate constraints
+- Multiple accounts and venues are part of the correlation key
+- Streaming and polling APIs are consistent
+- System can handle multi-leg instruments
+
+Low-effort unique findings (not in high):
+- Acks arrive before fills (no pre-ack fills)
+- Cancel-before-ack handling (submitted → cancelled missing)
+- Fill totals never exceed requested quantity
+- Deterministic ordering within a broker stream
+- Exercise/assignment and non-order position changes
+- Client-side idempotency of "place order"
+- Partial accept/normalize on replace
+- No "child" order fragmentation at broker
+- Submitted state can receive terminal events
+- Late cancel vs local expired mismatch
+
+**Character of the differences:**
+- HIGH-unique findings tend to be more **architectural/systemic** (multi-leg
+  instruments, streaming vs polling consistency, downstream invariant violations,
+  corporate actions). These require reasoning about the system's relationship
+  to the broader world.
+- LOW-unique findings tend to be more **implementation-specific edge cases**
+  (cancel-before-ack, pre-ack fills, child order fragmentation, partial accepts).
+  These require reasoning about specific event interleavings and protocol details.
+
+Both sets are valid and actionable. Neither is clearly "better." They represent
+different analytical modes — breadth-of-scope (high) vs depth-of-protocol (low).
+
+**Key insight — reasoning_effort doesn't scale analysis linearly:**
+
+Three possible explanations for the inverted behavior:
+
+1. **GPT-5 already uses near-maximum reasoning for analytical tasks regardless
+   of the effort parameter.** The ~4K reasoning tokens across all three levels
+   (4288/4160/3712) are too similar to reflect a genuine effort gradient. The
+   parameter may primarily affect OTHER task types (math, code, logic puzzles)
+   where reasoning depth is more variable.
+
+2. **Higher effort increases FILTERING, not exploration.** At high effort, GPT-5
+   may spend more of its reasoning on VERIFYING whether findings are genuine
+   before including them — similar to the extreme selectivity observed in
+   Finding #20 (invariant violation paths, 12K reasoning for 3 findings). This
+   would explain fewer findings despite theoretically "trying harder."
+
+3. **The parameter has minimal practical effect for this model version.**
+   The differences (33 vs 30 vs 30) are within normal stochastic variation.
+   Repeated runs at the same effort level might show similar variance.
+
+**The prompt cache hit on HIGH (2304 cached prompt tokens) may have slightly
+accelerated processing, but doesn't explain the reasoning token difference.**
+
+**Comparison to previous findings:**
+In Finding #20 (invariant violation paths), GPT-5 used 12,032 reasoning tokens
+for 3 findings — extreme verification behavior. Here, at default effort on a
+different task type (hidden assumptions), it uses ~4K reasoning for ~30 findings.
+This confirms that TASK TYPE is a far stronger predictor of GPT-5's reasoning
+behavior than the reasoning_effort parameter. The invariant violation prompt
+triggered deep verification; the assumption-finding prompt triggers broad
+exploration regardless of effort setting.
+
+**Practical implication:**
+For open-ended analytical tasks (assumption-finding, gap analysis, spec review),
+the reasoning_effort parameter appears to have negligible practical effect on
+GPT-5. Don't bother tuning it for these tasks — the default is fine. The
+parameter may be more meaningful for:
+- Tasks with verifiable correct answers (math, logic)
+- Tasks where the model could short-circuit (simple questions)
+- Extremely long documents where exploration budget matters
+
+For architecture review specifically: reasoning_effort is NOT a useful lever.
+Task framing (the prompt structure) and document selection remain the dominant
+variables for output quality. Save reasoning_effort tuning for coding/math tasks
+where the parameter was likely trained and evaluated.
+
+**Open question:** Would running the same experiment 5x at each level show that
+the 33-vs-30 difference is within stochastic noise? If so, reasoning_effort is
+effectively a no-op for analytical prompts. If not, low-effort consistently
+produces more (less filtered) output, which could be useful for brainstorming-
+style analysis where you want maximum coverage before manual triage.