refactor(findings): split ALL-FINDINGS.md into per-experiment files
Break the monolithic 3249-line findings file into 29 individual files, one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy chronological sorting and discovery. No content changes — purely structural reorganization.
This commit is contained in:
@@ -0,0 +1,125 @@
|
||||
# Finding 21: Reasoning effort (low/medium/high) has negligible effect on GPT-5's analytical output; the parameter may not work as documented for open-ended analysis
|
||||
|
||||
**Date:** 2026-05-04
|
||||
**Task:** Identify hidden assumptions in gargoyle's `order-state-machine.md` (221 lines)
|
||||
— a well-structured state machine specification covering order lifecycle, fill precedence,
|
||||
TIF semantics, and parameter resolution.
|
||||
**How we used them:** Same document, same prompt, same model (GPT-5), same
|
||||
max_completion_tokens (16K). Only variable: `reasoning.effort` parameter set to
|
||||
"low", "medium", or "high". Run sequentially via HAI proxy (OpenAI-compatible
|
||||
endpoint). No tools, no project context beyond the document.
|
||||
|
||||
| Effort | Time (ms) | Output tokens | Reasoning tokens | Findings |
|
||||
|---|---|---|---|---|
|
||||
| Low | 97,913 | 7,657 | 4,288 | 33 (+11 recs) |
|
||||
| Medium | 94,824 | 7,112 | 4,160 | 30 |
|
||||
| High | 88,607 | 6,891 | 3,712 | 30 |
|
||||
|
||||
**The counterintuitive result:** Higher reasoning effort produced FEWER findings,
|
||||
FEWER reasoning tokens, FEWER output tokens, and completed FASTER. The expected
|
||||
pattern (high effort → more reasoning → more depth) was inverted.
|
||||
|
||||
**Per-finding metrics (remarkably consistent):**
|
||||
|
||||
| Effort | Output tokens/finding | Reasoning tokens/finding |
|
||||
|---|---|---|
|
||||
| Low | 232 | 129 |
|
||||
| Medium | 237 | 138 |
|
||||
| High | 229 | 123 |
|
||||
|
||||
The depth per finding was nearly identical across all three levels. The models
|
||||
didn't get more detailed or rigorous per-finding at higher effort — they just
|
||||
found slightly fewer things.
|
||||
|
||||
**Severity distributions (similar across all three):**
|
||||
- Low: 7 Critical, 21 High, 5 Medium (33 findings)
|
||||
- Medium: 9 Critical, 15 High, 4 Medium + 2 borderline (30 findings)
|
||||
- High: 6 Critical, 14 High, 5 Medium + 4 borderline (30 findings)
|
||||
|
||||
**Qualitative differences — WHAT they found:**
|
||||
|
||||
High-effort unique findings (not in low):
|
||||
- Single-writer authority to broker (no out-of-band modifications)
|
||||
- Broker emits fills for all executed quantities (no silent netting)
|
||||
- Instrument identity remains stable across corporate actions
|
||||
- Late-fill override won't violate downstream invariants
|
||||
- Validation covers lot sizes, price ticks, borrow/locate constraints
|
||||
- Multiple accounts and venues are part of the correlation key
|
||||
- Streaming and polling APIs are consistent
|
||||
- System can handle multi-leg instruments
|
||||
|
||||
Low-effort unique findings (not in high):
|
||||
- Acks arrive before fills (no pre-ack fills)
|
||||
- Cancel-before-ack handling (submitted → cancelled missing)
|
||||
- Fill totals never exceed requested quantity
|
||||
- Deterministic ordering within a broker stream
|
||||
- Exercise/assignment and non-order position changes
|
||||
- Client-side idempotency of "place order"
|
||||
- Partial accept/normalize on replace
|
||||
- No "child" order fragmentation at broker
|
||||
- Submitted state can receive terminal events
|
||||
- Late cancel vs local expired mismatch
|
||||
|
||||
**Character of the differences:**
|
||||
- HIGH-unique findings tend to be more **architectural/systemic** (multi-leg
|
||||
instruments, streaming vs polling consistency, downstream invariant violations,
|
||||
corporate actions). These require reasoning about the system's relationship
|
||||
to the broader world.
|
||||
- LOW-unique findings tend to be more **implementation-specific edge cases**
|
||||
(cancel-before-ack, pre-ack fills, child order fragmentation, partial accepts).
|
||||
These require reasoning about specific event interleavings and protocol details.
|
||||
|
||||
Both sets are valid and actionable. Neither is clearly "better." They represent
|
||||
different analytical modes — breadth-of-scope (high) vs depth-of-protocol (low).
|
||||
|
||||
**Key insight — reasoning_effort doesn't scale analysis linearly:**
|
||||
|
||||
Three possible explanations for the inverted behavior:
|
||||
|
||||
1. **GPT-5 already uses near-maximum reasoning for analytical tasks regardless
|
||||
of the effort parameter.** The ~4K reasoning tokens across all three levels
|
||||
(4288/4160/3712) are too similar to reflect a genuine effort gradient. The
|
||||
parameter may primarily affect OTHER task types (math, code, logic puzzles)
|
||||
where reasoning depth is more variable.
|
||||
|
||||
2. **Higher effort increases FILTERING, not exploration.** At high effort, GPT-5
|
||||
may spend more of its reasoning on VERIFYING whether findings are genuine
|
||||
before including them — similar to the extreme selectivity observed in
|
||||
Finding #20 (invariant violation paths, 12K reasoning for 3 findings). This
|
||||
would explain fewer findings despite theoretically "trying harder."
|
||||
|
||||
3. **The parameter has minimal practical effect for this model version.**
|
||||
The differences (33 vs 30 vs 30) are within normal stochastic variation.
|
||||
Repeated runs at the same effort level might show similar variance.
|
||||
|
||||
**The prompt cache hit on HIGH (2304 cached prompt tokens) may have slightly
|
||||
accelerated processing, but doesn't explain the reasoning token difference.**
|
||||
|
||||
**Comparison to previous findings:**
|
||||
In Finding #20 (invariant violation paths), GPT-5 used 12,032 reasoning tokens
|
||||
for 3 findings — extreme verification behavior. Here, at default effort on a
|
||||
different task type (hidden assumptions), it uses ~4K reasoning for ~30 findings.
|
||||
This confirms that TASK TYPE is a far stronger predictor of GPT-5's reasoning
|
||||
behavior than the reasoning_effort parameter. The invariant violation prompt
|
||||
triggered deep verification; the assumption-finding prompt triggers broad
|
||||
exploration regardless of effort setting.
|
||||
|
||||
**Practical implication:**
|
||||
For open-ended analytical tasks (assumption-finding, gap analysis, spec review),
|
||||
the reasoning_effort parameter appears to have negligible practical effect on
|
||||
GPT-5. Don't bother tuning it for these tasks — the default is fine. The
|
||||
parameter may be more meaningful for:
|
||||
- Tasks with verifiable correct answers (math, logic)
|
||||
- Tasks where the model could short-circuit (simple questions)
|
||||
- Extremely long documents where exploration budget matters
|
||||
|
||||
For architecture review specifically: reasoning_effort is NOT a useful lever.
|
||||
Task framing (the prompt structure) and document selection remain the dominant
|
||||
variables for output quality. Save reasoning_effort tuning for coding/math tasks
|
||||
where the parameter was likely trained and evaluated.
|
||||
|
||||
**Open question:** Would running the same experiment 5x at each level show that
|
||||
the 33-vs-30 difference is within stochastic noise? If so, reasoning_effort is
|
||||
effectively a no-op for analytical prompts. If not, low-effort consistently
|
||||
produces more (less filtered) output, which could be useful for brainstorming-
|
||||
style analysis where you want maximum coverage before manual triage.
|
||||
Reference in New Issue
Block a user