6af8a6ee10
Break the monolithic 3249-line findings file into 29 individual files, one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy chronological sorting and discovery. No content changes — purely structural reorganization.
126 lines
6.4 KiB
Markdown
126 lines
6.4 KiB
Markdown
# Finding 21: Reasoning effort (low/medium/high) has negligible effect on GPT-5's analytical output; the parameter may not work as documented for open-ended analysis
|
|
|
|
**Date:** 2026-05-04
|
|
**Task:** Identify hidden assumptions in gargoyle's `order-state-machine.md` (221 lines)
|
|
— a well-structured state machine specification covering order lifecycle, fill precedence,
|
|
TIF semantics, and parameter resolution.
|
|
**How we used them:** Same document, same prompt, same model (GPT-5), same
|
|
max_completion_tokens (16K). Only variable: `reasoning.effort` parameter set to
|
|
"low", "medium", or "high". Run sequentially via HAI proxy (OpenAI-compatible
|
|
endpoint). No tools, no project context beyond the document.
|
|
|
|
| Effort | Time (ms) | Output tokens | Reasoning tokens | Findings |
|
|
|---|---|---|---|---|
|
|
| Low | 97,913 | 7,657 | 4,288 | 33 (+11 recs) |
|
|
| Medium | 94,824 | 7,112 | 4,160 | 30 |
|
|
| High | 88,607 | 6,891 | 3,712 | 30 |
|
|
|
|
**The counterintuitive result:** Higher reasoning effort produced FEWER findings,
|
|
FEWER reasoning tokens, FEWER output tokens, and completed FASTER. The expected
|
|
pattern (high effort → more reasoning → more depth) was inverted.
|
|
|
|
**Per-finding metrics (remarkably consistent):**
|
|
|
|
| Effort | Output tokens/finding | Reasoning tokens/finding |
|
|
|---|---|---|
|
|
| Low | 232 | 129 |
|
|
| Medium | 237 | 138 |
|
|
| High | 229 | 123 |
|
|
|
|
The depth per finding was nearly identical across all three levels. The models
|
|
didn't get more detailed or rigorous per-finding at higher effort — they just
|
|
found slightly fewer things.
|
|
|
|
**Severity distributions (similar across all three):**
|
|
- Low: 7 Critical, 21 High, 5 Medium (33 findings)
|
|
- Medium: 9 Critical, 15 High, 4 Medium + 2 borderline (30 findings)
|
|
- High: 6 Critical, 14 High, 5 Medium + 4 borderline (30 findings)
|
|
|
|
**Qualitative differences — WHAT they found:**
|
|
|
|
High-effort unique findings (not in low):
|
|
- Single-writer authority to broker (no out-of-band modifications)
|
|
- Broker emits fills for all executed quantities (no silent netting)
|
|
- Instrument identity remains stable across corporate actions
|
|
- Late-fill override won't violate downstream invariants
|
|
- Validation covers lot sizes, price ticks, borrow/locate constraints
|
|
- Multiple accounts and venues are part of the correlation key
|
|
- Streaming and polling APIs are consistent
|
|
- System can handle multi-leg instruments
|
|
|
|
Low-effort unique findings (not in high):
|
|
- Acks arrive before fills (no pre-ack fills)
|
|
- Cancel-before-ack handling (submitted → cancelled missing)
|
|
- Fill totals never exceed requested quantity
|
|
- Deterministic ordering within a broker stream
|
|
- Exercise/assignment and non-order position changes
|
|
- Client-side idempotency of "place order"
|
|
- Partial accept/normalize on replace
|
|
- No "child" order fragmentation at broker
|
|
- Submitted state can receive terminal events
|
|
- Late cancel vs local expired mismatch
|
|
|
|
**Character of the differences:**
|
|
- HIGH-unique findings tend to be more **architectural/systemic** (multi-leg
|
|
instruments, streaming vs polling consistency, downstream invariant violations,
|
|
corporate actions). These require reasoning about the system's relationship
|
|
to the broader world.
|
|
- LOW-unique findings tend to be more **implementation-specific edge cases**
|
|
(cancel-before-ack, pre-ack fills, child order fragmentation, partial accepts).
|
|
These require reasoning about specific event interleavings and protocol details.
|
|
|
|
Both sets are valid and actionable. Neither is clearly "better." They represent
|
|
different analytical modes — breadth-of-scope (high) vs depth-of-protocol (low).
|
|
|
|
**Key insight — reasoning_effort doesn't scale analysis linearly:**
|
|
|
|
Three possible explanations for the inverted behavior:
|
|
|
|
1. **GPT-5 already uses near-maximum reasoning for analytical tasks regardless
|
|
of the effort parameter.** The ~4K reasoning tokens across all three levels
|
|
(4288/4160/3712) are too similar to reflect a genuine effort gradient. The
|
|
parameter may primarily affect OTHER task types (math, code, logic puzzles)
|
|
where reasoning depth is more variable.
|
|
|
|
2. **Higher effort increases FILTERING, not exploration.** At high effort, GPT-5
|
|
may spend more of its reasoning on VERIFYING whether findings are genuine
|
|
before including them — similar to the extreme selectivity observed in
|
|
Finding #20 (invariant violation paths, 12K reasoning for 3 findings). This
|
|
would explain fewer findings despite theoretically "trying harder."
|
|
|
|
3. **The parameter has minimal practical effect for this model version.**
|
|
The differences (33 vs 30 vs 30) are within normal stochastic variation.
|
|
Repeated runs at the same effort level might show similar variance.
|
|
|
|
**The prompt cache hit on HIGH (2304 cached prompt tokens) may have slightly
|
|
accelerated processing, but doesn't explain the reasoning token difference.**
|
|
|
|
**Comparison to previous findings:**
|
|
In Finding #20 (invariant violation paths), GPT-5 used 12,032 reasoning tokens
|
|
for 3 findings — extreme verification behavior. Here, at default effort on a
|
|
different task type (hidden assumptions), it uses ~4K reasoning for ~30 findings.
|
|
This confirms that TASK TYPE is a far stronger predictor of GPT-5's reasoning
|
|
behavior than the reasoning_effort parameter. The invariant violation prompt
|
|
triggered deep verification; the assumption-finding prompt triggers broad
|
|
exploration regardless of effort setting.
|
|
|
|
**Practical implication:**
|
|
For open-ended analytical tasks (assumption-finding, gap analysis, spec review),
|
|
the reasoning_effort parameter appears to have negligible practical effect on
|
|
GPT-5. Don't bother tuning it for these tasks — the default is fine. The
|
|
parameter may be more meaningful for:
|
|
- Tasks with verifiable correct answers (math, logic)
|
|
- Tasks where the model could short-circuit (simple questions)
|
|
- Extremely long documents where exploration budget matters
|
|
|
|
For architecture review specifically: reasoning_effort is NOT a useful lever.
|
|
Task framing (the prompt structure) and document selection remain the dominant
|
|
variables for output quality. Save reasoning_effort tuning for coding/math tasks
|
|
where the parameter was likely trained and evaluated.
|
|
|
|
**Open question:** Would running the same experiment 5x at each level show that
|
|
the 33-vs-30 difference is within stochastic noise? If so, reasoning_effort is
|
|
effectively a no-op for analytical prompts. If not, low-effort consistently
|
|
produces more (less filtered) output, which could be useful for brainstorming-
|
|
style analysis where you want maximum coverage before manual triage.
|