model-research/findings/2026-05-04-21-reasoning-effort-lowmediumhigh-has-negligible.md

# Finding 21: Reasoning effort (low/medium/high) has negligible effect on GPT-5's analytical output; the parameter may not work as documented for open-ended analysis

**Date:** 2026-05-04
**Task:** Identify hidden assumptions in gargoyle's `order-state-machine.md` (221 lines)
— a well-structured state machine specification covering order lifecycle, fill precedence,
TIF semantics, and parameter resolution.
**How we used them:** Same document, same prompt, same model (GPT-5), same
max_completion_tokens (16K). Only variable: `reasoning.effort` parameter set to
"low", "medium", or "high". Run sequentially via HAI proxy (OpenAI-compatible
endpoint). No tools, no project context beyond the document.

| Effort | Time (ms) | Output tokens | Reasoning tokens | Findings |
|---|---|---|---|---|
| Low | 97,913 | 7,657 | 4,288 | 33 (+11 recs) |
| Medium | 94,824 | 7,112 | 4,160 | 30 |
| High | 88,607 | 6,891 | 3,712 | 30 |

**The counterintuitive result:** Higher reasoning effort produced FEWER findings,
FEWER reasoning tokens, FEWER output tokens, and completed FASTER. The expected
pattern (high effort → more reasoning → more depth) was inverted.

**Per-finding metrics (remarkably consistent):**

| Effort | Output tokens/finding | Reasoning tokens/finding |
|---|---|---|
| Low | 232 | 129 |
| Medium | 237 | 138 |
| High | 229 | 123 |

The depth per finding was nearly identical across all three levels. The models
didn't get more detailed or rigorous per-finding at higher effort — they just
found slightly fewer things.

**Severity distributions (similar across all three):**
- Low: 7 Critical, 21 High, 5 Medium (33 findings)
- Medium: 9 Critical, 15 High, 4 Medium + 2 borderline (30 findings)
- High: 6 Critical, 14 High, 5 Medium + 4 borderline (30 findings)

**Qualitative differences — WHAT they found:**

High-effort unique findings (not in low):
- Single-writer authority to broker (no out-of-band modifications)
- Broker emits fills for all executed quantities (no silent netting)
- Instrument identity remains stable across corporate actions
- Late-fill override won't violate downstream invariants
- Validation covers lot sizes, price ticks, borrow/locate constraints
- Multiple accounts and venues are part of the correlation key
- Streaming and polling APIs are consistent
- System can handle multi-leg instruments

Low-effort unique findings (not in high):
- Acks arrive before fills (no pre-ack fills)
- Cancel-before-ack handling (submitted → cancelled missing)
- Fill totals never exceed requested quantity
- Deterministic ordering within a broker stream
- Exercise/assignment and non-order position changes
- Client-side idempotency of "place order"
- Partial accept/normalize on replace
- No "child" order fragmentation at broker
- Submitted state can receive terminal events
- Late cancel vs local expired mismatch

**Character of the differences:**
- HIGH-unique findings tend to be more **architectural/systemic** (multi-leg
  instruments, streaming vs polling consistency, downstream invariant violations,
  corporate actions). These require reasoning about the system's relationship
  to the broader world.
- LOW-unique findings tend to be more **implementation-specific edge cases**
  (cancel-before-ack, pre-ack fills, child order fragmentation, partial accepts).
  These require reasoning about specific event interleavings and protocol details.

Both sets are valid and actionable. Neither is clearly "better." They represent
different analytical modes — breadth-of-scope (high) vs depth-of-protocol (low).

**Key insight — reasoning_effort doesn't scale analysis linearly:**

Three possible explanations for the inverted behavior:

1. **GPT-5 already uses near-maximum reasoning for analytical tasks regardless
   of the effort parameter.** The ~4K reasoning tokens across all three levels
   (4288/4160/3712) are too similar to reflect a genuine effort gradient. The
   parameter may primarily affect OTHER task types (math, code, logic puzzles)
   where reasoning depth is more variable.

2. **Higher effort increases FILTERING, not exploration.** At high effort, GPT-5
   may spend more of its reasoning on VERIFYING whether findings are genuine
   before including them — similar to the extreme selectivity observed in
   Finding #20 (invariant violation paths, 12K reasoning for 3 findings). This
   would explain fewer findings despite theoretically "trying harder."

3. **The parameter has minimal practical effect for this model version.**
   The differences (33 vs 30 vs 30) are within normal stochastic variation.
   Repeated runs at the same effort level might show similar variance.

**The prompt cache hit on HIGH (2304 cached prompt tokens) may have slightly
accelerated processing, but doesn't explain the reasoning token difference.**

**Comparison to previous findings:**
In Finding #20 (invariant violation paths), GPT-5 used 12,032 reasoning tokens
for 3 findings — extreme verification behavior. Here, at default effort on a
different task type (hidden assumptions), it uses ~4K reasoning for ~30 findings.
This confirms that TASK TYPE is a far stronger predictor of GPT-5's reasoning
behavior than the reasoning_effort parameter. The invariant violation prompt
triggered deep verification; the assumption-finding prompt triggers broad
exploration regardless of effort setting.

**Practical implication:**
For open-ended analytical tasks (assumption-finding, gap analysis, spec review),
the reasoning_effort parameter appears to have negligible practical effect on
GPT-5. Don't bother tuning it for these tasks — the default is fine. The
parameter may be more meaningful for:
- Tasks with verifiable correct answers (math, logic)
- Tasks where the model could short-circuit (simple questions)
- Extremely long documents where exploration budget matters

For architecture review specifically: reasoning_effort is NOT a useful lever.
Task framing (the prompt structure) and document selection remain the dominant
variables for output quality. Save reasoning_effort tuning for coding/math tasks
where the parameter was likely trained and evaluated.

**Open question:** Would running the same experiment 5x at each level show that
the 33-vs-30 difference is within stochastic noise? If so, reasoning_effort is
effectively a no-op for analytical prompts. If not, low-effort consistently
produces more (less filtered) output, which could be useful for brainstorming-
style analysis where you want maximum coverage before manual triage.