Files
model-research/findings/2026-05-04-21-reasoning-effort-lowmediumhigh-has-negligible.md
T
Rodin 6af8a6ee10 refactor(findings): split ALL-FINDINGS.md into per-experiment files
Break the monolithic 3249-line findings file into 29 individual files,
one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy
chronological sorting and discovery.

No content changes — purely structural reorganization.
2026-05-06 07:15:50 -07:00

6.4 KiB

Finding 21: Reasoning effort (low/medium/high) has negligible effect on GPT-5's analytical output; the parameter may not work as documented for open-ended analysis

Date: 2026-05-04 Task: Identify hidden assumptions in gargoyle's order-state-machine.md (221 lines) — a well-structured state machine specification covering order lifecycle, fill precedence, TIF semantics, and parameter resolution. How we used them: Same document, same prompt, same model (GPT-5), same max_completion_tokens (16K). Only variable: reasoning.effort parameter set to "low", "medium", or "high". Run sequentially via HAI proxy (OpenAI-compatible endpoint). No tools, no project context beyond the document.

Effort Time (ms) Output tokens Reasoning tokens Findings
Low 97,913 7,657 4,288 33 (+11 recs)
Medium 94,824 7,112 4,160 30
High 88,607 6,891 3,712 30

The counterintuitive result: Higher reasoning effort produced FEWER findings, FEWER reasoning tokens, FEWER output tokens, and completed FASTER. The expected pattern (high effort → more reasoning → more depth) was inverted.

Per-finding metrics (remarkably consistent):

Effort Output tokens/finding Reasoning tokens/finding
Low 232 129
Medium 237 138
High 229 123

The depth per finding was nearly identical across all three levels. The models didn't get more detailed or rigorous per-finding at higher effort — they just found slightly fewer things.

Severity distributions (similar across all three):

  • Low: 7 Critical, 21 High, 5 Medium (33 findings)
  • Medium: 9 Critical, 15 High, 4 Medium + 2 borderline (30 findings)
  • High: 6 Critical, 14 High, 5 Medium + 4 borderline (30 findings)

Qualitative differences — WHAT they found:

High-effort unique findings (not in low):

  • Single-writer authority to broker (no out-of-band modifications)
  • Broker emits fills for all executed quantities (no silent netting)
  • Instrument identity remains stable across corporate actions
  • Late-fill override won't violate downstream invariants
  • Validation covers lot sizes, price ticks, borrow/locate constraints
  • Multiple accounts and venues are part of the correlation key
  • Streaming and polling APIs are consistent
  • System can handle multi-leg instruments

Low-effort unique findings (not in high):

  • Acks arrive before fills (no pre-ack fills)
  • Cancel-before-ack handling (submitted → cancelled missing)
  • Fill totals never exceed requested quantity
  • Deterministic ordering within a broker stream
  • Exercise/assignment and non-order position changes
  • Client-side idempotency of "place order"
  • Partial accept/normalize on replace
  • No "child" order fragmentation at broker
  • Submitted state can receive terminal events
  • Late cancel vs local expired mismatch

Character of the differences:

  • HIGH-unique findings tend to be more architectural/systemic (multi-leg instruments, streaming vs polling consistency, downstream invariant violations, corporate actions). These require reasoning about the system's relationship to the broader world.
  • LOW-unique findings tend to be more implementation-specific edge cases (cancel-before-ack, pre-ack fills, child order fragmentation, partial accepts). These require reasoning about specific event interleavings and protocol details.

Both sets are valid and actionable. Neither is clearly "better." They represent different analytical modes — breadth-of-scope (high) vs depth-of-protocol (low).

Key insight — reasoning_effort doesn't scale analysis linearly:

Three possible explanations for the inverted behavior:

  1. GPT-5 already uses near-maximum reasoning for analytical tasks regardless of the effort parameter. The ~4K reasoning tokens across all three levels (4288/4160/3712) are too similar to reflect a genuine effort gradient. The parameter may primarily affect OTHER task types (math, code, logic puzzles) where reasoning depth is more variable.

  2. Higher effort increases FILTERING, not exploration. At high effort, GPT-5 may spend more of its reasoning on VERIFYING whether findings are genuine before including them — similar to the extreme selectivity observed in Finding #20 (invariant violation paths, 12K reasoning for 3 findings). This would explain fewer findings despite theoretically "trying harder."

  3. The parameter has minimal practical effect for this model version. The differences (33 vs 30 vs 30) are within normal stochastic variation. Repeated runs at the same effort level might show similar variance.

The prompt cache hit on HIGH (2304 cached prompt tokens) may have slightly accelerated processing, but doesn't explain the reasoning token difference.

Comparison to previous findings: In Finding #20 (invariant violation paths), GPT-5 used 12,032 reasoning tokens for 3 findings — extreme verification behavior. Here, at default effort on a different task type (hidden assumptions), it uses ~4K reasoning for ~30 findings. This confirms that TASK TYPE is a far stronger predictor of GPT-5's reasoning behavior than the reasoning_effort parameter. The invariant violation prompt triggered deep verification; the assumption-finding prompt triggers broad exploration regardless of effort setting.

Practical implication: For open-ended analytical tasks (assumption-finding, gap analysis, spec review), the reasoning_effort parameter appears to have negligible practical effect on GPT-5. Don't bother tuning it for these tasks — the default is fine. The parameter may be more meaningful for:

  • Tasks with verifiable correct answers (math, logic)
  • Tasks where the model could short-circuit (simple questions)
  • Extremely long documents where exploration budget matters

For architecture review specifically: reasoning_effort is NOT a useful lever. Task framing (the prompt structure) and document selection remain the dominant variables for output quality. Save reasoning_effort tuning for coding/math tasks where the parameter was likely trained and evaluated.

Open question: Would running the same experiment 5x at each level show that the 33-vs-30 difference is within stochastic noise? If so, reasoning_effort is effectively a no-op for analytical prompts. If not, low-effort consistently produces more (less filtered) output, which could be useful for brainstorming- style analysis where you want maximum coverage before manual triage.