mark adversarial ensemble question as answered (finding #35)
This commit is contained in:
+9
-3
@@ -29,10 +29,16 @@ consistency" framing instead of broad review? The hypothesis: Sonnet's
|
|||||||
|
|
||||||
## Medium Priority
|
## Medium Priority
|
||||||
|
|
||||||
### Adversarial analysis ensemble (from Finding #29)
|
### ~~Adversarial analysis ensemble (from Finding #29)~~ → ANSWERED (Finding #35)
|
||||||
Run GPT-5 and Opus sequentially — give Opus access to GPT-5's findings
|
~~Run GPT-5 and Opus sequentially — give Opus access to GPT-5's findings
|
||||||
and ask it to critique and extend. Does the ensemble find more than either
|
and ask it to critique and extend. Does the ensemble find more than either
|
||||||
alone? Does Opus's system-level thinking complement GPT-5's exhaustiveness?
|
alone? Does Opus's system-level thinking complement GPT-5's exhaustiveness?~~
|
||||||
|
|
||||||
|
**YES.** Ensemble produces 56 findings vs 43 (GPT-5) or 28 (Opus) alone (30%
|
||||||
|
improvement). Zero full disagreements — critique phase calibrates severity
|
||||||
|
without discarding. Extension phase adds 13 genuinely new findings (4 High).
|
||||||
|
The critique's structured assessment is more valuable than raw extensions.
|
||||||
|
Cost: ~28% more tokens for 30% more coverage + prioritization.
|
||||||
|
|
||||||
### Reasoning effort parameter (from Finding #21)
|
### Reasoning effort parameter (from Finding #21)
|
||||||
Reasoning effort (low/medium/high) had negligible effect on GPT-5's
|
Reasoning effort (low/medium/high) had negligible effect on GPT-5's
|
||||||
|
|||||||
Reference in New Issue
Block a user