Add finding #25: Data integrity analysis on audit-log.md
New task type testing distributed systems consistency analysis. GPT-5 found 18 issues (with 4,416 reasoning tokens), Sonnet found 13. Key insight: distributed systems reasoning benefits from extended reasoning - Sonnet at 72% of GPT-5 count, similar to race condition analysis (58%) and worse than assumption-finding (85%).
This commit is contained in:
@@ -0,0 +1,110 @@
|
|||||||
|
# Finding: Audit Log Data Integrity Analysis — GPT-5 excels at distributed systems reasoning; Sonnet identifies core issues but lacks depth
|
||||||
|
|
||||||
|
**Date:** 2026-05-11
|
||||||
|
**Task:** Identify data integrity violations in gargoyle's `audit-log.md` (170 lines) — scenarios where the audit log could become inconsistent, lose entries, or fail to be the authoritative record it claims to be.
|
||||||
|
**New task type:** Data integrity analysis — focused on distributed systems concerns (write ordering, referential integrity, consistency windows, recovery correctness, concurrent access hazards).
|
||||||
|
|
||||||
|
**How we used them:** Same document (full text) + same focused analytical question to both models via HAI proxy. Structured prompt with 5 categories and required output format. No tools, no project context beyond the document itself.
|
||||||
|
|
||||||
|
| Model | Time | Output tokens | Reasoning tokens | Findings |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| GPT-5 | 134s | 7,274 | 4,416 | 18 (+ 1 design note) |
|
||||||
|
| Sonnet 4.6 | 26s | 1,792 | (internal) | 13 |
|
||||||
|
|
||||||
|
## What they found — common ground (both identified):
|
||||||
|
|
||||||
|
- **Portfolio Risk outcome visible before decision record** — replication lag or write ordering can cause PR outcome to appear before the decision exists
|
||||||
|
- **Multi-aggregator signal processing race** — same signal appears under multiple decisions in arbitrary timestamp order
|
||||||
|
- **Orphaned decision references** — atomic write fails partway, leaving partial or phantom decision records
|
||||||
|
- **Missing signal risk rejections** — write failure leaves gap where signal appears to bypass controls
|
||||||
|
- **Signal expiration race window** — signals marked as "expired" that actually contributed
|
||||||
|
- **Portfolio Risk evaluation gap** — trades execute but decision shows no risk evaluation for a window
|
||||||
|
- **Duplicate signal processing** — network retries cause same signal to contribute to multiple decisions
|
||||||
|
- **Portfolio Risk duplicate evaluation** — message duplication causes conflicting outcomes (approved AND rejected)
|
||||||
|
|
||||||
|
## GPT-5 unique findings (not in Sonnet):
|
||||||
|
|
||||||
|
1. **Cross-service clock skew** — queries "ordered by time" mix stages incorrectly when clocks drift; SLA measurements and timelines become meaningless
|
||||||
|
2. **Large atomic batches not truly atomic across partitions** — distributed storage with sharding breaks atomicity guarantee; partial batches visible
|
||||||
|
3. **Decision_id collisions across aggregators** — without globally unique scheme, different decisions can share IDs; referential integrity collapses
|
||||||
|
4. **Duplicate PR outcomes from retries** — at-least-once delivery without idempotency creates conflicting terminal states
|
||||||
|
5. **Correction entries referencing missing entries** — correction chain breaks if original was lost
|
||||||
|
6. **Expired signal entries with bad signal_ids** — orphan "expired" rows with no other entries
|
||||||
|
7. **Read-replica lag windows** — transiently incomplete views depending on which replica is hit
|
||||||
|
8. **Corrections append-only interim truth problem** — queries between error and correction see wrong state
|
||||||
|
9. **Permanent holes when store unavailable** — no backfill = permanent gaps in "authoritative" record
|
||||||
|
10. **Approved logged but no order sent** — crash between PR write and OM handoff = factually wrong audit
|
||||||
|
11. **Aggregator duplicates decision after crash** — input replay creates duplicate or mutated decisions
|
||||||
|
12. **Conflicting terminal outcomes from concurrent PR paths** — multiple controls race, no finality rule
|
||||||
|
13. **At-least-once writers without idempotency keys** — duplicates inflate counts and confuse traces
|
||||||
|
14. **Two aggregators both own same decision** — split-brain creates conflicting decisions for same opportunity
|
||||||
|
15. **Mixed writers without transactional boundaries** — external Risk writes interleave with DE writes
|
||||||
|
|
||||||
|
## Sonnet unique findings (not in GPT-5):
|
||||||
|
|
||||||
|
1. **Partial recovery with ID sequence reset** — crash during write + checkpoint restart can cause ID reuse, creating duplicate IDs for different decisions (GPT-5 addressed this via different framing in #14)
|
||||||
|
2. **Inconsistent recovery state** — signal both rejected AND contributing after replay (GPT-5's #14 is similar but framed differently)
|
||||||
|
3. **Concurrent decision ID assignment** — ID service race returns same ID to multiple aggregators (GPT-5's #6 is similar)
|
||||||
|
|
||||||
|
## Quality assessment:
|
||||||
|
|
||||||
|
**GPT-5** was significantly more thorough and demonstrated deeper distributed systems expertise. Key observations:
|
||||||
|
|
||||||
|
- Found **18 distinct issues** with detailed sequences and precise impact analysis
|
||||||
|
- Identified issues Sonnet missed entirely: clock skew, replica lag windows, correction chain integrity, permanent holes semantics, the "approved but not sent" crash window
|
||||||
|
- Each finding named specific components and described exact interleaving scenarios
|
||||||
|
- Added a "modeling gap" note about signal_id/decision_id query asymmetry that isn't strictly a violation but creates incomplete narratives
|
||||||
|
- Output was 4x longer (7,274 vs 1,792 tokens) with substantially more depth per finding
|
||||||
|
|
||||||
|
**Sonnet 4.6** identified the core issues but with less depth:
|
||||||
|
|
||||||
|
- Found **13 issues** — 72% of GPT-5's count
|
||||||
|
- Many findings overlapped with GPT-5 but with less precise sequences
|
||||||
|
- Some findings were near-duplicates of each other under different category headings
|
||||||
|
- Missed the clock skew, replica lag, correction chain, and permanent-hole issues
|
||||||
|
- Completed in 26s (5x faster) — useful for quick first-pass but not comprehensive
|
||||||
|
|
||||||
|
## Key insight — distributed systems reasoning benefits significantly from reasoning tokens:
|
||||||
|
|
||||||
|
This experiment tested a new task type: analyzing an architecture document for distributed systems consistency issues. This requires reasoning about:
|
||||||
|
- Message ordering across services
|
||||||
|
- Crash-recovery semantics
|
||||||
|
- Replication lag visibility windows
|
||||||
|
- Idempotency and exactly-once delivery
|
||||||
|
- Atomic write guarantees across storage boundaries
|
||||||
|
|
||||||
|
GPT-5's 4,416 reasoning tokens enabled it to trace through complex multi-step scenarios (e.g., "aggregator writes, PR evaluates, replica lags, query hits stale replica" as a 4-step sequence). Sonnet's findings were shallower — it identified the category of issue but often didn't trace through the full causal chain.
|
||||||
|
|
||||||
|
This is consistent with Finding #13 (race condition identification) where Sonnet struggled with temporal/sequential reasoning. Distributed systems integrity analysis is essentially "race conditions at architecture scale" — the same cognitive skill that Sonnet lacks at the code level also shows up at the system design level.
|
||||||
|
|
||||||
|
## Comparison to previous findings:
|
||||||
|
|
||||||
|
| Task type | GPT-5 | Sonnet | Ratio | Notes |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| Assumption-finding (#12) | 20 | 17 | 85% | Sonnet's best relative performance |
|
||||||
|
| Cross-component interaction (#14) | 10 | 8 | 80% | Structured prompt helped Sonnet |
|
||||||
|
| Race condition identification (#13) | 12 | 7 | 58% | Sonnet struggled with concurrency |
|
||||||
|
| **Data integrity analysis** | 18 | 13 | 72% | New task type, between extremes |
|
||||||
|
|
||||||
|
Data integrity analysis is between "cross-component interaction" (where Sonnet does well) and "race condition identification" (where Sonnet struggles). The task combines both: understanding component interactions (Sonnet's strength) but also reasoning through temporal/ordering scenarios (Sonnet's weakness).
|
||||||
|
|
||||||
|
## Practical implications:
|
||||||
|
|
||||||
|
1. **For distributed systems design review:** Use GPT-5. The depth of analysis on issues like "permanent holes," "correction chain integrity," and "replica lag windows" provides genuine value that Sonnet misses.
|
||||||
|
|
||||||
|
2. **For quick sanity checks:** Sonnet is viable — it catches the obvious issues (orphaned references, duplicate processing) in 1/5 the time. But don't rely on it for thoroughness.
|
||||||
|
|
||||||
|
3. **Task framing helps but doesn't close the gap:** The structured prompt (5 categories, required output format) helped both models produce organized output. But unlike Finding #14 where structure helped Sonnet recover to 80% of GPT-5's count, here structure only got Sonnet to 72%. The task itself is inherently harder for non-reasoning models.
|
||||||
|
|
||||||
|
4. **New task type confirmed:** "Data integrity analysis" is a distinct analytical lens useful for architecture review. It complements assumption-finding (what must be true) and race condition analysis (what can interleave) with a focus on what can become inconsistent.
|
||||||
|
|
||||||
|
## Cost-effectiveness:
|
||||||
|
|
||||||
|
- GPT-5: 134s, ~8.5K total tokens (1.3K prompt + 7.2K completion including 4.4K reasoning)
|
||||||
|
- Sonnet: 26s, ~3.2K total tokens (1.4K prompt + 1.8K completion)
|
||||||
|
|
||||||
|
GPT-5 found 5 issues Sonnet missed entirely. At ~2.7x token cost and 5x time cost, this is justified for architecture review of data-critical systems where consistency violations have financial/regulatory impact.
|
||||||
|
|
||||||
|
## Document analyzed:
|
||||||
|
|
||||||
|
gargoyle's `docs/domain/contexts/decision-engine/audit-log.md` (170 lines) — describes the Decision Engine's append-only audit trail for signals, decisions, and risk evaluations. Claims to be "authoritative record" with "immutability" guarantees.
|
||||||
Reference in New Issue
Block a user