Files
model-research/findings/2026-05-11-audit-log-data-integrity-analysis.md
T
Rodin 2ca8c974f3 Add finding #25: Data integrity analysis on audit-log.md
New task type testing distributed systems consistency analysis.
GPT-5 found 18 issues (with 4,416 reasoning tokens), Sonnet found 13.
Key insight: distributed systems reasoning benefits from extended
reasoning - Sonnet at 72% of GPT-5 count, similar to race condition
analysis (58%) and worse than assumption-finding (85%).
2026-05-11 08:49:32 -07:00

8.8 KiB

Finding: Audit Log Data Integrity Analysis — GPT-5 excels at distributed systems reasoning; Sonnet identifies core issues but lacks depth

Date: 2026-05-11 Task: Identify data integrity violations in gargoyle's audit-log.md (170 lines) — scenarios where the audit log could become inconsistent, lose entries, or fail to be the authoritative record it claims to be. New task type: Data integrity analysis — focused on distributed systems concerns (write ordering, referential integrity, consistency windows, recovery correctness, concurrent access hazards).

How we used them: Same document (full text) + same focused analytical question to both models via HAI proxy. Structured prompt with 5 categories and required output format. No tools, no project context beyond the document itself.

Model Time Output tokens Reasoning tokens Findings
GPT-5 134s 7,274 4,416 18 (+ 1 design note)
Sonnet 4.6 26s 1,792 (internal) 13

What they found — common ground (both identified):

  • Portfolio Risk outcome visible before decision record — replication lag or write ordering can cause PR outcome to appear before the decision exists
  • Multi-aggregator signal processing race — same signal appears under multiple decisions in arbitrary timestamp order
  • Orphaned decision references — atomic write fails partway, leaving partial or phantom decision records
  • Missing signal risk rejections — write failure leaves gap where signal appears to bypass controls
  • Signal expiration race window — signals marked as "expired" that actually contributed
  • Portfolio Risk evaluation gap — trades execute but decision shows no risk evaluation for a window
  • Duplicate signal processing — network retries cause same signal to contribute to multiple decisions
  • Portfolio Risk duplicate evaluation — message duplication causes conflicting outcomes (approved AND rejected)

GPT-5 unique findings (not in Sonnet):

  1. Cross-service clock skew — queries "ordered by time" mix stages incorrectly when clocks drift; SLA measurements and timelines become meaningless
  2. Large atomic batches not truly atomic across partitions — distributed storage with sharding breaks atomicity guarantee; partial batches visible
  3. Decision_id collisions across aggregators — without globally unique scheme, different decisions can share IDs; referential integrity collapses
  4. Duplicate PR outcomes from retries — at-least-once delivery without idempotency creates conflicting terminal states
  5. Correction entries referencing missing entries — correction chain breaks if original was lost
  6. Expired signal entries with bad signal_ids — orphan "expired" rows with no other entries
  7. Read-replica lag windows — transiently incomplete views depending on which replica is hit
  8. Corrections append-only interim truth problem — queries between error and correction see wrong state
  9. Permanent holes when store unavailable — no backfill = permanent gaps in "authoritative" record
  10. Approved logged but no order sent — crash between PR write and OM handoff = factually wrong audit
  11. Aggregator duplicates decision after crash — input replay creates duplicate or mutated decisions
  12. Conflicting terminal outcomes from concurrent PR paths — multiple controls race, no finality rule
  13. At-least-once writers without idempotency keys — duplicates inflate counts and confuse traces
  14. Two aggregators both own same decision — split-brain creates conflicting decisions for same opportunity
  15. Mixed writers without transactional boundaries — external Risk writes interleave with DE writes

Sonnet unique findings (not in GPT-5):

  1. Partial recovery with ID sequence reset — crash during write + checkpoint restart can cause ID reuse, creating duplicate IDs for different decisions (GPT-5 addressed this via different framing in #14)
  2. Inconsistent recovery state — signal both rejected AND contributing after replay (GPT-5's #14 is similar but framed differently)
  3. Concurrent decision ID assignment — ID service race returns same ID to multiple aggregators (GPT-5's #6 is similar)

Quality assessment:

GPT-5 was significantly more thorough and demonstrated deeper distributed systems expertise. Key observations:

  • Found 18 distinct issues with detailed sequences and precise impact analysis
  • Identified issues Sonnet missed entirely: clock skew, replica lag windows, correction chain integrity, permanent holes semantics, the "approved but not sent" crash window
  • Each finding named specific components and described exact interleaving scenarios
  • Added a "modeling gap" note about signal_id/decision_id query asymmetry that isn't strictly a violation but creates incomplete narratives
  • Output was 4x longer (7,274 vs 1,792 tokens) with substantially more depth per finding

Sonnet 4.6 identified the core issues but with less depth:

  • Found 13 issues — 72% of GPT-5's count
  • Many findings overlapped with GPT-5 but with less precise sequences
  • Some findings were near-duplicates of each other under different category headings
  • Missed the clock skew, replica lag, correction chain, and permanent-hole issues
  • Completed in 26s (5x faster) — useful for quick first-pass but not comprehensive

Key insight — distributed systems reasoning benefits significantly from reasoning tokens:

This experiment tested a new task type: analyzing an architecture document for distributed systems consistency issues. This requires reasoning about:

  • Message ordering across services
  • Crash-recovery semantics
  • Replication lag visibility windows
  • Idempotency and exactly-once delivery
  • Atomic write guarantees across storage boundaries

GPT-5's 4,416 reasoning tokens enabled it to trace through complex multi-step scenarios (e.g., "aggregator writes, PR evaluates, replica lags, query hits stale replica" as a 4-step sequence). Sonnet's findings were shallower — it identified the category of issue but often didn't trace through the full causal chain.

This is consistent with Finding #13 (race condition identification) where Sonnet struggled with temporal/sequential reasoning. Distributed systems integrity analysis is essentially "race conditions at architecture scale" — the same cognitive skill that Sonnet lacks at the code level also shows up at the system design level.

Comparison to previous findings:

Task type GPT-5 Sonnet Ratio Notes
Assumption-finding (#12) 20 17 85% Sonnet's best relative performance
Cross-component interaction (#14) 10 8 80% Structured prompt helped Sonnet
Race condition identification (#13) 12 7 58% Sonnet struggled with concurrency
Data integrity analysis 18 13 72% New task type, between extremes

Data integrity analysis is between "cross-component interaction" (where Sonnet does well) and "race condition identification" (where Sonnet struggles). The task combines both: understanding component interactions (Sonnet's strength) but also reasoning through temporal/ordering scenarios (Sonnet's weakness).

Practical implications:

  1. For distributed systems design review: Use GPT-5. The depth of analysis on issues like "permanent holes," "correction chain integrity," and "replica lag windows" provides genuine value that Sonnet misses.

  2. For quick sanity checks: Sonnet is viable — it catches the obvious issues (orphaned references, duplicate processing) in 1/5 the time. But don't rely on it for thoroughness.

  3. Task framing helps but doesn't close the gap: The structured prompt (5 categories, required output format) helped both models produce organized output. But unlike Finding #14 where structure helped Sonnet recover to 80% of GPT-5's count, here structure only got Sonnet to 72%. The task itself is inherently harder for non-reasoning models.

  4. New task type confirmed: "Data integrity analysis" is a distinct analytical lens useful for architecture review. It complements assumption-finding (what must be true) and race condition analysis (what can interleave) with a focus on what can become inconsistent.

Cost-effectiveness:

  • GPT-5: 134s, ~8.5K total tokens (1.3K prompt + 7.2K completion including 4.4K reasoning)
  • Sonnet: 26s, ~3.2K total tokens (1.4K prompt + 1.8K completion)

GPT-5 found 5 issues Sonnet missed entirely. At ~2.7x token cost and 5x time cost, this is justified for architecture review of data-critical systems where consistency violations have financial/regulatory impact.

Document analyzed:

gargoyle's docs/domain/contexts/decision-engine/audit-log.md (170 lines) — describes the Decision Engine's append-only audit trail for signals, decisions, and risk evaluations. Claims to be "authoritative record" with "immutability" guarantees.