finding 59: convention rule gap analysis

New task type: analyzing prescriptive/specification documents for completeness. - GPT-5 dominates with exhaustive enumeration (34 findings) - Opus traces gaps to consequences (routing failures, compiler issues) - Sonnet surface-level (not recommended for thorough analysis) Key insight: GPT-5 found internal contradiction (telemetry verb rule vs example) that neither Claude model caught. Opus unique in tracing PubSub collision to actual routing failure scenario. Task taxonomy: convention gap analysis follows same pattern as architecture docs - GPT-5 for coverage, Opus for consequences.
2026-05-09 17:28:53 -07:00
parent 98304604ac
commit 2988f31fc3
1 changed files with 117 additions and 0 deletions
@@ -0,0 +1,117 @@
+# Finding 59: Convention Rule Gap Analysis — GPT-5 Dominates with Exhaustive Enumeration; Opus Finds Design Contradictions; Sonnet Is Surface-Level
+
+**Date:** 2026-05-09
+**Task:** Identify gaps, ambiguities, and internal inconsistencies in gargoyle's `naming-conventions.md` (196 lines) — a prescriptive document defining mechanical naming rules for modules, topics, and metrics.
+**Document type:** Convention/specification document (rules rather than architecture), testing whether models can analyze prescriptive text for completeness.
+
+## Methodology
+
+Same document (full text) + same focused analytical question to all 3 models via HAI AI Core proxy. Prompt specified 5 categories: ambiguous decision points, missing scenarios, internal contradictions, implicit assumptions, and edge cases. Required specific output format with concrete scenarios and severity ratings. No tools, no project context beyond the document.
+
+| Model | Time | Output tokens | Reasoning tokens | Findings |
+|---|---|---|---|---|
+| GPT-5 | 122s | 10,563 | 7,232 | 34 |
+| Claude Opus 4.6 | 69s | 3,898 | (internal) | 18 (12 fully written, 6 cut off) |
+| Claude Sonnet 4.6 | 28s | 1,415 | (internal) | 14 |
+
+## What they found — common ground (all 3 identified)
+
+- `Gargoyle.Accounts` namespace exists "outside bounded contexts" but module rules say "every module belongs to exactly one of seven contexts" — creates ambiguity
+- Behaviour vs struct/schema naming can collide (both use `Gargoyle.<Context>.<Name>`)
+- No guidance for behaviour *implementations* (adapter naming patterns)
+- Sub-namespace creation criteria are subjective ("different lifecycles, failure modes, or team ownership")
+- Process naming "role not action" rule is ambiguous (where's the boundary?)
+- Schema qualification ("when ambiguous, qualify it") lacks canonical qualifier terms
+- Registry naming undefined for multiple registries in same context
+- PubSub per-entity pattern conflicts with per-user pattern (topic collision risk)
+- Telemetry sub-namespace mapping from modules is unclear
+
+## GPT-5 unique findings (not in either Claude model)
+
+- **PubSub subtopic casing/separators undefined** — snake_case vs kebab-case vs plain not specified
+- **PubSub entity key encoding not specified** — colons in keys (BTC-PERP), case sensitivity, composite keys
+- **PubSub tenant/org scope missing** — only user/system/entity defined; no per-org or multi-tenant
+- **Telemetry verb rule contradicts examples** — rule requires past-tense but `stale` appears in examples (not `became_stale`)
+- **Telemetry tag naming unspecified** — key/value casing and naming patterns missing
+- **`_all_users` assumes user is only cross-scope concept** — no guidance for `_all_orgs`, `_all_tenants`
+- **`_all_users` for non-list actions undefined** — what about `count_all_users`, `purge_all_users!`?
+- **Test fake namespace collision risk** — fakes use production namespace (`Gargoyle.*`) in test files
+- **Exceptions/error modules lack conventions** — `OrderRejectedError` vs `OrderRejected` vs `Trading.Error.OrderRejected`
+- **Protocols and Ecto types not covered** — custom `Ecto.Type`, embedded schemas, defimpl placement
+- **Shared struct vs Ecto schema collision** — non-persisted struct named `Order` conflicts with schema
+- **Acronym normalization for topics/events** — `IB` → `ib` vs `i_b` not explicit
+- **Pluralization guidance incomplete** — schemas singular, but sub-namespace pluralization varies (Strategies vs Positioning)
+- **Web component naming gaps** — no guidance for LiveComponents, functional components, .JSON modules
+- **Short-name disambiguation left to reader inference** — "Supervisor" in prose without context is ambiguous
+- **Telemetry noun vs sub-namespace boundary ambiguous** — when is something a sub-namespace segment vs noun?
+- **Telemetry depth/mapping from module nesting unclear** — how many segments for `Trading.BrokerAdapters.IB.OrderRouter`?
+
+## Claude Opus unique findings (not in either other model)
+
+- **PubSub topic pattern collision is CRITICAL, not just ambiguous** — Opus specifically worked through how `"market_data:AAPL"` (per-entity) is indistinguishable from `"market_data:42:status"` (per-user) without the subtopic segment. GPT-5 noted the conflict but didn't trace through the routing failure scenario.
+- **Cross-context data structures have no guidance** — what happens to the `Signal` struct that Engine produces and Trading consumes? GPT-5 focused on module placement; Opus focused on the DDD principle that "same concept has different representations" and whether shared DTOs are allowed.
+- **Test fake location vs namespace mismatch is a compiler issue** — fakes in `test/support/fakes/` with `Gargoyle.*` namespace causes `mix compile` scoping ambiguity and accidental import risk. GPT-5 mentioned collision; Opus traced the consequence.
+- **No rules for implementation modules** — pure helper modules that aren't schemas, processes, or behaviours have no placement guidance. Is `Aggregator.Calculator` a valid name? What about `Internal.AggregatorHelpers`?
+
+## Claude Sonnet findings (less depth than others)
+
+- **Module spanning contexts lacks split guidance** — mentioned but didn't work through specific scenarios
+- **Third-party integration modules unaddressed** — Bloomberg, broker APIs: belong in relevant context or special treatment?
+- **Validation modules at context boundaries** — data validation between Market Data and Decision Engine: which context?
+- **Utility/helper module placement missing** — date helpers, math functions used across contexts
+
+Sonnet's findings were valid but less specific. Several overlapped with GPT-5/Opus without adding the concrete scenario detail that makes findings actionable.
+
+## Quality assessment
+
+- **GPT-5** produced the most exhaustive analysis (34 findings) with deep coverage of every section of the document. It systematically worked through PubSub rules, telemetry rules, `_all_users` functions, web naming, and identified edge cases in each. The telemetry verb/example contradiction finding is a genuine internal inconsistency in the document. The acronym, pluralization, and encoding findings show attention to implementation-level detail. GPT-5 also provided a concrete "Suggestions to clarify" section with remediation proposals.
+
+- **Claude Opus** produced fewer findings (18, with 6 cut off by token limit) but several were qualitatively superior. The PubSub collision finding traced through the routing failure consequence that GPT-5 only hinted at. The cross-context data structure finding engaged with DDD principles ("same concept, different representations") rather than just flagging the gap. Opus consistently reasoned about *why* the gap matters at the architectural level, not just *that* it exists.
+
+- **Claude Sonnet** produced the fewest findings (14) with the least depth. Several findings were restatements of the prompt categories ("no guidance for X") without working through specific collision scenarios. The third-party integration finding and validation module finding were valid but generic — they identify gaps without demonstrating concrete breakage.
+
+## Key insight — convention analysis requires exhaustive enumeration + constraint reasoning
+
+This task had two components:
+1. **Enumeration**: Systematically cover every rule in the document and ask "what's missing?"
+2. **Constraint reasoning**: For each gap, reason about whether two developers following the rules could produce incompatible results
+
+GPT-5 excelled at (1) — its 7,232 reasoning tokens enabled methodical section-by-section coverage. Every major topic (modules, PubSub, telemetry, `_all_users`, fakes, web) got explicit attention.
+
+Opus excelled at (2) — when it identified a gap, it reasoned through the *consequence* (routing failure, compiler ambiguity, architectural drift). Fewer findings, but each finding included the "why it breaks" reasoning that makes it actionable.
+
+Sonnet did neither particularly well — it identified gaps but didn't enumerate exhaustively or reason through consequences deeply.
+
+## Task taxonomy update
+
+**Convention/specification gap analysis** is a new task type with distinct model performance:
+- **GPT-5**: Best for exhaustive coverage — will find the acronym rules, the tag naming, the entity key encoding that others miss
+- **Opus**: Best for consequence reasoning — will trace gaps to architectural or operational failures
+- **Sonnet**: Adequate for quick sanity check; not recommended for thorough analysis
+
+This pattern is consistent with previous findings: GPT-5's reasoning enables systematic enumeration; Opus finds tensions the document can't see about itself; Sonnet lacks depth for analytical work.
+
+## Unique GPT-5 insight worth highlighting
+
+The telemetry verb contradiction (rule says past-tense, example shows `stale`) is a genuine internal inconsistency the document contains. This is different from "gap" or "ambiguity" — it's a self-contradiction where following the rule would produce `became_stale` but the example shows `stale`. GPT-5 caught this; neither Claude model did.
+
+## Unique Opus insight worth highlighting
+
+The PubSub topic collision finding traced through *why* the current rules break: `"market_data:AAPL"` (per-entity) is pattern-indistinguishable from `"market_data:42:status"` (per-user) for a subscriber matching `"market_data:" <> _`. This shows the gap doesn't just allow inconsistency — it allows *silent routing failures*. Opus consistently elevates gaps to their operational consequences.
+
+## Practical implications
+
+For convention/specification review:
+- Run GPT-5 for exhaustive coverage (will catch encoding rules, acronyms, pluralization — the details)
+- Run Opus to identify which gaps have architectural consequences (routing failures, compiler issues)
+- Sonnet is not recommended for this task type
+
+The ideal workflow: GPT-5 first for comprehensive gap list, then Opus to prioritize which gaps matter. This is the same pattern as state machine analysis (Finding #58) and assumption-finding (Finding #10-12).
+
+## Meta-observation: Document type affects model performance less than expected
+
+This is the first experiment on a *prescriptive rules* document rather than an *architecture* document. The model performance ranking (GPT-5 > Opus > Sonnet) held. The task skills (enumeration for GPT-5, consequence reasoning for Opus) also held. This suggests the analytical framework from previous findings generalizes beyond architecture docs.
+
+## Raw data
+
+Full model outputs preserved in `/tmp/gpt5-conventions.json`, `/tmp/opus-conventions.json`, `/tmp/sonnet-conventions.json`.