finding 59: convention rule gap analysis

New task type: analyzing prescriptive/specification documents for completeness. - GPT-5 dominates with exhaustive enumeration (34 findings) - Opus traces gaps to consequences (routing failures, compiler issues) - Sonnet surface-level (not recommended for thorough analysis) Key insight: GPT-5 found internal contradiction (telemetry verb rule vs example) that neither Claude model caught. Opus unique in tracing PubSub collision to actual routing failure scenario. Task taxonomy: convention gap analysis follows same pattern as architecture docs - GPT-5 for coverage, Opus for consequences.
2026-05-09 17:28:53 -07:00
parent 98304604ac
commit 2988f31fc3
1 changed files with 117 additions and 0 deletions
@@ -0,0 +1,117 @@
 # Finding 59: Convention Rule Gap Analysis — GPT-5 Dominates with Exhaustive Enumeration; Opus Finds Design Contradictions; Sonnet Is Surface-Level
 **Date:** 2026-05-09
 **Task:** Identify gaps, ambiguities, and internal inconsistencies in gargoyle's `naming-conventions.md` (196 lines) — a prescriptive document defining mechanical naming rules for modules, topics, and metrics.
 **Document type:** Convention/specification document (rules rather than architecture), testing whether models can analyze prescriptive text for completeness.
 ## Methodology
 Same document (full text) + same focused analytical question to all 3 models via HAI AI Core proxy. Prompt specified 5 categories: ambiguous decision points, missing scenarios, internal contradictions, implicit assumptions, and edge cases. Required specific output format with concrete scenarios and severity ratings. No tools, no project context beyond the document.
 | Model | Time | Output tokens | Reasoning tokens | Findings |
 |---|---|---|---|---|
 | GPT-5 | 122s | 10,563 | 7,232 | 34 |
 | Claude Opus 4.6 | 69s | 3,898 | (internal) | 18 (12 fully written, 6 cut off) |
 | Claude Sonnet 4.6 | 28s | 1,415 | (internal) | 14 |
 ## What they found — common ground (all 3 identified)
 - `Gargoyle.Accounts` namespace exists "outside bounded contexts" but module rules say "every module belongs to exactly one of seven contexts" — creates ambiguity
 - Behaviour vs struct/schema naming can collide (both use `Gargoyle.<Context>.<Name>`)
 - No guidance for behaviour *implementations* (adapter naming patterns)
 - Sub-namespace creation criteria are subjective ("different lifecycles, failure modes, or team ownership")
 - Process naming "role not action" rule is ambiguous (where's the boundary?)
 - Schema qualification ("when ambiguous, qualify it") lacks canonical qualifier terms
 - Registry naming undefined for multiple registries in same context
 - PubSub per-entity pattern conflicts with per-user pattern (topic collision risk)
 - Telemetry sub-namespace mapping from modules is unclear
 ## GPT-5 unique findings (not in either Claude model)
 - **PubSub subtopic casing/separators undefined** — snake_case vs kebab-case vs plain not specified
 - **PubSub entity key encoding not specified** — colons in keys (BTC-PERP), case sensitivity, composite keys
 - **PubSub tenant/org scope missing** — only user/system/entity defined; no per-org or multi-tenant
 - **Telemetry verb rule contradicts examples** — rule requires past-tense but `stale` appears in examples (not `became_stale`)
 - **Telemetry tag naming unspecified** — key/value casing and naming patterns missing
 - **`_all_users` assumes user is only cross-scope concept** — no guidance for `_all_orgs`, `_all_tenants`
 - **`_all_users` for non-list actions undefined** — what about `count_all_users`, `purge_all_users!`?
 - **Test fake namespace collision risk** — fakes use production namespace (`Gargoyle.*`) in test files
 - **Exceptions/error modules lack conventions** — `OrderRejectedError` vs `OrderRejected` vs `Trading.Error.OrderRejected`
 - **Protocols and Ecto types not covered** — custom `Ecto.Type`, embedded schemas, defimpl placement
 - **Shared struct vs Ecto schema collision** — non-persisted struct named `Order` conflicts with schema
 - **Acronym normalization for topics/events** — `IB` → `ib` vs `i_b` not explicit
 - **Pluralization guidance incomplete** — schemas singular, but sub-namespace pluralization varies (Strategies vs Positioning)
 - **Web component naming gaps** — no guidance for LiveComponents, functional components, .JSON modules
 - **Short-name disambiguation left to reader inference** — "Supervisor" in prose without context is ambiguous
 - **Telemetry noun vs sub-namespace boundary ambiguous** — when is something a sub-namespace segment vs noun?
 - **Telemetry depth/mapping from module nesting unclear** — how many segments for `Trading.BrokerAdapters.IB.OrderRouter`?
 ## Claude Opus unique findings (not in either other model)
 - **PubSub topic pattern collision is CRITICAL, not just ambiguous** — Opus specifically worked through how `"market_data:AAPL"` (per-entity) is indistinguishable from `"market_data:42:status"` (per-user) without the subtopic segment. GPT-5 noted the conflict but didn't trace through the routing failure scenario.
 - **Cross-context data structures have no guidance** — what happens to the `Signal` struct that Engine produces and Trading consumes? GPT-5 focused on module placement; Opus focused on the DDD principle that "same concept has different representations" and whether shared DTOs are allowed.
 - **Test fake location vs namespace mismatch is a compiler issue** — fakes in `test/support/fakes/` with `Gargoyle.*` namespace causes `mix compile` scoping ambiguity and accidental import risk. GPT-5 mentioned collision; Opus traced the consequence.
 - **No rules for implementation modules** — pure helper modules that aren't schemas, processes, or behaviours have no placement guidance. Is `Aggregator.Calculator` a valid name? What about `Internal.AggregatorHelpers`?
 ## Claude Sonnet findings (less depth than others)
 - **Module spanning contexts lacks split guidance** — mentioned but didn't work through specific scenarios
 - **Third-party integration modules unaddressed** — Bloomberg, broker APIs: belong in relevant context or special treatment?
 - **Validation modules at context boundaries** — data validation between Market Data and Decision Engine: which context?
 - **Utility/helper module placement missing** — date helpers, math functions used across contexts
 Sonnet's findings were valid but less specific. Several overlapped with GPT-5/Opus without adding the concrete scenario detail that makes findings actionable.
 ## Quality assessment
 - **GPT-5** produced the most exhaustive analysis (34 findings) with deep coverage of every section of the document. It systematically worked through PubSub rules, telemetry rules, `_all_users` functions, web naming, and identified edge cases in each. The telemetry verb/example contradiction finding is a genuine internal inconsistency in the document. The acronym, pluralization, and encoding findings show attention to implementation-level detail. GPT-5 also provided a concrete "Suggestions to clarify" section with remediation proposals.
 - **Claude Opus** produced fewer findings (18, with 6 cut off by token limit) but several were qualitatively superior. The PubSub collision finding traced through the routing failure consequence that GPT-5 only hinted at. The cross-context data structure finding engaged with DDD principles ("same concept, different representations") rather than just flagging the gap. Opus consistently reasoned about *why* the gap matters at the architectural level, not just *that* it exists.
 - **Claude Sonnet** produced the fewest findings (14) with the least depth. Several findings were restatements of the prompt categories ("no guidance for X") without working through specific collision scenarios. The third-party integration finding and validation module finding were valid but generic — they identify gaps without demonstrating concrete breakage.
 ## Key insight — convention analysis requires exhaustive enumeration + constraint reasoning
 This task had two components:
 1. **Enumeration**: Systematically cover every rule in the document and ask "what's missing?"
 2. **Constraint reasoning**: For each gap, reason about whether two developers following the rules could produce incompatible results
 GPT-5 excelled at (1) — its 7,232 reasoning tokens enabled methodical section-by-section coverage. Every major topic (modules, PubSub, telemetry, `_all_users`, fakes, web) got explicit attention.
 Opus excelled at (2) — when it identified a gap, it reasoned through the *consequence* (routing failure, compiler ambiguity, architectural drift). Fewer findings, but each finding included the "why it breaks" reasoning that makes it actionable.
 Sonnet did neither particularly well — it identified gaps but didn't enumerate exhaustively or reason through consequences deeply.
 ## Task taxonomy update
 **Convention/specification gap analysis** is a new task type with distinct model performance:
 - **GPT-5**: Best for exhaustive coverage — will find the acronym rules, the tag naming, the entity key encoding that others miss
 - **Opus**: Best for consequence reasoning — will trace gaps to architectural or operational failures
 - **Sonnet**: Adequate for quick sanity check; not recommended for thorough analysis
 This pattern is consistent with previous findings: GPT-5's reasoning enables systematic enumeration; Opus finds tensions the document can't see about itself; Sonnet lacks depth for analytical work.
 ## Unique GPT-5 insight worth highlighting
 The telemetry verb contradiction (rule says past-tense, example shows `stale`) is a genuine internal inconsistency the document contains. This is different from "gap" or "ambiguity" — it's a self-contradiction where following the rule would produce `became_stale` but the example shows `stale`. GPT-5 caught this; neither Claude model did.
 ## Unique Opus insight worth highlighting
 The PubSub topic collision finding traced through *why* the current rules break: `"market_data:AAPL"` (per-entity) is pattern-indistinguishable from `"market_data:42:status"` (per-user) for a subscriber matching `"market_data:" <> _`. This shows the gap doesn't just allow inconsistency — it allows *silent routing failures*. Opus consistently elevates gaps to their operational consequences.
 ## Practical implications
 For convention/specification review:
 - Run GPT-5 for exhaustive coverage (will catch encoding rules, acronyms, pluralization — the details)
 - Run Opus to identify which gaps have architectural consequences (routing failures, compiler issues)
 - Sonnet is not recommended for this task type
 The ideal workflow: GPT-5 first for comprehensive gap list, then Opus to prioritize which gaps matter. This is the same pattern as state machine analysis (Finding #58) and assumption-finding (Finding #10-12).
 ## Meta-observation: Document type affects model performance less than expected
 This is the first experiment on a *prescriptive rules* document rather than an *architecture* document. The model performance ranking (GPT-5 > Opus > Sonnet) held. The task skills (enumeration for GPT-5, consequence reasoning for Opus) also held. This suggests the analytical framework from previous findings generalizes beyond architecture docs.
 ## Raw data
 Full model outputs preserved in `/tmp/gpt5-conventions.json`, `/tmp/opus-conventions.json`, `/tmp/sonnet-conventions.json`.