finding 59: convention rule gap analysis

New task type: analyzing prescriptive/specification documents for completeness.

- GPT-5 dominates with exhaustive enumeration (34 findings)
- Opus traces gaps to consequences (routing failures, compiler issues)
- Sonnet surface-level (not recommended for thorough analysis)

Key insight: GPT-5 found internal contradiction (telemetry verb rule vs example)
that neither Claude model caught. Opus unique in tracing PubSub collision
to actual routing failure scenario.

Task taxonomy: convention gap analysis follows same pattern as architecture
docs - GPT-5 for coverage, Opus for consequences.
This commit is contained in:
Rodin
2026-05-09 17:28:53 -07:00
parent 98304604ac
commit 2988f31fc3
@@ -0,0 +1,117 @@
# Finding 59: Convention Rule Gap Analysis — GPT-5 Dominates with Exhaustive Enumeration; Opus Finds Design Contradictions; Sonnet Is Surface-Level
**Date:** 2026-05-09
**Task:** Identify gaps, ambiguities, and internal inconsistencies in gargoyle's `naming-conventions.md` (196 lines) — a prescriptive document defining mechanical naming rules for modules, topics, and metrics.
**Document type:** Convention/specification document (rules rather than architecture), testing whether models can analyze prescriptive text for completeness.
## Methodology
Same document (full text) + same focused analytical question to all 3 models via HAI AI Core proxy. Prompt specified 5 categories: ambiguous decision points, missing scenarios, internal contradictions, implicit assumptions, and edge cases. Required specific output format with concrete scenarios and severity ratings. No tools, no project context beyond the document.
| Model | Time | Output tokens | Reasoning tokens | Findings |
|---|---|---|---|---|
| GPT-5 | 122s | 10,563 | 7,232 | 34 |
| Claude Opus 4.6 | 69s | 3,898 | (internal) | 18 (12 fully written, 6 cut off) |
| Claude Sonnet 4.6 | 28s | 1,415 | (internal) | 14 |
## What they found — common ground (all 3 identified)
- `Gargoyle.Accounts` namespace exists "outside bounded contexts" but module rules say "every module belongs to exactly one of seven contexts" — creates ambiguity
- Behaviour vs struct/schema naming can collide (both use `Gargoyle.<Context>.<Name>`)
- No guidance for behaviour *implementations* (adapter naming patterns)
- Sub-namespace creation criteria are subjective ("different lifecycles, failure modes, or team ownership")
- Process naming "role not action" rule is ambiguous (where's the boundary?)
- Schema qualification ("when ambiguous, qualify it") lacks canonical qualifier terms
- Registry naming undefined for multiple registries in same context
- PubSub per-entity pattern conflicts with per-user pattern (topic collision risk)
- Telemetry sub-namespace mapping from modules is unclear
## GPT-5 unique findings (not in either Claude model)
- **PubSub subtopic casing/separators undefined** — snake_case vs kebab-case vs plain not specified
- **PubSub entity key encoding not specified** — colons in keys (BTC-PERP), case sensitivity, composite keys
- **PubSub tenant/org scope missing** — only user/system/entity defined; no per-org or multi-tenant
- **Telemetry verb rule contradicts examples** — rule requires past-tense but `stale` appears in examples (not `became_stale`)
- **Telemetry tag naming unspecified** — key/value casing and naming patterns missing
- **`_all_users` assumes user is only cross-scope concept** — no guidance for `_all_orgs`, `_all_tenants`
- **`_all_users` for non-list actions undefined** — what about `count_all_users`, `purge_all_users!`?
- **Test fake namespace collision risk** — fakes use production namespace (`Gargoyle.*`) in test files
- **Exceptions/error modules lack conventions** — `OrderRejectedError` vs `OrderRejected` vs `Trading.Error.OrderRejected`
- **Protocols and Ecto types not covered** — custom `Ecto.Type`, embedded schemas, defimpl placement
- **Shared struct vs Ecto schema collision** — non-persisted struct named `Order` conflicts with schema
- **Acronym normalization for topics/events** — `IB``ib` vs `i_b` not explicit
- **Pluralization guidance incomplete** — schemas singular, but sub-namespace pluralization varies (Strategies vs Positioning)
- **Web component naming gaps** — no guidance for LiveComponents, functional components, .JSON modules
- **Short-name disambiguation left to reader inference** — "Supervisor" in prose without context is ambiguous
- **Telemetry noun vs sub-namespace boundary ambiguous** — when is something a sub-namespace segment vs noun?
- **Telemetry depth/mapping from module nesting unclear** — how many segments for `Trading.BrokerAdapters.IB.OrderRouter`?
## Claude Opus unique findings (not in either other model)
- **PubSub topic pattern collision is CRITICAL, not just ambiguous** — Opus specifically worked through how `"market_data:AAPL"` (per-entity) is indistinguishable from `"market_data:42:status"` (per-user) without the subtopic segment. GPT-5 noted the conflict but didn't trace through the routing failure scenario.
- **Cross-context data structures have no guidance** — what happens to the `Signal` struct that Engine produces and Trading consumes? GPT-5 focused on module placement; Opus focused on the DDD principle that "same concept has different representations" and whether shared DTOs are allowed.
- **Test fake location vs namespace mismatch is a compiler issue** — fakes in `test/support/fakes/` with `Gargoyle.*` namespace causes `mix compile` scoping ambiguity and accidental import risk. GPT-5 mentioned collision; Opus traced the consequence.
- **No rules for implementation modules** — pure helper modules that aren't schemas, processes, or behaviours have no placement guidance. Is `Aggregator.Calculator` a valid name? What about `Internal.AggregatorHelpers`?
## Claude Sonnet findings (less depth than others)
- **Module spanning contexts lacks split guidance** — mentioned but didn't work through specific scenarios
- **Third-party integration modules unaddressed** — Bloomberg, broker APIs: belong in relevant context or special treatment?
- **Validation modules at context boundaries** — data validation between Market Data and Decision Engine: which context?
- **Utility/helper module placement missing** — date helpers, math functions used across contexts
Sonnet's findings were valid but less specific. Several overlapped with GPT-5/Opus without adding the concrete scenario detail that makes findings actionable.
## Quality assessment
- **GPT-5** produced the most exhaustive analysis (34 findings) with deep coverage of every section of the document. It systematically worked through PubSub rules, telemetry rules, `_all_users` functions, web naming, and identified edge cases in each. The telemetry verb/example contradiction finding is a genuine internal inconsistency in the document. The acronym, pluralization, and encoding findings show attention to implementation-level detail. GPT-5 also provided a concrete "Suggestions to clarify" section with remediation proposals.
- **Claude Opus** produced fewer findings (18, with 6 cut off by token limit) but several were qualitatively superior. The PubSub collision finding traced through the routing failure consequence that GPT-5 only hinted at. The cross-context data structure finding engaged with DDD principles ("same concept, different representations") rather than just flagging the gap. Opus consistently reasoned about *why* the gap matters at the architectural level, not just *that* it exists.
- **Claude Sonnet** produced the fewest findings (14) with the least depth. Several findings were restatements of the prompt categories ("no guidance for X") without working through specific collision scenarios. The third-party integration finding and validation module finding were valid but generic — they identify gaps without demonstrating concrete breakage.
## Key insight — convention analysis requires exhaustive enumeration + constraint reasoning
This task had two components:
1. **Enumeration**: Systematically cover every rule in the document and ask "what's missing?"
2. **Constraint reasoning**: For each gap, reason about whether two developers following the rules could produce incompatible results
GPT-5 excelled at (1) — its 7,232 reasoning tokens enabled methodical section-by-section coverage. Every major topic (modules, PubSub, telemetry, `_all_users`, fakes, web) got explicit attention.
Opus excelled at (2) — when it identified a gap, it reasoned through the *consequence* (routing failure, compiler ambiguity, architectural drift). Fewer findings, but each finding included the "why it breaks" reasoning that makes it actionable.
Sonnet did neither particularly well — it identified gaps but didn't enumerate exhaustively or reason through consequences deeply.
## Task taxonomy update
**Convention/specification gap analysis** is a new task type with distinct model performance:
- **GPT-5**: Best for exhaustive coverage — will find the acronym rules, the tag naming, the entity key encoding that others miss
- **Opus**: Best for consequence reasoning — will trace gaps to architectural or operational failures
- **Sonnet**: Adequate for quick sanity check; not recommended for thorough analysis
This pattern is consistent with previous findings: GPT-5's reasoning enables systematic enumeration; Opus finds tensions the document can't see about itself; Sonnet lacks depth for analytical work.
## Unique GPT-5 insight worth highlighting
The telemetry verb contradiction (rule says past-tense, example shows `stale`) is a genuine internal inconsistency the document contains. This is different from "gap" or "ambiguity" — it's a self-contradiction where following the rule would produce `became_stale` but the example shows `stale`. GPT-5 caught this; neither Claude model did.
## Unique Opus insight worth highlighting
The PubSub topic collision finding traced through *why* the current rules break: `"market_data:AAPL"` (per-entity) is pattern-indistinguishable from `"market_data:42:status"` (per-user) for a subscriber matching `"market_data:" <> _`. This shows the gap doesn't just allow inconsistency — it allows *silent routing failures*. Opus consistently elevates gaps to their operational consequences.
## Practical implications
For convention/specification review:
- Run GPT-5 for exhaustive coverage (will catch encoding rules, acronyms, pluralization — the details)
- Run Opus to identify which gaps have architectural consequences (routing failures, compiler issues)
- Sonnet is not recommended for this task type
The ideal workflow: GPT-5 first for comprehensive gap list, then Opus to prioritize which gaps matter. This is the same pattern as state machine analysis (Finding #58) and assumption-finding (Finding #10-12).
## Meta-observation: Document type affects model performance less than expected
This is the first experiment on a *prescriptive rules* document rather than an *architecture* document. The model performance ranking (GPT-5 > Opus > Sonnet) held. The task skills (enumeration for GPT-5, consequence reasoning for Opus) also held. This suggests the analytical framework from previous findings generalizes beyond architecture docs.
## Raw data
Full model outputs preserved in `/tmp/gpt5-conventions.json`, `/tmp/opus-conventions.json`, `/tmp/sonnet-conventions.json`.