finding 59: convention rule gap analysis
New task type: analyzing prescriptive/specification documents for completeness. - GPT-5 dominates with exhaustive enumeration (34 findings) - Opus traces gaps to consequences (routing failures, compiler issues) - Sonnet surface-level (not recommended for thorough analysis) Key insight: GPT-5 found internal contradiction (telemetry verb rule vs example) that neither Claude model caught. Opus unique in tracing PubSub collision to actual routing failure scenario. Task taxonomy: convention gap analysis follows same pattern as architecture docs - GPT-5 for coverage, Opus for consequences.
This commit is contained in:
@@ -0,0 +1,117 @@
|
||||
# Finding 59: Convention Rule Gap Analysis — GPT-5 Dominates with Exhaustive Enumeration; Opus Finds Design Contradictions; Sonnet Is Surface-Level
|
||||
|
||||
**Date:** 2026-05-09
|
||||
**Task:** Identify gaps, ambiguities, and internal inconsistencies in gargoyle's `naming-conventions.md` (196 lines) — a prescriptive document defining mechanical naming rules for modules, topics, and metrics.
|
||||
**Document type:** Convention/specification document (rules rather than architecture), testing whether models can analyze prescriptive text for completeness.
|
||||
|
||||
## Methodology
|
||||
|
||||
Same document (full text) + same focused analytical question to all 3 models via HAI AI Core proxy. Prompt specified 5 categories: ambiguous decision points, missing scenarios, internal contradictions, implicit assumptions, and edge cases. Required specific output format with concrete scenarios and severity ratings. No tools, no project context beyond the document.
|
||||
|
||||
| Model | Time | Output tokens | Reasoning tokens | Findings |
|
||||
|---|---|---|---|---|
|
||||
| GPT-5 | 122s | 10,563 | 7,232 | 34 |
|
||||
| Claude Opus 4.6 | 69s | 3,898 | (internal) | 18 (12 fully written, 6 cut off) |
|
||||
| Claude Sonnet 4.6 | 28s | 1,415 | (internal) | 14 |
|
||||
|
||||
## What they found — common ground (all 3 identified)
|
||||
|
||||
- `Gargoyle.Accounts` namespace exists "outside bounded contexts" but module rules say "every module belongs to exactly one of seven contexts" — creates ambiguity
|
||||
- Behaviour vs struct/schema naming can collide (both use `Gargoyle.<Context>.<Name>`)
|
||||
- No guidance for behaviour *implementations* (adapter naming patterns)
|
||||
- Sub-namespace creation criteria are subjective ("different lifecycles, failure modes, or team ownership")
|
||||
- Process naming "role not action" rule is ambiguous (where's the boundary?)
|
||||
- Schema qualification ("when ambiguous, qualify it") lacks canonical qualifier terms
|
||||
- Registry naming undefined for multiple registries in same context
|
||||
- PubSub per-entity pattern conflicts with per-user pattern (topic collision risk)
|
||||
- Telemetry sub-namespace mapping from modules is unclear
|
||||
|
||||
## GPT-5 unique findings (not in either Claude model)
|
||||
|
||||
- **PubSub subtopic casing/separators undefined** — snake_case vs kebab-case vs plain not specified
|
||||
- **PubSub entity key encoding not specified** — colons in keys (BTC-PERP), case sensitivity, composite keys
|
||||
- **PubSub tenant/org scope missing** — only user/system/entity defined; no per-org or multi-tenant
|
||||
- **Telemetry verb rule contradicts examples** — rule requires past-tense but `stale` appears in examples (not `became_stale`)
|
||||
- **Telemetry tag naming unspecified** — key/value casing and naming patterns missing
|
||||
- **`_all_users` assumes user is only cross-scope concept** — no guidance for `_all_orgs`, `_all_tenants`
|
||||
- **`_all_users` for non-list actions undefined** — what about `count_all_users`, `purge_all_users!`?
|
||||
- **Test fake namespace collision risk** — fakes use production namespace (`Gargoyle.*`) in test files
|
||||
- **Exceptions/error modules lack conventions** — `OrderRejectedError` vs `OrderRejected` vs `Trading.Error.OrderRejected`
|
||||
- **Protocols and Ecto types not covered** — custom `Ecto.Type`, embedded schemas, defimpl placement
|
||||
- **Shared struct vs Ecto schema collision** — non-persisted struct named `Order` conflicts with schema
|
||||
- **Acronym normalization for topics/events** — `IB` → `ib` vs `i_b` not explicit
|
||||
- **Pluralization guidance incomplete** — schemas singular, but sub-namespace pluralization varies (Strategies vs Positioning)
|
||||
- **Web component naming gaps** — no guidance for LiveComponents, functional components, .JSON modules
|
||||
- **Short-name disambiguation left to reader inference** — "Supervisor" in prose without context is ambiguous
|
||||
- **Telemetry noun vs sub-namespace boundary ambiguous** — when is something a sub-namespace segment vs noun?
|
||||
- **Telemetry depth/mapping from module nesting unclear** — how many segments for `Trading.BrokerAdapters.IB.OrderRouter`?
|
||||
|
||||
## Claude Opus unique findings (not in either other model)
|
||||
|
||||
- **PubSub topic pattern collision is CRITICAL, not just ambiguous** — Opus specifically worked through how `"market_data:AAPL"` (per-entity) is indistinguishable from `"market_data:42:status"` (per-user) without the subtopic segment. GPT-5 noted the conflict but didn't trace through the routing failure scenario.
|
||||
- **Cross-context data structures have no guidance** — what happens to the `Signal` struct that Engine produces and Trading consumes? GPT-5 focused on module placement; Opus focused on the DDD principle that "same concept has different representations" and whether shared DTOs are allowed.
|
||||
- **Test fake location vs namespace mismatch is a compiler issue** — fakes in `test/support/fakes/` with `Gargoyle.*` namespace causes `mix compile` scoping ambiguity and accidental import risk. GPT-5 mentioned collision; Opus traced the consequence.
|
||||
- **No rules for implementation modules** — pure helper modules that aren't schemas, processes, or behaviours have no placement guidance. Is `Aggregator.Calculator` a valid name? What about `Internal.AggregatorHelpers`?
|
||||
|
||||
## Claude Sonnet findings (less depth than others)
|
||||
|
||||
- **Module spanning contexts lacks split guidance** — mentioned but didn't work through specific scenarios
|
||||
- **Third-party integration modules unaddressed** — Bloomberg, broker APIs: belong in relevant context or special treatment?
|
||||
- **Validation modules at context boundaries** — data validation between Market Data and Decision Engine: which context?
|
||||
- **Utility/helper module placement missing** — date helpers, math functions used across contexts
|
||||
|
||||
Sonnet's findings were valid but less specific. Several overlapped with GPT-5/Opus without adding the concrete scenario detail that makes findings actionable.
|
||||
|
||||
## Quality assessment
|
||||
|
||||
- **GPT-5** produced the most exhaustive analysis (34 findings) with deep coverage of every section of the document. It systematically worked through PubSub rules, telemetry rules, `_all_users` functions, web naming, and identified edge cases in each. The telemetry verb/example contradiction finding is a genuine internal inconsistency in the document. The acronym, pluralization, and encoding findings show attention to implementation-level detail. GPT-5 also provided a concrete "Suggestions to clarify" section with remediation proposals.
|
||||
|
||||
- **Claude Opus** produced fewer findings (18, with 6 cut off by token limit) but several were qualitatively superior. The PubSub collision finding traced through the routing failure consequence that GPT-5 only hinted at. The cross-context data structure finding engaged with DDD principles ("same concept, different representations") rather than just flagging the gap. Opus consistently reasoned about *why* the gap matters at the architectural level, not just *that* it exists.
|
||||
|
||||
- **Claude Sonnet** produced the fewest findings (14) with the least depth. Several findings were restatements of the prompt categories ("no guidance for X") without working through specific collision scenarios. The third-party integration finding and validation module finding were valid but generic — they identify gaps without demonstrating concrete breakage.
|
||||
|
||||
## Key insight — convention analysis requires exhaustive enumeration + constraint reasoning
|
||||
|
||||
This task had two components:
|
||||
1. **Enumeration**: Systematically cover every rule in the document and ask "what's missing?"
|
||||
2. **Constraint reasoning**: For each gap, reason about whether two developers following the rules could produce incompatible results
|
||||
|
||||
GPT-5 excelled at (1) — its 7,232 reasoning tokens enabled methodical section-by-section coverage. Every major topic (modules, PubSub, telemetry, `_all_users`, fakes, web) got explicit attention.
|
||||
|
||||
Opus excelled at (2) — when it identified a gap, it reasoned through the *consequence* (routing failure, compiler ambiguity, architectural drift). Fewer findings, but each finding included the "why it breaks" reasoning that makes it actionable.
|
||||
|
||||
Sonnet did neither particularly well — it identified gaps but didn't enumerate exhaustively or reason through consequences deeply.
|
||||
|
||||
## Task taxonomy update
|
||||
|
||||
**Convention/specification gap analysis** is a new task type with distinct model performance:
|
||||
- **GPT-5**: Best for exhaustive coverage — will find the acronym rules, the tag naming, the entity key encoding that others miss
|
||||
- **Opus**: Best for consequence reasoning — will trace gaps to architectural or operational failures
|
||||
- **Sonnet**: Adequate for quick sanity check; not recommended for thorough analysis
|
||||
|
||||
This pattern is consistent with previous findings: GPT-5's reasoning enables systematic enumeration; Opus finds tensions the document can't see about itself; Sonnet lacks depth for analytical work.
|
||||
|
||||
## Unique GPT-5 insight worth highlighting
|
||||
|
||||
The telemetry verb contradiction (rule says past-tense, example shows `stale`) is a genuine internal inconsistency the document contains. This is different from "gap" or "ambiguity" — it's a self-contradiction where following the rule would produce `became_stale` but the example shows `stale`. GPT-5 caught this; neither Claude model did.
|
||||
|
||||
## Unique Opus insight worth highlighting
|
||||
|
||||
The PubSub topic collision finding traced through *why* the current rules break: `"market_data:AAPL"` (per-entity) is pattern-indistinguishable from `"market_data:42:status"` (per-user) for a subscriber matching `"market_data:" <> _`. This shows the gap doesn't just allow inconsistency — it allows *silent routing failures*. Opus consistently elevates gaps to their operational consequences.
|
||||
|
||||
## Practical implications
|
||||
|
||||
For convention/specification review:
|
||||
- Run GPT-5 for exhaustive coverage (will catch encoding rules, acronyms, pluralization — the details)
|
||||
- Run Opus to identify which gaps have architectural consequences (routing failures, compiler issues)
|
||||
- Sonnet is not recommended for this task type
|
||||
|
||||
The ideal workflow: GPT-5 first for comprehensive gap list, then Opus to prioritize which gaps matter. This is the same pattern as state machine analysis (Finding #58) and assumption-finding (Finding #10-12).
|
||||
|
||||
## Meta-observation: Document type affects model performance less than expected
|
||||
|
||||
This is the first experiment on a *prescriptive rules* document rather than an *architecture* document. The model performance ranking (GPT-5 > Opus > Sonnet) held. The task skills (enumeration for GPT-5, consequence reasoning for Opus) also held. This suggests the analytical framework from previous findings generalizes beyond architecture docs.
|
||||
|
||||
## Raw data
|
||||
|
||||
Full model outputs preserved in `/tmp/gpt5-conventions.json`, `/tmp/opus-conventions.json`, `/tmp/sonnet-conventions.json`.
|
||||
Reference in New Issue
Block a user