diff --git a/findings/2026-05-09-59-convention-rule-gap-analysis.md b/findings/2026-05-09-59-convention-rule-gap-analysis.md new file mode 100644 index 0000000..3c1a48f --- /dev/null +++ b/findings/2026-05-09-59-convention-rule-gap-analysis.md @@ -0,0 +1,117 @@ +# Finding 59: Convention Rule Gap Analysis — GPT-5 Dominates with Exhaustive Enumeration; Opus Finds Design Contradictions; Sonnet Is Surface-Level + +**Date:** 2026-05-09 +**Task:** Identify gaps, ambiguities, and internal inconsistencies in gargoyle's `naming-conventions.md` (196 lines) — a prescriptive document defining mechanical naming rules for modules, topics, and metrics. +**Document type:** Convention/specification document (rules rather than architecture), testing whether models can analyze prescriptive text for completeness. + +## Methodology + +Same document (full text) + same focused analytical question to all 3 models via HAI AI Core proxy. Prompt specified 5 categories: ambiguous decision points, missing scenarios, internal contradictions, implicit assumptions, and edge cases. Required specific output format with concrete scenarios and severity ratings. No tools, no project context beyond the document. + +| Model | Time | Output tokens | Reasoning tokens | Findings | +|---|---|---|---|---| +| GPT-5 | 122s | 10,563 | 7,232 | 34 | +| Claude Opus 4.6 | 69s | 3,898 | (internal) | 18 (12 fully written, 6 cut off) | +| Claude Sonnet 4.6 | 28s | 1,415 | (internal) | 14 | + +## What they found — common ground (all 3 identified) + +- `Gargoyle.Accounts` namespace exists "outside bounded contexts" but module rules say "every module belongs to exactly one of seven contexts" — creates ambiguity +- Behaviour vs struct/schema naming can collide (both use `Gargoyle..`) +- No guidance for behaviour *implementations* (adapter naming patterns) +- Sub-namespace creation criteria are subjective ("different lifecycles, failure modes, or team ownership") +- Process naming "role not action" rule is ambiguous (where's the boundary?) +- Schema qualification ("when ambiguous, qualify it") lacks canonical qualifier terms +- Registry naming undefined for multiple registries in same context +- PubSub per-entity pattern conflicts with per-user pattern (topic collision risk) +- Telemetry sub-namespace mapping from modules is unclear + +## GPT-5 unique findings (not in either Claude model) + +- **PubSub subtopic casing/separators undefined** — snake_case vs kebab-case vs plain not specified +- **PubSub entity key encoding not specified** — colons in keys (BTC-PERP), case sensitivity, composite keys +- **PubSub tenant/org scope missing** — only user/system/entity defined; no per-org or multi-tenant +- **Telemetry verb rule contradicts examples** — rule requires past-tense but `stale` appears in examples (not `became_stale`) +- **Telemetry tag naming unspecified** — key/value casing and naming patterns missing +- **`_all_users` assumes user is only cross-scope concept** — no guidance for `_all_orgs`, `_all_tenants` +- **`_all_users` for non-list actions undefined** — what about `count_all_users`, `purge_all_users!`? +- **Test fake namespace collision risk** — fakes use production namespace (`Gargoyle.*`) in test files +- **Exceptions/error modules lack conventions** — `OrderRejectedError` vs `OrderRejected` vs `Trading.Error.OrderRejected` +- **Protocols and Ecto types not covered** — custom `Ecto.Type`, embedded schemas, defimpl placement +- **Shared struct vs Ecto schema collision** — non-persisted struct named `Order` conflicts with schema +- **Acronym normalization for topics/events** — `IB` → `ib` vs `i_b` not explicit +- **Pluralization guidance incomplete** — schemas singular, but sub-namespace pluralization varies (Strategies vs Positioning) +- **Web component naming gaps** — no guidance for LiveComponents, functional components, .JSON modules +- **Short-name disambiguation left to reader inference** — "Supervisor" in prose without context is ambiguous +- **Telemetry noun vs sub-namespace boundary ambiguous** — when is something a sub-namespace segment vs noun? +- **Telemetry depth/mapping from module nesting unclear** — how many segments for `Trading.BrokerAdapters.IB.OrderRouter`? + +## Claude Opus unique findings (not in either other model) + +- **PubSub topic pattern collision is CRITICAL, not just ambiguous** — Opus specifically worked through how `"market_data:AAPL"` (per-entity) is indistinguishable from `"market_data:42:status"` (per-user) without the subtopic segment. GPT-5 noted the conflict but didn't trace through the routing failure scenario. +- **Cross-context data structures have no guidance** — what happens to the `Signal` struct that Engine produces and Trading consumes? GPT-5 focused on module placement; Opus focused on the DDD principle that "same concept has different representations" and whether shared DTOs are allowed. +- **Test fake location vs namespace mismatch is a compiler issue** — fakes in `test/support/fakes/` with `Gargoyle.*` namespace causes `mix compile` scoping ambiguity and accidental import risk. GPT-5 mentioned collision; Opus traced the consequence. +- **No rules for implementation modules** — pure helper modules that aren't schemas, processes, or behaviours have no placement guidance. Is `Aggregator.Calculator` a valid name? What about `Internal.AggregatorHelpers`? + +## Claude Sonnet findings (less depth than others) + +- **Module spanning contexts lacks split guidance** — mentioned but didn't work through specific scenarios +- **Third-party integration modules unaddressed** — Bloomberg, broker APIs: belong in relevant context or special treatment? +- **Validation modules at context boundaries** — data validation between Market Data and Decision Engine: which context? +- **Utility/helper module placement missing** — date helpers, math functions used across contexts + +Sonnet's findings were valid but less specific. Several overlapped with GPT-5/Opus without adding the concrete scenario detail that makes findings actionable. + +## Quality assessment + +- **GPT-5** produced the most exhaustive analysis (34 findings) with deep coverage of every section of the document. It systematically worked through PubSub rules, telemetry rules, `_all_users` functions, web naming, and identified edge cases in each. The telemetry verb/example contradiction finding is a genuine internal inconsistency in the document. The acronym, pluralization, and encoding findings show attention to implementation-level detail. GPT-5 also provided a concrete "Suggestions to clarify" section with remediation proposals. + +- **Claude Opus** produced fewer findings (18, with 6 cut off by token limit) but several were qualitatively superior. The PubSub collision finding traced through the routing failure consequence that GPT-5 only hinted at. The cross-context data structure finding engaged with DDD principles ("same concept, different representations") rather than just flagging the gap. Opus consistently reasoned about *why* the gap matters at the architectural level, not just *that* it exists. + +- **Claude Sonnet** produced the fewest findings (14) with the least depth. Several findings were restatements of the prompt categories ("no guidance for X") without working through specific collision scenarios. The third-party integration finding and validation module finding were valid but generic — they identify gaps without demonstrating concrete breakage. + +## Key insight — convention analysis requires exhaustive enumeration + constraint reasoning + +This task had two components: +1. **Enumeration**: Systematically cover every rule in the document and ask "what's missing?" +2. **Constraint reasoning**: For each gap, reason about whether two developers following the rules could produce incompatible results + +GPT-5 excelled at (1) — its 7,232 reasoning tokens enabled methodical section-by-section coverage. Every major topic (modules, PubSub, telemetry, `_all_users`, fakes, web) got explicit attention. + +Opus excelled at (2) — when it identified a gap, it reasoned through the *consequence* (routing failure, compiler ambiguity, architectural drift). Fewer findings, but each finding included the "why it breaks" reasoning that makes it actionable. + +Sonnet did neither particularly well — it identified gaps but didn't enumerate exhaustively or reason through consequences deeply. + +## Task taxonomy update + +**Convention/specification gap analysis** is a new task type with distinct model performance: +- **GPT-5**: Best for exhaustive coverage — will find the acronym rules, the tag naming, the entity key encoding that others miss +- **Opus**: Best for consequence reasoning — will trace gaps to architectural or operational failures +- **Sonnet**: Adequate for quick sanity check; not recommended for thorough analysis + +This pattern is consistent with previous findings: GPT-5's reasoning enables systematic enumeration; Opus finds tensions the document can't see about itself; Sonnet lacks depth for analytical work. + +## Unique GPT-5 insight worth highlighting + +The telemetry verb contradiction (rule says past-tense, example shows `stale`) is a genuine internal inconsistency the document contains. This is different from "gap" or "ambiguity" — it's a self-contradiction where following the rule would produce `became_stale` but the example shows `stale`. GPT-5 caught this; neither Claude model did. + +## Unique Opus insight worth highlighting + +The PubSub topic collision finding traced through *why* the current rules break: `"market_data:AAPL"` (per-entity) is pattern-indistinguishable from `"market_data:42:status"` (per-user) for a subscriber matching `"market_data:" <> _`. This shows the gap doesn't just allow inconsistency — it allows *silent routing failures*. Opus consistently elevates gaps to their operational consequences. + +## Practical implications + +For convention/specification review: +- Run GPT-5 for exhaustive coverage (will catch encoding rules, acronyms, pluralization — the details) +- Run Opus to identify which gaps have architectural consequences (routing failures, compiler issues) +- Sonnet is not recommended for this task type + +The ideal workflow: GPT-5 first for comprehensive gap list, then Opus to prioritize which gaps matter. This is the same pattern as state machine analysis (Finding #58) and assumption-finding (Finding #10-12). + +## Meta-observation: Document type affects model performance less than expected + +This is the first experiment on a *prescriptive rules* document rather than an *architecture* document. The model performance ranking (GPT-5 > Opus > Sonnet) held. The task skills (enumeration for GPT-5, consequence reasoning for Opus) also held. This suggests the analytical framework from previous findings generalizes beyond architecture docs. + +## Raw data + +Full model outputs preserved in `/tmp/gpt5-conventions.json`, `/tmp/opus-conventions.json`, `/tmp/sonnet-conventions.json`.