Files

T

Rodin 2988f31fc3 finding 59: convention rule gap analysis

New task type: analyzing prescriptive/specification documents for completeness.

- GPT-5 dominates with exhaustive enumeration (34 findings)
- Opus traces gaps to consequences (routing failures, compiler issues)
- Sonnet surface-level (not recommended for thorough analysis)

Key insight: GPT-5 found internal contradiction (telemetry verb rule vs example)
that neither Claude model caught. Opus unique in tracing PubSub collision
to actual routing failure scenario.

Task taxonomy: convention gap analysis follows same pattern as architecture
docs - GPT-5 for coverage, Opus for consequences.

2026-05-09 17:28:53 -07:00

10 KiB

Raw Blame History

Finding 59: Convention Rule Gap Analysis — GPT-5 Dominates with Exhaustive Enumeration; Opus Finds Design Contradictions; Sonnet Is Surface-Level

Date: 2026-05-09 Task: Identify gaps, ambiguities, and internal inconsistencies in gargoyle's naming-conventions.md (196 lines) — a prescriptive document defining mechanical naming rules for modules, topics, and metrics. Document type: Convention/specification document (rules rather than architecture), testing whether models can analyze prescriptive text for completeness.

Methodology

Same document (full text) + same focused analytical question to all 3 models via HAI AI Core proxy. Prompt specified 5 categories: ambiguous decision points, missing scenarios, internal contradictions, implicit assumptions, and edge cases. Required specific output format with concrete scenarios and severity ratings. No tools, no project context beyond the document.

Model	Time	Output tokens	Reasoning tokens	Findings
GPT-5	122s	10,563	7,232	34
Claude Opus 4.6	69s	3,898	(internal)	18 (12 fully written, 6 cut off)
Claude Sonnet 4.6	28s	1,415	(internal)	14

What they found — common ground (all 3 identified)

Gargoyle.Accounts namespace exists "outside bounded contexts" but module rules say "every module belongs to exactly one of seven contexts" — creates ambiguity
Behaviour vs struct/schema naming can collide (both use Gargoyle.<Context>.<Name>)
No guidance for behaviour implementations (adapter naming patterns)
Sub-namespace creation criteria are subjective ("different lifecycles, failure modes, or team ownership")
Process naming "role not action" rule is ambiguous (where's the boundary?)
Schema qualification ("when ambiguous, qualify it") lacks canonical qualifier terms
Registry naming undefined for multiple registries in same context
PubSub per-entity pattern conflicts with per-user pattern (topic collision risk)
Telemetry sub-namespace mapping from modules is unclear

GPT-5 unique findings (not in either Claude model)

PubSub subtopic casing/separators undefined — snake_case vs kebab-case vs plain not specified
PubSub entity key encoding not specified — colons in keys (BTC-PERP), case sensitivity, composite keys
PubSub tenant/org scope missing — only user/system/entity defined; no per-org or multi-tenant
Telemetry verb rule contradicts examples — rule requires past-tense but stale appears in examples (not became_stale)
Telemetry tag naming unspecified — key/value casing and naming patterns missing
_all_users assumes user is only cross-scope concept — no guidance for _all_orgs, _all_tenants
_all_users for non-list actions undefined — what about count_all_users, purge_all_users!?
Test fake namespace collision risk — fakes use production namespace (Gargoyle.*) in test files
Exceptions/error modules lack conventions — OrderRejectedError vs OrderRejected vs Trading.Error.OrderRejected
Protocols and Ecto types not covered — custom Ecto.Type, embedded schemas, defimpl placement
Shared struct vs Ecto schema collision — non-persisted struct named Order conflicts with schema
Acronym normalization for topics/events — IB → ib vs i_b not explicit
Pluralization guidance incomplete — schemas singular, but sub-namespace pluralization varies (Strategies vs Positioning)
Web component naming gaps — no guidance for LiveComponents, functional components, .JSON modules
Short-name disambiguation left to reader inference — "Supervisor" in prose without context is ambiguous
Telemetry noun vs sub-namespace boundary ambiguous — when is something a sub-namespace segment vs noun?
Telemetry depth/mapping from module nesting unclear — how many segments for Trading.BrokerAdapters.IB.OrderRouter?

Claude Opus unique findings (not in either other model)

PubSub topic pattern collision is CRITICAL, not just ambiguous — Opus specifically worked through how "market_data:AAPL" (per-entity) is indistinguishable from "market_data:42:status" (per-user) without the subtopic segment. GPT-5 noted the conflict but didn't trace through the routing failure scenario.
Cross-context data structures have no guidance — what happens to the Signal struct that Engine produces and Trading consumes? GPT-5 focused on module placement; Opus focused on the DDD principle that "same concept has different representations" and whether shared DTOs are allowed.
Test fake location vs namespace mismatch is a compiler issue — fakes in test/support/fakes/ with Gargoyle.* namespace causes mix compile scoping ambiguity and accidental import risk. GPT-5 mentioned collision; Opus traced the consequence.
No rules for implementation modules — pure helper modules that aren't schemas, processes, or behaviours have no placement guidance. Is Aggregator.Calculator a valid name? What about Internal.AggregatorHelpers?

Claude Sonnet findings (less depth than others)

Module spanning contexts lacks split guidance — mentioned but didn't work through specific scenarios
Third-party integration modules unaddressed — Bloomberg, broker APIs: belong in relevant context or special treatment?
Validation modules at context boundaries — data validation between Market Data and Decision Engine: which context?
Utility/helper module placement missing — date helpers, math functions used across contexts

Sonnet's findings were valid but less specific. Several overlapped with GPT-5/Opus without adding the concrete scenario detail that makes findings actionable.

Quality assessment

GPT-5 produced the most exhaustive analysis (34 findings) with deep coverage of every section of the document. It systematically worked through PubSub rules, telemetry rules, _all_users functions, web naming, and identified edge cases in each. The telemetry verb/example contradiction finding is a genuine internal inconsistency in the document. The acronym, pluralization, and encoding findings show attention to implementation-level detail. GPT-5 also provided a concrete "Suggestions to clarify" section with remediation proposals.
Claude Opus produced fewer findings (18, with 6 cut off by token limit) but several were qualitatively superior. The PubSub collision finding traced through the routing failure consequence that GPT-5 only hinted at. The cross-context data structure finding engaged with DDD principles ("same concept, different representations") rather than just flagging the gap. Opus consistently reasoned about why the gap matters at the architectural level, not just that it exists.
Claude Sonnet produced the fewest findings (14) with the least depth. Several findings were restatements of the prompt categories ("no guidance for X") without working through specific collision scenarios. The third-party integration finding and validation module finding were valid but generic — they identify gaps without demonstrating concrete breakage.

Key insight — convention analysis requires exhaustive enumeration + constraint reasoning

This task had two components:

Enumeration: Systematically cover every rule in the document and ask "what's missing?"
Constraint reasoning: For each gap, reason about whether two developers following the rules could produce incompatible results

GPT-5 excelled at (1) — its 7,232 reasoning tokens enabled methodical section-by-section coverage. Every major topic (modules, PubSub, telemetry, _all_users, fakes, web) got explicit attention.

Opus excelled at (2) — when it identified a gap, it reasoned through the consequence (routing failure, compiler ambiguity, architectural drift). Fewer findings, but each finding included the "why it breaks" reasoning that makes it actionable.

Sonnet did neither particularly well — it identified gaps but didn't enumerate exhaustively or reason through consequences deeply.

Task taxonomy update

Convention/specification gap analysis is a new task type with distinct model performance:

GPT-5: Best for exhaustive coverage — will find the acronym rules, the tag naming, the entity key encoding that others miss
Opus: Best for consequence reasoning — will trace gaps to architectural or operational failures
Sonnet: Adequate for quick sanity check; not recommended for thorough analysis

This pattern is consistent with previous findings: GPT-5's reasoning enables systematic enumeration; Opus finds tensions the document can't see about itself; Sonnet lacks depth for analytical work.

Unique GPT-5 insight worth highlighting

The telemetry verb contradiction (rule says past-tense, example shows stale) is a genuine internal inconsistency the document contains. This is different from "gap" or "ambiguity" — it's a self-contradiction where following the rule would produce became_stale but the example shows stale. GPT-5 caught this; neither Claude model did.

Unique Opus insight worth highlighting

The PubSub topic collision finding traced through why the current rules break: "market_data:AAPL" (per-entity) is pattern-indistinguishable from "market_data:42:status" (per-user) for a subscriber matching "market_data:" <> _. This shows the gap doesn't just allow inconsistency — it allows silent routing failures. Opus consistently elevates gaps to their operational consequences.

Practical implications

For convention/specification review:

Run GPT-5 for exhaustive coverage (will catch encoding rules, acronyms, pluralization — the details)
Run Opus to identify which gaps have architectural consequences (routing failures, compiler issues)
Sonnet is not recommended for this task type

The ideal workflow: GPT-5 first for comprehensive gap list, then Opus to prioritize which gaps matter. This is the same pattern as state machine analysis (Finding #58) and assumption-finding (Finding #10-12).

Meta-observation: Document type affects model performance less than expected

This is the first experiment on a prescriptive rules document rather than an architecture document. The model performance ranking (GPT-5 > Opus > Sonnet) held. The task skills (enumeration for GPT-5, consequence reasoning for Opus) also held. This suggests the analytical framework from previous findings generalizes beyond architecture docs.

Raw data

Full model outputs preserved in /tmp/gpt5-conventions.json, /tmp/opus-conventions.json, /tmp/sonnet-conventions.json.

10 KiB Raw Blame History