Files
model-research/findings/2026-05-07-38-regulatory-compliance-gap-analysis.md
T
claw d27ce6f5e1 finding #38: regulatory compliance gap analysis (FINRA/PDT domain knowledge test)
First experiment testing domain-specific regulatory knowledge rather than
pure architectural reasoning. Opus demonstrates deepest FINRA Rule 4210
knowledge; GPT-5 finds broker-API semantic mismatches; content filters
are a new failure mode for financial domain analysis via enterprise proxies.
2026-05-07 07:47:11 -07:00

132 lines
6.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Finding #38: Regulatory Compliance Gap Analysis
**Date:** 2026-05-07
**Document:** `docs/impl/dtbp-margin-call.md` (363 lines)
**Task type:** Domain-specific regulatory knowledge test (FINRA/SEC PDT rules)
**Models:** GPT-5, Claude Opus 4.6, Claude Sonnet 4.6
## Experiment Design
First experiment testing **domain-specific regulatory knowledge** rather than pure
architectural reasoning. Asked models to identify where the implementation design
might violate or inadequately handle actual FINRA/SEC regulatory requirements
around Pattern Day Trader (PDT) rules and margin calls.
Prompt specified 5 categories:
1. Regulatory gaps (FINRA/SEC PDT rules, Reg T requirements)
2. Broker semantic mismatches (API field meanings under real conditions)
3. Temporal edge cases (market boundaries, holidays, early closes)
4. State machine incompleteness (missing states/transitions)
5. Calculation correctness (DTBP arithmetic under specific order patterns)
## Results
| Model | Time | Output tokens | Reasoning tokens | Findings |
|---|---|---|---|---|
| GPT-5 | 155s | 11,734 | 9,024 | 13+ (cut off by content filter) |
| Claude Opus 4.6 | 117s | 5,049 | (internal) | 15 |
| Claude Sonnet 4.6 | ~39s | 1,938 | (internal) | 12 |
**NOTE:** GPT-5's response was terminated by the SAP HAI proxy's content safety
filter (financial/trading content triggered it), cutting off mid-finding #13.
## Common Ground (all 3 identified)
- Short sale DTBP consumption not tracked (buy-only accumulation)
- Options assignment creating untracked DTBP consumption
- Market close/open boundary timing issues
- Margin call detection relying solely on DTBP numeric comparison
- 5-day cure period calendar computation edge cases
## GPT-5 Unique Findings
- `account.buying_power` already being 2× from broker → system double-multiplies
to 4× in overnight mode (concrete implementation bug)
- After-hours trades consuming DTBP that resets at 4pm (dtbp_used_today reset too
early for same-day extended session)
- Premarket DTBP enforcement gap (broker enforces DTBP in extended hours but system
uses 2× overnight mode pre-open)
- House/concentration surcharges consuming DTBP faster than notional cost
- GTC orders executing after-hours at 4× sizing while system is in 2× overnight
- FIFO/LIFO matching ambiguity for partial sell DTBP release
## Claude Opus Unique Findings
- **PDT designation trigger gap:** System passively reads PDT status but doesn't
preemptively gate the 4th day trade that CAUSES designation; $25k equity not
verified before triggering trade
- **90-day freeze allows day trades:** Design restricts to 1× buying power but
FINRA actually PROHIBITS the activity entirely during escalation (not just
restricts leverage) — a genuine regulatory violation
- **Margin call issuance date recovery:** If pipeline is down when call is issued,
system sets issued_at to detection time, not actual issuance → extends cure
period beyond regulatory 5 days
- **Time-and-tick accounting requirement:** FINRA requires tracking maximum open
commitment (high water mark) for DTBP, not net basis — the release logic may
violate this
- **Multiple concurrent margin calls:** Second call upserts over first, losing the
earlier deadline (single-state-per-user model inadequate)
- **dtbp_used_today NOT reset in margin call mode:** Close sequence guard
(`bp_mode != :margin_dtbp`) skips reset, causing stale accumulation
- **Cash account free-riding 90-day freeze:** Broader Reg T scope not modeled
- **Broker re-query race on rapid fills:** Response ordering creates stale DTBP
window between consecutive fills
## Claude Sonnet Unique Findings
- PDT designation timing mismatch (Gargoyle vs broker overnight batch)
- Wash sale impact on maintenance requirements affecting DTBP (IRS interaction)
## Key Insights
### 1. Regulatory domain expertise varies significantly across models
- **Opus has deepest regulatory knowledge.** Cited specific FINRA Rule 4210
subsections, understood the distinction between restricting leverage vs
prohibiting activity, and knew about time-and-tick DTBP accounting.
- **GPT-5 has deepest broker-API semantic knowledge.** Reasoned about what
specific broker API fields actually mean vs what the design assumes
(buying_power already being 2×, DTBP in extended hours, house surcharges).
- **Sonnet is competent but surface-level.** Good coverage for a first pass
but doesn't match regulatory depth of Opus or semantic precision of GPT-5.
### 2. Domain-specific lens changes model ranking
In general assumption-finding (previous experiments):
- GPT-5 > Sonnet > Opus (by count)
- Opus > GPT-5 > Sonnet (by insight per finding)
In regulatory compliance analysis:
- Opus > GPT-5 > Sonnet (by regulatory significance)
- GPT-5 > Opus > Sonnet (by broker-semantic precision)
The regulatory lens ELEVATED Opus because it triggered domain-specific
knowledge that Opus possesses more deeply than the other models.
### 3. Content filters as a new failure mode
Enterprise AI proxies may filter financial/regulatory analytical content.
GPT-5's response was cut off by content safety — a failure mode not seen
in architectural analysis. For production regulatory compliance review,
use direct API access or configure filters for analytical discourse.
## Practical Implications
For systems with regulatory requirements (finance, healthcare, legal):
- **Run Opus for regulatory compliance analysis** — its domain knowledge
produces findings other models won't surface
- **Combine with GPT-5 for implementation semantics** — what does this API
field actually mean in practice?
- **Sonnet for fast first-pass** but not sole reviewer for regulatory matters
- **Direct API access for financial domain** — enterprise proxy content
filters may interfere
## Comparison to Previous Experiments
This extends the finding from #11 and #13 that task type changes model
performance. Here we show that task DOMAIN also matters. A model's strength
on architectural reasoning doesn't predict its strength on regulatory
reasoning. The optimal model assignment depends on both:
- Task type (assumptions vs races vs compliance)
- Task domain (architecture vs regulation vs security)