finding #38: regulatory compliance gap analysis (FINRA/PDT domain knowledge test)
First experiment testing domain-specific regulatory knowledge rather than pure architectural reasoning. Opus demonstrates deepest FINRA Rule 4210 knowledge; GPT-5 finds broker-API semantic mismatches; content filters are a new failure mode for financial domain analysis via enterprise proxies.
This commit is contained in:
@@ -0,0 +1,131 @@
|
|||||||
|
# Finding #38: Regulatory Compliance Gap Analysis
|
||||||
|
|
||||||
|
**Date:** 2026-05-07
|
||||||
|
**Document:** `docs/impl/dtbp-margin-call.md` (363 lines)
|
||||||
|
**Task type:** Domain-specific regulatory knowledge test (FINRA/SEC PDT rules)
|
||||||
|
**Models:** GPT-5, Claude Opus 4.6, Claude Sonnet 4.6
|
||||||
|
|
||||||
|
## Experiment Design
|
||||||
|
|
||||||
|
First experiment testing **domain-specific regulatory knowledge** rather than pure
|
||||||
|
architectural reasoning. Asked models to identify where the implementation design
|
||||||
|
might violate or inadequately handle actual FINRA/SEC regulatory requirements
|
||||||
|
around Pattern Day Trader (PDT) rules and margin calls.
|
||||||
|
|
||||||
|
Prompt specified 5 categories:
|
||||||
|
1. Regulatory gaps (FINRA/SEC PDT rules, Reg T requirements)
|
||||||
|
2. Broker semantic mismatches (API field meanings under real conditions)
|
||||||
|
3. Temporal edge cases (market boundaries, holidays, early closes)
|
||||||
|
4. State machine incompleteness (missing states/transitions)
|
||||||
|
5. Calculation correctness (DTBP arithmetic under specific order patterns)
|
||||||
|
|
||||||
|
## Results
|
||||||
|
|
||||||
|
| Model | Time | Output tokens | Reasoning tokens | Findings |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| GPT-5 | 155s | 11,734 | 9,024 | 13+ (cut off by content filter) |
|
||||||
|
| Claude Opus 4.6 | 117s | 5,049 | (internal) | 15 |
|
||||||
|
| Claude Sonnet 4.6 | ~39s | 1,938 | (internal) | 12 |
|
||||||
|
|
||||||
|
**NOTE:** GPT-5's response was terminated by the SAP HAI proxy's content safety
|
||||||
|
filter (financial/trading content triggered it), cutting off mid-finding #13.
|
||||||
|
|
||||||
|
## Common Ground (all 3 identified)
|
||||||
|
|
||||||
|
- Short sale DTBP consumption not tracked (buy-only accumulation)
|
||||||
|
- Options assignment creating untracked DTBP consumption
|
||||||
|
- Market close/open boundary timing issues
|
||||||
|
- Margin call detection relying solely on DTBP numeric comparison
|
||||||
|
- 5-day cure period calendar computation edge cases
|
||||||
|
|
||||||
|
## GPT-5 Unique Findings
|
||||||
|
|
||||||
|
- `account.buying_power` already being 2× from broker → system double-multiplies
|
||||||
|
to 4× in overnight mode (concrete implementation bug)
|
||||||
|
- After-hours trades consuming DTBP that resets at 4pm (dtbp_used_today reset too
|
||||||
|
early for same-day extended session)
|
||||||
|
- Premarket DTBP enforcement gap (broker enforces DTBP in extended hours but system
|
||||||
|
uses 2× overnight mode pre-open)
|
||||||
|
- House/concentration surcharges consuming DTBP faster than notional cost
|
||||||
|
- GTC orders executing after-hours at 4× sizing while system is in 2× overnight
|
||||||
|
- FIFO/LIFO matching ambiguity for partial sell DTBP release
|
||||||
|
|
||||||
|
## Claude Opus Unique Findings
|
||||||
|
|
||||||
|
- **PDT designation trigger gap:** System passively reads PDT status but doesn't
|
||||||
|
preemptively gate the 4th day trade that CAUSES designation; $25k equity not
|
||||||
|
verified before triggering trade
|
||||||
|
- **90-day freeze allows day trades:** Design restricts to 1× buying power but
|
||||||
|
FINRA actually PROHIBITS the activity entirely during escalation (not just
|
||||||
|
restricts leverage) — a genuine regulatory violation
|
||||||
|
- **Margin call issuance date recovery:** If pipeline is down when call is issued,
|
||||||
|
system sets issued_at to detection time, not actual issuance → extends cure
|
||||||
|
period beyond regulatory 5 days
|
||||||
|
- **Time-and-tick accounting requirement:** FINRA requires tracking maximum open
|
||||||
|
commitment (high water mark) for DTBP, not net basis — the release logic may
|
||||||
|
violate this
|
||||||
|
- **Multiple concurrent margin calls:** Second call upserts over first, losing the
|
||||||
|
earlier deadline (single-state-per-user model inadequate)
|
||||||
|
- **dtbp_used_today NOT reset in margin call mode:** Close sequence guard
|
||||||
|
(`bp_mode != :margin_dtbp`) skips reset, causing stale accumulation
|
||||||
|
- **Cash account free-riding 90-day freeze:** Broader Reg T scope not modeled
|
||||||
|
- **Broker re-query race on rapid fills:** Response ordering creates stale DTBP
|
||||||
|
window between consecutive fills
|
||||||
|
|
||||||
|
## Claude Sonnet Unique Findings
|
||||||
|
|
||||||
|
- PDT designation timing mismatch (Gargoyle vs broker overnight batch)
|
||||||
|
- Wash sale impact on maintenance requirements affecting DTBP (IRS interaction)
|
||||||
|
|
||||||
|
## Key Insights
|
||||||
|
|
||||||
|
### 1. Regulatory domain expertise varies significantly across models
|
||||||
|
|
||||||
|
- **Opus has deepest regulatory knowledge.** Cited specific FINRA Rule 4210
|
||||||
|
subsections, understood the distinction between restricting leverage vs
|
||||||
|
prohibiting activity, and knew about time-and-tick DTBP accounting.
|
||||||
|
- **GPT-5 has deepest broker-API semantic knowledge.** Reasoned about what
|
||||||
|
specific broker API fields actually mean vs what the design assumes
|
||||||
|
(buying_power already being 2×, DTBP in extended hours, house surcharges).
|
||||||
|
- **Sonnet is competent but surface-level.** Good coverage for a first pass
|
||||||
|
but doesn't match regulatory depth of Opus or semantic precision of GPT-5.
|
||||||
|
|
||||||
|
### 2. Domain-specific lens changes model ranking
|
||||||
|
|
||||||
|
In general assumption-finding (previous experiments):
|
||||||
|
- GPT-5 > Sonnet > Opus (by count)
|
||||||
|
- Opus > GPT-5 > Sonnet (by insight per finding)
|
||||||
|
|
||||||
|
In regulatory compliance analysis:
|
||||||
|
- Opus > GPT-5 > Sonnet (by regulatory significance)
|
||||||
|
- GPT-5 > Opus > Sonnet (by broker-semantic precision)
|
||||||
|
|
||||||
|
The regulatory lens ELEVATED Opus because it triggered domain-specific
|
||||||
|
knowledge that Opus possesses more deeply than the other models.
|
||||||
|
|
||||||
|
### 3. Content filters as a new failure mode
|
||||||
|
|
||||||
|
Enterprise AI proxies may filter financial/regulatory analytical content.
|
||||||
|
GPT-5's response was cut off by content safety — a failure mode not seen
|
||||||
|
in architectural analysis. For production regulatory compliance review,
|
||||||
|
use direct API access or configure filters for analytical discourse.
|
||||||
|
|
||||||
|
## Practical Implications
|
||||||
|
|
||||||
|
For systems with regulatory requirements (finance, healthcare, legal):
|
||||||
|
- **Run Opus for regulatory compliance analysis** — its domain knowledge
|
||||||
|
produces findings other models won't surface
|
||||||
|
- **Combine with GPT-5 for implementation semantics** — what does this API
|
||||||
|
field actually mean in practice?
|
||||||
|
- **Sonnet for fast first-pass** but not sole reviewer for regulatory matters
|
||||||
|
- **Direct API access for financial domain** — enterprise proxy content
|
||||||
|
filters may interfere
|
||||||
|
|
||||||
|
## Comparison to Previous Experiments
|
||||||
|
|
||||||
|
This extends the finding from #11 and #13 that task type changes model
|
||||||
|
performance. Here we show that task DOMAIN also matters. A model's strength
|
||||||
|
on architectural reasoning doesn't predict its strength on regulatory
|
||||||
|
reasoning. The optimal model assignment depends on both:
|
||||||
|
- Task type (assumptions vs races vs compliance)
|
||||||
|
- Task domain (architecture vs regulation vs security)
|
||||||
Reference in New Issue
Block a user