Files
model-research/findings/2026-05-07-38-regulatory-compliance-gap-analysis.md
T
claw d27ce6f5e1 finding #38: regulatory compliance gap analysis (FINRA/PDT domain knowledge test)
First experiment testing domain-specific regulatory knowledge rather than
pure architectural reasoning. Opus demonstrates deepest FINRA Rule 4210
knowledge; GPT-5 finds broker-API semantic mismatches; content filters
are a new failure mode for financial domain analysis via enterprise proxies.
2026-05-07 07:47:11 -07:00

6.1 KiB
Raw Blame History

Finding #38: Regulatory Compliance Gap Analysis

Date: 2026-05-07 Document: docs/impl/dtbp-margin-call.md (363 lines) Task type: Domain-specific regulatory knowledge test (FINRA/SEC PDT rules) Models: GPT-5, Claude Opus 4.6, Claude Sonnet 4.6

Experiment Design

First experiment testing domain-specific regulatory knowledge rather than pure architectural reasoning. Asked models to identify where the implementation design might violate or inadequately handle actual FINRA/SEC regulatory requirements around Pattern Day Trader (PDT) rules and margin calls.

Prompt specified 5 categories:

  1. Regulatory gaps (FINRA/SEC PDT rules, Reg T requirements)
  2. Broker semantic mismatches (API field meanings under real conditions)
  3. Temporal edge cases (market boundaries, holidays, early closes)
  4. State machine incompleteness (missing states/transitions)
  5. Calculation correctness (DTBP arithmetic under specific order patterns)

Results

Model Time Output tokens Reasoning tokens Findings
GPT-5 155s 11,734 9,024 13+ (cut off by content filter)
Claude Opus 4.6 117s 5,049 (internal) 15
Claude Sonnet 4.6 ~39s 1,938 (internal) 12

NOTE: GPT-5's response was terminated by the SAP HAI proxy's content safety filter (financial/trading content triggered it), cutting off mid-finding #13.

Common Ground (all 3 identified)

  • Short sale DTBP consumption not tracked (buy-only accumulation)
  • Options assignment creating untracked DTBP consumption
  • Market close/open boundary timing issues
  • Margin call detection relying solely on DTBP numeric comparison
  • 5-day cure period calendar computation edge cases

GPT-5 Unique Findings

  • account.buying_power already being 2× from broker → system double-multiplies to 4× in overnight mode (concrete implementation bug)
  • After-hours trades consuming DTBP that resets at 4pm (dtbp_used_today reset too early for same-day extended session)
  • Premarket DTBP enforcement gap (broker enforces DTBP in extended hours but system uses 2× overnight mode pre-open)
  • House/concentration surcharges consuming DTBP faster than notional cost
  • GTC orders executing after-hours at 4× sizing while system is in 2× overnight
  • FIFO/LIFO matching ambiguity for partial sell DTBP release

Claude Opus Unique Findings

  • PDT designation trigger gap: System passively reads PDT status but doesn't preemptively gate the 4th day trade that CAUSES designation; $25k equity not verified before triggering trade
  • 90-day freeze allows day trades: Design restricts to 1× buying power but FINRA actually PROHIBITS the activity entirely during escalation (not just restricts leverage) — a genuine regulatory violation
  • Margin call issuance date recovery: If pipeline is down when call is issued, system sets issued_at to detection time, not actual issuance → extends cure period beyond regulatory 5 days
  • Time-and-tick accounting requirement: FINRA requires tracking maximum open commitment (high water mark) for DTBP, not net basis — the release logic may violate this
  • Multiple concurrent margin calls: Second call upserts over first, losing the earlier deadline (single-state-per-user model inadequate)
  • dtbp_used_today NOT reset in margin call mode: Close sequence guard (bp_mode != :margin_dtbp) skips reset, causing stale accumulation
  • Cash account free-riding 90-day freeze: Broader Reg T scope not modeled
  • Broker re-query race on rapid fills: Response ordering creates stale DTBP window between consecutive fills

Claude Sonnet Unique Findings

  • PDT designation timing mismatch (Gargoyle vs broker overnight batch)
  • Wash sale impact on maintenance requirements affecting DTBP (IRS interaction)

Key Insights

1. Regulatory domain expertise varies significantly across models

  • Opus has deepest regulatory knowledge. Cited specific FINRA Rule 4210 subsections, understood the distinction between restricting leverage vs prohibiting activity, and knew about time-and-tick DTBP accounting.
  • GPT-5 has deepest broker-API semantic knowledge. Reasoned about what specific broker API fields actually mean vs what the design assumes (buying_power already being 2×, DTBP in extended hours, house surcharges).
  • Sonnet is competent but surface-level. Good coverage for a first pass but doesn't match regulatory depth of Opus or semantic precision of GPT-5.

2. Domain-specific lens changes model ranking

In general assumption-finding (previous experiments):

  • GPT-5 > Sonnet > Opus (by count)
  • Opus > GPT-5 > Sonnet (by insight per finding)

In regulatory compliance analysis:

  • Opus > GPT-5 > Sonnet (by regulatory significance)
  • GPT-5 > Opus > Sonnet (by broker-semantic precision)

The regulatory lens ELEVATED Opus because it triggered domain-specific knowledge that Opus possesses more deeply than the other models.

3. Content filters as a new failure mode

Enterprise AI proxies may filter financial/regulatory analytical content. GPT-5's response was cut off by content safety — a failure mode not seen in architectural analysis. For production regulatory compliance review, use direct API access or configure filters for analytical discourse.

Practical Implications

For systems with regulatory requirements (finance, healthcare, legal):

  • Run Opus for regulatory compliance analysis — its domain knowledge produces findings other models won't surface
  • Combine with GPT-5 for implementation semantics — what does this API field actually mean in practice?
  • Sonnet for fast first-pass but not sole reviewer for regulatory matters
  • Direct API access for financial domain — enterprise proxy content filters may interfere

Comparison to Previous Experiments

This extends the finding from #11 and #13 that task type changes model performance. Here we show that task DOMAIN also matters. A model's strength on architectural reasoning doesn't predict its strength on regulatory reasoning. The optimal model assignment depends on both:

  • Task type (assumptions vs races vs compliance)
  • Task domain (architecture vs regulation vs security)