From 6e930aed9416b45e0a38176f4256c94b05b93836 Mon Sep 17 00:00:00 2001 From: Rodin Date: Thu, 30 Apr 2026 11:45:53 -0700 Subject: [PATCH] docs: add full architectural analysis Repo shape, import hierarchy, HSM/CHASM architecture, PR discussions, code quality metrics, cross-ecosystem comparisons --- analysis.md | 381 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 381 insertions(+) create mode 100644 analysis.md diff --git a/analysis.md b/analysis.md new file mode 100644 index 0000000..c653b39 --- /dev/null +++ b/analysis.md @@ -0,0 +1,381 @@ +# Temporal: Architectural Analysis + +**Repo:** github.com/temporalio/temporal +**Size:** 181M, 2,645 Go files, 8,958 commits, 290 +contributors +**Category:** Durable execution engine / workflow +orchestrator + +--- + +## Repository Shape + +``` +temporal/ +├── api/ # Generated protobuf service definitions +├── chasm/ # NEW: Component-based HSM Architecture +├── client/ # Internal service clients +├── cmd/ # Entry points (server, tools) +├── common/ # Shared infrastructure (massive) +│ ├── backoff/ +│ ├── channel/ +│ ├── clock/ +│ ├── dynamicconfig/ # 566 runtime-configurable settings +│ ├── goro/ # Goroutine lifecycle management +│ ├── log/ +│ ├── metrics/ +│ ├── namespace/ +│ ├── persistence/ # Multi-backend storage abstraction +│ ├── quotas/ # Rate limiting infrastructure +│ ├── softassert/ # Production assertions (log, don't crash) +│ └── tasks/ # Scheduler primitives (IWRR, FIFO, etc.) +├── components/ # Feature modules (callbacks, nexus, schedulers) +├── service/ +│ ├── frontend/ # gRPC API handlers +│ ├── history/ # Workflow state machine execution +│ │ ├── hsm/ # Hierarchical State Machine framework +│ │ └── queues/ # Task queue processing +│ ├── matching/ # Task dispatch / worker routing +│ └── worker/ # System workflows +└── tests/ # Integration / functional tests +``` + +### Import Hierarchy (most depended-upon) + +1. `common` — 7,257 imports (the foundation) +2. `api` — 1,731 (protobuf contracts) +3. `service` — 1,693 (business logic) +4. `chasm` — 497 (rapidly growing new framework) +5. `tests` — 125 (integration harness) + +--- + +## Key Architectural Patterns + +### 1. Hierarchical State Machines (HSM) + +**PR #5494 (Mar 2024, 51 review comments):** + +The HSM framework is Temporal's core abstraction. Every +workflow execution is a tree of state machines — the +workflow itself, its activities, child workflows, timers, +callbacks, nexus operations. + +```go +type StateMachine[S comparable] interface { + TaskRegenerator + State() S + SetState(S) +} + +type Transition[S comparable, SM StateMachine[S], E any] struct { + Sources []S + Destination S + apply func(SM, E) (TransitionOutput, error) +} +``` + +**The key insight:** Type-safe state transitions with +source validation. `Transition.Apply()` checks +`slices.Contains(t.Sources, sm.State())` before +allowing the state change. Invalid transitions return +`ErrInvalidTransition` rather than silently corrupting +state. + +**From PR discussion (tdeebswihart):** +> "I wish we'd gone with the standard `fsm` name here. +> HSM keeps making me think of Hardware Security +> Modules." + +**From PR discussion (bergundy, the author):** +> "I don't consider this a final approach but I do think +> it's a step in the right direction. We need to model +> more state machines on top of this to form a more +> solid API." + +This is explicit about being iterative. The framework +shipped "not final" and evolved through real usage. + +### 2. CHASM (Component Architecture for State Machines) + +**PR #6987 (Dec 2024–Jan 2025, 60 review comments):** + +CHASM replaces the old ad-hoc component system. It's +a framework for building HSM-based components with: +- Declarative field definitions +- Mutable vs immutable contexts (type-enforced) +- Parent-child component relationships +- Task generation from transitions + +**Key discussion points:** + +**bergundy (author):** "I would put this in a top level +`chasm` directory. There's likely going to be some +chasm related code in other services." + +**yycptt:** "Having the implementation in the top level +package instead of service/history feels weird." +(Responded with re-export strategy.) + +**Sushisource:** "I think I prefer them separate, +because what happens if you mutate something and then +say 'not ready'? That would be some weird violation +that shouldn't be possible, and separate contexts +enforces that at the type level." + +→ **Decision: Split MutableContext vs Context at the +type level** to make invalid operations unrepresentable. +This is the "making wrong things impossible" philosophy +in action. + +### 3. Goroutine Lifecycle (goro.Handle) + +**PR #1892 (Sep 2021, 15 review comments):** + +Introduced to fix a **double-close panic** in the task +writer. The pattern is strikingly similar to +CockroachDB's Handle (introduced 2025), but predates it +by 3.5 years. + +```go +type Handle struct { + context context.Context + cancel context.CancelFunc + done chan struct{} + err atomic.Value +} +``` + +**From PR discussion (mmcshane, author):** +> "One thing you might not guess about Stop() is that +> it removes itself from the parent matching engine. I +> don't like this 'remove yourself' behavior because it +> puts the control logic in the wrong place (i.e. in +> the controlled object rather than the controller)." + +**Reviewer (paulnpdev):** +> "If an expert questions what the code is doing, it +> deserves a comment." + +This principle — "if a reviewer needs to ask, the code +needs a comment" — is enforced through review culture. + +### 4. Soft Assertions (softassert) + +**PR #7411 (Mar 2025, 46 review comments):** + +Production code that logs errors for invariant +violations but doesn't crash: + +```go +softassert.That(logger, object.state == "ready", + "object is not ready") +``` + +**From PR (stephanos):** +> "**Why not panic?** Maybe in the future. For now, +> we're happy with finding these failed assertions in +> functional tests." + +This is Temporal's version of CockroachDB's +`errors.AssertionFailed` — a way to mark "this should +never happen" without crashing production. The key +difference: CockroachDB promotes these to errors that +may crash; Temporal logs them and continues. + +### 5. Dynamic Configuration (566 settings) + +Temporal's most extreme pattern: **566 runtime- +configurable settings** with type-safe resolution and +namespace-scoped overrides. + +```go +var AdminEnableListHistoryTasks = NewGlobalBoolSetting( + "admin.enableListHistoryTasks", + true, + `Description here`, +) +``` + +Settings use generics for type safety and resolve with +precedence: task queue → namespace → global. + +The `Collection` uses `weak.Pointer` for cache +invalidation (Go 1.24 feature) and `goro.Group` for +background polling — showing how internal packages +compose. + +### 6. Persistence Plugin System (init registration) + +```go +func init() { + sql.RegisterPlugin(PluginName, &plugin{ + driver: &driver.PQDriver{}, + }) +} +``` + +Classic Go plugin pattern using `init()` + global +registry. Supports: PostgreSQL (lib/pq + pgx), MySQL, +SQLite, Cassandra. The init-time registration means +import order matters (the `cmd/` packages import the +plugins they want). + +### 7. uber/fx Dependency Injection + +Temporal uses uber/fx for service construction. Each +service has an `fx.go` that declares providers and +consumers: + +```go +type GrpcServerOptionsParams struct { + fx.In + Logger log.Logger + RPCFactory common.RPCFactory + RetryableInterceptor *interceptor.RetryableInterceptor + NamespaceRateLimitInterceptor interceptor.NamespaceRateLimitInterceptor `optional:"true"` +} +``` + +This is unusual for Go — most projects avoid DI +frameworks. Temporal justifies it because the service +graph is genuinely complex (4 services × multiple +backends × configurable interceptors). + +--- + +## Code Quality Markers + +| Metric | Count | +|--------|-------| +| TODOs (non-test) | 738 | +| FIXMEs | 0 | +| HACKs | 5 | +| Mock files | 152 | +| Test files | 785 | +| Integration tests | 113 | +| Generic usages | 1,928 | + +**TODO style:** `// TODO: description` (no owner tag). +Compare to CockroachDB's `// TODO(username):` — Temporal +doesn't track WHO is responsible for a TODO. + +--- + +## Patterns Unique to Temporal + +### ShutdownOnce (safe multi-close) + +```go +func (c *ShutdownOnceImpl) Shutdown() { + if atomic.CompareAndSwapInt32( + &c.status, + shutdownOnceStatusOpen, + shutdownOnceStatusClosed, + ) { + close(c.channel) + } +} +``` + +CAS-based channel close that's safe to call multiple +times. Solves the "close of closed channel" panic that +plagues concurrent shutdown code. + +### Interleaved Weighted Round Robin Scheduler + +Custom task scheduler that interleaves tasks from +different channels based on configurable weights. +Uses dynamic config for weight updates without restart. +This is their answer to fair scheduling across +namespaces with different SLAs. + +### serviceerror Package (domain error types) + +Instead of wrapping standard errors, Temporal defines +domain-specific error types that map directly to gRPC +status codes: +- `StickyWorkerUnavailable` +- `ShardOwnershipLost` +- `TaskAlreadyStarted` +- `CurrentBranchChanged` + +Each is a struct implementing `error` with specific +fields needed for retry/recovery decisions. + +--- + +## Cross-Ecosystem Observations + +### Temporal vs CockroachDB + +| Concern | Temporal | CockroachDB | +|---------|----------|-------------| +| Goroutine mgmt | goro.Handle (2021) | stop.Handle (2025) | +| Assertions | softassert (log) | AssertionFailed (error) | +| Config | 566 dynamic settings | Cluster settings | +| DI | uber/fx | Manual wiring | +| State machines | First-class HSM framework | Ad-hoc per component | +| Error types | Domain structs → gRPC | Sentinel + wrapping | +| TODO style | No owner | `TODO(username)` | + +### Temporal vs Prometheus + +| Concern | Temporal | Prometheus | +|---------|----------|------------| +| Plugin system | init() registration | init() registration | +| Logging | Custom log package | promslog (slog) | +| Interfaces | Heavy use | Minimal, targeted | +| Generics | 1,928 usages | Minimal | +| Global state | Avoided (fx wiring) | Accepted for hot paths | + +### Key Differences from CockroachDB + +1. **uber/fx is a conscious choice** — Temporal's service + graph is complex enough to justify a DI framework. + CockroachDB explicitly avoids frameworks. + +2. **HSM is THE architecture** — Everything in Temporal + is a state machine. CockroachDB has state machines + but doesn't have a unified framework for them. + +3. **CHASM splits mutable/immutable at the type level** + — This is Temporal's strongest pattern. Making + mutation impossible in read paths via the type system. + +4. **goro.Handle predates CockroachDB's Handle by 3.5 + years** — Same problem (goroutine lifecycle), same + solution (context + done channel + safe multi-stop), + invented independently. + +--- + +## Lessons for Code Review + +1. **"If a reviewer needs to ask, the code needs a + comment"** — Temporal's review culture promotes + comments that explain non-obvious decisions. + +2. **Separate mutable from immutable contexts at the + type level** — Don't rely on documentation to prevent + mutation in read paths. + +3. **Soft assertions > panics in distributed systems** + — Log the invariant violation, continue serving. + Crash later in tests. + +4. **Domain error types beat generic wrapping** when + errors drive retry/routing decisions. A struct with + specific fields is more useful than `fmt.Errorf`. + +5. **DI frameworks are justified when the service graph + is genuinely complex** — 4 services × multiple + backends × configurable interceptors × optional + features = real complexity. + +6. **HSM frameworks centralize correctness** — Moving + state transition validation into a framework means + every component gets it right by construction instead + of by discipline. + +