docs: add full architectural analysis

Repo shape, import hierarchy, HSM/CHASM architecture, PR discussions,
code quality metrics, cross-ecosystem comparisons
This commit is contained in:
Rodin
2026-04-30 11:45:53 -07:00
parent 1784995383
commit 6e930aed94
+381
View File
@@ -0,0 +1,381 @@
# Temporal: Architectural Analysis
**Repo:** github.com/temporalio/temporal
**Size:** 181M, 2,645 Go files, 8,958 commits, 290
contributors
**Category:** Durable execution engine / workflow
orchestrator
---
## Repository Shape
```
temporal/
├── api/ # Generated protobuf service definitions
├── chasm/ # NEW: Component-based HSM Architecture
├── client/ # Internal service clients
├── cmd/ # Entry points (server, tools)
├── common/ # Shared infrastructure (massive)
│ ├── backoff/
│ ├── channel/
│ ├── clock/
│ ├── dynamicconfig/ # 566 runtime-configurable settings
│ ├── goro/ # Goroutine lifecycle management
│ ├── log/
│ ├── metrics/
│ ├── namespace/
│ ├── persistence/ # Multi-backend storage abstraction
│ ├── quotas/ # Rate limiting infrastructure
│ ├── softassert/ # Production assertions (log, don't crash)
│ └── tasks/ # Scheduler primitives (IWRR, FIFO, etc.)
├── components/ # Feature modules (callbacks, nexus, schedulers)
├── service/
│ ├── frontend/ # gRPC API handlers
│ ├── history/ # Workflow state machine execution
│ │ ├── hsm/ # Hierarchical State Machine framework
│ │ └── queues/ # Task queue processing
│ ├── matching/ # Task dispatch / worker routing
│ └── worker/ # System workflows
└── tests/ # Integration / functional tests
```
### Import Hierarchy (most depended-upon)
1. `common` — 7,257 imports (the foundation)
2. `api` — 1,731 (protobuf contracts)
3. `service` — 1,693 (business logic)
4. `chasm` — 497 (rapidly growing new framework)
5. `tests` — 125 (integration harness)
---
## Key Architectural Patterns
### 1. Hierarchical State Machines (HSM)
**PR #5494 (Mar 2024, 51 review comments):**
The HSM framework is Temporal's core abstraction. Every
workflow execution is a tree of state machines — the
workflow itself, its activities, child workflows, timers,
callbacks, nexus operations.
```go
type StateMachine[S comparable] interface {
TaskRegenerator
State() S
SetState(S)
}
type Transition[S comparable, SM StateMachine[S], E any] struct {
Sources []S
Destination S
apply func(SM, E) (TransitionOutput, error)
}
```
**The key insight:** Type-safe state transitions with
source validation. `Transition.Apply()` checks
`slices.Contains(t.Sources, sm.State())` before
allowing the state change. Invalid transitions return
`ErrInvalidTransition` rather than silently corrupting
state.
**From PR discussion (tdeebswihart):**
> "I wish we'd gone with the standard `fsm` name here.
> HSM keeps making me think of Hardware Security
> Modules."
**From PR discussion (bergundy, the author):**
> "I don't consider this a final approach but I do think
> it's a step in the right direction. We need to model
> more state machines on top of this to form a more
> solid API."
This is explicit about being iterative. The framework
shipped "not final" and evolved through real usage.
### 2. CHASM (Component Architecture for State Machines)
**PR #6987 (Dec 2024Jan 2025, 60 review comments):**
CHASM replaces the old ad-hoc component system. It's
a framework for building HSM-based components with:
- Declarative field definitions
- Mutable vs immutable contexts (type-enforced)
- Parent-child component relationships
- Task generation from transitions
**Key discussion points:**
**bergundy (author):** "I would put this in a top level
`chasm` directory. There's likely going to be some
chasm related code in other services."
**yycptt:** "Having the implementation in the top level
package instead of service/history feels weird."
(Responded with re-export strategy.)
**Sushisource:** "I think I prefer them separate,
because what happens if you mutate something and then
say 'not ready'? That would be some weird violation
that shouldn't be possible, and separate contexts
enforces that at the type level."
→ **Decision: Split MutableContext vs Context at the
type level** to make invalid operations unrepresentable.
This is the "making wrong things impossible" philosophy
in action.
### 3. Goroutine Lifecycle (goro.Handle)
**PR #1892 (Sep 2021, 15 review comments):**
Introduced to fix a **double-close panic** in the task
writer. The pattern is strikingly similar to
CockroachDB's Handle (introduced 2025), but predates it
by 3.5 years.
```go
type Handle struct {
context context.Context
cancel context.CancelFunc
done chan struct{}
err atomic.Value
}
```
**From PR discussion (mmcshane, author):**
> "One thing you might not guess about Stop() is that
> it removes itself from the parent matching engine. I
> don't like this 'remove yourself' behavior because it
> puts the control logic in the wrong place (i.e. in
> the controlled object rather than the controller)."
**Reviewer (paulnpdev):**
> "If an expert questions what the code is doing, it
> deserves a comment."
This principle — "if a reviewer needs to ask, the code
needs a comment" — is enforced through review culture.
### 4. Soft Assertions (softassert)
**PR #7411 (Mar 2025, 46 review comments):**
Production code that logs errors for invariant
violations but doesn't crash:
```go
softassert.That(logger, object.state == "ready",
"object is not ready")
```
**From PR (stephanos):**
> "**Why not panic?** Maybe in the future. For now,
> we're happy with finding these failed assertions in
> functional tests."
This is Temporal's version of CockroachDB's
`errors.AssertionFailed` — a way to mark "this should
never happen" without crashing production. The key
difference: CockroachDB promotes these to errors that
may crash; Temporal logs them and continues.
### 5. Dynamic Configuration (566 settings)
Temporal's most extreme pattern: **566 runtime-
configurable settings** with type-safe resolution and
namespace-scoped overrides.
```go
var AdminEnableListHistoryTasks = NewGlobalBoolSetting(
"admin.enableListHistoryTasks",
true,
`Description here`,
)
```
Settings use generics for type safety and resolve with
precedence: task queue → namespace → global.
The `Collection` uses `weak.Pointer` for cache
invalidation (Go 1.24 feature) and `goro.Group` for
background polling — showing how internal packages
compose.
### 6. Persistence Plugin System (init registration)
```go
func init() {
sql.RegisterPlugin(PluginName, &plugin{
driver: &driver.PQDriver{},
})
}
```
Classic Go plugin pattern using `init()` + global
registry. Supports: PostgreSQL (lib/pq + pgx), MySQL,
SQLite, Cassandra. The init-time registration means
import order matters (the `cmd/` packages import the
plugins they want).
### 7. uber/fx Dependency Injection
Temporal uses uber/fx for service construction. Each
service has an `fx.go` that declares providers and
consumers:
```go
type GrpcServerOptionsParams struct {
fx.In
Logger log.Logger
RPCFactory common.RPCFactory
RetryableInterceptor *interceptor.RetryableInterceptor
NamespaceRateLimitInterceptor interceptor.NamespaceRateLimitInterceptor `optional:"true"`
}
```
This is unusual for Go — most projects avoid DI
frameworks. Temporal justifies it because the service
graph is genuinely complex (4 services × multiple
backends × configurable interceptors).
---
## Code Quality Markers
| Metric | Count |
|--------|-------|
| TODOs (non-test) | 738 |
| FIXMEs | 0 |
| HACKs | 5 |
| Mock files | 152 |
| Test files | 785 |
| Integration tests | 113 |
| Generic usages | 1,928 |
**TODO style:** `// TODO: description` (no owner tag).
Compare to CockroachDB's `// TODO(username):` — Temporal
doesn't track WHO is responsible for a TODO.
---
## Patterns Unique to Temporal
### ShutdownOnce (safe multi-close)
```go
func (c *ShutdownOnceImpl) Shutdown() {
if atomic.CompareAndSwapInt32(
&c.status,
shutdownOnceStatusOpen,
shutdownOnceStatusClosed,
) {
close(c.channel)
}
}
```
CAS-based channel close that's safe to call multiple
times. Solves the "close of closed channel" panic that
plagues concurrent shutdown code.
### Interleaved Weighted Round Robin Scheduler
Custom task scheduler that interleaves tasks from
different channels based on configurable weights.
Uses dynamic config for weight updates without restart.
This is their answer to fair scheduling across
namespaces with different SLAs.
### serviceerror Package (domain error types)
Instead of wrapping standard errors, Temporal defines
domain-specific error types that map directly to gRPC
status codes:
- `StickyWorkerUnavailable`
- `ShardOwnershipLost`
- `TaskAlreadyStarted`
- `CurrentBranchChanged`
Each is a struct implementing `error` with specific
fields needed for retry/recovery decisions.
---
## Cross-Ecosystem Observations
### Temporal vs CockroachDB
| Concern | Temporal | CockroachDB |
|---------|----------|-------------|
| Goroutine mgmt | goro.Handle (2021) | stop.Handle (2025) |
| Assertions | softassert (log) | AssertionFailed (error) |
| Config | 566 dynamic settings | Cluster settings |
| DI | uber/fx | Manual wiring |
| State machines | First-class HSM framework | Ad-hoc per component |
| Error types | Domain structs → gRPC | Sentinel + wrapping |
| TODO style | No owner | `TODO(username)` |
### Temporal vs Prometheus
| Concern | Temporal | Prometheus |
|---------|----------|------------|
| Plugin system | init() registration | init() registration |
| Logging | Custom log package | promslog (slog) |
| Interfaces | Heavy use | Minimal, targeted |
| Generics | 1,928 usages | Minimal |
| Global state | Avoided (fx wiring) | Accepted for hot paths |
### Key Differences from CockroachDB
1. **uber/fx is a conscious choice** — Temporal's service
graph is complex enough to justify a DI framework.
CockroachDB explicitly avoids frameworks.
2. **HSM is THE architecture** — Everything in Temporal
is a state machine. CockroachDB has state machines
but doesn't have a unified framework for them.
3. **CHASM splits mutable/immutable at the type level**
— This is Temporal's strongest pattern. Making
mutation impossible in read paths via the type system.
4. **goro.Handle predates CockroachDB's Handle by 3.5
years** — Same problem (goroutine lifecycle), same
solution (context + done channel + safe multi-stop),
invented independently.
---
## Lessons for Code Review
1. **"If a reviewer needs to ask, the code needs a
comment"** — Temporal's review culture promotes
comments that explain non-obvious decisions.
2. **Separate mutable from immutable contexts at the
type level** — Don't rely on documentation to prevent
mutation in read paths.
3. **Soft assertions > panics in distributed systems**
— Log the invariant violation, continue serving.
Crash later in tests.
4. **Domain error types beat generic wrapping** when
errors drive retry/routing decisions. A struct with
specific fields is more useful than `fmt.Errorf`.
5. **DI frameworks are justified when the service graph
is genuinely complex** — 4 services × multiple
backends × configurable interceptors × optional
features = real complexity.
6. **HSM frameworks centralize correctness** — Moving
state transition validation into a framework means
every component gets it right by construction instead
of by discipline.
<!-- PATTERN_COMPLETE -->