docs: add full architectural analysis
Repo shape, import hierarchy, HSM/CHASM architecture, PR discussions, code quality metrics, cross-ecosystem comparisons
This commit is contained in:
+381
@@ -0,0 +1,381 @@
|
||||
# Temporal: Architectural Analysis
|
||||
|
||||
**Repo:** github.com/temporalio/temporal
|
||||
**Size:** 181M, 2,645 Go files, 8,958 commits, 290
|
||||
contributors
|
||||
**Category:** Durable execution engine / workflow
|
||||
orchestrator
|
||||
|
||||
---
|
||||
|
||||
## Repository Shape
|
||||
|
||||
```
|
||||
temporal/
|
||||
├── api/ # Generated protobuf service definitions
|
||||
├── chasm/ # NEW: Component-based HSM Architecture
|
||||
├── client/ # Internal service clients
|
||||
├── cmd/ # Entry points (server, tools)
|
||||
├── common/ # Shared infrastructure (massive)
|
||||
│ ├── backoff/
|
||||
│ ├── channel/
|
||||
│ ├── clock/
|
||||
│ ├── dynamicconfig/ # 566 runtime-configurable settings
|
||||
│ ├── goro/ # Goroutine lifecycle management
|
||||
│ ├── log/
|
||||
│ ├── metrics/
|
||||
│ ├── namespace/
|
||||
│ ├── persistence/ # Multi-backend storage abstraction
|
||||
│ ├── quotas/ # Rate limiting infrastructure
|
||||
│ ├── softassert/ # Production assertions (log, don't crash)
|
||||
│ └── tasks/ # Scheduler primitives (IWRR, FIFO, etc.)
|
||||
├── components/ # Feature modules (callbacks, nexus, schedulers)
|
||||
├── service/
|
||||
│ ├── frontend/ # gRPC API handlers
|
||||
│ ├── history/ # Workflow state machine execution
|
||||
│ │ ├── hsm/ # Hierarchical State Machine framework
|
||||
│ │ └── queues/ # Task queue processing
|
||||
│ ├── matching/ # Task dispatch / worker routing
|
||||
│ └── worker/ # System workflows
|
||||
└── tests/ # Integration / functional tests
|
||||
```
|
||||
|
||||
### Import Hierarchy (most depended-upon)
|
||||
|
||||
1. `common` — 7,257 imports (the foundation)
|
||||
2. `api` — 1,731 (protobuf contracts)
|
||||
3. `service` — 1,693 (business logic)
|
||||
4. `chasm` — 497 (rapidly growing new framework)
|
||||
5. `tests` — 125 (integration harness)
|
||||
|
||||
---
|
||||
|
||||
## Key Architectural Patterns
|
||||
|
||||
### 1. Hierarchical State Machines (HSM)
|
||||
|
||||
**PR #5494 (Mar 2024, 51 review comments):**
|
||||
|
||||
The HSM framework is Temporal's core abstraction. Every
|
||||
workflow execution is a tree of state machines — the
|
||||
workflow itself, its activities, child workflows, timers,
|
||||
callbacks, nexus operations.
|
||||
|
||||
```go
|
||||
type StateMachine[S comparable] interface {
|
||||
TaskRegenerator
|
||||
State() S
|
||||
SetState(S)
|
||||
}
|
||||
|
||||
type Transition[S comparable, SM StateMachine[S], E any] struct {
|
||||
Sources []S
|
||||
Destination S
|
||||
apply func(SM, E) (TransitionOutput, error)
|
||||
}
|
||||
```
|
||||
|
||||
**The key insight:** Type-safe state transitions with
|
||||
source validation. `Transition.Apply()` checks
|
||||
`slices.Contains(t.Sources, sm.State())` before
|
||||
allowing the state change. Invalid transitions return
|
||||
`ErrInvalidTransition` rather than silently corrupting
|
||||
state.
|
||||
|
||||
**From PR discussion (tdeebswihart):**
|
||||
> "I wish we'd gone with the standard `fsm` name here.
|
||||
> HSM keeps making me think of Hardware Security
|
||||
> Modules."
|
||||
|
||||
**From PR discussion (bergundy, the author):**
|
||||
> "I don't consider this a final approach but I do think
|
||||
> it's a step in the right direction. We need to model
|
||||
> more state machines on top of this to form a more
|
||||
> solid API."
|
||||
|
||||
This is explicit about being iterative. The framework
|
||||
shipped "not final" and evolved through real usage.
|
||||
|
||||
### 2. CHASM (Component Architecture for State Machines)
|
||||
|
||||
**PR #6987 (Dec 2024–Jan 2025, 60 review comments):**
|
||||
|
||||
CHASM replaces the old ad-hoc component system. It's
|
||||
a framework for building HSM-based components with:
|
||||
- Declarative field definitions
|
||||
- Mutable vs immutable contexts (type-enforced)
|
||||
- Parent-child component relationships
|
||||
- Task generation from transitions
|
||||
|
||||
**Key discussion points:**
|
||||
|
||||
**bergundy (author):** "I would put this in a top level
|
||||
`chasm` directory. There's likely going to be some
|
||||
chasm related code in other services."
|
||||
|
||||
**yycptt:** "Having the implementation in the top level
|
||||
package instead of service/history feels weird."
|
||||
(Responded with re-export strategy.)
|
||||
|
||||
**Sushisource:** "I think I prefer them separate,
|
||||
because what happens if you mutate something and then
|
||||
say 'not ready'? That would be some weird violation
|
||||
that shouldn't be possible, and separate contexts
|
||||
enforces that at the type level."
|
||||
|
||||
→ **Decision: Split MutableContext vs Context at the
|
||||
type level** to make invalid operations unrepresentable.
|
||||
This is the "making wrong things impossible" philosophy
|
||||
in action.
|
||||
|
||||
### 3. Goroutine Lifecycle (goro.Handle)
|
||||
|
||||
**PR #1892 (Sep 2021, 15 review comments):**
|
||||
|
||||
Introduced to fix a **double-close panic** in the task
|
||||
writer. The pattern is strikingly similar to
|
||||
CockroachDB's Handle (introduced 2025), but predates it
|
||||
by 3.5 years.
|
||||
|
||||
```go
|
||||
type Handle struct {
|
||||
context context.Context
|
||||
cancel context.CancelFunc
|
||||
done chan struct{}
|
||||
err atomic.Value
|
||||
}
|
||||
```
|
||||
|
||||
**From PR discussion (mmcshane, author):**
|
||||
> "One thing you might not guess about Stop() is that
|
||||
> it removes itself from the parent matching engine. I
|
||||
> don't like this 'remove yourself' behavior because it
|
||||
> puts the control logic in the wrong place (i.e. in
|
||||
> the controlled object rather than the controller)."
|
||||
|
||||
**Reviewer (paulnpdev):**
|
||||
> "If an expert questions what the code is doing, it
|
||||
> deserves a comment."
|
||||
|
||||
This principle — "if a reviewer needs to ask, the code
|
||||
needs a comment" — is enforced through review culture.
|
||||
|
||||
### 4. Soft Assertions (softassert)
|
||||
|
||||
**PR #7411 (Mar 2025, 46 review comments):**
|
||||
|
||||
Production code that logs errors for invariant
|
||||
violations but doesn't crash:
|
||||
|
||||
```go
|
||||
softassert.That(logger, object.state == "ready",
|
||||
"object is not ready")
|
||||
```
|
||||
|
||||
**From PR (stephanos):**
|
||||
> "**Why not panic?** Maybe in the future. For now,
|
||||
> we're happy with finding these failed assertions in
|
||||
> functional tests."
|
||||
|
||||
This is Temporal's version of CockroachDB's
|
||||
`errors.AssertionFailed` — a way to mark "this should
|
||||
never happen" without crashing production. The key
|
||||
difference: CockroachDB promotes these to errors that
|
||||
may crash; Temporal logs them and continues.
|
||||
|
||||
### 5. Dynamic Configuration (566 settings)
|
||||
|
||||
Temporal's most extreme pattern: **566 runtime-
|
||||
configurable settings** with type-safe resolution and
|
||||
namespace-scoped overrides.
|
||||
|
||||
```go
|
||||
var AdminEnableListHistoryTasks = NewGlobalBoolSetting(
|
||||
"admin.enableListHistoryTasks",
|
||||
true,
|
||||
`Description here`,
|
||||
)
|
||||
```
|
||||
|
||||
Settings use generics for type safety and resolve with
|
||||
precedence: task queue → namespace → global.
|
||||
|
||||
The `Collection` uses `weak.Pointer` for cache
|
||||
invalidation (Go 1.24 feature) and `goro.Group` for
|
||||
background polling — showing how internal packages
|
||||
compose.
|
||||
|
||||
### 6. Persistence Plugin System (init registration)
|
||||
|
||||
```go
|
||||
func init() {
|
||||
sql.RegisterPlugin(PluginName, &plugin{
|
||||
driver: &driver.PQDriver{},
|
||||
})
|
||||
}
|
||||
```
|
||||
|
||||
Classic Go plugin pattern using `init()` + global
|
||||
registry. Supports: PostgreSQL (lib/pq + pgx), MySQL,
|
||||
SQLite, Cassandra. The init-time registration means
|
||||
import order matters (the `cmd/` packages import the
|
||||
plugins they want).
|
||||
|
||||
### 7. uber/fx Dependency Injection
|
||||
|
||||
Temporal uses uber/fx for service construction. Each
|
||||
service has an `fx.go` that declares providers and
|
||||
consumers:
|
||||
|
||||
```go
|
||||
type GrpcServerOptionsParams struct {
|
||||
fx.In
|
||||
Logger log.Logger
|
||||
RPCFactory common.RPCFactory
|
||||
RetryableInterceptor *interceptor.RetryableInterceptor
|
||||
NamespaceRateLimitInterceptor interceptor.NamespaceRateLimitInterceptor `optional:"true"`
|
||||
}
|
||||
```
|
||||
|
||||
This is unusual for Go — most projects avoid DI
|
||||
frameworks. Temporal justifies it because the service
|
||||
graph is genuinely complex (4 services × multiple
|
||||
backends × configurable interceptors).
|
||||
|
||||
---
|
||||
|
||||
## Code Quality Markers
|
||||
|
||||
| Metric | Count |
|
||||
|--------|-------|
|
||||
| TODOs (non-test) | 738 |
|
||||
| FIXMEs | 0 |
|
||||
| HACKs | 5 |
|
||||
| Mock files | 152 |
|
||||
| Test files | 785 |
|
||||
| Integration tests | 113 |
|
||||
| Generic usages | 1,928 |
|
||||
|
||||
**TODO style:** `// TODO: description` (no owner tag).
|
||||
Compare to CockroachDB's `// TODO(username):` — Temporal
|
||||
doesn't track WHO is responsible for a TODO.
|
||||
|
||||
---
|
||||
|
||||
## Patterns Unique to Temporal
|
||||
|
||||
### ShutdownOnce (safe multi-close)
|
||||
|
||||
```go
|
||||
func (c *ShutdownOnceImpl) Shutdown() {
|
||||
if atomic.CompareAndSwapInt32(
|
||||
&c.status,
|
||||
shutdownOnceStatusOpen,
|
||||
shutdownOnceStatusClosed,
|
||||
) {
|
||||
close(c.channel)
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
CAS-based channel close that's safe to call multiple
|
||||
times. Solves the "close of closed channel" panic that
|
||||
plagues concurrent shutdown code.
|
||||
|
||||
### Interleaved Weighted Round Robin Scheduler
|
||||
|
||||
Custom task scheduler that interleaves tasks from
|
||||
different channels based on configurable weights.
|
||||
Uses dynamic config for weight updates without restart.
|
||||
This is their answer to fair scheduling across
|
||||
namespaces with different SLAs.
|
||||
|
||||
### serviceerror Package (domain error types)
|
||||
|
||||
Instead of wrapping standard errors, Temporal defines
|
||||
domain-specific error types that map directly to gRPC
|
||||
status codes:
|
||||
- `StickyWorkerUnavailable`
|
||||
- `ShardOwnershipLost`
|
||||
- `TaskAlreadyStarted`
|
||||
- `CurrentBranchChanged`
|
||||
|
||||
Each is a struct implementing `error` with specific
|
||||
fields needed for retry/recovery decisions.
|
||||
|
||||
---
|
||||
|
||||
## Cross-Ecosystem Observations
|
||||
|
||||
### Temporal vs CockroachDB
|
||||
|
||||
| Concern | Temporal | CockroachDB |
|
||||
|---------|----------|-------------|
|
||||
| Goroutine mgmt | goro.Handle (2021) | stop.Handle (2025) |
|
||||
| Assertions | softassert (log) | AssertionFailed (error) |
|
||||
| Config | 566 dynamic settings | Cluster settings |
|
||||
| DI | uber/fx | Manual wiring |
|
||||
| State machines | First-class HSM framework | Ad-hoc per component |
|
||||
| Error types | Domain structs → gRPC | Sentinel + wrapping |
|
||||
| TODO style | No owner | `TODO(username)` |
|
||||
|
||||
### Temporal vs Prometheus
|
||||
|
||||
| Concern | Temporal | Prometheus |
|
||||
|---------|----------|------------|
|
||||
| Plugin system | init() registration | init() registration |
|
||||
| Logging | Custom log package | promslog (slog) |
|
||||
| Interfaces | Heavy use | Minimal, targeted |
|
||||
| Generics | 1,928 usages | Minimal |
|
||||
| Global state | Avoided (fx wiring) | Accepted for hot paths |
|
||||
|
||||
### Key Differences from CockroachDB
|
||||
|
||||
1. **uber/fx is a conscious choice** — Temporal's service
|
||||
graph is complex enough to justify a DI framework.
|
||||
CockroachDB explicitly avoids frameworks.
|
||||
|
||||
2. **HSM is THE architecture** — Everything in Temporal
|
||||
is a state machine. CockroachDB has state machines
|
||||
but doesn't have a unified framework for them.
|
||||
|
||||
3. **CHASM splits mutable/immutable at the type level**
|
||||
— This is Temporal's strongest pattern. Making
|
||||
mutation impossible in read paths via the type system.
|
||||
|
||||
4. **goro.Handle predates CockroachDB's Handle by 3.5
|
||||
years** — Same problem (goroutine lifecycle), same
|
||||
solution (context + done channel + safe multi-stop),
|
||||
invented independently.
|
||||
|
||||
---
|
||||
|
||||
## Lessons for Code Review
|
||||
|
||||
1. **"If a reviewer needs to ask, the code needs a
|
||||
comment"** — Temporal's review culture promotes
|
||||
comments that explain non-obvious decisions.
|
||||
|
||||
2. **Separate mutable from immutable contexts at the
|
||||
type level** — Don't rely on documentation to prevent
|
||||
mutation in read paths.
|
||||
|
||||
3. **Soft assertions > panics in distributed systems**
|
||||
— Log the invariant violation, continue serving.
|
||||
Crash later in tests.
|
||||
|
||||
4. **Domain error types beat generic wrapping** when
|
||||
errors drive retry/routing decisions. A struct with
|
||||
specific fields is more useful than `fmt.Errorf`.
|
||||
|
||||
5. **DI frameworks are justified when the service graph
|
||||
is genuinely complex** — 4 services × multiple
|
||||
backends × configurable interceptors × optional
|
||||
features = real complexity.
|
||||
|
||||
6. **HSM frameworks centralize correctness** — Moving
|
||||
state transition validation into a framework means
|
||||
every component gets it right by construction instead
|
||||
of by discipline.
|
||||
|
||||
<!-- PATTERN_COMPLETE -->
|
||||
Reference in New Issue
Block a user