docs: add full architectural analysis
Repo shape, import hierarchy, HSM/CHASM architecture, PR discussions, code quality metrics, cross-ecosystem comparisons
This commit is contained in:
+381
@@ -0,0 +1,381 @@
|
|||||||
|
# Temporal: Architectural Analysis
|
||||||
|
|
||||||
|
**Repo:** github.com/temporalio/temporal
|
||||||
|
**Size:** 181M, 2,645 Go files, 8,958 commits, 290
|
||||||
|
contributors
|
||||||
|
**Category:** Durable execution engine / workflow
|
||||||
|
orchestrator
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Repository Shape
|
||||||
|
|
||||||
|
```
|
||||||
|
temporal/
|
||||||
|
├── api/ # Generated protobuf service definitions
|
||||||
|
├── chasm/ # NEW: Component-based HSM Architecture
|
||||||
|
├── client/ # Internal service clients
|
||||||
|
├── cmd/ # Entry points (server, tools)
|
||||||
|
├── common/ # Shared infrastructure (massive)
|
||||||
|
│ ├── backoff/
|
||||||
|
│ ├── channel/
|
||||||
|
│ ├── clock/
|
||||||
|
│ ├── dynamicconfig/ # 566 runtime-configurable settings
|
||||||
|
│ ├── goro/ # Goroutine lifecycle management
|
||||||
|
│ ├── log/
|
||||||
|
│ ├── metrics/
|
||||||
|
│ ├── namespace/
|
||||||
|
│ ├── persistence/ # Multi-backend storage abstraction
|
||||||
|
│ ├── quotas/ # Rate limiting infrastructure
|
||||||
|
│ ├── softassert/ # Production assertions (log, don't crash)
|
||||||
|
│ └── tasks/ # Scheduler primitives (IWRR, FIFO, etc.)
|
||||||
|
├── components/ # Feature modules (callbacks, nexus, schedulers)
|
||||||
|
├── service/
|
||||||
|
│ ├── frontend/ # gRPC API handlers
|
||||||
|
│ ├── history/ # Workflow state machine execution
|
||||||
|
│ │ ├── hsm/ # Hierarchical State Machine framework
|
||||||
|
│ │ └── queues/ # Task queue processing
|
||||||
|
│ ├── matching/ # Task dispatch / worker routing
|
||||||
|
│ └── worker/ # System workflows
|
||||||
|
└── tests/ # Integration / functional tests
|
||||||
|
```
|
||||||
|
|
||||||
|
### Import Hierarchy (most depended-upon)
|
||||||
|
|
||||||
|
1. `common` — 7,257 imports (the foundation)
|
||||||
|
2. `api` — 1,731 (protobuf contracts)
|
||||||
|
3. `service` — 1,693 (business logic)
|
||||||
|
4. `chasm` — 497 (rapidly growing new framework)
|
||||||
|
5. `tests` — 125 (integration harness)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Key Architectural Patterns
|
||||||
|
|
||||||
|
### 1. Hierarchical State Machines (HSM)
|
||||||
|
|
||||||
|
**PR #5494 (Mar 2024, 51 review comments):**
|
||||||
|
|
||||||
|
The HSM framework is Temporal's core abstraction. Every
|
||||||
|
workflow execution is a tree of state machines — the
|
||||||
|
workflow itself, its activities, child workflows, timers,
|
||||||
|
callbacks, nexus operations.
|
||||||
|
|
||||||
|
```go
|
||||||
|
type StateMachine[S comparable] interface {
|
||||||
|
TaskRegenerator
|
||||||
|
State() S
|
||||||
|
SetState(S)
|
||||||
|
}
|
||||||
|
|
||||||
|
type Transition[S comparable, SM StateMachine[S], E any] struct {
|
||||||
|
Sources []S
|
||||||
|
Destination S
|
||||||
|
apply func(SM, E) (TransitionOutput, error)
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**The key insight:** Type-safe state transitions with
|
||||||
|
source validation. `Transition.Apply()` checks
|
||||||
|
`slices.Contains(t.Sources, sm.State())` before
|
||||||
|
allowing the state change. Invalid transitions return
|
||||||
|
`ErrInvalidTransition` rather than silently corrupting
|
||||||
|
state.
|
||||||
|
|
||||||
|
**From PR discussion (tdeebswihart):**
|
||||||
|
> "I wish we'd gone with the standard `fsm` name here.
|
||||||
|
> HSM keeps making me think of Hardware Security
|
||||||
|
> Modules."
|
||||||
|
|
||||||
|
**From PR discussion (bergundy, the author):**
|
||||||
|
> "I don't consider this a final approach but I do think
|
||||||
|
> it's a step in the right direction. We need to model
|
||||||
|
> more state machines on top of this to form a more
|
||||||
|
> solid API."
|
||||||
|
|
||||||
|
This is explicit about being iterative. The framework
|
||||||
|
shipped "not final" and evolved through real usage.
|
||||||
|
|
||||||
|
### 2. CHASM (Component Architecture for State Machines)
|
||||||
|
|
||||||
|
**PR #6987 (Dec 2024–Jan 2025, 60 review comments):**
|
||||||
|
|
||||||
|
CHASM replaces the old ad-hoc component system. It's
|
||||||
|
a framework for building HSM-based components with:
|
||||||
|
- Declarative field definitions
|
||||||
|
- Mutable vs immutable contexts (type-enforced)
|
||||||
|
- Parent-child component relationships
|
||||||
|
- Task generation from transitions
|
||||||
|
|
||||||
|
**Key discussion points:**
|
||||||
|
|
||||||
|
**bergundy (author):** "I would put this in a top level
|
||||||
|
`chasm` directory. There's likely going to be some
|
||||||
|
chasm related code in other services."
|
||||||
|
|
||||||
|
**yycptt:** "Having the implementation in the top level
|
||||||
|
package instead of service/history feels weird."
|
||||||
|
(Responded with re-export strategy.)
|
||||||
|
|
||||||
|
**Sushisource:** "I think I prefer them separate,
|
||||||
|
because what happens if you mutate something and then
|
||||||
|
say 'not ready'? That would be some weird violation
|
||||||
|
that shouldn't be possible, and separate contexts
|
||||||
|
enforces that at the type level."
|
||||||
|
|
||||||
|
→ **Decision: Split MutableContext vs Context at the
|
||||||
|
type level** to make invalid operations unrepresentable.
|
||||||
|
This is the "making wrong things impossible" philosophy
|
||||||
|
in action.
|
||||||
|
|
||||||
|
### 3. Goroutine Lifecycle (goro.Handle)
|
||||||
|
|
||||||
|
**PR #1892 (Sep 2021, 15 review comments):**
|
||||||
|
|
||||||
|
Introduced to fix a **double-close panic** in the task
|
||||||
|
writer. The pattern is strikingly similar to
|
||||||
|
CockroachDB's Handle (introduced 2025), but predates it
|
||||||
|
by 3.5 years.
|
||||||
|
|
||||||
|
```go
|
||||||
|
type Handle struct {
|
||||||
|
context context.Context
|
||||||
|
cancel context.CancelFunc
|
||||||
|
done chan struct{}
|
||||||
|
err atomic.Value
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**From PR discussion (mmcshane, author):**
|
||||||
|
> "One thing you might not guess about Stop() is that
|
||||||
|
> it removes itself from the parent matching engine. I
|
||||||
|
> don't like this 'remove yourself' behavior because it
|
||||||
|
> puts the control logic in the wrong place (i.e. in
|
||||||
|
> the controlled object rather than the controller)."
|
||||||
|
|
||||||
|
**Reviewer (paulnpdev):**
|
||||||
|
> "If an expert questions what the code is doing, it
|
||||||
|
> deserves a comment."
|
||||||
|
|
||||||
|
This principle — "if a reviewer needs to ask, the code
|
||||||
|
needs a comment" — is enforced through review culture.
|
||||||
|
|
||||||
|
### 4. Soft Assertions (softassert)
|
||||||
|
|
||||||
|
**PR #7411 (Mar 2025, 46 review comments):**
|
||||||
|
|
||||||
|
Production code that logs errors for invariant
|
||||||
|
violations but doesn't crash:
|
||||||
|
|
||||||
|
```go
|
||||||
|
softassert.That(logger, object.state == "ready",
|
||||||
|
"object is not ready")
|
||||||
|
```
|
||||||
|
|
||||||
|
**From PR (stephanos):**
|
||||||
|
> "**Why not panic?** Maybe in the future. For now,
|
||||||
|
> we're happy with finding these failed assertions in
|
||||||
|
> functional tests."
|
||||||
|
|
||||||
|
This is Temporal's version of CockroachDB's
|
||||||
|
`errors.AssertionFailed` — a way to mark "this should
|
||||||
|
never happen" without crashing production. The key
|
||||||
|
difference: CockroachDB promotes these to errors that
|
||||||
|
may crash; Temporal logs them and continues.
|
||||||
|
|
||||||
|
### 5. Dynamic Configuration (566 settings)
|
||||||
|
|
||||||
|
Temporal's most extreme pattern: **566 runtime-
|
||||||
|
configurable settings** with type-safe resolution and
|
||||||
|
namespace-scoped overrides.
|
||||||
|
|
||||||
|
```go
|
||||||
|
var AdminEnableListHistoryTasks = NewGlobalBoolSetting(
|
||||||
|
"admin.enableListHistoryTasks",
|
||||||
|
true,
|
||||||
|
`Description here`,
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
Settings use generics for type safety and resolve with
|
||||||
|
precedence: task queue → namespace → global.
|
||||||
|
|
||||||
|
The `Collection` uses `weak.Pointer` for cache
|
||||||
|
invalidation (Go 1.24 feature) and `goro.Group` for
|
||||||
|
background polling — showing how internal packages
|
||||||
|
compose.
|
||||||
|
|
||||||
|
### 6. Persistence Plugin System (init registration)
|
||||||
|
|
||||||
|
```go
|
||||||
|
func init() {
|
||||||
|
sql.RegisterPlugin(PluginName, &plugin{
|
||||||
|
driver: &driver.PQDriver{},
|
||||||
|
})
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Classic Go plugin pattern using `init()` + global
|
||||||
|
registry. Supports: PostgreSQL (lib/pq + pgx), MySQL,
|
||||||
|
SQLite, Cassandra. The init-time registration means
|
||||||
|
import order matters (the `cmd/` packages import the
|
||||||
|
plugins they want).
|
||||||
|
|
||||||
|
### 7. uber/fx Dependency Injection
|
||||||
|
|
||||||
|
Temporal uses uber/fx for service construction. Each
|
||||||
|
service has an `fx.go` that declares providers and
|
||||||
|
consumers:
|
||||||
|
|
||||||
|
```go
|
||||||
|
type GrpcServerOptionsParams struct {
|
||||||
|
fx.In
|
||||||
|
Logger log.Logger
|
||||||
|
RPCFactory common.RPCFactory
|
||||||
|
RetryableInterceptor *interceptor.RetryableInterceptor
|
||||||
|
NamespaceRateLimitInterceptor interceptor.NamespaceRateLimitInterceptor `optional:"true"`
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
This is unusual for Go — most projects avoid DI
|
||||||
|
frameworks. Temporal justifies it because the service
|
||||||
|
graph is genuinely complex (4 services × multiple
|
||||||
|
backends × configurable interceptors).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Code Quality Markers
|
||||||
|
|
||||||
|
| Metric | Count |
|
||||||
|
|--------|-------|
|
||||||
|
| TODOs (non-test) | 738 |
|
||||||
|
| FIXMEs | 0 |
|
||||||
|
| HACKs | 5 |
|
||||||
|
| Mock files | 152 |
|
||||||
|
| Test files | 785 |
|
||||||
|
| Integration tests | 113 |
|
||||||
|
| Generic usages | 1,928 |
|
||||||
|
|
||||||
|
**TODO style:** `// TODO: description` (no owner tag).
|
||||||
|
Compare to CockroachDB's `// TODO(username):` — Temporal
|
||||||
|
doesn't track WHO is responsible for a TODO.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Patterns Unique to Temporal
|
||||||
|
|
||||||
|
### ShutdownOnce (safe multi-close)
|
||||||
|
|
||||||
|
```go
|
||||||
|
func (c *ShutdownOnceImpl) Shutdown() {
|
||||||
|
if atomic.CompareAndSwapInt32(
|
||||||
|
&c.status,
|
||||||
|
shutdownOnceStatusOpen,
|
||||||
|
shutdownOnceStatusClosed,
|
||||||
|
) {
|
||||||
|
close(c.channel)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
CAS-based channel close that's safe to call multiple
|
||||||
|
times. Solves the "close of closed channel" panic that
|
||||||
|
plagues concurrent shutdown code.
|
||||||
|
|
||||||
|
### Interleaved Weighted Round Robin Scheduler
|
||||||
|
|
||||||
|
Custom task scheduler that interleaves tasks from
|
||||||
|
different channels based on configurable weights.
|
||||||
|
Uses dynamic config for weight updates without restart.
|
||||||
|
This is their answer to fair scheduling across
|
||||||
|
namespaces with different SLAs.
|
||||||
|
|
||||||
|
### serviceerror Package (domain error types)
|
||||||
|
|
||||||
|
Instead of wrapping standard errors, Temporal defines
|
||||||
|
domain-specific error types that map directly to gRPC
|
||||||
|
status codes:
|
||||||
|
- `StickyWorkerUnavailable`
|
||||||
|
- `ShardOwnershipLost`
|
||||||
|
- `TaskAlreadyStarted`
|
||||||
|
- `CurrentBranchChanged`
|
||||||
|
|
||||||
|
Each is a struct implementing `error` with specific
|
||||||
|
fields needed for retry/recovery decisions.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Cross-Ecosystem Observations
|
||||||
|
|
||||||
|
### Temporal vs CockroachDB
|
||||||
|
|
||||||
|
| Concern | Temporal | CockroachDB |
|
||||||
|
|---------|----------|-------------|
|
||||||
|
| Goroutine mgmt | goro.Handle (2021) | stop.Handle (2025) |
|
||||||
|
| Assertions | softassert (log) | AssertionFailed (error) |
|
||||||
|
| Config | 566 dynamic settings | Cluster settings |
|
||||||
|
| DI | uber/fx | Manual wiring |
|
||||||
|
| State machines | First-class HSM framework | Ad-hoc per component |
|
||||||
|
| Error types | Domain structs → gRPC | Sentinel + wrapping |
|
||||||
|
| TODO style | No owner | `TODO(username)` |
|
||||||
|
|
||||||
|
### Temporal vs Prometheus
|
||||||
|
|
||||||
|
| Concern | Temporal | Prometheus |
|
||||||
|
|---------|----------|------------|
|
||||||
|
| Plugin system | init() registration | init() registration |
|
||||||
|
| Logging | Custom log package | promslog (slog) |
|
||||||
|
| Interfaces | Heavy use | Minimal, targeted |
|
||||||
|
| Generics | 1,928 usages | Minimal |
|
||||||
|
| Global state | Avoided (fx wiring) | Accepted for hot paths |
|
||||||
|
|
||||||
|
### Key Differences from CockroachDB
|
||||||
|
|
||||||
|
1. **uber/fx is a conscious choice** — Temporal's service
|
||||||
|
graph is complex enough to justify a DI framework.
|
||||||
|
CockroachDB explicitly avoids frameworks.
|
||||||
|
|
||||||
|
2. **HSM is THE architecture** — Everything in Temporal
|
||||||
|
is a state machine. CockroachDB has state machines
|
||||||
|
but doesn't have a unified framework for them.
|
||||||
|
|
||||||
|
3. **CHASM splits mutable/immutable at the type level**
|
||||||
|
— This is Temporal's strongest pattern. Making
|
||||||
|
mutation impossible in read paths via the type system.
|
||||||
|
|
||||||
|
4. **goro.Handle predates CockroachDB's Handle by 3.5
|
||||||
|
years** — Same problem (goroutine lifecycle), same
|
||||||
|
solution (context + done channel + safe multi-stop),
|
||||||
|
invented independently.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Lessons for Code Review
|
||||||
|
|
||||||
|
1. **"If a reviewer needs to ask, the code needs a
|
||||||
|
comment"** — Temporal's review culture promotes
|
||||||
|
comments that explain non-obvious decisions.
|
||||||
|
|
||||||
|
2. **Separate mutable from immutable contexts at the
|
||||||
|
type level** — Don't rely on documentation to prevent
|
||||||
|
mutation in read paths.
|
||||||
|
|
||||||
|
3. **Soft assertions > panics in distributed systems**
|
||||||
|
— Log the invariant violation, continue serving.
|
||||||
|
Crash later in tests.
|
||||||
|
|
||||||
|
4. **Domain error types beat generic wrapping** when
|
||||||
|
errors drive retry/routing decisions. A struct with
|
||||||
|
specific fields is more useful than `fmt.Errorf`.
|
||||||
|
|
||||||
|
5. **DI frameworks are justified when the service graph
|
||||||
|
is genuinely complex** — 4 services × multiple
|
||||||
|
backends × configurable interceptors × optional
|
||||||
|
features = real complexity.
|
||||||
|
|
||||||
|
6. **HSM frameworks centralize correctness** — Moving
|
||||||
|
state transition validation into a framework means
|
||||||
|
every component gets it right by construction instead
|
||||||
|
of by discipline.
|
||||||
|
|
||||||
|
<!-- PATTERN_COMPLETE -->
|
||||||
Reference in New Issue
Block a user