6e930aed94
Repo shape, import hierarchy, HSM/CHASM architecture, PR discussions, code quality metrics, cross-ecosystem comparisons
382 lines
11 KiB
Markdown
382 lines
11 KiB
Markdown
# Temporal: Architectural Analysis
|
||
|
||
**Repo:** github.com/temporalio/temporal
|
||
**Size:** 181M, 2,645 Go files, 8,958 commits, 290
|
||
contributors
|
||
**Category:** Durable execution engine / workflow
|
||
orchestrator
|
||
|
||
---
|
||
|
||
## Repository Shape
|
||
|
||
```
|
||
temporal/
|
||
├── api/ # Generated protobuf service definitions
|
||
├── chasm/ # NEW: Component-based HSM Architecture
|
||
├── client/ # Internal service clients
|
||
├── cmd/ # Entry points (server, tools)
|
||
├── common/ # Shared infrastructure (massive)
|
||
│ ├── backoff/
|
||
│ ├── channel/
|
||
│ ├── clock/
|
||
│ ├── dynamicconfig/ # 566 runtime-configurable settings
|
||
│ ├── goro/ # Goroutine lifecycle management
|
||
│ ├── log/
|
||
│ ├── metrics/
|
||
│ ├── namespace/
|
||
│ ├── persistence/ # Multi-backend storage abstraction
|
||
│ ├── quotas/ # Rate limiting infrastructure
|
||
│ ├── softassert/ # Production assertions (log, don't crash)
|
||
│ └── tasks/ # Scheduler primitives (IWRR, FIFO, etc.)
|
||
├── components/ # Feature modules (callbacks, nexus, schedulers)
|
||
├── service/
|
||
│ ├── frontend/ # gRPC API handlers
|
||
│ ├── history/ # Workflow state machine execution
|
||
│ │ ├── hsm/ # Hierarchical State Machine framework
|
||
│ │ └── queues/ # Task queue processing
|
||
│ ├── matching/ # Task dispatch / worker routing
|
||
│ └── worker/ # System workflows
|
||
└── tests/ # Integration / functional tests
|
||
```
|
||
|
||
### Import Hierarchy (most depended-upon)
|
||
|
||
1. `common` — 7,257 imports (the foundation)
|
||
2. `api` — 1,731 (protobuf contracts)
|
||
3. `service` — 1,693 (business logic)
|
||
4. `chasm` — 497 (rapidly growing new framework)
|
||
5. `tests` — 125 (integration harness)
|
||
|
||
---
|
||
|
||
## Key Architectural Patterns
|
||
|
||
### 1. Hierarchical State Machines (HSM)
|
||
|
||
**PR #5494 (Mar 2024, 51 review comments):**
|
||
|
||
The HSM framework is Temporal's core abstraction. Every
|
||
workflow execution is a tree of state machines — the
|
||
workflow itself, its activities, child workflows, timers,
|
||
callbacks, nexus operations.
|
||
|
||
```go
|
||
type StateMachine[S comparable] interface {
|
||
TaskRegenerator
|
||
State() S
|
||
SetState(S)
|
||
}
|
||
|
||
type Transition[S comparable, SM StateMachine[S], E any] struct {
|
||
Sources []S
|
||
Destination S
|
||
apply func(SM, E) (TransitionOutput, error)
|
||
}
|
||
```
|
||
|
||
**The key insight:** Type-safe state transitions with
|
||
source validation. `Transition.Apply()` checks
|
||
`slices.Contains(t.Sources, sm.State())` before
|
||
allowing the state change. Invalid transitions return
|
||
`ErrInvalidTransition` rather than silently corrupting
|
||
state.
|
||
|
||
**From PR discussion (tdeebswihart):**
|
||
> "I wish we'd gone with the standard `fsm` name here.
|
||
> HSM keeps making me think of Hardware Security
|
||
> Modules."
|
||
|
||
**From PR discussion (bergundy, the author):**
|
||
> "I don't consider this a final approach but I do think
|
||
> it's a step in the right direction. We need to model
|
||
> more state machines on top of this to form a more
|
||
> solid API."
|
||
|
||
This is explicit about being iterative. The framework
|
||
shipped "not final" and evolved through real usage.
|
||
|
||
### 2. CHASM (Component Architecture for State Machines)
|
||
|
||
**PR #6987 (Dec 2024–Jan 2025, 60 review comments):**
|
||
|
||
CHASM replaces the old ad-hoc component system. It's
|
||
a framework for building HSM-based components with:
|
||
- Declarative field definitions
|
||
- Mutable vs immutable contexts (type-enforced)
|
||
- Parent-child component relationships
|
||
- Task generation from transitions
|
||
|
||
**Key discussion points:**
|
||
|
||
**bergundy (author):** "I would put this in a top level
|
||
`chasm` directory. There's likely going to be some
|
||
chasm related code in other services."
|
||
|
||
**yycptt:** "Having the implementation in the top level
|
||
package instead of service/history feels weird."
|
||
(Responded with re-export strategy.)
|
||
|
||
**Sushisource:** "I think I prefer them separate,
|
||
because what happens if you mutate something and then
|
||
say 'not ready'? That would be some weird violation
|
||
that shouldn't be possible, and separate contexts
|
||
enforces that at the type level."
|
||
|
||
→ **Decision: Split MutableContext vs Context at the
|
||
type level** to make invalid operations unrepresentable.
|
||
This is the "making wrong things impossible" philosophy
|
||
in action.
|
||
|
||
### 3. Goroutine Lifecycle (goro.Handle)
|
||
|
||
**PR #1892 (Sep 2021, 15 review comments):**
|
||
|
||
Introduced to fix a **double-close panic** in the task
|
||
writer. The pattern is strikingly similar to
|
||
CockroachDB's Handle (introduced 2025), but predates it
|
||
by 3.5 years.
|
||
|
||
```go
|
||
type Handle struct {
|
||
context context.Context
|
||
cancel context.CancelFunc
|
||
done chan struct{}
|
||
err atomic.Value
|
||
}
|
||
```
|
||
|
||
**From PR discussion (mmcshane, author):**
|
||
> "One thing you might not guess about Stop() is that
|
||
> it removes itself from the parent matching engine. I
|
||
> don't like this 'remove yourself' behavior because it
|
||
> puts the control logic in the wrong place (i.e. in
|
||
> the controlled object rather than the controller)."
|
||
|
||
**Reviewer (paulnpdev):**
|
||
> "If an expert questions what the code is doing, it
|
||
> deserves a comment."
|
||
|
||
This principle — "if a reviewer needs to ask, the code
|
||
needs a comment" — is enforced through review culture.
|
||
|
||
### 4. Soft Assertions (softassert)
|
||
|
||
**PR #7411 (Mar 2025, 46 review comments):**
|
||
|
||
Production code that logs errors for invariant
|
||
violations but doesn't crash:
|
||
|
||
```go
|
||
softassert.That(logger, object.state == "ready",
|
||
"object is not ready")
|
||
```
|
||
|
||
**From PR (stephanos):**
|
||
> "**Why not panic?** Maybe in the future. For now,
|
||
> we're happy with finding these failed assertions in
|
||
> functional tests."
|
||
|
||
This is Temporal's version of CockroachDB's
|
||
`errors.AssertionFailed` — a way to mark "this should
|
||
never happen" without crashing production. The key
|
||
difference: CockroachDB promotes these to errors that
|
||
may crash; Temporal logs them and continues.
|
||
|
||
### 5. Dynamic Configuration (566 settings)
|
||
|
||
Temporal's most extreme pattern: **566 runtime-
|
||
configurable settings** with type-safe resolution and
|
||
namespace-scoped overrides.
|
||
|
||
```go
|
||
var AdminEnableListHistoryTasks = NewGlobalBoolSetting(
|
||
"admin.enableListHistoryTasks",
|
||
true,
|
||
`Description here`,
|
||
)
|
||
```
|
||
|
||
Settings use generics for type safety and resolve with
|
||
precedence: task queue → namespace → global.
|
||
|
||
The `Collection` uses `weak.Pointer` for cache
|
||
invalidation (Go 1.24 feature) and `goro.Group` for
|
||
background polling — showing how internal packages
|
||
compose.
|
||
|
||
### 6. Persistence Plugin System (init registration)
|
||
|
||
```go
|
||
func init() {
|
||
sql.RegisterPlugin(PluginName, &plugin{
|
||
driver: &driver.PQDriver{},
|
||
})
|
||
}
|
||
```
|
||
|
||
Classic Go plugin pattern using `init()` + global
|
||
registry. Supports: PostgreSQL (lib/pq + pgx), MySQL,
|
||
SQLite, Cassandra. The init-time registration means
|
||
import order matters (the `cmd/` packages import the
|
||
plugins they want).
|
||
|
||
### 7. uber/fx Dependency Injection
|
||
|
||
Temporal uses uber/fx for service construction. Each
|
||
service has an `fx.go` that declares providers and
|
||
consumers:
|
||
|
||
```go
|
||
type GrpcServerOptionsParams struct {
|
||
fx.In
|
||
Logger log.Logger
|
||
RPCFactory common.RPCFactory
|
||
RetryableInterceptor *interceptor.RetryableInterceptor
|
||
NamespaceRateLimitInterceptor interceptor.NamespaceRateLimitInterceptor `optional:"true"`
|
||
}
|
||
```
|
||
|
||
This is unusual for Go — most projects avoid DI
|
||
frameworks. Temporal justifies it because the service
|
||
graph is genuinely complex (4 services × multiple
|
||
backends × configurable interceptors).
|
||
|
||
---
|
||
|
||
## Code Quality Markers
|
||
|
||
| Metric | Count |
|
||
|--------|-------|
|
||
| TODOs (non-test) | 738 |
|
||
| FIXMEs | 0 |
|
||
| HACKs | 5 |
|
||
| Mock files | 152 |
|
||
| Test files | 785 |
|
||
| Integration tests | 113 |
|
||
| Generic usages | 1,928 |
|
||
|
||
**TODO style:** `// TODO: description` (no owner tag).
|
||
Compare to CockroachDB's `// TODO(username):` — Temporal
|
||
doesn't track WHO is responsible for a TODO.
|
||
|
||
---
|
||
|
||
## Patterns Unique to Temporal
|
||
|
||
### ShutdownOnce (safe multi-close)
|
||
|
||
```go
|
||
func (c *ShutdownOnceImpl) Shutdown() {
|
||
if atomic.CompareAndSwapInt32(
|
||
&c.status,
|
||
shutdownOnceStatusOpen,
|
||
shutdownOnceStatusClosed,
|
||
) {
|
||
close(c.channel)
|
||
}
|
||
}
|
||
```
|
||
|
||
CAS-based channel close that's safe to call multiple
|
||
times. Solves the "close of closed channel" panic that
|
||
plagues concurrent shutdown code.
|
||
|
||
### Interleaved Weighted Round Robin Scheduler
|
||
|
||
Custom task scheduler that interleaves tasks from
|
||
different channels based on configurable weights.
|
||
Uses dynamic config for weight updates without restart.
|
||
This is their answer to fair scheduling across
|
||
namespaces with different SLAs.
|
||
|
||
### serviceerror Package (domain error types)
|
||
|
||
Instead of wrapping standard errors, Temporal defines
|
||
domain-specific error types that map directly to gRPC
|
||
status codes:
|
||
- `StickyWorkerUnavailable`
|
||
- `ShardOwnershipLost`
|
||
- `TaskAlreadyStarted`
|
||
- `CurrentBranchChanged`
|
||
|
||
Each is a struct implementing `error` with specific
|
||
fields needed for retry/recovery decisions.
|
||
|
||
---
|
||
|
||
## Cross-Ecosystem Observations
|
||
|
||
### Temporal vs CockroachDB
|
||
|
||
| Concern | Temporal | CockroachDB |
|
||
|---------|----------|-------------|
|
||
| Goroutine mgmt | goro.Handle (2021) | stop.Handle (2025) |
|
||
| Assertions | softassert (log) | AssertionFailed (error) |
|
||
| Config | 566 dynamic settings | Cluster settings |
|
||
| DI | uber/fx | Manual wiring |
|
||
| State machines | First-class HSM framework | Ad-hoc per component |
|
||
| Error types | Domain structs → gRPC | Sentinel + wrapping |
|
||
| TODO style | No owner | `TODO(username)` |
|
||
|
||
### Temporal vs Prometheus
|
||
|
||
| Concern | Temporal | Prometheus |
|
||
|---------|----------|------------|
|
||
| Plugin system | init() registration | init() registration |
|
||
| Logging | Custom log package | promslog (slog) |
|
||
| Interfaces | Heavy use | Minimal, targeted |
|
||
| Generics | 1,928 usages | Minimal |
|
||
| Global state | Avoided (fx wiring) | Accepted for hot paths |
|
||
|
||
### Key Differences from CockroachDB
|
||
|
||
1. **uber/fx is a conscious choice** — Temporal's service
|
||
graph is complex enough to justify a DI framework.
|
||
CockroachDB explicitly avoids frameworks.
|
||
|
||
2. **HSM is THE architecture** — Everything in Temporal
|
||
is a state machine. CockroachDB has state machines
|
||
but doesn't have a unified framework for them.
|
||
|
||
3. **CHASM splits mutable/immutable at the type level**
|
||
— This is Temporal's strongest pattern. Making
|
||
mutation impossible in read paths via the type system.
|
||
|
||
4. **goro.Handle predates CockroachDB's Handle by 3.5
|
||
years** — Same problem (goroutine lifecycle), same
|
||
solution (context + done channel + safe multi-stop),
|
||
invented independently.
|
||
|
||
---
|
||
|
||
## Lessons for Code Review
|
||
|
||
1. **"If a reviewer needs to ask, the code needs a
|
||
comment"** — Temporal's review culture promotes
|
||
comments that explain non-obvious decisions.
|
||
|
||
2. **Separate mutable from immutable contexts at the
|
||
type level** — Don't rely on documentation to prevent
|
||
mutation in read paths.
|
||
|
||
3. **Soft assertions > panics in distributed systems**
|
||
— Log the invariant violation, continue serving.
|
||
Crash later in tests.
|
||
|
||
4. **Domain error types beat generic wrapping** when
|
||
errors drive retry/routing decisions. A struct with
|
||
specific fields is more useful than `fmt.Errorf`.
|
||
|
||
5. **DI frameworks are justified when the service graph
|
||
is genuinely complex** — 4 services × multiple
|
||
backends × configurable interceptors × optional
|
||
features = real complexity.
|
||
|
||
6. **HSM frameworks centralize correctness** — Moving
|
||
state transition validation into a framework means
|
||
every component gets it right by construction instead
|
||
of by discipline.
|
||
|
||
<!-- PATTERN_COMPLETE -->
|