Files
Rodin 6e930aed94 docs: add full architectural analysis
Repo shape, import hierarchy, HSM/CHASM architecture, PR discussions,
code quality metrics, cross-ecosystem comparisons
2026-04-30 11:45:53 -07:00

11 KiB
Raw Permalink Blame History

Temporal: Architectural Analysis

Repo: github.com/temporalio/temporal Size: 181M, 2,645 Go files, 8,958 commits, 290 contributors Category: Durable execution engine / workflow orchestrator


Repository Shape

temporal/
├── api/          # Generated protobuf service definitions
├── chasm/        # NEW: Component-based HSM Architecture
├── client/       # Internal service clients
├── cmd/          # Entry points (server, tools)
├── common/       # Shared infrastructure (massive)
│   ├── backoff/
│   ├── channel/
│   ├── clock/
│   ├── dynamicconfig/  # 566 runtime-configurable settings
│   ├── goro/           # Goroutine lifecycle management
│   ├── log/
│   ├── metrics/
│   ├── namespace/
│   ├── persistence/    # Multi-backend storage abstraction
│   ├── quotas/         # Rate limiting infrastructure
│   ├── softassert/     # Production assertions (log, don't crash)
│   └── tasks/          # Scheduler primitives (IWRR, FIFO, etc.)
├── components/   # Feature modules (callbacks, nexus, schedulers)
├── service/
│   ├── frontend/ # gRPC API handlers
│   ├── history/  # Workflow state machine execution
│   │   ├── hsm/  # Hierarchical State Machine framework
│   │   └── queues/ # Task queue processing
│   ├── matching/ # Task dispatch / worker routing
│   └── worker/   # System workflows
└── tests/        # Integration / functional tests

Import Hierarchy (most depended-upon)

  1. common — 7,257 imports (the foundation)
  2. api — 1,731 (protobuf contracts)
  3. service — 1,693 (business logic)
  4. chasm — 497 (rapidly growing new framework)
  5. tests — 125 (integration harness)

Key Architectural Patterns

1. Hierarchical State Machines (HSM)

PR #5494 (Mar 2024, 51 review comments):

The HSM framework is Temporal's core abstraction. Every workflow execution is a tree of state machines — the workflow itself, its activities, child workflows, timers, callbacks, nexus operations.

type StateMachine[S comparable] interface {
    TaskRegenerator
    State() S
    SetState(S)
}

type Transition[S comparable, SM StateMachine[S], E any] struct {
    Sources     []S
    Destination S
    apply       func(SM, E) (TransitionOutput, error)
}

The key insight: Type-safe state transitions with source validation. Transition.Apply() checks slices.Contains(t.Sources, sm.State()) before allowing the state change. Invalid transitions return ErrInvalidTransition rather than silently corrupting state.

From PR discussion (tdeebswihart):

"I wish we'd gone with the standard fsm name here. HSM keeps making me think of Hardware Security Modules."

From PR discussion (bergundy, the author):

"I don't consider this a final approach but I do think it's a step in the right direction. We need to model more state machines on top of this to form a more solid API."

This is explicit about being iterative. The framework shipped "not final" and evolved through real usage.

2. CHASM (Component Architecture for State Machines)

PR #6987 (Dec 2024Jan 2025, 60 review comments):

CHASM replaces the old ad-hoc component system. It's a framework for building HSM-based components with:

  • Declarative field definitions
  • Mutable vs immutable contexts (type-enforced)
  • Parent-child component relationships
  • Task generation from transitions

Key discussion points:

bergundy (author): "I would put this in a top level chasm directory. There's likely going to be some chasm related code in other services."

yycptt: "Having the implementation in the top level package instead of service/history feels weird." (Responded with re-export strategy.)

Sushisource: "I think I prefer them separate, because what happens if you mutate something and then say 'not ready'? That would be some weird violation that shouldn't be possible, and separate contexts enforces that at the type level."

Decision: Split MutableContext vs Context at the type level to make invalid operations unrepresentable. This is the "making wrong things impossible" philosophy in action.

3. Goroutine Lifecycle (goro.Handle)

PR #1892 (Sep 2021, 15 review comments):

Introduced to fix a double-close panic in the task writer. The pattern is strikingly similar to CockroachDB's Handle (introduced 2025), but predates it by 3.5 years.

type Handle struct {
    context context.Context
    cancel  context.CancelFunc
    done    chan struct{}
    err     atomic.Value
}

From PR discussion (mmcshane, author):

"One thing you might not guess about Stop() is that it removes itself from the parent matching engine. I don't like this 'remove yourself' behavior because it puts the control logic in the wrong place (i.e. in the controlled object rather than the controller)."

Reviewer (paulnpdev):

"If an expert questions what the code is doing, it deserves a comment."

This principle — "if a reviewer needs to ask, the code needs a comment" — is enforced through review culture.

4. Soft Assertions (softassert)

PR #7411 (Mar 2025, 46 review comments):

Production code that logs errors for invariant violations but doesn't crash:

softassert.That(logger, object.state == "ready",
    "object is not ready")

From PR (stephanos):

"Why not panic? Maybe in the future. For now, we're happy with finding these failed assertions in functional tests."

This is Temporal's version of CockroachDB's errors.AssertionFailed — a way to mark "this should never happen" without crashing production. The key difference: CockroachDB promotes these to errors that may crash; Temporal logs them and continues.

5. Dynamic Configuration (566 settings)

Temporal's most extreme pattern: 566 runtime- configurable settings with type-safe resolution and namespace-scoped overrides.

var AdminEnableListHistoryTasks = NewGlobalBoolSetting(
    "admin.enableListHistoryTasks",
    true,
    `Description here`,
)

Settings use generics for type safety and resolve with precedence: task queue → namespace → global.

The Collection uses weak.Pointer for cache invalidation (Go 1.24 feature) and goro.Group for background polling — showing how internal packages compose.

6. Persistence Plugin System (init registration)

func init() {
    sql.RegisterPlugin(PluginName, &plugin{
        driver: &driver.PQDriver{},
    })
}

Classic Go plugin pattern using init() + global registry. Supports: PostgreSQL (lib/pq + pgx), MySQL, SQLite, Cassandra. The init-time registration means import order matters (the cmd/ packages import the plugins they want).

7. uber/fx Dependency Injection

Temporal uses uber/fx for service construction. Each service has an fx.go that declares providers and consumers:

type GrpcServerOptionsParams struct {
    fx.In
    Logger                    log.Logger
    RPCFactory                common.RPCFactory
    RetryableInterceptor      *interceptor.RetryableInterceptor
    NamespaceRateLimitInterceptor interceptor.NamespaceRateLimitInterceptor `optional:"true"`
}

This is unusual for Go — most projects avoid DI frameworks. Temporal justifies it because the service graph is genuinely complex (4 services × multiple backends × configurable interceptors).


Code Quality Markers

Metric Count
TODOs (non-test) 738
FIXMEs 0
HACKs 5
Mock files 152
Test files 785
Integration tests 113
Generic usages 1,928

TODO style: // TODO: description (no owner tag). Compare to CockroachDB's // TODO(username): — Temporal doesn't track WHO is responsible for a TODO.


Patterns Unique to Temporal

ShutdownOnce (safe multi-close)

func (c *ShutdownOnceImpl) Shutdown() {
    if atomic.CompareAndSwapInt32(
        &c.status,
        shutdownOnceStatusOpen,
        shutdownOnceStatusClosed,
    ) {
        close(c.channel)
    }
}

CAS-based channel close that's safe to call multiple times. Solves the "close of closed channel" panic that plagues concurrent shutdown code.

Interleaved Weighted Round Robin Scheduler

Custom task scheduler that interleaves tasks from different channels based on configurable weights. Uses dynamic config for weight updates without restart. This is their answer to fair scheduling across namespaces with different SLAs.

serviceerror Package (domain error types)

Instead of wrapping standard errors, Temporal defines domain-specific error types that map directly to gRPC status codes:

  • StickyWorkerUnavailable
  • ShardOwnershipLost
  • TaskAlreadyStarted
  • CurrentBranchChanged

Each is a struct implementing error with specific fields needed for retry/recovery decisions.


Cross-Ecosystem Observations

Temporal vs CockroachDB

Concern Temporal CockroachDB
Goroutine mgmt goro.Handle (2021) stop.Handle (2025)
Assertions softassert (log) AssertionFailed (error)
Config 566 dynamic settings Cluster settings
DI uber/fx Manual wiring
State machines First-class HSM framework Ad-hoc per component
Error types Domain structs → gRPC Sentinel + wrapping
TODO style No owner TODO(username)

Temporal vs Prometheus

Concern Temporal Prometheus
Plugin system init() registration init() registration
Logging Custom log package promslog (slog)
Interfaces Heavy use Minimal, targeted
Generics 1,928 usages Minimal
Global state Avoided (fx wiring) Accepted for hot paths

Key Differences from CockroachDB

  1. uber/fx is a conscious choice — Temporal's service graph is complex enough to justify a DI framework. CockroachDB explicitly avoids frameworks.

  2. HSM is THE architecture — Everything in Temporal is a state machine. CockroachDB has state machines but doesn't have a unified framework for them.

  3. CHASM splits mutable/immutable at the type level — This is Temporal's strongest pattern. Making mutation impossible in read paths via the type system.

  4. goro.Handle predates CockroachDB's Handle by 3.5 years — Same problem (goroutine lifecycle), same solution (context + done channel + safe multi-stop), invented independently.


Lessons for Code Review

  1. "If a reviewer needs to ask, the code needs a comment" — Temporal's review culture promotes comments that explain non-obvious decisions.

  2. Separate mutable from immutable contexts at the type level — Don't rely on documentation to prevent mutation in read paths.

  3. Soft assertions > panics in distributed systems — Log the invariant violation, continue serving. Crash later in tests.

  4. Domain error types beat generic wrapping when errors drive retry/routing decisions. A struct with specific fields is more useful than fmt.Errorf.

  5. DI frameworks are justified when the service graph is genuinely complex — 4 services × multiple backends × configurable interceptors × optional features = real complexity.

  6. HSM frameworks centralize correctness — Moving state transition validation into a framework means every component gets it right by construction instead of by discipline.