Files

T

Rodin 40f024b477 fix: update drifted source citations to match current upstream

Verified all 17 file:line citations against elixir-lang/elixir HEAD.
Fixed 10 citations where line numbers had shifted due to upstream changes:

- patterns/genserver.md: agent.ex:246 → 279 (start_link spec)
- patterns/process-design.md: task.ex:282 → 327 (child_spec)
- smells/anti-patterns.md: registry_test.exs:28 → 29, gen_server_test.exs:166 → 164,
  test_helper.exs:98 → 99
- smells/common-mistakes.md: registry_test.exs:28 → 29, callbacks.ex:423 → 433,
  task_test.exs:297,305,315,330 → 300,308,316,327,
  supervisor_test.exs:278 → 289, callbacks.ex:277 → 520

2026-05-06 16:33:21 -07:00

71 KiB

Raw Permalink Blame History

Process Design Patterns — From the Elixir Source

Analysis of lib/elixir/lib/supervisor.ex, lib/elixir/lib/dynamic_supervisor.ex, lib/elixir/lib/task.ex, lib/elixir/lib/task/supervisor.ex, lib/elixir/lib/process.ex, and lib/elixir/lib/registry.ex.

Pattern 1: Static vs Dynamic Supervision — Choose the Right Tool
Pattern 2: PartitionSupervisor for Scalability
Pattern 3: Supervision Strategies — Choosing the Right Restart Behavior
Pattern 4: Restart Intensity (max_restarts / max_seconds)
Pattern 5: Restart Values — :permanent vs :transient vs :temporary
Pattern 6: Automatic Shutdown for Pipeline Supervisors
Pattern 7: Task.async/await for Concurrent Value Computation
Pattern 8: Task.Supervisor.async_nolink for Fault-Tolerant Task Execution
Pattern 9: Task Supervisor as DynamicSupervisor Specialization
Pattern 10: Registry for Dynamic Process Naming and PubSub
Pattern 11: Shutdown Semantics — Graceful Termination
Pattern 12: DynamicSupervisor Internal State — Struct with Restart Tracking
Pattern 13: Restart Logic with Exponential Backoff via :try_again
Pattern 14: $ancestors and $callers — Process Lineage Tracking
Pattern 15: GenServer.reply/2 for Deferred Responses
Pattern 16: Process.alias for Safe Request/Response
Pattern 17: Registry Partitioning Strategies
Pattern 18: init/1 Return Values — The Full Spectrum

Pattern 1: Static vs Dynamic Supervision — Choose the Right Tool

Source: lib/elixir/lib/supervisor.ex#L1 vs lib/elixir/lib/dynamic_supervisor.ex#L1

What it does: Elixir provides two distinct supervisor types:

Supervisor — for static children known at compile time, started in a defined order
DynamicSupervisor — for children started on demand at runtime, with no ordering guarantees

Why: Static supervisors guarantee startup order (critical for dependencies like "DB pool must start before web server"). Dynamic supervisors optimize for scale — they can hold millions of children using efficient data structures and shut down concurrently.

Anti-pattern: Using a Supervisor when children are created dynamically (e.g., one process per WebSocket connection). You'll hit performance issues and ordering semantics you don't need. Conversely, using DynamicSupervisor for fixed infrastructure (DB pool, PubSub) loses startup order guarantees.

Code example from source (dynamic_supervisor.ex:1-15):

# DynamicSupervisor docs explain the distinction:
# "The Supervisor module was designed to handle mostly static children
#  that are started in the given order when the supervisor starts. A
#  DynamicSupervisor starts with no children. Instead, children are
#  started on demand via start_child/2 and there is no ordering between
#  children."

When to Use

Triggers:

You know at startup exactly which children need to run (DB pool, PubSub, caches)
Children have ordering dependencies (pool must start before consumers)
You're building application-level infrastructure in your supervision tree

Example — before:

# Using DynamicSupervisor for fixed infrastructure — wrong tool
defmodule MyApp.Application do
  def start(_type, _args) do
    children = [{DynamicSupervisor, name: MyApp.InfraSupervisor}]
    Supervisor.start_link(children, strategy: :one_for_one)
  end
end

# Manually starting fixed children after supervisor boots
DynamicSupervisor.start_child(MyApp.InfraSupervisor, MyApp.Repo)
DynamicSupervisor.start_child(MyApp.InfraSupervisor, MyApp.PubSub)
DynamicSupervisor.start_child(MyApp.InfraSupervisor, MyApp.Endpoint)
# No startup ordering guarantee!

Example — after:

defmodule MyApp.Application do
  def start(_type, _args) do
    children = [
      MyApp.Repo,        # DB pool starts first
      MyApp.PubSub,      # PubSub starts after DB is ready
      MyApp.Endpoint     # Web server starts last
    ]
    Supervisor.start_link(children, strategy: :rest_for_one)
  end
end

When NOT to Use

Don't use this when:

Children are created on-demand (per-connection, per-request, per-user)
The number of children is unbounded or varies significantly at runtime
You don't need ordering guarantees between children

Over-application example:

# Static supervisor for per-WebSocket connections — will be clunky
defmodule MyApp.ConnectionSupervisor do
  use Supervisor

  def init(_) do
    # Can't define children at compile time — they arrive at runtime!
    Supervisor.init([], strategy: :one_for_one)
  end

  # Awkward: using Supervisor for dynamic children
  def add_connection(socket) do
    spec = {ConnectionHandler, socket}
    Supervisor.start_child(__MODULE__, spec)
  end
end

Better alternative:

defmodule MyApp.ConnectionSupervisor do
  use DynamicSupervisor

  def start_link(_), do: DynamicSupervisor.start_link(__MODULE__, [], name: __MODULE__)
  def init(_), do: DynamicSupervisor.init(strategy: :one_for_one)

  def add_connection(socket) do
    DynamicSupervisor.start_child(__MODULE__, {ConnectionHandler, socket})
  end
end

Why: Static supervisors excel at ordered, fixed infrastructure. DynamicSupervisor excels at scale with runtime-determined children. Pick based on whether children are known at compile time.

Pattern 2: PartitionSupervisor for Scalability

Source: lib/elixir/lib/dynamic_supervisor.ex#L60 and lib/elixir/lib/task/supervisor.ex#L35

What it does: Both DynamicSupervisor and Task.Supervisor document the same scalability pattern: when a single supervisor becomes a bottleneck, wrap it in a PartitionSupervisor which starts N instances (one per core by default) and routes via a key.

Why: A supervisor is a single process. Under heavy start_child load, it serializes all spawn operations. PartitionSupervisor distributes the load across multiple supervisor processes, using self() as the routing key to ensure each caller consistently hits the same partition.

Anti-pattern: Creating your own load-balancing logic for supervisors, or just accepting the bottleneck. The standard library provides this pattern explicitly.

Code example from source (dynamic_supervisor.ex):

# Instead of a single DynamicSupervisor:
children = [
  {PartitionSupervisor,
   child_spec: DynamicSupervisor,
   name: MyApp.DynamicSupervisors}
]

# Start children through the partition supervisor:
DynamicSupervisor.start_child(
  {:via, PartitionSupervisor, {MyApp.DynamicSupervisors, self()}},
  {Counter, 0}
)

When to Use

Triggers:

A single DynamicSupervisor or Task.Supervisor is a bottleneck under high start_child load
You're seeing latency spikes when spawning tasks/children under concurrency
Profiling shows the supervisor process has a large message queue

Example — before:

# Single supervisor — serializes all spawn operations
defmodule MyApp.TaskRunner do
  def run_async(fun) do
    Task.Supervisor.async_nolink(MyApp.TaskSupervisor, fun)
  end
end

# Under 10k concurrent requests, this single process becomes a bottleneck

Example — after:

# Partitioned — distributes load across N supervisor processes
defmodule MyApp.Application do
  def start(_type, _args) do
    children = [
      {PartitionSupervisor,
       child_spec: Task.Supervisor,
       name: MyApp.TaskSupervisors}
    ]
    Supervisor.start_link(children, strategy: :one_for_one)
  end
end

defmodule MyApp.TaskRunner do
  def run_async(fun) do
    Task.Supervisor.async_nolink(
      {:via, PartitionSupervisor, {MyApp.TaskSupervisors, self()}},
      fun
    )
  end
end

When NOT to Use

Don't use this when:

You have low spawn rates (< 1000/sec) — a single supervisor is fine
You need ordering guarantees between children (partitioning breaks ordering)
The supervisor has few children total (partitioning adds overhead for no gain)

Over-application example:

# Partitioning a supervisor that starts 5 children at boot — pointless
{PartitionSupervisor,
 child_spec: DynamicSupervisor,
 name: MyApp.ConfigSupervisors,
 partitions: System.schedulers_online()}
# 16 partitions for 5 children = massive overhead, zero benefit

Better alternative:

# Just use a plain DynamicSupervisor
{DynamicSupervisor, name: MyApp.ConfigSupervisor}

Why: PartitionSupervisor exists for high-throughput spawn scenarios. If you're not hitting supervisor mailbox limits, the extra processes and routing logic add complexity without benefit.

Pattern 3: Supervision Strategies — Choosing the Right Restart Behavior

Source: lib/elixir/lib/supervisor.ex#L315 (Strategies section)

What it does: Three strategies model three dependency patterns:

:one_for_one — independent children (crash of A doesn't affect B)
:one_for_all — tightly coupled children (if one fails, all state is inconsistent)
:rest_for_one — sequential dependencies (children started after the crashed one depend on it)

Why: These map directly to runtime dependency graphs. A connection pool and its consumers are :rest_for_one — consumers can't work without the pool. Multiple independent request handlers are :one_for_one. Workers sharing a cache are :one_for_all — stale cache state after a crash could cause inconsistency.

Anti-pattern: Defaulting to :one_for_one everywhere without thinking about dependencies. If process B depends on process A's state and A crashes, B will be working with stale assumptions.

Code example from source (supervisor.ex docs):

# Independent workers — one crash doesn't affect others
Supervisor.start_link(children, strategy: :one_for_one)

# Tightly coupled — all must restart together for consistency
Supervisor.start_link(children, strategy: :one_for_all)

# Sequential dependency — later children depend on earlier ones
Supervisor.start_link(children, strategy: :rest_for_one)

When to Use

Triggers:

You're deciding how a supervisor should react when one child fails
Children share state or resources that become inconsistent if one crashes
You have a pipeline: A feeds B feeds C

Example — before:

# Defaulting to :one_for_one without thinking about dependencies
defmodule MyApp.DataPipeline do
  use Supervisor

  def init(_) do
    children = [
      MyApp.DataSource,    # Produces data
      MyApp.Transformer,   # Transforms data (holds reference to DataSource)
      MyApp.Sink           # Writes transformed data
    ]
    # If DataSource crashes, Transformer has a stale reference!
    Supervisor.init(children, strategy: :one_for_one)
  end
end

Example — after:

defmodule MyApp.DataPipeline do
  use Supervisor

  def init(_) do
    children = [
      MyApp.DataSource,    # If this crashes...
      MyApp.Transformer,   # ...these must restart too (stale refs)
      MyApp.Sink
    ]
    # rest_for_one: crash of DataSource restarts Transformer and Sink
    Supervisor.init(children, strategy: :rest_for_one)
  end
end

When NOT to Use

Don't use this when:

Children are truly independent (HTTP request handlers, job workers)
You're using :one_for_all because you're unsure — analyze dependencies first
The restart strategy masks a design problem (maybe use separate supervision subtrees)

Over-application example:

# one_for_all when children are actually independent
defmodule MyApp.Workers do
  use Supervisor

  def init(_) do
    children = [
      {MyApp.EmailWorker, []},
      {MyApp.SMSWorker, []},
      {MyApp.PushWorker, []}
    ]
    # If email crashes, why restart SMS and Push? They're independent!
    Supervisor.init(children, strategy: :one_for_all)
  end
end

Better alternative:

defmodule MyApp.Workers do
  use Supervisor

  def init(_) do
    children = [
      {MyApp.EmailWorker, []},
      {MyApp.SMSWorker, []},
      {MyApp.PushWorker, []}
    ]
    # Independent workers — one crash doesn't affect others
    Supervisor.init(children, strategy: :one_for_one)
  end
end

Why: Restart strategies model dependency graphs. Using :one_for_all for independent workers causes unnecessary restarts, losing in-progress work for no benefit.

Pattern 4: Restart Intensity (`max_restarts` / `max_seconds`)

Source: lib/elixir/lib/supervisor.ex#L309, lib/elixir/lib/dynamic_supervisor.ex#L730 (implementation)

What it does: Supervisors track restart frequency. If a child exceeds max_restarts within max_seconds, the supervisor itself shuts down (escalating the failure to its parent). Defaults: 3 restarts in 5 seconds.

Why: Prevents infinite restart loops that waste CPU and mask bugs. If a child keeps crashing within seconds, it's a systemic problem that the current supervisor level can't fix. Escalating to the parent allows a higher-level strategy to respond (perhaps restarting the entire subsystem with fresh state).

Anti-pattern: Setting max_restarts extremely high to "prevent crashes." This hides bugs and wastes resources. Let supervisors escalate — that's the point of the hierarchy.

Code example from source (dynamic_supervisor.ex internal logic):

defp add_restart(state) do
  %{max_seconds: max_seconds, max_restarts: max_restarts, restarts: restarts} = state

  now = :erlang.monotonic_time(1)
  restarts = add_restart([now | restarts], now, max_seconds)
  state = %{state | restarts: restarts}

  if length(restarts) <= max_restarts do
    {:ok, state}
  else
    {:shutdown, state}
  end
end

defp add_restart(restarts, now, period) do
  for then <- restarts, now <= then + period, do: then
end

When to Use

Triggers:

You want to prevent infinite restart loops from burning CPU
You're tuning a supervisor for a child that occasionally crashes under load
You need the supervisor to escalate when a systemic problem prevents recovery

Example — before:

# Default: 3 restarts in 5 seconds — might be too aggressive for flaky networks
defmodule MyApp.ExternalAPISupervisor do
  use Supervisor

  def init(_) do
    children = [{MyApp.APIClient, []}]
    # Default max_restarts: 3, max_seconds: 5
    # Network blip causes 3 crashes in 2 seconds → supervisor dies → app crashes
    Supervisor.init(children, strategy: :one_for_one)
  end
end

Example — after:

defmodule MyApp.ExternalAPISupervisor do
  use Supervisor

  def init(_) do
    children = [{MyApp.APIClient, []}]
    # Allow more restarts for transient network issues
    Supervisor.init(children,
      strategy: :one_for_one,
      max_restarts: 10,
      max_seconds: 60
    )
  end
end

When NOT to Use

Don't use this when:

You're setting max_restarts very high to "prevent crashes" — you're hiding bugs
The child crash is deterministic (same input = same crash) — fix the bug instead
You're relying on restart intensity as a backoff mechanism (use explicit backoff)

Over-application example:

# Setting absurdly high restart limits to "never crash"
Supervisor.init(children,
  strategy: :one_for_one,
  max_restarts: 1000,
  max_seconds: 1
)
# This allows 1000 crashes per second — you'll burn CPU and hide bugs

Better alternative:

# Reasonable limits + fix the underlying crash
Supervisor.init(children,
  strategy: :one_for_one,
  max_restarts: 5,
  max_seconds: 30
)
# If 5 crashes in 30 seconds isn't enough, the problem is the child, not the limit

Why: Restart intensity is a circuit breaker, not a throttle. It should escalate systemic failures, not suppress them. If you need aggressive restarts, your child has a bug.

Pattern 5: Restart Values — `:permanent` vs `:transient` vs `:temporary`

Source: lib/elixir/lib/supervisor.ex#L130 (Restart values section)

What it does: Three restart policies control what happens when a child terminates:

:permanent — always restart (default for GenServer/Agent/Supervisor)
:transient — restart only on abnormal exit (not :normal, :shutdown, {:shutdown, term})
:temporary — never restart (default for Task)

Why: Different processes have different lifecycle expectations. A database pool should always be running (:permanent). A task that computes a value and exits is done when it's done (:temporary). A connection process should restart on crash but not on graceful disconnect (:transient).

Anti-pattern: Making everything :permanent. If a one-shot task keeps restarting, it'll trigger restart intensity limits and take down the supervisor.

Code example from source:

# Task defaults to :temporary — intentional one-shot work
# (from task.ex:327)
def child_spec(arg) do
  %{
    id: Task,
    start: {Task, :start_link, [arg]},
    restart: :temporary
  }
end

# Customize via use:
use GenServer, restart: :transient

When to Use

Triggers:

You have different process types with different lifecycle expectations
One-shot tasks keep restarting and wasting resources
A connection should gracefully disconnect without triggering restart

Example — before:

# Everything is :permanent (default) — tasks restart forever
defmodule MyApp.BatchProcessor do
  use GenServer

  def handle_cast({:process, batch}, state) do
    Task.start_link(fn ->
      process_batch(batch)
      # Task exits :normal... and gets restarted by supervisor!
    end)
    {:noreply, state}
  end
end

Example — after:

defmodule MyApp.BatchTask do
  use Task, restart: :temporary  # Don't restart completed tasks

  def start_link(batch) do
    Task.start_link(__MODULE__, :run, [batch])
  end

  def run(batch), do: process_batch(batch)
end

defmodule MyApp.ConnectionWorker do
  use GenServer, restart: :transient  # Restart on crash, not graceful disconnect

  def disconnect(pid) do
    GenServer.stop(pid, :normal)  # Won't trigger restart
  end
end

When NOT to Use

Don't use this when:

You're using :temporary to avoid fixing a crash (the child just stays dead)
You set everything to :transient without thinking — :permanent is usually right for services

Over-application example:

# Making a critical service :temporary so it "doesn't bother the supervisor"
defmodule MyApp.PaymentProcessor do
  use GenServer, restart: :temporary
  # If this crashes, it stays dead! Payments stop working silently!
end

Better alternative:

defmodule MyApp.PaymentProcessor do
  use GenServer, restart: :permanent
  # Critical services should always restart — that's the whole point
end

Why: :permanent is the safe default for anything that should "always be running." Only use :transient for processes that have a valid "done" state, and :temporary for truly one-shot work.

Pattern 6: Automatic Shutdown for Pipeline Supervisors

Source: lib/elixir/lib/supervisor.ex#L349 (Automatic shutdown section)

What it does: Supervisors support :auto_shutdown which terminates the supervisor when significant children exit. Options: :any_significant (first significant child exits → shutdown) or :all_significant (all significant children must exit → shutdown).

Why: Models pipeline/workflow patterns where a supervisor's purpose is tied to its children's work. If all significant workers finish, the supervisor should clean up. This is useful for batch processing supervisors or connection-scoped process groups.

Anti-pattern: Manually monitoring children and calling Supervisor.stop/1. The automatic shutdown mechanism handles this cleanly within OTP semantics.

Code example (from docs):

# Only :transient and :temporary children can be marked significant
children = [
  Supervisor.child_spec({BatchWorker, args}, significant: true, restart: :transient)
]

Supervisor.start_link(children,
  strategy: :one_for_one,
  auto_shutdown: :all_significant
)

When to Use

Triggers:

You have a supervisor managing a batch/pipeline where completion means "job done"
A supervisor's existence only makes sense while its children are doing work
You're building a workflow that should self-terminate

Example — before:

# Manual cleanup when batch workers finish
defmodule MyApp.BatchSupervisor do
  use DynamicSupervisor

  def all_done?(supervisor) do
    # Polling... ugly
    DynamicSupervisor.count_children(supervisor).active == 0
  end
end

# Somewhere else, a monitor process watches and cleans up
defmodule MyApp.BatchMonitor do
  use GenServer

  def handle_info(:check, state) do
    if MyApp.BatchSupervisor.all_done?(state.sup) do
      Supervisor.stop(state.sup)
    end
    {:noreply, state}
  end
end

Example — after:

defmodule MyApp.BatchSupervisor do
  use Supervisor

  def start_link(tasks) do
    Supervisor.start_link(__MODULE__, tasks)
  end

  def init(tasks) do
    children =
      Enum.map(tasks, fn task ->
        Supervisor.child_spec({MyApp.BatchWorker, task},
          id: task.id, restart: :transient, significant: true)
      end)

    Supervisor.init(children,
      strategy: :one_for_one,
      auto_shutdown: :all_significant
    )
  end
end
# When all workers complete normally, supervisor shuts down automatically

When NOT to Use

Don't use this when:

Children are long-lived services that should never "complete"
You want the supervisor to keep running even after children exit (for later restarts)
Children have :permanent restart — they can't be :significant

Over-application example:

# auto_shutdown on infrastructure supervisor — it'll die when any child exits!
children = [
  Supervisor.child_spec(MyApp.Cache, significant: true, restart: :transient),
  MyApp.WebServer
]
Supervisor.init(children,
  strategy: :one_for_one,
  auto_shutdown: :any_significant
)
# If Cache restarts as :transient and exits :normal once, the WHOLE supervisor dies

Better alternative:

# Infrastructure supervisors should NOT auto-shutdown
Supervisor.init(children, strategy: :one_for_one)
# Only use auto_shutdown for workflow/batch supervisors with finite lifetimes

Why: auto_shutdown models "this supervisor's job is done when its children finish." It's for finite work, not long-lived services.

Pattern 7: Task.async/await for Concurrent Value Computation

Source: lib/elixir/lib/task.ex#L1 and lib/elixir/lib/task.ex#L300

What it does: Task.async spawns a linked, monitored process and returns a %Task{} struct. Task.await blocks until the result arrives or times out. This is the canonical pattern for "compute a value concurrently."

Why: Tasks provide structured concurrency — the caller is linked to the task, so failures propagate naturally. The monitor reference enables safe await with timeout. This is explicit and composable unlike raw spawn_link + receive.

Anti-pattern: Using spawn_link + send + receive for one-shot concurrent computation. You lose: error propagation, structured monitoring, timeout handling, and the caller-tracking metadata that tasks provide.

Code example from source:

task = Task.async(fn -> do_some_work() end)
res = do_some_other_work()
res + Task.await(task)

Key constraint from docs: "If you start an async, you must await. This is either done by calling Task.await/2 or Task.yield/2 followed by Task.shutdown/2."

When to Use

Triggers:

You need to compute a value concurrently and use it in the current flow
Multiple independent computations can run in parallel to reduce latency
The current process should crash if the computation fails (linked failure)

Example — before:

# Sequential — total time = sum of all operations
def build_dashboard(user_id) do
  profile = fetch_profile(user_id)        # 200ms
  orders = fetch_recent_orders(user_id)   # 300ms
  recommendations = compute_recs(user_id) # 500ms
  # Total: 1000ms

  %{profile: profile, orders: orders, recommendations: recommendations}
end

Example — after:

# Concurrent — total time = max of all operations
def build_dashboard(user_id) do
  profile_task = Task.async(fn -> fetch_profile(user_id) end)
  orders_task = Task.async(fn -> fetch_recent_orders(user_id) end)
  recs_task = Task.async(fn -> compute_recs(user_id) end)

  %{
    profile: Task.await(profile_task),
    orders: Task.await(orders_task),
    recommendations: Task.await(recs_task)
  }
  # Total: ~500ms (limited by slowest task)
end

When NOT to Use

Don't use this when:

The task might fail and you don't want the caller to crash (use async_nolink)
You're inside a GenServer and can't block on await (use async_nolink + handle_info)
The computation is trivial (< 1ms) — spawning a process adds overhead

Over-application example:

# Spawning a task for trivial work — overhead exceeds benefit
def format_name(user) do
  task = Task.async(fn -> String.upcase(user.name) end)
  Task.await(task)
  # Process spawn + message passing for a microsecond operation!
end

Better alternative:

def format_name(user) do
  String.upcase(user.name)
end

Why: Tasks add process spawn overhead (~2-5μs) plus message passing. Only use them when the work is expensive enough to justify parallelism — typically >1ms or when running multiple operations concurrently.

Pattern 8: Task.Supervisor.async_nolink for Fault-Tolerant Task Execution

Source: lib/elixir/lib/task/supervisor.ex#L240 (async_nolink docs with GenServer example)

What it does: Unlike Task.async, async_nolink spawns a task that is NOT linked to the caller. The caller monitors it and handles success/failure via handle_info. This prevents a task crash from killing the caller.

Why: In a GenServer, you often want to spawn work that might fail without taking down the server. The pattern: spawn with async_nolink, receive the result as {ref, answer}, and handle failure as {:DOWN, ref, :process, _pid, reason}.

Anti-pattern: Using Task.async inside a GenServer when the task might fail. The link means the GenServer crashes too. Use async_nolink + handle_info for resilient concurrent work.

Code example from source (task/supervisor.ex):

defmodule MyApp.Server do
  use GenServer

  def handle_call(:start_task, _from, %{ref: nil} = state) do
    task =
      Task.Supervisor.async_nolink(MyApp.TaskSupervisor, fn ->
        # potentially failing work
      end)

    {:reply, :ok, %{state | ref: task.ref}}
  end

  # Task completed successfully
  def handle_info({ref, answer}, %{ref: ref} = state) do
    Process.demonitor(ref, [:flush])
    {:noreply, %{state | ref: nil}}
  end

  # Task failed
  def handle_info({:DOWN, ref, :process, _pid, _reason}, %{ref: ref} = state) do
    {:noreply, %{state | ref: nil}}
  end
end

When to Use

Triggers:

A GenServer needs to spawn work that might fail without crashing the server
You're building a "request/response with timeout" pattern inside a GenServer
External calls (HTTP, DB) from a GenServer should be non-blocking and resilient

Example — before:

defmodule MyApp.Enricher do
  use GenServer

  @impl true
  def handle_call({:enrich, data}, _from, state) do
    # If this HTTP call crashes, the entire GenServer dies!
    result = Task.async(fn -> HTTPClient.post!("/api/enrich", data) end)
    enriched = Task.await(result, 5_000)
    {:reply, {:ok, enriched}, state}
  end
end

Example — after:

defmodule MyApp.Enricher do
  use GenServer

  @impl true
  def handle_call({:enrich, data}, from, state) do
    task = Task.Supervisor.async_nolink(MyApp.TaskSupervisor, fn ->
      HTTPClient.post!("/api/enrich", data)
    end)
    {:noreply, Map.put(state, task.ref, from)}
  end

  @impl true
  def handle_info({ref, result}, state) do
    Process.demonitor(ref, [:flush])
    {from, state} = Map.pop(state, ref)
    GenServer.reply(from, {:ok, result})
    {:noreply, state}
  end

  @impl true
  def handle_info({:DOWN, ref, :process, _pid, reason}, state) do
    {from, state} = Map.pop(state, ref)
    GenServer.reply(from, {:error, reason})
    {:noreply, state}
  end
end

When NOT to Use

Don't use this when:

Task failure should crash the caller (you WANT linked failure propagation)
You're not inside a GenServer and can handle the crash in a try/rescue
The work is fast and synchronous is acceptable

Over-application example:

# Using async_nolink for work that SHOULD crash the caller on failure
defmodule MyApp.CriticalPayment do
  use GenServer

  def handle_call({:charge, card}, _from, state) do
    task = Task.Supervisor.async_nolink(MyApp.TaskSupervisor, fn ->
      PaymentGateway.charge!(card)
    end)
    # Now you have to manually handle the failure...
    # But if payment fails, maybe this GenServer SHOULD crash
    # to trigger a supervisor restart with clean state
  end
end

Better alternative:

# If failure should crash the GenServer, use Task.async (linked)
def handle_call({:charge, card}, _from, state) do
  result = Task.async(fn -> PaymentGateway.charge!(card) end)
  {:reply, Task.await(result), state}
end

Why: async_nolink is for resilient, non-critical work. If the task's failure means your GenServer's state is invalid, you want the link — let it crash and restart clean.

Pattern 9: Task Supervisor as DynamicSupervisor Specialization

Source: lib/elixir/lib/task/supervisor.ex#L151 (start_link implementation)

What it does: Task.Supervisor is implemented directly on top of DynamicSupervisor. It stores default restart/shutdown settings in the process dictionary and delegates init to DynamicSupervisor.init.

Why: Specialization without duplication. Task.Supervisor adds task-specific behavior (caller tracking, async/nolink patterns, stream support) on top of the generic dynamic supervision infrastructure. It's a compositional pattern — build specialized supervisors by wrapping the generic one.

Anti-pattern: Re-implementing task supervision from scratch with a plain DynamicSupervisor + custom start logic. Use Task.Supervisor — it handles caller tracking, owner propagation, and proper shutdown.

Code example from source:

# Task.Supervisor.start_link delegates to DynamicSupervisor
def start_link(options \\ []) do
  {restart, options} = Keyword.pop(options, :restart)
  {shutdown, options} = Keyword.pop(options, :shutdown)
  keys = [:max_children, :max_seconds, :max_restarts]
  {sup_opts, start_opts} = Keyword.split(options, keys)
  restart_and_shutdown = {restart || :temporary, shutdown || 5000}
  DynamicSupervisor.start_link(__MODULE__, {restart_and_shutdown, sup_opts}, start_opts)
end

def init({{_restart, _shutdown} = arg, options}) do
  Process.put(__MODULE__, arg)
  DynamicSupervisor.init([strategy: :one_for_one] ++ options)
end

When to Use

Triggers:

You're building task infrastructure that needs proper shutdown and caller tracking
You want async_nolink + streaming + concurrency limiting for tasks
You need tasks to be supervised (restarted, tracked, shut down gracefully)

Example — before:

# Rolling your own task management on DynamicSupervisor
defmodule MyApp.Workers do
  def start_task(fun) do
    spec = %{id: make_ref(), start: {Task, :start_link, [fun]}, restart: :temporary}
    DynamicSupervisor.start_child(MyApp.WorkerSup, spec)
  end

  # No caller tracking, no async_nolink, no stream support
  # Must manually build all of that
end

Example — after:

# Task.Supervisor gives you all of this for free
defmodule MyApp.Application do
  def start(_type, _args) do
    children = [
      {Task.Supervisor, name: MyApp.TaskSupervisor}
    ]
    Supervisor.start_link(children, strategy: :one_for_one)
  end
end

# Now you get: async, async_nolink, async_stream, start_child, etc.
Task.Supervisor.async_nolink(MyApp.TaskSupervisor, fn -> work() end)

When NOT to Use

Don't use this when:

Tasks are fire-and-forget and you don't need supervision (just Task.start/1)
You need custom child specs with complex init logic (use DynamicSupervisor directly)
You're spawning non-Task children (GenServers, Agents)

Over-application example:

# Using Task.Supervisor to start GenServers — wrong tool
Task.Supervisor.start_child(MyApp.TaskSupervisor, fn ->
  # This spawns a Task that starts a GenServer... awkward
  {:ok, _} = MyApp.Worker.start_link(args)
  Process.sleep(:infinity)  # Keep the task alive??
end)

Better alternative:

# Use DynamicSupervisor for non-Task children
DynamicSupervisor.start_child(MyApp.WorkerSupervisor, {MyApp.Worker, args})

Why: Task.Supervisor is purpose-built for Task processes. It adds caller tracking, $callers propagation, and task-specific APIs. For anything that isn't a Task, use DynamicSupervisor.

Pattern 10: Registry for Dynamic Process Naming and PubSub

Source: lib/elixir/lib/registry.ex#L1 (module docs), lib/elixir/lib/registry.ex#L250 (whereis_name via callbacks)

What it does: Registry provides two modes:

:unique keys — each key maps to exactly one process (name registry, process lookup)
:duplicate keys — each key maps to many processes (PubSub topics, event dispatch)

Processes are automatically unregistered on death. Registry integrates with GenServer naming via {:via, Registry, {registry, key}}.

Why: Solves the dynamic naming problem without atom leaks. Also provides local PubSub without external dependencies. The registry is partitioned for scalability and uses ETS for O(1) lookups.

Anti-pattern: Building custom ETS-based process registries with manual cleanup on process death. Registry handles monitor-based cleanup automatically.

Code example from source (registry.ex :via callbacks):

# :via integration — GenServer uses these callbacks
def whereis_name({registry, key}), do: whereis_name(registry, key)
def whereis_name({registry, key, _value}), do: whereis_name(registry, key)

defp whereis_name(registry, key) do
  case key_info!(registry) do
    {:unique, partitions, key_ets} ->
      key_ets = key_ets || key_ets!(registry, key, partitions)
      case lookup_second(:unique, key_ets, key) do
        {pid, _} ->
          if Process.alive?(pid), do: pid, else: :undefined
        _ ->
          :undefined
      end
  end
end

When to Use

Triggers:

You need to look up processes by a dynamic key without atom leaks
You want local PubSub (subscribe/dispatch to topics) without external deps
You're building per-entity process pools (per-user, per-room, per-device)

Example — before:

# Custom ETS-based registry with manual cleanup
defmodule MyApp.ProcessRegistry do
  def register(key, pid) do
    ref = Process.monitor(pid)
    :ets.insert(:registry, {key, pid, ref})
  end

  def lookup(key) do
    case :ets.lookup(:registry, key) do
      [{^key, pid, _ref}] -> {:ok, pid}
      [] -> :error
    end
  end

  # Must handle :DOWN manually to clean up dead entries
  def handle_info({:DOWN, ref, :process, pid, _reason}, state) do
    :ets.match_delete(:registry, {:_, pid, ref})
    {:noreply, state}
  end
end

Example — after:

# Registry handles all of this automatically
# In supervision tree:
{Registry, keys: :unique, name: MyApp.GameRegistry}

# Registration happens via :via tuple — automatic cleanup on death
defmodule MyApp.GameSession do
  use GenServer

  def start_link(game_id) do
    GenServer.start_link(__MODULE__, game_id,
      name: {:via, Registry, {MyApp.GameRegistry, game_id}})
  end
end

When NOT to Use

Don't use this when:

You need distributed/cluster-wide process registration (use Horde, :global, or pg)
Process lookup is the hot path and you need sub-microsecond latency (direct PID passing)
You have a fixed set of processes known at compile time (atom names are simpler)

Over-application example:

# Using Registry when you could just pass the PID directly
defmodule MyApp.Pipeline do
  def process(data) do
    # Register a process just to look it up one line later...
    {:ok, pid} = MyApp.Worker.start_link(data)
    # Why not just use `pid` directly?
    worker = Registry.lookup(MyApp.Registry, data.id) |> List.first()
    GenServer.call(elem(worker, 0), :process)
  end
end

Better alternative:

defmodule MyApp.Pipeline do
  def process(data) do
    {:ok, pid} = MyApp.Worker.start_link(data)
    GenServer.call(pid, :process)
  end
end

Why: Registry shines when the looker-upper doesn't know the PID (arrived in a different request, different process tree). If you already have the PID, just use it directly.

Pattern 11: Shutdown Semantics — Graceful Termination

Source: lib/elixir/lib/supervisor.ex#L156 (Shutdown values section)

What it does: Three shutdown modes:

:brutal_kill — immediate Process.exit(child, :kill), no cleanup
integer (ms) — send :shutdown signal, wait N ms, then :kill
:infinity — wait forever (default for supervisor children)

Workers default to 5000ms. Supervisors default to :infinity (to give their children time).

Why: Graceful shutdown enables cleanup (closing connections, flushing buffers, deregistering from services). The timeout prevents hung processes from blocking system shutdown indefinitely. The hierarchy matters: supervisors need infinite time because they're waiting for their own children to shut down.

Anti-pattern: Setting :brutal_kill on processes that hold external resources (DB connections, file handles). They'll leak. Also: setting :infinity on worker processes — a bug in terminate/2 will hang your entire shutdown.

Code example from source:

# From supervisor.ex docs:
# :brutal_kill - unconditional and immediate termination
# integer >= 0 - wait that many ms after :shutdown signal
# :infinity - wait forever (recommended for supervisors)

# Worker default: 5000ms
%{shutdown: 5_000, type: :worker}

# Supervisor default: :infinity
%{shutdown: :infinity, type: :supervisor}

When to Use

Triggers:

You're deploying and need processes to flush buffers, close connections, or deregister
Child processes hold external resources that leak if killed immediately
Your system has a clean shutdown requirement (compliance, data integrity)

Example — before:

# :brutal_kill on a process that writes to disk — data loss
defmodule MyApp.WriteAheadLog do
  use GenServer, shutdown: :brutal_kill  # BAD: loses buffered writes

  @impl true
  def terminate(_reason, state) do
    # This never runs with :brutal_kill!
    flush_buffer_to_disk(state.buffer)
  end
end

Example — after:

defmodule MyApp.WriteAheadLog do
  use GenServer, shutdown: 10_000  # 10 seconds to flush

  @impl true
  def terminate(_reason, state) do
    flush_buffer_to_disk(state.buffer)
    close_file_handle(state.fd)
  end
end

When NOT to Use

Don't use this when:

Setting :infinity on worker processes — a bug in terminate/2 hangs your entire shutdown
The process holds no external resources (default 5000ms is fine)
You're using :brutal_kill on supervisors (they need time to stop their children)

Over-application example:

# :infinity shutdown on a worker — if terminate hangs, deployment hangs
defmodule MyApp.Worker do
  use GenServer, shutdown: :infinity

  @impl true
  def terminate(_reason, state) do
    # If this HTTP call hangs forever, your entire app can't shut down
    HTTPClient.post!("/api/deregister", %{id: state.id})
  end
end

Better alternative:

defmodule MyApp.Worker do
  use GenServer, shutdown: 15_000  # Generous but bounded

  @impl true
  def terminate(_reason, state) do
    # Use a timeout on the cleanup call too
    Task.async(fn -> HTTPClient.post("/api/deregister", %{id: state.id}) end)
    |> Task.yield(10_000)
  end
end

Why: :infinity is safe for supervisors (they're waiting for children) but dangerous for workers. A hung terminate/2 with infinite shutdown blocks your entire deployment pipeline.

Pattern 12: DynamicSupervisor Internal State — Struct with Restart Tracking

Source: lib/elixir/lib/dynamic_supervisor.ex#L165 (defstruct)

What it does: The DynamicSupervisor uses a struct for its GenServer state with explicit fields: children (map of pid → child spec), restarts (list of timestamps for rate limiting), and configuration fields.

Why: This shows the Elixir team's state design philosophy: use a struct with named fields, not a bare map or tuple. The children field uses a %{} map keyed by PID for O(1) lookup/deletion on child exit. The restarts list uses a simple sliding-window approach for restart intensity.

Anti-pattern: Using a list for children lookup (O(n) on every EXIT message), or using a tuple-based state that requires positional knowledge.

Code example from source:

defstruct [
  :args,
  :extra_arguments,
  :mod,
  :name,
  :strategy,
  :max_children,
  :max_restarts,
  :max_seconds,
  children: %{},
  restarts: []
]

When to Use

Triggers:

You need to understand the internal implementation of a supervisor for debugging
You're building a custom supervisor-like process
You want to understand why DynamicSupervisor uses a map keyed by PID

Example — before:

# Using a list to track children — O(n) on every EXIT message
defmodule MyApp.CustomSupervisor do
  use GenServer

  @impl true
  def init(_) do
    {:ok, %{children: []}}  # List! Every EXIT scans the whole thing
  end

  def handle_info({:EXIT, pid, _reason}, state) do
    # O(n) scan to find and remove the dead child
    children = Enum.reject(state.children, fn {p, _spec} -> p == pid end)
    {:noreply, %{state | children: children}}
  end
end

Example — after:

defmodule MyApp.CustomSupervisor do
  use GenServer

  defstruct children: %{}, restarts: [], max_restarts: 3, max_seconds: 5

  @impl true
  def init(_) do
    Process.flag(:trap_exit, true)
    {:ok, %__MODULE__{}}
  end

  def handle_info({:EXIT, pid, _reason}, state) do
    # O(1) lookup and delete
    {_spec, children} = Map.pop(state.children, pid)
    {:noreply, %{state | children: children}}
  end
end

When NOT to Use

Don't use this when:

You're building a production supervisor (use Supervisor/DynamicSupervisor)
You don't need custom supervision logic (the standard supervisors cover 99% of cases)
You're optimizing prematurely — most apps never have enough children for data structure choice to matter

Over-application example:

# Building a custom supervisor because "I want more control"
defmodule MyApp.FancySupervisor do
  use GenServer
  # 200 lines of restart logic, child tracking, shutdown handling...
  # Congratulations, you've reimplemented DynamicSupervisor with more bugs
end

Better alternative:

# Just use the standard one
{DynamicSupervisor, name: MyApp.FancySupervisor, strategy: :one_for_one}

Why: The standard supervisors are battle-tested over decades. Build custom only when you need semantics they don't provide (e.g., priority-based restart, custom backoff).

Pattern 13: Restart Logic with Exponential Backoff via `:try_again`

Source: lib/elixir/lib/dynamic_supervisor.ex#L710 (restart_child and related functions)

What it does: When a child fails to restart (start function returns error), DynamicSupervisor doesn't give up. It stores the child as {:restarting, child}, sends itself a :"$gen_restart" message, and retries later. This prevents the supervisor from blocking on a transiently failing child.

Why: During network partitions or resource exhaustion, a child might fail to start immediately but succeed seconds later. Instead of counting this as a restart (which would hit intensity limits), the supervisor retries asynchronously. The :try_again path is separate from the restart counter.

Anti-pattern: Treating every start failure as a "restart" — this would exhaust max_restarts quickly during transient failures like port conflicts.

Code example from source:

defp restart_child(:one_for_one, current_pid, child, state) do
  {{m, f, args} = mfa, restart, shutdown, type, modules} = child
  %{extra_arguments: extra} = state

  case start_child(m, f, extra ++ args) do
    {:ok, pid, _} ->
      state = delete_child(current_pid, state)
      {:ok, save_child(pid, mfa, restart, shutdown, type, modules, state)}

    {:ok, pid} ->
      state = delete_child(current_pid, state)
      {:ok, save_child(pid, mfa, restart, shutdown, type, modules, state)}

    :ignore ->
      {:ok, delete_child(current_pid, state)}

    {:error, reason} ->
      report_error(:start_error, reason, {:restarting, current_pid}, child, state)
      state = put_in(state.children[current_pid], {:restarting, child})
      {:try_again, state}
  end
end

When to Use

Triggers:

A child fails to start due to transient conditions (port conflict, network partition)
You're seeing restart intensity limits hit because start failures count as restarts
You need a supervisor that tolerates temporary resource unavailability

Example — before:

# Every start failure counts against restart intensity
# 3 failures in 5 seconds → supervisor crashes → cascading failure
defmodule MyApp.ConnectionPool do
  use GenServer

  @impl true
  def init(config) do
    # If DB is temporarily unreachable, this crashes...
    # ...which counts as a restart...
    # ...which can exhaust restart intensity
    {:ok, conn} = DBConnection.start_link(config)
    {:ok, %{conn: conn}}
  end
end

Example — after:

defmodule MyApp.ConnectionPool do
  use GenServer

  @impl true
  def init(config) do
    # Use handle_continue for the connection attempt
    {:ok, %{config: config, conn: nil}, {:continue, :connect}}
  end

  @impl true
  def handle_continue(:connect, state) do
    case DBConnection.start_link(state.config) do
      {:ok, conn} ->
        {:noreply, %{state | conn: conn}}
      {:error, _reason} ->
        # Retry after delay without counting against restart intensity
        Process.send_after(self(), :retry_connect, 5_000)
        {:noreply, state}
    end
  end

  @impl true
  def handle_info(:retry_connect, state) do
    {:noreply, state, {:continue, :connect}}
  end
end

When NOT to Use

Don't use this when:

The start failure is deterministic (config error, missing module) — fix the bug
You're relying on automatic retry to avoid proper health checking
The child should NOT start at all if dependencies are unavailable (use :ignore)

Over-application example:

# Retrying forever when the error is permanent
defmodule MyApp.MisconfiguredWorker do
  @impl true
  def init(%{api_key: nil}) do
    # This will NEVER succeed — the key is nil!
    # Infinite retry just wastes resources
    Process.send_after(self(), :retry, 5_000)
    {:ok, %{}}
  end
end

Better alternative:

defmodule MyApp.MisconfiguredWorker do
  @impl true
  def init(%{api_key: nil}) do
    # Fail fast on permanent configuration errors
    {:stop, {:error, :missing_api_key}}
  end
end

Why: Retry logic is for transient failures (network, resource contention). For permanent errors (bad config, missing deps), fail fast so the operator can fix the actual problem.

Pattern 14: `$ancestors` and `$callers` — Process Lineage Tracking

Source: lib/elixir/lib/task.ex#L227 (Ancestor and Caller Tracking section)

What it does: Elixir uses two process dictionary keys for lineage:

$ancestors — the supervision hierarchy (who spawned/supervises this process)
$callers — the logical call chain (who requested this work)

These are different! A task's ancestor is its supervisor, but its caller is the process that initiated the async operation.

Why: Debugging and tracing. When a task crashes, the log includes both its supervisor (for restart context) and its caller (for business logic context). This dual tracking is essential for understanding failures in systems where the spawner and supervisor are different processes.

Anti-pattern: Ignoring caller tracking when building custom process spawning. If you build something like Task.Supervisor, propagate $callers so crash logs are meaningful.

Code example from source (task/supervisor.ex):

defp get_callers(owner) do
  case :erlang.get(:"$callers") do
    [_ | _] = list -> [owner | list]
    _ -> [owner]
  end
end

# Task.start_link propagates both owner and callers
def start_link(module, function, args)
    when is_atom(module) and is_atom(function) and is_list(args) do
  mfa = {module, function, args}
  Task.Supervised.start_link(get_owner(self()), get_callers(self()), mfa)
end

When to Use

Triggers:

You're debugging crashes and need to understand where a task was spawned from
You're building custom process spawning and want crash logs to show the call chain
You need to trace a request through multiple spawned processes

Example — before:

# Custom spawner that loses caller context
defmodule MyApp.BackgroundJob do
  def run_async(fun) do
    spawn_link(fn ->
      # When this crashes, the log shows no context about WHO spawned it
      fun.()
    end)
  end
end

# Crash log:
# [error] Process #PID<0.234.0> raised an exception
# ** (RuntimeError) something went wrong
# No idea who called run_async or why!

Example — after:

defmodule MyApp.BackgroundJob do
  def run_async(fun) do
    owner = self()
    callers = case Process.get(:"$callers") do
      [_ | _] = list -> [owner | list]
      _ -> [owner]
    end

    spawn_link(fn ->
      Process.put(:"$callers", callers)
      fun.()
    end)
  end
end

# Crash log now shows the full caller chain:
# [error] Process #PID<0.234.0> raised an exception
# Callers: [#PID<0.200.0>, #PID<0.150.0>]  ← who initiated this work

When NOT to Use

Don't use this when:

You're using Task/Task.Supervisor (they propagate callers automatically)
The process is long-lived and the original caller is irrelevant after startup
You're spawning processes that outlive their callers (callers list becomes stale)

Over-application example:

# Tracking callers for a permanent GenServer — pointless after init
defmodule MyApp.Cache do
  use GenServer

  def start_link(_) do
    # The "caller" of start_link is the supervisor — not useful for debugging
    # After boot, the cache serves many callers — the original spawner is irrelevant
    GenServer.start_link(__MODULE__, [], name: __MODULE__)
  end
end

Better alternative:

# For long-lived processes, use Logger.metadata or OpenTelemetry spans
# to track per-request context, not process lineage
def handle_call({:get, key}, _from, state) do
  Logger.metadata(request_id: Logger.metadata()[:request_id])
  {:reply, Map.get(state, key), state}
end

Why: $callers is useful for short-lived spawned work (tasks, one-shot processes). For long-lived services, per-request tracing (metadata, spans) is more appropriate than process lineage.

Pattern 15: GenServer.reply/2 for Deferred Responses

Source: lib/elixir/lib/gen_server.ex#L620 (callback docs), lib/elixir/lib/gen_server.ex#L1328 (reply/2 function)

What it does: A handle_call can return {:noreply, state} without replying, then later call GenServer.reply(from, response) from any process. This decouples request receipt from response delivery.

Why: Three use cases (from the source):

Reply before returning (response known, but need to do cleanup after)
Reply after returning (response not yet available, computed asynchronously)
Reply from another process (delegate work to a task)

This enables non-blocking request handling in GenServers that would otherwise be bottlenecked.

Anti-pattern: Spawning a task to do work and then having the GenServer block on Task.await inside handle_call. This defeats the purpose — use reply/2 from the task instead.

Code example from source:

def handle_call(:reply_in_one_second, from, state) do
  Process.send_after(self(), {:reply, from}, 1_000)
  {:noreply, state}
end

def handle_info({:reply, from}, state) do
  GenServer.reply(from, :one_second_has_passed)
  {:noreply, state}
end

When to Use

Triggers:

A GenServer needs to do async work before replying (DB query, HTTP call, aggregation)
You want to reply from a different process than the one that received the request
You need to send intermediate progress and then a final response

Example — before:

defmodule MyApp.Aggregator do
  use GenServer

  @impl true
  def handle_call(:aggregate, _from, state) do
    # Blocks the GenServer for potentially seconds
    # No other calls can be processed during this time
    result = Enum.reduce(state.sources, %{}, fn source, acc ->
      data = HTTPClient.get!(source.url).body
      Map.merge(acc, Jason.decode!(data))
    end)
    {:reply, result, state}
  end
end

Example — after:

defmodule MyApp.Aggregator do
  use GenServer

  @impl true
  def handle_call(:aggregate, from, state) do
    # Don't block — spawn the work and reply later
    Task.Supervisor.async_nolink(MyApp.TaskSupervisor, fn ->
      result = Enum.reduce(state.sources, %{}, fn source, acc ->
        data = HTTPClient.get!(source.url).body
        Map.merge(acc, Jason.decode!(data))
      end)
      GenServer.reply(from, result)
    end)
    {:noreply, state}
  end
end

When NOT to Use

Don't use this when:

The work is fast (< 1ms) — just reply inline
You need the reply to be ordered with respect to other calls (deferred replies break ordering)
The from reference escapes to a long-lived process (it holds a monitor that should be cleaned up)

Over-application example:

# Deferring reply for trivial work — unnecessary complexity
defmodule MyApp.Counter do
  use GenServer

  @impl true
  def handle_call(:get, from, state) do
    # This is instant! Why defer?
    Task.start(fn -> GenServer.reply(from, state.count) end)
    {:noreply, state}
  end
end

Better alternative:

defmodule MyApp.Counter do
  use GenServer

  @impl true
  def handle_call(:get, _from, state) do
    {:reply, state.count, state}
  end
end

Why: reply/2 enables non-blocking GenServers for expensive operations. For cheap operations, it adds process spawn overhead, potential ordering issues, and code complexity for no benefit.

Pattern 16: Process.alias for Safe Request/Response

Source: lib/elixir/lib/process.ex#L32 (Aliases section)

What it does: Process aliases (Erlang/OTP 24+) provide a deactivatable reference for receiving replies. After sending a request with an alias as the reply address, you can deactivate the alias if you no longer want the response — any messages sent to a deactivated alias are silently dropped.

Why: Solves the "late reply" problem. In request/response patterns, if the requester times out and moves on, a late reply to its PID could confuse future receive blocks. With aliases, you deactivate after timeout and the late reply harmlessly vanishes.

Anti-pattern: Using bare PIDs for reply addresses in protocols where timeouts are possible. Late messages pollute the mailbox.

Code example from source:

server = spawn(&server/0)

source_alias = Process.alias()
send(server, {:ping, source_alias})

receive do
  :pong -> :pong
end

# Deactivate — late replies to this alias are silently dropped
Process.unalias(source_alias)

When to Use

Triggers:

You're building request/response patterns with timeouts where late replies pollute the mailbox
A GenServer sends a request and moves on after timeout, but the response arrives later
You need safe cancellation of pending responses

Example — before:

defmodule MyApp.RequestRouter do
  use GenServer

  @impl true
  def handle_call({:request, payload}, _from, state) do
    send(state.backend, {:request, self(), payload})
    receive do
      {:response, result} -> {:reply, result, state}
    after
      5_000 ->
        # Timeout... but the response might still arrive later!
        # It'll sit in our mailbox and confuse future receives
        {:reply, {:error, :timeout}, state}
    end
  end
end

Example — after:

defmodule MyApp.RequestRouter do
  use GenServer

  @impl true
  def handle_call({:request, payload}, _from, state) do
    alias_ref = Process.alias([:reply])
    send(state.backend, {:request, alias_ref, payload})

    receive do
      {^alias_ref, result} -> {:reply, result, state}
    after
      5_000 ->
        # Deactivate the alias — late replies are silently dropped
        Process.unalias(alias_ref)
        {:reply, {:error, :timeout}, state}
    end
  end
end

When NOT to Use

Don't use this when:

You're using GenServer.call (it already handles this with its own ref-based protocol)
The response will always arrive (no timeout scenario)
You're on OTP < 24 (aliases aren't available)

Over-application example:

# Using aliases for GenServer.call — it already handles late replies
defmodule MyApp.Client do
  def get_data(server) do
    alias_ref = Process.alias([:reply])
    # Pointless — GenServer.call already uses monitor-based protocol
    # that handles late replies correctly
    GenServer.call(server, {:get, alias_ref})
  end
end

Better alternative:

defmodule MyApp.Client do
  def get_data(server) do
    # GenServer.call already handles timeouts and late replies correctly
    GenServer.call(server, :get, 5_000)
  end
end

Why: Aliases solve the problem for custom protocols where you build your own request/response. GenServer.call already has equivalent protections built in. Use aliases when you're implementing raw message-based protocols.

Pattern 17: Registry Partitioning Strategies

Source: lib/elixir/lib/registry.ex#L310 (start_link partitioning docs)

What it does: Duplicate registries support two partitioning strategies:

{:duplicate, :pid} (default) — groups entries by the registering process's PID. Good for few keys with many entries (e.g., one PubSub topic with many subscribers).
{:duplicate, :key} — groups entries by key. Good for many keys with few entries each (e.g., many topics with few subscribers).

Why: The partitioning strategy determines which partition(s) need to be scanned during lookup. With :key partitioning, a key lookup hits exactly one partition (O(1) partitions). With :pid partitioning, key lookups must scan all partitions but process-based operations (unregister on death) are localized.

Anti-pattern: Using default :pid partitioning with millions of unique keys and frequent lookups. Each lookup scans all partitions. Switch to {:duplicate, :key}.

Code example from source:

# Many topics, few subscribers each — use key partitioning
Registry.start_link(
  keys: {:duplicate, :key},
  name: MyApp.TopicRegistry,
  partitions: System.schedulers_online()
)

# Few topics, many subscribers — use pid partitioning (default)
Registry.start_link(
  keys: :duplicate,
  name: MyApp.BroadcastRegistry,
  partitions: System.schedulers_online()
)

When to Use

Triggers:

You have a PubSub with many topics and few subscribers per topic — key lookups are slow
Profiling shows Registry.dispatch scanning many partitions for key-based lookups
You're choosing between "optimize for subscribe/unsubscribe" vs "optimize for dispatch"

Example — before:

# Default :pid partitioning with many unique keys
# Each dispatch must scan ALL partitions to find subscribers for a key
Registry.start_link(keys: :duplicate, name: MyApp.Events)

# With 16 partitions and 100k unique event types,
# every dispatch scans 16 ETS tables
Registry.dispatch(MyApp.Events, "order.created", fn entries ->
  for {pid, _} <- entries, do: send(pid, :notify)
end)

Example — after:

# Key partitioning — dispatch hits exactly ONE partition per key
Registry.start_link(
  keys: {:duplicate, :key},
  name: MyApp.Events,
  partitions: System.schedulers_online()
)

# Now dispatch only scans one ETS table — O(1) partitions
Registry.dispatch(MyApp.Events, "order.created", fn entries ->
  for {pid, _} <- entries, do: send(pid, :notify)
end)

When NOT to Use

Don't use this when:

You have few keys with many subscribers (:pid partitioning is better for cleanup)
Process death cleanup is the hot path (:key partitioning must scan all partitions on death)
You're not hitting performance issues with the default (premature optimization)

Over-application example:

# Key partitioning for a "presence" system where processes die frequently
# Each death must scan ALL partitions to unregister
Registry.start_link(
  keys: {:duplicate, :key},
  name: MyApp.Presence,
  partitions: 16
)
# With 50k users connecting/disconnecting per second,
# each disconnect scans 16 partitions — worse than default!

Better alternative:

# Pid partitioning — death cleanup is localized to one partition
Registry.start_link(
  keys: :duplicate,
  name: MyApp.Presence,
  partitions: System.schedulers_online()
)

Why: Partitioning is a tradeoff. :key optimizes dispatch (one partition per lookup) at the cost of death cleanup (scan all). :pid optimizes death cleanup (one partition) at the cost of dispatch (scan all). Pick based on which operation is hotter.

Pattern 18: `init/1` Return Values — The Full Spectrum

Source: lib/elixir/lib/gen_server.ex#L498 (init callback spec)

What it does: init/1 supports five return values:

{:ok, state} — normal start
{:ok, state, timeout} — start with idle timeout
{:ok, state, :hibernate} — start and immediately hibernate (GC + compact heap)
{:ok, state, {:continue, arg}} — start then immediately invoke handle_continue
:ignore — don't start, supervisor treats as successful (child can be restarted later)
{:stop, reason} — initialization failed

Why: Each covers a real scenario:

:ignore — process is disabled by configuration but might be enabled later via Supervisor.restart_child/2
{:stop, reason} — unrecoverable initialization failure
:hibernate — process will be idle for a long time, minimize memory
{:continue, _} — split fast init from slow setup

Anti-pattern: Using {:stop, reason} when :ignore is appropriate. If a feature is disabled by config, :ignore keeps the child spec in the supervisor for later activation. {:stop, reason} signals a real failure.

When to Use

Triggers:

You need to communicate "don't start this child" without the supervisor treating it as failure
A feature is disabled by config but the child spec should remain for hot-enabling
A process discovers during init that it's a duplicate and should yield to the existing one

Example — before:

defmodule MyApp.OptionalFeature do
  use GenServer

  @impl true
  def init(_) do
    if Application.get_env(:my_app, :feature_enabled) do
      {:ok, %{}}
    else
      # {:stop, :disabled} causes supervisor to count it as a failure!
      {:stop, :disabled}
    end
  end
end

Example — after:

defmodule MyApp.OptionalFeature do
  use GenServer

  @impl true
  def init(_) do
    if Application.get_env(:my_app, :feature_enabled) do
      {:ok, %{}}
    else
      # :ignore — supervisor is happy, child spec stays for later activation
      :ignore
    end
  end
end

# Later, to enable:
# Update config, then:
# Supervisor.restart_child(MyApp.Supervisor, MyApp.OptionalFeature)

When NOT to Use

Don't use this when:

The failure is real and should count toward restart intensity (use {:stop, reason})
You want the supervisor to NOT have a child spec for this module (just don't add it)
The process should retry starting later automatically (use {:stop, _} + transient restart)

Over-application example:

# Using :ignore for a real failure — hides the problem
defmodule MyApp.DBConnection do
  @impl true
  def init(config) do
    case connect(config) do
      {:ok, conn} -> {:ok, conn}
      {:error, _} -> :ignore  # BAD: DB is down but we pretend everything is fine
    end
  end
end

Better alternative:

defmodule MyApp.DBConnection do
  @impl true
  def init(config) do
    case connect(config) do
      {:ok, conn} -> {:ok, conn}
      {:error, reason} -> {:stop, reason}  # Let supervisor handle the failure
    end
  end
end

Why: :ignore means "this child intentionally should not run right now." {:stop, reason} means "this child tried to start and failed." Conflating the two hides real failures from your supervision tree.

Decision Tree

If you have children known at compile time with ordering dependencies → Pattern 1: Static vs Dynamic Supervision
If a single DynamicSupervisor or Task.Supervisor is a bottleneck under high spawn load → Pattern 2: PartitionSupervisor for Scalability
If you need to decide how a supervisor reacts when children share state or have dependencies → Pattern 3: Supervision Strategies
If you want to tune how many restarts are tolerated before escalation → Pattern 4: Restart Intensity
If different processes have different lifecycle expectations (one-shot vs permanent) → Pattern 5: Restart Values
If a supervisor should self-terminate when its children finish their work → Pattern 6: Automatic Shutdown
If you need to compute values concurrently and the caller should crash on failure → Pattern 7: Task.async/await
If a GenServer needs to spawn work that might fail without taking down the server → Pattern 8: Task.Supervisor.async_nolink
If you need supervised tasks with caller tracking, async_nolink, and streaming → Pattern 9: Task Supervisor
If you need to look up processes by a dynamic key without atom leaks → Pattern 10: Registry
If processes hold external resources that need cleanup on shutdown → Pattern 11: Shutdown Semantics
If you are building a custom supervisor-like process and need efficient child tracking → Pattern 12: DynamicSupervisor Internal State
If a child fails to start due to transient conditions and you want non-blocking retry → Pattern 13: Restart Logic with Backoff
If you need to trace which process initiated spawned work for debugging → Pattern 14: Process Lineage Tracking
If a GenServer needs to do async work before replying to a caller → Pattern 15: GenServer.reply/2
If you build a custom request/response protocol with timeouts and need to prevent late replies → Pattern 16: Process.alias
If your Registry dispatch is slow because of wrong partitioning strategy → Pattern 17: Registry Partitioning
If you need to communicate "don't start this child" or split init into fast/slow phases → Pattern 18: init/1 Return Values

71 KiB Raw Permalink Blame History

Process Design Patterns — From the Elixir Source

Contents

Pattern 1: Static vs Dynamic Supervision — Choose the Right Tool

When to Use

When NOT to Use

Pattern 2: PartitionSupervisor for Scalability

When to Use

When NOT to Use

Pattern 3: Supervision Strategies — Choosing the Right Restart Behavior

When to Use

When NOT to Use

Pattern 4: Restart Intensity (max_restarts / max_seconds)

When to Use

When NOT to Use

Pattern 5: Restart Values — :permanent vs :transient vs :temporary

When to Use

When NOT to Use

Pattern 6: Automatic Shutdown for Pipeline Supervisors

When to Use

When NOT to Use

Pattern 7: Task.async/await for Concurrent Value Computation

When to Use

When NOT to Use

Pattern 8: Task.Supervisor.async_nolink for Fault-Tolerant Task Execution

When to Use

When NOT to Use

Pattern 9: Task Supervisor as DynamicSupervisor Specialization

When to Use

When NOT to Use

Pattern 10: Registry for Dynamic Process Naming and PubSub

When to Use

When NOT to Use

Pattern 11: Shutdown Semantics — Graceful Termination

When to Use

When NOT to Use

Pattern 12: DynamicSupervisor Internal State — Struct with Restart Tracking

When to Use

When NOT to Use

Pattern 13: Restart Logic with Exponential Backoff via :try_again

When to Use

When NOT to Use

Pattern 14: $ancestors and $callers — Process Lineage Tracking

When to Use

When NOT to Use

Pattern 15: GenServer.reply/2 for Deferred Responses

When to Use

When NOT to Use

Pattern 16: Process.alias for Safe Request/Response

When to Use

When NOT to Use

Pattern 17: Registry Partitioning Strategies

When to Use

When NOT to Use

Pattern 18: init/1 Return Values — The Full Spectrum

When to Use

When NOT to Use

Decision Tree

71 KiB

Raw Permalink Blame History

Pattern 4: Restart Intensity (`max_restarts` / `max_seconds`)

Pattern 5: Restart Values — `:permanent` vs `:transient` vs `:temporary`

Pattern 13: Restart Logic with Exponential Backoff via `:try_again`

Pattern 14: `$ancestors` and `$callers` — Process Lineage Tracking

Pattern 18: `init/1` Return Values — The Full Spectrum