docs: add when/when-not to all Kubernetes patterns

2026-04-30 13:41:33 +00:00
parent 039f0ad0fa
commit 5d2e5b43c3
3 changed files with 1480 additions and 4 deletions
@@ -18,7 +18,6 @@ deploymentCopy := deployment.DeepCopy()
 deploymentCopy.Spec.Replicas = ptr.To[int32](3)
 ```

-
 ### When to Apply This Rule

 **Triggers:**
@@ -42,6 +41,12 @@ copy.Spec.Replicas = ptr.To[int32](5)
 client.Update(ctx, copy)
 ```

+### Exceptions
+
+- **Read-only access:** If you only read fields and never modify the object, deep copy is unnecessary overhead
+- **Single-owner data:** If the struct was created locally and isn't shared (e.g., you just built it with `&Deployment{}`), copying your own data is wasteful
+- **Immutable value types:** Primitive fields (string, int, bool) are copied by value in Go — you only need DeepCopy for slices, maps, and pointer fields
+
 **Evidence:** The `runtime.Object` interface *mandates* `DeepCopyObject()`. Every API type has generated deep copy methods. The entire architecture assumes immutable reads.

 ---
@@ -64,6 +69,48 @@ func (q *Typed[T]) Get() (item T, shutdown bool) {
 }
 ```

+### When to Apply This Rule
+
+**Triggers:**
+- Multiple goroutines/workers process items from a shared queue or event stream
+- Processing involves read-modify-write cycles on external state (API server, database)
+- You observe optimistic concurrency conflicts (409 Conflict) or duplicate operations in logs
+
+**Example — detecting the smell:**
+```go
+// Multiple workers, no key-level serialization
+for i := 0; i < 10; i++ {
+    go func() {
+        for key := range eventChannel { // same key can go to any worker
+            obj, _ := client.Get(key)
+            obj.Status.Count++
+            client.Update(obj) // CONFLICT: another worker updated first
+        }
+    }()
+}
+```
+
+**Example — fixed:**
+```go
+// Workqueue guarantees: one worker per key at a time
+queue := workqueue.NewTypedRateLimitingQueue[string](limiter)
+for i := 0; i < 10; i++ {
+    go func() {
+        for {
+            key, _ := queue.Get() // key is exclusively ours until Done()
+            reconcile(key)
+            queue.Done(key)
+        }
+    }()
+}
+```
+
+### Exceptions
+
+- **Read-only operations:** If workers only read (metrics collection, logging), concurrent access to the same key is safe
+- **Truly idempotent writes:** If the write is unconditional (e.g., PUT with a fixed value, not read-modify-write), concurrent processing produces the same result
+- **Sharded ownership:** If each key is deterministically routed to exactly one worker (consistent hashing), the queue's built-in serialization is unnecessary
+
 ---

 ## 3. Never Use Edge-Triggered Logic
@@ -74,7 +121,6 @@ func (q *Typed[T]) Get() (item T, shutdown bool) {

 **The pattern K8s enforces:** Level-triggered reconciliation. The `syncHandler` reads *current state from the cache*, computes *desired state from the spec*, and makes the world match:

-
 ### When to Apply This Rule

 **Triggers:**
@@ -97,6 +143,12 @@ func reconcile(deployment Deployment) {
 }
 ```

+### Exceptions
+
+- **Audit logging / event streams:** Recording "what happened" is inherently edge-triggered — you want to log each event, not reconstruct history from state
+- **Notifications:** "User X just logged in" is an event that triggers a one-time notification — level-triggered makes no sense here
+- **Ordering-sensitive operations:** If the sequence of events matters (transaction log, command queue), level-triggered reconciliation would lose ordering information
+
 ```go
 // The sync function always reads current state, never relies on "what happened"
 func (dc *DeploymentController) syncDeployment(ctx context.Context, key string) error {
@@ -136,6 +188,46 @@ func (dc *DeploymentController) handleErr(ctx context.Context, err error, key st
 }
 ```

+### When to Apply This Rule
+
+**Triggers:**
+- You're using a rate-limited workqueue and items can both succeed and fail
+- You notice items processing slower over time for no apparent reason (accumulated backoff)
+- Your error handling only calls `AddRateLimited` on failure but has no `Forget` on success
+
+**Example — detecting the smell:**
+```go
+func processItem(queue workqueue.RateLimitingInterface, key string) {
+    defer queue.Done(key)
+    err := reconcile(key)
+    if err != nil {
+        queue.AddRateLimited(key) // retry with backoff
+        return
+    }
+    // BUG: no queue.Forget(key)!
+    // Next time this key is processed (even for a new event),
+    // it gets rate-limited delay from the old failure counter
+}
+```
+
+**Example — fixed:**
+```go
+func processItem(queue workqueue.RateLimitingInterface, key string) {
+    defer queue.Done(key)
+    err := reconcile(key)
+    if err != nil {
+        queue.AddRateLimited(key)
+        return
+    }
+    queue.Forget(key) // clear backoff counter on success
+}
+```
+
+### Exceptions
+
+- **You're using a plain (non-rate-limited) queue:** `Forget()` only matters on `RateLimitingInterface`. Plain `Interface` has no rate limiter to clear.
+- **Permanent failures where the key should never be fast-tracked:** If a key represents a permanently broken resource, you might intentionally not `Forget` so that future events for it are naturally throttled (rare — usually you'd remove it from the queue entirely).
+
 ---

 ## 5. Never Hit the API Server in a Tight Loop
@@ -153,6 +245,50 @@ deployment, err := dc.dLister.Deployments(namespace).Get(name)
 _, err = dc.client.AppsV1().Deployments(namespace).Update(ctx, deployment, ...)
 ```

+### When to Apply This Rule
+
+**Triggers:**
+- Your reconcile function makes API calls (Get/List) inside a loop
+- You observe API server throttling (429 responses) or high request latency
+- Multiple controllers in the same process each independently fetch the same resources
+
+**Example — detecting the smell:**
+```go
+func (c *Controller) reconcile(ctx context.Context, deployment *apps.Deployment) error {
+    for i := 0; i < int(*deployment.Spec.Replicas); i++ {
+        // API call inside a loop — O(replicas) calls per sync
+        pod, err := c.client.CoreV1().Pods(ns).Get(ctx, podName(i), metav1.GetOptions{})
+        if errors.IsNotFound(err) {
+            c.client.CoreV1().Pods(ns).Create(ctx, newPod(i), metav1.CreateOptions{})
+        }
+    }
+    return nil
+}
+```
+
+**Example — fixed:**
+```go
+func (c *Controller) reconcile(ctx context.Context, deployment *apps.Deployment) error {
+    // Single cache read (local, free) — gets ALL pods at once
+    existingPods, _ := c.podLister.Pods(ns).List(selectorForDeployment(deployment))
+    existingSet := make(map[string]bool)
+    for _, p := range existingPods { existingSet[p.Name] = true }
+    
+    for i := 0; i < int(*deployment.Spec.Replicas); i++ {
+        if !existingSet[podName(i)] {
+            c.client.CoreV1().Pods(ns).Create(ctx, newPod(i), metav1.CreateOptions{})
+        }
+    }
+    return nil
+}
+```
+
+### Exceptions
+
+- **Writes are unavoidable:** You must hit the API server for creates/updates/deletes — the rule is about *reads*, not writes
+- **Cache staleness is unacceptable:** Rare cases where you need a strongly consistent read (e.g., before an irreversible action) justify a direct Get
+- **One-shot tools:** CLI commands or migration scripts don't benefit from informer caches (they exit after one operation)
+
 ---

 ## 6. Never Sync Before Caches Are Warm
@@ -170,6 +306,46 @@ if !cache.WaitForNamedCacheSyncWithContext(ctx,
 }
 ```

+### When to Apply This Rule
+
+**Triggers:**
+- Your controller uses informer caches (Listers) to determine what actions to take
+- You observe a burst of spurious creates/deletes immediately after startup
+- Your reconcile logic compares "desired count vs actual count" using cached data
+
+**Example — detecting the smell:**
+```go
+func (c *Controller) Run(ctx context.Context) {
+    // BUG: starts workers immediately — cache might be empty
+    for i := 0; i < workers; i++ {
+        go c.worker(ctx)
+    }
+    // Informers haven't finished initial List yet
+    // Worker reads cache → sees 0 pods → creates all replicas again
+}
+```
+
+**Example — fixed:**
+```go
+func (c *Controller) Run(ctx context.Context) {
+    // Gate: wait until all caches have completed initial List
+    if !cache.WaitForCacheSync(ctx.Done(), c.podsSynced, c.rsSynced) {
+        runtime.HandleError(fmt.Errorf("caches never synced"))
+        return
+    }
+    // Now safe to start workers — cache reflects reality
+    for i := 0; i < workers; i++ {
+        go c.worker(ctx)
+    }
+}
+```
+
+### Exceptions
+
+- **Read-only controllers:** If your controller only observes and reports (metrics, logging), processing with a partial cache produces incomplete but not harmful results
+- **Controllers that check existence before acting:** If your reconcile logic does a direct API Get (not cache read) before creating, the cache warmth gate is less critical (but still good practice)
+- **Fast-starting controllers with explicit "not found" handling:** If your reconcile function handles "object not in cache" gracefully (e.g., requeues the key), it may tolerate starting before full sync
+
 ---

 ## 7. Never Ignore Tombstones in Delete Handlers
@@ -195,6 +371,51 @@ func (dc *DeploymentController) deleteDeployment(logger klog.Logger, obj interfa
 }
 ```

+### When to Apply This Rule
+
+**Triggers:**
+- You're implementing a `DeleteFunc` event handler for an informer
+- Your delete handler type-asserts directly to the concrete type without a fallback
+- You observe "leaked" resources that should have been cleaned up after their parent was deleted
+
+**Example — detecting the smell:**
+```go
+informer.AddEventHandler(cache.ResourceEventHandlerFuncs{
+    DeleteFunc: func(obj interface{}) {
+        pod := obj.(*v1.Pod) // PANIC during watch reconnection
+        cleanupPod(pod)
+    },
+})
+```
+
+**Example — fixed:**
+```go
+informer.AddEventHandler(cache.ResourceEventHandlerFuncs{
+    DeleteFunc: func(obj interface{}) {
+        pod, ok := obj.(*v1.Pod)
+        if !ok {
+            tombstone, ok := obj.(cache.DeletedFinalStateUnknown)
+            if !ok {
+                runtime.HandleError(fmt.Errorf("unexpected type %T", obj))
+                return
+            }
+            pod, ok = tombstone.Obj.(*v1.Pod)
+            if !ok {
+                runtime.HandleError(fmt.Errorf("tombstone contained %T", tombstone.Obj))
+                return
+            }
+        }
+        cleanupPod(pod)
+    },
+})
+```
+
+### Exceptions
+
+- **Delete handler only extracts the key:** If your handler only needs `namespace/name` to enqueue work, you can use `cache.MetaNamespaceKeyFunc(obj)` which handles tombstones internally
+- **Non-informer event sources:** If your events don't come from client-go informers (e.g., custom message queues), tombstones don't apply
+- **Level-triggered reconcilers that don't use delete handlers:** If your reconcile loop discovers deletions via "not found" from the lister, the delete handler is optional anyway
+
 ---

 ## 8. Never Use ResourceVersion for Equality
@@ -216,6 +437,39 @@ func (dc *DeploymentController) updateReplicaSet(logger klog.Logger, old, cur in
 }
 ```

+### When to Apply This Rule
+
+**Triggers:**
+- You're comparing ResourceVersions with `<`, `>`, or numeric parsing
+- You're storing ResourceVersion as an integer for ordering or pagination
+- You're using ResourceVersion to determine "which version is newer"
+
+**Example — detecting the smell:**
+```go
+// Treating ResourceVersion as a number — WRONG
+rv1, _ := strconv.Atoi(obj1.ResourceVersion)
+rv2, _ := strconv.Atoi(obj2.ResourceVersion)
+if rv2 > rv1 {
+    // "obj2 is newer" — this assumption may break in future implementations
+}
+```
+
+**Example — fixed:**
+```go
+// Only valid comparison: equality (to detect resync vs real update)
+if cur.ResourceVersion == old.ResourceVersion {
+    return // no change — this is a periodic resync event
+}
+// Don't compare ordering — just process the update
+processUpdate(cur)
+```
+
+### Exceptions
+
+- **The `== ` check for resync detection:** Comparing ResourceVersion for equality (`==`) to detect "no change" is valid and explicitly how Kubernetes uses it
+- **Watch continuity:** Passing ResourceVersion to Watch/List for "resume from" is the intended API — you're not comparing, you're providing a cursor
+- **Internal etcd tooling:** If you're operating directly on etcd (not through the Kubernetes API), mod_revision semantics are documented there
+
 ---

 ## 9. Never Panic in Production Goroutines (Without Recovery)
@@ -241,6 +495,41 @@ func BackoffUntilWithContext(ctx context.Context, f func(ctx context.Context), .
 }
 ```

+### When to Apply This Rule
+
+**Triggers:**
+- You're launching goroutines without `defer recover()` or `defer HandleCrash()`
+- Your process hosts multiple independent subsystems (any one panicking kills all)
+- You observe random process deaths in production with nil pointer or index-out-of-range panics
+
+**Example — detecting the smell:**
+```go
+// Naked goroutine — one panic kills the whole process
+func (c *Controller) Run(ctx context.Context) {
+    for i := 0; i < 5; i++ {
+        go c.worker(ctx) // no recovery — nil pointer in worker kills everything
+    }
+}
+```
+
+**Example — fixed:**
+```go
+func (c *Controller) Run(ctx context.Context) {
+    for i := 0; i < 5; i++ {
+        go func() {
+            defer utilruntime.HandleCrash()
+            c.worker(ctx)
+        }()
+    }
+}
+```
+
+### Exceptions
+
+- **Main goroutine of a CLI tool:** Let it panic — the stack trace IS the error report. Recovery here just hides bugs.
+- **Panics that indicate programming errors you WANT to crash on:** `panic("unreachable")` or assertion failures during development should crash to catch bugs early.
+- **Test code:** Tests should panic/fail loudly. `HandleCrash` in tests would swallow test failures.
+
 ---

 ## 10. Never Block Workers Indefinitely
@@ -257,6 +546,52 @@ defer cancel()
 _, err := client.CoreV1().Pods(ns).Create(ctx, pod, metav1.CreateOptions{})
 ```

+### When to Apply This Rule
+
+**Triggers:**
+- Your sync handler calls external services without a timeout
+- Workers are a fixed pool and you observe queue depth growing while workers appear idle
+- Your code has `select {}`, unbounded channel reads, or mutex waits without deadline
+
+**Example — detecting the smell:**
+```go
+func (c *Controller) syncItem(ctx context.Context, key string) error {
+    // Calls external service with no timeout — blocks indefinitely if service is down
+    resp, err := http.Get("http://external-service/api/" + key)
+    if err != nil { return err }
+    
+    // Waits on a channel that might never close
+    result := <-c.resultChan // blocks forever if producer crashes
+    return c.updateStatus(ctx, key, result)
+}
+```
+
+**Example — fixed:**
+```go
+func (c *Controller) syncItem(ctx context.Context, key string) error {
+    // Bounded timeout on external call
+    reqCtx, cancel := context.WithTimeout(ctx, 10*time.Second)
+    defer cancel()
+    req, _ := http.NewRequestWithContext(reqCtx, "GET", "http://external-service/api/"+key, nil)
+    resp, err := http.DefaultClient.Do(req)
+    if err != nil { return err } // timeout returns error, worker is freed
+    
+    // Bounded wait on channel
+    select {
+    case result := <-c.resultChan:
+        return c.updateStatus(ctx, key, result)
+    case <-ctx.Done():
+        return ctx.Err() // worker freed, key requeued
+    }
+}
+```
+
+### Exceptions
+
+- **The worker loop itself:** The top-level `queue.Get()` blocks intentionally when the queue is empty — this is by design (workers sleep until there's work)
+- **Graceful shutdown drains:** During shutdown, waiting for in-flight work to complete is acceptable (with a bounded overall shutdown timeout)
+- **Leader election acquire:** Blocking until leadership is acquired is intentional — non-leaders should idle
+
 ---

 ## 11. Never Use sync.Mutex Where sync.Once Suffices
@@ -284,6 +619,55 @@ func (m *BaseControllerRefManager) CanAdopt(ctx context.Context) error {
 }
 ```

+### When to Apply This Rule
+
+**Triggers:**
+- You see a `sync.Mutex` protecting a `bool` flag that tracks "has this been initialized?"
+- The protected code runs exactly once and caches the result
+- You see patterns like `mu.Lock(); if !done { doThing(); done = true }; mu.Unlock()`
+
+**Example — detecting the smell:**
+```go
+type Manager struct {
+    mu          sync.Mutex
+    initialized bool
+    config      *Config
+}
+
+func (m *Manager) GetConfig() *Config {
+    m.mu.Lock()
+    defer m.mu.Unlock()
+    if !m.initialized {
+        m.config = loadConfig()
+        m.initialized = true
+    }
+    return m.config
+}
+// Bug-prone: what if someone adds code that sets initialized=false?
+```
+
+**Example — fixed:**
+```go
+type Manager struct {
+    initOnce sync.Once
+    config   *Config
+}
+
+func (m *Manager) GetConfig() *Config {
+    m.initOnce.Do(func() {
+        m.config = loadConfig()
+    })
+    return m.config
+    // Impossible to accidentally re-initialize
+}
+```
+
+### Exceptions
+
+- **Resettable state:** If the "one-time" operation might need to run again (e.g., reconnection after disconnect), `sync.Once` doesn't support reset — use a mutex with a flag
+- **Multiple initialization phases:** If initialization has multiple stages that can partially fail and retry, `sync.Once` is too coarse (it only fires once, even on error)
+- **Pre-Go 1.21 error handling:** Before Go 1.21's `sync.OnceValue`/`sync.OnceFunc`, handling errors from `Do` required workarounds. Now `sync.OnceValue` cleanly supports fallible initialization.
+
 ---

 ## 12. Never Expose Mutable State Through Interfaces
@@ -294,6 +678,48 @@ func (m *BaseControllerRefManager) CanAdopt(ctx context.Context) error {

 **The pattern K8s enforces:** Listers return objects from the read-only cache. The `DeepCopy()` pattern ensures mutation safety is the caller's responsibility, not the cache's.

+### When to Apply This Rule
+
+**Triggers:**
+- Your getter/accessor returns a pointer to a field that's part of your struct's internal state
+- Multiple goroutines call the accessor, and at least one caller might modify the result
+- You observe race conditions in `go test -race` pointing to internal fields accessed from outside
+
+**Example — detecting the smell:**
+```go
+type Registry struct {
+    mu    sync.RWMutex
+    items map[string]*Item
+}
+
+func (r *Registry) Get(key string) *Item {
+    r.mu.RLock()
+    defer r.mu.RUnlock()
+    return r.items[key] // returns internal pointer — caller can mutate!
+}
+// Caller does: item := registry.Get("foo"); item.Name = "bar"
+// Now every other caller sees the mutated name — data race
+```
+
+**Example — fixed:**
+```go
+func (r *Registry) Get(key string) *Item {
+    r.mu.RLock()
+    defer r.mu.RUnlock()
+    if item, ok := r.items[key]; ok {
+        return item.DeepCopy() // caller gets their own copy
+    }
+    return nil
+}
+// Caller mutates their copy — registry is unaffected
+```
+
+### Exceptions
+
+- **Single-threaded access:** If your struct is only accessed from one goroutine (no concurrency), returning pointers to internals is safe and avoids copy overhead
+- **Intentionally shared mutable state:** Some designs explicitly want callers to see mutations (e.g., shared counters, observable state). Document this clearly.
+- **Performance-critical read paths:** If deep-copying on every access is too expensive and callers are trusted to not mutate, return a pointer with clear documentation: `// WARNING: do not mutate returned value`
+
 ---

 ## Summary: The Philosophy