docs: add when/when-not to all Kubernetes patterns
This commit is contained in:
+428
-2
@@ -18,7 +18,6 @@ deploymentCopy := deployment.DeepCopy()
|
||||
deploymentCopy.Spec.Replicas = ptr.To[int32](3)
|
||||
```
|
||||
|
||||
|
||||
### When to Apply This Rule
|
||||
|
||||
**Triggers:**
|
||||
@@ -42,6 +41,12 @@ copy.Spec.Replicas = ptr.To[int32](5)
|
||||
client.Update(ctx, copy)
|
||||
```
|
||||
|
||||
### Exceptions
|
||||
|
||||
- **Read-only access:** If you only read fields and never modify the object, deep copy is unnecessary overhead
|
||||
- **Single-owner data:** If the struct was created locally and isn't shared (e.g., you just built it with `&Deployment{}`), copying your own data is wasteful
|
||||
- **Immutable value types:** Primitive fields (string, int, bool) are copied by value in Go — you only need DeepCopy for slices, maps, and pointer fields
|
||||
|
||||
**Evidence:** The `runtime.Object` interface *mandates* `DeepCopyObject()`. Every API type has generated deep copy methods. The entire architecture assumes immutable reads.
|
||||
|
||||
---
|
||||
@@ -64,6 +69,48 @@ func (q *Typed[T]) Get() (item T, shutdown bool) {
|
||||
}
|
||||
```
|
||||
|
||||
### When to Apply This Rule
|
||||
|
||||
**Triggers:**
|
||||
- Multiple goroutines/workers process items from a shared queue or event stream
|
||||
- Processing involves read-modify-write cycles on external state (API server, database)
|
||||
- You observe optimistic concurrency conflicts (409 Conflict) or duplicate operations in logs
|
||||
|
||||
**Example — detecting the smell:**
|
||||
```go
|
||||
// Multiple workers, no key-level serialization
|
||||
for i := 0; i < 10; i++ {
|
||||
go func() {
|
||||
for key := range eventChannel { // same key can go to any worker
|
||||
obj, _ := client.Get(key)
|
||||
obj.Status.Count++
|
||||
client.Update(obj) // CONFLICT: another worker updated first
|
||||
}
|
||||
}()
|
||||
}
|
||||
```
|
||||
|
||||
**Example — fixed:**
|
||||
```go
|
||||
// Workqueue guarantees: one worker per key at a time
|
||||
queue := workqueue.NewTypedRateLimitingQueue[string](limiter)
|
||||
for i := 0; i < 10; i++ {
|
||||
go func() {
|
||||
for {
|
||||
key, _ := queue.Get() // key is exclusively ours until Done()
|
||||
reconcile(key)
|
||||
queue.Done(key)
|
||||
}
|
||||
}()
|
||||
}
|
||||
```
|
||||
|
||||
### Exceptions
|
||||
|
||||
- **Read-only operations:** If workers only read (metrics collection, logging), concurrent access to the same key is safe
|
||||
- **Truly idempotent writes:** If the write is unconditional (e.g., PUT with a fixed value, not read-modify-write), concurrent processing produces the same result
|
||||
- **Sharded ownership:** If each key is deterministically routed to exactly one worker (consistent hashing), the queue's built-in serialization is unnecessary
|
||||
|
||||
---
|
||||
|
||||
## 3. Never Use Edge-Triggered Logic
|
||||
@@ -74,7 +121,6 @@ func (q *Typed[T]) Get() (item T, shutdown bool) {
|
||||
|
||||
**The pattern K8s enforces:** Level-triggered reconciliation. The `syncHandler` reads *current state from the cache*, computes *desired state from the spec*, and makes the world match:
|
||||
|
||||
|
||||
### When to Apply This Rule
|
||||
|
||||
**Triggers:**
|
||||
@@ -97,6 +143,12 @@ func reconcile(deployment Deployment) {
|
||||
}
|
||||
```
|
||||
|
||||
### Exceptions
|
||||
|
||||
- **Audit logging / event streams:** Recording "what happened" is inherently edge-triggered — you want to log each event, not reconstruct history from state
|
||||
- **Notifications:** "User X just logged in" is an event that triggers a one-time notification — level-triggered makes no sense here
|
||||
- **Ordering-sensitive operations:** If the sequence of events matters (transaction log, command queue), level-triggered reconciliation would lose ordering information
|
||||
|
||||
```go
|
||||
// The sync function always reads current state, never relies on "what happened"
|
||||
func (dc *DeploymentController) syncDeployment(ctx context.Context, key string) error {
|
||||
@@ -136,6 +188,46 @@ func (dc *DeploymentController) handleErr(ctx context.Context, err error, key st
|
||||
}
|
||||
```
|
||||
|
||||
### When to Apply This Rule
|
||||
|
||||
**Triggers:**
|
||||
- You're using a rate-limited workqueue and items can both succeed and fail
|
||||
- You notice items processing slower over time for no apparent reason (accumulated backoff)
|
||||
- Your error handling only calls `AddRateLimited` on failure but has no `Forget` on success
|
||||
|
||||
**Example — detecting the smell:**
|
||||
```go
|
||||
func processItem(queue workqueue.RateLimitingInterface, key string) {
|
||||
defer queue.Done(key)
|
||||
err := reconcile(key)
|
||||
if err != nil {
|
||||
queue.AddRateLimited(key) // retry with backoff
|
||||
return
|
||||
}
|
||||
// BUG: no queue.Forget(key)!
|
||||
// Next time this key is processed (even for a new event),
|
||||
// it gets rate-limited delay from the old failure counter
|
||||
}
|
||||
```
|
||||
|
||||
**Example — fixed:**
|
||||
```go
|
||||
func processItem(queue workqueue.RateLimitingInterface, key string) {
|
||||
defer queue.Done(key)
|
||||
err := reconcile(key)
|
||||
if err != nil {
|
||||
queue.AddRateLimited(key)
|
||||
return
|
||||
}
|
||||
queue.Forget(key) // clear backoff counter on success
|
||||
}
|
||||
```
|
||||
|
||||
### Exceptions
|
||||
|
||||
- **You're using a plain (non-rate-limited) queue:** `Forget()` only matters on `RateLimitingInterface`. Plain `Interface` has no rate limiter to clear.
|
||||
- **Permanent failures where the key should never be fast-tracked:** If a key represents a permanently broken resource, you might intentionally not `Forget` so that future events for it are naturally throttled (rare — usually you'd remove it from the queue entirely).
|
||||
|
||||
---
|
||||
|
||||
## 5. Never Hit the API Server in a Tight Loop
|
||||
@@ -153,6 +245,50 @@ deployment, err := dc.dLister.Deployments(namespace).Get(name)
|
||||
_, err = dc.client.AppsV1().Deployments(namespace).Update(ctx, deployment, ...)
|
||||
```
|
||||
|
||||
### When to Apply This Rule
|
||||
|
||||
**Triggers:**
|
||||
- Your reconcile function makes API calls (Get/List) inside a loop
|
||||
- You observe API server throttling (429 responses) or high request latency
|
||||
- Multiple controllers in the same process each independently fetch the same resources
|
||||
|
||||
**Example — detecting the smell:**
|
||||
```go
|
||||
func (c *Controller) reconcile(ctx context.Context, deployment *apps.Deployment) error {
|
||||
for i := 0; i < int(*deployment.Spec.Replicas); i++ {
|
||||
// API call inside a loop — O(replicas) calls per sync
|
||||
pod, err := c.client.CoreV1().Pods(ns).Get(ctx, podName(i), metav1.GetOptions{})
|
||||
if errors.IsNotFound(err) {
|
||||
c.client.CoreV1().Pods(ns).Create(ctx, newPod(i), metav1.CreateOptions{})
|
||||
}
|
||||
}
|
||||
return nil
|
||||
}
|
||||
```
|
||||
|
||||
**Example — fixed:**
|
||||
```go
|
||||
func (c *Controller) reconcile(ctx context.Context, deployment *apps.Deployment) error {
|
||||
// Single cache read (local, free) — gets ALL pods at once
|
||||
existingPods, _ := c.podLister.Pods(ns).List(selectorForDeployment(deployment))
|
||||
existingSet := make(map[string]bool)
|
||||
for _, p := range existingPods { existingSet[p.Name] = true }
|
||||
|
||||
for i := 0; i < int(*deployment.Spec.Replicas); i++ {
|
||||
if !existingSet[podName(i)] {
|
||||
c.client.CoreV1().Pods(ns).Create(ctx, newPod(i), metav1.CreateOptions{})
|
||||
}
|
||||
}
|
||||
return nil
|
||||
}
|
||||
```
|
||||
|
||||
### Exceptions
|
||||
|
||||
- **Writes are unavoidable:** You must hit the API server for creates/updates/deletes — the rule is about *reads*, not writes
|
||||
- **Cache staleness is unacceptable:** Rare cases where you need a strongly consistent read (e.g., before an irreversible action) justify a direct Get
|
||||
- **One-shot tools:** CLI commands or migration scripts don't benefit from informer caches (they exit after one operation)
|
||||
|
||||
---
|
||||
|
||||
## 6. Never Sync Before Caches Are Warm
|
||||
@@ -170,6 +306,46 @@ if !cache.WaitForNamedCacheSyncWithContext(ctx,
|
||||
}
|
||||
```
|
||||
|
||||
### When to Apply This Rule
|
||||
|
||||
**Triggers:**
|
||||
- Your controller uses informer caches (Listers) to determine what actions to take
|
||||
- You observe a burst of spurious creates/deletes immediately after startup
|
||||
- Your reconcile logic compares "desired count vs actual count" using cached data
|
||||
|
||||
**Example — detecting the smell:**
|
||||
```go
|
||||
func (c *Controller) Run(ctx context.Context) {
|
||||
// BUG: starts workers immediately — cache might be empty
|
||||
for i := 0; i < workers; i++ {
|
||||
go c.worker(ctx)
|
||||
}
|
||||
// Informers haven't finished initial List yet
|
||||
// Worker reads cache → sees 0 pods → creates all replicas again
|
||||
}
|
||||
```
|
||||
|
||||
**Example — fixed:**
|
||||
```go
|
||||
func (c *Controller) Run(ctx context.Context) {
|
||||
// Gate: wait until all caches have completed initial List
|
||||
if !cache.WaitForCacheSync(ctx.Done(), c.podsSynced, c.rsSynced) {
|
||||
runtime.HandleError(fmt.Errorf("caches never synced"))
|
||||
return
|
||||
}
|
||||
// Now safe to start workers — cache reflects reality
|
||||
for i := 0; i < workers; i++ {
|
||||
go c.worker(ctx)
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Exceptions
|
||||
|
||||
- **Read-only controllers:** If your controller only observes and reports (metrics, logging), processing with a partial cache produces incomplete but not harmful results
|
||||
- **Controllers that check existence before acting:** If your reconcile logic does a direct API Get (not cache read) before creating, the cache warmth gate is less critical (but still good practice)
|
||||
- **Fast-starting controllers with explicit "not found" handling:** If your reconcile function handles "object not in cache" gracefully (e.g., requeues the key), it may tolerate starting before full sync
|
||||
|
||||
---
|
||||
|
||||
## 7. Never Ignore Tombstones in Delete Handlers
|
||||
@@ -195,6 +371,51 @@ func (dc *DeploymentController) deleteDeployment(logger klog.Logger, obj interfa
|
||||
}
|
||||
```
|
||||
|
||||
### When to Apply This Rule
|
||||
|
||||
**Triggers:**
|
||||
- You're implementing a `DeleteFunc` event handler for an informer
|
||||
- Your delete handler type-asserts directly to the concrete type without a fallback
|
||||
- You observe "leaked" resources that should have been cleaned up after their parent was deleted
|
||||
|
||||
**Example — detecting the smell:**
|
||||
```go
|
||||
informer.AddEventHandler(cache.ResourceEventHandlerFuncs{
|
||||
DeleteFunc: func(obj interface{}) {
|
||||
pod := obj.(*v1.Pod) // PANIC during watch reconnection
|
||||
cleanupPod(pod)
|
||||
},
|
||||
})
|
||||
```
|
||||
|
||||
**Example — fixed:**
|
||||
```go
|
||||
informer.AddEventHandler(cache.ResourceEventHandlerFuncs{
|
||||
DeleteFunc: func(obj interface{}) {
|
||||
pod, ok := obj.(*v1.Pod)
|
||||
if !ok {
|
||||
tombstone, ok := obj.(cache.DeletedFinalStateUnknown)
|
||||
if !ok {
|
||||
runtime.HandleError(fmt.Errorf("unexpected type %T", obj))
|
||||
return
|
||||
}
|
||||
pod, ok = tombstone.Obj.(*v1.Pod)
|
||||
if !ok {
|
||||
runtime.HandleError(fmt.Errorf("tombstone contained %T", tombstone.Obj))
|
||||
return
|
||||
}
|
||||
}
|
||||
cleanupPod(pod)
|
||||
},
|
||||
})
|
||||
```
|
||||
|
||||
### Exceptions
|
||||
|
||||
- **Delete handler only extracts the key:** If your handler only needs `namespace/name` to enqueue work, you can use `cache.MetaNamespaceKeyFunc(obj)` which handles tombstones internally
|
||||
- **Non-informer event sources:** If your events don't come from client-go informers (e.g., custom message queues), tombstones don't apply
|
||||
- **Level-triggered reconcilers that don't use delete handlers:** If your reconcile loop discovers deletions via "not found" from the lister, the delete handler is optional anyway
|
||||
|
||||
---
|
||||
|
||||
## 8. Never Use ResourceVersion for Equality
|
||||
@@ -216,6 +437,39 @@ func (dc *DeploymentController) updateReplicaSet(logger klog.Logger, old, cur in
|
||||
}
|
||||
```
|
||||
|
||||
### When to Apply This Rule
|
||||
|
||||
**Triggers:**
|
||||
- You're comparing ResourceVersions with `<`, `>`, or numeric parsing
|
||||
- You're storing ResourceVersion as an integer for ordering or pagination
|
||||
- You're using ResourceVersion to determine "which version is newer"
|
||||
|
||||
**Example — detecting the smell:**
|
||||
```go
|
||||
// Treating ResourceVersion as a number — WRONG
|
||||
rv1, _ := strconv.Atoi(obj1.ResourceVersion)
|
||||
rv2, _ := strconv.Atoi(obj2.ResourceVersion)
|
||||
if rv2 > rv1 {
|
||||
// "obj2 is newer" — this assumption may break in future implementations
|
||||
}
|
||||
```
|
||||
|
||||
**Example — fixed:**
|
||||
```go
|
||||
// Only valid comparison: equality (to detect resync vs real update)
|
||||
if cur.ResourceVersion == old.ResourceVersion {
|
||||
return // no change — this is a periodic resync event
|
||||
}
|
||||
// Don't compare ordering — just process the update
|
||||
processUpdate(cur)
|
||||
```
|
||||
|
||||
### Exceptions
|
||||
|
||||
- **The `== ` check for resync detection:** Comparing ResourceVersion for equality (`==`) to detect "no change" is valid and explicitly how Kubernetes uses it
|
||||
- **Watch continuity:** Passing ResourceVersion to Watch/List for "resume from" is the intended API — you're not comparing, you're providing a cursor
|
||||
- **Internal etcd tooling:** If you're operating directly on etcd (not through the Kubernetes API), mod_revision semantics are documented there
|
||||
|
||||
---
|
||||
|
||||
## 9. Never Panic in Production Goroutines (Without Recovery)
|
||||
@@ -241,6 +495,41 @@ func BackoffUntilWithContext(ctx context.Context, f func(ctx context.Context), .
|
||||
}
|
||||
```
|
||||
|
||||
### When to Apply This Rule
|
||||
|
||||
**Triggers:**
|
||||
- You're launching goroutines without `defer recover()` or `defer HandleCrash()`
|
||||
- Your process hosts multiple independent subsystems (any one panicking kills all)
|
||||
- You observe random process deaths in production with nil pointer or index-out-of-range panics
|
||||
|
||||
**Example — detecting the smell:**
|
||||
```go
|
||||
// Naked goroutine — one panic kills the whole process
|
||||
func (c *Controller) Run(ctx context.Context) {
|
||||
for i := 0; i < 5; i++ {
|
||||
go c.worker(ctx) // no recovery — nil pointer in worker kills everything
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Example — fixed:**
|
||||
```go
|
||||
func (c *Controller) Run(ctx context.Context) {
|
||||
for i := 0; i < 5; i++ {
|
||||
go func() {
|
||||
defer utilruntime.HandleCrash()
|
||||
c.worker(ctx)
|
||||
}()
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Exceptions
|
||||
|
||||
- **Main goroutine of a CLI tool:** Let it panic — the stack trace IS the error report. Recovery here just hides bugs.
|
||||
- **Panics that indicate programming errors you WANT to crash on:** `panic("unreachable")` or assertion failures during development should crash to catch bugs early.
|
||||
- **Test code:** Tests should panic/fail loudly. `HandleCrash` in tests would swallow test failures.
|
||||
|
||||
---
|
||||
|
||||
## 10. Never Block Workers Indefinitely
|
||||
@@ -257,6 +546,52 @@ defer cancel()
|
||||
_, err := client.CoreV1().Pods(ns).Create(ctx, pod, metav1.CreateOptions{})
|
||||
```
|
||||
|
||||
### When to Apply This Rule
|
||||
|
||||
**Triggers:**
|
||||
- Your sync handler calls external services without a timeout
|
||||
- Workers are a fixed pool and you observe queue depth growing while workers appear idle
|
||||
- Your code has `select {}`, unbounded channel reads, or mutex waits without deadline
|
||||
|
||||
**Example — detecting the smell:**
|
||||
```go
|
||||
func (c *Controller) syncItem(ctx context.Context, key string) error {
|
||||
// Calls external service with no timeout — blocks indefinitely if service is down
|
||||
resp, err := http.Get("http://external-service/api/" + key)
|
||||
if err != nil { return err }
|
||||
|
||||
// Waits on a channel that might never close
|
||||
result := <-c.resultChan // blocks forever if producer crashes
|
||||
return c.updateStatus(ctx, key, result)
|
||||
}
|
||||
```
|
||||
|
||||
**Example — fixed:**
|
||||
```go
|
||||
func (c *Controller) syncItem(ctx context.Context, key string) error {
|
||||
// Bounded timeout on external call
|
||||
reqCtx, cancel := context.WithTimeout(ctx, 10*time.Second)
|
||||
defer cancel()
|
||||
req, _ := http.NewRequestWithContext(reqCtx, "GET", "http://external-service/api/"+key, nil)
|
||||
resp, err := http.DefaultClient.Do(req)
|
||||
if err != nil { return err } // timeout returns error, worker is freed
|
||||
|
||||
// Bounded wait on channel
|
||||
select {
|
||||
case result := <-c.resultChan:
|
||||
return c.updateStatus(ctx, key, result)
|
||||
case <-ctx.Done():
|
||||
return ctx.Err() // worker freed, key requeued
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Exceptions
|
||||
|
||||
- **The worker loop itself:** The top-level `queue.Get()` blocks intentionally when the queue is empty — this is by design (workers sleep until there's work)
|
||||
- **Graceful shutdown drains:** During shutdown, waiting for in-flight work to complete is acceptable (with a bounded overall shutdown timeout)
|
||||
- **Leader election acquire:** Blocking until leadership is acquired is intentional — non-leaders should idle
|
||||
|
||||
---
|
||||
|
||||
## 11. Never Use sync.Mutex Where sync.Once Suffices
|
||||
@@ -284,6 +619,55 @@ func (m *BaseControllerRefManager) CanAdopt(ctx context.Context) error {
|
||||
}
|
||||
```
|
||||
|
||||
### When to Apply This Rule
|
||||
|
||||
**Triggers:**
|
||||
- You see a `sync.Mutex` protecting a `bool` flag that tracks "has this been initialized?"
|
||||
- The protected code runs exactly once and caches the result
|
||||
- You see patterns like `mu.Lock(); if !done { doThing(); done = true }; mu.Unlock()`
|
||||
|
||||
**Example — detecting the smell:**
|
||||
```go
|
||||
type Manager struct {
|
||||
mu sync.Mutex
|
||||
initialized bool
|
||||
config *Config
|
||||
}
|
||||
|
||||
func (m *Manager) GetConfig() *Config {
|
||||
m.mu.Lock()
|
||||
defer m.mu.Unlock()
|
||||
if !m.initialized {
|
||||
m.config = loadConfig()
|
||||
m.initialized = true
|
||||
}
|
||||
return m.config
|
||||
}
|
||||
// Bug-prone: what if someone adds code that sets initialized=false?
|
||||
```
|
||||
|
||||
**Example — fixed:**
|
||||
```go
|
||||
type Manager struct {
|
||||
initOnce sync.Once
|
||||
config *Config
|
||||
}
|
||||
|
||||
func (m *Manager) GetConfig() *Config {
|
||||
m.initOnce.Do(func() {
|
||||
m.config = loadConfig()
|
||||
})
|
||||
return m.config
|
||||
// Impossible to accidentally re-initialize
|
||||
}
|
||||
```
|
||||
|
||||
### Exceptions
|
||||
|
||||
- **Resettable state:** If the "one-time" operation might need to run again (e.g., reconnection after disconnect), `sync.Once` doesn't support reset — use a mutex with a flag
|
||||
- **Multiple initialization phases:** If initialization has multiple stages that can partially fail and retry, `sync.Once` is too coarse (it only fires once, even on error)
|
||||
- **Pre-Go 1.21 error handling:** Before Go 1.21's `sync.OnceValue`/`sync.OnceFunc`, handling errors from `Do` required workarounds. Now `sync.OnceValue` cleanly supports fallible initialization.
|
||||
|
||||
---
|
||||
|
||||
## 12. Never Expose Mutable State Through Interfaces
|
||||
@@ -294,6 +678,48 @@ func (m *BaseControllerRefManager) CanAdopt(ctx context.Context) error {
|
||||
|
||||
**The pattern K8s enforces:** Listers return objects from the read-only cache. The `DeepCopy()` pattern ensures mutation safety is the caller's responsibility, not the cache's.
|
||||
|
||||
### When to Apply This Rule
|
||||
|
||||
**Triggers:**
|
||||
- Your getter/accessor returns a pointer to a field that's part of your struct's internal state
|
||||
- Multiple goroutines call the accessor, and at least one caller might modify the result
|
||||
- You observe race conditions in `go test -race` pointing to internal fields accessed from outside
|
||||
|
||||
**Example — detecting the smell:**
|
||||
```go
|
||||
type Registry struct {
|
||||
mu sync.RWMutex
|
||||
items map[string]*Item
|
||||
}
|
||||
|
||||
func (r *Registry) Get(key string) *Item {
|
||||
r.mu.RLock()
|
||||
defer r.mu.RUnlock()
|
||||
return r.items[key] // returns internal pointer — caller can mutate!
|
||||
}
|
||||
// Caller does: item := registry.Get("foo"); item.Name = "bar"
|
||||
// Now every other caller sees the mutated name — data race
|
||||
```
|
||||
|
||||
**Example — fixed:**
|
||||
```go
|
||||
func (r *Registry) Get(key string) *Item {
|
||||
r.mu.RLock()
|
||||
defer r.mu.RUnlock()
|
||||
if item, ok := r.items[key]; ok {
|
||||
return item.DeepCopy() // caller gets their own copy
|
||||
}
|
||||
return nil
|
||||
}
|
||||
// Caller mutates their copy — registry is unaffected
|
||||
```
|
||||
|
||||
### Exceptions
|
||||
|
||||
- **Single-threaded access:** If your struct is only accessed from one goroutine (no concurrency), returning pointers to internals is safe and avoids copy overhead
|
||||
- **Intentionally shared mutable state:** Some designs explicitly want callers to see mutations (e.g., shared counters, observable state). Document this clearly.
|
||||
- **Performance-critical read paths:** If deep-copying on every access is too expensive and callers are trusted to not mutate, return a pointer with clear documentation: `// WARNING: do not mutate returned value`
|
||||
|
||||
---
|
||||
|
||||
## Summary: The Philosophy
|
||||
|
||||
Reference in New Issue
Block a user