736 lines
27 KiB
Markdown
736 lines
27 KiB
Markdown
# Anti-Patterns: What Kubernetes Avoids (and Why)
|
|
|
|
## 1. Never Mutate Shared Cache Objects
|
|
|
|
**What they avoid:** Modifying objects returned by Listers/Informers without deep-copying first.
|
|
|
|
**Why:** The informer cache is shared across all controllers. Mutating a cached object corrupts state for every other consumer.
|
|
|
|
**The pattern K8s enforces:**
|
|
```go
|
|
// WRONG — mutates shared cache
|
|
deployment, _ := dc.dLister.Deployments(ns).Get(name)
|
|
deployment.Spec.Replicas = ptr.To[int32](3) // CORRUPTION!
|
|
|
|
// RIGHT — deep copy before mutating
|
|
deployment, _ := dc.dLister.Deployments(ns).Get(name)
|
|
deploymentCopy := deployment.DeepCopy()
|
|
deploymentCopy.Spec.Replicas = ptr.To[int32](3)
|
|
```
|
|
|
|
### When to Apply This Rule
|
|
|
|
**Triggers:**
|
|
- You're reading from any shared data structure (cache, registry, concurrent map)
|
|
- Your function modifies a struct that was returned by a Lister, cache, or lookup
|
|
- You see code like `obj.Field = newValue` where `obj` came from a shared source
|
|
|
|
**Example — detecting the smell:**
|
|
```go
|
|
// This line is the red flag: modifying something obtained from a shared cache
|
|
deployment, _ := deploymentLister.Get(name)
|
|
deployment.Spec.Replicas = ptr.To[int32](5) // SMELL: mutating cached object
|
|
client.Update(ctx, deployment)
|
|
```
|
|
|
|
**Example — fixed:**
|
|
```go
|
|
deployment, _ := deploymentLister.Get(name)
|
|
copy := deployment.DeepCopy() // isolate your mutation
|
|
copy.Spec.Replicas = ptr.To[int32](5)
|
|
client.Update(ctx, copy)
|
|
```
|
|
|
|
### Exceptions
|
|
|
|
- **Read-only access:** If you only read fields and never modify the object, deep copy is unnecessary overhead
|
|
- **Single-owner data:** If the struct was created locally and isn't shared (e.g., you just built it with `&Deployment{}`), copying your own data is wasteful
|
|
- **Immutable value types:** Primitive fields (string, int, bool) are copied by value in Go — you only need DeepCopy for slices, maps, and pointer fields
|
|
|
|
**Evidence:** The `runtime.Object` interface *mandates* `DeepCopyObject()`. Every API type has generated deep copy methods. The entire architecture assumes immutable reads.
|
|
|
|
---
|
|
|
|
## 2. Never Process the Same Key Concurrently
|
|
|
|
**What they avoid:** Multiple goroutines syncing the same object simultaneously.
|
|
|
|
**Why:** Two goroutines reading the same Deployment, each computing a different desired state, then both writing → conflict errors and potential state corruption.
|
|
|
|
**The pattern K8s enforces:** The workqueue's `processing` set ensures a key is only handed to one worker at a time. If an item is added while being processed, it's re-queued *after* `Done()` is called:
|
|
|
|
```go
|
|
// From queue.go — the processing set blocks concurrent access
|
|
func (q *Typed[T]) Get() (item T, shutdown bool) {
|
|
// ...
|
|
q.processing.Insert(item) // Mark as being worked on
|
|
q.dirty.Delete(item)
|
|
return item, false
|
|
}
|
|
```
|
|
|
|
### When to Apply This Rule
|
|
|
|
**Triggers:**
|
|
- Multiple goroutines/workers process items from a shared queue or event stream
|
|
- Processing involves read-modify-write cycles on external state (API server, database)
|
|
- You observe optimistic concurrency conflicts (409 Conflict) or duplicate operations in logs
|
|
|
|
**Example — detecting the smell:**
|
|
```go
|
|
// Multiple workers, no key-level serialization
|
|
for i := 0; i < 10; i++ {
|
|
go func() {
|
|
for key := range eventChannel { // same key can go to any worker
|
|
obj, _ := client.Get(key)
|
|
obj.Status.Count++
|
|
client.Update(obj) // CONFLICT: another worker updated first
|
|
}
|
|
}()
|
|
}
|
|
```
|
|
|
|
**Example — fixed:**
|
|
```go
|
|
// Workqueue guarantees: one worker per key at a time
|
|
queue := workqueue.NewTypedRateLimitingQueue[string](limiter)
|
|
for i := 0; i < 10; i++ {
|
|
go func() {
|
|
for {
|
|
key, _ := queue.Get() // key is exclusively ours until Done()
|
|
reconcile(key)
|
|
queue.Done(key)
|
|
}
|
|
}()
|
|
}
|
|
```
|
|
|
|
### Exceptions
|
|
|
|
- **Read-only operations:** If workers only read (metrics collection, logging), concurrent access to the same key is safe
|
|
- **Truly idempotent writes:** If the write is unconditional (e.g., PUT with a fixed value, not read-modify-write), concurrent processing produces the same result
|
|
- **Sharded ownership:** If each key is deterministically routed to exactly one worker (consistent hashing), the queue's built-in serialization is unnecessary
|
|
|
|
---
|
|
|
|
## 3. Never Use Edge-Triggered Logic
|
|
|
|
**What they avoid:** Controllers that react to *what changed* rather than *what the current state is*.
|
|
|
|
**Why:** Events can be missed (watch disconnects), delivered out of order, or duplicated. If your controller says "a pod was deleted, so decrement counter" rather than "count current pods and compare to desired", you'll drift.
|
|
|
|
**The pattern K8s enforces:** Level-triggered reconciliation. The `syncHandler` reads *current state from the cache*, computes *desired state from the spec*, and makes the world match:
|
|
|
|
### When to Apply This Rule
|
|
|
|
**Triggers:**
|
|
- Your handler says "X happened, so do Y" instead of "the world should look like Z, make it so"
|
|
- You're incrementing/decrementing counters based on events instead of counting actual state
|
|
- A missed event (network blip, restart) would cause permanent drift
|
|
|
|
**Example — detecting the smell:**
|
|
```go
|
|
func onPodDeleted(pod Pod) {
|
|
deployment.Status.Replicas-- // edge-triggered: if we miss a delete, count is wrong forever
|
|
}
|
|
```
|
|
|
|
**Example — fixed:**
|
|
```go
|
|
func reconcile(deployment Deployment) {
|
|
actualPods := listPodsForDeployment(deployment)
|
|
deployment.Status.Replicas = int32(len(actualPods)) // level-triggered: always correct
|
|
}
|
|
```
|
|
|
|
### Exceptions
|
|
|
|
- **Audit logging / event streams:** Recording "what happened" is inherently edge-triggered — you want to log each event, not reconstruct history from state
|
|
- **Notifications:** "User X just logged in" is an event that triggers a one-time notification — level-triggered makes no sense here
|
|
- **Ordering-sensitive operations:** If the sequence of events matters (transaction log, command queue), level-triggered reconciliation would lose ordering information
|
|
|
|
```go
|
|
// The sync function always reads current state, never relies on "what happened"
|
|
func (dc *DeploymentController) syncDeployment(ctx context.Context, key string) error {
|
|
deployment, err := dc.dLister.Deployments(namespace).Get(name)
|
|
// Compute desired state from deployment.Spec
|
|
// Read actual state from replicaset lister
|
|
// Reconcile difference
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 4. Never Forget to Call queue.Forget() on Success
|
|
|
|
**What they avoid:** Letting the rate limiter track items that succeeded.
|
|
|
|
**Why:** The rate limiter is per-item. If you never call `Forget()`, the next time the same key needs processing (even for a new event), it will be rate-limited as if it previously failed.
|
|
|
|
**Source:** The rate_limiting_queue.go comments explicitly warn:
|
|
```go
|
|
// NewTypedRateLimitingQueue constructs a new workqueue with rateLimited queuing ability
|
|
// Remember to call Forget! If you don't, you may end up tracking failures forever.
|
|
```
|
|
|
|
**The pattern K8s enforces:**
|
|
```go
|
|
func (dc *DeploymentController) handleErr(ctx context.Context, err error, key string) {
|
|
if err == nil {
|
|
dc.queue.Forget(key) // CRITICAL: clear the failure counter
|
|
return
|
|
}
|
|
if dc.queue.NumRequeues(key) < maxRetries {
|
|
dc.queue.AddRateLimited(key)
|
|
return
|
|
}
|
|
dc.queue.Forget(key) // Also forget when giving up
|
|
}
|
|
```
|
|
|
|
### When to Apply This Rule
|
|
|
|
**Triggers:**
|
|
- You're using a rate-limited workqueue and items can both succeed and fail
|
|
- You notice items processing slower over time for no apparent reason (accumulated backoff)
|
|
- Your error handling only calls `AddRateLimited` on failure but has no `Forget` on success
|
|
|
|
**Example — detecting the smell:**
|
|
```go
|
|
func processItem(queue workqueue.RateLimitingInterface, key string) {
|
|
defer queue.Done(key)
|
|
err := reconcile(key)
|
|
if err != nil {
|
|
queue.AddRateLimited(key) // retry with backoff
|
|
return
|
|
}
|
|
// BUG: no queue.Forget(key)!
|
|
// Next time this key is processed (even for a new event),
|
|
// it gets rate-limited delay from the old failure counter
|
|
}
|
|
```
|
|
|
|
**Example — fixed:**
|
|
```go
|
|
func processItem(queue workqueue.RateLimitingInterface, key string) {
|
|
defer queue.Done(key)
|
|
err := reconcile(key)
|
|
if err != nil {
|
|
queue.AddRateLimited(key)
|
|
return
|
|
}
|
|
queue.Forget(key) // clear backoff counter on success
|
|
}
|
|
```
|
|
|
|
### Exceptions
|
|
|
|
- **You're using a plain (non-rate-limited) queue:** `Forget()` only matters on `RateLimitingInterface`. Plain `Interface` has no rate limiter to clear.
|
|
- **Permanent failures where the key should never be fast-tracked:** If a key represents a permanently broken resource, you might intentionally not `Forget` so that future events for it are naturally throttled (rare — usually you'd remove it from the queue entirely).
|
|
|
|
---
|
|
|
|
## 5. Never Hit the API Server in a Tight Loop
|
|
|
|
**What they avoid:** Direct API calls for reads. List/Get calls in hot paths.
|
|
|
|
**Why:** The API server is a shared resource. If 100 controllers each make 10 API calls per reconciliation at 1 sync/second, that's 1000 req/s to the API server per controller manager instance.
|
|
|
|
**The pattern K8s enforces:** Read from Listers (local cache), write to API server:
|
|
```go
|
|
// READ from cache (free, local, fast)
|
|
deployment, err := dc.dLister.Deployments(namespace).Get(name)
|
|
|
|
// WRITE to API server (expensive, remote, rate-limited)
|
|
_, err = dc.client.AppsV1().Deployments(namespace).Update(ctx, deployment, ...)
|
|
```
|
|
|
|
### When to Apply This Rule
|
|
|
|
**Triggers:**
|
|
- Your reconcile function makes API calls (Get/List) inside a loop
|
|
- You observe API server throttling (429 responses) or high request latency
|
|
- Multiple controllers in the same process each independently fetch the same resources
|
|
|
|
**Example — detecting the smell:**
|
|
```go
|
|
func (c *Controller) reconcile(ctx context.Context, deployment *apps.Deployment) error {
|
|
for i := 0; i < int(*deployment.Spec.Replicas); i++ {
|
|
// API call inside a loop — O(replicas) calls per sync
|
|
pod, err := c.client.CoreV1().Pods(ns).Get(ctx, podName(i), metav1.GetOptions{})
|
|
if errors.IsNotFound(err) {
|
|
c.client.CoreV1().Pods(ns).Create(ctx, newPod(i), metav1.CreateOptions{})
|
|
}
|
|
}
|
|
return nil
|
|
}
|
|
```
|
|
|
|
**Example — fixed:**
|
|
```go
|
|
func (c *Controller) reconcile(ctx context.Context, deployment *apps.Deployment) error {
|
|
// Single cache read (local, free) — gets ALL pods at once
|
|
existingPods, _ := c.podLister.Pods(ns).List(selectorForDeployment(deployment))
|
|
existingSet := make(map[string]bool)
|
|
for _, p := range existingPods { existingSet[p.Name] = true }
|
|
|
|
for i := 0; i < int(*deployment.Spec.Replicas); i++ {
|
|
if !existingSet[podName(i)] {
|
|
c.client.CoreV1().Pods(ns).Create(ctx, newPod(i), metav1.CreateOptions{})
|
|
}
|
|
}
|
|
return nil
|
|
}
|
|
```
|
|
|
|
### Exceptions
|
|
|
|
- **Writes are unavoidable:** You must hit the API server for creates/updates/deletes — the rule is about *reads*, not writes
|
|
- **Cache staleness is unacceptable:** Rare cases where you need a strongly consistent read (e.g., before an irreversible action) justify a direct Get
|
|
- **One-shot tools:** CLI commands or migration scripts don't benefit from informer caches (they exit after one operation)
|
|
|
|
---
|
|
|
|
## 6. Never Sync Before Caches Are Warm
|
|
|
|
**What they avoid:** Processing items before the informer has done its initial List.
|
|
|
|
**Why:** With an empty cache, a controller might think "no pods exist, must create all of them" — causing a thundering herd of duplicate creates.
|
|
|
|
**The pattern K8s enforces:**
|
|
```go
|
|
// pkg/controller/deployment/deployment_controller.go:189
|
|
if !cache.WaitForNamedCacheSyncWithContext(ctx,
|
|
dc.dListerSynced, dc.rsListerSynced, dc.podListerSynced) {
|
|
return // Don't start workers until all caches are populated
|
|
}
|
|
```
|
|
|
|
### When to Apply This Rule
|
|
|
|
**Triggers:**
|
|
- Your controller uses informer caches (Listers) to determine what actions to take
|
|
- You observe a burst of spurious creates/deletes immediately after startup
|
|
- Your reconcile logic compares "desired count vs actual count" using cached data
|
|
|
|
**Example — detecting the smell:**
|
|
```go
|
|
func (c *Controller) Run(ctx context.Context) {
|
|
// BUG: starts workers immediately — cache might be empty
|
|
for i := 0; i < workers; i++ {
|
|
go c.worker(ctx)
|
|
}
|
|
// Informers haven't finished initial List yet
|
|
// Worker reads cache → sees 0 pods → creates all replicas again
|
|
}
|
|
```
|
|
|
|
**Example — fixed:**
|
|
```go
|
|
func (c *Controller) Run(ctx context.Context) {
|
|
// Gate: wait until all caches have completed initial List
|
|
if !cache.WaitForCacheSync(ctx.Done(), c.podsSynced, c.rsSynced) {
|
|
runtime.HandleError(fmt.Errorf("caches never synced"))
|
|
return
|
|
}
|
|
// Now safe to start workers — cache reflects reality
|
|
for i := 0; i < workers; i++ {
|
|
go c.worker(ctx)
|
|
}
|
|
}
|
|
```
|
|
|
|
### Exceptions
|
|
|
|
- **Read-only controllers:** If your controller only observes and reports (metrics, logging), processing with a partial cache produces incomplete but not harmful results
|
|
- **Controllers that check existence before acting:** If your reconcile logic does a direct API Get (not cache read) before creating, the cache warmth gate is less critical (but still good practice)
|
|
- **Fast-starting controllers with explicit "not found" handling:** If your reconcile function handles "object not in cache" gracefully (e.g., requeues the key), it may tolerate starting before full sync
|
|
|
|
---
|
|
|
|
## 7. Never Ignore Tombstones in Delete Handlers
|
|
|
|
**What they avoid:** Assuming delete handlers always receive the concrete type.
|
|
|
|
**Why:** If a watch disconnects and reconnects, missed deletes arrive as `DeletedFinalStateUnknown` (tombstones). Ignoring them means your controller never learns about those deletions.
|
|
|
|
**The pattern K8s enforces:**
|
|
```go
|
|
// Every delete handler must check for tombstones
|
|
func (dc *DeploymentController) deleteDeployment(logger klog.Logger, obj interface{}) {
|
|
d, ok := obj.(*apps.Deployment)
|
|
if !ok {
|
|
tombstone, ok := obj.(cache.DeletedFinalStateUnknown)
|
|
if !ok {
|
|
utilruntime.HandleError(...)
|
|
return
|
|
}
|
|
d, ok = tombstone.Obj.(*apps.Deployment)
|
|
// ...
|
|
}
|
|
}
|
|
```
|
|
|
|
### When to Apply This Rule
|
|
|
|
**Triggers:**
|
|
- You're implementing a `DeleteFunc` event handler for an informer
|
|
- Your delete handler type-asserts directly to the concrete type without a fallback
|
|
- You observe "leaked" resources that should have been cleaned up after their parent was deleted
|
|
|
|
**Example — detecting the smell:**
|
|
```go
|
|
informer.AddEventHandler(cache.ResourceEventHandlerFuncs{
|
|
DeleteFunc: func(obj interface{}) {
|
|
pod := obj.(*v1.Pod) // PANIC during watch reconnection
|
|
cleanupPod(pod)
|
|
},
|
|
})
|
|
```
|
|
|
|
**Example — fixed:**
|
|
```go
|
|
informer.AddEventHandler(cache.ResourceEventHandlerFuncs{
|
|
DeleteFunc: func(obj interface{}) {
|
|
pod, ok := obj.(*v1.Pod)
|
|
if !ok {
|
|
tombstone, ok := obj.(cache.DeletedFinalStateUnknown)
|
|
if !ok {
|
|
runtime.HandleError(fmt.Errorf("unexpected type %T", obj))
|
|
return
|
|
}
|
|
pod, ok = tombstone.Obj.(*v1.Pod)
|
|
if !ok {
|
|
runtime.HandleError(fmt.Errorf("tombstone contained %T", tombstone.Obj))
|
|
return
|
|
}
|
|
}
|
|
cleanupPod(pod)
|
|
},
|
|
})
|
|
```
|
|
|
|
### Exceptions
|
|
|
|
- **Delete handler only extracts the key:** If your handler only needs `namespace/name` to enqueue work, you can use `cache.MetaNamespaceKeyFunc(obj)` which handles tombstones internally
|
|
- **Non-informer event sources:** If your events don't come from client-go informers (e.g., custom message queues), tombstones don't apply
|
|
- **Level-triggered reconcilers that don't use delete handlers:** If your reconcile loop discovers deletions via "not found" from the lister, the delete handler is optional anyway
|
|
|
|
---
|
|
|
|
## 8. Never Use ResourceVersion for Equality
|
|
|
|
**What they avoid:** Comparing ResourceVersion to check if an object changed.
|
|
|
|
**Why:** ResourceVersion is opaque (currently etcd's mod_revision, but this is an implementation detail). The only valid operation is "!=" to detect change.
|
|
|
|
**The pattern K8s uses:**
|
|
```go
|
|
// pkg/controller/deployment/deployment_controller.go:284-288
|
|
func (dc *DeploymentController) updateReplicaSet(logger klog.Logger, old, cur interface{}) {
|
|
curRS := cur.(*apps.ReplicaSet)
|
|
oldRS := old.(*apps.ReplicaSet)
|
|
if curRS.ResourceVersion == oldRS.ResourceVersion {
|
|
return // Periodic resync, nothing actually changed
|
|
}
|
|
// ... process the real update
|
|
}
|
|
```
|
|
|
|
### When to Apply This Rule
|
|
|
|
**Triggers:**
|
|
- You're comparing ResourceVersions with `<`, `>`, or numeric parsing
|
|
- You're storing ResourceVersion as an integer for ordering or pagination
|
|
- You're using ResourceVersion to determine "which version is newer"
|
|
|
|
**Example — detecting the smell:**
|
|
```go
|
|
// Treating ResourceVersion as a number — WRONG
|
|
rv1, _ := strconv.Atoi(obj1.ResourceVersion)
|
|
rv2, _ := strconv.Atoi(obj2.ResourceVersion)
|
|
if rv2 > rv1 {
|
|
// "obj2 is newer" — this assumption may break in future implementations
|
|
}
|
|
```
|
|
|
|
**Example — fixed:**
|
|
```go
|
|
// Only valid comparison: equality (to detect resync vs real update)
|
|
if cur.ResourceVersion == old.ResourceVersion {
|
|
return // no change — this is a periodic resync event
|
|
}
|
|
// Don't compare ordering — just process the update
|
|
processUpdate(cur)
|
|
```
|
|
|
|
### Exceptions
|
|
|
|
- **The `== ` check for resync detection:** Comparing ResourceVersion for equality (`==`) to detect "no change" is valid and explicitly how Kubernetes uses it
|
|
- **Watch continuity:** Passing ResourceVersion to Watch/List for "resume from" is the intended API — you're not comparing, you're providing a cursor
|
|
- **Internal etcd tooling:** If you're operating directly on etcd (not through the Kubernetes API), mod_revision semantics are documented there
|
|
|
|
---
|
|
|
|
## 9. Never Panic in Production Goroutines (Without Recovery)
|
|
|
|
**What they avoid:** Unhandled panics killing the entire controller manager.
|
|
|
|
**Why:** A single nil pointer in one controller's sync loop would crash all 30+ controllers running in the same process.
|
|
|
|
**The pattern K8s enforces:**
|
|
```go
|
|
// Every goroutine gets crash protection
|
|
func (dc *DeploymentController) Run(ctx context.Context, workers int) {
|
|
defer utilruntime.HandleCrash() // Top-level recovery
|
|
// ...
|
|
}
|
|
|
|
// And in every polling loop:
|
|
func BackoffUntilWithContext(ctx context.Context, f func(ctx context.Context), ...) {
|
|
func() {
|
|
defer runtime.HandleCrashWithContext(ctx) // Per-iteration recovery
|
|
f(ctx)
|
|
}()
|
|
}
|
|
```
|
|
|
|
### When to Apply This Rule
|
|
|
|
**Triggers:**
|
|
- You're launching goroutines without `defer recover()` or `defer HandleCrash()`
|
|
- Your process hosts multiple independent subsystems (any one panicking kills all)
|
|
- You observe random process deaths in production with nil pointer or index-out-of-range panics
|
|
|
|
**Example — detecting the smell:**
|
|
```go
|
|
// Naked goroutine — one panic kills the whole process
|
|
func (c *Controller) Run(ctx context.Context) {
|
|
for i := 0; i < 5; i++ {
|
|
go c.worker(ctx) // no recovery — nil pointer in worker kills everything
|
|
}
|
|
}
|
|
```
|
|
|
|
**Example — fixed:**
|
|
```go
|
|
func (c *Controller) Run(ctx context.Context) {
|
|
for i := 0; i < 5; i++ {
|
|
go func() {
|
|
defer utilruntime.HandleCrash()
|
|
c.worker(ctx)
|
|
}()
|
|
}
|
|
}
|
|
```
|
|
|
|
### Exceptions
|
|
|
|
- **Main goroutine of a CLI tool:** Let it panic — the stack trace IS the error report. Recovery here just hides bugs.
|
|
- **Panics that indicate programming errors you WANT to crash on:** `panic("unreachable")` or assertion failures during development should crash to catch bugs early.
|
|
- **Test code:** Tests should panic/fail loudly. `HandleCrash` in tests would swallow test failures.
|
|
|
|
---
|
|
|
|
## 10. Never Block Workers Indefinitely
|
|
|
|
**What they avoid:** Unbounded blocking in a sync handler (e.g., waiting for a condition that may never occur).
|
|
|
|
**Why:** Workers are a finite pool. If one blocks forever, that's one fewer worker processing the queue. At scale, this cascades.
|
|
|
|
**The pattern K8s enforces:** All API calls take context (with timeouts), all waits are bounded:
|
|
```go
|
|
// Timeouts on API calls
|
|
ctx, cancel := context.WithTimeout(ctx, 30*time.Second)
|
|
defer cancel()
|
|
_, err := client.CoreV1().Pods(ns).Create(ctx, pod, metav1.CreateOptions{})
|
|
```
|
|
|
|
### When to Apply This Rule
|
|
|
|
**Triggers:**
|
|
- Your sync handler calls external services without a timeout
|
|
- Workers are a fixed pool and you observe queue depth growing while workers appear idle
|
|
- Your code has `select {}`, unbounded channel reads, or mutex waits without deadline
|
|
|
|
**Example — detecting the smell:**
|
|
```go
|
|
func (c *Controller) syncItem(ctx context.Context, key string) error {
|
|
// Calls external service with no timeout — blocks indefinitely if service is down
|
|
resp, err := http.Get("http://external-service/api/" + key)
|
|
if err != nil { return err }
|
|
|
|
// Waits on a channel that might never close
|
|
result := <-c.resultChan // blocks forever if producer crashes
|
|
return c.updateStatus(ctx, key, result)
|
|
}
|
|
```
|
|
|
|
**Example — fixed:**
|
|
```go
|
|
func (c *Controller) syncItem(ctx context.Context, key string) error {
|
|
// Bounded timeout on external call
|
|
reqCtx, cancel := context.WithTimeout(ctx, 10*time.Second)
|
|
defer cancel()
|
|
req, _ := http.NewRequestWithContext(reqCtx, "GET", "http://external-service/api/"+key, nil)
|
|
resp, err := http.DefaultClient.Do(req)
|
|
if err != nil { return err } // timeout returns error, worker is freed
|
|
|
|
// Bounded wait on channel
|
|
select {
|
|
case result := <-c.resultChan:
|
|
return c.updateStatus(ctx, key, result)
|
|
case <-ctx.Done():
|
|
return ctx.Err() // worker freed, key requeued
|
|
}
|
|
}
|
|
```
|
|
|
|
### Exceptions
|
|
|
|
- **The worker loop itself:** The top-level `queue.Get()` blocks intentionally when the queue is empty — this is by design (workers sleep until there's work)
|
|
- **Graceful shutdown drains:** During shutdown, waiting for in-flight work to complete is acceptable (with a bounded overall shutdown timeout)
|
|
- **Leader election acquire:** Blocking until leadership is acquired is intentional — non-leaders should idle
|
|
|
|
---
|
|
|
|
## 11. Never Use sync.Mutex Where sync.Once Suffices
|
|
|
|
**What they avoid:** Full mutual exclusion for one-shot operations.
|
|
|
|
**Why:** `sync.Once` is semantically clearer and avoids the bug where you forget to check a "done" flag under the mutex.
|
|
|
|
**The pattern K8s uses:**
|
|
```go
|
|
// pkg/controller/controller_ref_manager.go:43-49
|
|
type BaseControllerRefManager struct {
|
|
canAdoptErr error
|
|
canAdoptOnce sync.Once // One-shot lazy evaluation
|
|
CanAdoptFunc func(ctx context.Context) error
|
|
}
|
|
|
|
func (m *BaseControllerRefManager) CanAdopt(ctx context.Context) error {
|
|
m.canAdoptOnce.Do(func() {
|
|
if m.CanAdoptFunc != nil {
|
|
m.canAdoptErr = m.CanAdoptFunc(ctx)
|
|
}
|
|
})
|
|
return m.canAdoptErr
|
|
}
|
|
```
|
|
|
|
### When to Apply This Rule
|
|
|
|
**Triggers:**
|
|
- You see a `sync.Mutex` protecting a `bool` flag that tracks "has this been initialized?"
|
|
- The protected code runs exactly once and caches the result
|
|
- You see patterns like `mu.Lock(); if !done { doThing(); done = true }; mu.Unlock()`
|
|
|
|
**Example — detecting the smell:**
|
|
```go
|
|
type Manager struct {
|
|
mu sync.Mutex
|
|
initialized bool
|
|
config *Config
|
|
}
|
|
|
|
func (m *Manager) GetConfig() *Config {
|
|
m.mu.Lock()
|
|
defer m.mu.Unlock()
|
|
if !m.initialized {
|
|
m.config = loadConfig()
|
|
m.initialized = true
|
|
}
|
|
return m.config
|
|
}
|
|
// Bug-prone: what if someone adds code that sets initialized=false?
|
|
```
|
|
|
|
**Example — fixed:**
|
|
```go
|
|
type Manager struct {
|
|
initOnce sync.Once
|
|
config *Config
|
|
}
|
|
|
|
func (m *Manager) GetConfig() *Config {
|
|
m.initOnce.Do(func() {
|
|
m.config = loadConfig()
|
|
})
|
|
return m.config
|
|
// Impossible to accidentally re-initialize
|
|
}
|
|
```
|
|
|
|
### Exceptions
|
|
|
|
- **Resettable state:** If the "one-time" operation might need to run again (e.g., reconnection after disconnect), `sync.Once` doesn't support reset — use a mutex with a flag
|
|
- **Multiple initialization phases:** If initialization has multiple stages that can partially fail and retry, `sync.Once` is too coarse (it only fires once, even on error)
|
|
- **Pre-Go 1.21 error handling:** Before Go 1.21's `sync.OnceValue`/`sync.OnceFunc`, handling errors from `Do` required workarounds. Now `sync.OnceValue` cleanly supports fallible initialization.
|
|
|
|
---
|
|
|
|
## 12. Never Expose Mutable State Through Interfaces
|
|
|
|
**What they avoid:** Returning pointers to internal state through public interfaces.
|
|
|
|
**Why:** Callers can accidentally mutate internal state, creating subtle bugs that only manifest under concurrency.
|
|
|
|
**The pattern K8s enforces:** Listers return objects from the read-only cache. The `DeepCopy()` pattern ensures mutation safety is the caller's responsibility, not the cache's.
|
|
|
|
### When to Apply This Rule
|
|
|
|
**Triggers:**
|
|
- Your getter/accessor returns a pointer to a field that's part of your struct's internal state
|
|
- Multiple goroutines call the accessor, and at least one caller might modify the result
|
|
- You observe race conditions in `go test -race` pointing to internal fields accessed from outside
|
|
|
|
**Example — detecting the smell:**
|
|
```go
|
|
type Registry struct {
|
|
mu sync.RWMutex
|
|
items map[string]*Item
|
|
}
|
|
|
|
func (r *Registry) Get(key string) *Item {
|
|
r.mu.RLock()
|
|
defer r.mu.RUnlock()
|
|
return r.items[key] // returns internal pointer — caller can mutate!
|
|
}
|
|
// Caller does: item := registry.Get("foo"); item.Name = "bar"
|
|
// Now every other caller sees the mutated name — data race
|
|
```
|
|
|
|
**Example — fixed:**
|
|
```go
|
|
func (r *Registry) Get(key string) *Item {
|
|
r.mu.RLock()
|
|
defer r.mu.RUnlock()
|
|
if item, ok := r.items[key]; ok {
|
|
return item.DeepCopy() // caller gets their own copy
|
|
}
|
|
return nil
|
|
}
|
|
// Caller mutates their copy — registry is unaffected
|
|
```
|
|
|
|
### Exceptions
|
|
|
|
- **Single-threaded access:** If your struct is only accessed from one goroutine (no concurrency), returning pointers to internals is safe and avoids copy overhead
|
|
- **Intentionally shared mutable state:** Some designs explicitly want callers to see mutations (e.g., shared counters, observable state). Document this clearly.
|
|
- **Performance-critical read paths:** If deep-copying on every access is too expensive and callers are trusted to not mutate, return a pointer with clear documentation: `// WARNING: do not mutate returned value`
|
|
|
|
---
|
|
|
|
## Summary: The Philosophy
|
|
|
|
Kubernetes avoids these anti-patterns because of one fundamental truth: **in a distributed system, every assumption you make about state being consistent is wrong.**
|
|
|
|
The patterns exist because:
|
|
1. **Events are unreliable** → level-triggered reconciliation
|
|
2. **Reads are stale** → always compare desired vs actual
|
|
3. **Concurrent access is inevitable** → deep copy, queue serialization
|
|
4. **Failures are normal** → retry with backoff, graceful degradation
|
|
5. **Resources are shared** → cache reads, rate-limit writes
|
|
6. **Systems outlive their authors** → code generation, type registries, feature gates
|