Triage a production incident on Kubernetes
When to use: An app is misbehaving in prod and you need to look at pods, events, logs without alt-tabbing.
Prerequisites
- kubeconfig with access to the cluster — Standard
aws eks update-kubeconfigor equivalent
Flow
-
Find unhealthy podsk8s: in context
prod-us-east, namespacecheckout, list pods not in Running state. Include reason + restart count.✓ Copied→ Pods shown with state, reason, restarts -
Get eventsGet events in that namespace from the last 30 minutes, sorted by time.✓ Copied→ Events list; OOMKilled or ImagePullBackOff visible if present
-
Get logsFor the pod with the most recent restart, tail the previous container's logs (last 200 lines).✓ Copied→ Stack trace / cause visible
-
DiagnoseSynthesize: what's the likely root cause and what should we do? Be specific.✓ Copied→ Concrete next step (e.g. raise memory limit + roll out)
Outcome: Triage in <5 minutes with cited pod names + log lines.
Pitfalls
- Logs of a missing previous container aren't available — If pod only restarted once, check current container logs and previous container only if it crashed
- Wrong context — Always specify context per call; don't rely on current-context drift