Skip to content

Troubleshooting Kubernetes

Troubleshooting Guide

Common Issues and Solutions

1. Viewing Pod Status and Logs

Check Pod Status:

# List all pods across namespaces
kubectl get pods -A

# List pods with more details
kubectl get pods -o wide

# Watch pods in real-time
kubectl get pods -w

# Get pod details
kubectl describe pod <pod-name> -n <namespace>

# Get pod events
kubectl get events --field-selector involvedObject.name=<pod-name> -n <namespace>

View Logs:

# Basic log viewing
kubectl logs <pod-name> -n <namespace>

# Logs from specific container in multi-container pod
kubectl logs <pod-name> -c <container-name> -n <namespace>

# Follow logs in real-time (like tail -f)
kubectl logs -f <pod-name> -n <namespace>

# View previous container logs (after crash/restart)
kubectl logs <pod-name> --previous -n <namespace>

# Get last N lines
kubectl logs <pod-name> --tail=100 -n <namespace>

# Logs since specific time
kubectl logs <pod-name> --since=1h -n <namespace>

# Logs with timestamps
kubectl logs <pod-name> --timestamps=true -n <namespace>

# Export logs to file
kubectl logs <pod-name> -n <namespace> > pod-logs.txt

2. Common Pod States and Solutions

stateDiagram-v2
    [*] --> Pending
    Pending --> Running: Resources Available
    Pending --> Failed: Scheduling Failed
    Running --> Succeeded: Completed
    Running --> Failed: Error
    Running --> CrashLoopBackOff: Repeated Crashes
    CrashLoopBackOff --> Running: Restart
    Failed --> [*]
    Succeeded --> [*]

Pending Pod:

# Diagnose pending pod
kubectl describe pod <pod-name> -n <namespace>

# Check node resources
kubectl top nodes
kubectl describe nodes

# Check for PVC issues
kubectl get pvc -n <namespace>

# Common causes and solutions:
# 1. Insufficient CPU/Memory: Scale down other apps or add nodes
# 2. Node selector mismatch: Check nodeSelector/affinity rules
# 3. Taint/toleration issues: Verify node taints
# 4. PVC not bound: Check storage class and PV availability

CrashLoopBackOff:

# Check recent logs
kubectl logs <pod-name> -n <namespace> --previous

# Check all container statuses
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.status.containerStatuses[*]}'

# Common debugging steps:
# 1. Check exit code
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 "Last State"

# 2. Run debug container
kubectl debug <pod-name> -it --image=busybox -n <namespace>

# 3. Check resource limits
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 5 resources:

ImagePullBackOff:

# Check the exact error
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 "Failed"

# Verify image name
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].image}'

# Check if secret exists (for private registries)
kubectl get secrets -n <namespace>

# Create image pull secret
kubectl create secret docker-registry regcred \
  --docker-server=<registry-server> \
  --docker-username=<username> \
  --docker-password=<password> \
  --docker-email=<email> \
  -n <namespace>

3. Node Troubleshooting

# Check node status
kubectl get nodes
kubectl describe node <node-name>

# Check node conditions
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.conditions[?(@.type=="Ready")].status}{"\n"}{end}'

# Node resource usage
kubectl top nodes

# Check kubelet logs (on the node)
journalctl -u kubelet -f

# Cordon node (prevent new pods)
kubectl cordon <node-name>

# Drain node safely
kubectl drain <node-name> \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --force \
  --grace-period=60

# Uncordon node
kubectl uncordon <node-name>

4. Service and Networking Issues

# Check service endpoints
kubectl get endpoints <service-name> -n <namespace>

# Verify service selectors match pod labels
kubectl get svc <service-name> -n <namespace> -o yaml | grep -A 5 selector:
kubectl get pods -n <namespace> --show-labels

# Test DNS resolution
kubectl run -it --rm debug --image=busybox --restart=Never -n <namespace> -- nslookup <service-name>

# Test service connectivity
kubectl run -it --rm debug --image=nicolaka/netshoot --restart=Never -n <namespace> -- curl <service-name>:<port>

# Check network policies
kubectl get networkpolicies -n <namespace>

5. Port Forwarding Issues

Common Issues and Solutions:

# Verify pod is running
kubectl get pod <pod-name> -n <namespace>

# Check container port
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].ports[*]}'

# Port forward to pod
kubectl port-forward -n <namespace> pod/<pod-name> 8080:80

# Port forward to service (more stable)
kubectl port-forward -n <namespace> svc/<service-name> 8080:80

# Port forward with specific address binding
kubectl port-forward --address 0.0.0.0 -n <namespace> svc/<service-name> 8080:80

Important Port Forwarding Notes:

  • The command will block the terminal - this is expected behavior
  • Use a separate terminal for other commands
  • "Handling connection" messages are normal
  • Connection resets may occur with idle connections
  • For production, use proper ingress/load balancers