Kubernetes Pod Debugging: Master Troubleshooting in 2024
Master Kubernetes pod debugging with kubectl, ephemeral containers & node inspection. Automate diagnostics with OpsSqad's K8s Squad & save hours on troubleshooting.

Mastering Kubernetes Pod Debugging: From Pending to Production
When a Kubernetes pod fails in production, every second counts. The application is down, users are affected, and you need answers fast. Whether your pod is stuck in Pending, crashing with cryptic exit codes, or silently failing health checks, effective debugging requires both systematic methodology and the right tools. This comprehensive guide walks you through the entire spectrum of Kubernetes pod debugging—from understanding pod lifecycle states to executing advanced troubleshooting techniques that solve real-world production issues.
TL;DR: Debugging Kubernetes pods requires understanding pod lifecycle states (Pending, ContainerCreating, Running, Failed), mastering essential kubectl commands (describe, logs, exec), and applying advanced techniques like ephemeral debug containers and node-level inspection. Common issues include resource constraints, image pull failures, network misconfigurations, and application crashes—each requiring specific diagnostic approaches and remediation strategies.
Understanding the Kubernetes Pod Lifecycle and Common Failure States
Kubernetes pod debugging becomes significantly easier when you understand what's supposed to happen during normal operation. A pod is the smallest deployable unit in Kubernetes, representing one or more containers that share storage and network resources. Every pod progresses through a defined lifecycle, and recognizing when that lifecycle deviates from normal patterns is the first step in effective troubleshooting.
The Normal Pod Lifecycle: From Creation to Running
A healthy Kubernetes pod transitions through several distinct phases from creation to steady-state operation. The pod phase is a high-level summary of where the pod is in its lifecycle, reported in the status.phase field when you query the pod.
Pending is the initial state after pod creation. During this phase, the Kubernetes scheduler is working to find an appropriate node for the pod based on resource requirements, node selectors, affinity rules, and taints/tolerations. The pod manifest has been accepted by the cluster, but one or more containers haven't been created yet. A pod should only remain in Pending briefly—typically seconds to a few minutes depending on cluster load.
ContainerCreating (technically part of the Pending phase but shown separately in some tools) indicates the scheduler has assigned the pod to a node, and the kubelet on that node is actively pulling container images and setting up volumes. This phase duration depends on image size and network speed—a small Alpine-based image might take 5-10 seconds, while a multi-gigabyte application image could take several minutes on first pull.
Running means at least one container has started and is executing. This is the desired steady state for long-running applications like web servers or background workers. The pod remains in Running as long as at least one container is active, even if others have terminated.
Succeeded applies to pods designed to run to completion, such as batch jobs or database migrations. All containers in the pod have terminated successfully with exit code 0, and the pod will not be restarted. This is a terminal state.
Failed indicates at least one container terminated with a non-zero exit code, or the system terminated the container for policy reasons (such as exceeding resource limits). For pods controlled by a Deployment or ReplicaSet, Kubernetes will typically attempt to restart failed pods based on the restart policy.
Understanding these normal transitions helps you quickly identify abnormal behavior. A pod stuck in Pending for 10 minutes signals a scheduling problem. A pod cycling between Running and Failed every 30 seconds suggests an application crash loop.
Diagnosing "Pending" Pods: When Your Pod Can't Get Scheduled
A pod stuck in the Pending state means the Kubernetes scheduler cannot find a suitable node to run it. This is one of the most common issues in resource-constrained clusters, and the root cause is almost always found in the pod's events.
The primary diagnostic tool for Pending pods is kubectl describe pod:
kubectl describe pod my-app-7d8f9c5b6-xkj2mLook immediately at the Events section at the bottom of the output. You'll typically see one of these failure patterns:
Insufficient resources is the most common cause. The scheduler cannot find a node with enough CPU or memory to satisfy the pod's resource requests:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 2m default-scheduler 0/3 nodes are available: 3 Insufficient cpu.
This message tells you exactly what's wrong: all three nodes in your cluster lack sufficient CPU. Check your pod's resource requests in the spec.containers[].resources.requests section. A request of cpu: 4 (4 CPU cores) might be too high for nodes that only have 2 cores each. Either reduce the request or add nodes with more capacity.
Node affinity or anti-affinity rules can prevent scheduling when labels don't match:
Events:
Warning FailedScheduling 1m default-scheduler 0/3 nodes are available: 3 node(s) didn't match pod affinity/anti-affinity rules.
Review your pod spec for affinity or nodeSelector fields. You might be requiring a label like disktype: ssd that no nodes in your cluster possess. Verify node labels with kubectl get nodes --show-labels.
Taints and tolerations work as node-level gatekeepers. Nodes can be tainted to repel pods unless those pods have matching tolerations:
Events:
Warning FailedScheduling 30s default-scheduler 0/3 nodes are available: 3 node(s) had taints that the pod didn't tolerate.
Check node taints with kubectl describe node <node-name> and look for the Taints section. Common taints include node.kubernetes.io/not-ready or custom taints like dedicated=gpu:NoSchedule. Add appropriate tolerations to your pod spec if it should run on tainted nodes.
Persistent volume issues prevent scheduling when a pod requires a PersistentVolumeClaim that cannot be satisfied:
Events:
Warning FailedScheduling 45s default-scheduler 0/3 nodes are available: 3 pod has unbound immediate PersistentVolumeClaims.
This indicates your pod references a PVC that either doesn't exist or cannot be bound to an available PersistentVolume. Check PVC status with kubectl get pvc and verify that storage classes are properly configured.
Troubleshooting "ContainerCreating" and "Waiting" States
When a pod moves past Pending but gets stuck in ContainerCreating or shows containers in a Waiting state, the issue has shifted from scheduling to container initialization. These problems typically involve image retrieval, volume mounting, or container runtime issues.
The ContainerCreating phase should complete within a few minutes at most. If a pod remains in this state longer, use kubectl describe pod to examine the Events and container statuses:
kubectl describe pod my-app-7d8f9c5b6-xkj2mImage pull failures are extremely common, especially with private registries or typos in image names:
Events:
Warning Failed 2m kubelet Failed to pull image "myregistry.io/myapp:v1.2.3": rpc error: code = Unknown desc = Error response from daemon: pull access denied for myregistry.io/myapp, repository does not exist or may require 'docker login'
This error indicates authentication issues or a non-existent image. Verify the image name and tag are correct. For private registries, ensure you've created an imagePullSecret and referenced it in the pod spec:
spec:
imagePullSecrets:
- name: my-registry-secret
containers:
- name: app
image: myregistry.io/myapp:v1.2.3Create the secret with: kubectl create secret docker-registry my-registry-secret --docker-server=myregistry.io --docker-username=user --docker-password=pass
Volume mounting problems occur when the kubelet cannot attach or mount required volumes:
Events:
Warning FailedMount 1m kubelet MountVolume.SetUp failed for volume "config-volume" : configmap "app-config" not found
This example shows a missing ConfigMap. Verify all referenced ConfigMaps, Secrets, and PersistentVolumeClaims exist in the same namespace as the pod. Check with kubectl get configmap,secret,pvc -n <namespace>.
Container Waiting states appear in the container status section of kubectl describe pod output:
Containers:
app:
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
The Waiting state with reason CrashLoopBackOff means the container started but immediately crashed, and Kubernetes is backing off before attempting another restart. This is an application-level issue, not an infrastructure problem. Check container logs immediately:
kubectl logs my-app-7d8f9c5b6-xkj2mIf the container crashes before logging anything, you may need to check previous container instances:
kubectl logs my-app-7d8f9c5b6-xkj2m --previousThis retrieves logs from the last terminated container, which often contains the startup error that caused the crash.
The Essential Toolkit: Mastering kubectl for Pod Debugging
The kubectl command-line tool is your primary interface for Kubernetes debugging. While many graphical dashboards exist, experienced engineers rely on kubectl for its speed, scriptability, and comprehensive access to cluster state. Mastering three core commands—describe, logs, and exec—solves approximately 80% of pod debugging scenarios.
Inspecting Pod Details with kubectl describe pod
The kubectl describe pod command provides a comprehensive, human-readable summary of a pod's current state, configuration, and recent events. This single command often contains all the information needed to diagnose a problem.
kubectl describe pod my-app-7d8f9c5b6-xkj2m -n productionThe output is organized into several critical sections. Name and Namespace confirm you're examining the correct pod. Node shows which node is running the pod—essential for node-level debugging. Status provides the high-level phase (Pending, Running, Failed, etc.).
The Conditions section contains boolean flags that indicate pod readiness:
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
A pod with Ready: False cannot receive traffic from Services. The ContainersReady: False condition indicates at least one container is not passing its readiness probe. Cross-reference this with the container states below to identify which container is failing.
The Containers section details each container's state, image, ports, and resource configuration:
Containers:
app:
Container ID: containerd://a1b2c3d4e5f6
Image: myapp:v1.2.3
Image ID: myregistry.io/myapp@sha256:abcd1234...
Port: 8080/TCP
State: Running
Started: Mon, 15 Jan 2024 10:23:45 -0800
Ready: False
Restart Count: 3
Limits:
cpu: 1
memory: 512Mi
Requests:
cpu: 500m
memory: 256Mi
Liveness: http-get http://:8080/health delay=10s timeout=1s period=10s
Readiness: http-get http://:8080/ready delay=5s timeout=1s period=5s
Pay attention to Restart Count. A count above zero indicates the container has crashed and been restarted. High restart counts (10+) suggest a persistent crash loop. The State field shows current status—Running, Waiting, or Terminated. For Terminated containers, you'll see the exit code and reason.
The Events section at the bottom provides a chronological timeline of significant pod activities:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 5m default-scheduler Successfully assigned production/my-app-7d8f9c5b6-xkj2m to node-3
Normal Pulling 5m kubelet Pulling image "myapp:v1.2.3"
Normal Pulled 4m kubelet Successfully pulled image
Normal Created 4m kubelet Created container app
Normal Started 4m kubelet Started container app
Warning Unhealthy 3m (x12 over 4m) kubelet Readiness probe failed: HTTP probe failed with statuscode: 503
This timeline is invaluable for understanding what happened and when. In this example, the container started successfully but is failing readiness probes with HTTP 503 errors. The application is running but not ready to serve traffic—likely still initializing or unable to connect to a dependency like a database.
Uncovering Application Errors with kubectl logs
Container logs are the primary source of application-level debugging information. The kubectl logs command streams stdout and stderr from containers, providing access to application error messages, stack traces, and diagnostic output.
Basic usage retrieves logs from a single-container pod:
kubectl logs my-app-7d8f9c5b6-xkj2mFor pods with multiple containers, specify which container's logs to retrieve:
kubectl logs my-app-7d8f9c5b6-xkj2m -c app
kubectl logs my-app-7d8f9c5b6-xkj2m -c sidecarThe -f flag follows logs in real-time, similar to tail -f:
kubectl logs -f my-app-7d8f9c5b6-xkj2mThis is essential for observing application behavior during startup or while reproducing an issue. Press Ctrl+C to stop following.
When a container is crash-looping, the current container instance may not have logs yet, or may have crashed before logging anything useful. Use --previous to retrieve logs from the last terminated instance:
kubectl logs my-app-7d8f9c5b6-xkj2m --previousThis often reveals the startup error that caused the crash:
2024-01-15T10:23:47Z [ERROR] Failed to connect to database: connection refused at postgres.production.svc.cluster.local:5432
2024-01-15T10:23:47Z [FATAL] Cannot start application without database connection
Warning: Logs are stored on the node's disk and are subject to rotation. By default, Kubernetes retains logs only until the container is removed or the log file reaches 10MB. For long-term log retention, implement a centralized logging solution like ELK stack, Loki, or a cloud-native logging service.
Common log troubleshooting scenarios:
Empty logs might indicate the application hasn't started yet, or is logging to a file instead of stdout/stderr. Kubernetes only captures stdout and stderr. If your application logs to /var/log/app.log, you won't see those logs with kubectl logs. Either reconfigure the application to log to stdout or use kubectl exec to examine log files directly.
Truncated logs occur when you're viewing a high-volume log stream. Use --tail to limit output:
kubectl logs my-app-7d8f9c5b6-xkj2m --tail=100This shows only the last 100 lines, which is usually sufficient for recent errors.
Logs from all pods in a deployment can be retrieved using label selectors:
kubectl logs -l app=my-app --all-containers=trueThis aggregates logs from all pods matching the label app=my-app, useful for distributed debugging across replicas.
Executing Commands Inside Running Containers with kubectl exec
The kubectl exec command allows you to execute arbitrary commands inside a running container, effectively giving you shell access to the container's environment. This is invaluable for inspecting file systems, testing network connectivity, and verifying runtime configuration.
To start an interactive shell session:
kubectl exec -it my-app-7d8f9c5b6-xkj2m -- /bin/bashThe -it flags allocate an interactive terminal. If the container doesn't have bash, try /bin/sh:
kubectl exec -it my-app-7d8f9c5b6-xkj2m -- /bin/shOnce inside, you can navigate the filesystem, check environment variables, and run diagnostic commands:
# Inside the container
pwd
ls -la /app
env | grep DATABASE
cat /etc/resolv.confFor multi-container pods, specify the container with -c:
kubectl exec -it my-app-7d8f9c5b6-xkj2m -c app -- /bin/bashYou can also execute single commands without an interactive session:
kubectl exec my-app-7d8f9c5b6-xkj2m -- ps aux
kubectl exec my-app-7d8f9c5b6-xkj2m -- cat /app/config.yaml
kubectl exec my-app-7d8f9c5b6-xkj2m -- curl -v http://api-service:8080/healthThis is particularly useful for testing network connectivity from within the pod's network namespace:
# Test DNS resolution
kubectl exec my-app-7d8f9c5b6-xkj2m -- nslookup postgres.production.svc.cluster.local
# Test connectivity to another service
kubectl exec my-app-7d8f9c5b6-xkj2m -- wget -O- http://backend-service:3000/api/status
# Check if a port is listening
kubectl exec my-app-7d8f9c5b6-xkj2m -- netstat -tlnpSecurity considerations: kubectl exec provides powerful access to container internals and should be restricted in production environments. Use RBAC policies to limit which users can execute commands in which namespaces. Audit all kubectl exec usage—many compliance frameworks require logging of interactive shell access. Some organizations disable kubectl exec entirely in production and rely on other debugging methods.
Note: kubectl exec only works on running containers. If your container is crash-looping or stuck in ContainerCreating, you cannot exec into it. In these cases, use ephemeral debug containers or debug pod copies instead.
Advanced Pod Debugging Techniques
When standard kubectl commands don't provide sufficient insight, advanced debugging techniques offer deeper access to pod behavior without disrupting production workloads. These methods are particularly valuable when dealing with minimal container images, networking issues, or intermittent failures that are difficult to reproduce.
Leveraging Ephemeral Debug Containers for Non-Intrusive Inspection
Ephemeral debug containers are temporary containers added to a running pod specifically for debugging purposes. Introduced as beta in Kubernetes 1.23, they solve a critical problem: how do you debug a container built from a minimal image like scratch or distroless that contains no shell or diagnostic tools?
Traditional debugging with kubectl exec fails when the target container has no shell:
kubectl exec -it my-app-7d8f9c5b6-xkj2m -- /bin/bash
error: Internal error occurred: error executing command in container: failed to exec in container: failed to start exec "abc123": OCI runtime exec failed: exec failed: container_linux.go:380: starting container process caused: exec: "/bin/bash": stat /bin/bash: no such file or directory: unknownEphemeral containers bypass this limitation by injecting a fully-featured debug image alongside your application container:
kubectl debug my-app-7d8f9c5b6-xkj2m -it --image=ubuntu:22.04 --target=appThis command creates a new ephemeral container running Ubuntu 22.04 in the same pod. The --target=app flag is crucial—it shares the process namespace with the target container, allowing you to see and interact with the application's processes.
Once inside the debug container, you have access to a full Linux environment with tools like ps, netstat, curl, and package managers:
# List all processes in the pod, including those from the target container
ps aux
# Inspect network connections
netstat -tlnp
# Install additional tools as needed
apt-get update && apt-get install -y tcpdump
# Capture network traffic
tcpdump -i any -w /tmp/capture.pcapCommon debug images and their use cases:
ubuntu:22.04 or debian:bullseye provide full-featured environments with apt package management. Use these when you need to install specialized tools or have complex debugging requirements.
busybox is a minimal image (1-5MB) with basic Unix utilities. Perfect for quick network tests or file inspection when you don't need a full distribution.
nicolaka/netshoot is purpose-built for network debugging, including tools like tcpdump, curl, wget, dig, nslookup, netstat, iperf, and more. This is the go-to image for diagnosing connectivity issues.
Ephemeral containers have important limitations. They cannot be removed once added (they exist until the pod is deleted), they don't persist across pod restarts, and they share the pod's resource limits—adding a memory-hungry debug container could trigger OOMKills.
Warning: Ephemeral containers require the EphemeralContainers feature gate, which is enabled by default in Kubernetes 1.25+. On older clusters, you may need to enable it explicitly or use alternative debugging methods.
Creating Debug Copies of Pods for Controlled Experimentation
Sometimes you need to modify a pod's configuration to isolate an issue—changing the container image, altering the startup command, or adding sidecar containers. Modifying a production pod directly is risky, but creating a debug copy allows safe experimentation without affecting the original workload.
The kubectl debug command can create a modified copy of a pod:
kubectl debug my-app-7d8f9c5b6-xkj2m --copy-to=my-app-debug --set-image=app=ubuntu:22.04This creates a new pod named my-app-debug with the app container's image replaced by Ubuntu. The original pod continues running unchanged. This technique is useful when you suspect an application bug and want to replace the application container with a shell for manual testing:
# Create a copy with a different image and override the command
kubectl debug my-app-7d8f9c5b6-xkj2m --copy-to=my-app-debug --set-image=app=ubuntu:22.04 -- sleep infinityNow you can exec into the debug copy and manually test application startup:
kubectl exec -it my-app-debug -- /bin/bash
# Manually run the application with debug flags
/app/start.sh --debug --verboseYou can also create a copy with modified environment variables or command arguments to test configuration changes:
kubectl debug my-app-7d8f9c5b6-xkj2m --copy-to=my-app-debug --env="LOG_LEVEL=debug"When to use debug copies versus ephemeral containers: Use ephemeral containers when you need to inspect a running application without changing it. Use debug copies when you need to modify the pod configuration, test different images, or repeatedly start/stop containers during troubleshooting. Debug copies consume additional cluster resources and should be deleted after debugging is complete.
Node-Level Debugging: When the Problem Isn't Just the Pod
Some issues originate at the node level rather than within the pod itself. Disk pressure, network configuration problems, kernel issues, or kubelet failures require node-level investigation. Understanding the relationship between pods and nodes is essential for comprehensive debugging.
First, identify which node is hosting your problematic pod:
kubectl get pod my-app-7d8f9c5b6-xkj2m -o wideThe output includes the NODE column:
NAME READY STATUS RESTARTS AGE IP NODE
my-app-7d8f9c5b6-xkj2m 1/1 Running 0 2h 10.244.2.5 node-3
Common node-level issues include:
Disk pressure prevents new pods from being scheduled and can cause running pods to be evicted. Check node conditions:
kubectl describe node node-3Look for conditions like DiskPressure: True or MemoryPressure: True. The Allocated Resources section shows how much CPU and memory is committed versus available.
Network connectivity problems at the node level affect all pods on that node. If multiple pods on the same node are experiencing network issues, the problem likely resides in the node's network configuration or CNI plugin.
To access the node for deeper investigation, you traditionally need SSH access:
ssh user@node-3-ip-addressOnce on the node, check:
# Disk space
df -h
# System resource usage
top
free -m
# Kubelet logs (systemd-based systems)
journalctl -u kubelet -n 100
# Container runtime logs
journalctl -u containerd -n 100
# Network interfaces and routing
ip addr
ip routeHowever, SSH access requires proper network configuration, firewall rules, and key management. Many production clusters restrict SSH access for security reasons, and cloud-managed Kubernetes services often don't provide direct node access at all.
Pro tip: Solutions that use reverse TCP connections, like OpsSqad, eliminate the need for complex firewall configurations or VPN setups. The agent on your node establishes an outbound connection to the control plane, allowing secure command execution without exposing SSH ports or managing inbound access rules. This architecture is particularly valuable in multi-cloud or hybrid environments where traditional remote access is cumbersome.
Kubernetes also provides a node debugging feature that creates a privileged pod with access to the node's filesystem:
kubectl debug node/node-3 -it --image=ubuntu:22.04This creates a pod on the specified node with the host's filesystem mounted at /host, allowing you to inspect node-level files and configurations without SSH:
# Inside the debug pod
chroot /host
systemctl status kubelet
journalctl -u kubelet -n 50This approach works even when SSH is unavailable, making it valuable for managed Kubernetes services like EKS, GKE, or AKS.
Addressing Specific Debugging Challenges and Error Patterns
Real-world Kubernetes debugging often involves recognizing common error patterns and applying targeted solutions. This section addresses the most frequent debugging scenarios DevOps engineers encounter in production environments.
Debugging Application Crashes and Exit Codes
When a container terminates unexpectedly, the exit code provides critical information about why it failed. Exit codes are integers between 0 and 255, where 0 indicates success and any non-zero value indicates failure.
Check the exit code using kubectl describe pod:
Containers:
app:
State: Terminated
Reason: Error
Exit Code: 137
Started: Mon, 15 Jan 2024 14:23:10 -0800
Finished: Mon, 15 Jan 2024 14:23:45 -0800
Common exit codes and their meanings:
Exit Code 0: Normal termination. The container completed its work successfully. This is expected for batch jobs and init containers.
Exit Code 1: Generic application error. The application exited due to an error condition. Check logs with kubectl logs --previous to see the error message. Common causes include configuration errors, failed dependency connections, or unhandled exceptions.
Exit Code 137: The container was killed by the OOM (Out Of Memory) killer. This occurs when the container exceeded its memory limit. The kernel forcibly terminated the process to prevent node instability. Increase the container's memory limit or optimize application memory usage.
Exit Code 143: The container received SIGTERM and exited gracefully. This is normal during pod deletion or rolling updates. If you see this unexpectedly, check if something is sending termination signals to your pod.
Exit Code 126: The container's entrypoint command cannot be executed. Often caused by incorrect file permissions. Verify the entrypoint script has execute permissions in the container image.
Exit Code 127: The container's entrypoint command was not found. The specified command doesn't exist in the container. Check for typos in the command or verify the command is installed in the image.
To investigate crashes, retrieve logs from the terminated container:
kubectl logs my-app-7d8f9c5b6-xkj2m --previousLook for stack traces, error messages, or warnings immediately before termination. For intermittent crashes, enable application-level crash dumps or increase log verbosity to capture more diagnostic information.
If the application crashes immediately on startup before producing logs, create a debug copy with a different command to investigate:
kubectl debug my-app-7d8f9c5b6-xkj2m --copy-to=my-app-debug -- sleep infinity
kubectl exec -it my-app-debug -- /bin/bash
# Manually execute the startup command with debugging enabledTroubleshooting Pod Network Issues
Network problems are among the most challenging Kubernetes issues to debug because they involve multiple layers: pod networking, Services, DNS, NetworkPolicies, and external connectivity. A systematic approach is essential.
Pod-to-pod connectivity failures prevent applications from communicating within the cluster. Test connectivity by executing curl or wget from within the pod:
kubectl exec my-app-7d8f9c5b6-xkj2m -- curl -v http://10.244.3.8:8080If the connection fails, verify the target pod's IP address is correct and the pod is running. Check if NetworkPolicies are blocking traffic:
kubectl get networkpolicy -n production
kubectl describe networkpolicy allow-from-frontend -n productionNetworkPolicies use label selectors to allow or deny traffic. Ensure your pods have the correct labels to match the policy rules.
Service connectivity issues occur when pods cannot reach a Service by its DNS name. First, verify the Service exists and has endpoints:
kubectl get service backend-api -n production
kubectl get endpoints backend-api -n productionIf the Endpoints list is empty, no pods match the Service's selector. Check the Service selector against your pod labels:
kubectl describe service backend-api -n production
kubectl get pods -n production --show-labelsTest DNS resolution from within the pod:
kubectl exec my-app-7d8f9c5b6-xkj2m -- nslookup backend-api.production.svc.cluster.localIf DNS resolution fails, check the cluster's DNS service (typically CoreDNS):
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dnsExternal connectivity problems prevent pods from reaching services outside the cluster. Test with curl:
kubectl exec my-app-7d8f9c5b6-xkj2m -- curl -v https://api.external-service.comIf this fails but the same curl command works from the node, check egress NetworkPolicies or firewall rules. Verify the pod's DNS can resolve external domains:
kubectl exec my-app-7d8f9c5b6-xkj2m -- nslookup google.comFor detailed network debugging, use a debug container with network tools:
kubectl debug my-app-7d8f9c5b6-xkj2m -it --image=nicolaka/netshoot --target=app
# Inside the debug container
# Test connectivity with traceroute
traceroute backend-api.production.svc.cluster.local
# Capture packets to analyze traffic
tcpdump -i any -n port 8080
# Check routing table
ip routeWarning: Network debugging can be time-consuming because issues may be intermittent or dependent on specific timing or load conditions. Reproduce the issue consistently before attempting fixes, and make changes incrementally while testing after each change.
Handling Resource Constraints and OOMKilled Pods
Resource management is critical for stable Kubernetes operations. Pods that exceed their memory limits are terminated by the OOM killer, while pods that exceed CPU limits are throttled, causing performance degradation.
When a pod is OOMKilled, you'll see this in the container state:
kubectl describe pod my-app-7d8f9c5b6-xkj2mContainers:
app:
State: Terminated
Reason: OOMKilled
Exit Code: 137
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Restart Count: 5
The Restart Count incrementing rapidly indicates a persistent memory issue. The container starts, consumes memory until it hits the limit, gets killed, and restarts in a loop.
Check the container's memory limit and actual usage:
Limits:
memory: 256Mi
Requests:
memory: 128Mi
If your application genuinely needs more memory, increase the limit:
resources:
limits:
memory: 512Mi
requests:
memory: 256MiHowever, increasing limits without understanding the root cause can lead to node instability. Investigate why the application is consuming excessive memory:
- Check for memory leaks by monitoring memory growth over time
- Review application logs for excessive object creation or caching
- Profile the application using language-specific memory profilers
- Analyze heap dumps if available
CPU throttling is more subtle than OOM kills. A throttled pod remains running but responds slowly. Check if CPU throttling is occurring by examining cgroup metrics on the node:
kubectl debug node/node-3 -it --image=ubuntu:22.04
# Inside the debug pod
cat /host/sys/fs/cgroup/cpu/kubepods/pod<pod-uid>/cpu.statLook for nr_throttled and throttled_time values. High throttling indicates the container is hitting its CPU limit frequently.
Set appropriate resource requests and limits based on actual application behavior:
- Requests should reflect the minimum resources needed for the application to run acceptably
- Limits should provide headroom for traffic spikes but prevent runaway resource consumption
Use tools like Vertical Pod Autoscaler (VPA) or metrics from Prometheus to determine optimal values based on historical usage patterns. A good starting point is setting requests at the 50th percentile of actual usage and limits at the 95th percentile.
Skip the Manual Work: How OpsSqad Automates Kubernetes Pod Debugging
You've just learned powerful techniques for debugging Kubernetes pods—from interpreting pod states and executing kubectl commands to leveraging ephemeral containers and node-level inspection. These skills are essential for any DevOps engineer, but executing them manually in production environments is time-consuming and error-prone. What took you 15 minutes of running kubectl commands, analyzing outputs, and cross-referencing documentation could take 90 seconds with the right automation.
The OpsSqad Advantage: Instant Access and AI-Powered Resolution
OpsSqad fundamentally changes how you interact with Kubernetes infrastructure through its reverse TCP architecture. Instead of configuring VPNs, opening firewall ports, or managing SSH keys, you install a lightweight node agent that establishes an outbound connection to the OpsSqad cloud platform. This means you can access any cluster from anywhere—whether it's behind corporate firewalls, in air-gapped environments, or distributed across multiple cloud providers—without complex networking setup.
Once connected, OpsSqad's AI agents execute the exact debugging workflows you've learned in this guide, but through natural language conversation. The K8s Squad is specifically trained on Kubernetes troubleshooting patterns, kubectl command syntax, and common failure scenarios. It combines the expertise of the debugging techniques covered in this article with the speed of automation.
Your 5-Step Journey to Effortless Kubernetes Debugging with OpsSqad
Step 1: Create Your Free Account & Install a Node
Sign up at app.opssquad.ai and navigate to the Nodes section in your dashboard. Create a new Node with a descriptive name like "production-k8s-cluster". The dashboard generates unique credentials: a Node ID and authentication token that you'll use for installation.
SSH to a server that has kubectl access to your cluster (this could be a bastion host, a control plane node, or any machine with kubeconfig configured). Install the OpsSqad agent:
curl -fsSL https://install.opssquad.ai/install.sh | bash
opssquad node install --node-id=<your-node-id-from-dashboard> --token=<your-token-from-dashboard>
opssquad node startThe agent establishes an outbound reverse TCP connection to OpsSqad's cloud platform. No inbound firewall rules required, no VPN configuration needed. The connection is encrypted and authenticated using the credentials you generated.
Step 2: Deploy the OpsSqad Agent
The installation process automatically registers your node with the OpsSqad platform. Verify the connection in your dashboard—you should see the node status change to "Connected" with a green indicator. The agent is now ready to receive commands from your Squads.
Step 3: Deploy the K8s Squad
Navigate to the Squad Marketplace in your OpsSqad dashboard. Browse available Squads and select the "K8s Troubleshooting Squad". Click Deploy to create your private instance. This Squad includes multiple AI agents specialized in different aspects of Kubernetes debugging: pod lifecycle analysis, network troubleshooting, resource management, and log analysis.
The deployment process takes about 30 seconds. The Squad initializes with pre-configured command whitelists (only approved kubectl commands can be executed), sandboxing rules (commands run in isolated contexts), and audit logging (every action is recorded for compliance).
Step 4: Link Agents to Nodes
Open your deployed K8s Squad and navigate to the Agents tab. You'll see a list of AI agents within the Squad. Click "Manage Access" and grant the agents permission to execute commands on your Kubernetes node. This establishes the link between the AI agents and your infrastructure.
The permission model is granular—you can restrict which namespaces the Squad can access, which types of commands are allowed, and set time-based access windows. For initial testing, grant broad access, then tighten permissions based on your organization's security requirements.
Step 5: Start Debugging via Natural Language Chat
Navigate to chat.opssquad.ai and select your K8s Squad from the Squad selector. You're now ready to debug using natural language instead of manual kubectl commands.
Real-World Debugging with the K8s Squad: A Chat Interaction Example
Let's walk through a realistic debugging scenario. Your payment-processor pod in the production namespace is stuck in CrashLoopBackOff, and you need to identify the root cause quickly.
You: "K8s Squad, my payment-processor pod in production namespace is crash-looping. What's causing it?"
K8s Squad:
[INFO] Analyzing pod payment-processor in namespace production...
[INFO] Executing: kubectl get pods -n production -l app=payment-processor
[INFO] Found pod: payment-processor-7d9c8f5b6-xkj2m
[INFO] Status: CrashLoopBackOff (Restart count: 8)
[INFO] Executing: kubectl describe pod payment-processor-7d9c8f5b6-xkj2m -n production
[INFO] Container State: Waiting (Reason: CrashLoopBackOff)
[INFO] Last Termination: Exit Code 1 (Error)
[INFO] Executing: kubectl logs payment-processor-7d9c8f5b6-xkj2m -n production --previous
[INFO] Retrieved logs from previous container instance:
2024-01-15T18:45:23Z [ERROR] Failed to connect to Redis at redis.production.svc.cluster.local:6379
2024-01-15T18:45:23Z [ERROR] Error: ECONNREFUSED - Connection refused
2024-01-15T18:45:23Z [FATAL] Cannot start payment processor without cache connection
[ANALYSIS] The payment-processor pod is crashing because it cannot connect to Redis.
Root cause: Connection refused to redis.production.svc.cluster.local:6379
Recommended next steps:
1. Verify Redis service is running and healthy
2. Check if Redis endpoints are available
3. Test network connectivity from payment-processor to Redis
You: "Can you check if Redis is running and test connectivity?"
K8s Squad:
[INFO] Executing: kubectl get service redis -n production
[INFO] Service found: redis (ClusterIP: 10.96.45.12, Port: 6379)
[INFO] Executing: kubectl get endpoints redis -n production
[WARNING] Endpoints list is EMPTY - no pods are backing this service!
[INFO] Executing: kubectl get pods -n production -l app=redis
[INFO] Found pod: redis-master-0
[INFO] Status: Pending
[INFO] Reason: 0/3 nodes are available: 3 Insufficient memory.
[ANALYSIS] Redis pod cannot be scheduled due to insufficient memory on all nodes.
This is why payment-processor cannot connect - the Redis service has no healthy endpoints.
Root cause chain:
1. Redis pod stuck in Pending (insufficient memory to schedule)
2. Redis service has no endpoints
3. Payment-processor cannot connect to Redis
4. Payment-processor crashes with connection refused
Recommended actions:
1. Free up memory on cluster nodes or add new nodes
2. Reduce Redis memory requests if they're overprovisioned
3. Check if other pods can be scaled down temporarily
You: "What are Redis's memory requests? Can we reduce them?"
K8s Squad:
[INFO] Executing: kubectl get pod redis-master-0 -n production -o jsonpath='{.spec.containers[0].resources}'
[INFO] Current resource configuration:
Requests:
memory: 4Gi
cpu: 1000m
Limits:
memory: 4Gi
cpu: 2000m
[INFO] Executing: kubectl top nodes
[INFO] Node memory availability:
node-1: 3.2Gi available / 8Gi total
node-2: 2.8Gi available / 8Gi total
node-3: 3.5Gi available / 8Gi total
[ANALYSIS] Redis is requesting 4Gi memory but no node has 4Gi available.
Checking Redis actual usage from previous runs...
[INFO] Based on historical metrics, Redis typically uses 1.5-2Gi in this environment.
The 4Gi request appears overprovisioned.
Would you like me to create a debug copy of the Redis pod with reduced memory requests (2Gi) to test if it schedules successfully?
You: "Yes, create the debug copy with 2Gi memory."
K8s Squad:
[INFO] Creating debug copy of Redis pod with adjusted resources...
[INFO] Executing: kubectl debug pod/redis-master-0 --copy-to=redis-debug -n production
[INFO] Modifying resource requests in debug copy...
[SUCCESS] Created pod redis-debug with memory request: 2Gi
[INFO] Monitoring scheduling...
[INFO] Pod redis-debug: Status changed to Running
[INFO] Container started successfully on node-1
[INFO] Testing Redis connectivity from debug pod:
[INFO] Executing: kubectl exec redis-debug -n production -- redis-cli ping
[INFO] Response: PONG
[SUCCESS] Redis debug pod is running and responding to health checks.
You can now update the Redis StatefulSet to use 2Gi memory requests instead of 4Gi.
Suggested next steps:
1. Update Redis StatefulSet resource requests to 2Gi
2. Delete the debug pod once verified
3. Monitor payment-processor for successful connection
This conversation demonstrates how OpsSqad transforms complex multi-step debugging workflows into natural language interactions. The K8s Squad executed eight different kubectl commands, analyzed outputs, identified the root cause chain, and proposed a solution—all in under two minutes. The same investigation manually would involve:
- Running
kubectl get podsto identify the pod name - Running
kubectl describe podto check status - Running
kubectl logs --previousto examine crash logs - Recognizing the Redis connection error
- Running
kubectl get serviceto verify Redis service - Running
kubectl get endpointsto check for backing pods - Running
kubectl get podsfiltered for Redis - Running
kubectl describe podon the Redis pod - Running
kubectl top nodesto check available resources - Manually calculating if resources are available
- Looking up historical metrics for typical Redis usage
- Crafting a
kubectl debugcommand with correct syntax - Monitoring the debug pod status
- Testing the debug pod functionality
Each step requires remembering command syntax, interpreting output, and deciding the next action. With OpsSqad, you describe the problem in plain English and receive actionable insights backed by real command execution.
The reverse TCP architecture means this works identically whether your cluster is in AWS, GCP, Azure, on-premises, or across multiple environments. The command whitelisting ensures only approved operations execute, the sandboxing prevents accidental damage, and the audit logging provides complete traceability for compliance requirements. What previously required VPN access, kubectl expertise, and 15 minutes of manual work now takes a single chat message and 90 seconds.
Prevention and Best Practices for Robust Kubernetes Deployments
Effective debugging skills are essential, but preventing issues before they reach production is even more valuable. Implementing proactive monitoring, proper resource management, and deployment best practices reduces the frequency and severity of pod failures.
Implementing Health Checks and Readiness Probes
Kubernetes health checks allow the platform to automatically detect and respond to application failures without manual intervention. Two types of probes serve different purposes: liveness probes detect when an application is broken and needs restarting, while readiness probes determine when an application is ready to receive traffic.
A liveness probe tells Kubernetes whether a container is running properly. If the probe fails repeatedly, Kubernetes kills and restarts the container. Configure liveness probes to detect application deadlocks or unrecoverable errors:
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3This configuration waits 30 seconds after container start (giving the application time to initialize), then checks the /health endpoint every 10 seconds. If three consecutive checks fail, Kubernetes restarts the container.
A readiness probe determines whether a container should receive traffic from Services. A failing readiness probe removes the pod from Service endpoints without restarting it:
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2The readiness probe typically has a shorter initial delay than the liveness probe because you want to start routing traffic as soon as the application is ready. Use readiness probes to handle temporary conditions like database connection pool initialization or cache warming.
Best practices for probe configuration:
- Set
initialDelaySecondslonger than your application's typical startup time to prevent premature failures - Use different endpoints for liveness and readiness—the liveness check should be simple (application is alive), while readiness can check dependencies
- Avoid expensive operations in probe handlers; they execute frequently and should complete in milliseconds
- Set appropriate
timeoutSecondsvalues based on expected response times under load - For applications with long startup times, consider using startup probes (Kubernetes 1.18+) to handle initialization separately
Resource Management: Requests, Limits, and Quotas
Proper resource management prevents resource starvation, node instability, and unpredictable application performance. Kubernetes uses resource requests for scheduling decisions and limits to enforce consumption boundaries.
Resource requests tell the scheduler how much CPU and memory a container needs. The scheduler only places pods on nodes with sufficient unreserved resources:
resources:
requests:
memory: "256Mi"
cpu: "500m"This requests 256 megabytes of memory and 500 millicores (0.5 CPU cores). Set requests based on typical application usage, not peak usage. Under-requesting leads to oversubscription and potential node instability; over-requesting wastes cluster capacity.
Resource limits cap how much a container can consume. Exceeding memory limits triggers OOMKills; exceeding CPU limits causes throttling:
resources:
limits:
memory: "512Mi"
cpu: "1000m"The ratio between limits and requests determines your quality of service (QoS) class. Pods with requests equal to limits get Guaranteed QoS (highest priority during eviction). Pods with requests lower than limits get Burstable QoS. Pods without requests or limits get BestEffort QoS (evicted first under pressure).
ResourceQuotas control total resource consumption at the namespace level:
apiVersion: v1
kind: ResourceQuota
metadata:
name: production-quota
namespace: production
spec:
hard:
requests.cpu: "50"
requests.memory: "100Gi"
limits.cpu: "100"
limits.memory: "200Gi"
pods: "50"This quota prevents the production namespace from consuming more than 50 CPU cores of requests, 100Gi of memory requests, and deploying more than 50 pods. Quotas prevent runaway resource consumption and enforce fair sharing in multi-tenant clusters.
Use LimitRanges to set default requests and limits for containers that don't specify them:
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: production
spec:
limits:
- default:
memory: "512Mi"
cpu: "1000m"
defaultRequest:
memory: "256Mi"
cpu: "500m"
type: ContainerThis ensures every container has reasonable defaults, preventing accidental resource starvation or consumption.
Effective Logging and Monitoring Strategies
Comprehensive logging and monitoring enable proactive issue detection before users are affected. Kubernetes generates metrics and logs at multiple layers: cluster components, nodes, pods, and containers.
Implement centralized logging to aggregate logs from all pods and nodes into a searchable system. Popular solutions include:
- ELK Stack (Elasticsearch, Logstash, Kibana): Self-hosted, powerful querying, resource-intensive
- Loki (with Grafana): Lightweight, optimized for Kubernetes, integrates with Prometheus
- Cloud-native services: CloudWatch (AWS), Cloud Logging (GCP), Azure Monitor
Deploy a logging agent (Fluentd, Fluent Bit, or Filebeat) as a DaemonSet to collect logs from every node. Configure structured logging in your applications using JSON format to enable field-based filtering and aggregation.
Monitoring and alerting should cover both infrastructure and application metrics:
# Prometheus ServiceMonitor example
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: payment-processor-metrics
spec:
selector:
matchLabels:
app: payment-processor
endpoints:
- port: metrics
interval: 30sSet up alerts for critical conditions:
- Pod crash loops (restart count increasing)
- Failed deployments (pods not reaching Ready state)
- Resource exhaustion (node CPU/memory above 85%)
- Application error rates exceeding thresholds
- Slow response times or increased latency
Use Prometheus with Alertmanager for Kubernetes-native monitoring, or leverage cloud provider monitoring services. Configure alert routing to notify the appropriate teams through Slack, PagerDuty, or other incident management systems.
Immutable Infrastructure and GitOps Principles
Treating infrastructure as code and implementing GitOps workflows improves reliability, traceability, and disaster recovery capabilities. Every Kubernetes resource should be defined in version-controlled YAML files, never created manually.
GitOps workflow uses Git as the single source of truth:
- All Kubernetes manifests are stored in Git repositories
- Changes go through code review via pull requests
- Automated pipelines deploy changes to clusters
- Cluster state is continuously reconciled with Git state
Tools like ArgoCD or Flux automate this reconciliation:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: payment-processor
spec:
project: production
source:
repoURL: https://github.com/company/k8s-manifests
targetRevision: main
path: apps/payment-processor
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
prune: true
selfHeal: trueThis ArgoCD Application continuously monitors the Git repository and automatically applies changes to the cluster. If someone manually modifies a resource in the cluster, ArgoCD reverts it to match the Git state.
Benefits of immutable infrastructure:
- Complete audit trail of all changes through Git history
- Easy rollback to any previous state
- Consistent deployments across environments (dev, staging, production)
- Reduced configuration drift
- Simplified disaster recovery (rebuild cluster from Git)
Implement deployment strategies that minimize risk:
- Rolling updates: Gradually replace old pods with new ones (default Kubernetes behavior)
- Blue-green deployments: Run two identical environments, switch traffic between them
- Canary deployments: Route a small percentage of traffic to the new version, gradually increase
Use Deployment annotations to track change metadata:
metadata:
annotations:
kubernetes.io/change-cause: "Update to v2.3.1 - fixes payment timeout issue"This appears in rollout history, making it easier to understand what changed and when during debugging sessions.
Conclusion
Mastering Kubernetes pod debugging requires understanding pod lifecycle states, proficiency with kubectl commands, and knowledge of advanced techniques like ephemeral containers and node-level inspection. By systematically applying the diagnostic approaches covered in this guide—from analyzing Pending pods and interpreting exit codes to troubleshooting network issues and resource constraints—you can resolve the vast majority of production issues efficiently. Combining these reactive debugging skills with proactive measures like proper health checks, resource management, and GitOps workflows creates a robust operational foundation.
If you want to automate these debugging workflows and reduce troubleshooting time from 15 minutes to 90 seconds, OpsSqad's K8s Squad transforms manual kubectl commands into natural language conversations backed by AI-powered analysis. Create your free account at app.opssquad.ai and experience how reverse TCP architecture and specialized AI agents can streamline your entire Kubernetes operations workflow.
