Kubernetes Pod Stuck Terminating: Fix & Troubleshoot in 2024
Learn to fix Kubernetes pods stuck terminating with manual kubectl steps, then automate diagnostics with OpsSqad's K8s Squad. Save hours on troubleshooting.
Kubernetes Pod Stuck Terminating: A Deep Dive into Causes and Solutions
A pod stuck in the Terminating state is one of the most common Kubernetes issues that DevOps engineers encounter. When you run kubectl delete pod expecting a clean removal, but the pod remains in limbo for minutes or even hours, it signals an underlying problem that can cascade into broader cluster issues. This comprehensive guide walks you through the diagnostic process, common root causes, and proven solutions for resolving stuck terminating pods.
TL;DR: Pods get stuck terminating due to finalizers blocking deletion, application processes ignoring SIGTERM signals, resource constraints like OOMKilled, volume attachment issues, or network policy misconfigurations. The diagnostic process involves checking pod events with kubectl describe, examining finalizers in the pod spec, reviewing container exit codes, and investigating controller behavior. Solutions range from fixing application shutdown logic to manually removing finalizers or force-deleting pods as a last resort.
Understanding the "Terminating" State
Pods in Kubernetes are designed to be ephemeral, but sometimes they get stuck in the Terminating state, preventing their graceful removal and potentially impacting your cluster's stability. Understanding what happens during the termination lifecycle helps you diagnose why pods fail to complete this process.
What Does "Terminating" Mean?
A pod enters the Terminating state when a deletion request is initiated—either through an explicit kubectl delete pod command, a scaling operation by a controller, or an eviction triggered by resource pressure or node maintenance. The termination process follows a specific sequence designed to give applications time to shut down gracefully.
When Kubernetes marks a pod for deletion, it immediately updates the pod's status to Terminating in the API server and sets a deletion timestamp. The kubelet on the node hosting the pod receives this update and begins the shutdown sequence. First, it sends a SIGTERM signal to the main process in each container, giving the application a chance to close connections, flush buffers, and clean up resources. By default, Kubernetes waits 30 seconds (the terminationGracePeriodSeconds) for containers to exit voluntarily.
If containers haven't exited after the grace period expires, the kubelet sends a SIGKILL signal, which forcefully terminates the processes. Once all containers have stopped, the kubelet performs cleanup tasks like unmounting volumes and removing network configurations. Finally, it reports back to the API server, which removes the pod object from etcd.
However, this graceful shutdown or forceful termination process can get interrupted at multiple points. A pod stuck in Terminating means the deletion process has stalled somewhere in this sequence, and the pod object persists in the API server despite being marked for removal.
Why is a Stuck Pod a Problem?
Pods stuck in Terminating consume cluster resources and create operational headaches that extend beyond just the individual pod. Even though the containers might have stopped running, the pod object still exists in the API server, and associated resources may remain allocated or locked.
Resource consumption is the most immediate concern. If the pod had PersistentVolumeClaims attached, those volumes might remain locked and unavailable for other pods to use. The pod's IP address stays reserved in the cluster's networking layer, preventing it from being reassigned. On the node itself, remnants of the pod's cgroup configurations or network interfaces might persist, consuming small amounts of memory and creating clutter.
Stuck terminating pods can block scaling operations and deployments. If you're trying to update a Deployment with a RollingUpdate strategy, Kubernetes won't create new pods until old ones terminate successfully. For StatefulSets, the situation is even more critical—StatefulSet pods have stable identities, and a pod stuck in Terminating prevents a new pod with the same identity from being created, effectively breaking your stateful application.
The presence of stuck pods also indicates underlying issues that will likely recur. Whether it's a misconfigured finalizer, a buggy application that doesn't handle shutdown signals, or infrastructure problems with storage or networking, these root causes will continue affecting new pods until addressed. Monitoring systems might generate alerts for these stuck pods, creating noise that obscures more critical issues.
Diagnosing the Root Cause: Initial kubectl Investigations
Before diving into complex scenarios, it's crucial to perform fundamental checks using kubectl to gather initial clues about why a pod might be stuck. The diagnostic process should be systematic, starting with broad observations and narrowing down to specific issues.
Checking Pod Status and Events
The first step in diagnosing a kubernetes pod stuck terminating is gathering comprehensive information about the pod's current state and the events that led to its stuck status. Kubernetes maintains a rich event log that provides crucial insights into what's happening during the termination process.
Start with a broad view of all pods to identify which ones are stuck:
kubectl get pods -A -o wide | grep TerminatingThis command shows all pods across all namespaces that are in the Terminating state, along with which nodes they're running on and how long they've been in this state. Pay attention to the AGE column—a pod that's been terminating for more than a few minutes typically indicates a problem.
Next, examine the specific pod in detail:
kubectl describe pod <pod-name> -n <namespace>The output from kubectl describe is your most valuable diagnostic tool. The Events section at the bottom chronologically lists everything Kubernetes has done with this pod. Look for specific patterns:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Killing 5m kubelet Stopping container nginx
Warning FailedMount 3m (x12 over 5m) kubelet Unable to attach or mount volumes
Warning FailedSync 1m (x25 over 5m) kubelet Error syncing pod
Common event messages that indicate termination issues include "FailedMount" (volume attachment problems), "FailedSync" (kubelet unable to reconcile pod state), and repeated "Killing" messages without corresponding container exits.
The pod's metadata section in the describe output also reveals critical information. Check the DeletionTimestamp field—this shows when the deletion was initiated. If this timestamp is more than a few minutes old, you have a stuck pod. Also examine the Finalizers field:
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.metadata.finalizers}'If this returns any values, those finalizers are likely preventing deletion. We'll cover finalizers in depth later, but their presence is an immediate clue.
Container logs can reveal application-level issues preventing graceful shutdown:
kubectl logs <pod-name> -n <namespace> -c <container-name>If the container has already exited, retrieve the previous container's logs:
kubectl logs <pod-name> -n <namespace> -c <container-name> --previousLook for error messages around the time the termination began, particularly errors related to database disconnections, file I/O operations, or signal handling. Applications that log "Received SIGTERM" or "Shutting down gracefully" indicate they're attempting to respond to the termination signal, while absence of such messages suggests the application isn't handling signals properly.
Understanding Container Exit Codes
When a container terminates, it does so with an exit code that provides direct insights into how and why the termination occurred. These numeric codes follow standard Unix conventions, with Kubernetes-specific interpretations for certain values.
Exit code 0 indicates successful termination—the process completed its work and exited cleanly. This is what you want to see during normal pod shutdowns.
Exit code 1 is a general error code that applications use to indicate something went wrong. The specific meaning depends entirely on the application, so you'll need to check container logs to understand what failed.
Exit code 137 is particularly significant in Kubernetes contexts. This value equals 128 + 9, where 9 is the signal number for SIGKILL. When you see exit code 137, it means the container was forcefully killed, most commonly because it exceeded its memory limit and was terminated by the OOM killer. You can confirm this by checking for "OOMKilled" in the pod's status:
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.status.containerStatuses[*].lastState.terminated.reason}'Exit code 143 equals 128 + 15, where 15 is the signal number for SIGTERM. This indicates the container received the graceful shutdown signal. If a pod is stuck terminating with containers showing exit code 143, it suggests the application received SIGTERM but didn't exit within the grace period, and the subsequent SIGKILL also failed to clean up properly.
Exit code 130 (128 + 2) indicates SIGINT was received, typically from a Ctrl+C or interrupt signal. This is less common in production Kubernetes but can occur during manual interventions.
To check exit codes for all containers in a pod:
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{range .status.containerStatuses[*]}{.name}{"\t"}{.lastState.terminated.exitCode}{"\n"}{end}'Understanding these exit codes helps you quickly categorize the problem: resource constraints (137), application shutdown issues (143 in a stuck pod), or application errors (1 or other non-standard codes).
Pro tip: Automate initial checks with OpsSqad
Instead of running multiple kubectl commands and correlating their outputs, you can ask OpsSqad's K8s Squad to gather this information for you in a single chat interaction, receiving a synthesized analysis of pod status, events, finalizers, and exit codes in seconds.
Common Culprits: Resource Constraints and Application Issues
Many pods get stuck terminating due to resource limitations or application-level problems that prevent them from shutting down cleanly. These issues are among the most frequent causes and often the easiest to identify and resolve.
The Dreaded OOMKilled
One of the most common reasons for pods being forcefully terminated is running out of memory. While OOMKilled pods usually terminate eventually, they can get stuck if the cleanup process encounters issues or if the node itself is under severe memory pressure.
What is OOMKilled? When a container exceeds its allocated memory limit, the Linux kernel's Out-Of-Memory killer intervenes to protect the system. The OOM killer selects processes to terminate based on various factors, including how much memory they're using and their OOM score adjustments. In Kubernetes, when a container exceeds its memory limit, it becomes a prime candidate for termination.
The OOM killer sends a SIGKILL signal (not SIGTERM), which immediately terminates the process without giving it any opportunity for graceful shutdown. This forceful termination can leave resources in inconsistent states, potentially contributing to stuck termination if cleanup operations fail.
Identifying OOMKilled: The clearest indicator is in the pod's container status:
kubectl describe pod <pod-name> -n <namespace>Look for output like this in the container status section:
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Mon, 15 Jan 2024 10:23:45 -0800
Finished: Mon, 15 Jan 2024 10:28:12 -0800
The combination of "OOMKilled" reason and exit code 137 definitively indicates memory exhaustion. You might also see events like:
Warning BackOff 2m (x10 over 5m) kubelet Back-off restarting failed container
To check current memory usage versus limits:
kubectl top pod <pod-name> -n <namespace>This shows real-time resource consumption. If memory usage is consistently near the limit, you're likely experiencing OOM issues.
CPU and Core in Kubernetes: Understanding resource allocation is vital for preventing OOM situations. In Kubernetes, CPU is measured in "cores" or fractions thereof. One CPU core equals 1000 millicores (or 1000m). When you specify cpu: 500m, you're allocating half a CPU core's worth of processing time.
Memory is measured in bytes, typically expressed as Mi (mebibytes) or Gi (gibibytes). A memory limit of 512Mi means the container can use up to 512 mebibytes before the OOM killer intervenes.
The distinction between requests and limits is crucial:
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"Requests determine scheduling—Kubernetes only places the pod on nodes with at least that much available capacity. Limits determine enforcement—the container cannot exceed these values. For memory, exceeding the limit triggers OOMKilled. For CPU, the container is throttled but not killed.
Misconfigured limits are a primary cause of OOMKilled pods. Setting limits too low for the application's actual needs causes repeated OOM terminations. Setting requests much lower than actual usage leads to overcommitted nodes where multiple pods compete for insufficient memory, triggering OOM conditions even when individual pods haven't exceeded their limits.
Application Unresponsiveness During Shutdown
Sometimes the application within the container simply doesn't respond to the SIGTERM signal, causing the pod to hang during the grace period and potentially get stuck if subsequent cleanup also fails.
Applications get stuck during shutdown for several reasons. The most common is that the application doesn't have signal handlers implemented. Many applications, particularly those not designed for containerized environments, don't explicitly handle SIGTERM. When they receive the signal, their default behavior might be to ignore it or handle it improperly.
Another frequent issue is deadlocks during shutdown. An application might be waiting for a lock that will never be released, or trying to flush data to a database that's no longer responding, or waiting for worker threads that have themselves deadlocked. These situations cause the application to hang indefinitely during its shutdown routine.
Some applications have shutdown sequences that simply take longer than the default 30-second grace period. Complex applications might need to drain message queues, complete in-flight transactions, or persist state to disk—operations that legitimately require more time.
You can identify application unresponsiveness by checking if containers are still running after the termination grace period:
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.metadata.deletionTimestamp}'Compare this timestamp to the current time. If it's more than 30 seconds old (or whatever grace period you've configured) and the pod is still terminating, the application likely isn't responding to SIGTERM.
Check the configured grace period:
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.terminationGracePeriodSeconds}'If your application needs more time, you can increase this value in your pod spec:
spec:
terminationGracePeriodSeconds: 60
containers:
- name: myapp
image: myapp:latestDebugging Application Shutdown Logic
If you suspect application unresponsiveness, you'll need to examine your application's code and its shutdown procedures. The application should explicitly handle SIGTERM signals and perform cleanup operations.
For applications written in Go, proper signal handling looks like this:
sigterm := make(chan os.Signal, 1)
signal.Notify(sigterm, syscall.SIGTERM)
<-sigterm
log.Info("Received SIGTERM, starting graceful shutdown")
// Close database connections
db.Close()
// Stop accepting new requests
server.Shutdown(context.Background())
// Wait for in-flight requests to complete
// Cleanup resourcesPython applications should use similar patterns:
import signal
import sys
def sigterm_handler(signum, frame):
logger.info("Received SIGTERM, shutting down gracefully")
# Close connections
db.close()
# Cleanup resources
sys.exit(0)
signal.signal(signal.SIGTERM, sigterm_handler)Warning: A common mistake is catching SIGTERM but performing cleanup operations that themselves can hang. Always implement timeouts in your shutdown logic. If database disconnection takes more than a few seconds, log an error and exit anyway rather than hanging indefinitely.
If you cannot modify the application code, consider using a preStop hook to perform cleanup:
spec:
containers:
- name: myapp
image: myapp:latest
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "/app/graceful-shutdown.sh"]The preStop hook executes before SIGTERM is sent, giving you a chance to trigger application-specific shutdown procedures. The grace period countdown doesn't start until after the preStop hook completes, so you have additional time for cleanup.
Network Policies and Dependencies
Networking issues, particularly those involving NetworkPolicy objects, can contribute to pods getting stuck in a terminating state, though these causes are less common than resource constraints or application issues.
How Network Policies Can Interfere
NetworkPolicy objects control traffic flow between pods in Kubernetes clusters. While they're essential for security, misconfigured policies can inadvertently block communication necessary for proper pod termination.
During the termination process, the kubelet needs to communicate with containers to send signals and monitor their shutdown. In some CNI implementations, the kubelet also needs to communicate with network components to clean up pod networking. If a NetworkPolicy blocks this communication, termination can stall.
Checking Network Policy Application: First, identify which NetworkPolicies might affect your pod:
kubectl get networkpolicy -n <namespace>Then examine each policy's selectors to determine if they apply to your stuck pod:
kubectl describe networkpolicy <policy-name> -n <namespace>Look at the podSelector field. If it matches your pod's labels, the policy applies. For example:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: api-network-policy
spec:
podSelector:
matchLabels:
app: api
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app: frontend
egress:
- to:
- podSelector:
matchLabels:
app: databaseThis policy applies to pods with label app: api and restricts both ingress and egress traffic. If your terminating pod has the app: api label, this policy affects it.
Common Pitfalls: Overly restrictive egress rules are the most frequent culprit. If a NetworkPolicy denies all egress traffic except to specific destinations, it might prevent the pod from communicating with cluster DNS, the API server, or external dependencies it needs to contact during shutdown.
A particularly problematic configuration is denying all traffic without explicit allows:
spec:
podSelector: {}
policyTypes:
- Ingress
- EgressThis policy applies to all pods in the namespace and blocks all traffic unless explicitly allowed by other rules. If you don't have corresponding rules allowing necessary communication, pods can't perform network operations during shutdown.
To verify if NetworkPolicy is causing your issue, you can temporarily remove the policy:
kubectl delete networkpolicy <policy-name> -n <namespace>Then attempt to delete the stuck pod again. If it terminates successfully, you've confirmed the NetworkPolicy was interfering. You'll need to adjust the policy to allow necessary traffic.
Note: Some CNI plugins handle NetworkPolicy enforcement differently. Calico, Cilium, and Weave each have their own implementations, and bugs in these implementations can occasionally cause termination issues. Check your CNI plugin's documentation and issue tracker if you suspect NetworkPolicy-related problems.
External Dependencies and Graceful Shutdown
Pods often depend on external services like databases, message queues, or APIs. If these dependencies are unavailable or unresponsive during termination, the pod might hang while trying to disconnect or clean up its connections.
Ensuring Graceful Disconnects: Applications should implement timeouts for all external communications during shutdown. If you're trying to flush data to a database, set a maximum wait time:
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
if err := flushDataToDatabase(ctx); err != nil {
log.Error("Failed to flush data, exiting anyway: %v", err)
}This pattern ensures that even if the database is unresponsive, your application exits after 5 seconds rather than hanging indefinitely.
Connection pooling libraries often have shutdown methods that attempt to gracefully close all connections. These can hang if remote services aren't responding:
# Set a timeout for connection pool shutdown
try:
db_pool.close(timeout=5)
except TimeoutError:
logger.warning("Database pool didn't close cleanly, forcing shutdown")
db_pool.terminate()For applications using message queues, implement similar patterns to stop consuming messages, process any in-flight messages (with a timeout), and then disconnect:
consumer.Stop()
// Wait for in-flight messages with timeout
select {
case <-consumer.Done():
log.Info("Consumer stopped cleanly")
case <-time.After(10 * time.Second):
log.Warning("Consumer didn't stop in time, forcing exit")
}If your application makes API calls during shutdown (like deregistering from a service registry), wrap these in timeouts and handle failures gracefully. A pod shouldn't hang indefinitely just because a service registry is unavailable.
Advanced Troubleshooting: Finalizers and Controller Issues
When basic checks don't reveal the problem, it's time to look at more advanced Kubernetes concepts like finalizers and how controllers manage pod lifecycles. These issues are less common but can be particularly stubborn.
The Role of Finalizers
Finalizers are keys added to a resource's metadata.finalizers field that prevent the resource from being deleted until specific cleanup operations complete. They're a critical mechanism for ensuring proper cleanup, but malfunctioning finalizers are a common cause of kubernetes pod stuck terminating issues.
A finalizer is simply a string that identifies a controller responsible for some cleanup task. When you delete a resource with finalizers, Kubernetes doesn't immediately remove it from the API server. Instead, it sets the deletionTimestamp and waits for controllers to perform their cleanup and remove their finalizers. Only when all finalizers are removed does Kubernetes actually delete the resource.
Identifying Finalizers: Check if your stuck pod has any finalizers:
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 5 finalizersYou might see output like:
metadata:
finalizers:
- kubernetes.io/pvc-protection
- example.com/custom-cleanupEach string in the finalizers list represents a controller that must complete its work. The kubernetes.io/pvc-protection finalizer is common and prevents deletion of pods with PersistentVolumeClaims until the volumes are properly detached.
Common Scenarios: Storage controllers frequently use finalizers. When a pod uses a PersistentVolumeClaim, the PVC controller adds a finalizer to ensure the volume isn't deleted while still in use. If the storage controller is malfunctioning, unable to detach the volume, or if there are issues with the underlying storage system, this finalizer won't be removed and the pod stays stuck.
Custom controllers and operators often add their own finalizers. For example, a database operator might add a finalizer to ensure backups complete before a database pod is deleted. If the operator is down, misconfigured, or has a bug, it won't remove its finalizer.
Admission webhooks can also add finalizers during pod creation. If the webhook service is later removed or becomes unavailable, there's no longer anything to remove the finalizer during deletion.
To identify which controller is responsible for a finalizer, examine the finalizer string. Standard Kubernetes finalizers use the kubernetes.io/ prefix. Third-party controllers typically use their own domain names. You can search for the string in your cluster's CustomResourceDefinitions and deployed controllers:
kubectl get crd -o yaml | grep <finalizer-string>Check if the responsible controller is running:
kubectl get pods -A | grep <controller-name>If the controller pod is in CrashLoopBackOff or missing entirely, that explains why the finalizer isn't being removed.
Understanding Controller Behavior
Kubernetes controllers like Deployment, StatefulSet, and ReplicaSet manage pod lifecycles, and issues with these controllers can manifest as stuck terminating pods.
Checking Controller Status: If your stuck pod is managed by a Deployment, examine the Deployment's events:
kubectl describe deployment <deployment-name> -n <namespace>Look for error messages about pod deletion or scaling operations. The ReplicaSet that actually manages the pods might have more specific information:
kubectl get rs -n <namespace>
kubectl describe rs <replicaset-name> -n <namespace>For StatefulSets, the situation is more complex because pods have stable identities. A StatefulSet won't create a new pod with identity app-0 until the old app-0 pod is completely deleted. Check the StatefulSet controller's status:
kubectl describe statefulset <statefulset-name> -n <namespace>Events might indicate issues like "FailedDelete" or "FailedUpdate" that explain why pods aren't terminating properly.
Controller issues often stem from the controller manager itself being under-resourced or experiencing bugs. Check the controller manager logs:
kubectl logs -n kube-system <controller-manager-pod>Look for errors related to your namespace or pod names. Common issues include API server communication failures, resource quota problems, or bugs in the controller logic for specific edge cases.
Volume Management and Permissions
Issues with volume mounts and underlying storage can prevent proper pod termination, particularly when the kubelet can't cleanly unmount volumes.
VolumeMount User Group and File Permissions: Incorrectly set user/group IDs or file permissions on mounted volumes can prevent applications from performing cleanup operations during shutdown. If your application needs to delete temporary files, flush buffers to disk, or update state files during shutdown, but lacks the necessary permissions, it might hang.
Check the securityContext in your pod spec:
spec:
securityContext:
fsGroup: 2000
runAsUser: 1000
runAsGroup: 3000
containers:
- name: myapp
volumeMounts:
- name: data
mountPath: /dataThe fsGroup setting ensures that mounted volumes are accessible to the specified group. If this doesn't match your application's expectations, file operations during shutdown might fail.
You can verify current permissions by execing into a running container:
kubectl exec -it <pod-name> -n <namespace> -- ls -la /dataIf permissions are incorrect, you might see errors in the container logs during shutdown attempts.
PersistentVolumeClaim Issues: Problems with the underlying PersistentVolume or PVC can block pod termination. Check the PVC status:
kubectl get pvc -n <namespace>
kubectl describe pvc <pvc-name> -n <namespace>A PVC stuck in Terminating state itself will prevent pods using it from terminating. This often happens when the PVC has the kubernetes.io/pvc-protection finalizer but the volume can't be detached from the node.
Check the PV status:
kubectl get pv
kubectl describe pv <pv-name>Look for events indicating attachment or detachment failures. Cloud provider storage (AWS EBS, GCE Persistent Disk, Azure Disk) can have specific issues:
- AWS EBS: Volume might be attached to a different instance, or the instance might have been terminated without properly detaching volumes
- GCE Persistent Disk: Disk might be in a different zone than the node, or quota limits might prevent operations
- Azure Disk: Disk might be locked by another resource, or the VM might have reached its disk attachment limit
For local or NFS volumes, check if the mount point on the node is hung:
# SSH to the node
mount | grep <volume-name>A hung NFS mount will prevent the kubelet from unmounting it, blocking pod termination. You might need to forcefully unmount:
umount -f /var/lib/kubelet/pods/<pod-uid>/volumes/...Warning: Forceful unmounting can cause data loss if writes are in progress. Only do this as a last resort when you're certain the volume is truly hung.
Forcefully Terminating Stuck Pods (When All Else Fails)
Sometimes, despite all diagnostic efforts, a pod remains stubbornly stuck in terminating state. In such cases, you might need to resort to more forceful methods. Use these approaches with caution, as they bypass normal graceful shutdown procedures and can leave orphaned resources.
Removing Finalizers Manually
If you've identified that a finalizer is preventing deletion and the responsible controller is unavailable or malfunctioning, you can manually remove the finalizer to allow the pod to be deleted.
Steps to remove finalizers:
First, get the current pod definition:
kubectl get pod <pod-name> -n <namespace> -o yaml > pod-backup.yamlAlways save a backup before making manual edits. This allows you to restore the original state if something goes wrong.
Edit the pod directly:
kubectl edit pod <pod-name> -n <namespace>This opens the pod definition in your default editor. Locate the metadata.finalizers section:
metadata:
finalizers:
- kubernetes.io/pvc-protection
- example.com/custom-cleanupRemove the problematic finalizer entry. You can remove all finalizers or just the specific one causing issues. If you're removing the example.com/custom-cleanup finalizer, the section should look like:
metadata:
finalizers:
- kubernetes.io/pvc-protectionOr if removing all finalizers:
metadata:
finalizers: []Save and exit the editor. Kubernetes will immediately attempt to proceed with deletion now that the blocking finalizer is removed.
Alternatively, you can remove finalizers using a JSON patch without opening an editor:
kubectl patch pod <pod-name> -n <namespace> -p '{"metadata":{"finalizers":null}}' --type=mergeThis sets the finalizers field to null, removing all finalizers at once.
Warning: Removing finalizers bypasses the cleanup operations they were meant to ensure. If you remove a storage finalizer, you might leave volumes attached to nodes or orphaned in your cloud provider. If you remove a custom controller's finalizer, whatever cleanup logic it was supposed to perform won't happen. Only remove finalizers when you understand the consequences and have verified the cleanup either isn't necessary or you'll handle it manually.
After removing finalizers, the pod should delete within seconds. If it still doesn't delete, proceed to force deletion.
Deleting the Pod with --force and --grace-period=0
Force deletion is the most aggressive method for removing a stuck pod. This should be your absolute last resort after all other approaches have failed.
Command:
kubectl delete pod <pod-name> -n <namespace> --grace-period=0 --forceExplanation of flags:
The --grace-period=0 flag tells Kubernetes to skip the termination grace period entirely. Instead of waiting 30 seconds (or whatever grace period is configured) for containers to exit after receiving SIGTERM, Kubernetes immediately sends SIGKILL to forcefully terminate all processes.
The --force flag instructs Kubernetes to remove the pod object from the API server immediately, even if the kubelet hasn't confirmed that all containers have stopped and cleanup is complete. This is a directive to the API server to delete the pod record regardless of actual state on the node.
When you combine these flags, you're telling Kubernetes: "I don't care about graceful shutdown, and I don't care if the node has finished cleanup—delete this pod object now."
The output will typically show:
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "<pod-name>" force deleted
This warning is important. Force deleting a pod removes it from Kubernetes' view, but the containers might still be running on the node. The kubelet will eventually notice the discrepancy and clean up, but there's a window where the pod is "gone" from Kubernetes but still consuming node resources.
Consequences and risks:
Force deletion can lead to several problems. If the pod had PersistentVolumeClaims, those volumes might remain attached to the node even though the pod is gone. You'll need to manually check and detach them if necessary.
For StatefulSets, force deleting a pod can create split-brain scenarios. If you force delete app-0, Kubernetes will create a new app-0 pod immediately. But if the old pod's containers are still running on the node, you temporarily have two pods with the same identity, which can cause data corruption in stateful applications like databases.
Network resources might not be properly cleaned up. The pod's IP address might remain allocated, and network policies might still reference the pod until the kubelet completes its cleanup.
Application state can be left inconsistent. If the application was in the middle of a database transaction or file write when forcefully killed, you might have partial data or corrupted files.
When force deletion is appropriate:
Despite these risks, force deletion is sometimes necessary. Appropriate scenarios include:
- The node hosting the pod is completely unavailable (crashed, network partitioned, terminated) and will never come back
- You've already manually cleaned up volumes and other resources and just need to remove the API server object
- The pod is blocking critical operations (like cluster upgrades or security patches) and the risks of leaving it are greater than the risks of force deletion
- You've exhausted all other options and verified that the pod's containers aren't actually running
Before force deleting, verify the node status:
kubectl get nodes
kubectl describe node <node-name>If the node shows as NotReady or has been removed from the cluster, force deletion is safer because you know the kubelet isn't going to complete cleanup anyway.
After force deleting a pod, monitor for orphaned resources:
# Check for volumes still attached
kubectl get pv
kubectl describe pv <pv-name>
# For StatefulSets, verify the new pod starts correctly
kubectl get pods -n <namespace> -l <statefulset-labels>
kubectl logs <new-pod-name> -n <namespace>If you're running on cloud infrastructure, check your cloud provider's console for orphaned volumes or network interfaces that might need manual cleanup.
Skip the Manual Work: How OpsSqad's K8s Squad Solves This For You
You've just learned dozens of kubectl commands, explored complex Kubernetes internals, and walked through a multi-step diagnostic process for troubleshooting stuck terminating pods. What if you could achieve the same results—or better—with a simple conversation? OpsSqad's K8s Squad transforms this complex debugging workflow into an intuitive chat interaction that takes minutes instead of hours.
The OpsSqad User Journey: From Node to Resolution
Getting started with OpsSqad takes about three minutes and requires no complex infrastructure changes. The platform uses reverse TCP architecture, meaning your infrastructure initiates the connection to OpsSqad's cloud—no inbound firewall rules, no VPN setup, no exposed endpoints.
Step 1: Create Your Account and Node
Sign up for a free account at app.opssquad.ai. After logging in, navigate to the Nodes section in the dashboard. Click "Create Node" and give it a descriptive name like "production-k8s-cluster" or "staging-environment." The dashboard generates a unique Node ID and authentication token—these are your credentials for connecting infrastructure to OpsSqad.
Step 2: Deploy the Agent
SSH into your server or jump host that has kubectl access to your cluster. Run the installation commands using the Node ID and token from your dashboard:
curl -fsSL https://install.opssquad.ai/install.sh | bashThen install and start the node:
opssqad node install --node-id=<your-node-id-from-dashboard> --token=<your-token-from-dashboard>
opssqad node startThe agent is lightweight (under 50MB) and immediately establishes a secure reverse TCP connection to OpsSqad cloud. Your infrastructure connects outbound to OpsSqad—nothing is listening for inbound connections, so your existing firewall rules don't need modification. The agent runs with minimal resource overhead and can be deployed on any server with terminal access.
Step 3: Browse and Deploy the K8s Squad
Back in the OpsSqad dashboard, navigate to the Squad Marketplace. Find the "K8s Troubleshooting Squad" (or similar—the exact name may vary) and click "Deploy Squad." This creates a private instance of the Squad with all its specialized AI agents. The Squad includes agents trained on Kubernetes operations, common failure scenarios, and troubleshooting workflows.
Step 4: Link Agents to Nodes
Open your deployed Squad and navigate to the Agents tab. Here you'll see the AI agents that make up the Squad. Click to give these agents access to your Node(s). This grants permission for the agents to execute commands on your infrastructure through the secure reverse connection.
OpsSqad's security model ensures that only whitelisted commands can be executed. The K8s Squad comes with pre-approved commands like kubectl get, kubectl describe, kubectl logs, and diagnostic utilities. Commands are executed in a sandboxed environment, and every action is logged to a detailed audit trail you can review in the dashboard.
Step 5: Start Debugging
Navigate to chat.opssquad.ai and select your K8s Squad. You're now ready to debug Kubernetes issues through natural language conversation.
Real-Time Debugging: A Stuck Pod Scenario
Imagine you've just received an alert that a pod is stuck terminating. Instead of running through the entire diagnostic process you learned in this article, you can simply describe the problem:
You: "I have a pod called payment-processor-7d9f8b-xkj2m in the production namespace that's been stuck in Terminating state for 15 minutes. Can you investigate why?"
Behind the scenes, the K8s Squad executes the diagnostic workflow you learned earlier—checking pod status, examining events, reviewing finalizers, inspecting container exit codes—but it does this in seconds rather than minutes. The agent synthesizes the information into a clear analysis:
K8s Squad: "I've investigated payment-processor-7d9f8b-xkj2m. Here's what I found:
- Pod has been in Terminating state for 17 minutes (deletion timestamp: 2024-01-15T14:23:11Z)
- The pod has one finalizer:
volume.kubernetes.io/storage-protection - PersistentVolumeClaim
payment-data-pvcis also stuck in Terminating state - The underlying PersistentVolume shows a detachment failure from node
worker-node-3 - Node
worker-node-3is in NotReady state (last heartbeat 22 minutes ago)
The root cause appears to be that the node became unavailable while the pod was running, preventing proper volume detachment. The volume controller added a finalizer to protect the volume, but can't complete detachment because the node is unreachable.
Would you like me to:
- Force delete the pod and manually remove the finalizer
- Attempt to detach the volume from the cloud provider
- Show you the full diagnostic output for manual review"
This single response gives you more insight than running a dozen manual commands. The agent has correlated information across multiple resources (pod, PVC, PV, node) and identified the root cause.
You: "Yes, force delete the pod and remove the finalizer. Also check if we need to manually detach the volume in AWS."
K8s Squad: "Executing the following actions:
- Backing up pod definition to audit log
- Removing finalizer
volume.kubernetes.io/storage-protectionfrom podpayment-processor-7d9f8b-xkj2m - Force deleting pod with --grace-period=0 --force
- Checking AWS EBS volume attachment status
Results:
- Pod
payment-processor-7d9f8b-xkj2msuccessfully deleted - AWS EBS volume
vol-0a1b2c3d4e5fis still attached to instancei-0x9y8z7w6v(worker-node-3) - Since the node is NotReady, I recommend detaching the volume in AWS and allowing Kubernetes to reattach it when a new pod is scheduled
Would you like me to:
- Detach the EBS volume in AWS (requires AWS credentials configured)
- Provide the AWS CLI command for manual detachment
- Monitor for the new pod creation and verify it starts successfully"
You: "Provide the AWS CLI command and monitor for the new pod."
K8s Squad: "Here's the AWS CLI command to detach the volume:
aws ec2 detach-volume --volume-id vol-0a1b2c3d4e5f --forceMonitoring for new pod creation...
A new pod payment-processor-7d9f8b-9kx3n has been created and is currently in ContainerCreating state. Waiting for it to reach Running state...
Pod payment-processor-7d9f8b-9kx3n is now Running. All containers are healthy. The PVC has successfully attached. Issue resolved.
Total time from initial investigation to resolution: 90 seconds."
The OpsSqad Advantage
This interaction demonstrates several key benefits:
Time Savings: What would have taken 15-20 minutes of manual investigation—running multiple kubectl commands, checking node status, examining cloud provider resources, correlating information across different systems—took 90 seconds through chat. The agent executed dozens of commands in parallel and synthesized the results.
Reduced Cognitive Load: You don't need to remember the exact syntax for checking finalizers, the difference between --force and --grace-period=0, or which AWS CLI command detaches volumes. The agent handles these details, letting you focus on decision-making rather than command syntax.
Enhanced Security: Every command the agent executed was from a pre-approved whitelist. The commands ran in a sandboxed environment. Every action was logged to an audit trail showing exactly what was executed, when, and by whom. You get the power of automated troubleshooting with the safety of strict access controls.
Reverse Architecture Benefits: The OpsSqad agent on your infrastructure initiated the connection to OpsSqad cloud. No inbound firewall rules were needed. No VPN configuration. No bastion hosts. The connection works from anywhere—your office, home, or while traveling—without exposing your cluster to the internet.
Contextual Intelligence: The K8s Squad didn't just run commands—it understood the relationships between resources. It knew that a stuck pod with a storage finalizer might indicate PVC issues, which might relate to node availability, which might require cloud provider intervention. This contextual awareness comes from training on thousands of real-world Kubernetes troubleshooting scenarios.
Audit Trail and Learning: After resolving the issue, you can review the complete audit log in the OpsSqad dashboard to see every command that was executed. This serves as documentation for post-mortems and helps junior team members learn the troubleshooting process. The chat transcript itself becomes a runbook for similar issues in the future.
Prevention and Best Practices
While troubleshooting stuck terminating pods is a valuable skill, preventing these issues in the first place saves time and reduces operational stress. The following best practices address the root causes you learned about in this article.
Resource Management Best Practices
Proper resource configuration prevents OOMKilled scenarios and resource-related termination issues.
Set Realistic Requests and Limits: Don't guess at resource requirements—measure them. Deploy your application to a development environment and monitor actual resource usage over time using metrics:
kubectl top pod <pod-name> -n <namespace> --containersSet requests based on typical usage and limits based on peak usage with a reasonable buffer. A good starting point is setting limits at 150-200% of requests:
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "500m"For memory, be conservative with limits. Setting them too low causes frequent OOMKilled events. Setting them too high allows memory leaks to consume excessive node resources. Monitor your application's memory growth over time to detect leaks before they cause problems.
For CPU, limits cause throttling rather than termination, but excessive throttling can make applications unresponsive during shutdown, preventing them from completing cleanup within the grace period. If you see high CPU throttling metrics, increase CPU limits.
Implement PodDisruptionBudgets: PodDisruptionBudgets (PDBs) ensure that a minimum number of pods remain available during voluntary disruptions like node drains, cluster upgrades, or manual deletions. While PDBs don't directly prevent stuck terminating pods, they prevent scenarios where too many pods are terminating simultaneously, overwhelming the cluster's ability to handle graceful shutdowns:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: payment-processor-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: payment-processorThis PDB ensures at least 2 payment-processor pods remain running during disruptions, preventing situations where all pods are terminating at once.
Application Design for Graceful Shutdown
Applications designed with containerized environments in mind handle termination much more reliably.
Handle SIGTERM Gracefully: Every containerized application should implement proper signal handling. When your application receives SIGTERM, it should:
- Stop accepting new requests or work items
- Complete in-flight requests (with a timeout)
- Close database connections and external service connections
- Flush buffers and persist state
- Exit with code 0
Set realistic timeouts for each step. If completing in-flight requests might take 20 seconds, set terminationGracePeriodSeconds to at least 30 seconds to allow time for connection cleanup afterward:
spec:
terminationGracePeriodSeconds: 45Implement Health Checks Correctly: Readiness and liveness probes help Kubernetes manage pod lifecycle, but poorly configured probes can contribute to termination issues.
Liveness probes should check if the application is fundamentally broken and needs to be restarted. They should be simple and fast:
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3Readiness probes should check if the application is ready to serve traffic. During shutdown, the application should fail readiness checks immediately after receiving SIGTERM, causing Kubernetes to stop sending it traffic:
readinessProbe:
httpGet:
path: /ready
port: 8080
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2Warning: Don't make probes dependent on external services. If your liveness probe checks database connectivity and the database becomes unavailable, Kubernetes will restart all your pods simultaneously, creating a thundering herd that makes the problem worse.
Kubernetes Configuration Best Practices
Cluster-level configuration choices affect how reliably pods terminate.
Use Finalizers Judiciously: If you're developing custom controllers or operators, only add finalizers when absolutely necessary for cleanup operations. Ensure your controller is highly available and has proper error handling. If the cleanup operation fails, your controller should:
- Log detailed error information
- Retry with exponential backoff
- Eventually remove the finalizer after a maximum retry count to prevent indefinite blocking
Implement health checks for your custom controllers so Kubernetes can restart them if they become unresponsive, preventing scenarios where finalizers are never removed.
Regularly Review Network Policies: Network policies should be treated as code—version controlled, reviewed, and tested. Periodically audit your policies to ensure they're not overly restrictive:
kubectl get networkpolicy -A -o yaml > networkpolicies-backup.yamlTest policy changes in a development environment before applying to production. Use tools like kubectl-netpol or network policy simulators to verify that policies allow necessary traffic.
Monitor Termination Metrics: Set up monitoring for pod termination patterns. Track metrics like:
- Number of pods stuck in Terminating state
- Time spent in Terminating state (should be under 1 minute for most pods)
- Frequency of force deletions
- Pods terminated due to OOMKilled
Create alerts for anomalies. If you see a sudden spike in stuck terminating pods, it might indicate a cluster-wide issue like a failing storage system or network problem that needs immediate attention.
Conclusion
Pods stuck in the terminating state are a common Kubernetes challenge with multiple potential root causes—from resource constraints and application shutdown issues to finalizers, volume management problems, and network policy misconfigurations. By systematically diagnosing the issue using kubectl describe, examining container exit codes, checking for finalizers, and investigating controller behavior, you can identify and resolve most stuck pod scenarios. When prevention fails, force deletion with proper precautions provides a last-resort solution.
If you want to transform this multi-step diagnostic process into a simple conversation that takes seconds instead of minutes, OpsSqad's K8s Squad provides AI-powered troubleshooting through a secure, reverse TCP architecture that requires no firewall changes or VPN setup. Create your free account at app.opssquad.ai and experience how natural language commands can replace dozens of manual kubectl operations while maintaining full audit trails and security controls.