Kubernetes Troubleshooting #
Inspecting the Cluster #
# List ControlPlane addresses and services
kubectl cluster-info 
# Shell output:
Kubernetes control plane is running at https://127.0.0.1:6443
CoreDNS is running at https://127.0.0.1:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
Metrics-server is running at https://127.0.0.1:6443/api/v1/namespaces/kube-system/services/https:metrics-server:https/proxy
To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
# List further details for debugging and diagnosis
kubectl cluster-info dump
Kubernetes Nodes #
List Nodes #
# List Kubernetes nodes
kubectl get nodes
# List Kubernetes nodes: More details
kubectl get nodes -o wide
# Shell output:
ubuntu1   Ready    control-plane,master   42d   v1.30.5+k3s1   192.168.30.10   <none>        Ubuntu 24.04.1 LTS   6.8.0-48-generic   containerd://1.7.21-k3s2
ubuntu2   Ready    worker                 42d   v1.30.5+k3s1   192.168.30.11   <none>        Ubuntu 24.04.1 LTS   6.8.0-45-generic   containerd://1.7.21-k3s2
ubuntu3   Ready    worker                 42d   v1.30.5+k3s1   192.168.30.12   <none>        Ubuntu 24.04.1 LTS   6.8.0-45-generic   containerd://1.7.21-k3s2
ubuntu4   Ready    worker                 42d   v1.30.5+k3s1   192.168.30.13   <none>        Ubuntu 24.04.1 LTS   6.8.0-45-generic   containerd://1.7.21-k3s2
Node Status #
| Status | Explanation | 
|---|---|
| Ready | The node is healthy | 
| DiskPressure | The disk capacity is low | 
| MemoryPressure | The node memory is low | 
| PIDPressure | Too many processes are running on the node | 
| NetworkUnavailable | The networking is misconfigured | 
| SchedulingDisabled | Appears after a node cordon | 
Node Details #
This are the most imported parts of the node details:
- 
Conditions:More details of the node status
- 
Non-terminated Pods:Resource requests and limits from the pods running in the cluster
- 
Allocated resources:Current CPU, memory and storage consumption of the cluster
# List details of a node
kubectl describe node ubuntu1
# Shell output snippet:
...
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Tue, 12 Nov 2024 12:10:58 +0000   Mon, 30 Sep 2024 17:40:37 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Tue, 12 Nov 2024 12:10:58 +0000   Mon, 30 Sep 2024 17:40:37 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Tue, 12 Nov 2024 12:10:58 +0000   Mon, 30 Sep 2024 17:40:37 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Tue, 12 Nov 2024 12:10:58 +0000   Mon, 30 Sep 2024 17:40:38 +0000   KubeletReady                 kubelet is posting ready status
...
Non-terminated Pods:          (7 in total)
  Namespace                   Name                                       CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                       ------------  ----------  ---------------  -------------  ---
  kube-system                 coredns-7b98449c4-wmbsb                    100m (2%)     0 (0%)      70Mi (1%)        170Mi (4%)     42d
  kube-system                 csi-nfs-node-kwqxb                         30m (0%)      0 (0%)      60Mi (1%)        500Mi (12%)    41d
  kube-system                 local-path-provisioner-6795b5f9d8-vvdjl    0 (0%)        0 (0%)      0 (0%)           0 (0%)         42d
  kube-system                 metrics-server-cdcc87586-r9t67             100m (2%)     0 (0%)      70Mi (1%)        0 (0%)         42d
  kube-system                 svclb-postgres-cluster-1-2a758881-x76kf    0 (0%)        0 (0%)      0 (0%)           0 (0%)         41d
  kube-system                 svclb-traefik-29cb166f-nhrkl               0 (0%)        0 (0%)      0 (0%)           0 (0%)         42d
  kube-system                 traefik-67f6c94c47-hgbv6                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         42d
...
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                230m (5%)   0 (0%)
  memory             200Mi (5%)  670Mi (17%)
  ephemeral-storage  0 (0%)      0 (0%)
  hugepages-1Gi      0 (0%)      0 (0%)
  hugepages-2Mi      0 (0%)      0 (0%)
# Optional output the node details in YAML format
kubectl get node ubuntu1 -o yaml
List CPU, Memory, Processes & Available Storage #
# List CPU, memory & processes of node
top
# List available disk storage
df -h
Inspecting Kubernetes Components #
Pods Details #
# List Kubernetes system related pods
kubectl get pods -n kube-system
# List Kubernetes system related pods: List on which nodes the pods are running
kubectl get pods -n kube-system -o wide
# List pods details
kubectl describe pod pod-name -n kube-system
# Save pod details in YAML format into file
kubectl get pod pod-name -n kube-system -o yaml > pod-details.yaml
# List logs of a pod
kubectl logs pod-name -n kube-system
Pod Status Errors #
# Example: Run a failing pod
kubectl run new-pod --image=nginx:37
# List pods
NAME                                    READY   STATUS         RESTARTS       AGE
new-pod                                 0/1     ErrImagePull   0              3s
Status Errors:
- 
ImagePullBackOffExport YAML config, update image
- 
CrashLoopBackOffUse thekubectl describeandkubectl logcommands. Sometimes the error was also caused by cluster components.
- 
PendingUse thekubectl describecommand. Can be a scheduling issue caused by no available nodes or exceeding the resources. Check the node status and use thetopcommand to check out the resource allocation.
- 
CompletedUse thekubectl describecommand.
- 
ErrorUse thekubectl describecommand.
Troubleshooting Kubelet Agent #
K8s #
# List the Kubelet status
systemctl status kubelet
# List logs on the Kubelet service
journalctl -u kubelet.service
# Restart the Kubelet
sudo systemctl restart kubelet
K3s #
K3s bundles the kubelet, the container runtime and other Kubernetes components into a single binary that is managed by a single systemd service called k3s:
# Check K3s status: Controller node
systemctl status k3s
# Check K3s status: Worker node
systemctl status k3s-agent