記一次虛機(jī)斷電磁盤損壞無法開機(jī)導(dǎo)致 K8s 集群 部分節(jié)點(diǎn)未就緒(NotReady) 問題解決

我遇到了什么問題

哈,中午走的時候鑰匙被鎖屋里了,急著回家找?guī)煾甸_門,工作地方的 nuc 要帶回去,就強(qiáng)制關(guān)機(jī)了,結(jié)果回來虛機(jī)部署的 k8s 集群起不來了,不過還不算太糟糕, 至少 master 還在,不幸的萬幸。之前有一次是強(qiáng)制關(guān)機(jī)了,結(jié)果是 k8s 集群都起不來了,etcd 對應(yīng)的 pod 也掛掉了,沒有備份,最后沒辦法,使用 kubeadm 重置集群了。


┌──[root@vms81.liruilongs.github.io]-[~]

└─$kubectl get nodes

NAME                          STATUS     ROLES                  AGE    VERSION

vms155.liruilongs.github.io   NotReady   <none>                 76d    v1.22.2

vms156.liruilongs.github.io   NotReady   <none>                 76d    v1.22.2

vms81.liruilongs.github.io    Ready      control-plane,master   400d   v1.22.2

vms82.liruilongs.github.io    NotReady   <none>                 400d   v1.22.2

vms83.liruilongs.github.io    Ready      <none>                 400d   v1.22.2

┌──[root@vms81.liruilongs.github.io]-[~]

└─$

哈,部分集群節(jié)點(diǎn)未就緒,對應(yīng)的虛機(jī)也起不來,下面為開機(jī)啟動直接進(jìn)入救援模式的提示信息。


[ 9.800336] XFS (sdal): Metadata corruption detected at xfs_agf_read_verify+0

×78/8×12[ xfs], xfs_agf block 0x4b000019.008356] XFS(sda1): Unmount and run xfs_repair

9.008376] XFS(sdal): First 64 butes of corrupted metadata buffer:

9.808395] ffff88803610a400:58 41 47 46 80 0880 01 80 08 80 81 80 96 88 88

XAGF...….

I9.8884151ffff88803610a410:80 88 80 81 80 88 88 82 80 88 80 88 80 88 80 819.808435] ffff88803610a420:80 88 80 81 80 88 80 88 80 88 88 88 80 8880 83

I 9.080454] ffff88003610a430:00 80 00 84 00 8d d1 2d 00 77 c3 a3 00 80 08 88

....-.w.……

I9.080515] XFS(sdal): metadata I/0 error: block 0x4b00001 ("xfs_trans_read_

buf_map") error 117 numblks 1

Generating "/run/initramfs/rdsosreport. txt."

Entering emergency mode. Exit the shell to continue.

Type "journalctl"to view system logs.

You might want to save "/run/initramfs/rdsosreport. txt"to a USB stick or /boot after mounting them and attach it to a bug report.

:/#

磁盤損壞,需要修復(fù)。哈,太坑了


我是如何做的

尋找磁盤恢復(fù)的解決方案,操作步驟:


啟動虛擬機(jī) E 進(jìn)入單用戶模式

在 linux16 開頭的那行末尾添加 rd.break

在上一步的基礎(chǔ)上 ctrl+x 進(jìn)入救援模式,然后執(zhí)行 xfs_repair -L /dev/sda1 : 這里的 sda1 是上面損壞的磁盤,可以在救援模式的輸出中看到。

執(zhí)行 reboot

OK,陸續(xù)修復(fù)磁盤,開機(jī),然后查看節(jié)點(diǎn),發(fā)現(xiàn)恢復(fù)了一個,還是有兩個節(jié)點(diǎn)未就緒。


┌──[root@vms81.liruilongs.github.io]-[~]

└─$kubectl get nodes

NAME                          STATUS     ROLES                  AGE    VERSION

vms155.liruilongs.github.io   NotReady   <none>                 76d    v1.22.2

vms156.liruilongs.github.io   Ready      <none>                 76d    v1.22.2

vms81.liruilongs.github.io    Ready      control-plane,master   400d   v1.22.2

vms82.liruilongs.github.io    NotReady   <none>                 400d   v1.22.2

vms83.liruilongs.github.io    Ready      <none>                 400d   v1.22.2

┌──[root@vms81.liruilongs.github.io]-[~]

└─$

我最開始以為 kubectl 的問題,排查了日志發(fā)現(xiàn)沒有問題。


┌──[root@vms82.liruilongs.github.io]-[~]

└─$systemctl status kubelet.service

● kubelet.service - kubelet: The Kubernetes Node Agent

   Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)

  Drop-In: /usr/lib/systemd/system/kubelet.service.d

           └─10-kubeadm.conf

   Active: active (running) since 二 2023-01-17 20:53:02 CST; 1min 18s ago

   ....

然后在集群事件中,發(fā)現(xiàn) Is the docker daemon running?, Error while dialing dial unix /run/containerd/containerd. sock: connect: connection refused": unavailable 類似的事件提示。


┌──[root@vms81.liruilongs.github.io]-[~]

└─$kubectl get events | grep -i error

54m         Warning   Unhealthy                pod/calico-node-nfkzd                                 Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/bird/bird.ctl: connect: no such file or directory

54m         Warning   Unhealthy                pod/calico-node-nfkzd                                 Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/calico/bird.ctl: connect: connection refused

44m         Warning   FailedCreatePodSandBox   pod/calico-node-vxpxt                                 Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "calico-node-vxpxt": Error response from daemon: connection error: desc = "transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused": unavailable

44m         Warning   FailedCreatePodSandBox   pod/calico-node-vxpxt                                 Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "calico-node-vxpxt": Error response from daemon: transport is closing: unavailable

44m         Warning   FailedCreatePodSandBox   pod/kube-proxy-htg7t                                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "kube-proxy-htg7t": Error response from daemon: connection error: desc = "transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused": unavailable

44m         Warning   FailedCreatePodSandBox   pod/kube-proxy-htg7t                                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "kube-proxy-htg7t": Error response from daemon: transport is closing: unavailable

44m         Warning   FailedCreatePodSandBox   pod/kube-proxy-htg7t                                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create a sandbox for pod "kube-proxy-htg7t": error during connect: Post "http://%2Fvar%2Frun%2Fdocker.sock/v1.41/containers/create?name=k8s_POD_kube-proxy-htg7t_kube-system_85fe510d-d713-4fe6-b852-dd1655d37fff_15": EOF

44m         Warning   FailedKillPod            pod/skooner-5b65f884f8-9cs4k                          error killing pod: failed to "KillPodSandbox" for "eb888be0-5f30-4620-a4a2-111f14bb092d" with KillPodSandbo

Error: "rpc error: code = Unknown desc = [networkPlugin cni failed to teardown pod \"skooner-5b65f884f8-9cs4k_kube-system\" network: error getting ClusterInformation: Get \"https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default\": dial tcp 10.96.0.1:443: connect: connection refused, Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?]"

┌──[root@vms81.liruilongs.github.io]-[~]

└─$

有可以能有些節(jié)點(diǎn)的  docker 沒有起來,然后我查看了未就緒節(jié)點(diǎn)的 docker 的狀態(tài)


┌──[root@vms82.liruilongs.github.io]-[~]

└─$systemctl  status docker

● docker.service - Docker Application Container Engine

   Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)

   Active: inactive (dead)

     Docs: https://docs.docker.com


1月 17 21:08:19 vms82.liruilongs.github.io systemd[1]: Dependency failed for Docker Application Container Engine.

1月 17 21:08:19 vms82.liruilongs.github.io systemd[1]: Job docker.service/start failed with result 'dependency'.

1月 17 21:08:25 vms82.liruilongs.github.io systemd[1]: Dependency failed for Docker Application Container Engine.

1月 17 21:08:25 vms82.liruilongs.github.io systemd[1]: Job docker.service/start failed with result 'dependency'.

1月 17 21:08:30 vms82.liruilongs.github.io systemd[1]: Dependency failed for Docker Application Container Engine.

。。。。。。。

發(fā)現(xiàn) docker 果然沒有啟動成功,提示他的依賴沒有啟動成功,查看一下 docker  的正向依賴,即在 docker 之前啟動的服務(wù)


┌──[root@vms82.liruilongs.github.io]-[~]

└─$systemctl list-dependencies docker.service

docker.service

● ├─containerd.service

● ├─docker.socket

● ├─system.slice

● ├─basic.target

● │ ├─microcode.service

● │ ├─rhel-autorelabel-mark.service

● │ ├─rhel-autorelabel.service

● │ ├─rhel-configure.service

● │ ├─rhel-dmesg.service

● │ ├─rhel-loadmodules.service

● │ ├─selinux-policy-migrate-local-changes@targeted.service

● │ ├─paths.target

● │ ├─slices.target

● │ │ ├─-.slice

● │ │ └─system.slice

● │ ├─sockets.target

............................






然后我們看一下第一個 依賴的服務(wù) containerd.service ,查看發(fā)現(xiàn)也么有啟動成功


┌──[root@vms82.liruilongs.github.io]-[~]

└─$systemctl status containerd.service

● containerd.service - containerd container runtime

   Loaded: loaded (/usr/lib/systemd/system/containerd.service; disabled; vendor preset: disabled)

   Active: activating (auto-restart) (Result: exit-code) since 二 2023-01-17 21:14:58 CST; 4s ago

     Docs: https://containerd.io

  Process: 6494 ExecStart=/usr/bin/containerd (code=exited, status=2)

  Process: 6491 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)

 Main PID: 6494 (code=exited, status=2)


1月 17 21:14:58 vms82.liruilongs.github.io systemd[1]: Failed to start containerd container runtime.

1月 17 21:14:58 vms82.liruilongs.github.io systemd[1]: Unit containerd.service entered failed state.

1月 17 21:14:58 vms82.liruilongs.github.io systemd[1]: containerd.service failed.

┌──[root@vms82.liruilongs.github.io]-[~]

└─$

沒有更多的提示信息,只是提示 啟動失敗了,這里我們嘗試重啟試試


┌──[root@vms82.liruilongs.github.io]-[~]

└─$systemctl restart containerd.service

Job for containerd.service failed because the control process exited with error code. See "systemctl status containerd.service" and "journalctl -xe" for details.

查看 containerd 服務(wù)日志,這里先查看一下 error 的信息


┌──[root@vms82.liruilongs.github.io]-[~]

└─$journalctl -u  containerd | grep -i error -m 3

1月 17 20:41:56 vms82.liruilongs.github.io containerd[962]: time="2023-01-17T20:41:56.203387028+08:00" level=info msg="skip loading plugin \"io.containerd.snapshotter.v1.aufs\"..." error="aufs is not supported (modprobe aufs failed: exit status 1 \"modprobe: FATAL: Module aufs not found.\\n\"): skip plugin" type=io.containerd.snapshotter.v1

1月 17 20:41:56 vms82.liruilongs.github.io containerd[962]: time="2023-01-17T20:41:56.203699262+08:00" level=warning msg="failed to load plugin io.containerd.snapshotter.v1.devmapper" error="devmapper not configured"

1月 17 20:41:56 vms82.liruilongs.github.io containerd[962]: time="2023-01-17T20:41:56.204050775+08:00" level=info msg="skip loading plugin \"io.containerd.snapshotter.v1.zfs\"..." error="path /var/lib/containerd/io.containerd.snapshotter.v1.zfs must be a zfs filesystem to be used with the zfs snapshotter: skip plugin" type=io.containerd.snapshotter.v1

┌──[root@vms82.liruilongs.github.io]-[~]

└─$

我們通過日志,得到了下面的日志信息,猜測可能是磁盤損壞照成的,這里我們備份 /var/lib/containerd/ 對應(yīng)的文件夾,刪除試試


aufs is not supported (modprobe aufs failed: exit status 1 \"modprobe: FATAL: Module aufs not found.

path /var/lib/containerd/io.containerd.snapshotter.v1.zfs must be a zfs filesystem to be used with the zfs snapshotter: sk

刪除文件夾下所有文件


┌──[root@vms82.liruilongs.github.io]-[~]

└─$cd /var/lib/containerd/

io.containerd.content.v1.content/       io.containerd.runtime.v1.linux/         io.containerd.snapshotter.v1.native/    tmpmounts/

io.containerd.metadata.v1.bolt/         io.containerd.runtime.v2.task/          io.containerd.snapshotter.v1.overlayfs/

┌──[root@vms82.liruilongs.github.io]-[~]

└─$cd /var/lib/containerd/

┌──[root@vms82.liruilongs.github.io]-[/var/lib/containerd]

└─$rm -rf *

┌──[root@vms82.liruilongs.github.io]-[/var/lib/containerd]

└─$ls

刪除之后嘗試重新 啟動 containerd


┌──[root@vms82.liruilongs.github.io]-[/var/lib/containerd]

└─$systemctl start containerd

┌──[root@vms82.liruilongs.github.io]-[/var/lib/containerd]

└─$systemctl status containerd

● containerd.service - containerd container runtime

   Loaded: loaded (/usr/lib/systemd/system/containerd.service; disabled; vendor preset: disabled)

   Active: active (running) since 二 2023-01-17 21:25:13 CST; 51s ago

     Docs: https://containerd.io

  Process: 8180 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)

 Main PID: 8182 (containerd)

   Memory: 146.8M

   ...........

OK ,啟動成功,這個時候我們發(fā)現(xiàn) 節(jié)點(diǎn)也正常了。


┌──[root@vms81.liruilongs.github.io]-[~]

└─$kubectl get nodes

NAME                          STATUS     ROLES                  AGE    VERSION

vms155.liruilongs.github.io   NotReady   <none>                 76d    v1.22.2

vms156.liruilongs.github.io   Ready      <none>                 76d    v1.22.2

vms81.liruilongs.github.io    Ready      control-plane,master   400d   v1.22.2

vms82.liruilongs.github.io    Ready      <none>                 400d   v1.22.2

vms83.liruilongs.github.io    Ready      <none>                 400d   v1.22.2

┌──[root@vms81.liruilongs.github.io]-[~]

└─$

其他的節(jié)點(diǎn)陸續(xù)操作下


在 192.168.26.155 發(fā)現(xiàn)這個節(jié)點(diǎn) docker 也是啟動失敗的,但是問題不一樣,操作后,服務(wù)沒有自動重啟,日志有error 級別的日志,


┌──[root@vms81.liruilongs.github.io]-[~]

└─$ssh root@192.168.26.155

Last login: Mon Jan 16 02:26:43 2023 from 192.168.26.81

┌──[root@vms155.liruilongs.github.io]-[~]

└─$systemctl is-active  docker

failed

┌──[root@vms155.liruilongs.github.io]-[~]

└─$cd /var/lib/containerd/

┌──[root@vms155.liruilongs.github.io]-[/var/lib/containerd]

└─$rm -rf *

┌──[root@vms155.liruilongs.github.io]-[/var/lib/containerd]

└─$systemctl start containerd

┌──[root@vms155.liruilongs.github.io]-[/var/lib/containerd]

└─$systemctl is-active  docker

failed

┌──[root@vms155.liruilongs.github.io]-[/var/lib/containerd]

└─$systemctl status docker

● docker.service - Docker Application Container Engine

   Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)

   Active: failed (Result: start-limit) since 二 2023-01-17 20:20:03 CST; 1h 31min ago

     Docs: https://docs.docker.com

  Process: 2030 ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock (code=exited, status=0/SUCCESS)

 Main PID: 2030 (code=exited, status=0/SUCCESS)


1月 17 20:20:02 vms155.liruilongs.github.io dockerd[2030]: time="2023-01-17T20:20:02.796621853+08:00" level=error msg="712fd90a1962d0f546eaf6c9db05c2577ac9855b38f9f41e37724402f10d3045 cleanup: failed to de...

1月 17 20:20:02 vms155.liruilongs.github.io dockerd[2030]: time="2023-01-17T20:20:02.796669296+08:00" level=error msg="Handler for POST /v1.41/containers/712fd90a1962d0f546eaf6c9db05c2577ac9855b38f9f41e377...

1月 17 20:20:03 vms155.liruilongs.github.io dockerd[2030]: time="2023-01-17T20:20:03.285529266+08:00" level=warning msg="grpc: addrConn.createTransport failed to connect to {unix:///run/containerd/containe...

1月 17 20:20:03 vms155.liruilongs.github.io dockerd[2030]: time="2023-01-17T20:20:03.783878143+08:00" level=info msg="Processing signal 'terminated'"

1月 17 20:20:03 vms155.liruilongs.github.io systemd[1]: Stopping Docker Application Container Engine...

1月 17 20:20:03 vms155.liruilongs.github.io dockerd[2030]: time="2023-01-17T20:20:03.784550238+08:00" level=info msg="Daemon shutdown complete"

1月 17 20:20:03 vms155.liruilongs.github.io systemd[1]: start request repeated too quickly for docker.service

1月 17 20:20:03 vms155.liruilongs.github.io systemd[1]: Failed to start Docker Application Container Engine.

1月 17 20:20:03 vms155.liruilongs.github.io systemd[1]: Unit docker.service entered failed state.

1月 17 20:20:03 vms155.liruilongs.github.io systemd[1]: docker.service failed.

Hint: Some lines were ellipsized, use -l to show in full.

服務(wù)沒有自動重啟,這里手動重啟試試


┌──[root@vms155.liruilongs.github.io]-[/var/lib/containerd]

└─$systemctl restart docker

查看節(jié)點(diǎn)狀態(tài),所有節(jié)點(diǎn) ready。如果可以裝 一些面板工具,或者 k8s 管理工具,還是找些開源的安裝一下比較好,方便一點(diǎn),單純命令搞節(jié)點(diǎn)不多還好,節(jié)點(diǎn)多的話太廢人了,哈哈。


┌──[root@vms81.liruilongs.github.io]-[~]

└─$kubectl get nodes

NAME                          STATUS   ROLES                  AGE    VERSION

vms155.liruilongs.github.io   Ready    <none>                 76d    v1.22.2

vms156.liruilongs.github.io   Ready    <none>                 76d    v1.22.2

vms81.liruilongs.github.io    Ready    control-plane,master   400d   v1.22.2

vms82.liruilongs.github.io    Ready    <none>                 400d   v1.22.2

vms83.liruilongs.github.io    Ready    <none>                 400d   v1.22.2

┌──[root@vms81.liruilongs.github.io]-[~]

└─$

博文參考

https://blog.csdn.net/qq_35022803/article/details/109287086




作者:山河已無恙


歡迎關(guān)注微信公眾號 :山河已無恙