K8s Nodes Goes To NotReady State
We have a K8s cluster with different set of services like frontend, backend, Redis Database running.
We exposed frontend service to outside using ingress resource which uses Nginx ingress controller.
We have a above same setup in two K8s Cluster named Staging and Production.
We have experienced a issue on Mar 11, 12:07 AM (GMT+5:30)IST, that all of our Kubernetes nodes goes into NotReady state.
Because of that our frontend service running in k8s cluster not reachable(503 Service Not Reachable).
However, all nodes back to Ready state within 2 minutes.The same issue happens in our both cluster.
We use a datadog for monitoring/logging our infrastructure. It shows there is no single user visit our site at that time. We analysed node metrics, that shows small spike in all nodes in both cluster. And pod wise metrics shows, cpu spike in Calico-node pods under the service named cni.
But we are not sure why it is happened, we assume that it may happens due to calico-node try to reconfigure after node goes to NotReady or When it back to Ready state. We attached a logs streamed at that time.
Time: 12:07:50 (GMT+5:30) IST: Events from the Node <NODE_NAME> 1 NodeNotReady: Node <NODE NAME> status is now: NodeNotReady 1 NodeNotReady: Node <NODE NAME> status is now: NodeNotReady 1 NodeNotReady: Node <NODE NAME> status is now: NodeNotReady 1 NodeNotReady: Node <NODE NAME> status is now: NodeNotReady Events from the Pod kube-system/calico-node-kxlkc 308 Unhealthy: Readiness probe failed: calico/node is not ready: felix is not ready: readiness probe reporting 503 Events from the Pod kube-system/calico-node-b64pg 10 Unhealthy: Readiness probe failed: calico/node is not ready: felix is not ready: readiness probe reporting 503 Time: 12:09:30 (GMT+5:30) IST: Events from the Node l<NODE NAME> 2 NodeHasSufficientPID: Node <NODE NAME> status is now: NodeHasSufficientPID 2 NodeReady: Node <NODE NAME> status is now: NodeReady 2 NodeHasSufficientMemory: Node <NODE NAME> status is now: NodeHasSufficientMemory 2 NodeHasNoDiskPressure: Node <NODE NAME> status is now: NodeHasNoDiskPressure
Logs for Pod: Calico-node | Service: cni
[INFO] ipam_plugin.go 68: migrating from host-local to calico-ipam... [INFO] k8s.go 228: Using Calico IPAM [INFO] migrate.go 65: checking host-local IPAM data dir dir existence... [INFO] migrate.go 67: host-local IPAM data dir dir not found; no migration necessary, successfully exiting... [INFO] ipam_plugin.go 98: migration from host-local to calico-ipam complete node="lke8498-14694-5f8ebbc945f3" ls: cannot access '/calico-secrets': No such file or directory
Can you help us with the issue and following queries?
Why all nodes are goes into NotReady state?
We know some reasons mentioned in K8s documentation like resource not enough, kubelet stops working,Container runtime stops working.But these are expected to happens in any one or two node. But in our case, all nodes in different clusters are goes to NotReady state.
Is there any issues with Linode Kubernetes Engine?
Is there any notification mechanism to notify us , when node goes to NotReady state?
It is not good for our infrastrucure, thus when node goes NotReady state/Down, Our services goes to down.
Any suggestions on best practices to handle this unexpected issues/cases?
I raised a ticket for the issue to Linode, But they suggest me to ask this on community.
Had anyone experienced a issue like that on your LKE cluster.If there, what approach you take to resolve this?
Hey @bewithjonam - apologies for the delay in response here. Hopefully by this point your cluster health has improved and things are up and running. I reached out to one of our LKE administrators about this and we believe we were able to pinpoint the reason why you were seeing your nodes report as NotReady:
- Can you help us with the issue and following queries?
- Why all nodes are goes into NotReady state?
- Is there any issues with Linode Kubernetes Engine?
At the time which you were reporting issues, our LKE team was performing some routine upgrades. During the maintenance, a service became unavailable for a short time while the upgraded version was being put in place. This most likely made the control plane temporarily unavailable and resulted in your nodes being listed as NotReady. Once the service was fully upgraded and the control plane became available again, your nodes should have returned to a Ready state.
All this to say, the upgrade ended up being more interruptive than we anticipated, and in the future we will be posting a maintenance page ahead of time.
- Is there any notification mechanism to notify us , when node goes to NotReady state?
To set up alerts, we recommend Prometheus and kube-state-metrics. Deploying kube-state-metrics provides various cluster metrics out of the box, including node Ready state. You can configure Prometheus to scrape kube-state-metrics and then configure an alert to fire if nodes become NotReady (or whichever query you'd like to use). You could also configure Alertmanager to forward these alerts to something like Slack or PagerDuty, which can then send SMS or push notifications when the alert is triggered.
- Any suggestions on best practices to handle this unexpected issues/cases?
We try to avoid performing maintenance that interrupts customer workloads, but sometimes it can be necessary. Like I said before, we'll be posting a maintenance page prior to performing similar upgrades in the future. If you require a multi-availability setup, one option is looking at something like Yggdrasil.
I hope this helps clear some things up after-the-fact.