At DoiT International we work with customers small and large, and from time to time we recognize common issues, especially with some of our larger-scale customers. A recent issue upgrading Google’s GKE (managed Kubernetes), we feel is worth sharing in case others are planning their upgrades or run into similar issues.
Symptoms
It’s important to say, not all customers experience this, but the following symptoms we’ve witnessed:
- after the upgrade, kubectl CLI cannot interact with cluster (API-Server not responding)
- the upgrade process appears to be “hanging” and not completed after 20+ minutes
- error notice in console or logs citing the following or similar “All cluster resources were brought up, but: component “kube-apiserver” from endpoint “gke-XXXXXXXX-XXXXXXX” is unhealthy.”
Risk Diagnosis
Although not everyone experiences this, the commonalities we’ve witnessed include:
- GKE cluster version below 1.16
- Zonal cluster (single-zone master for control plane)
- “Chatty” workloads that continuously interact with API-Server like Istio, Flux, or ArgoCD
- Upgrades between versions like 1.12 -> 1.13, 1.13 -> 1.14, 1.14 -> 1.15, 1.15 -> 1.16 (typically not seen during patch updates)
Who might be impacted?
We’d like to reiterate that this has only occurred with a few customers thus far and most fit that profile with zonal clusters and heavy workloads hitting the API-Server that caused it to take too long to pass health checks on upgrades, thus stuck in a “hung” state. We hope this information is helpful either for planning future version upgrades or troubleshooting existing issues.
Remediation
Google support case to increase health check timeout
If you are already experiencing this issue, but your worker nodes are still serving traffic (just no kubectl access to control plane), you can submit a support request with Google Support, or your technical support partner (we hope it’s DoiT International), to increase the health check timeout to 3 minutes to allow more time for API-Server to recover, preventing a cycle of failed health checks.
Reduce node pool size and load on API-Server
If you can afford potential downtime of worker nodes, to ease the pressure off your API-Server you could either disable the “chatty” workloads and scale them down for a period of time, or simply scale down your node pool to 0 and let the upgrade complete, and then scale it back up.
Alternative potential causes
Some of our engineers have come across scenarios where the control plane was inaccessible related to a race condition caused by the Linux netfilter/conntrack table. This has been fixed but older versions were susceptible to it.
Long-term solution
Google engineers are aware of the issue and a fix is planned within the next month for version 1.16 or later (manual upgrade). At the time of this blog, there is no public issue tracker link.
Upgrade your cluster from zonal to regional
Unfortunately, there is no “easy button” to upgrade a cluster from zonal to regional but one of our cloud architects has an article that describes one approach for migration using a popular open-source tool, Velero.