Blog

Upgrading Google Kubernetes Engine

At DoiT International we work with customers small and large, and from time to time we recognize common issues, especially with some of our larger-scale customers. A recent issue upgrading Google’s GKE (managed Kubernetes), we feel is worth sharing in case others are planning their upgrades or run into similar issues.

Symptoms

It’s important to say, not all customers experience this, but the following symptoms we’ve witnessed:

after the upgrade, kubectl CLI cannot interact with cluster (API-Server not responding)
the upgrade process appears to be “hanging” and not completed after 20+ minutes
error notice in console or logs citing the following or similar “All cluster resources were brought up, but: component “kube-apiserver” from endpoint “gke-XXXXXXXX-XXXXXXX” is unhealthy.”

Risk Diagnosis

Although not everyone experiences this, the commonalities we’ve witnessed include:

GKE cluster version below 1.16
Zonal cluster (single-zone master for control plane)
“Chatty” workloads that continuously interact with API-Server like Istio, Flux, or ArgoCD
Upgrades between versions like 1.12 -> 1.13, 1.13 -> 1.14, 1.14 -> 1.15, 1.15 -> 1.16 (typically not seen during patch updates)

Who might be impacted?

We’d like to reiterate that this has only occurred with a few customers thus far and most fit that profile with zonal clusters and heavy workloads hitting the API-Server that caused it to take too long to pass health checks on upgrades, thus stuck in a “hung” state. We hope this information is helpful either for planning future version upgrades or troubleshooting existing issues.

Remediation

Google support case to increase health check timeout

If you are already experiencing this issue, but your worker nodes are still serving traffic (just no kubectl access to control plane), you can submit a support request with Google Support, or your technical support partner (we hope it’s DoiT International), to increase the health check timeout to 3 minutes to allow more time for API-Server to recover, preventing a cycle of failed health checks.

Reduce node pool size and load on API-Server

If you can afford potential downtime of worker nodes, to ease the pressure off your API-Server you could either disable the “chatty” workloads and scale them down for a period of time, or simply scale down your node pool to 0 and let the upgrade complete, and then scale it back up.

Alternative potential causes

Some of our engineers have come across scenarios where the control plane was inaccessible related to a race condition caused by the Linux netfilter/conntrack table. This has been fixed but older versions were susceptible to it.

Long-term solution

Google engineers are aware of the issue and a fix is planned within the next month for version 1.16 or later (manual upgrade). At the time of this blog, there is no public issue tracker link.

Upgrade your cluster from zonal to regional

Unfortunately, there is no “easy button” to upgrade a cluster from zonal to regional but one of our cloud architects has an article that describes one approach for migration using a popular open-source tool, Velero.

Subscribe to updates, news and more.

Related blogs

Smarter Cloud Cost Optimization with DoiT Insights

The constant balancing act that comes from managing public cloud costs is a regular source of stress for the

Keep reading

Google Cloud LLM implementation: Key takeaways from our live Q&A

Learn how to implement LLMs on Google Cloud from DoiT’s AI experts. Get practical insights on model selection, cost management, RAG implementation with Google Workspace, API testing strategies, and step-by-step guidance for your GenAI journey.

Keep reading

Navigating the Deprecation of Google Cloud Pub/Sub Lite: Exploring Alternative Messaging Solutions

This guide explores various options for migrating off of Google Cloud Pub/Sub Lite, highlighting their pros and cons to help you make an informed decision.

Keep reading

Let’s do it

From cost optimization to cloud migration, machine learning and CloudOps,
we’re here to make the public cloud easy.

From cost optimization to cloud migration, machine learning and CloudOps, we’re here to make the public cloud easy.

Intent-aware FinOps platform to eradicate the "Illusion of Efficiency"

By Role

Tailored solutions for keycloud stakeholders

By Industry

Cloud innovation designed for your business goals

Consolidate and simplify your billing

Operating Intelligence

Cloud Analytics

Anomaly Detection

Allocations

Budgets

Workload Intelligence

BigQuery Lens

Spot Scaling

DataHub

Cloud Diagrams

Containerized Workloads

Automation

PerfectScale

CloudFlow

Flexsave for Compute

Ava

Integrations

Customer success stories

Customers, advancing ourtechnology

Real-time DoiT efficiency,impact and success

Global compliance acrosscloud providers

Insights, tips and perspectivesfrom cloud experts

Tangible tips for navigating the cloud

Foundational expertise and future-ready recommendations

What’s new at DoiT

In-person and virtual tech talks

Demos, interviews and more from cloud experts

Company

Blog

Upgrading Google Kubernetes Engine

Symptoms

Risk Diagnosis

Who might be impacted?

Remediation

Google support case to increase health check timeout

Reduce node pool size and load on API-Server

Alternative potential causes

Long-term solution

Upgrade your cluster from zonal to regional

Subscribe to updates, news and more.

Related blogs

Company

Offering

Support

Never miss an update.

Subscribe to updates, news and more.

Schedule a call with our team

Tailored solutions for key
cloud stakeholders

Cloud innovation designed
for your business goals

Consolidate and simplify
your billing

Customers, advancing our
technology

Real-time DoiT efficiency,
impact and success

Global compliance across
cloud providers

Insights, tips and perspectives
from cloud experts

Tangible tips for navigating
the cloud

Foundational expertise and
future-ready recommendations

Demos, interviews and
more from cloud experts