GKE Autopilot Scaling Loop (scale-up followed by scale-down)

Hello everyone,

I’m investigating a scaling loop issue in my GKE Autopilot cluster, and current evidence points to instability within the kube-system namespace. I’d appreciate any advice on how to proceed.

Context:

  • The cluster was stable for 3+ months, and this issue started around 2–3 weeks ago.

  • The cluster repeatedly scales up by one node, then scales down ~10 minutes later.

  • I’ve confirmed that the “Container Restarts” seen on the dashboard are a result of node drain during scale-down — not the cause.

Key evidence:
Each scaling cycle is immediately preceded by a large spike of error logs from the kube-system namespace, visible in my “Container Error Logs/Sec” dashboard. This makes me suspect the root cause lies there.

Question:
What are the common causes of instability in kube-system pods on GKE Autopilot, and what’s the best diagnostic path forward?

I’m currently inspecting Logs Explorer to pinpoint the failing pod. Are there any specific kube-system components (e.g., version-related regressions, misconfigurations, or known issues) that could trigger behavior like this?

Any expert insight on interpreting kube-system error logs or which components to focus on would be greatly appreciated.

Thanks in advance! :folded_hands:

1 Like

We’re having the same issue with exactly the same sort of time scale (around 2-3 weeks before you’re initial post date). Now we’re seeing some of services impacted because of this, our long running cron processes are being terminated before they start or mid way through when this previously wasn’t happening. Did you manage to resolve this problem?

Not yet, but I’ve found a temporary workaround that has helped stabilize things. You can try setting up a Pod Disruption Budget (PDB) for your key services to prevent them from being disrupted during node scale-down events.

After applying PDBs, I’ve noticed that some of my nodes have remained stable currently running for about 8 days without unexpected terminations. It’s not a full fix yet, but it definitely helps mitigate the impact while continuing to investigate the root cause.

image

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.