GKE Autopilot Scaling Loop (scale-up followed by scale-down)

hoangvx · October 16, 2025, 3:26am

Hello everyone,

I’m investigating a scaling loop issue in my GKE Autopilot cluster, and current evidence points to instability within the kube-system namespace. I’d appreciate any advice on how to proceed.

Context:

The cluster was stable for 3+ months, and this issue started around 2–3 weeks ago.
The cluster repeatedly scales up by one node, then scales down ~10 minutes later.
I’ve confirmed that the “Container Restarts” seen on the dashboard are a result of node drain during scale-down — not the cause.

Key evidence:
Each scaling cycle is immediately preceded by a large spike of error logs from the kube-system namespace, visible in my “Container Error Logs/Sec” dashboard. This makes me suspect the root cause lies there.

Question:
What are the common causes of instability in kube-system pods on GKE Autopilot, and what’s the best diagnostic path forward?

I’m currently inspecting Logs Explorer to pinpoint the failing pod. Are there any specific kube-system components (e.g., version-related regressions, misconfigurations, or known issues) that could trigger behavior like this?

Any expert insight on interpreting kube-system error logs or which components to focus on would be greatly appreciated.

Thanks in advance!

HH · November 7, 2025, 1:33pm

We’re having the same issue with exactly the same sort of time scale (around 2-3 weeks before you’re initial post date). Now we’re seeing some of services impacted because of this, our long running cron processes are being terminated before they start or mid way through when this previously wasn’t happening. Did you manage to resolve this problem?

hoangvx · November 12, 2025, 7:51am

Not yet, but I’ve found a temporary workaround that has helped stabilize things. You can try setting up a Pod Disruption Budget (PDB) for your key services to prevent them from being disrupted during node scale-down events.

After applying PDBs, I’ve noticed that some of my nodes have remained stable currently running for about 8 days without unexpected terminations. It’s not a full fix yet, but it definitely helps mitigate the impact while continuing to investigate the root cause.

system · November 19, 2025, 7:52am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Autopilot Pod Disruption Budget Issue Serverless Applications gke	4	59	February 13, 2024
gke cluster is not scaling down even after applying optimize-utilization option Serverless Applications gke	1	16	June 15, 2023
GKE Autopilot Pod Stuck Terminating Serverless Applications	4	55	July 15, 2024

GKE Autopilot Scaling Loop (scale-up followed by scale-down)

AI Suggested topics