Hello everyone,
I’m investigating a scaling loop issue in my GKE Autopilot cluster, and current evidence points to instability within the kube-system namespace. I’d appreciate any advice on how to proceed.
Context:
-
The cluster was stable for 3+ months, and this issue started around 2–3 weeks ago.
-
The cluster repeatedly scales up by one node, then scales down ~10 minutes later.
-
I’ve confirmed that the “Container Restarts” seen on the dashboard are a result of node drain during scale-down — not the cause.
Key evidence:
Each scaling cycle is immediately preceded by a large spike of error logs from the kube-system namespace, visible in my “Container Error Logs/Sec” dashboard. This makes me suspect the root cause lies there.
Question:
What are the common causes of instability in kube-system pods on GKE Autopilot, and what’s the best diagnostic path forward?
I’m currently inspecting Logs Explorer to pinpoint the failing pod. Are there any specific kube-system components (e.g., version-related regressions, misconfigurations, or known issues) that could trigger behavior like this?
Any expert insight on interpreting kube-system error logs or which components to focus on would be greatly appreciated.
Thanks in advance! ![]()


