Layers of timeout in an Istio+k8s managed cluster

Question

I have a cluster of microservices. UI calls API1 (assuming it goes through ingress gateway, correct me if I am wrong), API1 calls API2 via RestTemplate.

The API2 process is bulky and takes roughly 1.5 minutes to complete, however there are no errors or exceptions in the process itself. For testing purposes, I called API2 directly via Bruno (whcih is set with a sufficiently large timeout value) which gives a socket hangup around 1 minute which is expected as the AWS LB connection idle timeout is 1 min. But from Chrome's network tab I see the timing of the call to succesfully complete with waiting for server time to be 1.5 minutes.

I understand istio has a default of 2 retries for connection failures, and timeout is disabled. My question is that why when calling from the UI it is successful after 1.5 minutes rather than waiting for 3 minutes and failing. Are the pods behaving in a way that I don't understand? My understanding is that after 1 minute socket gets closed and a retry kicks off, but that should also fail and kickoff the 2nd retry. Again, that should also fail after the next minute. Is the socket somehow reopened within the total time of 3x1=3 minutes and the call is successful because it takes less than 3 minutes?

P.S. I am a Junior developer who is just getting into the devops world of cluster orchestration, service mesh, etc. Any clarification is deeply appreciated.

I changed the LB connection idle timeout to a higher value and the call was successful in Bruno as expected. I put arbritrarily large sleep times in the process and altered the idle time out values to get expected results. But I don't understand the difference I am seeing in Chrome(UI->API1->API2) vs calling (Bruno->API2). I read the docs and scoured google with no satisfactory answer.

IME there are frequently enough proxies and timeouts (an out-of-cluster load balancer, the cluster ingress gateway, the Istio server and client proxies, the service itself) that it's not productive to try to set the timeouts longer than 60 seconds: there are too many places to change and too many possible accidents from infrastructure changes. If your HTTP service is taking longer than 10 seconds or so, restructure it to produce some sort of asynchronous result, like HTTP 202 with a pollable URL. — David Maze
– David Maze, Commented Sep 30 at 1:12

EasyPea · Accepted Answer · 2025-10-08 08:05:39Z

0

Multiple timeout layers (load balancer, ingress, Istio sidecars, HTTP client) can each cut off the call, it’s not that socket “reopens.” To fix this, extend or disable the timeout at each layer, or break the long-running operation into an async or polling pattern.

answered Oct 8 at 8:05

EasyPea

114 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Layers of timeout in an Istio+k8s managed cluster

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related