Sources Contact Advanced Search Tutorials

An Interest In:

Web News this Week

Search Archive

Some of Our Sources

View All Sources

Help Webnuz

Referal links:

May 14, 2020 11:23 pm GMT

If at first you don't get an answer...

give up quickly and ask someone else. That's my favorite microservices retry policy. To understand what works and doesn't work, we need to think about how distributed systems of services and databases interact with each other, not just a single call. Like my last post on why systems are slow sometimes, I will start simple, pick out key points, and gradually explain the things that will kill your distributed system, if you don't have timeouts and retries setup properly. It's also one of the easiest things to fix, once you understand what's going on, and it generally doesn't need new code, just configuration changes.

A bad timeout and retry policy can break a distributed systems architecture, but is relatively easy to fix.

I often find the timeout and retry policy across an application is set to whatever the default is for the framework or sample code you copied when starting to write each microservice. If every microservice uses the same timeout, this is guaranteed to fail, because the microservices deeper in the system are still retrying when the microservices nearer the edge have given up waiting for them.

Don't use the same default timeout settings across your distributed system.

This causes an avalanche of extra calls, work amplification, that is triggered when just one microservice or database deep in the system slows down for some reason. I used to work on a system that fell over at 8pm every Monday night because the monolithic backend database backup was scheduled then, and database queries slowed down enough to trigger a retry storm that overloaded the application servers. The simplest way to reduce work amplification is to limit the number of retries to one. I've seen defaults that are higher, and seen people react to seeing timeouts by increasing the number of retries, but this is just going to multiply the amount of extra work the system is trying to do when it's already struggling.

Limit the number of retries to one, to minimize work amplification.

This should be a fairly easy configuration change to implement, and a good checklist question that should be answered for your inventory of microservices is to ask how the timeout and retry settings are set, and what is needed to change the configuration. It's really nice if the settings can be changed dynamically and take effect in a few seconds. That provides a mechanism that may be able to bring a catatonic system back to life during an outage. Unfortunately in many cases the configuration is set once when the application starts, or could be hard coded in the application, or is so deep inside some imported library that no-one knows that a timeout setting exists, or how to change it.

Discovering and documenting current settings and how to change timeouts and retries can be a pain, but it's worth it.

Many people don't realize that there are two different kinds of timeout, some coding platforms and libraries bundle them together, and others expose them separately. The best analogy is to consider the difference between sending a letter, and making a phone call. If you send a letter, with an RSVP, you wait for a response. Maybe after a timeout you send another letter. It's the simplest request/response mechanism, and UDP based protocols like DNS work this way. A phone call is different because the first step is to make a connection by establishing a phone call. The other party has to pick up the phone and acknowledge that they can hear you, and you have to be sure you are talking to the right person. If they don't pick up, you retry the connection by calling again. Once the call is in progress, you make the request, and wait for the other party to respond. You can make several requests during a call, but if the line goes dead, you have to call again. TCP based protocols like HTTP, which underlies many APIs, work this way.

Be clear about the difference between connections and requests for the protocols you are using.

The connection timeout is how long you wait before giving up on getting a connection made to the next step in your request flow. The request timeout is how long you wait for the rest of the entire flow to complete from that point. The common case is to bundle everything into the request at the application level, and have some lower level library handle the connections. Connections may be pre-setup during an initialization and authentication phase or occur when the first request is made. Subsequent application requests could cause an entire new connection to be made (like HTTP 1.0), or a keep-alive option (supported in HTTP 1.1 and later) would keep it around to get a quicker response.

Dig deep to find any hidden connection timeout and keep-alive settings.

How should you decide how to set the connection and request timeouts?
While connection timeouts depend on the network latency to get to the service interface, request timeouts depend on the end-to-end time to call through many layers of the system. The closer to the end user, the longer the request timeout needs to be. Deeper into the system, request timeouts need to be shorter. For simple three-tier architectures like a front end web service, back end application service and database it's clear what order things are called in, and nested timeouts can be setup for each tier.

Edge timeouts must be long enough, and back end timeouts short enough, that the edge doesn't give up while the back end is still working.

Web browsers and mobile apps often have a 30s default timeout before they give up. However humans tend to hit the refresh button after about 10s if they are staring at a blank screen. If you have too many retries and long timeouts between your services, it's easy to add up to more than a few seconds, and while your system is still working to get a response, the end user will ignore the result as they have already given up and sent a new request. The best outcome is that your application returns before they retry manually, with a message that tells them you gave up, and asks if they want to try again. This minimizes the chance that you'll get flooded by work amplification.

For human interactions, if you can't respond within 10s, give up and respond with an error.

One question is what to do after giving up on a call. This can be problematic, as described in the Amazon Builders Library piece Avoiding fallback in distributed systems, the fallback code to handle problems is often poorly tested, if at all. A common problem is that a system that times out calls elsewhere will crash, return bad or corrupted data, or freeze.

Injecting slow and failed responses from the dependencies of a service is an important chaos testing technique.

In my last post I talked about how end to end response time is made up of residence times for every step along the way, and those residence times are a combination of wait time and service time. During a timeout, there is an increase in wait time, and you can think of this as an additional timeout queue, where work is being parked that can't move forward. Each retry adds to service time, as the request is processed again.

If there is only one instance of a service endpoint or database, you have no-where else to go. Attempting to connect to it repeatedly and quickly is unlikely to help it recover, so it's common to use a back-off algorithm. Some systems implement exponentially increasing back-off, but it's important to use bounded exponential back-off with an upper limit, otherwise it can take too long to get connected, and upstream systems will already have given up waiting. This is also sometimes called capped exponential back-off. Random back-off is another good policy, as it will disperse the thundering herd of clients that could end up backing off and returning in a loosely synchronized way. Deterministic jittered backoff is a way to give each instance a different back-off, but with a pattern that can be used as a signature for analysis purposes. Marc Brooker of AWS describes this technique in Timeouts, Retries and Backoff with Jitter.

Use random or jittered back-off to disperse clients and avoid thundering herd behaviors.

Consider a deeply nested microservice architecture, where the request flow depends on what kind of request is being made, and the ideal request timeout ends up being flow dependent. If we can pass a timeout value through the chain of individual spans in the flow then dynamic timeouts are possible. One way to implement this is to pass a deadline timestamp through the flow. If you can't respond within the deadline, give up. This requires closely synchronized clocks across the system, and may be a useful technique for relatively large timeouts of tens of seconds to minutes. Some batch management systems implement deadline scheduling for long running jobs. For short latency microservice architectures the ideal policy would be to decrement a latency budget for each span in the flow, and give up when there's not enough budget left to get a result, but there is still time to respond back up and report the failure cleanly before the user times out at the edge.

Dynamic request timeout policies are ideal for deeply nested microservices architectures.

Connection timeouts depend on the number of round trips needed to setup a connection (at least two, more for secure connections like HTTPS), and network latency to the next service in line, they should be very short, a small multiple of the latency itself. Within a datacenter or AWS availability zone, round trip network latency is generally around a millisecond, and a few milliseconds between AWS availability zones. Sometimes people talk about one-way latency numbers, so be clear what is being discussed, as a round trip is twice as long.

Use a short timeout for connections between co-located microservices, whether direct, or via a service mesh proxy like Envoy.

Round trip latency between AWS regions in the same continent is likely to be in the range 10-100ms, and between continents, a few hundred milliseconds. Calls to third party APIs or your own remote services will need a much higher timeout setting. Your measurements will vary, but will be limited by signals propagating at less than the speed of light, about 300,000km/s, which is about 150km round trip per millisecond. Processing and routing the network packets etc. will always make it much slower than this, but unfortunately the speed of light is not increasing so this is a fundamental latency limit!

Use a longer timeout for connections between AWS regions, and calls to third party APIs.

If you set your connection timeout to be shorter than the round trip, it will fail continuously, so it's common to use a high value. This works fine until there's a problem, then it magnifies the problem, and causes requests to timeout when they could be succeeding over a different connection. It's much better to fail fast and give up quickly on a connection that isn't going to work out. It would be ideal to have the connection timeout be adaptive, to start out large, and to shrink to fit, for the actual network latency situation. TCP does something like this for congestion control once connections are setup, but I don't know of any platforms that learn and adapt to latency, and invite readers to let us know in the comments if they do, and I'll update this post.

Set connection timeouts to be about 10x the round trip latency, if you can...

The default connection retry policy is to just repeat the call. Like phoning someone over and over again on the same number. However, if that person has a home number and a mobile number, you might try the other number after the first failure, and increase your chance of getting through. It's common to have horizontally scaled services, and if you've built a stateless microservices architecture (that doesn't use sessions to route traffic to locally cached state), you should try to connect to a different instance of the same service, as your default connection retry policy. Unfortunately many microservices frameworks don't have this option, but it was built into the Netflix open source service mesh released in 2012. NetflixOSS Ribbon and the Eureka service registry implemented a Java based service mesh, Envoy implements a language agnostic "side-car" process based service mesh with a configurable connection timeout, and defaults retries to one, with bounded and jittered backoff.

If you have horizontally scaled services, don't retry connections to the same one that just failed.

There's a high probability that an instance of a microservice that failed to connect will fail again for the same reason. It could be overloaded, out of threads, or doing a long garbage collection. If an instance is not listening for connections because it's application process has crashed or it's hit a thread or connection limit, you get a fast fail and a specific error return, as the connection will be refused immediately. It's like an unobtainable number error for a phone call. Immediately calling again in this case is clearly pointless.

Fast failures are good, and don't retry to the same instance immediately if at all possible.

I've seen some microservice based applications behaving in a fragile or brittle manner, that collapsed when conditions weren't perfect. However after fixing their timeout and retry policies along these lines they became stable and robust, absorbing a lot of problems so that customers saw a slightly degraded rather than an offline service. I hope you find these ideas and references useful for improving your own service robustness.

Photo taken by Adrian in March 2019 at the Roman Amphitheater, Mrida, Spain. A nice example of a low latency, high fan-out, communication system.

Original Link: https://dev.to/aws/if-at-first-you-don-t-get-an-answer-3e85

Share this article:

View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To