Sources Contact Advanced Search Tutorials

An Interest In:

Web News this Week

Search Archive

Some of Our Sources

View All Sources

Help Webnuz

Referal links:

October 21, 2021 10:02 am GMT

Failure Detection. The Phi Accrual Failure Detector

Introduction

In distributed systems, Nodes need to know if their peers are still healthy and update their peer lists accordingly. The simplest way to do this is with heartbeats. If there is a network with nodes A, B, and C, A sends heartbeat messages to its peers B and C saying Im still here. If B and C don't hear from node A after a particular period, they can assume that node A is dead and can no longer handle requests, and nodes B and C will stop sending traffic to node A.
The question is how long should nodes B and C wait for A before marking it as dead. Twenty seconds ?, Thirty seconds ?. If a very short time is chosen as the threshold, there will be lots of false positives on the dead nodes peer list. A lot of nodes will be marked as dead when they are actually not. They might have had bad internet connections etc. On the other hand, if a longer time is chosen as the threshold, there will be lots of failed requests because dead nodes will keep getting traffic as though theyre still active and dropping them. So what exactly is the sweet spot? The Phi Accrual failure detector.

The Phi Accrual Failure Detector

The Phi Accrual failure detector, which was originally introduced in this paper uses a suspicion model. For example, when nodes B and C startup, theyll set the confidence level they have in node A to 0.5, where the scale is from 0 to 1. After a particular period, if node A doesn't respond, the confidence level is dropped by 0.1, which makes the new confidence 0.4. If it does respond, it is increased by 0.1 and the confidence is 0.6. A threshold can then be set, that if node A doesn't respond when it's down to 0.2 then we can mark the node as dead. This is a more compound approach because it recognizes the past response times of the node. Data can also be stored, and used to analyze and decide what to set the threshold for a node to be marked as dead to.
Simulation
I did a small simulation with Docker and Golang, and this is what it looked like.

Leader

The first component is the Leader. I set up the entity in leader.go. The Leader receives the heartbeats, updates the tables, manages the thresholds and the failure points.

var (   threshold = 15   failurePoint = 0.2)

The threshold is set to 15 and is denominated to seconds . This means if we don't hear from a node after fifteen seconds, the confidence level will be decremented by 0.1.
The failurePoint is set to 0.2 and once the nodes confidence level reaches the failurePoint or lower, it will be marked as dead.

Follower

The next entity is the Follower which is setup in
follower.go. The followers will send periodic heartbeats to the leader. A heartbeat looks like this:

type HeartBeat struct {   IpAddress string   Pid       int   Timestamp int64}

The follower sends its IP address, process id (PID), and timestamp to the leader periodically.
I setup two docker containers, phi_leader to hold the leader executable and phi_follower for the follower executable.
I scaled the phi_follower instance to 5 instances on a docker swarm network.

# docker service scale phi_follower=5  phi_follower scaled to 5overall progress: 0 out of 5 tasks 1/5: ready     [======================================>            ] 2/5: ready     [======================================>            ] 3/5: starting  [============================================>      ] 4/5: preparing [=================================>                 ] 5/5: ready     [======================================>            ]

After running this process for about five minutes, I tailed the leaders logs and the peer list looks like this:

# docker service logs phi_leader -fReceived heart beat from process with IP 10.0.4.222 and PID 1updating Accrual failure detection table...    *------------*-----*-----------------*------------*         IP      PID    CONFIDENCE        TIME        *------------*-----*-----------------*------------*     10.0.4.219    1         0.500000  1634756932      10.0.4.223    1         0.500000  1634756940      10.0.4.222    1         0.600000  1634756942      10.0.4.224    1         0.600000  1634756924      10.0.4.225    1         0.900000  1634756922     *------------*-----*-----------------*------------*                        TOTAL PROCESSES      5          *------------*-----*-----------------*------------*

The five different nodes have been registered on the leader, with different confidence levels and timestamps. With the confidence levels, all the peers are healthy.
Now I stopped two nodes, and watched to see if the detector decrements their confidence levels.

// Stop first container# docker stop 14d07e02a1ad            14d07e02a1ad// Stop second container# docker stop 3829bb23fe413829bb23fe41

After 17 seconds, which is below the threshold of 15 seconds, a node hadnt sent a heartbeat and the confidence level was decreased. Node with IP 10.0.4.223 has been dropped to 0.9

# docker service logs phi_leader -fChecking for inactive processes...Havent received a heartbeat from process with ip 10.0.4.223 and pid 1 for the past 17 secondsDecrementing process confidence by 0.1    *------------*-----*-----------------*------------*         IP      PID    CONFIDENCE        TIME        *------------*-----*-----------------*------------*     10.0.4.219    1         1.000000  1634757451      10.0.4.223    1         0.900000  1634757439      10.0.4.222    1         1.000000  1634757453      10.0.4.224    1         1.000000  1634757440      10.0.4.225    1         1.000000  1634757430     *------------*-----*-----------------*------------*                        TOTAL PROCESSES      5          *------------*-----*-----------------*------------*

After a few minutes of tailing the logs, one of the nodes go below the failure point of 0.2and is marked as dead. The leader removes this node from its peer list, and the total drops to 4

# docker service logs phi_leader -fThis process with ip 10.0.4.224 and pid 1 now has confidence 0.200000 which is below or equal to the failure point 0.200000 and will be marked as dead    *------------*-----*-----------------*------------*         IP      PID    CONFIDENCE        TIME        *------------*-----*-----------------*------------*     10.0.4.219    1         0.900000  1634757996      10.0.4.223    1         1.000000  1634757998      10.0.4.222    1         1.000000  1634758005      10.0.4.227    1         0.700000  1634758010     *------------*-----*-----------------*------------*                        TOTAL PROCESSES      4          *------------*-----*-----------------*------------*

So it worked! The code to the full implementation and simulation of the Phi Accrual Detector can be found here.
Thank you for reading!

Originally published here

Notes

The actual phi accrual detector is more complex than this, it uses a better probabilistic model like calculating the mean and standard deviation of previous results to determine the threshold and failure point instead of hardcoding the values as we did. You can find a good read here (phi = -log10(1 - F(timeSinceLastHeartbeat)) )

Original Link: https://dev.to/obbap/failure-detection-the-phi-accrual-failure-detector-4jgj

Share this article:

View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To