Sources Contact Advanced Search Tutorials

An Interest In:

Web News this Week

Search Archive

Some of Our Sources

View All Sources

Help Webnuz

Referal links:

May 25, 2022 07:28 pm GMT

How automatic repair works in AKS

AKS continuously monitors the health state of worker nodes and performs automatic node repair if they become unhealthy. The Azure virtual machine (VM) platform performs maintenance on VMs experiencing issues.

AKS and Azure VMs work together to minimize service disruptions for clusters.

In this article, we'll learn how automatic node repair functionality behaves for both Windows and Linux nodes.

How AKS checks for unhealthy nodes

AKS uses the following rules to determine if a node is unhealthy and needs repair:

The node reports NotReady status on consecutive checks within a 10-minute timeframe.
The node doesn't report any status within 10 minutes.

We can manually check the health state of our nodes with kubectl.

kubectl get nodes

How automatic repair works

If AKS identifies an unhealthy node that remains unhealthy for 10 minutes, AKS takes the following actions:

Reboot the node.
If the reboot is unsuccessful, reimage the node.
If the reimage is unsuccessful, redeploy the node.

Alternative remediation's are investigated by AKS engineers if auto-repair is unsuccessful.

If AKS finds multiple unhealthy nodes during a health check, each node is repaired individually before another repair begins.

Node Autodrain

Scheduled Events can occur on the underlying virtual machines (VMs) in any of our node pools. For spot node pools, scheduled events may cause a preempt node event for the node.

Certain node events, such as preempt, cause AKS node autodrain to attempt a cordon and drain of the affected node, which allows for a graceful reschedule of any affected workloads on that node.

When this happens, we might notice the node to receive a taint with "remediator.aks.microsoft.com/unschedulable", because of "kubernetes.azure.com/scalesetpriority: spot".

The following table shows the node events, and the actions they cause for AKS node autodrain.

Event	Description	Action
Freeze	The VM is scheduled to pause for a few seconds. CPU and network connectivity may be suspended, but there is no impact on memory or open files	No action
Reboot	The VM is scheduled for reboot. The VM's non-persistent memory is lost.	No action
Redeploy	The VM is scheduled to move to another node. The VM's ephemeral disks are lost.	Cordon and drain
Preempt	The spot VM is being deleted. The VM's ephemeral disks are lost.	Cordon and drain
Terminate	The VM is scheduled to be deleted.	Cordon and drain

Limitations

In many cases, AKS can determine if a node is unhealthy and attempt to repair the issue, but there are cases where AKS either can't repair the issue or can't detect that there is an issue. For example, AKS can't detect issues if a node status is not being reported due to error in network configuration, or has failed to initially register as a healthy node.

Thanks for reading my article till end. I hope you learned something special today. If you enjoyed this article then please share to your friends and if you have suggestions or thoughts to share with me then please write in the comment box.

Original Link: https://dev.to/makendrang/how-automatic-repair-works-in-aks-1fi1

Share this article:

View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To