An Interest In:
Web News this Week
- April 25, 2024
- April 24, 2024
- April 23, 2024
- April 22, 2024
- April 21, 2024
- April 20, 2024
- April 19, 2024
How automatic repair works in AKS
AKS continuously monitors the health state of worker nodes and performs automatic node repair
if they become unhealthy. The Azure virtual machine (VM) platform performs maintenance on VMs experiencing issues.
AKS and Azure VMs work together to minimize service disruptions for clusters.
In this article, we'll learn how automatic node repair functionality
behaves for both Windows and Linux nodes.
How AKS checks for unhealthy nodes
AKS uses the following rules to determine if a node is unhealthy and needs repair:
- The node reports NotReady status on consecutive checks within a 10-minute timeframe.
- The node doesn't report any status within 10 minutes.
We can manually check the health state of our nodes with kubectl
.
kubectl get nodes
How automatic repair works
If AKS identifies an unhealthy node that remains unhealthy for 10 minutes, AKS takes the following actions:
- Reboot the node.
- If the reboot is unsuccessful, reimage the node.
- If the reimage is unsuccessful, redeploy the node.
Alternative remediation's are investigated by AKS engineers if auto-repair is unsuccessful.
If AKS finds multiple unhealthy nodes during a health check, each node is repaired individually before another repair begins.
Node Autodrain
Scheduled Events can occur on the underlying virtual machines (VMs) in any of our node pools. For spot node pools, scheduled events may cause a preempt node event for the node.
Certain node events, such as preempt
, cause AKS node autodrain to attempt a cordon and drain
of the affected node, which allows for a graceful reschedule of any affected workloads on that node.
When this happens, we might notice the node to receive a taint with "remediator.aks.microsoft.com/unschedulable", because of "kubernetes.azure.com/scalesetpriority: spot".
The following table shows the node events, and the actions they cause for AKS node autodrain.
Event | Description | Action |
---|---|---|
Freeze | The VM is scheduled to pause for a few seconds. CPU and network connectivity may be suspended, but there is no impact on memory or open files | No action |
Reboot | The VM is scheduled for reboot. The VM's non-persistent memory is lost. | No action |
Redeploy | The VM is scheduled to move to another node. The VM's ephemeral disks are lost. | Cordon and drain |
Preempt | The spot VM is being deleted. The VM's ephemeral disks are lost. | Cordon and drain |
Terminate | The VM is scheduled to be deleted. | Cordon and drain |
Limitations
In many cases, AKS can determine if a node is unhealthy and attempt to repair the issue, but there are cases where AKS either can't repair the issue or can't detect that there is an issue. For example, AKS can't detect issues if a node status is not being reported due to error in network configuration, or has failed to initially register as a healthy node.
Thanks for reading my article till end. I hope you learned something special today. If you enjoyed this article then please share to your friends and if you have suggestions or thoughts to share with me then please write in the comment box.
Original Link: https://dev.to/makendrang/how-automatic-repair-works-in-aks-1fi1
Dev To
An online community for sharing and discovering great ideas, having debates, and making friendsMore About this Source Visit Dev To