You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What steps did you take and what happened:
When a VMSS VM has a provisioning state of Failed, AzureMachinePoolMachine controller will mark the AMPM as failed:
log.Info("Error state detected, skipping reconciliation")
return reconcile.Result{}, nil
}
According to the Cluster API provider contract also CAPI will stop reconciling this Machine (once it bubbled up to the Machine):
Note: once any of failureReason or failureMessage surface on the machine who is referencing the infrastructureMachine object, they cannot be restored anymore (it is considered a terminal error; the only way to recover is to delete and recreate the machine). Also, if the machine is under control of a MachineHealthCheck instance, the machine will be automatically remediated.
(emphasis mine)
However, VMSS VMs may sometimes go into ProvisioningState Failed and recover themselves again. We just had several of those. In the VMSS Activity Log they were marked as Health issues (PlatformInitiated Downtime).
What did you expect to happen:
I'd expect a AMPM to only go into terminal failed state when it really can't recover.
Anything else you would like to add:
I'm happy to provide a PR for this, but I'd need some information first. For example if there is a way to reliably determine if a VMSS VM is Failed and can't be recovered.
As a quick fix (and to test this), I'll probably resort to not setting FailureReason/Message in this case.
Environment:
cluster-api-provider-azure version: 1.17.2
Kubernetes version: (use kubectl version): 1.31.1
OS (e.g. from /etc/os-release): linux/windows
The text was updated successfully, but these errors were encountered:
/kind bug
What steps did you take and what happened:
When a VMSS VM has a provisioning state of
Failed
, AzureMachinePoolMachine controller will mark the AMPM as failed:cluster-api-provider-azure/exp/controllers/azuremachinepoolmachine_controller.go
Lines 298 to 299 in aad2249
And will not continue to reconcile the AMPM further:
cluster-api-provider-azure/exp/controllers/azuremachinepoolmachine_controller.go
Lines 265 to 268 in aad2249
According to the Cluster API provider contract also CAPI will stop reconciling this Machine (once it bubbled up to the Machine):
(emphasis mine)
However, VMSS VMs may sometimes go into ProvisioningState Failed and recover themselves again. We just had several of those. In the VMSS Activity Log they were marked as Health issues (PlatformInitiated Downtime).
What did you expect to happen:
I'd expect a AMPM to only go into terminal failed state when it really can't recover.
Anything else you would like to add:
I'm happy to provide a PR for this, but I'd need some information first. For example if there is a way to reliably determine if a VMSS VM is Failed and can't be recovered.
As a quick fix (and to test this), I'll probably resort to not setting FailureReason/Message in this case.
Environment:
kubectl version
): 1.31.1/etc/os-release
): linux/windowsThe text was updated successfully, but these errors were encountered: