AzureMachinePoolMachine marked as terminally failed when ProvisioningState is set to Failed #5303

mweibel · 2024-11-22T15:08:08Z

/kind bug

What steps did you take and what happened:
When a VMSS VM has a provisioning state of Failed, AzureMachinePoolMachine controller will mark the AMPM as failed:

cluster-api-provider-azure/exp/controllers/azuremachinepoolmachine_controller.go

Lines 298 to 299 in aad2249

    
           machineScope.SetFailureReason(capierrors.UpdateMachineError) 
        
           machineScope.SetFailureMessage(errors.Errorf("Azure VM state is %s", state))

And will not continue to reconcile the AMPM further:

cluster-api-provider-azure/exp/controllers/azuremachinepoolmachine_controller.go

Lines 265 to 268 in aad2249

    
           if machineScope.AzureMachinePool.Status.FailureReason != nil || machineScope.AzureMachinePool.Status.FailureMessage != nil { 
        
           	log.Info("Error state detected, skipping reconciliation") 
        
           	return reconcile.Result{}, nil 
        
           }

According to the Cluster API provider contract also CAPI will stop reconciling this Machine (once it bubbled up to the Machine):

Note: once any of failureReason or failureMessage surface on the machine who is referencing the infrastructureMachine object, they cannot be restored anymore (it is considered a terminal error; the only way to recover is to delete and recreate the machine). Also, if the machine is under control of a MachineHealthCheck instance, the machine will be automatically remediated.

(emphasis mine)

However, VMSS VMs may sometimes go into ProvisioningState Failed and recover themselves again. We just had several of those. In the VMSS Activity Log they were marked as Health issues (PlatformInitiated Downtime).

What did you expect to happen:
I'd expect a AMPM to only go into terminal failed state when it really can't recover.

Anything else you would like to add:
I'm happy to provide a PR for this, but I'd need some information first. For example if there is a way to reliably determine if a VMSS VM is Failed and can't be recovered.
As a quick fix (and to test this), I'll probably resort to not setting FailureReason/Message in this case.

Environment:

cluster-api-provider-azure version: 1.17.2
Kubernetes version: (use kubectl version): 1.31.1
OS (e.g. from /etc/os-release): linux/windows

The text was updated successfully, but these errors were encountered:

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Nov 22, 2024

github-project-automation bot added this to CAPZ Planning Nov 22, 2024

github-project-automation bot moved this to Todo in CAPZ Planning Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AzureMachinePoolMachine marked as terminally failed when ProvisioningState is set to Failed #5303

AzureMachinePoolMachine marked as terminally failed when ProvisioningState is set to Failed #5303

mweibel commented Nov 22, 2024

AzureMachinePoolMachine marked as terminally failed when ProvisioningState is set to Failed #5303

AzureMachinePoolMachine marked as terminally failed when ProvisioningState is set to Failed #5303

Comments

mweibel commented Nov 22, 2024