Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AzureMachinePoolMachine marked as terminally failed when ProvisioningState is set to Failed #5303

Open
mweibel opened this issue Nov 22, 2024 · 0 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@mweibel
Copy link
Contributor

mweibel commented Nov 22, 2024

/kind bug

What steps did you take and what happened:
When a VMSS VM has a provisioning state of Failed, AzureMachinePoolMachine controller will mark the AMPM as failed:

machineScope.SetFailureReason(capierrors.UpdateMachineError)
machineScope.SetFailureMessage(errors.Errorf("Azure VM state is %s", state))

And will not continue to reconcile the AMPM further:

if machineScope.AzureMachinePool.Status.FailureReason != nil || machineScope.AzureMachinePool.Status.FailureMessage != nil {
log.Info("Error state detected, skipping reconciliation")
return reconcile.Result{}, nil
}

According to the Cluster API provider contract also CAPI will stop reconciling this Machine (once it bubbled up to the Machine):

Note: once any of failureReason or failureMessage surface on the machine who is referencing the infrastructureMachine object, they cannot be restored anymore (it is considered a terminal error; the only way to recover is to delete and recreate the machine). Also, if the machine is under control of a MachineHealthCheck instance, the machine will be automatically remediated.

(emphasis mine)

However, VMSS VMs may sometimes go into ProvisioningState Failed and recover themselves again. We just had several of those. In the VMSS Activity Log they were marked as Health issues (PlatformInitiated Downtime).

What did you expect to happen:
I'd expect a AMPM to only go into terminal failed state when it really can't recover.

Anything else you would like to add:
I'm happy to provide a PR for this, but I'd need some information first. For example if there is a way to reliably determine if a VMSS VM is Failed and can't be recovered.
As a quick fix (and to test this), I'll probably resort to not setting FailureReason/Message in this case.

Environment:

  • cluster-api-provider-azure version: 1.17.2
  • Kubernetes version: (use kubectl version): 1.31.1
  • OS (e.g. from /etc/os-release): linux/windows
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
Status: Todo
Development

No branches or pull requests

2 participants