-
Notifications
You must be signed in to change notification settings - Fork 579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛: elbv2: wait for LB active state instead of resolving DNS name #5093
Conversation
Hi @r4f4. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
In my local tests I see:
I'm going to run Openshift e2e CI tests on this change to check for regressions. /cc @patrickdillon |
@r4f4: GitHub didn't allow me to request PR reviews from the following users: patrickdillon. Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
* Fix for DNS name resolution too early: kubernetes-sigs/cluster-api-provider-aws#5093
I don't see any install issues in the e2e tests of openshift/installer#8825 |
@AndiDog I think this is a good compromise regarding your comment in #4976 (comment) . Doing the DNS name check after waiting for the LB to become "active" would address the issue in #5033 but we've got reports of cases where the DNS name resolution didn't happen until the host DNS server was changed to force a DNS cache refresh. |
/ok-to-test |
/test ? |
@r4f4: The following commands are available to trigger required jobs:
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/test pull-cluster-api-provider-aws-e2e pull-cluster-api-provider-aws-e2e-conformance |
Yes, this might work. I'm just wondering overall why the DNS check was introduced and what would happen if we remove it completely, i.e. go on even before DNS and LB are ready. Would it just delay cluster readiness because CAP* controllers error out and retry? Would the cluster fail to come up? Did you try such a scenario? Basically I'd like to have some certainty that this change doesn't cause problems in 1 out of 100 cases. |
I was not expecting such a solution to be accepted by capa devs. Let me try to run the openshift e2e tests with the change. I'll report back. |
For what's worth, when we were using terraform for provisioning there was no such LB DNS name check. But the tf aws provider does wait for the LB to become active after creation: https://github.com/hashicorp/terraform-provider-aws/blob/main/internal/service/elbv2/load_balancer.go#L406-L408 |
No install failures in 11 CI jobs: openshift/installer#8927. |
/assign @kubernetes-sigs/cluster-api-provider-aws-maintainers for approval |
Sorry, posted that PR assignment to the wrong PR 🙄 |
665e6e5
to
a880668
Compare
Update: rebased on top of master so I could rebase openshift/installer#8927 and re-run the Openshift e2e tests. |
@AndiDog I haven't seen a single regression in the OCP e2e tests so far. Anything else I can do to move this along? |
If it fails again we can see what AWS account is using and then investigate. |
I suspect the timeout failures in the e2e are due to #5131 |
Just for more context, I'm trying to deploy a AWS Cluster on Trying an online Having another method to check the LB status would be much appreciated. |
/test pull-cluster-api-provider-aws-e2e |
Kicked the tests again now that CI seems to be working again. |
Using DNS name resolution as a way to check the load balancer is working can cause problems that are dependent on the host running CAPA. In some systems, the DNS resolution can fail with very large TTLs cached DNS responses, causing very long provisioning times. Instead of DNS resolution, let's use the AWS API to check for the load balancer "active" state. Waiting for resolvable DNS names should be left for the clients to do.
a880668
to
32ebcce
Compare
Update: rebased on top of current main. /test pull-cluster-api-provider-aws-e2e |
/test pull-cluster-api-provider-aws-e2e-eks |
The EKS e2e failures are not related to this PR, as I can see them In a no-op PR as well here. |
/test pull-cluster-api-provider-aws-e2e-eks |
Both sets of e2e are passing and this only applies to ELBv2, so i think /approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: richardcase The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
For lgtm: /assign damdo |
Considering the extensive testing conducted by @r4f4 and team, our CI being happy with this and in general the change introducing more robust and reliable checking than a DNS based one /lgtm |
@AndiDog @richardcase @damdo Any chance we could backport this to a version that doesn't include the bump to CAPI 1.8.4? We need to incorporate this fix in OCP 4.16 and we have limitations on the Golang version we can build that with. |
@r4f4 Have you tested whether it will cherry-pick cleanly onto a previous 2.x? I personally don't have any issues with backporting this and it doesn't have any obvious dependencies on CAPI 1.8 that I can see |
@nrb Yes, clean apply on top of
|
/cherry-pick release-2.6 |
@nrb: new pull request created: #5226 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
Using DNS name resolution as a way to check the load balancer is working can cause problems that are dependent on the host running CAPA. In some systems, the DNS resolution can fail with very large TTLs cached DNS responses, causing very long provisioning times.
Instead of DNS resolution, let's use the AWS API to check for the load balancer "active" state. Waiting for resolvable DNS names should be left for the clients to do.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #5032
Special notes for your reviewer:
This is an alternative approach to #4976 and #5033.
Checklist:
Release note: