Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Renamed deployment causes worker crash and CapacityLimit token issues #16806

Open
Samreay opened this issue Jan 22, 2025 · 0 comments
Open

Renamed deployment causes worker crash and CapacityLimit token issues #16806

Samreay opened this issue Jan 22, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@Samreay
Copy link

Samreay commented Jan 22, 2025

Bug summary

Making new issue from #10632 at @cicdw's request :)

We have two replica workers trying to process jobs from a queue and our production environment workers crashed two hours ago. We got them back online by finding the late flow run which was causing the crash and deleting it.

What we see is:

  1. Worker starts fine

Image

  1. Worker tries to submit a flow run (or cancel it, the logs dont make this clear to me)... but cannot find the deployment

Image

  1. Rather than this triggering it to try and submit the next flow run, it seems that not being able to find the deployment triggers the worker shutdown, reporting this capacity issue:

Image

Image

After some digging it turns out there the flow run which did not have a deployment was probably a pre-scheduled flow run, and when the deployment was renamed, this scheduled flow run did not update (possible bug on Prefect side?) the deployment.

Specifically, in our case we had two deployments of a general flow, one called BatterySpecs and a new one we were trialling as BatterySpecsV2. After validating that the BatterySpecsV2 deployment gave the correct results, we deleted the original deployment BatterySpecs and renamed BatterySpecsV2 -> BatterySpecs. However, this renaming seemed to not have propagated to the already scheduled flow runs from the original BatterySpecsV2 deployment.

So when the worker tried to run pre-scheduled flow run and get the deployment, the 404 seems to have brought down the worker instead of simply resulting in a crashed flow (second prefect issue?)

Version info

Version:             2.20.10
API version:         0.8.4
Python version:      3.11.8
Git commit:          4fb64ec3
Built:               Wed, Oct 16, 2024 1:24 PM
OS/Arch:             linux/x86_64

Additional context

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant