Renamed deployment causes worker crash and CapacityLimit token issues #16806

Samreay · 2025-01-22T00:57:53Z

Bug summary

Making new issue from #10632 at @cicdw's request :)

We have two replica workers trying to process jobs from a queue and our production environment workers crashed two hours ago. We got them back online by finding the late flow run which was causing the crash and deleting it.

What we see is:

Worker starts fine

Worker tries to submit a flow run (or cancel it, the logs dont make this clear to me)... but cannot find the deployment

Rather than this triggering it to try and submit the next flow run, it seems that not being able to find the deployment triggers the worker shutdown, reporting this capacity issue:

After some digging it turns out there the flow run which did not have a deployment was probably a pre-scheduled flow run, and when the deployment was renamed, this scheduled flow run did not update (possible bug on Prefect side?) the deployment.

Specifically, in our case we had two deployments of a general flow, one called BatterySpecs and a new one we were trialling as BatterySpecsV2. After validating that the BatterySpecsV2 deployment gave the correct results, we deleted the original deployment BatterySpecs and renamed BatterySpecsV2 -> BatterySpecs. However, this renaming seemed to not have propagated to the already scheduled flow runs from the original BatterySpecsV2 deployment.

So when the worker tried to run pre-scheduled flow run and get the deployment, the 404 seems to have brought down the worker instead of simply resulting in a crashed flow (second prefect issue?)

Version info

Version:             2.20.10
API version:         0.8.4
Python version:      3.11.8
Git commit:          4fb64ec3
Built:               Wed, Oct 16, 2024 1:24 PM
OS/Arch:             linux/x86_64

Additional context

No response

The text was updated successfully, but these errors were encountered:

Samreay added the bug Something isn't working label Jan 22, 2025

Samreay mentioned this issue Jan 22, 2025

This borrower is already holding one of this CapacityLimiter's tokens #10632

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Renamed deployment causes worker crash and CapacityLimit token issues #16806

Renamed deployment causes worker crash and CapacityLimit token issues #16806

Samreay commented Jan 22, 2025 •

edited

Loading

Renamed deployment causes worker crash and CapacityLimit token issues #16806

Renamed deployment causes worker crash and CapacityLimit token issues #16806

Comments

Samreay commented Jan 22, 2025 • edited Loading

Bug summary

Version info

Additional context

Samreay commented Jan 22, 2025 •

edited

Loading