You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Making new issue from #10632 at @cicdw's request :)
We have two replica workers trying to process jobs from a queue and our production environment workers crashed two hours ago. We got them back online by finding the late flow run which was causing the crash and deleting it.
What we see is:
Worker starts fine
Worker tries to submit a flow run (or cancel it, the logs dont make this clear to me)... but cannot find the deployment
Rather than this triggering it to try and submit the next flow run, it seems that not being able to find the deployment triggers the worker shutdown, reporting this capacity issue:
After some digging it turns out there the flow run which did not have a deployment was probably a pre-scheduled flow run, and when the deployment was renamed, this scheduled flow run did not update (possible bug on Prefect side?) the deployment.
Specifically, in our case we had two deployments of a general flow, one called BatterySpecs and a new one we were trialling as BatterySpecsV2. After validating that the BatterySpecsV2 deployment gave the correct results, we deleted the original deployment BatterySpecs and renamed BatterySpecsV2 -> BatterySpecs. However, this renaming seemed to not have propagated to the already scheduled flow runs from the original BatterySpecsV2 deployment.
So when the worker tried to run pre-scheduled flow run and get the deployment, the 404 seems to have brought down the worker instead of simply resulting in a crashed flow (second prefect issue?)
Version info
Version: 2.20.10
API version: 0.8.4
Python version: 3.11.8
Git commit: 4fb64ec3
Built: Wed, Oct 16, 2024 1:24 PM
OS/Arch: linux/x86_64
Additional context
No response
The text was updated successfully, but these errors were encountered:
Bug summary
Making new issue from #10632 at @cicdw's request :)
We have two replica workers trying to process jobs from a queue and our production environment workers crashed two hours ago. We got them back online by finding the late flow run which was causing the crash and deleting it.
What we see is:
After some digging it turns out there the flow run which did not have a deployment was probably a pre-scheduled flow run, and when the deployment was renamed, this scheduled flow run did not update (possible bug on Prefect side?) the deployment.
Specifically, in our case we had two deployments of a general flow, one called
BatterySpecs
and a new one we were trialling asBatterySpecsV2
. After validating that theBatterySpecsV2
deployment gave the correct results, we deleted the original deploymentBatterySpecs
and renamedBatterySpecsV2 -> BatterySpecs
. However, this renaming seemed to not have propagated to the already scheduled flow runs from the originalBatterySpecsV2
deployment.So when the worker tried to run pre-scheduled flow run and get the deployment, the 404 seems to have brought down the worker instead of simply resulting in a crashed flow (second prefect issue?)
Version info
Additional context
No response
The text was updated successfully, but these errors were encountered: