Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a Dagster RetryPolicy to PUDL's ETL #3913

Open
Tracked by #3956
e-belfer opened this issue Oct 18, 2024 · 0 comments
Open
Tracked by #3956

Add a Dagster RetryPolicy to PUDL's ETL #3913

e-belfer opened this issue Oct 18, 2024 · 0 comments
Labels
dagster Issues related to our use of the Dagster orchestrator internal-onboarding Good first issues, for folks who have access to all of our systems. nightly-builds Anything having to do with nightly builds or continuous deployment.

Comments

@e-belfer
Copy link
Member

Is your feature request related to a problem? Please describe.
Right now when an asset crashes due to memory-loss, it kills the entire build. Using dagster's run-level or asset-level retry policy, we could hopefully avoid some of these build failures.

Describe the solution you'd like

When an asset crashes due to memory failures, we retry it rather than borking the entire build. If an asset fails due to a transient error (e.g., the Bad ZipFile), we also retry it. Thus, the nightly build fails when assets or validations fail, not when passing problems occur.

We could do this either at the asset or the run level. Each has different advantages - we could target which assets we expect to have problems (e.g., raw_phmsagas and high-memory outputs), or we could apply a blanket policy to the whole run.

To implement retries, we'll need to migrate to a log storage system that supports them, i.e. Postgres. We want to do this anyways. See #3868. It's also possible we'll need to configure a dagster.yaml, in which case we'll run into the questions raised in #3752 and possibly move from using dagster dev instead of dagster webserver in our Makefile.

Describe alternatives you've considered
Asset vs run-level retries:

If we can get them to work, asset-level retries might be preferable to target problem high-memory assets. However:

Additional context
Add any other context or screenshots about the feature request here.

@e-belfer e-belfer added dagster Issues related to our use of the Dagster orchestrator nightly-builds Anything having to do with nightly builds or continuous deployment. labels Oct 18, 2024
@e-belfer e-belfer changed the title Add retries to our dagster jobs Add a Dagster RetryPolicy to PUDL's ETL Oct 18, 2024
@jdangerx jdangerx moved this from New to Backlog in Catalyst Megaproject Oct 23, 2024
@bendnorman bendnorman added the good-first-issue Good issues for first-time contributors. Self-contained, low context, no credentials required. label Nov 18, 2024
@bendnorman bendnorman removed the good-first-issue Good issues for first-time contributors. Self-contained, low context, no credentials required. label Jan 9, 2025
@jdangerx jdangerx added good-first-issue Good issues for first-time contributors. Self-contained, low context, no credentials required. internal internal-onboarding Good first issues, for folks who have access to all of our systems. and removed internal good-first-issue Good issues for first-time contributors. Self-contained, low context, no credentials required. labels Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dagster Issues related to our use of the Dagster orchestrator internal-onboarding Good first issues, for folks who have access to all of our systems. nightly-builds Anything having to do with nightly builds or continuous deployment.
Projects
Status: Backlog
Development

No branches or pull requests

3 participants