Add a Dagster RetryPolicy to PUDL's ETL #3913

e-belfer · 2024-10-18T20:29:31Z

Is your feature request related to a problem? Please describe.
Right now when an asset crashes due to memory-loss, it kills the entire build. Using dagster's run-level or asset-level retry policy, we could hopefully avoid some of these build failures.

Describe the solution you'd like

When an asset crashes due to memory failures, we retry it rather than borking the entire build. If an asset fails due to a transient error (e.g., the Bad ZipFile), we also retry it. Thus, the nightly build fails when assets or validations fail, not when passing problems occur.

We could do this either at the asset or the run level. Each has different advantages - we could target which assets we expect to have problems (e.g., raw_phmsagas and high-memory outputs), or we could apply a blanket policy to the whole run.

To implement retries, we'll need to migrate to a log storage system that supports them, i.e. Postgres. We want to do this anyways. See #3868. It's also possible we'll need to configure a dagster.yaml, in which case we'll run into the questions raised in #3752 and possibly move from using dagster dev instead of dagster webserver in our Makefile.

Describe alternatives you've considered
Asset vs run-level retries:

If we can get them to work, asset-level retries might be preferable to target problem high-memory assets. However:

Asset-level retries didn't cover instances where the asset crashes: How can i ensure that my assets and ops retry if something goes wrong during materialization? dagster-io/dagster#16289
This should have been fixed in September, but I wasn't able to get it working locally: Respect op/asset retry policy when a step crashes or fails a health check dagster-io/dagster#24517
Which is unfortunate because run-level retries don't yet have exponential backoff options: Add delay and backoff to run-level retries dagster-io/dagster#13427

Additional context
Add any other context or screenshots about the feature request here.

The text was updated successfully, but these errors were encountered:

e-belfer added dagster Issues related to our use of the Dagster orchestrator nightly-builds Anything having to do with nightly builds or continuous deployment. labels Oct 18, 2024

e-belfer added this to Catalyst Megaproject Oct 18, 2024

github-project-automation bot moved this to New in Catalyst Megaproject Oct 18, 2024

e-belfer changed the title ~~Add retries to our dagster jobs~~ Add a Dagster RetryPolicy to PUDL's ETL Oct 18, 2024

jdangerx moved this from New to Backlog in Catalyst Megaproject Oct 23, 2024

zaneselvans mentioned this issue Nov 11, 2024

Dagster Housekeeping #3956

Open

bendnorman added the good-first-issue Good issues for first-time contributors. Self-contained, low context, no credentials required. label Nov 18, 2024

bendnorman removed the good-first-issue Good issues for first-time contributors. Self-contained, low context, no credentials required. label Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a Dagster RetryPolicy to PUDL's ETL #3913

Add a Dagster RetryPolicy to PUDL's ETL #3913

e-belfer commented Oct 18, 2024

Add a Dagster RetryPolicy to PUDL's ETL #3913

Add a Dagster RetryPolicy to PUDL's ETL #3913

Comments

e-belfer commented Oct 18, 2024