Add a Dagster RetryPolicy to PUDL's ETL #3913
Labels
dagster
Issues related to our use of the Dagster orchestrator
internal-onboarding
Good first issues, for folks who have access to all of our systems.
nightly-builds
Anything having to do with nightly builds or continuous deployment.
Is your feature request related to a problem? Please describe.
Right now when an asset crashes due to memory-loss, it kills the entire build. Using dagster's run-level or asset-level retry policy, we could hopefully avoid some of these build failures.
Describe the solution you'd like
When an asset crashes due to memory failures, we retry it rather than borking the entire build. If an asset fails due to a transient error (e.g., the Bad ZipFile), we also retry it. Thus, the nightly build fails when assets or validations fail, not when passing problems occur.
We could do this either at the asset or the run level. Each has different advantages - we could target which assets we expect to have problems (e.g.,
raw_phmsagas
and high-memory outputs), or we could apply a blanket policy to the whole run.To implement retries, we'll need to migrate to a log storage system that supports them, i.e. Postgres. We want to do this anyways. See #3868. It's also possible we'll need to configure a dagster.yaml, in which case we'll run into the questions raised in #3752 and possibly move from using
dagster dev
instead ofdagster webserver
in our Makefile.Describe alternatives you've considered
Asset vs run-level retries:
If we can get them to work, asset-level retries might be preferable to target problem high-memory assets. However:
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: