Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dbt setup #4011

Draft
wants to merge 17 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
237 changes: 132 additions & 105 deletions environments/conda-linux-64.lock.yml

Large diffs are not rendered by default.

4,318 changes: 2,796 additions & 1,522 deletions environments/conda-lock.yml

Large diffs are not rendered by default.

239 changes: 133 additions & 106 deletions environments/conda-osx-64.lock.yml

Large diffs are not rendered by default.

239 changes: 133 additions & 106 deletions environments/conda-osx-arm64.lock.yml

Large diffs are not rendered by default.

7 changes: 5 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ name = "catalystcoop.pudl"
description = "An open data processing pipeline for US energy data"
readme = { file = "README.rst", content-type = "text/x-rst" }
authors = [{ name = "Catalyst Cooperative", email = "[email protected]" }]
requires-python = ">=3.12,<3.13"
requires-python = ">=3.10,<3.13"
zaneselvans marked this conversation as resolved.
Show resolved Hide resolved
dynamic = ["version"]
license = { file = "LICENSE.txt" }
dependencies = [
Expand All @@ -23,10 +23,12 @@ dependencies = [
"conda-lock>=2.5.7",
"coverage>=7.6",
"dagster>=1.9",
"dagster-dbt>=0.25.6,<1",
"dagster-postgres>=0.24,<1", # Update when dagster-postgres graduates to 1.x
"dask>=2024",
"dask-expr", # Required for dask[dataframe]
"datasette>=0.65",
"dbt-duckdb",
"doc8>=1.1",
"duckdb>=1.1.3",
"email-validator>=1.0.3", # pydantic[email]
Expand Down Expand Up @@ -83,6 +85,7 @@ dependencies = [
"sphinxcontrib_googleanalytics>=0.4",
"sqlalchemy>=2",
"sqlglot>=25",
"s3fs>=2024",
"timezonefinder>=6.2",
"universal_pathlib>=0.2",
"urllib3>=1.26.18",
Expand Down Expand Up @@ -343,7 +346,7 @@ nodejs = ">=20"
pandoc = ">=2"
pip = ">=24"
prettier = ">=3.0"
python = ">=3.12,<3.13"
python = ">=3.10,<3.13"
sqlite = ">=3.47"
zip = ">=3.0"

Expand Down
4 changes: 4 additions & 0 deletions src/pudl/dbt/.gitignore
zaneselvans marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@

target/
dbt_packages/
logs/
1 change: 1 addition & 0 deletions src/pudl/dbt/.user.yml
zaneselvans marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
id: 143b9efc-6985-409a-8029-865947b8f8f1
86 changes: 86 additions & 0 deletions src/pudl/dbt/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
### Overview
This directory contains an initial setup of a `dbt` project meant to write
[data tests](https://docs.getdbt.com/docs/build/data-tests) for PUDL data. The
project is setup with profiles that allow you to select running tests on `nightly`
builds, `etl-full`, or `etl-fast` outputs. The `nightly` profile will operate
directly on parquet files in our S3 bucket, while both the `etl-full` and `etl-fast`
profiles will look for parquet files based on your `PUDL_OUTPUT` environment
variable. See the `Usage` section below for examples using these profiles.


### Development
To setup the `dbt` project, simply install the PUDL `conda` environment as normal,
then run the following command from this directory.

```
dbt deps
```

#### Adding new tables
To add a new table to the project, you must add it as a
[dbt source](https://docs.getdbt.com/docs/build/sources). You can do this by editing
the file `src/pudl/dbt/models/schema.yml`. I've already added the table
`out_vcerare__hourly_available_capacity_factor`, which can be used as a reference.

#### Adding tests
Once a table is included as a `source`, you can add tests for the table. You can
either add a generic test directly in `src/pudl/dbt/models/schema.yml`, or create
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is required to have one monster schema.yml file that defines everything? Or are there common ways to break it down into more manageable thematic chunks?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure you can just shove more YAMLs into /models and DBT will just pick them up... here's a guide for how to structure DBT projects.

a `sql` file in the directory `src/pudl/dbt/tests/`, which references the `source`.
When adding `sql` tests like this, you should construct a query that `SELECT`'s rows
that indicate a failure. That is, if the query returns any rows, `dbt` will raise a
failure for that test.

The project includes [dbt-expectations](https://github.com/calogica/dbt-expectations)
and [dbt-utils](https://github.com/dbt-labs/dbt-utils) as dependencies. These
packages include useful tests out of the box that can be applied to any tables
in the project. There are several examples in `src/pudl/dbt/models/schema.yml` which
use `dbt-expectations`.

#### Modifying a table before test
In many cases we modify a table slightly before executing a test. There are a couple
ways to accomplish this. First, when creating a `sql` test in `src/pudl/dbt/tests/`,
you can structure your query to modify the table/column before selecting failure
rows. The second method is to create a [model](https://docs.getdbt.com/docs/build/models) in `src/pudl/dbt/models/validation`. Any models created here will create a view
in a `duckdb` database being used by `dbt`. You can then reference this model in
`src/pudl/dbt/models/schema.yml`, and apply tests as you would with `sources`. There's
an example of this pattern which takes the table `out_ferc1__yearly_steam_plants_fuel_by_plant_sched402`,
computes fuel cost per mmbtu in the `sql` model, then applies `dbt_expectations` tests
to this model.

#### Usage
There are a few ways to execute tests. To run all tests with a single command:

```
dbt build
```

This command will first run any models, then execute all tests.

For more finegrained control, first run:

```
dbt run
```

This will run all models, thus prepairing any `sql` views that will be referenced in
tests. Once you've done this, you can run all tests with:

```
dbt test
```

To run all tests for a single source table:

```
dbt test --select source:pudl.{table_name}
```

To run all tests for a model table:

```
dbt test --select {model_name}
```

##### Selecting target profile
To select between `nightly`, `etl-full`, and `etl-fast` profiles, append
`--target {target_name}` to any of the previous commands.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now only the nightly profile exists, right? And it reads data from the parquet files in S3?

How will the etl-full and etl-fast profiles work? I can imagine just pointing DuckDB at the $PUDL_OUTPUT directory, but there's no way to know whether it is up to date, or whether it contains fast or full outputs, without interacting with Dagster.

14 changes: 14 additions & 0 deletions src/pudl/dbt/dbt_project.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Name your project! Project names should contain only lowercase characters
# and underscores. A good package name should reflect your organization's
# name or the intended use of these models
name: "pudl_dbt"
version: "1.0.0"

# This setting configures which "profile" dbt uses for this project.
profile: "pudl_dbt"

# These configurations specify where dbt should look for different types of files.
# The `model-paths` config, for example, states that models in this project can be
# found in the "models/" directory. You probably won't need to change these!
model-paths: ["models"]
test-paths: ["tests"]
90 changes: 90 additions & 0 deletions src/pudl/dbt/models/schema.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
version: 2

sources:
- name: pudl
meta:
external_location: |
{%- if target.name == "nightly" -%} 'https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/{name}.parquet'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it also work to point this directly at S3 rather than going through the HTTPS interface?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this to s3://pudl.catalyst.coop/nightly/{name}.parquet and it seems to work. I think going through s3:// directly will probably be more performant, won't it? E.g. in the case where there are efficiencies to be had in querying only small portions of the larger Parquet files.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, with the s3:// URL it didn't give me any error, but it also didn't seem to be making much progress. There was just a ton of data being downloaded. Not sure why.

{%- else -%} '{{ env_var('PUDL_OUTPUT') }}/parquet/{name}.parquet'
{%- endif -%}
tables:
- name: out_eia923__boiler_fuel
- name: out_eia923__monthly_boiler_fuel
- name: out_ferc1__yearly_steam_plants_fuel_by_plant_sched402
- name: out_vcerare__hourly_available_capacity_factor
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a data warehouse with hundreds of tables, would this file be created and managed by hand? Or would there be some rule-based way to generate it, or parts of it, along the lines of what we're doing with the Pandera schema checks right now? For example, the not_null tests here are a 2nd place that that restriction is being specified -- it's already present in our table metadata, which seems like recipe for them getting out of sync.

Or in the case of row counts, is there a clean, non-manual way to update the row counts to reflect whatever the currently observed counts are? Especially if we're trying to regenerate expected row counts for each individual year, filling it all in manually could be pretty tedious and error prone. We've moved toward specifying per-year row counts on the newer assets so that they work transparently in either the fast or full ETL cases, and the asset checks don't need to be aware of which kind of job they're being run in, which seems both more specific and more robust.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the "X column is not null" checks are currently defined in fields.py under the field constraints, is that what you're thinking about?

I think it would be nice to have auto-generated tests like the non-null tests & row counts defined alongside manually added tests. Then all the tests will be defined in one place, except for the tests that we need to write custom Python code for.

That seems pretty doable - YAML is easy to work with, and dbt lets us tag tests, so we could easily tag all the auto-generated tests so our generation scripts know to replace them but leave the manually-added tests alone.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the field specific constraints I think we automatically add NOT NULL check constraints to the PK fields when we construct the SQLite database -- but more generally I'm just saying that we need to get all of these generated tests integrated non-duplicatively into the dbt tests somehow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems totally possible to auto-generate tests, but I think there's also probably many ways to do accomplish this, so we should figure out what we want from it. For example, when we talk about auto-generating row count/not null tests, will these be generated once and committed into the repo, or will some/all of them be dynamically generated at runtime?

It definitely seems tricky to minimize duplication between dbt/our existing python schema info. I also wonder how this plays into any refactoring of our metadata system?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like we may need to clearly define the data tests that are ready to be migrated in a straightforward way, and the things that still need design work, so we can point Margay folks at the stuff that's ready to go and keep thinking about the things that still need some scaffolding?

data_tests:
- dbt_expectations.expect_table_row_count_to_equal:
value: |
{%- if target.name == "etl-fast" -%} 27287400
{%- else -%} 136437000
{%- endif -%}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a clean way to specify the expected row counts for each year of data (or some other meaningful subset) within a table, as we've started doing for the newer assets in Dagster asset checks, so we don't have to differentiate between fast and full validations, and can identify where the changes are?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd probably need to create a custom macro for this, but that seems totally doable. Big question is how we want to generate/store all of those tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The row count tests have functionally become regression tests -- we want to know when they change, and verify that the magnitude and nature of the change is expected based on the code or data that we've changed. Given that there are hundreds of tables (and thousands of table-years) it doesn't seem practical to hand-code all of the expected row counts.

It would be nice to have the per table-year row counts stored in (say) YAML somewhere, and be able to generate a new version of that file based on current ETL outputs. Then we could look at the diffs between the old and the new versions of the file when trying to assess changes in the lengths of the outputs.

- dbt_expectations.expect_compound_columns_to_be_unique:
column_list: ["county_id_fips", "datetime_utc"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be generated based on the PK that's defined for every table?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be possible. We can also probably come up with a way to generate foreign key checks so we can actually verify foreign keys for tables only in parquet

row_condition: "county_id_fips is not null"
columns:
- name: capacity_factor_solar_pv
data_tests:
- not_null
- dbt_expectations.expect_column_max_to_be_between:
max_value: 1.02
- dbt_expectations.expect_column_min_to_be_between:
min_value: 0.00
- name: capacity_factor_offshore_wind
data_tests:
- not_null
- dbt_expectations.expect_column_max_to_be_between:
max_value: 1.00
- dbt_expectations.expect_column_min_to_be_between:
min_value: 0.00
- name: hour_of_year
data_tests:
- not_null
- dbt_expectations.expect_column_max_to_be_between:
min_value: 8759
max_value: 8761
- name: datetime_utc
data_tests:
- not_null
- dbt_expectations.expect_column_values_to_not_be_in_set:
value_set: ["{{ dbt_date.date(2020, 12, 31) }}"]
- name: county_or_lake_name
data_tests:
- not_null
- dbt_expectations.expect_column_values_to_not_be_in_set:
value_set: ["bedford_city", "clifton_forge_city"]
models:
- name: ferc1_fbp_cost_per_mmbtu
columns:
- name: gas_cost_per_mmbtu
data_tests:
- dbt_expectations.expect_column_quantile_values_to_be_between:
quantile: 0.05
min_value: 1.5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing these are not using the weighted quantiles?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this are just basic quantiles. It's not too hard to get a sql query that can do a version of weighted quantiles, but the existing vs_historical tests are hard because they're computing a bunch of quantiles, then comparing them all

- dbt_expectations.expect_column_quantile_values_to_be_between:
quantile: 0.90
max_value: 15.0
- dbt_expectations.expect_column_median_to_be_between:
min_value: 2.0
max_value: 10.0
- name: oil_cost_per_mmbtu
data_tests:
- dbt_expectations.expect_column_quantile_values_to_be_between:
quantile: 0.10
min_value: 3.5
- dbt_expectations.expect_column_quantile_values_to_be_between:
quantile: 0.90
max_value: 25.0
- dbt_expectations.expect_column_median_to_be_between:
min_value: 6.5
max_value: 17.0
- name: coal_cost_per_mmbtu
data_tests:
- dbt_expectations.expect_column_quantile_values_to_be_between:
quantile: 0.10
min_value: 0.75
- dbt_expectations.expect_column_quantile_values_to_be_between:
quantile: 0.90
max_value: 4.0
- dbt_expectations.expect_column_median_to_be_between:
min_value: 1.0
max_value: 2.5
6 changes: 6 additions & 0 deletions src/pudl/dbt/models/validation/ferc1_fbp_cost_per_mmbtu.sql
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do end up needing to define these intermediate tables it seems like we would want to have some kind of clear naming convention for them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think that seems like a good idea. Maybe just use a validation_ prefix and otherwise follow existing naming conventions?

Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@

select
{% for fuel_type in ["gas", "oil", "coal"] %}
{{ fuel_type }}_fraction_cost * fuel_cost / ({{ fuel_type }}_fraction_mmbtu * fuel_mmbtu) as {{ fuel_type }}_cost_per_mmbtu,
{% endfor %}
from {{ source('pudl', 'out_ferc1__yearly_steam_plants_fuel_by_plant_sched402') }}
8 changes: 8 additions & 0 deletions src/pudl/dbt/package-lock.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
packages:
- package: calogica/dbt_expectations
version: 0.10.4
- package: dbt-labs/dbt_utils
version: 1.3.0
- package: calogica/dbt_date
version: 0.10.1
sha1_hash: 29571f46f50e6393ca399c3db7361c22657f2d6b
5 changes: 5 additions & 0 deletions src/pudl/dbt/packages.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
packages:
- package: calogica/dbt_expectations
version: [">=0.10.0", "<0.11.0"]
- package: dbt-labs/dbt_utils
version: [">=1.3.0", "<1.4.0"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see neither of these are available in conda-forge. It all works fine now with the dbt CLI, but will we need to get them packaged with conda to make dbt work from within an @asset_check as part of our Dagster pipeline?

17 changes: 17 additions & 0 deletions src/pudl/dbt/profiles.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
pudl_dbt:
outputs:
# Define targets for nightly builds, and local ETL full/fast
# See models/schema.yml for further configuration
nightly:
type: duckdb
path: "{{ env_var('PUDL_OUTPUT') }}/pudl.duckdb"
filesystems:
- fs: s3
etl-full:
type: duckdb
path: "{{ env_var('PUDL_OUTPUT') }}/pudl.duckdb"
etl-fast:
type: duckdb
path: "{{ env_var('PUDL_OUTPUT') }}/pudl.duckdb"

target: nightly
Empty file added src/pudl/dbt/tests/.gitkeep
Empty file.
2 changes: 1 addition & 1 deletion src/pudl/transform/ferc714.py
Original file line number Diff line number Diff line change
Expand Up @@ -362,7 +362,7 @@ def _filter_for_freshest_data_xbrl(
into the raw instant or duration XBRL table name.
"""
table_name_raw_xbrl = (
f"{TABLE_NAME_MAP_FERC714[table_name]["xbrl"]}_{instant_or_duration}"
f"{TABLE_NAME_MAP_FERC714[table_name]['xbrl']}_{instant_or_duration}"
)
xbrl = filter_for_freshest_data_xbrl(
raw_xbrl,
Expand Down
Loading