Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dbt setup #4011

Draft
wants to merge 17 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -49,3 +49,9 @@ devtools/datasette/fly/Dockerfile
devtools/datasette/fly/inspect-data.json
devtools/datasette/fly/metadata.yml
devtools/datasette/fly/all_dbs.tar.zst

# dbt specific ignores
dbt/dbt_packages/
dbt/target/
dbt/logs/
dbt/.user.yml
86 changes: 86 additions & 0 deletions dbt/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
### Overview
This directory contains an initial setup of a `dbt` project meant to write
[data tests](https://docs.getdbt.com/docs/build/data-tests) for PUDL data. The
project is setup with profiles that allow you to select running tests on `nightly`
builds, `etl-full`, or `etl-fast` outputs. The `nightly` profile will operate
directly on parquet files in our S3 bucket, while both the `etl-full` and `etl-fast`
profiles will look for parquet files based on your `PUDL_OUTPUT` environment
variable. See the `Usage` section below for examples using these profiles.


### Development
To setup the `dbt` project, simply install the PUDL `conda` environment as normal,
then run the following command from this directory.

```
dbt deps
```

#### Adding new tables
To add a new table to the project, you must add it as a
[dbt source](https://docs.getdbt.com/docs/build/sources). You can do this by editing
the file `src/pudl/dbt/models/schema.yml`. I've already added the table
`out_vcerare__hourly_available_capacity_factor`, which can be used as a reference.

#### Adding tests
Once a table is included as a `source`, you can add tests for the table. You can
either add a generic test directly in `src/pudl/dbt/models/schema.yml`, or create
a `sql` file in the directory `src/pudl/dbt/tests/`, which references the `source`.
When adding `sql` tests like this, you should construct a query that `SELECT`'s rows
that indicate a failure. That is, if the query returns any rows, `dbt` will raise a
failure for that test.

The project includes [dbt-expectations](https://github.com/calogica/dbt-expectations)
and [dbt-utils](https://github.com/dbt-labs/dbt-utils) as dependencies. These
packages include useful tests out of the box that can be applied to any tables
in the project. There are several examples in `src/pudl/dbt/models/schema.yml` which
use `dbt-expectations`.

#### Modifying a table before test
In many cases we modify a table slightly before executing a test. There are a couple
ways to accomplish this. First, when creating a `sql` test in `src/pudl/dbt/tests/`,
you can structure your query to modify the table/column before selecting failure
rows. The second method is to create a [model](https://docs.getdbt.com/docs/build/models) in `src/pudl/dbt/models/validation`. Any models created here will create a view
in a `duckdb` database being used by `dbt`. You can then reference this model in
`src/pudl/dbt/models/schema.yml`, and apply tests as you would with `sources`. There's
an example of this pattern which takes the table `out_ferc1__yearly_steam_plants_fuel_by_plant_sched402`,
computes fuel cost per mmbtu in the `sql` model, then applies `dbt_expectations` tests
to this model.

#### Usage
There are a few ways to execute tests. To run all tests with a single command:

```
dbt build
```

This command will first run any models, then execute all tests.

For more finegrained control, first run:

```
dbt run
```

This will run all models, thus prepairing any `sql` views that will be referenced in
tests. Once you've done this, you can run all tests with:

```
dbt test
```

To run all tests for a single source table:

```
dbt test --select source:pudl.{table_name}
```

To run all tests for a model table:

```
dbt test --select {model_name}
```

##### Selecting target profile
To select between `nightly`, `etl-full`, and `etl-fast` profiles, append
`--target {target_name}` to any of the previous commands.
14 changes: 14 additions & 0 deletions dbt/dbt_project.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Name your project! Project names should contain only lowercase characters
# and underscores. A good package name should reflect your organization's
# name or the intended use of these models
name: "pudl_dbt"
version: "1.0.0"

# This setting configures which "profile" dbt uses for this project.
profile: "pudl_dbt"

# These configurations specify where dbt should look for different types of files.
# The `model-paths` config, for example, states that models in this project can be
# found in the "models/" directory. You probably won't need to change these!
model-paths: ["models"]
test-paths: ["tests"]
12 changes: 12 additions & 0 deletions dbt/models/eia923/schema.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
version: 2

sources:
- name: eia923
meta:
external_location: |
{%- if target.name == "nightly" -%} 'https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/{name}.parquet'
{%- else -%} '{{ env_var('PUDL_OUTPUT') }}/parquet/{name}.parquet'
{%- endif -%}
tables:
- name: out_eia923__monthly_boiler_fuel
- name: out_eia923__boiler_fuel
6 changes: 6 additions & 0 deletions dbt/models/ferc1/ferc1_fbp_cost_per_mmbtu.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@

select
{% for fuel_type in ["gas", "oil", "coal"] %}
{{ fuel_type }}_fraction_cost * fuel_cost / ({{ fuel_type }}_fraction_mmbtu * fuel_mmbtu) as {{ fuel_type }}_cost_per_mmbtu,
{% endfor %}
from {{ source('ferc1', 'out_ferc1__yearly_steam_plants_fuel_by_plant_sched402') }}
48 changes: 48 additions & 0 deletions dbt/models/ferc1/schema.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
version: 2

sources:
- name: ferc1
meta:
external_location: |
{%- if target.name == "nightly" -%} 'https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/{name}.parquet'
{%- else -%} '{{ env_var('PUDL_OUTPUT') }}/parquet/{name}.parquet'
{%- endif -%}
tables:
- name: out_ferc1__yearly_steam_plants_fuel_by_plant_sched402

models:
- name: ferc1_fbp_cost_per_mmbtu
columns:
- name: gas_cost_per_mmbtu
data_tests:
- dbt_expectations.expect_column_quantile_values_to_be_between:
quantile: 0.05
min_value: 1.5
- dbt_expectations.expect_column_quantile_values_to_be_between:
quantile: 0.90
max_value: 15.0
- dbt_expectations.expect_column_median_to_be_between:
min_value: 2.0
max_value: 10.0
- name: oil_cost_per_mmbtu
data_tests:
- dbt_expectations.expect_column_quantile_values_to_be_between:
quantile: 0.10
min_value: 3.5
- dbt_expectations.expect_column_quantile_values_to_be_between:
quantile: 0.90
max_value: 25.0
- dbt_expectations.expect_column_median_to_be_between:
min_value: 6.5
max_value: 17.0
- name: coal_cost_per_mmbtu
data_tests:
- dbt_expectations.expect_column_quantile_values_to_be_between:
quantile: 0.10
min_value: 0.75
- dbt_expectations.expect_column_quantile_values_to_be_between:
quantile: 0.90
max_value: 4.0
- dbt_expectations.expect_column_median_to_be_between:
min_value: 1.0
max_value: 2.5
7 changes: 7 additions & 0 deletions dbt/models/schema.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
version: 2

sources:
- name: ferc1
description: "Data tests for ferc1 assets."
- name: eia923
description: "Data tests for eia923 assets."
51 changes: 51 additions & 0 deletions dbt/models/vcerare/schema.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
version: 2

sources:
- name: pudl
meta:
external_location: |
{%- if target.name == "nightly" -%} 'https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/{name}.parquet'
{%- else -%} '{{ env_var('PUDL_OUTPUT') }}/parquet/{name}.parquet'
{%- endif -%}
tables:
- name: out_vcerare__hourly_available_capacity_factor
data_tests:
- dbt_expectations.expect_table_row_count_to_equal:
value: |
{%- if target.name == "etl-fast" -%} 27287400
{%- else -%} 136437000
{%- endif -%}
- dbt_expectations.expect_compound_columns_to_be_unique:
column_list: ["county_id_fips", "datetime_utc"]
row_condition: "county_id_fips is not null"
columns:
- name: capacity_factor_solar_pv
data_tests:
- not_null
- dbt_expectations.expect_column_max_to_be_between:
max_value: 1.02
- dbt_expectations.expect_column_min_to_be_between:
min_value: 0.00
- name: capacity_factor_offshore_wind
data_tests:
- not_null
- dbt_expectations.expect_column_max_to_be_between:
max_value: 1.00
- dbt_expectations.expect_column_min_to_be_between:
min_value: 0.00
- name: hour_of_year
data_tests:
- not_null
- dbt_expectations.expect_column_max_to_be_between:
min_value: 8759
max_value: 8761
- name: datetime_utc
data_tests:
- not_null
- dbt_expectations.expect_column_values_to_not_be_in_set:
value_set: ["{{ dbt_date.date(2020, 12, 31) }}"]
- name: county_or_lake_name
data_tests:
- not_null
- dbt_expectations.expect_column_values_to_not_be_in_set:
value_set: ["bedford_city", "clifton_forge_city"]
8 changes: 8 additions & 0 deletions dbt/package-lock.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
packages:
- package: calogica/dbt_expectations
version: 0.10.4
- package: dbt-labs/dbt_utils
version: 1.3.0
- package: calogica/dbt_date
version: 0.10.1
sha1_hash: 29571f46f50e6393ca399c3db7361c22657f2d6b
5 changes: 5 additions & 0 deletions dbt/packages.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
packages:
- package: calogica/dbt_expectations
version: [">=0.10.0", "<0.11.0"]
- package: dbt-labs/dbt_utils
version: [">=1.3.0", "<1.4.0"]
17 changes: 17 additions & 0 deletions dbt/profiles.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
pudl_dbt:
outputs:
# Define targets for nightly builds, and local ETL full/fast
# See models/schema.yml for further configuration
nightly:
type: duckdb
path: "{{ env_var('PUDL_OUTPUT') }}/pudl.duckdb"
filesystems:
- fs: s3
etl-full:
type: duckdb
path: "{{ env_var('PUDL_OUTPUT') }}/pudl.duckdb"
etl-fast:
type: duckdb
path: "{{ env_var('PUDL_OUTPUT') }}/pudl.duckdb"

target: nightly
Loading
Loading