Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dbt setup #4011

Draft
wants to merge 35 commits into
base: main
Choose a base branch
from
Draft
Changes from 1 commit
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
bf40ffb
Add basic dbt setup
zschira Jan 9, 2025
9aac625
Update to dagster 1.9.7 & grpcio 1.67.1
zaneselvans Jan 10, 2025
415a113
Setup multiple dbt profiles
zschira Jan 10, 2025
ba32bd8
Merge remote-tracking branch 'refs/remotes/origin/dbt_setup' into dbt…
zaneselvans Jan 10, 2025
dc51c8f
Add all vcerare dbt tests
zschira Jan 10, 2025
590b02a
Add more example dbt tests
zschira Jan 13, 2025
63e663a
Merge branch 'dbt_setup' of github.com:catalyst-cooperative/pudl into…
zaneselvans Jan 13, 2025
d428b5d
Merge changes from main and revert to python 3.12
zaneselvans Jan 13, 2025
784cf96
Bump gdal to v3.10.1 bugfix release.
zaneselvans Jan 14, 2025
48a16e1
Merge branch 'main' into dbt_setup
zaneselvans Jan 14, 2025
6f45ba5
Merge branch 'main' into dbt_setup
zaneselvans Jan 15, 2025
ac41a41
Update to dagster 1.9.9
zaneselvans Jan 19, 2025
0ce1648
Merge branch 'main' into dbt_setup
zaneselvans Jan 20, 2025
c19cfd8
Merge branch 'main' into dbt_setup
zaneselvans Jan 20, 2025
6335e94
Reorganize dbt into multiple schema.yml files
zschira Jan 21, 2025
2585eca
Merge branch 'dbt_setup' of github.com:catalyst-cooperative/pudl into…
zschira Jan 21, 2025
e24af8c
Move dbt project to top level of repo
zschira Jan 22, 2025
1ed85b3
Only set parquet path in dbt project once
zschira Jan 30, 2025
e92f5be
Standardize dbt maning scheme
zschira Jan 30, 2025
5de9ebe
Add more detail to README
zschira Jan 30, 2025
da9ae93
Add script to generate dbt scaffolding and row count tests
zschira Feb 5, 2025
7461786
Add documentation for dbt helper script
zschira Feb 5, 2025
0d120c6
Add out_ferc1__yearly_steam_plants_fuel_by_plant_sched402 to yearly r…
zschira Feb 5, 2025
3666360
Add weighted quantile test (broken)
zschira Feb 5, 2025
a3579dc
Change row count test name
zschira Feb 5, 2025
c98219c
Update dbt initialization process
zschira Feb 5, 2025
f9b3fa7
Make dbt helper script work properly with non-yearly partitioned tables
zschira Feb 5, 2025
79e2153
Update dbt readme
zschira Feb 5, 2025
012ba4a
Regenerate ferc dbt schemas
zschira Feb 6, 2025
ff766b3
Merge branch 'main' into dbt_setup
zaneselvans Feb 10, 2025
94267a5
Improve dbt_helper command line usability
zschira Feb 10, 2025
8f660fd
Merge branch 'dbt_setup' of github.com:catalyst-cooperative/pudl into…
zschira Feb 10, 2025
70e6895
Flesh out test migration command
zschira Feb 13, 2025
389c540
Add test migration tutorial
zschira Feb 14, 2025
eb0765a
Merge branch 'main' into dbt_setup
zschira Feb 14, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Setup multiple dbt profiles
zschira committed Jan 10, 2025
commit 415a113c708dd84cd269f82c27fcd0c1454e5461
16 changes: 13 additions & 3 deletions src/pudl/dbt/models/schema.yml
Original file line number Diff line number Diff line change
@@ -1,12 +1,22 @@
version: 2

sources:
- name: pudl_nightly
- name: pudl
meta:
external_location: "https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/{name}.parquet"
external_location: |
{%- if target.name == "nightly" -%} 'https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/{name}.parquet'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it also work to point this directly at S3 rather than going through the HTTPS interface?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this to s3://pudl.catalyst.coop/nightly/{name}.parquet and it seems to work. I think going through s3:// directly will probably be more performant, won't it? E.g. in the case where there are efficiencies to be had in querying only small portions of the larger Parquet files.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, with the s3:// URL it didn't give me any error, but it also didn't seem to be making much progress. There was just a ton of data being downloaded. Not sure why.

{%- else -%} '{{ env_var('PUDL_OUTPUT') }}/parquet/{name}.parquet'
{%- endif -%}
tables:
- name: out_vcerare__hourly_available_capacity_factor
columns:
- name: capacity_factor_solar_pv
tests:
data_tests:
- not_null
- dbt_expectations.expect_column_max_to_be_between:
max_value: 1.02
- name: capacity_factor_offshore_wind
data_tests:
- not_null
- dbt_expectations.expect_column_max_to_be_between:
max_value: 1.00
8 changes: 8 additions & 0 deletions src/pudl/dbt/package-lock.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
packages:
- package: calogica/dbt_expectations
version: 0.10.4
- package: dbt-labs/dbt_utils
version: 1.3.0
- package: calogica/dbt_date
version: 0.10.1
sha1_hash: 29571f46f50e6393ca399c3db7361c22657f2d6b
5 changes: 5 additions & 0 deletions src/pudl/dbt/packages.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
packages:
- package: calogica/dbt_expectations
version: [">=0.10.0", "<0.11.0"]
- package: dbt-labs/dbt_utils
version: [">=1.3.0", "<1.4.0"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see neither of these are available in conda-forge. It all works fine now with the dbt CLI, but will we need to get them packaged with conda to make dbt work from within an @asset_check as part of our Dagster pipeline?

12 changes: 10 additions & 2 deletions src/pudl/dbt/profiles.yml
Original file line number Diff line number Diff line change
@@ -1,9 +1,17 @@
pudl_dbt:
outputs:
dev:
# Define targets for nightly builds, and local ETL full/fast
# See models/schema.yml for further configuration
nightly:
type: duckdb
path: /tmp/pudl.duckdb
filesystems:
- fs: s3
etl-full:
type: duckdb
path: /tmp/pudl.duckdb
etl-fast:
type: duckdb
path: /tmp/pudl.duckdb

target: dev
target: nightly
5 changes: 0 additions & 5 deletions src/pudl/dbt/tests/schema.yml

This file was deleted.

5 changes: 0 additions & 5 deletions src/pudl/dbt/tests/vcerare_wind_cap_factor_upper_bound.sql

This file was deleted.