-
-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update EIA-EPA crosswalk to include multiple years of crosswalk data #4056
Conversation
src/pudl/etl/glue_assets.py
Outdated
year: f"camd-eia-crosswalk-latest-{year}/epa_eia_crosswalk.csv" | ||
for year in range(2019, 2024) | ||
} | ||
logger.info(csv_map) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, this should get cleaned up.
@grgmiller A heads up, as one of the primary users of our crosswalk outputs - we've rerun the crosswalk with EIA data from 2019-2023, and we're pulling the new data into the ETL in this PR. I've done a fair amount of investigating (outlined above) to make sure all changes to the tables are expected based on changes in the underlying data pulled from the EPA CAMD FACT API in the process of running the crosswalk, but this will change a few hundred unit ID matches in our outputs. Let me know if you have any questions or concerns! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay. i think this is good. i appreciate the detailed explorations of the difference between the existing crosswalk and the new one. the diffs are alarming on the face of it but seem to be legit good and adding new associations.
one thing I can't help but thinking (and feels way oos for this pr) would be to update our crosswalk to assume it is an annual or at least time varying like the scd's. feels super oos for this but could have avoided some confusion for this integration
docs/release_notes.rst
Outdated
* Ran the crosswalk using 2019, 2020, 2022 and 2023 EIA data, and incorporated the new | ||
crosswalk data into the generation of :ref:`core_epa__assn_eia_epacamd` and | ||
:ref:`core_epa__assn_eia_epacamd_subplant_ids`. See :issue:`4039` and :pr:`4056`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit but idk if a casual user will know what "ran the crosswalk" means. I think "Updated the crosswalk.." even would be more accessible
Thanks @e-belfer! I haven't followed this PR closely - is this PR essentially adapting the EPA's (R) code to update the PSDC with more recent years of data? Or is this a separate effort that builds on top of this? One thing I'll flag for you is that in our most recent OGE release (https://github.com/singularity-energy/open-grid-emissions/releases/tag/v0.6.0) we actually did a lot of manual updates to the crosswalk as well (see the "Expanded and enhanced EPA-EIA crosswalking" section of the release notes). One major change is that we added start and end years to the crosswalk, because we found that historically these mappings change over time and sometimes even switch back and forth. So if you look at https://github.com/singularity-energy/open-grid-emissions/blob/main/src/oge/reference_tables/epa_eia_crosswalk_manual.csv you'll now see how these mappings change over time. Not sure if this is helpful or if you've already tackled this in your PR. |
That's exactly it - we re-ran the EPA R scripts with different years of EIA data, and are pulling the additional data into our pipeline. Because the scripts grab EPA data directly from the CAMD FACT API, there are also some updates to the EPA data that happen as a result of re-running the crosswalks.
Thanks for flagging this! This is out of scope for this PR but is noted as a TODO in the code itself and tracked in #3691. I think we're definitely interested in pulling the manual mappings into our ETL, but there's a bit of design work to do here to handle time-variant mappings so I think the plan is to circle back to this in the future. |
Overview
Closes #4039.
What problem does this address?
What did you change?
glue_assets.py
What changed in the outputs we produce?
"core_epa__assn_eia_epacamd"
The original dataset:
The new dataset:
Have we changed any of the 2018 or 2021 matches in this process?
What's up with these 14 changed records? Let's look at all the records with EPA plant IDs and 2021 report years:
We see a few new EPA plants (54271, 54350, 10776), a new generator ID (8A) , and one dropped generator ID (GEN3).
"_core_epa__assn_eia_epacamd_unique"
When comparing the new and old deduplicated DFs, we see that there are only new rows of data. No generator IDs have been lost.
"core_epa__assn_eia_epacamd_subplant_ids"
The original dataset:
The new dataset:
What changed?
Take for example plant ID 3, Barry:
The last three rows were generated by merging on EIA generator and boiler data, and don't correspond to the originally existing crosswalk. In the new crosswalk, both generators are mapped to unit ID 8. This explains the shrinking of row counts between the two dataframes.
Documentation
Make sure to update relevant aspects of the documentation.
Tasks
Testing
How did you make sure this worked? How can a reviewer verify this?
To-do list