Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update EIA-EPA crosswalk to include multiple years of crosswalk data #4056

Merged
merged 9 commits into from
Feb 12, 2025

Conversation

e-belfer
Copy link
Member

@e-belfer e-belfer commented Feb 11, 2025

Overview

Closes #4039.

What problem does this address?

What did you change?

  • Add all years of crosswalk data into glue_assets.py
  • Updated stale references to the crosswalk only having 2018 data or 2018 and 2021
  • Added assertions concerning changing matches over time
  • Updated table descriptions
  • Added working partitions to the data

What changed in the outputs we produce?

"core_epa__assn_eia_epacamd"

The original dataset:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12832 entries, 0 to 12831
Data columns (total 7 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   report_year            12832 non-null  int32 
 1   plant_id_epa           12832 non-null  int32 
 2   emissions_unit_id_epa  12832 non-null  object
 3   generator_id_epa       12832 non-null  object
 4   plant_id_eia           12832 non-null  int32 
 5   boiler_id              5023 non-null   object
 6   generator_id           12832 non-null  object

The new dataset:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38493 entries, 0 to 38492
Data columns (total 7 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   report_year            38493 non-null  Int64 
 1   plant_id_epa           38493 non-null  Int64 
 2   emissions_unit_id_epa  38493 non-null  string
 3   generator_id_epa       38493 non-null  string
 4   plant_id_eia           38493 non-null  Int64 
 5   boiler_id              14988 non-null  string
 6   generator_id           38493 non-null  string

Have we changed any of the 2018 or 2021 matches in this process?

test = core_df.merge(core_df_old, how = 'outer', indicator = True)
test.loc[test._merge != 'both'].report_year.value_counts()

report_year
2019    6443
2020    6427
2022    6410
2023    6371
2021      14

What's up with these 14 changed records? Let's look at all the records with EPA plant IDs and 2021 report years:

mismatch = test.loc[(test._merge != 'both')&(test.report_year==2021)]
test = test.set_index(['report_year', 'plant_id_epa'])
mismatch = mismatch.set_index(['report_year', 'plant_id_epa'])
test[test.index.isin(mismatch.index)]
emissions_unit_id_epa generator_id_epa plant_id_eia boiler_id generator_id _merge
report_year plant_id_epa
2021 1167 8 8 1167 8 8 both
1167 8 8A 1167 8 8A left_only
1167 9 9 1167 9 9 both
10776 GTG GTG 10776 GTG left_only
52176 1 GEN1 52176 GEN1 both
52176 1 GEN3 52176 HRSG1 GEN3 right_only
52176 2 GEN2 52176 GEN2 both
52176 2 GEN3 52176 HRSG2 GEN3 right_only
54271 A01 CTG1 54271 CTG1 left_only
54271 A01 STG 54271 HRSG1 STG left_only
54271 A02 CTG2 54271 CTG2 left_only
54271 A02 STG 54271 HRSG2 STG left_only
54350 A001 GTA 54350 GTA left_only
54350 A001 STM 54350 STM left_only
54350 A002 GTB 54350 GTB left_only
54350 A002 STM 54350 STM left_only
54350 A003 GTC 54350 GTC left_only
54350 A003 STM 54350 STM left_only

We see a few new EPA plants (54271, 54350, 10776), a new generator ID (8A) , and one dropped generator ID (GEN3).

"_core_epa__assn_eia_epacamd_unique"

When comparing the new and old deduplicated DFs, we see that there are only new rows of data. No generator IDs have been lost.

deduped._merge.value_counts()
_merge
both          6542
left_only       94
right_only       0
Name: count, dtype: int64

"core_epa__assn_eia_epacamd_subplant_ids"

The original dataset:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41458 entries, 0 to 41457
Data columns (total 6 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   plant_id_eia           41458 non-null  int32  
 1   plant_id_epa           40603 non-null  float64
 2   subplant_id            41458 non-null  int32  
 3   unit_id_pudl           5888 non-null   float64
 4   emissions_unit_id_epa  41458 non-null  object 
 5   generator_id           40603 non-null  object 
dtypes: float64(2), int32(2), object(2)

The new dataset:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41320 entries, 0 to 41319
Data columns (total 6 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   plant_id_eia           41320 non-null  Int64 
 1   plant_id_epa           40524 non-null  Int64 
 2   subplant_id            41320 non-null  Int64 
 3   unit_id_pudl           5894 non-null   Int64 
 4   emissions_unit_id_epa  41320 non-null  string
 5   generator_id           40524 non-null  string

What changed?

  • All new plant, generator and unit ID combos from the crosswalk are in the new subplant table
  • Of the lost records, none of them are in the new crosswalks.

Take for example plant ID 3, Barry:

plant_id_epa emissions_unit_id_epa generator_id plant_id_eia subplant_id_x unit_id_pudl_x subplant_id_y unit_id_pudl_y _merge
0 3.0 1 1 3 0 1 0.0 1.0 both
1 3.0 2 2 3 1 2 1.0 2.0 both
2 3.0 3 3 3 2 3 2.0 3.0 both
3 3.0 4 4 3 3 4 3.0 4.0 both
4 3.0 5 5 3 4 5 4.0 5.0 both
5 3.0 6A A1CT 3 5 6 5.0 6.0 both
6 3.0 6A A1ST 3 5 6 5.0 6.0 both
7 3.0 6B A1CT2 3 5 6 5.0 6.0 both
8 3.0 6B A1ST 3 5 6 5.0 6.0 both
9 3.0 7A A2C1 3 6 7 6.0 7.0 both
10 3.0 7A A2ST 3 6 7 6.0 7.0 both
11 3.0 7B A2C2 3 6 7 6.0 7.0 both
12 3.0 7B A2ST 3 6 7 6.0 7.0 both
13 3.0 8 A3C1 3 7 8 NaN NaN left_only
14 3.0 8 A3ST 3 7 8 NaN NaN left_only
15 3.0 A3C1 A3C1 3 7.0 8.0 right_only
16 3.0 A3ST A3ST 3 7.0 8.0 right_only
17 8 NaN 3 8.0 NaN right_only

The last three rows were generated by merging on EIA generator and boiler data, and don't correspond to the originally existing crosswalk. In the new crosswalk, both generators are mapped to unit ID 8. This explains the shrinking of row counts between the two dataframes.

Documentation

Make sure to update relevant aspects of the documentation.

Tasks

Preview Give feedback

Testing

How did you make sure this worked? How can a reviewer verify this?

To-do list

Preview Give feedback

@e-belfer e-belfer self-assigned this Feb 11, 2025
@e-belfer e-belfer requested a review from cmgosnell February 11, 2025 21:33
year: f"camd-eia-crosswalk-latest-{year}/epa_eia_crosswalk.csv"
for year in range(2019, 2024)
}
logger.info(csv_map)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, this should get cleaned up.

@e-belfer
Copy link
Member Author

@grgmiller A heads up, as one of the primary users of our crosswalk outputs - we've rerun the crosswalk with EIA data from 2019-2023, and we're pulling the new data into the ETL in this PR. I've done a fair amount of investigating (outlined above) to make sure all changes to the tables are expected based on changes in the underlying data pulled from the EPA CAMD FACT API in the process of running the crosswalk, but this will change a few hundred unit ID matches in our outputs. Let me know if you have any questions or concerns!

Copy link
Member

@cmgosnell cmgosnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay. i think this is good. i appreciate the detailed explorations of the difference between the existing crosswalk and the new one. the diffs are alarming on the face of it but seem to be legit good and adding new associations.

one thing I can't help but thinking (and feels way oos for this pr) would be to update our crosswalk to assume it is an annual or at least time varying like the scd's. feels super oos for this but could have avoided some confusion for this integration

Comment on lines 47 to 49
* Ran the crosswalk using 2019, 2020, 2022 and 2023 EIA data, and incorporated the new
crosswalk data into the generation of :ref:`core_epa__assn_eia_epacamd` and
:ref:`core_epa__assn_eia_epacamd_subplant_ids`. See :issue:`4039` and :pr:`4056`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit but idk if a casual user will know what "ran the crosswalk" means. I think "Updated the crosswalk.." even would be more accessible

@grgmiller
Copy link
Collaborator

Thanks @e-belfer! I haven't followed this PR closely - is this PR essentially adapting the EPA's (R) code to update the PSDC with more recent years of data? Or is this a separate effort that builds on top of this?

One thing I'll flag for you is that in our most recent OGE release (https://github.com/singularity-energy/open-grid-emissions/releases/tag/v0.6.0) we actually did a lot of manual updates to the crosswalk as well (see the "Expanded and enhanced EPA-EIA crosswalking" section of the release notes). One major change is that we added start and end years to the crosswalk, because we found that historically these mappings change over time and sometimes even switch back and forth. So if you look at https://github.com/singularity-energy/open-grid-emissions/blob/main/src/oge/reference_tables/epa_eia_crosswalk_manual.csv you'll now see how these mappings change over time. Not sure if this is helpful or if you've already tackled this in your PR.

@e-belfer
Copy link
Member Author

Thanks @e-belfer! I haven't followed this PR closely - is this PR essentially adapting the EPA's (R) code to update the PSDC with more recent years of data? Or is this a separate effort that builds on top of this?

That's exactly it - we re-ran the EPA R scripts with different years of EIA data, and are pulling the additional data into our pipeline. Because the scripts grab EPA data directly from the CAMD FACT API, there are also some updates to the EPA data that happen as a result of re-running the crosswalks.

One thing I'll flag for you is that in our most recent OGE release (https://github.com/singularity-energy/open-grid-emissions/releases/tag/v0.6.0) we actually did a lot of manual updates to the crosswalk as well (see the "Expanded and enhanced EPA-EIA crosswalking" section of the release notes). One major change is that we added start and end years to the crosswalk, because we found that historically these mappings change over time and sometimes even switch back and forth. So if you look at https://github.com/singularity-energy/open-grid-emissions/blob/main/src/oge/reference_tables/epa_eia_crosswalk_manual.csv you'll now see how these mappings change over time. Not sure if this is helpful or if you've already tackled this in your PR.

Thanks for flagging this! This is out of scope for this PR but is noted as a TODO in the code itself and tracked in #3691. I think we're definitely interested in pulling the manual mappings into our ETL, but there's a bit of design work to do here to handle time-variant mappings so I think the plan is to circle back to this in the future.

@e-belfer e-belfer marked this pull request as ready for review February 12, 2025 15:25
@e-belfer e-belfer added this pull request to the merge queue Feb 12, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 12, 2025
@e-belfer e-belfer enabled auto-merge February 12, 2025 19:02
@e-belfer e-belfer added this pull request to the merge queue Feb 12, 2025
Merged via the queue into main with commit 40635ae Feb 12, 2025
19 checks passed
@e-belfer e-belfer deleted the update-crosswalk branch February 12, 2025 20:34
@zaneselvans zaneselvans added the epacems Integration and analysis of the EPA CEMS dataset. label Feb 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epacems Integration and analysis of the EPA CEMS dataset.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

EPA CAMD Crosswalk update
4 participants