Update EIA-EPA crosswalk to include multiple years of crosswalk data #4056

e-belfer · 2025-02-11T14:59:03Z

Overview

Closes #4039.

What problem does this address?

In Crosswalks for 2018 thru 2023 camd-eia-crosswalk-latest#3 and Fix inflated crosswalk files camd-eia-crosswalk-latest#5, we produced crosswalk data for each year from 2018-2023. We then archived this data in Update epacamd_eia to properly use latest version pudl-archiver#478.
This PR pulls the data into PUDL, using a crosswalk generated with multiple years of EIA data.

What did you change?

Add all years of crosswalk data into glue_assets.py
Updated stale references to the crosswalk only having 2018 data or 2018 and 2021
Added assertions concerning changing matches over time
Updated table descriptions
Added working partitions to the data

What changed in the outputs we produce?

"core_epa__assn_eia_epacamd"

The original dataset:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12832 entries, 0 to 12831
Data columns (total 7 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   report_year            12832 non-null  int32 
 1   plant_id_epa           12832 non-null  int32 
 2   emissions_unit_id_epa  12832 non-null  object
 3   generator_id_epa       12832 non-null  object
 4   plant_id_eia           12832 non-null  int32 
 5   boiler_id              5023 non-null   object
 6   generator_id           12832 non-null  object

The new dataset:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38493 entries, 0 to 38492
Data columns (total 7 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   report_year            38493 non-null  Int64 
 1   plant_id_epa           38493 non-null  Int64 
 2   emissions_unit_id_epa  38493 non-null  string
 3   generator_id_epa       38493 non-null  string
 4   plant_id_eia           38493 non-null  Int64 
 5   boiler_id              14988 non-null  string
 6   generator_id           38493 non-null  string

Have we changed any of the 2018 or 2021 matches in this process?

test = core_df.merge(core_df_old, how = 'outer', indicator = True)
test.loc[test._merge != 'both'].report_year.value_counts()

report_year
2019    6443
2020    6427
2022    6410
2023    6371
2021      14

What's up with these 14 changed records? Let's look at all the records with EPA plant IDs and 2021 report years:

mismatch = test.loc[(test._merge != 'both')&(test.report_year==2021)]
test = test.set_index(['report_year', 'plant_id_epa'])
mismatch = mismatch.set_index(['report_year', 'plant_id_epa'])
test[test.index.isin(mismatch.index)]

		emissions_unit_id_epa	generator_id_epa	plant_id_eia	boiler_id	generator_id	_merge
report_year	plant_id_epa
2021	1167	8	8	1167	8	8	both
	1167	8	8A	1167	8	8A	left_only
	1167	9	9	1167	9	9	both
	10776	GTG	GTG	10776		GTG	left_only
	52176	1	GEN1	52176		GEN1	both
	52176	1	GEN3	52176	HRSG1	GEN3	right_only
	52176	2	GEN2	52176		GEN2	both
	52176	2	GEN3	52176	HRSG2	GEN3	right_only
	54271	A01	CTG1	54271		CTG1	left_only
	54271	A01	STG	54271	HRSG1	STG	left_only
	54271	A02	CTG2	54271		CTG2	left_only
	54271	A02	STG	54271	HRSG2	STG	left_only
	54350	A001	GTA	54350		GTA	left_only
	54350	A001	STM	54350		STM	left_only
	54350	A002	GTB	54350		GTB	left_only
	54350	A002	STM	54350		STM	left_only
	54350	A003	GTC	54350		GTC	left_only
	54350	A003	STM	54350		STM	left_only

We see a few new EPA plants (54271, 54350, 10776), a new generator ID (8A) , and one dropped generator ID (GEN3).

"_core_epa__assn_eia_epacamd_unique"

When comparing the new and old deduplicated DFs, we see that there are only new rows of data. No generator IDs have been lost.

deduped._merge.value_counts()
_merge
both          6542
left_only       94
right_only       0
Name: count, dtype: int64

"core_epa__assn_eia_epacamd_subplant_ids"

The original dataset:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41458 entries, 0 to 41457
Data columns (total 6 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   plant_id_eia           41458 non-null  int32  
 1   plant_id_epa           40603 non-null  float64
 2   subplant_id            41458 non-null  int32  
 3   unit_id_pudl           5888 non-null   float64
 4   emissions_unit_id_epa  41458 non-null  object 
 5   generator_id           40603 non-null  object 
dtypes: float64(2), int32(2), object(2)

The new dataset:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41320 entries, 0 to 41319
Data columns (total 6 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   plant_id_eia           41320 non-null  Int64 
 1   plant_id_epa           40524 non-null  Int64 
 2   subplant_id            41320 non-null  Int64 
 3   unit_id_pudl           5894 non-null   Int64 
 4   emissions_unit_id_epa  41320 non-null  string
 5   generator_id           40524 non-null  string

What changed?

All new plant, generator and unit ID combos from the crosswalk are in the new subplant table
Of the lost records, none of them are in the new crosswalks.

Take for example plant ID 3, Barry:

	plant_id_epa	emissions_unit_id_epa	generator_id	plant_id_eia	subplant_id_x	unit_id_pudl_x	subplant_id_y	unit_id_pudl_y	_merge
0	3.0	1	1	3	0	1	0.0	1.0	both
1	3.0	2	2	3	1	2	1.0	2.0	both
2	3.0	3	3	3	2	3	2.0	3.0	both
3	3.0	4	4	3	3	4	3.0	4.0	both
4	3.0	5	5	3	4	5	4.0	5.0	both
5	3.0	6A	A1CT	3	5	6	5.0	6.0	both
6	3.0	6A	A1ST	3	5	6	5.0	6.0	both
7	3.0	6B	A1CT2	3	5	6	5.0	6.0	both
8	3.0	6B	A1ST	3	5	6	5.0	6.0	both
9	3.0	7A	A2C1	3	6	7	6.0	7.0	both
10	3.0	7A	A2ST	3	6	7	6.0	7.0	both
11	3.0	7B	A2C2	3	6	7	6.0	7.0	both
12	3.0	7B	A2ST	3	6	7	6.0	7.0	both
13	3.0	8	A3C1	3	7	8	NaN	NaN	left_only
14	3.0	8	A3ST	3	7	8	NaN	NaN	left_only
15	3.0	A3C1	A3C1	3			7.0	8.0	right_only
16	3.0	A3ST	A3ST	3			7.0	8.0	right_only
17		8	NaN	3			8.0	NaN	right_only

The last three rows were generated by merging on EIA generator and boiler data, and don't correspond to the originally existing crosswalk. In the new crosswalk, both generators are mapped to unit ID 8. This explains the shrinking of row counts between the two dataframes.

Documentation

Make sure to update relevant aspects of the documentation.

Tasks

Give feedback

Update the release notes: reference the PR and related issues.
Update relevant Data Source jinja templates (see docs/data_sources/templates).
Update relevant table or source description metadata (see src/metadata).
Review and update any other aspects of the documentation that might be affected by this PR.
Options

Testing

How did you make sure this worked? How can a reviewer verify this?

To-do list

Give feedback

Compare the deduplicated crosswalk tables
Is the new crosswalk missing CEMs generators that used to exist?
are there any unit IDs from CEMS that aren't in the new versions of our tables?
Run make pytest-coverage locally to ensure that the merge queue will accept your PR.
Review the PR yourself and call out any questions or issues you have.
For bigger ETL or data changes run the full ETL locally and then run the data validations using make pytest-validate.
Alternatively, run the build-deploy-pudl GitHub Action manually.
Options

e-belfer · 2025-02-11T21:34:05Z

src/pudl/etl/glue_assets.py

+        year: f"camd-eia-crosswalk-latest-{year}/epa_eia_crosswalk.csv"
+        for year in range(2019, 2024)
+    }
+    logger.info(csv_map)


Oops, this should get cleaned up.

e-belfer · 2025-02-11T21:39:57Z

@grgmiller A heads up, as one of the primary users of our crosswalk outputs - we've rerun the crosswalk with EIA data from 2019-2023, and we're pulling the new data into the ETL in this PR. I've done a fair amount of investigating (outlined above) to make sure all changes to the tables are expected based on changes in the underlying data pulled from the EPA CAMD FACT API in the process of running the crosswalk, but this will change a few hundred unit ID matches in our outputs. Let me know if you have any questions or concerns!

cmgosnell

okay. i think this is good. i appreciate the detailed explorations of the difference between the existing crosswalk and the new one. the diffs are alarming on the face of it but seem to be legit good and adding new associations.

one thing I can't help but thinking (and feels way oos for this pr) would be to update our crosswalk to assume it is an annual or at least time varying like the scd's. feels super oos for this but could have avoided some confusion for this integration

cmgosnell · 2025-02-11T22:02:49Z

docs/release_notes.rst

+* Ran the crosswalk using 2019, 2020, 2022 and 2023 EIA data, and incorporated the new
+  crosswalk data into the generation of :ref:`core_epa__assn_eia_epacamd` and
+  :ref:`core_epa__assn_eia_epacamd_subplant_ids`. See :issue:`4039` and :pr:`4056`.


nit but idk if a casual user will know what "ran the crosswalk" means. I think "Updated the crosswalk.." even would be more accessible

…ative/pudl into update-crosswalk

grgmiller · 2025-02-11T22:59:13Z

Thanks @e-belfer! I haven't followed this PR closely - is this PR essentially adapting the EPA's (R) code to update the PSDC with more recent years of data? Or is this a separate effort that builds on top of this?

One thing I'll flag for you is that in our most recent OGE release (https://github.com/singularity-energy/open-grid-emissions/releases/tag/v0.6.0) we actually did a lot of manual updates to the crosswalk as well (see the "Expanded and enhanced EPA-EIA crosswalking" section of the release notes). One major change is that we added start and end years to the crosswalk, because we found that historically these mappings change over time and sometimes even switch back and forth. So if you look at https://github.com/singularity-energy/open-grid-emissions/blob/main/src/oge/reference_tables/epa_eia_crosswalk_manual.csv you'll now see how these mappings change over time. Not sure if this is helpful or if you've already tackled this in your PR.

e-belfer · 2025-02-12T15:20:08Z

Thanks @e-belfer! I haven't followed this PR closely - is this PR essentially adapting the EPA's (R) code to update the PSDC with more recent years of data? Or is this a separate effort that builds on top of this?

That's exactly it - we re-ran the EPA R scripts with different years of EIA data, and are pulling the additional data into our pipeline. Because the scripts grab EPA data directly from the CAMD FACT API, there are also some updates to the EPA data that happen as a result of re-running the crosswalks.

One thing I'll flag for you is that in our most recent OGE release (https://github.com/singularity-energy/open-grid-emissions/releases/tag/v0.6.0) we actually did a lot of manual updates to the crosswalk as well (see the "Expanded and enhanced EPA-EIA crosswalking" section of the release notes). One major change is that we added start and end years to the crosswalk, because we found that historically these mappings change over time and sometimes even switch back and forth. So if you look at https://github.com/singularity-energy/open-grid-emissions/blob/main/src/oge/reference_tables/epa_eia_crosswalk_manual.csv you'll now see how these mappings change over time. Not sure if this is helpful or if you've already tackled this in your PR.

Thanks for flagging this! This is out of scope for this PR but is noted as a TODO in the code itself and tracked in #3691. I think we're definitely interested in pulling the manual mappings into our ETL, but there's a bit of design work to do here to handle time-variant mappings so I think the plan is to circle back to this in the future.

e-belfer added 2 commits February 10, 2025 17:39

Stash changes

c6f65e7

Integrate new data into crosswalk

e811255

e-belfer self-assigned this Feb 11, 2025

e-belfer and others added 2 commits February 11, 2025 16:31

Update release notes and table descriptions

a9e38fd

Merge branch 'main' into update-crosswalk

b27ef72

e-belfer requested a review from cmgosnell February 11, 2025 21:33

e-belfer commented Feb 11, 2025

View reviewed changes

cmgosnell approved these changes Feb 11, 2025

View reviewed changes

e-belfer added 2 commits February 11, 2025 17:35

Update release notes to drop jargon, clean up glue_assets

52bd18a

Merge branch 'update-crosswalk' of https://github.com/catalyst-cooper…

f1276a5

…ative/pudl into update-crosswalk

Update row counts for epacamd_eia test

938a103

e-belfer marked this pull request as ready for review February 12, 2025 15:25

e-belfer added this pull request to the merge queue Feb 12, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 12, 2025

Update assertions to handle fast ETL

17f85d8

e-belfer enabled auto-merge February 12, 2025 19:02

Merge branch 'main' into update-crosswalk

cfe2db9

e-belfer added this pull request to the merge queue Feb 12, 2025

Merged via the queue into main with commit 40635ae Feb 12, 2025
19 checks passed

e-belfer deleted the update-crosswalk branch February 12, 2025 20:34

zaneselvans added the epacems Integration and analysis of the EPA CEMS dataset. label Feb 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update EIA-EPA crosswalk to include multiple years of crosswalk data #4056

Update EIA-EPA crosswalk to include multiple years of crosswalk data #4056

e-belfer commented Feb 11, 2025 •

edited

Loading

Tasks

To-do list

e-belfer Feb 11, 2025

e-belfer commented Feb 11, 2025

cmgosnell left a comment

cmgosnell Feb 11, 2025

grgmiller commented Feb 11, 2025

e-belfer commented Feb 12, 2025

Update EIA-EPA crosswalk to include multiple years of crosswalk data #4056

Update EIA-EPA crosswalk to include multiple years of crosswalk data #4056

Conversation

e-belfer commented Feb 11, 2025 • edited Loading

Overview

What problem does this address?

What did you change?

What changed in the outputs we produce?

Documentation

Tasks

Testing

How did you make sure this worked? How can a reviewer verify this?

To-do list

e-belfer Feb 11, 2025

Choose a reason for hiding this comment

e-belfer commented Feb 11, 2025

cmgosnell left a comment

Choose a reason for hiding this comment

cmgosnell Feb 11, 2025

Choose a reason for hiding this comment

grgmiller commented Feb 11, 2025

e-belfer commented Feb 12, 2025

e-belfer commented Feb 11, 2025 •

edited

Loading