Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resource field description cleanup #3283

Merged
merged 20 commits into from
Jan 31, 2024
Merged

Conversation

aesharpe
Copy link
Member

@aesharpe aesharpe commented Jan 24, 2024

This PR addresses the fixes laid out in #3224

  • Remove unused fields
  • Add descriptions to all fields
  • Make descriptions mandatory for Field and Resource instances
  • Update unit tests to reflect mandatory descriptions
  • Update the docstring examples to reflect mandatory descriptions
  • Remove the x-fail from the test_defined_fields_are_used test so we make sure all fields are being used
  • Add a new alembic migration for altering the metadata

Notes on unused fields that were removed:

(I went through and checked when each description-less field was last edited to see what might have happened)

  • active no record of addition
  • country changed to owner_country out_eia860__yearly_ownership table from the ownership_eia860 table
  • credits_or_adjustments no record of addition
  • delivery_customers no record of addition
  • depreciation_amortization_value from f1_dacs_epda table - I think changed to "dollar_value" in the out_ferc1__yearly_depreciation_summary_sched336 table
  • electric_plant no record of addition
  • energy_source no record of addition
  • environmental_equipment_name from the raw_eia923__emissions_control table which hasn't been transformed yet
  • expense changed to dollar_value in the core_ferc1__yearly_operating_expenses_sched320 table
  • fuel_transportation_mode no record of addition
  • future_plant no record of addition
  • income changed to dollar_value in the core_ferc1__yearly_income_statements_sched114 table
  • is_total used to be used to identify total rows in ferc1 tables
  • leased_plant no record of addition
  • line_id no record of addition
  • month no record of addition
  • notes was used to show any notes extracted from ferc1 tables
  • operator_name not exactly sure, maybe dropped from eia860 ownership out_ table?
  • operator_state not exactly sure, maybe dropped from eia860 ownership out_ table?
  • operator_utility_id_eia not exactly sure, maybe dropped from eia860 ownership out_ table?
  • other no record of addition
  • other_total no record of addition
  • owner_name owner_name
  • peak_demand_summer_mw old ferc714 table description_pa_ferc714
  • peak_demand_winter_mw old ferc714 table description_pa_ferc714
  • period_nox from the raw_eia860__boiler_info table that gets pulled into the _core_eia860__boilers table but I guess the columns get dropped
  • period_particulate from the raw_eia860__boiler_info table that gets pulled into the _core_eia860__boilers table but I guess the columns get dropped idk if it's intentional?
  • period_so2 from the raw_eia860__boiler_info table that gets pulled into the _core_eia860__boilers table but I guess the columns get dropped idk if it's intentional?
  • prime_mover no record of addition
  • retail_sales no record of addition`
  • sales_for_resale no record of addition
  • status no record of addition
  • storage_capacity_mw no record of addition
  • storage_customers no record of addition
  • total no record of addition
  • total_meters no record of addition
  • transmission no record of addition
  • unbundled_revenues no record of addition
  • utility_attn no record of addition
  • utility_pobox no record of addition
  • virtual_capacity_mw no record of addition, probably from pre-normalized EIA861 data
  • virtual_customers no record of addition, probably from pre-normalized EIA861 data

Suggested fixes based on going through lots of field metadata:

High priority fixes:

  • business_model this is a weird column in core_eia861__yearly_sales which contains a letter I'm not sure if we've coded correctly.
  • fuel_pct kind of funky carve out.
  • All peak_reduction columns should be in MW not MWH! (this means fixing the actual column name which is a bit of a pain

Low priority fixes (probably out of scope):

  • demand_mwh Could be more specific: add "hourly_demand"?
  • incremental_energy_savings_mwh Appears to align with old column "energy_efficiency_incremental_effects_mwh" - maybe we want them to have the same name? (from 2012 split from DSM table to EE/DR tables).
  • incremental_peak_reduction_mw Appears to align with old column "energy_efficiency_incremental_actual_peak_reduction_mw" - maybe we want them to have the same name? (from 2012 split from DSM table to EE/DR tables).
  • respondent_type Maybe this should be more specific like "respondent_type_ferc714"
  • utc_datetime There's another field similar to this - investigate whether they are the same thing or not. On a similar note: it might be worth it to go through some of the fields and make sure there are no duplicates. I feel like there are a lot of categorical fields like "respondent type" etc. that are very similar across data sources. We should either make them more specific by adding data source suffixes or make them more generic and apply to all tables. There's definitely an opportunity to go through some of the fields and prune a bit, but it's not a small task.
  • utility_owned_capacity_mw Could make this col name more specific to non-net metered capacity

Suggested fixes based on going through lots of resource metadata:

Low priority fixes (probably out of scope)

  • Come up with standard format for resource descriptions (at least within a given source) so there is some standardization.
  • It looks like we currently don't integrate the EIA861 dispersed_generation tables (part of the raw distributed_generation spreadsheets). Is that something we want to do? (distributed = grid connected, dispersed = not grid connected).

* Remove fields that are not in any PUDL tables.
* Add descriptions to fields that do not have a description.
* Made field descriptions manditory.
* Update unit tests to have description fields when making test Resources
* Fix docstring examples so the builds don't fail
* Add a description to the RESOURCE_METADATA for tables missing a description.
* Make the description variable necessary for Resource() instances.
* Update unit tests to have description fields for Resource() instances.
* Fix docstring examples so the builds don't fail.
* Tack on the alembic update that should have gone with the previous commit.
* Remove the xfail for the test_defined_fields_are_used() function.
* Remove three fields that I missed before that aren't being used.
* Update the alembic file for those three removed fields.
@aesharpe aesharpe requested a review from zaneselvans January 24, 2024 09:29
@aesharpe aesharpe self-assigned this Jan 24, 2024
@aesharpe aesharpe linked an issue Jan 24, 2024 that may be closed by this pull request
Copy link

codecov bot commented Jan 24, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (2615631) 92.7% compared to head (19080b8) 92.7%.
Report is 23 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##            main   #3283     +/-   ##
=======================================
- Coverage   92.7%   92.7%   -0.0%     
=======================================
  Files        145     144      -1     
  Lines      13100   13091      -9     
=======================================
- Hits       12142   12133      -9     
  Misses       958     958             

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@aesharpe
Copy link
Member Author

One question I have is how or whether we will create descriptions for the output tables (all tables with the out prefix). Right now it seems like the only table metadata we have exists in for core normalized tables.

@zaneselvans
Copy link
Member

I think given that we intend the out_ tables to the be the ones that are most commonly used / user friendly, we'll want to make sure they all have descriptions. Though these could be very similar or even identical to the descriptions of the corresponding core_ table if they're conceptually the same.

@zaneselvans zaneselvans added eia861 Anything having to do with EIA Form 861 metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. docs Documentation for users and contributors. labels Jan 25, 2024
Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly the resource descriptions look good! I made a few suggestions.

We should remove the xfail from the two unit tests that check whether all fields and resources have descriptions (they're currently XPASS in the unit tests, so seems like they should pass) or potentially remove them entirely since we're now requiring descriptions.

It does seem like there are some table name clarifications / simplifications that we would be good to do here while we're still cleaning up old names. @bendnorman what do you think about getting this into the Great Rename? The old EIA-861 table names were kind of just pulled from the EIA-861 forms / spreadsheets without too much intention since they were "interim" 😆.

@aesharpe would you be willing to put your detailed per-field "suggested fixes" in context in the fields.py file as a self-review? I don't have enough context to understand what's going on with all those fields and would end up trying to transfer all the comments myself to see what's going on, but you've got more context on them.

I have not yet reviewed fields.py

test/unit/metadata_test.py Show resolved Hide resolved
src/pudl/metadata/resources/eia861.py Show resolved Hide resolved
src/pudl/metadata/resources/eia861.py Show resolved Hide resolved
src/pudl/metadata/resources/eia861.py Show resolved Hide resolved
src/pudl/metadata/resources/eia861.py Show resolved Hide resolved
src/pudl/metadata/resources/eia861.py Outdated Show resolved Hide resolved
src/pudl/metadata/resources/eia861.py Show resolved Hide resolved
src/pudl/metadata/resources/eia861.py Show resolved Hide resolved
src/pudl/metadata/resources/eia861.py Show resolved Hide resolved
src/pudl/metadata/resources/eia861.py Show resolved Hide resolved
@zaneselvans zaneselvans marked this pull request as draft January 25, 2024 03:01
@zaneselvans
Copy link
Member

(converting to draft so the ci-integration doesn't cost us money while there's some additional back-and-forth)

@aesharpe
Copy link
Member Author

We should remove the xfail from the two unit tests that check whether all fields and resources have descriptions (they're currently XPASS in the unit tests, so seems like they should pass) or potentially remove them entirely since we're now requiring descriptions.

I think it makes sense to remove them because of the required description field.

…cription fields because now they are required (#3283).
src/pudl/metadata/fields.py Outdated Show resolved Hide resolved
src/pudl/metadata/fields.py Outdated Show resolved Hide resolved
src/pudl/metadata/fields.py Outdated Show resolved Hide resolved
src/pudl/metadata/fields.py Show resolved Hide resolved
src/pudl/metadata/fields.py Outdated Show resolved Hide resolved
src/pudl/metadata/fields.py Outdated Show resolved Hide resolved
src/pudl/metadata/fields.py Outdated Show resolved Hide resolved
src/pudl/metadata/fields.py Show resolved Hide resolved
src/pudl/metadata/fields.py Outdated Show resolved Hide resolved
src/pudl/metadata/fields.py Show resolved Hide resolved
src/pudl/metadata/resources/eia861.py Show resolved Hide resolved
@zaneselvans
Copy link
Member

@aesharpe I think there's only a handful of things left that are unclear and that we want to try and address in this PR. I made some small wording suggestions / typo fixes, fixed the one truly wrong column name (MWh => MW) and gathered together the issues we're punting until later in #3322.

…mbine the energy_displaced_mwh column with the sold_to_utility_mwh column. The former only shows up in years 2007-2009 and upon further inspection seems analogous with the latter. Removed the former from the schema and updated the column map to point the old energy_displaced_mwh columns at the sold_to_utility_mwh column.
Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you feel pretty confident about the "total energy displaced" and "sold to utility" re-org then I think this is good to go assuming all the tests pass.

@zaneselvans
Copy link
Member

I am playing "Human Merge Queue" and just kicked off the CI on #3294. Once it finishes and merges in (or fails) I'll kick it off here too and get this PR merged in.

@zaneselvans zaneselvans marked this pull request as ready for review January 31, 2024 00:43
@zaneselvans zaneselvans enabled auto-merge (squash) January 31, 2024 00:45
@zaneselvans zaneselvans merged commit 75ba576 into main Jan 31, 2024
12 checks passed
@zaneselvans zaneselvans deleted the resource-field-description-cleanup branch January 31, 2024 02:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation for users and contributors. eia861 Anything having to do with EIA Form 861 metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Add resource and field descriptions; purge unused fields
3 participants