From 949106eb5268e0b8e5c335e37892c8b49686fc60 Mon Sep 17 00:00:00 2001 From: Austen Sharpe Date: Mon, 2 Oct 2023 11:44:45 -0300 Subject: [PATCH 1/2] Restructure intro docs page and README to accommodate data warehouse info. Add three components of PUDL description --- README.rst | 112 +++++++++++++++++++++++--------- docs/dev/naming_conventions.rst | 106 ++++++++++++++++-------------- docs/intro.rst | 105 ++++++++++++------------------ 3 files changed, 182 insertions(+), 141 deletions(-) diff --git a/README.rst b/README.rst index df5edcda3e..7ea885131c 100644 --- a/README.rst +++ b/README.rst @@ -52,30 +52,83 @@ What is PUDL? ------------- The `PUDL `__ Project is an open source data processing -pipeline that makes US energy data easier to access and use programmatically. +pipeline created by `Catalyst Cooperative +`__ that cleans, integrates, and standardizes some of the most +widely used public energy datasets in the US. Hundreds of gigabytes of valuable data +are published by US government agencies, but they are often difficult to work with. +PUDL takes the original spreadsheets, CSV files, and databases and turns them into a +unified resource. + +PUDL is comprised of three core components: + +- **Raw Data Archives** + + - We `archive `__ all the raw + data inputs on `Zenodo `__ + to ensure perminant, versioned access to the data. In the event that an agency + changes how they publish data or deletes old files, the ETL will still have access + to the original inputs. Each of the data inputs may have several different versions + archived, and all are assigned a unique DOI and made available through the REST API. +- **ETL Pipeline** + + - The ETL pipeline (this repo) ingests the raw archives, cleans them, integrates + them, and outputs them to a series of tables stored in SQLite Databases, Parquet + files, and pickle files (the Data Warehouse). Each release of the PUDL Python + package is embedded with a set of of DOIs to indicate which version of the raw + inputs it is meant to process. This process helps ensure that the ETL and it's + outputs are replicable. +- **Data Warehouse** + + - The outputs from the ETL, sometimes called "PUDL outputs", are stored in a data + warehouse so that users can access the data without having to run any code. The + majority of the outputs are stored in ``pudl.sqlite``, however CEMS data are stored + in seperate Parquet files due to their large size. The warehouse also contains + pickled interim assets from the ETL process, should users want to access the data + at various stages of the cleaning process, and SQLite databases for the raw FERC + inputs. + +For more information about each of the components, read our +`documentation `__ +. -Hundreds of gigabytes of valuable data are published by US government agencies, but -it's often difficult to work with. PUDL takes the original spreadsheets, CSV files, -and databases and turns them into a unified resource. This allows users to spend more -time on novel analysis and less time on data preparation. What data is available? ----------------------- PUDL currently integrates data from: -* `EIA Form 860 `__: 2001-2022 -* `EIA Form 860m `__: 2023-06 -* `EIA Form 861 `__: 2001-2022 -* `EIA Form 923 `__: 2001-2022 -* `EPA Continuous Emissions Monitoring System (CEMS) `__: 1995-2022 -* `FERC Form 1 `__: 1994-2021 -* `FERC Form 714 `__: 2006-2020 -* `US Census Demographic Profile 1 Geodatabase `__: 2010 +* **EIA Form 860**: 2001-2022 + - `Source `__ + - `PUDL Docs `__ +* **EIA Form 860m**: 2023-06 + - `Source `__ +* **EIA Form 861**: 2001-2022 + - `Source `__ + - `PUDL Docs `__ +* **EIA Form 923**: 2001-2022 + - `Source `__ + - `PUDL Docs `__ +* **EPA Continuous Emissions Monitoring System (CEMS)**: 1995-2022 + - `Source `__ + - `PUDL Docs `__ +* **FERC Form 1**: 1994-2021 + - `Source `__ + - `PUDL Docs `__ +* **FERC Form 714**: 2006-2020 + - `Source `__ + - `PUDL Docs `__ +* **FERC Form 2**: 2021 (raw only) + - `Source `__ +* **FERC Form 6**: 2021 (raw only) + - `Source `__ +* **FERC Form 60**: 2021 (raw only) + - `Source `__ +* **US Census Demographic Profile 1 Geodatabase**: 2010 + - `Source `__ Thanks to support from the `Alfred P. Sloan Foundation Energy & Environment Program `__, from -2021 to 2024 we will be integrating the following data as well: +2021 to 2024 we will be cleaning and integrating the following data as well: * `EIA Form 176 `__ (The Annual Report of Natural Gas Supply and Disposition) @@ -83,7 +136,6 @@ Program `__, from * `FERC Form 2 `__ (Annual Report of Major Natural Gas Companies) * `PHMSA Natural Gas Annual Report `__ -* Machine Readable Specifications of State Clean Energy Standards Who is PUDL for? ---------------- @@ -101,8 +153,8 @@ resources and everyone in between! How do I access the data? ------------------------- -There are several ways to access PUDL outputs. For more details you'll want -to check out `the complete documentation +There are several ways to access the information in the PUDL Data Warehouse. For more +details you'll want to check out `the complete documentation `__, but here's a quick overview: Datasette @@ -118,6 +170,19 @@ This access mode is good for casual data explorers or anyone who just wants to g small subset of the data. It also lets you share links to a particular subset of the data and provides a REST API for querying the data from other applications. +Nightly Data Builds +^^^^^^^^^^^^^^^^^^^ +We automatically run the ETL every week night and upload the outputs to public S3 +storage buckets as part of the `AWS Open Data Registry +`__. This data is based on +the `dev branch `__, of PUDL, and +is what we use to populate Datasette. Use this data access method if you want to +download the sqlite files directly. + +You can download the outputs using the AWS CLI, the S3 API, or directly via the web. +See `Accessing Nightly Builds `__ +for links to the individual SQLite, JSON, and Apache Parquet outputs. + Docker + Jupyter ^^^^^^^^^^^^^^^^ Want access to all the published data in bulk? If you're familiar with Python @@ -151,19 +216,6 @@ most users. You should check out the `Development section `__ for more details. -Nightly Data Builds -^^^^^^^^^^^^^^^^^^^ -If you are less concerned with reproducibility and want the freshest possible data -we automatically upload the outputs of our nightly builds to public S3 storage buckets -as part of the `AWS Open Data Registry -`__. This data is based on -the `dev branch `__, of PUDL, and -is updated most weekday mornings. It is also the data used to populate Datasette. - -The nightly build outputs can be accessed using the AWS CLI, the S3 API, or downloaded -directly via the web. See `Accessing Nightly Builds `__ -for links to the individual SQLite, JSON, and Apache Parquet outputs. - Contributing to PUDL -------------------- Find PUDL useful? Want to help make it better? There are lots of ways to help! diff --git a/docs/dev/naming_conventions.rst b/docs/dev/naming_conventions.rst index 5becde4b9b..91f9ef3fec 100644 --- a/docs/dev/naming_conventions.rst +++ b/docs/dev/naming_conventions.rst @@ -15,9 +15,9 @@ Asset Naming Conventions PUDL's data processing is divided into three layers of Dagster assets: Raw, Core and Output. Dagster assets are the core unit of computation in PUDL. The outputs of assets can be persisted to any type of storage though PUDL outputs are typically -tables in a SQLite database, parquet files or pickle files. The asset name is used -for the table or parquet file name. Asset names should generally follow this naming -convention: +tables in a SQLite database, parquet files or pickle files (read more about this here: +:doc:`../intro`). The asset name is used for the table or parquet file name. Asset +names should generally follow this naming convention: .. code-block:: @@ -33,9 +33,11 @@ convention: Raw layer ^^^^^^^^^ -* This layer contains assets that extract data from spreadsheets and databases - and are persisted as pickle files. -* Naming convention: ``raw_{source}__{asset_name}`` +This layer contains assets that extract data from spreadsheets and databases +and are persisted as pickle files. + +Naming convention: ``raw_{source}__{asset_name}`` + * ``asset_name`` is typically copied from the source data. * ``asset_type`` is not included in this layer because the data modeling does not yet conform to PUDL standards. Raw assets are typically just copies of the @@ -43,17 +45,20 @@ Raw layer Core layer ^^^^^^^^^^ -* This layer contains assets that typically break denormalized raw assets into - well-modeled tables that serve as building blocks for downstream wide tables - and analyses. Well-modeled means tables in the database have logical - primary keys, foreign keys, datatypes and generally follow - :ref:`Tidy Data standards `. Assets in this layer create - consistent categorical variables, decuplicate and impute data. - These assets are typically stored in parquet files or tables in a database. -* Naming convention: ``core_{source}__{asset_type}_{asset_name}`` -* ``asset_type`` describes how the asset is modeled and its role in PUDL’s - collection of core assets. There are a handful of table types in this layer: - +This layer contains assets that typically break denormalized raw assets into +well-modeled tables that serve as building blocks for downstream wide tables +and analyses. Well-modeled means tables in the database have logical +primary keys, foreign keys, datatypes and generally follow +:ref:`Tidy Data standards `. Assets in this layer create +consistent categorical variables, decuplicate and impute data. +These assets are typically stored in parquet files or tables in a database. + +Naming convention: ``core_{source}__{asset_type}_{asset_name}`` + + * ``asset_type`` describes how the asset is modeled and its role in PUDL’s + collection of core assets. There are a handful of table types in this layer: + * ``asset_type`` describes how the asset is modeled and its role in PUDL’s + collection of core assets. There are a handful of table types in this layer: * ``assn``: Association tables provide connections between entities. This data can be manually compiled or extracted from data sources. Examples: @@ -82,32 +87,23 @@ Core layer * ``core_ferc714__hourly_demand_pa``, * ``core_ferc1__yearly_plant_in_service``. -Output layer -^^^^^^^^^^^^ -* Assets in this layer use the well modeled tables from the Core layer to construct - wide and complete tables suitable for users to perform analysis on. This layer - contains intermediate tables that bridge the core and user-facing tables. -* Naming convention: ``out_{source}__{asset_type}_{asset_name}`` -* ``source`` is optional in this layer because there can be assets that join data from - multiple sources. -* ``asset_type`` is also optional. It will likely describe the frequency at which - the data is reported (annual/monthly/hourly). -Intermediate Assets -^^^^^^^^^^^^^^^^^^^ -* Intermediate assets are logical steps towards a final well-modeled core or - user-facing output asset. These assets are not intended to be persisted in the - database or accessible to the user. These assets are denoted by a preceding - underscore, like a private python method. For example, the intermediate asset - ``_core_eia860__plants`` is a logical step towards the - ``core_eia860__entity_plants`` and ``core_eia860__scd_plants`` assets. - ``_core_eia860__plants`` does some basic cleaning of the ``raw_eia860__plant`` - asset but still contains duplicate plant entities. The computation intensive - harvesting process deduplicates ``_core_eia860__plants`` and outputs the - ``core_eia860__entity_plants`` and ``core_eia860__scd_plants`` assets which - follow Tiny Data standards. -* Limit the number of intermediate assets to avoid an extremely - cluttered DAG. It is appropriate to create an intermediate asset when: +Core Layer (Intermediate Assets) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Intermediate assets are logical steps towards a final well-modeled core or +user-facing output asset. These assets are not intended to be persisted in the +database or accessible to the user. These assets are denoted by a preceding +underscore, like a private python method. For example, the intermediate asset +``_core_eia860__plants`` is a logical step towards the +``core_eia860__entity_plants`` and ``core_eia860__scd_plants`` assets. +``_core_eia860__plants`` does some basic cleaning of the ``raw_eia860__plant`` +asset but still contains duplicate plant entities. The computation intensive +harvesting process deduplicates ``_core_eia860__plants`` and outputs the +``core_eia860__entity_plants`` and ``core_eia860__scd_plants`` assets which +follow Tiny Data standards. + +Limit the number of intermediate assets to avoid an extremely +cluttered DAG. It is appropriate to create an intermediate asset when: * there is a short and long running portion of a process. It is convenient to separate the long and short-running processing portions into separate assets so debugging the @@ -116,18 +112,32 @@ Intermediate Assets example, the pre harvest assets in the ``_core_eia860`` and ``_core_eia923`` groups are frequently inspected when new years of data are added. +Output layer +^^^^^^^^^^^^ +This layer uses assets in the Core layer to construct wide and complete tables +suitable for users to perform analysis on. This layer can contain intermediate +tables that bridge the core and user-facing tables. + +Naming convention: ``out_{source}__{asset_type}_{asset_name}`` + +* ``source`` is optional in this layer because there can be assets that join data from + multiple sources. +* ``asset_type`` is also optional. It will likely describe the frequency at which + the data is reported (annual/monthly/hourly). + Columns and Field Names -^^^^^^^^^^^^^^^^^^^^^^^ +------------------------------ + If two columns in different tables record the same quantity in the same units, give them the same name. That way if they end up in the same dataframe for comparison it's easy to automatically rename them with suffixes indicating where they came from. For example, net electricity generation is reported to -both :doc:`FERC Form 1 <../data_sources/ferc1>` and :doc:`EIA 923 -<../data_sources/eia923>`, so we've named columns ``net_generation_mwh`` in -each of those data sources. Similarly, give non-comparable quantities reported -in different data sources **different** column names. This helps make it clear -that the quantities are actually different. +both :doc:`FERC Form 1 <../data_sources/ferc1>` and +:doc:`EIA 923<../data_sources/eia923>`, so we've named columns ``net_generation_mwh`` +in each of those data sources. Similarly, give non-comparable quantities reported in +different data sources **different** column names. This helps make it clear that the +quantities are actually different. * ``total`` should come at the beginning of the name (e.g. ``total_expns_production``) diff --git a/docs/intro.rst b/docs/intro.rst index 7bf02258a8..0f750596d9 100644 --- a/docs/intro.rst +++ b/docs/intro.rst @@ -1,43 +1,15 @@ ======================================================================================= -Introduction +What is PUDL? ======================================================================================= -PUDL is a data processing pipeline created by `Catalyst Cooperative -`__ that cleans, integrates, and standardizes some of the most -widely used public energy datasets in the US. The data serve researchers, activists, -journalists, and policy makers that might not have the technical expertise to access it -in its raw form, the time to clean and prepare the data for bulk analysis, or the means -to purchase it from existing commercial providers. +Welcome to the Public Utilities Data Liberation Project (PUDL)! Our README explains that +PUDL has three core components: ---------------------------------------------------------------------------------------- -Available Data ---------------------------------------------------------------------------------------- - -We focus primarily on poorly curated data published by the US government in -semi-structured but machine readable formats. For details on exactly what data is -available from these data sources and what state it is in, see the the individual -pages for each source: - -* :doc:`data_sources/eia860` -* :doc:`data_sources/eia861` -* :doc:`data_sources/eia923` -* :doc:`data_sources/epacems` -* :doc:`data_sources/ferc1` -* :doc:`data_sources/ferc714` +- **Raw Data Archives** (raw, versioned inputs) +- **ETL Pipeline** (code to process, clean, and organize the raw inputs) +- **Data Warehouse** (location where ETL outputs, both interim and final, are stored) -We also publish SQLite databases containing relatively pristine versions of our more -difficult to parse inputs, especially the old Visual FoxPro (DBF, pre-2021) and new XBRL -data (2021+) published by FERC: - -* `FERC Form 1 (DBF) `__ -* `FERC Form 1 (XBRL) `__ -* `FERC Form 2 (XBRL) `__ -* `FERC Form 6 (XBRL) `__ -* `FERC Form 60 (XBRL) `__ -* `FERC Form 714 (XBRL) `__ - -To get started using PUDL data, visit our :doc:`data_access` page, or continue reading -to learn more about the PUDL data processing pipeline. +Let's dig into each of these... .. _raw-data-archive: @@ -74,13 +46,43 @@ needed and organize them in a local :doc:`datastore `. .. _etl-process: --------------------------------------------------------------------------------------- -The Data Warehouse Design +The ETL Pipeline +--------------------------------------------------------------------------------------- + +Dagster stuff, etc. I feel like this is similar to the data warehouse stuff, or rather, +it informs the structure? Talk about validation tests n stuff here. + +Data Validation +^^^^^^^^^^^^^^^ + +We have a growing collection of data validation test cases that we run before +publishing a data release to try and avoid publishing data with known issues. Most of +these validations are described in the :mod:`pudl.validate` module. They check things +like: + +* The heat content of various fuel types are within expected bounds. +* Coal ash, moisture, mercury, sulfur etc. content are within expected bounds +* Generator heat rates and capacity factors are realistic for the type of prime mover + being reported. + +Some data validations are currently only specified within our test suite, including: + +* The expected number of records within each table +* The fact that there are no entirely N/A columns + +A variety of database integrity checks are also run either during the ETL process or +when the data is loaded into SQLite. + +See our :doc:`dev/testing` documentation for more information. + +--------------------------------------------------------------------------------------- +The Data Warehouse --------------------------------------------------------------------------------------- -PUDL's data processing produces a data warehouse that can be used for analytics. -The processing happens within Dagster assets that are persisted to storage, -typically pickle, parquet or SQLite files. The raw data moves through three -layers of the data warehouse. +The Data Warehouse contains all the cleaned data outputs and interim outputs from the +ETL pipeline. + +Data passing through the ETL pipeline passes different phases or "layers"....ADD MORE Raw Layer ^^^^^^^^^ @@ -182,26 +184,3 @@ integrate more analytical outputs into the library over time. Python, Pandas, and NumPy. .. _test-and-validate: - ---------------------------------------------------------------------------------------- -Data Validation ---------------------------------------------------------------------------------------- -We have a growing collection of data validation test cases that we run before -publishing a data release to try and avoid publishing data with known issues. Most of -these validations are described in the :mod:`pudl.validate` module. They check things -like: - -* The heat content of various fuel types are within expected bounds. -* Coal ash, moisture, mercury, sulfur etc. content are within expected bounds -* Generator heat rates and capacity factors are realistic for the type of prime mover - being reported. - -Some data validations are currently only specified within our test suite, including: - -* The expected number of records within each table -* The fact that there are no entirely N/A columns - -A variety of database integrity checks are also run either during the ETL process or -when the data is loaded into SQLite. - -See our :doc:`dev/testing` documentation for more information. From c723a412e6d459c487eb7d29de8d8bd7d7829c8c Mon Sep 17 00:00:00 2001 From: Austen Sharpe Date: Wed, 4 Oct 2023 12:17:56 -0300 Subject: [PATCH 2/2] Update plant_function values in csv to match those in the table_dimensions_ferc1 table --- docs/dev/naming_conventions.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/dev/naming_conventions.rst b/docs/dev/naming_conventions.rst index 91f9ef3fec..5e298d550d 100644 --- a/docs/dev/naming_conventions.rst +++ b/docs/dev/naming_conventions.rst @@ -88,8 +88,8 @@ Naming convention: ``core_{source}__{asset_type}_{asset_name}`` * ``core_ferc1__yearly_plant_in_service``. -Core Layer (Intermediate Assets) -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Intermediate Assets +^^^^^^^^^^^^^^^^^^^ Intermediate assets are logical steps towards a final well-modeled core or user-facing output asset. These assets are not intended to be persisted in the database or accessible to the user. These assets are denoted by a preceding