-
-
Notifications
You must be signed in to change notification settings - Fork 119
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #3170 from catalyst-cooperative/nightly-2023-12-18
Merge dev into main for 2023-12-18
- Loading branch information
Showing
113 changed files
with
6,461 additions
and
7,008 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -47,23 +47,81 @@ it's often difficult to work with. PUDL takes the original spreadsheets, CSV fil | |
and databases and turns them into a unified resource. This allows users to spend more | ||
time on novel analysis and less time on data preparation. | ||
|
||
The project is focused on serving researchers, activists, journalists, policy makers, | ||
and small businesses that might not otherwise be able to afford access to this data | ||
from commercial sources and who may not have the time or expertise to do all the | ||
data processing themselves from scratch. | ||
|
||
We want to make this data accessible and easy to work with for as wide an audience as | ||
possible: anyone from a grassroots youth climate organizers working with Google | ||
sheets to university researchers with access to scalable cloud computing | ||
resources and everyone in between! | ||
|
||
PUDL is comprised of three core components: | ||
|
||
- **Raw Data Archives** | ||
|
||
- PUDL `archives <https://github.com/catalyst-cooperative/pudl-archiver>`__ | ||
all the raw data inputs on `Zenodo <https://zenodo.org/communities/catalyst-cooperative/?page=1&size=20>`__ | ||
to ensure perminant, versioned access to the data. In the event that an agency | ||
changes how they publish data or deletes old files, the ETL will still have access | ||
to the original inputs. Each of the data inputs may have several different versions | ||
archived, and all are assigned a unique DOI and made available through the REST API. | ||
You can read more about the Raw Data Archives in the | ||
`docs <https://catalystcoop-pudl.readthedocs.io/en/dev/intro.html#raw-data-archives>`__. | ||
- **ETL Pipeline** | ||
|
||
- The ETL pipeline (this repo) ingests the raw archives, cleans them, | ||
integrates them, and outputs them to a series of tables stored in SQLite Databases, | ||
Parquet files, and pickle files (the Data Warehouse). Each release of the PUDL | ||
Python package is embedded with a set of of DOIs to indicate which version of the | ||
raw inputs it is meant to process. This process helps ensure that the ETL and it's | ||
outputs are replicable. You can read more about the ETL in the | ||
`docs <https://catalystcoop-pudl.readthedocs.io/en/dev/intro.html#the-etl-process>`__. | ||
- **Data Warehouse** | ||
|
||
- The outputs from the ETL, sometimes called "PUDL outputs", | ||
are stored in a data warehouse as a collection of SQLite and Parquet files so that | ||
users can access the data without having to run any code. Learn more about how to | ||
access the data `here <https://catalystcoop-pudl.readthedocs.io/en/dev/data_access.html>`__. | ||
|
||
What data is available? | ||
----------------------- | ||
|
||
PUDL currently integrates data from: | ||
|
||
* `EIA Form 860 <https://www.eia.gov/electricity/data/eia860/>`__: 2001 - 2022 | ||
* `EIA Form 860m <https://www.eia.gov/electricity/data/eia860m/>`__: 2023-06 | ||
* `EIA Form 861 <https://www.eia.gov/electricity/data/eia861/>`__: 2001 - 2022 | ||
* `EIA Form 923 <https://www.eia.gov/electricity/data/eia923/>`__: 2001 - 2023-08 | ||
* `EPA Continuous Emissions Monitoring System (CEMS) <https://campd.epa.gov/>`__: 1995 - 2022 | ||
* `FERC Form 1 <https://www.ferc.gov/industries-data/electric/general-information/electric-industry-forms/form-1-electric-utility-annual>`__: 1994-2021 | ||
* `FERC Form 714 <https://www.ferc.gov/industries-data/electric/general-information/electric-industry-forms/form-no-714-annual-electric/data>`__: 2006-2020 | ||
* `US Census Demographic Profile 1 Geodatabase <https://www.census.gov/geographies/mapping-files/2010/geo/tiger-data.html>`__: 2010 | ||
* **EIA Form 860**: 2001-2022 | ||
- `Source Docs <https://www.eia.gov/electricity/data/eia860/>`__ | ||
- `PUDL Docs <https://catalystcoop-pudl.readthedocs.io/en/dev/data_sources/eia860.html>`__ | ||
* **EIA Form 860m**: 2023-06 | ||
- `Source Docs <https://www.eia.gov/electricity/data/eia860m/>`__ | ||
* **EIA Form 861**: 2001-2022 | ||
- `Source Docs <https://www.eia.gov/electricity/data/eia861/>`__ | ||
- `PUDL Docs <https://catalystcoop-pudl.readthedocs.io/en/dev/data_sources/eia861.html>`__ | ||
* **EIA Form 923**: 2001-2022 | ||
- `Source Docs <https://www.eia.gov/electricity/data/eia923/>`__ | ||
- `PUDL Docs <https://catalystcoop-pudl.readthedocs.io/en/dev/data_sources/eia923.html>`__ | ||
* **EPA Continuous Emissions Monitoring System (CEMS)**: 1995-2022 | ||
- `Source Docs <https://campd.epa.gov/>`__ | ||
- `PUDL Docs <https://catalystcoop-pudl.readthedocs.io/en/dev/data_sources/epacems.html>`__ | ||
* **FERC Form 1**: 1994-2021 | ||
- `Source Docs <https://www.ferc.gov/industries-data/electric/general-information/electric-industry-forms/form-1-electric-utility-annual>`__ | ||
- `PUDL Docs <https://catalystcoop-pudl.readthedocs.io/en/dev/data_sources/ferc1.html>`__ | ||
* **FERC Form 714**: 2006-2020 | ||
- `Source Docs <https://www.ferc.gov/industries-data/electric/general-information/electric-industry-forms/form-no-714-annual-electric/data>`__ | ||
- `PUDL Docs <https://catalystcoop-pudl.readthedocs.io/en/dev/data_sources/ferc714.html>`__ | ||
* **FERC Form 2**: 2021 (raw only) | ||
- `Source Docs <https://www.ferc.gov/industries-data/natural-gas/industry-forms/form-2-2a-3-q-gas-historical-vfp-data>`__ | ||
* **FERC Form 6**: 2021 (raw only) | ||
- `Source Docs <https://www.ferc.gov/general-information-1/oil-industry-forms/form-6-6q-historical-vfp-data>`__ | ||
* **FERC Form 60**: 2021 (raw only) | ||
- `Source Docs <https://www.ferc.gov/form-60-annual-report-centralized-service-companies>`__ | ||
* **US Census Demographic Profile 1 Geodatabase**: 2010 | ||
- `Source Docs <https://www.census.gov/geographies/mapping-files/2010/geo/tiger-data.html>`__ | ||
|
||
Thanks to support from the `Alfred P. Sloan Foundation Energy & Environment | ||
Program <https://sloan.org/programs/research/energy-and-environment>`__, from | ||
2021 to 2024 we will be integrating the following data as well: | ||
2021 to 2024 we will be cleaning and integrating the following data as well: | ||
|
||
* `EIA Form 176 <https://www.eia.gov/dnav/ng/TblDefs/NG_DataSources.html#s176>`__ | ||
(The Annual Report of Natural Gas Supply and Disposition) | ||
|
@@ -73,90 +131,37 @@ Program <https://sloan.org/programs/research/energy-and-environment>`__, from | |
* `PHMSA Natural Gas Annual Report <https://www.phmsa.dot.gov/data-and-statistics/pipeline/gas-distribution-gas-gathering-gas-transmission-hazardous-liquids>`__ | ||
* Machine Readable Specifications of State Clean Energy Standards | ||
|
||
Who is PUDL for? | ||
---------------- | ||
|
||
The project is focused on serving researchers, activists, journalists, policy makers, | ||
and small businesses that might not otherwise be able to afford access to this data | ||
from commercial sources and who may not have the time or expertise to do all the | ||
data processing themselves from scratch. | ||
|
||
We want to make this data accessible and easy to work with for as wide an audience as | ||
possible: anyone from a grassroots youth climate organizers working with Google | ||
sheets to university researchers with access to scalable cloud computing | ||
resources and everyone in between! | ||
|
||
How do I access the data? | ||
------------------------- | ||
|
||
There are several ways to access PUDL outputs. For more details you'll want | ||
to check out `the complete documentation | ||
<https://catalystcoop-pudl.readthedocs.io>`__, but here's a quick overview: | ||
|
||
Datasette | ||
^^^^^^^^^ | ||
We publish a lot of the data on https://data.catalyst.coop using a tool called | ||
`Datasette <https://datasette.io>`__ that lets us wrap our databases in a relatively | ||
friendly web interface. You can browse and query the data, make simple charts and | ||
maps, and download portions of the data as CSV files or JSON so you can work with it | ||
locally. For a quick introduction to what you can do with the Datasette interface, | ||
check out `this 17 minute video <https://simonwillison.net/2021/Feb/7/video/>`__. | ||
|
||
This access mode is good for casual data explorers or anyone who just wants to grab a | ||
small subset of the data. It also lets you share links to a particular subset of the | ||
data and provides a REST API for querying the data from other applications. | ||
|
||
Docker + Jupyter | ||
^^^^^^^^^^^^^^^^ | ||
Want access to all the published data in bulk? If you're familiar with Python | ||
and `Jupyter Notebooks <https://jupyter.org/>`__ and are willing to install Docker you | ||
can: | ||
|
||
* `Download a PUDL data release <https://zenodo.org/record/3653158>`__ from | ||
CERN's `Zenodo <https://zenodo.org>`__ archiving service. | ||
* `Install Docker <https://docs.docker.com/get-docker/>`__ | ||
* Run the archived image using ``docker-compose up`` | ||
* Access the data via the resulting Jupyter Notebook server running on your machine. | ||
|
||
If you'd rather work with the PUDL `SQLite <https://sqlite.org>`__ Databases and | ||
`Apache Parquet <https://parquet.apache.org>`__ files directly, they are accessible | ||
within the same Zenodo archive. | ||
|
||
The `PUDL Examples repository <https://github.com/catalyst-cooperative/pudl-examples>`__ | ||
has more detailed instructions on how to work with the Zenodo data archive and Docker | ||
image. | ||
|
||
The PUDL Development Environment | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
If you're more familiar with the Python data science stack and are comfortable working | ||
with git, ``conda`` environments, and the Unix command line, then you can set up the | ||
whole PUDL Development Environment on your own computer. This will allow you to run the | ||
full data processing pipeline yourself, tweak the underlying source code, and (we hope!) | ||
make contributions back to the project. | ||
|
||
This is by far the most involved way to access the data and isn't recommended for | ||
most users. You should check out the `Development section <https://catalystcoop-pudl.readthedocs.io/en/latest/dev/dev_setup.html>`__ | ||
of the main `PUDL documentation <https://catalystcoop-pudl.readthedocs.io>`__ for more | ||
details. | ||
|
||
Nightly Data Builds | ||
^^^^^^^^^^^^^^^^^^^ | ||
If you are less concerned with reproducibility and want the freshest possible data | ||
we automatically upload the outputs of our nightly builds to public S3 storage buckets | ||
as part of the `AWS Open Data Registry | ||
<https://registry.opendata.aws/catalyst-cooperative-pudl/>`__. This data is based on | ||
the `dev branch <https://github.com/catalyst-cooperative/pudl/tree/dev>`__, of PUDL, and | ||
is updated most weekday mornings. It is also the data used to populate Datasette. | ||
|
||
The nightly build outputs can be accessed using the AWS CLI, the S3 API, or downloaded | ||
directly via the web. See `Accessing Nightly Builds <https://catalystcoop-pudl.readthedocs.io/en/latest/data_access.html#access-nightly-builds>`__ | ||
for links to the individual SQLite, JSON, and Apache Parquet outputs. | ||
For details on how to access PUDL data, see the `data access documentation | ||
<https://catalystcoop-pudl.readthedocs.io/en/latest/data_access.html>`__. A quick | ||
summary: | ||
|
||
* `Datasette <https://catalystcoop-pudl.readthedocs.io/en/latest/data_access.html#-access-datasette>`__ | ||
provides browsable and queryable data from our nightly builds on the web: | ||
https://data.catalyst.coop | ||
* `Kaggle <https://catalystcoop-pudl.readthedocs.io/en/latest/data_access.html#access-kaggle>`__ | ||
provides easy Jupyter notebook access to the PUDL data, updated weekly: | ||
https://www.kaggle.com/datasets/catalystcooperative/pudl-project | ||
* `Zenodo <https://catalystcoop-pudl.readthedocs.io/en/latest/data_access.html#access-zenodo>`__ | ||
provides stable long-term access to our versioned data releases with a citeable DOI: | ||
https://doi.org/10.5281/zenodo.3653158 | ||
* `Nightly Data Builds <https://catalystcoop-pudl.readthedocs.io/en/latest/data_access.html#access-nightly-builds>`__ | ||
push their outputs to the AWS Open Data Registry: | ||
https://registry.opendata.aws/catalyst-cooperative-pudl/ | ||
See `the nightly build docs <https://catalystcoop-pudl.readthedocs.io/en/latest/data_access.html#access-nightly-builds>`__ | ||
for direct download links. | ||
* `The PUDL Development Environment <https://catalystcoop-pudl.readthedocs.io/en/latest/dev/dev_setup.html>`__ | ||
lets you run the PUDL data processing pipeline locally. | ||
|
||
Contributing to PUDL | ||
-------------------- | ||
|
||
Find PUDL useful? Want to help make it better? There are lots of ways to help! | ||
|
||
* First, be sure to read our `Code of Conduct <https://catalystcoop-pudl.readthedocs.io/en/latest/code_of_conduct.html>`__. | ||
* Check out our `contribution guide <https://catalystcoop-pudl.readthedocs.io/en/latest/CONTRIBUTING.html>`__ | ||
including our `Code of Conduct <https://catalystcoop-pudl.readthedocs.io/en/latest/code_of_conduct.html>`__. | ||
* You can file a bug report, make a feature request, or ask questions in the | ||
`Github issue tracker <https://github.com/catalyst-cooperative/pudl/issues>`__. | ||
* Feel free to fork the project and make a pull request with new code, better | ||
|
@@ -165,8 +170,6 @@ Find PUDL useful? Want to help make it better? There are lots of ways to help! | |
to support our work liberating public energy data. | ||
* `Hire us to do some custom analysis <https://catalyst.coop/hire-catalyst/>`__ and | ||
allow us to integrate the resulting code into PUDL. | ||
* For more information check out the Contributing section of the | ||
`PUDL Documentation <https://catalystcoop-pudl.readthedocs.io>`__ | ||
|
||
Licensing | ||
--------- | ||
|
@@ -193,10 +196,15 @@ Contact Us | |
* Want to schedule a time to chat with us one-on-one about your PUDL use case, ideas | ||
for improvement, or get some personalized support? Join us for | ||
`Office Hours <https://calend.ly/catalyst-cooperative/pudl-office-hours>`__ | ||
* `Follow us here on GitHub <https://github.com/catalyst-cooperative/>`__ | ||
* Follow us on Mastodon: `@[email protected] <https://mastodon.energy/@CatalystCoop>`__ | ||
* Follow us on BlueSky: `@catalyst.coop <https://bsky.app/profile/catalyst.coop>`__ | ||
* `Follow us on LinkedIn <https://www.linkedin.com/company/catalyst-cooperative/>`__ | ||
* `Follow us on HuggingFace <https://huggingface.co/catalystcooperative>`__ | ||
* Follow us on Twitter: `@CatalystCoop <https://twitter.com/CatalystCoop>`__ | ||
* `Follow us on Kaggle <https://www.kaggle.com/catalystcooperative/>`__ | ||
* More info on our website: https://catalyst.coop | ||
* To hire us to provide customized data | ||
extraction and analysis, you can email the maintainers: | ||
* Email us if you'd like to hire us to provide customized data extraction and analysis: | ||
`[email protected] <mailto:[email protected]>`__ | ||
|
||
About Catalyst Cooperative | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.