Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add naming new naming convention to docs #2874

Merged
merged 14 commits into from
Nov 10, 2023
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions docs/data_access.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,12 @@ PUDL data, so if you have a suggestion please `open a GitHub issue
<https://github.com/catalyst-cooperative/pudl/issues>`__. If you have a question you
can `create a GitHub discussion <https://github.com/orgs/catalyst-cooperative/discussions/new?category=help-me>`__.

PUDL's primary data output is the ``pudl.sqlite`` database. It contains a collection
of tables that follow :ref:`PUDL's asset naming convention <asset-naming>`. Tables
with the ``core_`` prefix are normalized tables that serve as building blocks for the
more denormalized and easy to work with ``output_`` tables. **We recommend only working
with ``output_`` tables.**

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should consider the fact that many users may not know what normalized and denormalized data means in this context. It might make sense to get rid of the sentence

Tables with the ``core_`` prefix are normalized tables that serve as building blocks for the more denormalized and easy to work with ``output_`` tables.

and just say

We recommend working with tables with the ``output_`` prefix as these tables contain the most complete data. For more information about the different types of tables, read through the naming conventions.

Or something like that?

.. _access-modes:

---------------------------------------------------------------------------------------
Expand Down
49 changes: 2 additions & 47 deletions docs/dev/data_guidelines.rst
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,8 @@ Examples of Unacceptable Changes
fuel heat content and net electricity generation. The heat rate would
be a derived value and not part of the original data.

.. _tidy-data:

-------------------------------------------------------------------------------
Make Tidy Data
-------------------------------------------------------------------------------
Expand Down Expand Up @@ -117,24 +119,6 @@ that M/Mega is a million in SI. And a `BTU
energy required to raise the temperature of one an *avoirdupois pound* of water
by 1 degree *Farenheit*?! What century even is this?).

-------------------------------------------------------------------------------
Silo the ETL Process
-------------------------------------------------------------------------------
It should be possible to run the ETL process on each data source independently
and with any combination of data sources included. This allows users to include
only the data need. In some cases, like the :doc:`EIA 860
<../data_sources/eia860>` and :doc:`EIA 923 <../data_sources/eia923>` data, two
data sources may be so intertwined that keeping them separate doesn't really
make sense. This should be the exception, however, not the rule.

-------------------------------------------------------------------------------
Separate Data from Glue
-------------------------------------------------------------------------------
The glue that relates different data sources to each other should be applied
after or alongside the ETL process and not as a mandatory part of ETL. This
makes it easy to pull individual data sources in and work with them even when
the glue isn't working or doesn't yet exist.

-------------------------------------------------------------------------------
Partition Big Data
-------------------------------------------------------------------------------
Expand All @@ -146,35 +130,6 @@ them to pull in only certain years, certain states, or other sensible partitions
data so that they don’t run out of memory or disk space or have to wait hours while data
they don't need is being processed.

-------------------------------------------------------------------------------
Naming Conventions
-------------------------------------------------------------------------------
*There are only two hard problems in computer science: caching,
naming things, and off-by-one errors.*

Use Consistent Names
^^^^^^^^^^^^^^^^^^^^
If two columns in different tables record the same quantity in the same units,
give them the same name. That way if they end up in the same dataframe for
comparison it's easy to automatically rename them with suffixes indicating
where they came from. For example, net electricity generation is reported to
both :doc:`FERC Form 1 <../data_sources/ferc1>` and :doc:`EIA 923
<../data_sources/eia923>`, so we've named columns ``net_generation_mwh`` in
each of those data sources. Similarly, give non-comparable quantities reported
in different data sources **different** column names. This helps make it clear
that the quantities are actually different.

Follow Existing Conventions
^^^^^^^^^^^^^^^^^^^^^^^^^^^
We are trying to use consistent naming conventions for the data tables,
columns, data sources, and functions. Generally speaking PUDL is a collection
of subpackages organized by purpose (extract, transform, load, analysis,
output, datastore…), containing a module for each data source. Each data source
has a short name that is used everywhere throughout the project and is composed of
the reporting agency and the form number or another identifying abbreviation:
``ferc1``, ``epacems``, ``eia923``, ``eia861``, etc. See the :doc:`naming
conventions <naming_conventions>` document for more details.

-------------------------------------------------------------------------------
Complete, Continuous Time Series
-------------------------------------------------------------------------------
Expand Down
201 changes: 143 additions & 58 deletions docs/dev/naming_conventions.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,148 @@
===============================================================================
Naming Conventions
===============================================================================
*There are only two hard problems in computer science: caching,
naming things, and off-by-one errors.*

We try to use consistent naming conventions for the data tables, data assets,
columns, data sources, and functions.
aesharpe marked this conversation as resolved.
Show resolved Hide resolved

.. _asset-naming:

Asset Naming Conventions
---------------------------------------------------

PUDL's data processing is divided into three layers of dagster assets: Raw, Core
and Output. Asset names should generally follow this naming convention:
aesharpe marked this conversation as resolved.
Show resolved Hide resolved

.. code-block::

{layer}_{source}__{asset_type}_{asset_name}

* ``layer`` is the processing layer of the asset. Acceptable values are:
``raw``, ``core`` and ``out``. ``layer`` is required for all assets in all layers.
* ``source`` is an abbreviation of the original source of the data. For example,
``eia860``, ``ferc1`` and ``epacems``.
* ``asset_type`` describes how the asset in modeled.
* ``asset_name`` should describe the entity, categorical code type, or measurement of
the asset.

Raw layer
^^^^^^^^^
* This layer contains assets that extract data from spreadsheets and databases
and are persisted as pickle files.
* Naming convention: ``raw_{source}__{asset_name}``
* ``asset_name`` is typically copied from the source data.
* ``asset_type`` is not included in this layer because the data modeling does not
yet conform to PUDL standards. Raw assets are typically just copies of the
source data.

Core layer
aesharpe marked this conversation as resolved.
Show resolved Hide resolved
^^^^^^^^^^
* This layer contains well-modeled assets that serve as building blocks for downstream
wide tables and analyses. Well-modeled means tables in the database have logical
primary keys, foreign keys, datatypes and generally follow
:ref:`Tidy Data standards <tidy-data>`.
These assets are typically stored in parquet files or tables in a database.
* Naming convention: ``core_{source}__{asset_type}_{asset_name}``
* ``asset_type`` describes how the asset is modeled and its role in PUDL’s
collection of core assets. There are a handful of table types in this layer:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add something about how these tables break the raw tables down into well-modeled assets...so it's clear they are like the amino acids of the process haha

Copy link
Member Author

@bendnorman bendnorman Sep 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it. I'm ashamed to say I have a middle school understanding of biology.


* ``assn``: Association tables provide connections between entities. This data
can be manually compiled or extracted from data sources. Examples:
``core_pudl__assn_plants_eia``, ``core_eia861__assn_utility``.
aesharpe marked this conversation as resolved.
Show resolved Hide resolved
* ``codes``: Code tables contain more verbose descriptions of categorical codes
typically manually compiled from source data dictionaries. Examples:
``core_eia__codes_averaging_periods``, ``core_eia__codes_balancing_authorities``
* ``entity``: Entity tables contain static information about entities. For example,
the state a plant is located in, or the plant a boiler is a part of. Examples:
aesharpe marked this conversation as resolved.
Show resolved Hide resolved
``core_eia__entity_boilers``, ``core_eia923__entity_coalmine``.
* ``scd``: Slowly changing dimension tables describe attributes of entities that
rarely change. For example, the ownership or the capacity of a plant. Examples:
``core_eia860__scd_generators``, ``core_eia860__scd_plants``.
* ``yearly/monthly/hourly``: Time series tables contain attributes about entities
that are expected to change for each reported timestamp. Time series tables
typically contain measurements of processes like net generation or co2 emissions.
Examples: ``core_ferc714__hourly_demand_pa``,
``core_ferc1__yearly_plant_in_service``.

Output layer
^^^^^^^^^^^^
* This layer uses assets in the Core layer to construct wide and complete tables
suitable for users to perform analysis on. This layer can contain intermediate
tables that bridge the core and user-facing tables.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add something about how this layer more closely mimics the raw data sources but with cleaned data. How it reconstructs the data from the components to ensure consistency for things like plant name / location etc. This is where we convince people that this is the best layer :)

I'm a little confused about these intermediate assets. I think you mean the _core tables. But you have a section below dedicated to Intermediate Assets. If intermediate assets are part of the _output layer, maybe they should be nested underneath that bullet rather than having their own section? But also it doesn't make sense to me that they would be part of the output layer and not the core layer given their name. Are they not their own distinct type of layer?

Also the wording here is a bit confusing. I would probably say this layer "contains" rather than this layer "can contain" because it should probably be fixed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intermediate assets can be present in the core and output layer. They're just logical steps towards a final core out output asset. Should I move the section or make it a note so people understand it isn't a separate layer but a type of asset?

* Naming convention: ``out_{source}__{asset_type}_{asset_name}``
* ``source`` is optional in this layer because there can be assets that join data from
multiple sources.
* ``asset_type`` is also optional. It will likely describe the frequency at which
the data is reported (annual/monthly/hourly).

Intermediate Assets
^^^^^^^^^^^^^^^^^^^
* Intermediate assets are logical steps towards a final well-modeled core asset or
user-facing output asset. These assets are not intended to be persisted in the
database or accessible to the user. These assets are denoted by a preceding
underscore, like a private python method. For example, the intermediate asset
``_core_eia860__plants`` is a logical step towards the
``core_eia860__entity_plants`` and ``core_eia860__scd_plants`` assets.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good example, but I don't understand what a "logical step" is in this context. I think it would help to provide more detail. What is in __plants vs __entity_plants and __scd_plants. Why is it important to have a _core table output before splitting the table into denormalized components and then performing transformations?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fleshed out the example a bit more.

* The number of intermediate assets should be limited to avoid an extremely
cluttered DAG. It is appropriate to create an intermediate asset when:

* there is a short and long running portion of a process. It is convenient to separate
the long and short-running processing portions into separate assets so debugging the
short-running process doesn’t take forever.
* there is a logical step in a process that is frequently inspected for debugging. For
example, the pre harvest assets in the ``_core_eia860`` and ``_core_eia923`` groups
are frequently inspected when new years of data are added.


Columns and Field Names
^^^^^^^^^^^^^^^^^^^^^^^
If two columns in different tables record the same quantity in the same units,
give them the same name. That way if they end up in the same dataframe for
comparison it's easy to automatically rename them with suffixes indicating
where they came from. For example, net electricity generation is reported to
both :doc:`FERC Form 1 <../data_sources/ferc1>` and :doc:`EIA 923
<../data_sources/eia923>`, so we've named columns ``net_generation_mwh`` in
each of those data sources. Similarly, give non-comparable quantities reported
in different data sources **different** column names. This helps make it clear
that the quantities are actually different.

* ``total`` should come at the beginning of the name (e.g.
``total_expns_production``)
* Identifiers should be structured ``type`` + ``_id_`` + ``source`` where
``source`` is the agency or organization that has assigned the ID. (e.g.
``plant_id_eia``)
* The data source or label (e.g. ``plant_id_pudl``) should follow the thing it
is describing
* Units should be appended to field names where applicable (e.g.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid passive voice when possible so that the wording is more clear and direct.

Instead of:
Units should be appended to field names
Make it:
Append units to field names

Basically any time you see the very "to be" at work it's passive and you can move the verb to the front of the sentence to make it active.

I think there is another example above where it says "Identifiers should be..."

``net_generation_mwh``). This includes "per unit" signifiers (e.g. ``_pct``
for percent, ``_ppm`` for parts per million, or a generic ``_per_unit`` when
the type of unit varies, as in columns containing a heterogeneous collection
of fuels)
* Financial values are assumed to be in nominal US dollars.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I.e., the suffix _usd is implied.) If they are not reported in USD, convert them to USD. If they must be kept in their original form for some reason, append a suffix that lets the user know they are not USD.

* ``_id`` indicates the field contains a usually numerical reference to
another table, which will not be intelligible without looking up the value in
that other table.
* The suffix ``_code`` indicates the field contains a short abbreviation from
a well defined list of values, that probably needs to be looked up if you
want to understand what it means.
* The suffix ``_type`` (e.g. ``fuel_type``) indicates a human readable category
from a well defined list of values. Whenever possible we try to use these
longer descriptive names rather than codes.
* ``_name`` indicates a longer human readable name, that is likely not well
categorized into a small set of acceptable values.
* ``_date`` indicates the field contains a :class:`Date` object.
* ``_datetime`` indicates the field contains a full :class:`Datetime` object.
* ``_year`` indicates the field contains an :class:`integer` 4-digit year.
* ``capacity`` refers to nameplate capacity (e.g. ``capacity_mw``)-- other
specific types of capacity are annotated.
* Regardless of what label utilities are given in the original data source
(e.g. ``operator`` in EIA or ``respondent`` in FERC) we refer to them as
``utilities`` in PUDL.
Comment on lines +165 to +183
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When there's lots of information in a bullet like here, I tend to prefer a tabular format (kind of like below for general abbreviations)


Naming Conventions in Code
--------------------------

In the PUDL codebase, we aspire to follow the naming and other conventions
detailed in :pep:`8`.
Expand All @@ -21,11 +163,6 @@ as we come across them again in maintaining the code.
``eia`` or ``ferc1``). When outputs are built from a single table, simply use
the table name (e.g. ``core_eia923__monthly_boiler_fuel``).

.. _glossary:

Glossary of Abbreviations
-------------------------

General Abbreviations
^^^^^^^^^^^^^^^^^^^^^

Expand Down Expand Up @@ -76,61 +213,9 @@ Abbreviation Definition


Data Extraction Functions
-------------------------
^^^^^^^^^^^^^^^^^^^^^^^^^

The lower level namespace uses an imperative verb to identify the action the
function performs followed by the object of extraction (e.g.
``get_eia860_file``). The upper level namespace identifies the dataset where
extraction is occurring.

Output Functions
-----------------

When dataframe outputs are built from multiple tables, identify the type of
information being pulled (e.g. ``plants``) and the source of the tables (e.g.
``eia`` or ``ferc1``). When outputs are built from a single table, simply use
the table name (e.g. ``core_eia923__monthly_boiler_fuel``).

Table Names
-----------

See `this article <http://www.vertabelo.com/blog/technical-articles/naming-conventions-in-database-modeling>`__ on database naming conventions.

* Table names in snake_case
* The data source should follow the thing it applies to e.g. ``plant_id_ferc1``

Columns and Field Names
-----------------------

* ``total`` should come at the beginning of the name (e.g.
``total_expns_production``)
* Identifiers should be structured ``type`` + ``_id_`` + ``source`` where
``source`` is the agency or organization that has assigned the ID. (e.g.
``plant_id_eia``)
* The data source or label (e.g. ``plant_id_pudl``) should follow the thing it
is describing
* Units should be appended to field names where applicable (e.g.
``net_generation_mwh``). This includes "per unit" signifiers (e.g. ``_pct``
for percent, ``_ppm`` for parts per million, or a generic ``_per_unit`` when
the type of unit varies, as in columns containing a heterogeneous collection
of fuels)
* Financial values are assumed to be in nominal US dollars.
* ``_id`` indicates the field contains a usually numerical reference to
another table, which will not be intelligible without looking up the value in
that other table.
* The suffix ``_code`` indicates the field contains a short abbreviation from
a well defined list of values, that probably needs to be looked up if you
want to understand what it means.
* The suffix ``_type`` (e.g. ``fuel_type``) indicates a human readable category
from a well defined list of values. Whenever possible we try to use these
longer descriptive names rather than codes.
* ``_name`` indicates a longer human readable name, that is likely not well
categorized into a small set of acceptable values.
* ``_date`` indicates the field contains a :class:`Date` object.
* ``_datetime`` indicates the field contains a full :class:`Datetime` object.
* ``_year`` indicates the field contains an :class:`integer` 4-digit year.
* ``capacity`` refers to nameplate capacity (e.g. ``capacity_mw``)-- other
specific types of capacity are annotated.
* Regardless of what label utilities are given in the original data source
(e.g. ``operator`` in EIA or ``respondent`` in FERC) we refer to them as
``utilities`` in PUDL.
Loading