catalyst-cooperative · bendnorman · Nov 10, 2023 · Sep 20, 2023 · Sep 20, 2023 · Sep 26, 2023
diff --git a/docs/data_access.rst b/docs/data_access.rst
@@ -8,6 +8,12 @@ PUDL data, so if you have a suggestion please `open a GitHub issue
 <https://github.com/catalyst-cooperative/pudl/issues>`__. If you have a question you
 can `create a GitHub discussion <https://github.com/orgs/catalyst-cooperative/discussions/new?category=help-me>`__.
 
+PUDL's primary data output is the ``pudl.sqlite`` database. It contains a collection
+of tables that follow :ref:`PUDL's asset naming convention <asset-naming>`. Tables
+with the ``core_`` prefix are normalized tables that serve as building blocks for the
+more denormalized and easy to work with ``output_`` tables. **We recommend only working
+with ``output_`` tables.**
+
 .. _access-modes:
 
 ---------------------------------------------------------------------------------------

diff --git a/docs/dev/data_guidelines.rst b/docs/dev/data_guidelines.rst
@@ -64,6 +64,8 @@ Examples of Unacceptable Changes
   fuel heat content and net electricity generation. The heat rate would
   be a derived value and not part of the original data.
 
+.. _tidy-data:
+
 -------------------------------------------------------------------------------
 Make Tidy Data
 -------------------------------------------------------------------------------
@@ -117,24 +119,6 @@ that M/Mega is a million in SI. And a `BTU
 energy required to raise the temperature of one an *avoirdupois pound* of water
 by 1 degree *Farenheit*?! What century even is this?).
 
--------------------------------------------------------------------------------
-Silo the ETL Process
--------------------------------------------------------------------------------
-It should be possible to run the ETL process on each data source independently
-and with any combination of data sources included. This allows users to include
-only the data need. In some cases, like the :doc:`EIA 860
-<../data_sources/eia860>` and :doc:`EIA 923 <../data_sources/eia923>` data, two
-data sources may be so intertwined that keeping them separate doesn't really
-make sense. This should be the exception, however, not the rule.
-
--------------------------------------------------------------------------------
-Separate Data from Glue
--------------------------------------------------------------------------------
-The glue that relates different data sources to each other should be applied
-after or alongside the ETL process and not as a mandatory part of ETL. This
-makes it easy to pull individual data sources in and work with them even when
-the glue isn't working or doesn't yet exist.
-
 -------------------------------------------------------------------------------
 Partition Big Data
 -------------------------------------------------------------------------------
@@ -146,35 +130,6 @@ them to pull in only certain years, certain states, or other sensible partitions
 data so that they don’t run out of memory or disk space or have to wait hours while data
 they don't need is being processed.
 
--------------------------------------------------------------------------------
-Naming Conventions
--------------------------------------------------------------------------------
-    *There are only two hard problems in computer science: caching,
-    naming things, and off-by-one errors.*
-
-Use Consistent Names
-^^^^^^^^^^^^^^^^^^^^
-If two columns in different tables record the same quantity in the same units,
-give them the same name. That way if they end up in the same dataframe for
-comparison it's easy to automatically rename them with suffixes indicating
-where they came from. For example, net electricity generation is reported to
-both :doc:`FERC Form 1 <../data_sources/ferc1>` and :doc:`EIA 923
-<../data_sources/eia923>`, so we've named columns ``net_generation_mwh`` in
-each of those data sources. Similarly, give non-comparable quantities reported
-in different data sources **different** column names. This helps make it clear
-that the quantities are actually different.
-
-Follow Existing Conventions
-^^^^^^^^^^^^^^^^^^^^^^^^^^^
-We are trying to use consistent naming conventions for the data tables,
-columns, data sources, and functions. Generally speaking PUDL is a collection
-of subpackages organized by purpose (extract, transform, load, analysis,
-output, datastore…), containing a module for each data source. Each data source
-has a short name that is used everywhere throughout the project and is composed of
-the reporting agency and the form number or another identifying abbreviation:
-``ferc1``, ``epacems``, ``eia923``, ``eia861``, etc. See the :doc:`naming
-conventions <naming_conventions>` document for more details.
-
 -------------------------------------------------------------------------------
 Complete, Continuous Time Series
 -------------------------------------------------------------------------------

diff --git a/docs/dev/naming_conventions.rst b/docs/dev/naming_conventions.rst
@@ -1,6 +1,148 @@
 ===============================================================================
 Naming Conventions
 ===============================================================================
+    *There are only two hard problems in computer science: caching,
+    naming things, and off-by-one errors.*
+
+We try to use consistent naming conventions for the data tables, data assets,
+columns, data sources, and functions.
+
+.. _asset-naming:
+
+Asset Naming Conventions
+---------------------------------------------------
+
+PUDL's data processing is divided into three layers of dagster assets: Raw, Core
+and Output. Asset names should generally follow this naming convention:
+
+.. code-block::
+
+    {layer}_{source}__{asset_type}_{asset_name}
+
+* ``layer`` is the processing layer of the asset. Acceptable values are:
+  ``raw``, ``core`` and ``out``. ``layer`` is required for all assets in all layers.
+* ``source`` is an abbreviation of the original source of the data. For example,
+  ``eia860``, ``ferc1`` and ``epacems``.
+* ``asset_type`` describes how the asset in modeled.
+* ``asset_name`` should describe the entity, categorical code type, or measurement of
+  the asset.
+
+Raw layer
+^^^^^^^^^
+* This layer contains assets that extract data from spreadsheets and databases
+  and are persisted as pickle files.
+* Naming convention: ``raw_{source}__{asset_name}``
+* ``asset_name`` is typically copied from the source data.
+* ``asset_type`` is not included in this layer because the data modeling does not
+  yet conform to PUDL standards. Raw assets are typically just copies of the
+  source data.
+
+Core layer
+^^^^^^^^^^
+* This layer contains well-modeled assets that serve as building blocks for downstream
+  wide tables and analyses. Well-modeled means tables in the database have logical
+  primary keys, foreign keys, datatypes and generally follow
+  :ref:`Tidy Data standards <tidy-data>`.
+  These assets are typically stored in parquet files or tables in a database.
+* Naming convention: ``core_{source}__{asset_type}_{asset_name}``
+* ``asset_type`` describes how the asset is modeled and its role in PUDL’s
+  collection of core assets. There are a handful of table types in this layer:
+
+  * ``assn``: Association tables provide connections between entities. This data
+    can be manually compiled or extracted from data sources. Examples:
+    ``core_pudl__assn_plants_eia``, ``core_eia861__assn_utility``.
+  * ``codes``: Code tables contain more verbose descriptions of categorical codes
+    typically manually compiled from source data dictionaries. Examples:
+    ``core_eia__codes_averaging_periods``, ``core_eia__codes_balancing_authorities``
+  * ``entity``: Entity tables contain static information about entities. For example,
+    the state a plant is located in, or the plant a boiler is a part of. Examples:
+    ``core_eia__entity_boilers``, ``core_eia923__entity_coalmine``.
+  * ``scd``: Slowly changing dimension tables describe attributes of entities that
+    rarely change. For example, the ownership or the capacity of a plant. Examples:
+    ``core_eia860__scd_generators``, ``core_eia860__scd_plants``.
+  * ``yearly/monthly/hourly``: Time series tables contain attributes about entities
+    that are expected to change for each reported timestamp. Time series tables
+    typically contain measurements of processes like net generation or co2 emissions.
+    Examples: ``core_ferc714__hourly_demand_pa``,
+    ``core_ferc1__yearly_plant_in_service``.
+
+Output layer
+^^^^^^^^^^^^
+* This layer uses assets in the Core layer to construct wide and complete tables
+  suitable for users to perform analysis on. This layer can contain intermediate
+  tables that bridge the core and user-facing tables.
+* Naming convention: ``out_{source}__{asset_type}_{asset_name}``
+* ``source`` is optional in this layer because there can be assets that join data from
+  multiple sources.
+* ``asset_type`` is also optional. It will likely describe the frequency at which
+  the data is reported (annual/monthly/hourly).
+
+Intermediate Assets
+^^^^^^^^^^^^^^^^^^^
+* Intermediate assets are logical steps towards a final well-modeled core asset or
+  user-facing output asset. These assets are not intended to be persisted in the
+  database or accessible to the user. These assets are denoted by a preceding
+  underscore, like a private python method. For example, the intermediate asset
+  ``_core_eia860__plants`` is a logical step towards the
+  ``core_eia860__entity_plants`` and ``core_eia860__scd_plants`` assets.
+* The number of intermediate assets should be limited to avoid an extremely
+  cluttered DAG. It is appropriate to create an intermediate asset when:
+
+  * there is a short and long running portion of a process. It is convenient to separate
+    the long and short-running processing portions into separate assets so debugging the
+    short-running process doesn’t take forever.
+  * there is a logical step in a process that is frequently inspected for debugging. For
+    example, the pre harvest assets in the ``_core_eia860`` and ``_core_eia923`` groups
+    are frequently inspected when new years of data are added.
+
+
+Columns and Field Names
+^^^^^^^^^^^^^^^^^^^^^^^
+If two columns in different tables record the same quantity in the same units,
+give them the same name. That way if they end up in the same dataframe for
+comparison it's easy to automatically rename them with suffixes indicating
+where they came from. For example, net electricity generation is reported to
+both :doc:`FERC Form 1 <../data_sources/ferc1>` and :doc:`EIA 923
+<../data_sources/eia923>`, so we've named columns ``net_generation_mwh`` in
+each of those data sources. Similarly, give non-comparable quantities reported
+in different data sources **different** column names. This helps make it clear
+that the quantities are actually different.
+
+* ``total`` should come at the beginning of the name (e.g.
+  ``total_expns_production``)
+* Identifiers should be structured ``type`` + ``_id_`` + ``source`` where
+  ``source`` is the agency or organization that has assigned the ID. (e.g.
+  ``plant_id_eia``)
+* The data source or label (e.g. ``plant_id_pudl``) should follow the thing it
+  is describing
+* Units should be appended to field names where applicable (e.g.
+  ``net_generation_mwh``). This includes "per unit" signifiers (e.g. ``_pct``
+  for percent, ``_ppm`` for parts per million, or a generic ``_per_unit`` when
+  the type of unit varies, as in columns containing a heterogeneous collection
+  of fuels)
+* Financial values are assumed to be in nominal US dollars.
+* ``_id`` indicates the field contains a usually numerical reference to
+  another table, which will not be intelligible without looking up the value in
+  that other table.
+* The suffix ``_code`` indicates the field contains a short abbreviation from
+  a well defined list of values, that probably needs to be looked up if you
+  want to understand what it means.
+* The suffix ``_type`` (e.g. ``fuel_type``) indicates a human readable category
+  from a well defined list of values. Whenever possible we try to use these
+  longer descriptive names rather than codes.
+* ``_name`` indicates a longer human readable name, that is likely not well
+  categorized into a small set of acceptable values.
+* ``_date`` indicates the field contains a :class:`Date` object.
+* ``_datetime`` indicates the field contains a full :class:`Datetime` object.
+* ``_year`` indicates the field contains an :class:`integer` 4-digit year.
+* ``capacity`` refers to nameplate capacity (e.g. ``capacity_mw``)-- other
+  specific types of capacity are annotated.
+* Regardless of what label utilities are given in the original data source
+  (e.g. ``operator`` in EIA or ``respondent`` in FERC) we refer to them as
+  ``utilities`` in PUDL.
+
+Naming Conventions in Code
+--------------------------
 
 In the PUDL codebase, we aspire to follow the naming and other conventions
 detailed in :pep:`8`.
@@ -21,11 +163,6 @@ as we come across them again in maintaining the code.
   ``eia`` or ``ferc1``). When outputs are built from a single table, simply use
   the table name (e.g. ``core_eia923__monthly_boiler_fuel``).
 
-.. _glossary:
-
-Glossary of Abbreviations
--------------------------
-
 General Abbreviations
 ^^^^^^^^^^^^^^^^^^^^^
 
@@ -76,61 +213,9 @@ Abbreviation            Definition
 
 
 Data Extraction Functions
--------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^
 
 The lower level namespace uses an imperative verb to identify the action the
 function performs followed by the object of extraction (e.g.
 ``get_eia860_file``). The upper level namespace identifies the dataset where
 extraction is occurring.
-
-Output Functions
------------------
-
-When dataframe outputs are built from multiple tables, identify the type of
-information being pulled (e.g. ``plants``) and the source of the tables (e.g.
-``eia`` or ``ferc1``). When outputs are built from a single table, simply use
-the table name (e.g. ``core_eia923__monthly_boiler_fuel``).
-
-Table Names
------------
-
-See `this article <http://www.vertabelo.com/blog/technical-articles/naming-conventions-in-database-modeling>`__ on database naming conventions.
-
-* Table names in snake_case
-* The data source should follow the thing it applies to e.g. ``plant_id_ferc1``
-
-Columns and Field Names
------------------------
-
-* ``total`` should come at the beginning of the name (e.g.
-  ``total_expns_production``)
-* Identifiers should be structured ``type`` + ``_id_`` + ``source`` where
-  ``source`` is the agency or organization that has assigned the ID. (e.g.
-  ``plant_id_eia``)
-* The data source or label (e.g. ``plant_id_pudl``) should follow the thing it
-  is describing
-* Units should be appended to field names where applicable (e.g.
-  ``net_generation_mwh``). This includes "per unit" signifiers (e.g. ``_pct``
-  for percent, ``_ppm`` for parts per million, or a generic ``_per_unit`` when
-  the type of unit varies, as in columns containing a heterogeneous collection
-  of fuels)
-* Financial values are assumed to be in nominal US dollars.
-* ``_id`` indicates the field contains a usually numerical reference to
-  another table, which will not be intelligible without looking up the value in
-  that other table.
-* The suffix ``_code`` indicates the field contains a short abbreviation from
-  a well defined list of values, that probably needs to be looked up if you
-  want to understand what it means.
-* The suffix ``_type`` (e.g. ``fuel_type``) indicates a human readable category
-  from a well defined list of values. Whenever possible we try to use these
-  longer descriptive names rather than codes.
-* ``_name`` indicates a longer human readable name, that is likely not well
-  categorized into a small set of acceptable values.
-* ``_date`` indicates the field contains a :class:`Date` object.
-* ``_datetime`` indicates the field contains a full :class:`Datetime` object.
-* ``_year`` indicates the field contains an :class:`integer` 4-digit year.
-* ``capacity`` refers to nameplate capacity (e.g. ``capacity_mw``)-- other
-  specific types of capacity are annotated.
-* Regardless of what label utilities are given in the original data source
-  (e.g. ``operator`` in EIA or ``respondent`` in FERC) we refer to them as
-  ``utilities`` in PUDL.