-
-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create design doc for naming convention of PUDL tables #2517
Comments
Some of the discussion in #2275 should be relevant here. Right now IIRC the normalized tables are the only tables which don't have some kind of qualifying suffix (e.g. the referent In terms of updating naming conventions, the name should convey the following four pieces of information:
While doing this, we should aim for the following:
I'd propose the following, which is a bit of a departure from our current naming protocol in the dag but I think would be clearest: Pros of this approach:
|
One amendment to the above: the denormalized tables are the ones that will be most widely used. We'll want to use the simplest and clearest naming convention for them, so this may mean we want to drop the |
This seems like an accurate assessment of our current naming issues @e-belfer; thank you! I agree our "normalized" tables are the only tables / assets lacking a suffix that describes the stage of the data. Our naming conventions will also affect how we categorize our assets, code, tables and metadata. There are a few ways we are grouping assets:
I could see us adopting a file structure organized by processing stage and source:
|
We should base our table naming convention on standard data modeling techniques. Here are a few questions we should answer:
|
I think our normalized data roughly follows the popular Star Schema. The star schema organizes data into fact and dimension tables.
We have tables where each row measures a specific event (a report from a generator, plant, utility, or respondent), similar to a fact table. Our entity tables contain attributes that describe the events in the fact tables, similar to dimension tables. For example, the These fact and dimension tables can then be combined to create various derived tables that summarize and aggregate the data. I still need to find more examples/documentation on derived tables to recommend naming conventions for all of our "denormalized" tables. I'm trying to categorize our existing tables to confirm they map to star schema concepts. We discussed how these topics apply to PUDL in a github discussion from September 2022. |
Slowly Varying DimensionsA lot of our dimensons are slowly changing. More dimension-like than fact-like, but not truly static, so we end up with almost all of the generator / boiler / plant attributes showing up in the annual rather then entity tables. How do those kinds of tables typically fit into data warehouse schemas? Organizing AssetsI guess we're really trying to come up with an asset naming convention, with a subset of the assets also being database tables (and other assets possibly becoming database tables at some future date if they're clearly useful in the that context). What is the intended method of organizing / grouping assets in Dagster? Initially I had thought that the New Table NamesAlongside the 4 naming chunks in
So we also need to think a bit about how we're going to come up with legible and concise names for the tables themselves. Combining Skinny Derived TablesThe current work to translate our output layers into Dagster abstractions and the database isn't trying to change how we actually organize the processing, so that we have a limited scope to get the transition done. Looking forward, I think we might want to do some of that re-organization such that we are doing a better job of organizing new fact and dimension tables that are produced through analysis, and only combining these building blocks into wide, denormalized tables intended for direct use by most people at the end of the process. Is there any way that we can prepare ourselves for doing that kind of disentanglement later without creating huge disruptions? For example, right now the MCOE output table contains per-generator estimates of: capacity factor, heat rate, and fuel cost per MWh. Each of those variables has its own output method and there are dependencies between them, but the So right now we have to define a long & wide MCOE asset to preserve the current processing arrangement and expected outputs, and maybe that MCOE table is part of the data API that we highlight for users, but in the future the way it is constructed can change to keep the data warehouse better organized, without dramatically changing what the user sees? I think a similar pattern may be useful in the cases where we are filling in / estimating missing values in the normalized fact tables (e.g. in the case of the EIA fuel price estimates in the FRC table). Being able to separately address the original facts and the augmented facts will help us label them appropriately and make it possible to use one or the other if need be in downstream applications. I think this is mostly about having well-organized and modular tables within the data warehouse, as my impression is that most users don't want to think too hard about these individual components, and will go straight for the most complete denormalized tables for day-to-day usage, and in that context I think we'd probably end up with tables that look like "monthly generator facts" or "annual generator facts" that include all of the time varying values which are reported on the basis of generator ID, and which can easily be merged with any subset of the attributes which appear in the generator dimension table (generator entity, or generator annual). Maybe it also makes sense to just go ahead and do an exhaustive merge for folks so they have the utility (operator), plant, and generator information all there in the denormalized time-varying generator fact table, but that would be a monster table, with something like 200 columns. Is that normal / desirable? |
Slowly Varying DimensionsOh I didn't realize how many of the fields in our annual tables are mostly static. There are three common methods for handling slowly changing dimensions in data warehouses. Type 2 is the only option that preserves the entire history of a given entity. Type 2 adds a new row for each version of the entity by adding start and end dates for each record. A surrogate key is added to unique identify the record because the natural key will not be unique if there are multiple versions. The surrogate key is then used in the fact table. After learning more about the star schema model, I don't think we're actually that close to following it. Maybe I'm interpreting the definitions too strictly but I think we'd have to change a lot about how we model our data to follow this pattern. For example, we'd have to move a bunch of columns from our annual tables, to our entity tables and add surrogate keys to follow the slowly changing dimension pattern. Based on what I've read, it isn't common for dimensions to reference each other. Some of our entity tables refer to other entity or encoding tables. Instead of having multiple annual tables as fact tables, we'd probably have a smaller set of fact tables where each row is a report for the most granular entity (generators)? We could spend some more hours thinking through how to adopt this common pattern for PUDL or we could just focus on desired outputs and naming conventions given our existing modeling. Organizing Assets
Yes! I think that's a good way to frame it. I'm also excited about inverting our project structure. I think it will make it clearer to folks how to add new data sources: "To integrate a new datasource, add a directory to the data source, add some assets in a sub module and load the assets into the main Combining Skinny Derived TablesWIP: Lots to think about! Will flesh out a response tomorrow! |
It seems like we want to 1) define a naming convention and 2) decide how to structure the tables we want people to access. I’m not sure in what order we should tackle these. We could:
Option one is probably the fasted path to distributing data as option 2 might require reworking a lot of our existing assets. However, option one will mean we might need to go through two iterations of name changes. |
Wow there's a lot to this! One hot take is that we shouldn't rename tables twice. Also, have we come to any conclusions about where the data will live? (i.e. in one big happy PUDL db or in separate databases?). Ella mentioned this thread earlier, but I don't think we made any decisions about it. I feel like that might also impact how we decide to name these tables. |
The full design doc can be found in this public google doc. The TL;DR was shared with users and the implementation will be tracked in #2765 |
It sounds like users favor ripping out
PudlTabl
ASAP and accessing the table directly from the database #2503. Given we won't havePudlTabl
method to act as aliases for the actual table names, we need to decide on a naming convention for our denormalized tables. We'll likely have to update the normalized table names because the ideal denormalized table names might conflict with the existing normalized table names. This is a scary change!Scope
Tasks
Questions
The text was updated successfully, but these errors were encountered: