-
-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sec10k metadata directly in PUDL #4035
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved
I think you may need to set the And if it's set in the context of both the unit and integration tests, then the SEC tables would get pulled when it's testing for the merge queue too (assuming the runner has permissions to pull from GCS) which would probably also be good. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a bunch of comments about field names and our existing conventions mostly.
Do we not need to define the resources that these fields are going to be part of in pudl/metadata/resources/sec10k.py
src/pudl/metadata/fields.py
Outdated
"block": { | ||
"type": "string", | ||
"description": "Title of block of data.", | ||
}, | ||
"block_count": { | ||
"type": "integer", | ||
"description": "Some blocks are repeated, `block_count` defines the index of the data block.", | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is a data block? More context in these descriptions would be helpful. Is block
a specific enough column name?
I think the single backticks will probably not render correctly, since that's markdown formatting, and this will be getting parse as RST in the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is basically a narrow tidy table with a bunch of key value pairs that contain company information. These key values are organized in connected blocks. I've changed the naming to hopefully make this structure more clear. Let me know if it makes more sense now.
"date_filed": { | ||
"type": "datetime", | ||
"description": "Date filing was submitted.", | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we already have a field name that means this, maybe in the context of the FERC XBRL data, which could be re-used here rather than creating a new field name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't able to find any fields like that, but I can definitely change this to use an existing field if there's one that makes sense.
src/pudl/metadata/fields.py
Outdated
"former_conformed_name": { | ||
"type": "string", | ||
"description": "Former name of the company.", | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By analogy with company_name
and company_name_raw
might this be clearer as company_name_former
? What does "conformed" mean in this context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that makes sense and I've changed it. I think the conformed_name
here comes from the SEC wording, but I don't see any reason to strictly adhere to the terminology they used. @katie-lamb definitely let me know if you're aware of any specific reason to keep conformed
in the name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, former_confermed_name
comes directly from a field pulled from the 10-K filing. I don't see any reason to keep conformed
though. We don't use this field in any of the modeling bits.
src/pudl/metadata/fields.py
Outdated
"form_type": { | ||
"type": "string", | ||
"description": "Specific version of SEC 10k filed.", | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Above we have exhibit_21_version
which seems analogous, but there we say version
and here we say type
-- if both are versions, and this one is specific to the SEC 10K, it seems like it would be more consistent to use version
in both cases. So maybe exhibit21_version
and sec10k_version
In our naming conventions we also use the _type
suffix to indicate "a human readable category from a well defined list of values"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, fixed.
"state_of_incorporation": { | ||
"type": "string", | ||
"description": "Two letter state code where company is incorporated.", | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this / can this be constrained to match only valid state abbreviations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should and maybe can. Is there a place where we're applying such a constraint right now? The base state
field doesn't have any constraints on it.
src/pudl/metadata/fields.py
Outdated
"street_address_2": { | ||
"type": "string", | ||
"description": "Secondary street address.", | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have an address_2
field defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
src/pudl/metadata/fields.py
Outdated
@@ -4328,6 +4417,10 @@ | |||
"type": "integer", | |||
"description": "Sub-plant ID links EPA CEMS emissions units to EIA units.", | |||
}, | |||
"subsidiary": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
subsidiary_company_name
or at least subsidiary_name
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
src/pudl/metadata/fields.py
Outdated
"value": { | ||
"type": "string", | ||
"description": "String value of data point.", | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems very vague without more context. What does it mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, I've changed the naming for this table to hopefully make it more clear.
src/pudl/metadata/fields.py
Outdated
"year_quarter": { | ||
"type": "string", | ||
"description": "Year quarter filing applies to.", | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for all other columns with this kind of information in them, we are using report_date
-- sometimes with annual and other times with monthly frequency. In the FERC data I think it also describes quarterly data (although we only pull in Q4 information). Can we store a quarter using a date type, which conveys richer information about the contents, and can be used in conjunction with other report date data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Converted to report_date
@zaneselvans yeah I had just accidentally not added the file in my commit, then got pulled away yesterday before I could fix it. Same with removing the |
I left some responses about naming convention stuff but all the updates seem good to me. @zaneselvans I'll let you weigh in to address requested changes. |
Add metadata for sec10k tables directly to PUDL to solve read-the-docs build issues with pulling metadata from gcs, and make the
deltalake
dependency unnecessary, which was behaving weirdly.