Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sec10k metadata directly in PUDL #4035

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

Add sec10k metadata directly in PUDL #4035

wants to merge 8 commits into from

Conversation

zschira
Copy link
Member

@zschira zschira commented Jan 28, 2025

Add metadata for sec10k tables directly to PUDL to solve read-the-docs build issues with pulling metadata from gcs, and make the deltalake dependency unnecessary, which was behaving weirdly.

@zschira zschira requested a review from katie-lamb January 28, 2025 23:26
Copy link
Member

@katie-lamb katie-lamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved

@zaneselvans
Copy link
Member

zaneselvans commented Jan 29, 2025

I think you may need to set the USE_PUDL_MODELS envvar on the GitHub runner to get the metadata unit tests to pass -- it's finding a bunch of fields which are not being used because the SEC tables aren't included in any of the resources we've defined.

And if it's set in the context of both the unit and integration tests, then the SEC tables would get pulled when it's testing for the merge queue too (assuming the runner has permissions to pull from GCS) which would probably also be good.

@zaneselvans zaneselvans added testing Writing tests, creating test data, automating testing, etc. metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. github-actions Pull requests that update GitHub Actions code mozilla_sec_to_eia Mozilla AI for EJ grant to link SEC utility ownership data to EIA operational data labels Jan 29, 2025
@zaneselvans zaneselvans added sec10k Issues related to SEC 10K filing data. and removed mozilla_sec_to_eia Mozilla AI for EJ grant to link SEC utility ownership data to EIA operational data labels Jan 29, 2025
Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a bunch of comments about field names and our existing conventions mostly.

Do we not need to define the resources that these fields are going to be part of in pudl/metadata/resources/sec10k.py

src/pudl/analysis/pudl_models.py Show resolved Hide resolved
Comment on lines 334 to 341
"block": {
"type": "string",
"description": "Title of block of data.",
},
"block_count": {
"type": "integer",
"description": "Some blocks are repeated, `block_count` defines the index of the data block.",
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a data block? More context in these descriptions would be helpful. Is block a specific enough column name?

I think the single backticks will probably not render correctly, since that's markdown formatting, and this will be getting parse as RST in the documentation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is basically a narrow tidy table with a bunch of key value pairs that contain company information. These key values are organized in connected blocks. I've changed the naming to hopefully make this structure more clear. Let me know if it makes more sense now.

Comment on lines +1021 to +1024
"date_filed": {
"type": "datetime",
"description": "Date filing was submitted.",
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we already have a field name that means this, maybe in the context of the FERC XBRL data, which could be re-used here rather than creating a new field name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't able to find any fields like that, but I can definitely change this to use an existing field if there's one that makes sense.

Comment on lines 1646 to 1649
"former_conformed_name": {
"type": "string",
"description": "Former name of the company.",
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By analogy with company_name and company_name_raw might this be clearer as company_name_former? What does "conformed" mean in this context?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that makes sense and I've changed it. I think the conformed_name here comes from the SEC wording, but I don't see any reason to strictly adhere to the terminology they used. @katie-lamb definitely let me know if you're aware of any specific reason to keep conformed in the name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, former_confermed_name comes directly from a field pulled from the 10-K filing. I don't see any reason to keep conformed though. We don't use this field in any of the modeling bits.

Comment on lines 1734 to 1737
"form_type": {
"type": "string",
"description": "Specific version of SEC 10k filed.",
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above we have exhibit_21_version which seems analogous, but there we say version and here we say type -- if both are versions, and this one is specific to the SEC 10K, it seems like it would be more consistent to use version in both cases. So maybe exhibit21_version and sec10k_version

In our naming conventions we also use the _type suffix to indicate "a human readable category from a well defined list of values"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, fixed.

Comment on lines +4341 to +4344
"state_of_incorporation": {
"type": "string",
"description": "Two letter state code where company is incorporated.",
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this / can this be constrained to match only valid state abbreviations?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should and maybe can. Is there a place where we're applying such a constraint right now? The base state field doesn't have any constraints on it.

Comment on lines 4387 to 4390
"street_address_2": {
"type": "string",
"description": "Secondary street address.",
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have an address_2 field defined.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -4328,6 +4417,10 @@
"type": "integer",
"description": "Sub-plant ID links EPA CEMS emissions units to EIA units.",
},
"subsidiary": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subsidiary_company_name or at least subsidiary_name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Comment on lines 4907 to 4910
"value": {
"type": "string",
"description": "String value of data point.",
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems very vague without more context. What does it mean?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, I've changed the naming for this table to hopefully make it more clear.

Comment on lines 5044 to 5047
"year_quarter": {
"type": "string",
"description": "Year quarter filing applies to.",
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for all other columns with this kind of information in them, we are using report_date -- sometimes with annual and other times with monthly frequency. In the FERC data I think it also describes quarterly data (although we only pull in Q4 information). Can we store a quarter using a date type, which conveys richer information about the contents, and can be used in conjunction with other report date data?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converted to report_date

@zschira
Copy link
Member Author

zschira commented Jan 29, 2025

Do we not need to define the resources that these fields are going to be part of in pudl/metadata/resources/sec10k.py

@zaneselvans yeah I had just accidentally not added the file in my commit, then got pulled away yesterday before I could fix it. Same with removing the deltalake dependency. Those should both be fixed now and I'll work through the comments of the fields this morning.

@katie-lamb
Copy link
Member

I left some responses about naming convention stuff but all the updates seem good to me. @zaneselvans I'll let you weigh in to address requested changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
github-actions Pull requests that update GitHub Actions code metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. sec10k Issues related to SEC 10K filing data. testing Writing tests, creating test data, automating testing, etc.
Projects
Status: In progress
Development

Successfully merging this pull request may close these issues.

3 participants