-
-
Notifications
You must be signed in to change notification settings - Fork 123
Restructure SEC ownership tables #4097
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Another look into the linear increase in the number of subsidiary names This is a follow up on this comment and chart showing the linear nature of the increase in the number of unique subsidiary names and @zaneselvans 's suggestion to look at the number of new names that appear and disappear per year. In the below chart, ignore what happens after 2017, when the cardinality of the number of subsidiary names is no longer accurate because we need to rerun the model. Here's a chart with three lines:
The green line supports the linear-like increase in the number of subsidiary names that we're seeing. The mean of the green (net increase/decrease) from 1994-2015 is 9,759, which is seems like a good estimation of the slope of the increase in the number of subsidiary names. Code for getting the number of newly observed subsidiary names
Code for getting the number of subsidiary names which disappear per year
|
Huh, so it seems as if there really are just a net of ~10K new subsidiary names that appear each year, at least up to like 2009. Interesting. |
filer_info_df["subsidiary_company_name"] = _standardize_company_name( | ||
filer_info_df["company_name"] | ||
) | ||
ownership_df["subsidiary_company_name"] = _standardize_company_name( | ||
ownership_df["subsidiary_company_name"] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, there are potentially many subsidiary companies that show up with multiple different subsidiary company IDs -- since every time the appear as a subsidiary in a different parent company, they get a new ID that includes a different parent company CIK as part of it. If that's true, then these IDs don't really identify subsidiary companies -- they're more like ownership slices of subsidiary companies, and there's no explicit way to aggregate the slices back together to get a more complete picture of the whole subsidiary company.
How does this situation differ for the subset of subsidiaries that are themselves identified as 10K filers? What fraction of subsidiaries show up as owned by multiple parent companies / have ownership fractions less than 1.0?
src/pudl/output/sec10k.py
Outdated
# token of the filename. | ||
core_sec10k__quarterly_exhibit_21_company_ownership.loc[ | ||
:, "parent_company_central_index_key" | ||
] = core_sec10k__quarterly_exhibit_21_company_ownership["filename_sec10k"].apply( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This helper function to get the CIK from a filename seems to appear in multiple places -- you might want to factor it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pulled this out into a separate function in the transform.sec10k
module, which means that I need to import that function into output.sec10k
. This seemed better than adding a very dataset specific helper function into the helpers.py
module
Summary of remaining questions/changes where the solution is still ambiguous:
|
Overview
Closes #4086 . The second SEC table restructuring PR - this handles restructures the ownership tables with information pulled from Ex. 21 attachments.
What problem does this address?
Makes changes to the SEC ownership table structures to be more well normalized and usable.
What did you change?
core_sec10k__quarterly_exhibit_21_company_ownership
out_sec10k__parents_and_subsidiaries
subsidiary_utility_id_eia
when the subsidiary doesn't file a 10-K.Subsidiary to filer association table
subsidiary_company_id_sec10k
andcentral_index_key
Subsidiary to EIA utility association table
subsidiary_company_id_sec10k
andutility_id_eia
Questions
out_sec10k__parents_and_subsidiaries
merges parent company info onto the Ex. 21 ownership information. Is there a good way to reconcile this many to many merge to be many to one at this point? See this comment.To Do
out_sec10k__parents_and_subsidiaries
by report date? what is standard here?Documentation
Make sure to update relevant aspects of the documentation.
docs/data_sources/templates
).src/metadata
).Testing
How did you make sure this worked? How can a reviewer verify this?
To-do list
test_minmax_rows()
).make pytest-coverage
locally to ensure that the merge queue will accept your PR.make pytest-coverage
passes, make sure you have a fresh full PUDL DB downloaded locally, materialize new/changed assets and all their downstream assets and run relevant data validation tests usingpytest
and--live-dbs
.make pytest-validate
.build-deploy-pudl
GitHub Action manually.