Skip to content

Restructure SEC ownership tables #4097

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 146 commits into
base: main
Choose a base branch
from
Open

Conversation

katie-lamb
Copy link
Member

@katie-lamb katie-lamb commented Mar 1, 2025

Overview

Closes #4086 . The second SEC table restructuring PR - this handles restructures the ownership tables with information pulled from Ex. 21 attachments.

What problem does this address?

Makes changes to the SEC ownership table structures to be more well normalized and usable.

What did you change?

core_sec10k__quarterly_exhibit_21_company_ownership

  • The CIK of the parent company for each of the subsidiaries listed in this table is the CIK associated with the filename that the information is pulled from (associations between CIK and filename in the core_sec10k__quarterly_filings table)
  • Include parent company CIK (and maybe parent company name) in this table
  • Ensure this is a complete time series of ownership information
  • Create an ID for each subsidiary.

out_sec10k__parents_and_subsidiaries

  • Merge company information for the subsidiary and parent onto the core_sec10k__quarterly_exhibit_21_company_ownership table.
  • Add EIA utility ID?
  • Ensure this table gives a complete time series of ownership information.
  • Add on subsidiary_utility_id_eia when the subsidiary doesn't file a 10-K.

Subsidiary to filer association table

  • Create an association table between subsidiary_company_id_sec10k and central_index_key
  • Maybe merge this subsidiary ID onto the company information table

Subsidiary to EIA utility association table

  • Create an association table between subsidiary_company_id_sec10k and utility_id_eia

Questions

  • The out_sec10k__parents_and_subsidiaries merges parent company info onto the Ex. 21 ownership information. Is there a good way to reconcile this many to many merge to be many to one at this point? See this comment.
  • Title case company names? This would better match EIA utility names

To Do

  • Apply name cleaning and location cleaning to core tables
  • Add in row counts for out parents and subsidiaries in dbt seeds
  • Add dbt schema for out parents and subs
  • Investigate the linear increase in the number of parent-subsidiary relationships over time
  • sort out_sec10k__parents_and_subsidiaries by report date? what is standard here?
  • update row counts
  • poke more at linear increase in n subsidiaries

Documentation

Make sure to update relevant aspects of the documentation.

  • Update the release notes: reference the PR and related issues.
  • Update relevant Data Source jinja templates (see docs/data_sources/templates).
  • Update relevant table or source description metadata (see src/metadata).
  • Review and update any other aspects of the documentation that might be affected by this PR.

Testing

How did you make sure this worked? How can a reviewer verify this?

To-do list

  • If updating analyses or data processing functions: make sure to update or write data validation tests (e.g. test_minmax_rows()).
  • Run make pytest-coverage locally to ensure that the merge queue will accept your PR.
  • Review the PR yourself and call out any questions or issues you have.
  • For minor ETL changes or data additions, once make pytest-coverage passes, make sure you have a fresh full PUDL DB downloaded locally, materialize new/changed assets and all their downstream assets and run relevant data validation tests using pytest and --live-dbs.
  • For bigger ETL or data changes run the full ETL locally and then run the data validations using make pytest-validate.
  • Alternatively, run the build-deploy-pudl GitHub Action manually.

@katie-lamb katie-lamb added the sec10k Issues related to SEC 10K filing data. label Mar 1, 2025
@katie-lamb katie-lamb self-assigned this Mar 1, 2025
@katie-lamb
Copy link
Member Author

katie-lamb commented May 23, 2025

Another look into the linear increase in the number of subsidiary names

This is a follow up on this comment and chart showing the linear nature of the increase in the number of unique subsidiary names and @zaneselvans 's suggestion to look at the number of new names that appear and disappear per year. In the below chart, ignore what happens after 2017, when the cardinality of the number of subsidiary names is no longer accurate because we need to rerun the model.

Here's a chart with three lines:

  • the number of newly observed subsidiary names per year (blue),
  • the number of subsidiary names which disappear (and never reappear) per year (orange),
  • the net change in the number appearing or disappearing subsidiary names (green)

sub_changes_per_year

The green line supports the linear-like increase in the number of subsidiary names that we're seeing. The mean of the green (net increase/decrease) from 1994-2015 is 9,759, which is seems like a good estimation of the slope of the increase in the number of subsidiary names.

Code for getting the number of newly observed subsidiary names

out_df = pd.read_sql("out_sec10k__parents_and_subsidiaries", pudl_engine)
years = out_df.report_date.dropna().dt.year.unique().tolist()
years.sort()
seen_names = set()
new_names_count_per_year = {}
for year in years:
    new_names = set(out_df[out_df.report_date.dt.year == year].subsidiary_company_name)
    new_names_count_per_year[year] = len(new_names - seen_names)
    seen_names = seen_names.union(new_names)

Code for getting the number of subsidiary names which disappear per year

# how many names disappear forever each year
def get_year_that_names_disappear(df):
    prev_year_names = set()
    gone_names_per_year = {}
    seen_years = []
    for year in years:
        year_names = set(df[df.report_date.dt.year == year].subsidiary_company_name)
        # get the names that disappear from previous year
        gone_names_per_year[year] = prev_year_names - year_names
        # get the names that appear
        names_that_appear = year_names - prev_year_names
        # reappeared names = of the appeared names, how many were in disappeared lists
        # reappearing_names = {}
        for seen_year in seen_years:
            # append new reappearing names
            # reappearing_names.add(gone_names_per_year[seen_year] & names_that_appear)
            # remove these names from the gones_names_per_year
            gone_names_per_year[seen_year] = gone_names_per_year[seen_year] - names_that_appear
        seen_years.append(year)
        prev_year_names = year_names
        # we get lists of disappeared names that never reappear
        # add to dictionary of these reappearing names
    return gone_names_per_year

@zaneselvans
Copy link
Member

Huh, so it seems as if there really are just a net of ~10K new subsidiary names that appear each year, at least up to like 2009. Interesting.

Comment on lines +235 to +240
filer_info_df["subsidiary_company_name"] = _standardize_company_name(
filer_info_df["company_name"]
)
ownership_df["subsidiary_company_name"] = _standardize_company_name(
ownership_df["subsidiary_company_name"]
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, there are potentially many subsidiary companies that show up with multiple different subsidiary company IDs -- since every time the appear as a subsidiary in a different parent company, they get a new ID that includes a different parent company CIK as part of it. If that's true, then these IDs don't really identify subsidiary companies -- they're more like ownership slices of subsidiary companies, and there's no explicit way to aggregate the slices back together to get a more complete picture of the whole subsidiary company.

How does this situation differ for the subset of subsidiaries that are themselves identified as 10K filers? What fraction of subsidiaries show up as owned by multiple parent companies / have ownership fractions less than 1.0?

# token of the filename.
core_sec10k__quarterly_exhibit_21_company_ownership.loc[
:, "parent_company_central_index_key"
] = core_sec10k__quarterly_exhibit_21_company_ownership["filename_sec10k"].apply(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This helper function to get the CIK from a filename seems to appear in multiple places -- you might want to factor it out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pulled this out into a separate function in the transform.sec10k module, which means that I need to import that function into output.sec10k. This seemed better than adding a very dataset specific helper function into the helpers.py module

@katie-lamb
Copy link
Member Author

katie-lamb commented May 27, 2025

Summary of remaining questions/changes where the solution is still ambiguous:

  • Here we're assuming that the filer company is the parent company associated with an attached Ex. 21
    • Write more documentation of this assumption/decision including referencing the 10-K general instructions?
    • Poke into why/when multiple company info blocks show up in a filing, i.e. when a group of companies are contained in one filing, are they wholly owned subsidiaries of the filer?
  • Here we discuss what to do if there were multiple company info blocks associated with a filer's CIK, how do we resolve duplicates?
  • Here we discuss the limitations of subsidiary_company_id_sec10k and how to interpret the match between subsidiaries and their CIKs (if they file a 10k)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
schema-change Used to label any PR that changes table or column names, what columns appear in a table, etc. sec10k Issues related to SEC 10K filing data.
Projects
Status: In progress
Development

Successfully merging this pull request may close these issues.

Restructure SEC company ownership tables
3 participants