Restructure SEC ownership tables #4097

katie-lamb · 2025-03-01T18:46:49Z

Overview

Closes #4086 . The second SEC table restructuring PR - this handles restructures the ownership tables with information pulled from Ex. 21 attachments.

What problem does this address?

Makes changes to the SEC ownership table structures to be more well normalized and usable.

What did you change?

core_sec10k__quarterly_exhibit_21_company_ownership

The CIK of the parent company for each of the subsidiaries listed in this table is the CIK associated with the filename that the information is pulled from (associations between CIK and filename in the core_sec10k__quarterly_filings table)
Include parent company CIK (and maybe parent company name) in this table
Ensure this is a complete time series of ownership information
Create an ID for each subsidiary.

out_sec10k__parents_and_subsidiaries

Merge company information for the subsidiary and parent onto the core_sec10k__quarterly_exhibit_21_company_ownership table.
Add EIA utility ID?
Ensure this table gives a complete time series of ownership information.
Add on subsidiary_utility_id_eia when the subsidiary doesn't file a 10-K.

Subsidiary to filer association table

Create an association table between subsidiary_company_id_sec10k and central_index_key
Maybe merge this subsidiary ID onto the company information table

Subsidiary to EIA utility association table

Create an association table between subsidiary_company_id_sec10k and utility_id_eia

Questions

The out_sec10k__parents_and_subsidiaries merges parent company info onto the Ex. 21 ownership information. Is there a good way to reconcile this many to many merge to be many to one at this point? See this comment.
Title case company names? This would better match EIA utility names

To Do

Apply name cleaning and location cleaning to core tables
Add in row counts for out parents and subsidiaries in dbt seeds
Add dbt schema for out parents and subs
Investigate the linear increase in the number of parent-subsidiary relationships over time
sort out_sec10k__parents_and_subsidiaries by report date? what is standard here?
update row counts
poke more at linear increase in n subsidiaries

Documentation

Make sure to update relevant aspects of the documentation.

Update the release notes: reference the PR and related issues.
Update relevant Data Source jinja templates (see docs/data_sources/templates).
Update relevant table or source description metadata (see src/metadata).
Review and update any other aspects of the documentation that might be affected by this PR.

Testing

How did you make sure this worked? How can a reviewer verify this?

To-do list

If updating analyses or data processing functions: make sure to update or write data validation tests (e.g. test_minmax_rows()).
Run make pytest-coverage locally to ensure that the merge queue will accept your PR.
Review the PR yourself and call out any questions or issues you have.
For minor ETL changes or data additions, once make pytest-coverage passes, make sure you have a fresh full PUDL DB downloaded locally, materialize new/changed assets and all their downstream assets and run relevant data validation tests using pytest and --live-dbs.
For bigger ETL or data changes run the full ETL locally and then run the data validations using make pytest-validate.
Alternatively, run the build-deploy-pudl GitHub Action manually.

src/pudl/transform/sec10k.py

katie-lamb · 2025-05-23T00:00:57Z

Another look into the linear increase in the number of subsidiary names

This is a follow up on this comment and chart showing the linear nature of the increase in the number of unique subsidiary names and @zaneselvans 's suggestion to look at the number of new names that appear and disappear per year. In the below chart, ignore what happens after 2017, when the cardinality of the number of subsidiary names is no longer accurate because we need to rerun the model.

Here's a chart with three lines:

the number of newly observed subsidiary names per year (blue),
the number of subsidiary names which disappear (and never reappear) per year (orange),
the net change in the number appearing or disappearing subsidiary names (green)

The green line supports the linear-like increase in the number of subsidiary names that we're seeing. The mean of the green (net increase/decrease) from 1994-2015 is 9,759, which is seems like a good estimation of the slope of the increase in the number of subsidiary names.

Code for getting the number of newly observed subsidiary names

out_df = pd.read_sql("out_sec10k__parents_and_subsidiaries", pudl_engine)
years = out_df.report_date.dropna().dt.year.unique().tolist()
years.sort()
seen_names = set()
new_names_count_per_year = {}
for year in years:
    new_names = set(out_df[out_df.report_date.dt.year == year].subsidiary_company_name)
    new_names_count_per_year[year] = len(new_names - seen_names)
    seen_names = seen_names.union(new_names)

Code for getting the number of subsidiary names which disappear per year

# how many names disappear forever each year
def get_year_that_names_disappear(df):
    prev_year_names = set()
    gone_names_per_year = {}
    seen_years = []
    for year in years:
        year_names = set(df[df.report_date.dt.year == year].subsidiary_company_name)
        # get the names that disappear from previous year
        gone_names_per_year[year] = prev_year_names - year_names
        # get the names that appear
        names_that_appear = year_names - prev_year_names
        # reappeared names = of the appeared names, how many were in disappeared lists
        # reappearing_names = {}
        for seen_year in seen_years:
            # append new reappearing names
            # reappearing_names.add(gone_names_per_year[seen_year] & names_that_appear)
            # remove these names from the gones_names_per_year
            gone_names_per_year[seen_year] = gone_names_per_year[seen_year] - names_that_appear
        seen_years.append(year)
        prev_year_names = year_names
        # we get lists of disappeared names that never reappear
        # add to dictionary of these reappearing names
    return gone_names_per_year

zaneselvans · 2025-05-26T20:03:44Z

Huh, so it seems as if there really are just a net of ~10K new subsidiary names that appear each year, at least up to like 2009. Interesting.

src/pudl/transform/sec10k.py

zaneselvans · 2025-05-26T21:58:34Z

src/pudl/transform/sec10k.py

+    filer_info_df["subsidiary_company_name"] = _standardize_company_name(
+        filer_info_df["company_name"]
+    )
+    ownership_df["subsidiary_company_name"] = _standardize_company_name(
+        ownership_df["subsidiary_company_name"]
+    )


If I understand correctly, there are potentially many subsidiary companies that show up with multiple different subsidiary company IDs -- since every time the appear as a subsidiary in a different parent company, they get a new ID that includes a different parent company CIK as part of it. If that's true, then these IDs don't really identify subsidiary companies -- they're more like ownership slices of subsidiary companies, and there's no explicit way to aggregate the slices back together to get a more complete picture of the whole subsidiary company.

How does this situation differ for the subset of subsidiaries that are themselves identified as 10K filers? What fraction of subsidiaries show up as owned by multiple parent companies / have ownership fractions less than 1.0?

src/pudl/transform/sec10k.py

test/unit/transform/sec10k_test.py

src/pudl/transform/sec10k.py

zaneselvans · 2025-05-26T23:03:30Z

src/pudl/output/sec10k.py

+    # token of the filename.
+    core_sec10k__quarterly_exhibit_21_company_ownership.loc[
+        :, "parent_company_central_index_key"
+    ] = core_sec10k__quarterly_exhibit_21_company_ownership["filename_sec10k"].apply(


This helper function to get the CIK from a filename seems to appear in multiple places -- you might want to factor it out.

Pulled this out into a separate function in the transform.sec10k module, which means that I need to import that function into output.sec10k. This seemed better than adding a very dataset specific helper function into the helpers.py module

src/pudl/output/sec10k.py

katie-lamb · 2025-05-27T16:13:32Z

Summary of remaining questions/changes where the solution is still ambiguous:

Here we're assuming that the filer company is the parent company associated with an attached Ex. 21
- Write more documentation of this assumption/decision including referencing the 10-K general instructions?
- Poke into why/when multiple company info blocks show up in a filing, i.e. when a group of companies are contained in one filing, are they wholly owned subsidiaries of the filer?
Here we discuss what to do if there were multiple company info blocks associated with a filer's CIK, how do we resolve duplicates?
Here we discuss the limitations of subsidiary_company_id_sec10k and how to interpret the match between subsidiaries and their CIKs (if they file a 10k)

katie-lamb added 12 commits February 23, 2025 10:18

pad cik in quarterly filings table

775fc0b

update core company information table to be a raw table

5e5f5f9

add core company information table

54f7aea

fix fields md

0c4cd5f

update row counts of core company info table

117f928

wipe alembic migrations

fe9b1cc

add company information output table

38e76d5

add migration for output table

767a353

add changelog of sec company names table

f10d62d

Merge branch 'main' into sec-table-restructuring

02f5d33

fix migrations

c1d3768

update core ownership table

142b939

katie-lamb added the sec10k Issues related to SEC 10K filing data. label Mar 1, 2025

katie-lamb self-assigned this Mar 1, 2025

github-project-automation bot added this to Catalyst Megaproject Mar 1, 2025

github-project-automation bot moved this to New in Catalyst Megaproject Mar 1, 2025

zaneselvans and others added 14 commits March 3, 2025 22:04

Fix alembic migration for SEC 10-K tables.

7d86dff

Merge branch 'main' into sec-table-restructuring

c09c3ee

Update conda lockfiles.

174d95f

Sort row count seeds to minimize diffs.

42be34b

fix ownership table bugs

a183f86

clean up core table columns

016f4a3

Merge branch 'main' into sec-table-restructuring

be27371

Merge branch 'sec-table-restructuring' into sec-ownership-restructuring

43c5282

fix core table zip code errors

8947e5a

fix alembic migrations and remove ins from asset definitions

32f7c96

small fixes to clean up core table fields

b694694

schematic changes to core information tables

458625d

fix schema in core table

ae25019

add regex constraint on exhibit 21 version

57e6615

katie-lamb added 10 commits May 7, 2025 10:16

update field types and make fraction owned source specific

399d453

Merge branch 'main' into sec-ownership-restructuring

5cb55da

move name cleaning to instantiate within function

855c5dc

remove upstream core table dependency from core ex21 table

eabf942

Merge branch 'main' into sec-ownership-restructuring

4f2fbbd

update subs and filers assn table

3f1a3c7

simplify match function and add unit test

00b8b2e

fixes to output table

c39bd9b

Merge branch 'main' into sec-ownership-restructuring

94dc677

Merge branch 'main' into sec-ownership-restructuring

d7beccc

katie-lamb commented May 15, 2025

View reviewed changes

src/pudl/transform/sec10k.py Show resolved Hide resolved

fix docs build

6f4ad97

Merge branch 'main' into sec-ownership-restructuring

0c857b8

zaneselvans added 4 commits May 26, 2025 14:14

Bring in conda lockfiles from main.

49eb3b9

Merge branch 'main' into sec-ownership-restructuring

0a76d09

Pull in pyproject.toml from main

f242ce3

Minor docstring, type annotation, readability tweaks.

0d40533

zaneselvans reviewed May 27, 2025

View reviewed changes

zaneselvans and others added 9 commits May 27, 2025 15:13

Merge branch 'main' into sec-ownership-restructuring

99de8b2

Merge branch 'main' into sec-ownership-restructuring

bada50a

make filename and cik the pk of company info tables

08042a2

Merge branch 'main' into sec-ownership-restructuring

47411d7

update unit test name

b72e78f

small fixes and merge validations

c39971c

update merge in transform

35fc146

Merge branch 'main' into sec-ownership-restructuring

b5ae913

update migrations

b785957

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Restructure SEC ownership tables #4097

Restructure SEC ownership tables #4097

Uh oh!

katie-lamb commented Mar 1, 2025 •

edited

Loading

Uh oh!

Uh oh!

katie-lamb commented May 23, 2025 •

edited

Loading

Uh oh!

zaneselvans commented May 26, 2025

Uh oh!

Uh oh!

zaneselvans May 26, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zaneselvans May 26, 2025

Uh oh!

katie-lamb May 31, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

katie-lamb commented May 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Restructure SEC ownership tables #4097

Are you sure you want to change the base?

Restructure SEC ownership tables #4097

Uh oh!

Conversation

katie-lamb commented Mar 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

What problem does this address?

What did you change?

Questions

To Do

Documentation

Testing

How did you make sure this worked? How can a reviewer verify this?

To-do list

Uh oh!

Uh oh!

katie-lamb commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zaneselvans commented May 26, 2025

Uh oh!

Uh oh!

zaneselvans May 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zaneselvans May 26, 2025

Choose a reason for hiding this comment

Uh oh!

katie-lamb May 31, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

katie-lamb commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

katie-lamb commented Mar 1, 2025 •

edited

Loading

katie-lamb commented May 23, 2025 •

edited

Loading

katie-lamb commented May 27, 2025 •

edited

Loading