Skip to content

Add helper script for quarterlies #4074

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Add helper script for quarterlies #4074

wants to merge 1 commit into from

Conversation

krivard
Copy link
Contributor

@krivard krivard commented Feb 20, 2025

Overview

Currently just checks what columns were dropped or added; could be expanded to auto-generate candidate updates to column mappings.

What problem does this address?

Quarterly updates for sources that frequently change the layout/schema/spelling of their raw files are tedious. The 2025Q1 update for EIA 930 dropped 9 and added 30 columns. I used a more-horrible version of this script to help identify and organize the necessary changes to column mappings.

What did you change?

  • Added scripts/ folder
  • Added quarterlies-helper.py, intended to be run from the command line

Documentation

Make sure to update relevant aspects of the documentation.

  • Update the release notes: reference the PR and related issues.
  • Update relevant Data Source jinja templates (see docs/data_sources/templates).
  • Update relevant table or source description metadata (see src/metadata).
  • Review and update any other aspects of the documentation that might be affected by this PR.

Testing

How did you make sure this worked? How can a reviewer verify this?

Sample usage:
$ pudl_datastore
[...]
$ diff -q ../store/pudl_input/eia930/10.5281-zenodo.14026427 ../store/pudl_input/eia930/10.5281-zenodo.14792697 |tail -2
Files ../store/pudl_input/eia930/10.5281-zenodo.14026427/eia930-2024half2.zip and ../store/pudl_input/eia930/10.5281-zenodo.14792697/eia930-2024half2.zip differ
Only in ../store/pudl_input/eia930/10.5281-zenodo.14792697: eia930-2025half1.zip
$ python scripts/quarterlies-helper.py eia930 eia930-2024half2.zip 10.5281-zenodo.14026427 10.5281-zenodo.14792697 |nl
     1	====
     2	eia930-2024half2-balance.csv
     3	Column mismatch
     4	  Removed columns:
     5	    Net Generation (MW) from Hydropower and Pumped Storage
     6	    Net Generation (MW) from Hydropower and Pumped Storage (Adjusted)
     7	    Net Generation (MW) from Hydropower and Pumped Storage (Imputed)
     8	    Net Generation (MW) from Solar
     9	    Net Generation (MW) from Solar (Adjusted)
    10	    Net Generation (MW) from Solar (Imputed)
    11	    Net Generation (MW) from Wind
    12	    Net Generation (MW) from Wind (Adjusted)
    13	    Net Generation (MW) from Wind (Imputed)
    14	  Added columns:
    15	    Net Generation (MW) from Battery Storage
    16	    Net Generation (MW) from Battery Storage (Adjusted)
    17	    Net Generation (MW) from Battery Storage (Imputed)
    18	    Net Generation (MW) from Geothermal
    19	    Net Generation (MW) from Geothermal (Adjusted)
    20	    Net Generation (MW) from Geothermal (Imputed)
    21	    Net Generation (MW) from Hydropower Excluding Pumped Storage
    22	    Net Generation (MW) from Hydropower Excluding Pumped Storage (Adjusted)
    23	    Net Generation (MW) from Hydropower Excluding Pumped Storage (Imputed)
    24	    Net Generation (MW) from Other Energy Storage
    25	    Net Generation (MW) from Other Energy Storage (Adjusted)
    26	    Net Generation (MW) from Other Energy Storage (Imputed)
    27	    Net Generation (MW) from Pumped Storage
    28	    Net Generation (MW) from Pumped Storage  (Adjusted)
    29	    Net Generation (MW) from Pumped Storage (Imputed)
    30	    Net Generation (MW) from Solar with Integrated Battery Storage
    31	    Net Generation (MW) from Solar with Integrated Battery Storage (Imputed)
    32	    Net Generation (MW) from Solar witho Integrated Battery Storage (Adjusted)
    33	    Net Generation (MW) from Solar without Integrated Battery Storage
    34	    Net Generation (MW) from Solar without Integrated Battery Storage (Adjusted)
    35	    Net Generation (MW) from Solar without Integrated Battery Storage (Imputed)
    36	    Net Generation (MW) from Unknown Energy Storage
    37	    Net Generation (MW) from Unknown Energy Storage (Adjusted)
    38	    Net Generation (MW) from Unknown Energy Storage (Imputed)
    39	    Net Generation (MW) from Wind with Integrated Battery Storage
    40	    Net Generation (MW) from Wind with Integrated Battery Storage (Adjusted)
    41	    Net Generation (MW) from Wind with Integrated Battery Storage (Imputed)
    42	    Net Generation (MW) from Wind without Integrated Battery Storage
    43	    Net Generation (MW) from Wind without Integrated Battery Storage (Adjusted)
    44	    Net Generation (MW) from Wind without Integrated Battery Storage (Imputed)
    45	====
    46	eia930-2024half2-interchange.csv
    47	====
    48	eia930-2024half2-subregion.csv
$

To-do list

  • If updating analyses or data processing functions: make sure to update or write data validation tests (e.g. test_minmax_rows()).
  • Run make pytest-coverage locally to ensure that the merge queue will accept your PR.
  • Review the PR yourself and call out any questions or issues you have.
  • For minor ETL changes or data additions, once make pytest-coverage passes, make sure you have a fresh full PUDL DB downloaded locally, materialize new/changed assets and all their downstream assets and run relevant data validation tests using pytest and --live-dbs.
  • For bigger ETL or data changes run the full ETL locally and then run the data validations using make pytest-validate.
  • Alternatively, run the build-deploy-pudl GitHub Action manually.

Currently just checks what columns were dropped or added.
@zaneselvans
Copy link
Member

The devtools/ directory is our existing scripty place which this might fit into. It's also overdue for a reorganiztion and the code in there might benefit from being pulled into a cli or scripts subpackage under src/pudl, since none of the code in there can be imported or tested elsewhere as it is now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: New
Development

Successfully merging this pull request may close these issues.

2 participants