Skip to content

Import script should follow more of a pipeline style #669

@Mr0grog

Description

@Mr0grog

The import script has gotten pretty crazy and messy over time, and we could remove a lot of the complexity. Some is just because it’s taken us a while to learn the ins- and outs- of the Wayback APIs and their peculiarities, but others are just because something was expedient at the time.

A while back, I had a bunch of ideas about how this script could be clearer and more pipeline-y, with a series of generator-based tasks that run on threads connected by FiniteQueue. I’ve played that out somewhat in the task sheets script. We sort of do that here, but various filtering, summarization, and error handling bits that should be separate workflow items are mixed together, and what’s actually happening isn’t always clear.

(There might also be some better tools for this now. Things like Databay and Prefect either didn’t exist or I didn’t know about them at the time. Bonobo looked to be in a messy total rewrite and didn’t have some of the facilities we needed, but may be better now.)

This probably isn’t high-priority enough to fit on the 2020 roadmap, but would be some nice cleanup to do if there’s time.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Inbox

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions