Import script should follow more of a pipeline style

The import script has gotten pretty crazy and messy over time, and we could remove a lot of the complexity. Some is just because it’s taken us a while to learn the ins- and outs- of the Wayback APIs and their peculiarities, but others are just because something was expedient at the time.

A while back, I had a bunch of ideas about how this script could be clearer and more pipeline-y, with a series of generator-based tasks that run on threads connected by `FiniteQueue`. I’ve played that out somewhat in the [task sheets script](https://github.com/edgi-govdata-archiving/web-monitoring-task-sheets/blob/683976bb50bd939bee7caaf2bda5d7ac01c40a74/generate_task_sheets.py#L167-L198). We sort of do that here, but various filtering, summarization, and error handling bits that should be separate workflow items are mixed together, and what’s actually happening isn’t always clear.

(There might also be some better tools for this now. Things like [Databay](https://github.com/Voyz/databay) and [Prefect](https://www.prefect.io/) either didn’t exist or I didn’t know about them at the time. [Bonobo](https://www.bonobo-project.org/) looked to be in a messy total rewrite and didn’t have some of the facilities we needed, but may be better now.)

This probably isn’t high-priority enough to fit on the [2020 roadmap](https://github.com/edgi-govdata-archiving/web-monitoring/issues/158), but would be some nice cleanup to do if there’s time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Import script should follow more of a pipeline style #669

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Import script should follow more of a pipeline style #669

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions