Skip to content

faster file list ingestion for flat lists #77

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

lgray
Copy link
Contributor

@lgray lgray commented Nov 27, 2019

Significant improvement 20-50% reduction in the time needed to ingest flat lists of files. The improvement is larger for larger lists, which we are likely to encounter from people transitioning from classical analysis platforms.

@lgray
Copy link
Contributor Author

lgray commented Dec 3, 2019

Any comments on this one? It's a noticeable improvement for people who are using it.

@PerilousApricot
Copy link
Collaborator

PerilousApricot commented Dec 3, 2019 via email

@PerilousApricot
Copy link
Collaborator

PerilousApricot commented Dec 3, 2019 via email

@lgray
Copy link
Contributor Author

lgray commented Dec 3, 2019

OK - I'll try out the second one. The first can lead to incorrect behavior in the case that people have multiple datasets in the same directory and keep track of the files with a json (yes people do this).

@lgray lgray force-pushed the ux/light_file_ingestion branch from 70767c6 to 447af06 Compare December 5, 2019 20:25
@lgray
Copy link
Contributor Author

lgray commented Dec 5, 2019

I'll test it on the cluster once we are done with upgrading arrow/numpy/etc there, and let you know the outcomes.

@lgray
Copy link
Contributor Author

lgray commented Dec 5, 2019

Figured out a better implementation, as well.

@lgray
Copy link
Contributor Author

lgray commented Dec 5, 2019

Or so I thought. Not as easy to turn into a flat map as thought.

@lgray
Copy link
Contributor Author

lgray commented Dec 10, 2019

OK - finally got to this - the bulk listStatus API is not faster than calling list status in a for loop. Looking into it, listStatus(Path[]) is just implemented as a for loop on listStatus(Path), and so hits the issues which led me to just copy paths that end in .root.

I'm gonna back out those changes, just ingesting things that end in .root directly and expanding paths, and keep the loop inside the interface you wanna present.

This OK with you?

@PerilousApricot
Copy link
Collaborator

PerilousApricot commented Dec 10, 2019 via email

@lgray
Copy link
Contributor Author

lgray commented Dec 10, 2019

OK - I have to deal with cases where files in multiple datasets are in the same directory path, but that seems manageable from the provided interface. Set intersection should be enough, if I can glob largest overlapping string that can work too. Will see what shakes out.

PerilousApricot added a commit that referenced this pull request Apr 7, 2020
Remove all uses of old path expansion implementation (expandPathToList)

Fixes #81 #77
PerilousApricot added a commit that referenced this pull request Apr 7, 2020
Remove all uses of old path expansion implementation (expandPathToList)

Fixes #81 #77
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants