-
Notifications
You must be signed in to change notification settings - Fork 5
faster file list ingestion for flat lists #77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
b652adb
to
d1734ea
Compare
Any comments on this one? It's a noticeable improvement for people who are using it. |
Instead of your patch, please push the loop at
https://github.com/spark-root/laurelin/blob/e20d8dafabed60443416620e405317cf24867b92/src/main/java/edu/vanderbilt/accre/laurelin/Root.java#L257
into IOFactory.expandPathToList() and do the `fileSystem.listFiles` for the
parent directory(ies) in the hadoop implementation to get the file list.
Also, can you get your editor set up to do the right indentation?
… |
Or, actually, use the bulk listStatus API
https://hadoop.apache.org/docs/r2.7.1/api/org/apache/hadoop/fs/FileSystem.html#listStatus(org.apache.hadoop.fs.Path[])
… |
OK - I'll try out the second one. The first can lead to incorrect behavior in the case that people have multiple datasets in the same directory and keep track of the files with a json (yes people do this). |
70767c6
to
447af06
Compare
I'll test it on the cluster once we are done with upgrading arrow/numpy/etc there, and let you know the outcomes. |
Figured out a better implementation, as well. |
Or so I thought. Not as easy to turn into a flat map as thought. |
OK - finally got to this - the bulk listStatus API is not faster than calling list status in a for loop. Looking into it, listStatus(Path[]) is just implemented as a for loop on listStatus(Path), and so hits the issues which led me to just copy paths that end in I'm gonna back out those changes, just ingesting things that end in This OK with you? |
No, it needs to be done right and not as a hack, which we agreed to not do.
If that particular interface is implemented weird, then do listStatus on
the (deduped) parent, and use that to sort out what is and isn't a
directory.
… |
OK - I have to deal with cases where files in multiple datasets are in the same directory path, but that seems manageable from the provided interface. Set intersection should be enough, if I can glob largest overlapping string that can work too. Will see what shakes out. |
6cb2aec
to
7846c7d
Compare
Significant improvement 20-50% reduction in the time needed to ingest flat lists of files. The improvement is larger for larger lists, which we are likely to encounter from people transitioning from classical analysis platforms.