faster file list ingestion for flat lists #77

lgray · 2019-11-27T00:32:52Z

Significant improvement 20-50% reduction in the time needed to ingest flat lists of files. The improvement is larger for larger lists, which we are likely to encounter from people transitioning from classical analysis platforms.

lgray · 2019-12-03T20:29:55Z

Any comments on this one? It's a noticeable improvement for people who are using it.

PerilousApricot · 2019-12-03T22:27:16Z

Instead of your patch, please push the loop at https://github.com/spark-root/laurelin/blob/e20d8dafabed60443416620e405317cf24867b92/src/main/java/edu/vanderbilt/accre/laurelin/Root.java#L257 into IOFactory.expandPathToList() and do the `fileSystem.listFiles` for the parent directory(ies) in the hadoop implementation to get the file list. Also, can you get your editor set up to do the right indentation?

…

PerilousApricot · 2019-12-03T22:53:58Z

Or, actually, use the bulk listStatus API https://hadoop.apache.org/docs/r2.7.1/api/org/apache/hadoop/fs/FileSystem.html#listStatus(org.apache.hadoop.fs.Path[])

…

lgray · 2019-12-03T23:03:34Z

OK - I'll try out the second one. The first can lead to incorrect behavior in the case that people have multiple datasets in the same directory and keep track of the files with a json (yes people do this).

…ing multiple paths

lgray · 2019-12-05T20:30:04Z

I'll test it on the cluster once we are done with upgrading arrow/numpy/etc there, and let you know the outcomes.

lgray · 2019-12-05T23:01:42Z

Figured out a better implementation, as well.

lgray · 2019-12-05T23:12:04Z

Or so I thought. Not as easy to turn into a flat map as thought.

lgray · 2019-12-10T14:46:07Z

OK - finally got to this - the bulk listStatus API is not faster than calling list status in a for loop. Looking into it, listStatus(Path[]) is just implemented as a for loop on listStatus(Path), and so hits the issues which led me to just copy paths that end in .root.

I'm gonna back out those changes, just ingesting things that end in .root directly and expanding paths, and keep the loop inside the interface you wanna present.

This OK with you?

PerilousApricot · 2019-12-10T16:11:45Z

No, it needs to be done right and not as a hack, which we agreed to not do. If that particular interface is implemented weird, then do listStatus on the (deduped) parent, and use that to sort out what is and isn't a directory.

…

lgray · 2019-12-10T16:49:06Z

OK - I have to deal with cases where files in multiple datasets are in the same directory path, but that seems manageable from the provided interface. Set intersection should be enough, if I can glob largest overlapping string that can work too. Will see what shakes out.

Remove all uses of old path expansion implementation (expandPathToList) Fixes #81 #77

PerilousApricot force-pushed the master branch from b652adb to d1734ea Compare December 2, 2019 22:15

lgray added 4 commits December 5, 2019 14:03

faster file list ingestion for flat lists

9120dc2

use more hadoopy goodness

80635de

use LinkedList all the way through to avoid necessary loops when gett…

904f1a2

…ing multiple paths

add tests

447af06

lgray force-pushed the ux/light_file_ingestion branch from 70767c6 to 447af06 Compare December 5, 2019 20:25

PerilousApricot force-pushed the master branch from 6cb2aec to 7846c7d Compare December 16, 2019 23:33

PerilousApricot added a commit that referenced this pull request Apr 7, 2020

Use new path expansion implementation

17d4aaa

Remove all uses of old path expansion implementation (expandPathToList) Fixes #81 #77

PerilousApricot added a commit that referenced this pull request Apr 7, 2020

Use new path expansion implementation

e4671e0

Remove all uses of old path expansion implementation (expandPathToList) Fixes #81 #77

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

faster file list ingestion for flat lists #77

faster file list ingestion for flat lists #77

Uh oh!

lgray commented Nov 27, 2019 •

edited

Loading

Uh oh!

lgray commented Dec 3, 2019

Uh oh!

PerilousApricot commented Dec 3, 2019 via email

Uh oh!

PerilousApricot commented Dec 3, 2019 via email

Uh oh!

lgray commented Dec 3, 2019

Uh oh!

lgray commented Dec 5, 2019 •

edited

Loading

Uh oh!

lgray commented Dec 5, 2019

Uh oh!

lgray commented Dec 5, 2019

Uh oh!

lgray commented Dec 10, 2019

Uh oh!

PerilousApricot commented Dec 10, 2019 via email

Uh oh!

lgray commented Dec 10, 2019

Uh oh!

Uh oh!

faster file list ingestion for flat lists #77

Are you sure you want to change the base?

faster file list ingestion for flat lists #77

Uh oh!

Conversation

lgray commented Nov 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lgray commented Dec 3, 2019

Uh oh!

PerilousApricot commented Dec 3, 2019 via email

Uh oh!

PerilousApricot commented Dec 3, 2019 via email

Uh oh!

lgray commented Dec 3, 2019

Uh oh!

lgray commented Dec 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lgray commented Dec 5, 2019

Uh oh!

lgray commented Dec 5, 2019

Uh oh!

lgray commented Dec 10, 2019

Uh oh!

PerilousApricot commented Dec 10, 2019 via email

Uh oh!

lgray commented Dec 10, 2019

Uh oh!

Uh oh!

lgray commented Nov 27, 2019 •

edited

Loading

lgray commented Dec 5, 2019 •

edited

Loading