Skip to content

Update fsspec parameters for cloud reads #677

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: development
Choose a base branch
from

Conversation

rwegener2
Copy link
Contributor

What has been built?

This PR sets up fsspec to use the cloud optimized parameters recommended in this whitepaper. It does not add the h5py parameters due to benchmarking discussed in #675. fsspec is used for cloud reading (and only cloud reading) so this update should only affect reads to data in s3.

Merging Note: I put this PR up for reference and visibility. I know there are other big changes and a v2 release coming soon, but this PR can wait until the appropriate time for merging.

Approximate timing results

Even on v006 data adding the added fsspec parameter noticeably speeds up data reads from the cloud (~12 times faster).

Run Time (1 group ATL03)
cloud v006 without optimized fsspec params ~6 minutes
cloud v006 with optimized fsspec params ~30 seconds

Timing was done with the jupyter magic %%timeit on the 3 lines below: 1) create Read object 2) append variables 3) load data. This does not account for s3 caching.

How was it done?

The biggest decision, I think, was whether or not to expose this as an option to the user. This PR is the simplest possible implementation, in which the user is not given a choice. Given what a low level change this is I think this makes the most sense, but others should jump in if they would like.

How can it be tested?

The code below reads a small v006 file from s3.

import icepyx as ipx

s3url_006 = 's3://nsidc-cumulus-prod-protected/ATLAS/ATL03/006/2019/06/13/ATL03_20190613013940_11570313_006_02.h5'

reader = ipx.Read(s3url_006)
reader.vars.append(beam_list=['gt1l'], var_list=['h_ph'])
ds = reader.load()  # <- This call will now use the new parameters

Copy link

github-actions bot commented Apr 28, 2025

Binder 👈 Launch a binder notebook on this branch for commit 2f482da

I will automatically update this comment whenever this PR is modified

Binder 👈 Launch a binder notebook on this branch for commit 21a5961

Comment on lines 630 to +635
s3 = earthaccess.get_s3fs_session(daac="NSIDC")
file = s3.open(file, "rb")
fsspec_params = {
"cache_type": "blockcache",
"block_size": 8 * 1024 * 1024,
}
file = s3.open(file, "rb", **fsspec_params)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking if there's a way to just call earthaccess.open here? We want to slowly delegate most of the granule reading logic to earthaccess (xref #575), so would be best if we can have earthaccess handle the default cache/block sizes somehow. Cc @betolink for thoughts (since you had some ideas at nsidc/earthaccess#527).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants