Update fsspec parameters for cloud reads #677

rwegener2 · 2025-04-28T20:39:37Z

What has been built?

This PR sets up fsspec to use the cloud optimized parameters recommended in this whitepaper. It does not add the h5py parameters due to benchmarking discussed in #675. fsspec is used for cloud reading (and only cloud reading) so this update should only affect reads to data in s3.

Merging Note: I put this PR up for reference and visibility. I know there are other big changes and a v2 release coming soon, but this PR can wait until the appropriate time for merging.

Approximate timing results

Even on v006 data adding the added fsspec parameter noticeably speeds up data reads from the cloud (~12 times faster).

Run	Time (1 group ATL03)
cloud v006 without optimized fsspec params	~6 minutes
cloud v006 with optimized fsspec params	~30 seconds

Timing was done with the jupyter magic %%timeit on the 3 lines below: 1) create Read object 2) append variables 3) load data. This does not account for s3 caching.

How was it done?

The biggest decision, I think, was whether or not to expose this as an option to the user. This PR is the simplest possible implementation, in which the user is not given a choice. Given what a low level change this is I think this makes the most sense, but others should jump in if they would like.

How can it be tested?

The code below reads a small v006 file from s3.

import icepyx as ipx

s3url_006 = 's3://nsidc-cumulus-prod-protected/ATLAS/ATL03/006/2019/06/13/ATL03_20190613013940_11570313_006_02.h5'

reader = ipx.Read(s3url_006)
reader.vars.append(beam_list=['gt1l'], var_list=['h_ph'])
ds = reader.load()  # <- This call will now use the new parameters

for more information, see https://pre-commit.ci

github-actions · 2025-04-28T20:39:47Z

👈 Launch a binder notebook on this branch for commit 2f482da

I will automatically update this comment whenever this PR is modified

👈 Launch a binder notebook on this branch for commit 21a5961

weiji14 · 2025-05-11T21:09:10Z

icepyx/core/read.py

                s3 = earthaccess.get_s3fs_session(daac="NSIDC")
-                file = s3.open(file, "rb")
+                fsspec_params = {
+                    "cache_type": "blockcache",
+                    "block_size": 8 * 1024 * 1024,
+                }
+                file = s3.open(file, "rb", **fsspec_params)


I'm thinking if there's a way to just call earthaccess.open here? We want to slowly delegate most of the granule reading logic to earthaccess (xref #575), so would be best if we can have earthaccess handle the default cache/block sizes somehow. Cc @betolink for thoughts (since you had some ideas at nsidc/earthaccess#527).

rwegener2 and others added 2 commits April 28, 2025 20:06

change fsspec parameters

2f482da

[pre-commit.ci] auto fixes from pre-commit.com hooks

21a5961

for more information, see https://pre-commit.ci

weiji14 reviewed May 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update fsspec parameters for cloud reads #677

Update fsspec parameters for cloud reads #677

Uh oh!

rwegener2 commented Apr 28, 2025

Uh oh!

github-actions bot commented Apr 28, 2025 •

edited

Loading

Uh oh!

weiji14 May 11, 2025

Uh oh!

Uh oh!

Update fsspec parameters for cloud reads #677

Are you sure you want to change the base?

Update fsspec parameters for cloud reads #677

Uh oh!

Conversation

rwegener2 commented Apr 28, 2025

What has been built?

Approximate timing results

How was it done?

How can it be tested?

Uh oh!

github-actions bot commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

weiji14 May 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Apr 28, 2025 •

edited

Loading