Skip to content

More validation for partial read parameters #116

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

Conversation

krisfed
Copy link
Member

@krisfed krisfed commented Jul 14, 2025

Changes include:

  1. Preventing a crash when trying to read too much data ( Fixes MATLAB crashed while using zarrread function to read a big (2317GB) array #108 )

    a. The crash is caused by running out of memory, and the best we can do is error out instead of crashing. There isn't an easy way to determine memory limit in MATLAB, so doing a very hacky workaround of trying to create a zeros array of same size and datatype and seeing if it errors. Luckily creating an array like that is efficient and errors out quickly. At least this is better than accidentally causing the whole MATLAB to crash.

    b. To know the datatype before reading the data I am using the results from zarrinfo (instead of introspection on already read data). To determine datatype from zarrinfo I needed to add fromZarrType method to ZarrDatatype.m

    c. Added a test for this in tZarrRead - tooBigArray.

    d. Example of the proposed error when reading data that's too big:

>> d = zarrread('https://noaa-nwm-retro-v2-zarr-pds.s3.amazonaws.com/elevation/', Count=[100000,100000])
Error using Zarr/read (line 346)
Reading requested data (100000-by-100000 single matrix) might exceed available memory.

Error in zarrread (line 36)
data = zarrObj.read(options.Start, options.Count, options.Stride);
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  1. Replacing TensorStore errors (which are a bit hard to interpret) with our own errors for when Start/Stride/Count are out of bounds ( Fixes Partial read: Improvement to error messages #110 ). This also makes sure we error out when Stride is too big ( Fixes Check behavior when using out-of-bound value for the "stride" parameter #113 ). For example (on 3x4 Zarr array):

Start too big

Before:

>> d = zarrread("dataFiles/grp_v2/smallArr", Start=[5, 5])
Error using ZarrPy>readZarr (line 107)
Python Error: IndexError: Computing interval slice for dimension 0: (4, 2) do not
specify a valid closed index interval [source
locations='tensorstore/index_interval.cc:360\ntensorstore/index_interval.cc:403\ntensorstore/index_space/internal/numpy_indexing_spec.cc:506\ntensorstore/index_space/internal/numpy_indexing_spec.cc:506\ntensorstore/index_space.h:440']

Error in zarrread (line 36)
data = zarrObj.read(options.Start, options.Count, options.Stride);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

After:

>> d = zarrread("dataFiles/grp_v2/smallArr", Start=[5, 5])
Error using Zarr.processPartialReadParams (line 107)
Elements in Start must not exceed the corresponding Zarr array dimensions.

Error in Zarr/read (line 317)
            start = Zarr.processPartialReadParams(start, info.shape,...
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Error in zarrread (line 36)
data = zarrObj.read(options.Start, options.Count, options.Stride);

Stride too big

Before:

>> d = zarrread("dataFiles/grp_v2/smallArr", Stride=[10, 10])

d =

     []

After:

>> d = zarrread("dataFiles/grp_v2/smallArr", Stride=[10, 10])
Error using Zarr.processPartialReadParams (line 107)
Elements in Stride must not exceed the corresponding Zarr array dimensions.

Error in Zarr/read (line 319)
            stride = Zarr.processPartialReadParams(stride, info.shape,...
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Error in zarrread (line 36)
data = zarrObj.read(options.Start, options.Count, options.Stride);
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Combination of Count and Start is too big

Before:

d = zarrread("dataFiles/grp_v2/smallArr", Start=[1,2], Count=[3,4])
Error using ZarrPy>readZarr (line 107)
Python Error: ValueError: OUT_OF_RANGE: Propagated bounds [0, 4), with size=4, for
dimension 1 are incompatible with existing bounds [1, 5), with size=4. [transform='Rank
2 -> 2 index space transform:   Input domain:     0: [0, 3)     1: [1, 5)   Output index
maps:     out[0] = 0 + 1 * in[0]     out[1] = 0 + 1 * in[1] '] [domain='{origin={0, 0},
shape={3, 4}}'] [source
locations='tensorstore/index_space/internal/propagate_bounds.cc:287\ntensorstore/index_space/index_transform.h:994']

Error in zarrread (line 36)
data = zarrObj.read(options.Start, options.Count, options.Stride);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

After:

>> d = zarrread("dataFiles/grp_v2/smallArr", Start=[1,2], Count=[3,4])
Error using Zarr/read (line 326)
Requested Count in combination with other parameters exceeds Zarr array dimensions.

Error in zarrread (line 36)
data = zarrObj.read(options.Start, options.Count, options.Stride);
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  1. Not allowing logical values for Start/Stride/Count. Previously you could get a following error:
>> d = zarrread("dataFiles/grp_v2/smallArr", Start="")
Error using zarrread (line 30)
 d = zarrread("dataFiles/grp_v2/smallArr", Start="")
                                                 ^^
Invalid value for 'Start' argument. Value must be numeric or logical.

But logical values do not make sense for Start/Stride/Count, so I added a mustBeNumeric validator. Now the behavior is clearer:

>> d = zarrread("dataFiles/grp_v2/smallArr", Start="")
Error using zarrread (line 30)
 d = zarrread("dataFiles/grp_v2/smallArr", Start="")
                                                 ^^
Invalid value for 'Start' argument. Value must be numeric.

>> d = zarrread("dataFiles/grp_v2/smallArr", Start=[true, true])
Error using zarrread (line 30)
 d = zarrread("dataFiles/grp_v2/smallArr", Start=[true, true])
                                                 ^^^^^^^^^^^^
Invalid value for 'Start' argument. Value must be numeric.

Copy link

codecov bot commented Jul 15, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.80%. Comparing base (8898cee) to head (c7a55bf).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #116      +/-   ##
==========================================
- Coverage   97.09%   95.80%   -1.30%     
==========================================
  Files           8        8              
  Lines         241      262      +21     
==========================================
+ Hits          234      251      +17     
- Misses          7       11       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@krisfed krisfed marked this pull request as ready for review July 16, 2025 17:47
@krisfed krisfed requested review from jm9176 and jhughes-mw July 16, 2025 17:47
try
zeros(count, obj.Datatype.MATLABType);
catch ME
if strcmp(ME.identifier, 'MATLAB:array:SizeLimitExceeded')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you have to use IF here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I only want to check for the error related to exceeding size limit. I am choosing to silently swallow any other possible errors (rare/unlikely edge-cases). We are trying to create an array of zeros here just as a way to see if reading a similarly-sized array from Zarr file will exceed the memory limit (unfortunately couldn't find a better way) , so my thinking is that any other errors would be unrelated, and we shouldn't throw them and interrupt user workflow. If something goes wrong with creating an array if zeros that is unrelated to memory limit, we should just try reading from the zarr file anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants