Skip to content

Conversation

d-v-b
Copy link
Contributor

@d-v-b d-v-b commented Aug 24, 2025

This PR adds a runtime type checker specifically for checking JSON-like data against a type definition. It's currently a draft while I get the test suite happy and refine the API, but it's also ready for people to look at and try out. I'm pretty convinced of it's utility, but I also think we should have a good discussion about whether this feature is a good idea.

Demo

The basic API looks like this:

# /// script
# requires-python = ">=3.11"
# dependencies = [
#   "zarr@git+https://github.com/d-v-b/zarr-python.git@feat/type-checker",
# ]
# ///

from zarr.core.type_check import check_type
from typing import TypedDict, Literal, NotRequired

class MyChunkGridConfig(TypedDict):
   prop_a: int
   prop_b: str
   prop_c: str | int
   optional_prop: NotRequired[bool]

class MyChunkGrid(TypedDict):
   name: Literal["awesome_chunk_grid"]
   configuration: MyChunkGridConfig

valid_data = [
   {"name": "awesome_chunk_grid", "configuration": {"prop_a": 10, "prop_b": "string", "prop_c": 10}},
   {"name": "awesome_chunk_grid", "configuration": {"prop_a": 10, "prop_b": "string", "prop_c": "10"}},
   {"name": "awesome_chunk_grid", "configuration": {"prop_a": 10, "prop_b": "string", "prop_c": "10", "optional_prop": True}}
   ]

for d in valid_data:
    result = check_type(d, MyChunkGrid)
    print(result.success, result.errors)
"""
True []
True []
True []
"""

# prop_b should be a string, but it's an int
invalid_data = [
    # invalid type
   {"name": "awesome_chunk_grid", "configuration": {"prop_a": 10, "prop_b": 10, "prop_c": 10}},
   # missing key
   {"name": "awesome_chunk_grid", "configuration": {"prop_a": 10, "prop_b": "10"}}
   ]
for d in invalid_data:
    result = check_type(d, MyChunkGrid)
    print(result.success, result.errors)
"""
False ["value['configuration']['prop_b'] expected an instance of <class 'str'> but got 10 with type <class 'int'>"]
False ["value['configuration'] missing required key 'prop_c'"]
"""

Some aspects might evolve while this is a draft, like the nature of the error messages.

Supported types

This is not a general-purpose type checker. It is targeted for the types relevant for Zarr metadata documents, and so It supports the following narrow set of types

  • int, bool, float, str
  • union
  • tuple
  • list
  • sequence
  • mapping
  • typeddict

cost

maintenance burden

The type checker itself is ~530 lines of commented code, broken up into functions which are mostly easy to understand. The typeddict part, and the logic for resolving generic types, is convoluted and potentially sensitive to changes in how python exposes type annotations at runtime. Many type annotation features have been designed for static type checkers and not use within a python program, so some of this is rather fiddly. But I don't think we are relying on any brittle or private APIs here.

performance

As currrently implemented, the type checker will report all detectable errors:

>>> check_type(tuple(range(4)), tuple[str, ...])
TypeCheckResult(success=False, errors=["value[0] expected an instance of <class 'str'> but got 0 with type <class 'int'>", "value[1] expected an instance of <class 'str'> but got 1 with type <class 'int'>", "value[2] expected an instance of <class 'str'> but got 2 with type <class 'int'>", "value[3] expected an instance of <class 'str'> but got 3 with type <class 'int'>"])

This is wasted compute when we don't care exactly how mismatched the data is, but it is a better user experience. We might need to tune this if performance becomes a problem, e.g. by introducing a "fail_fast" option that returns on the first error.

benefit

We can instantly remove a lot of special-purpose functions. Most of the functions named parse_* (~30+ functions) and essentially all of the functions named *check_json* (~30 functions) could be replaced or simplified with the check_type function.

We can also make our JSON loading routines type-safe:

# /// script
# requires-python = ">=3.11"
# dependencies = [
#   "zarr@git+https://github.com/d-v-b/zarr-python.git@feat/type-checker",
# ]
# ///

from zarr.core.type_check import guard_type
from zarr.core.common import ArrayMetadataJSON_V3

unknown_data: dict[str, object] = {
    "zarr_format": 3,
    "node_type": "array",
    "shape": (10,10),
    "data_type": "uint8", 
    "chunk_grid": {"name": "regular", "configuration": {"chunk_shape": (10,10)}},
    "chunk_key_encoding": {"name": "default", "configuration": {"separator": "/"}},
    "codecs": ({"name": "bytes"}, ),
    "fill_value": 0
    }

if guard_type(unknown_data, ArrayMetadataJSON_V3):
    reveal_type(unknown_data)

mypy could not infer the type correctly, but basedpyright does:

zarr-python git:(feat/type-checker) ✗ uvx basedpyright test.py
/Users/d-v-b/dev/zarr-python/test.py
  /Users/d-v-b/dev/zarr-python/test.py:25:17 - information: Type of "unknown_data" is "ArrayMetadataJSON_V3"
0 errors, 0 warnings, 1 note

While we could write a bespoke function that specifically checks all the possibilities for zarr v3 metadata. But then we would need to painfully modify that function by hand to support something like this:

# /// script
# requires-python = ">=3.11"
# dependencies = [
#   "zarr@git+https://github.com/d-v-b/zarr-python.git@feat/type-checker",
# ]
# ///
from collections.abc import Mapping
from typing_extensions import TypedDict
from zarr.core.common import ArrayMetadataJSON_V3
from zarr.core.type_check import check_type

class GeoZarrAttrs(TypedDict):
    geozarr: Mapping[str, object]

class GeoZarrArray(ArrayMetadataJSON_V3):
    attributes: GeoZarrAttrs

unknown_data: dict[str, object] = {
    "zarr_format": 3,
    "node_type": "array",
    "shape": (10,10),
    "data_type": "uint8", 
    "chunk_grid": {"name": "regular", "configuration": {"chunk_shape": (10,10)}},
    "chunk_key_encoding": {"name": "default", "configuration": {"separator": "/"}},
    "codecs": ({"name": "bytes"}, ),
    "fill_value": 0,
    "attributes": {"not_geozarr": "bar"}
    }

result = check_type(unknown_data, GeoZarrArray)
print(result)
"""
TypeCheckResult(success=False, errors=["value['attributes'] missing required key 'geozarr'"])
"""

alternatives

we could use an external JSON validation library / type checking like pydantic, attrs, msgspec, beartype, etc. But I would rather not add a dependency. With the approach in this PR, we keep control in-house, and because this PR just adds functions, it composes with the rest of our codebase at the moment. (FWIW right now this type checker doesn't do any parsing, it only validates. If you think we should parse instead of just validating, then IMO that's a job for our array metadata classes)

we could also do nothing, and continue writing JSON parsing code by hand. But I would rather not do that, because this invites bugs and makes it hard to keep up with sneaky spec changes. Specifically, I'm planning on writing a lot of new types to model the codecs defined in #3376, and I would rather just write the type and get the type checking (and type safety) for free.

closes #3285

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Aug 24, 2025
@d-v-b d-v-b changed the title add a runttime type checker for metadata objects add a runtime type checker for metadata objects Aug 24, 2025
@github-actions github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Aug 25, 2025
@d-v-b d-v-b marked this pull request as ready for review August 25, 2025 13:09
@d-v-b
Copy link
Contributor Author

d-v-b commented Aug 25, 2025

this is pretty substantial so I would appreciate a lot of eyes @zarr-developers/python-core-devs

if anyone has concerns about whether we should do any runtime type checking at all, maybe send those thoughts to the issue this PR closes

I'm going to keep working on tests for the type checker, but so far it's working great.

This PR does violating liskov for a few subclasses of our Metadata ABC, because that class requires that to_dict and from_dict use dict[st, JSON], which is not very accurate. After this PR we need to make changes to class so that it supports type safety, then we won't be violating liskov any more.

Similarly, there are lots of # type: ignores added in various places. As you might guess, those were added because without them, mypy flagging type errors. Many of these checks will go away when we fix the lax typing of the Metadata ABC, and other classes.

@TomAugspurger I think you in particular will appreciate some of the effects of this PR. Since we can annotate methods like ArrayV3Metadata.from_dict() as taking ArrayMetadataJSON_V3, we don't need to do any runtime validation inside from_dict. The assumption is that the caller has already checked the input. This is not possible without the types defined in this PR, and it's not practical without the type checker. Once we extrapolate the type safe style, we can push almost all our type checking to the IO boundary.

That being said, I think the ArrayMetadata class will still need to do some internal consistency checks, like ensuring that the number of dimension names matches the length of shape. I don't think we want our type checker to be smart enough to catch that kind of thing.

Copy link

codecov bot commented Aug 25, 2025

Codecov Report

❌ Patch coverage is 88.94231% with 46 lines in your changes missing coverage. Please review.
✅ Project coverage is 94.31%. Comparing base (c9509ee) to head (bbd8ba7).

Files with missing lines Patch % Lines
src/zarr/core/type_check.py 83.20% 43 Missing ⚠️
src/zarr/core/array.py 86.66% 2 Missing ⚠️
src/zarr/core/metadata/v3.py 96.15% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3400      +/-   ##
==========================================
- Coverage   94.70%   94.31%   -0.39%     
==========================================
  Files          79       80       +1     
  Lines        9532     9795     +263     
==========================================
+ Hits         9027     9238     +211     
- Misses        505      557      +52     
Files with missing lines Coverage Δ
src/zarr/abc/codec.py 98.50% <100.00%> (-0.22%) ⬇️
src/zarr/api/asynchronous.py 89.93% <100.00%> (-0.04%) ⬇️
src/zarr/core/common.py 95.40% <100.00%> (+2.06%) ⬆️
src/zarr/core/dtype/common.py 70.58% <100.00%> (-14.96%) ⬇️
src/zarr/core/dtype/npy/bool.py 100.00% <100.00%> (ø)
src/zarr/core/dtype/npy/bytes.py 99.49% <100.00%> (-0.01%) ⬇️
src/zarr/core/dtype/npy/complex.py 98.80% <100.00%> (+0.01%) ⬆️
src/zarr/core/dtype/npy/float.py 98.91% <100.00%> (+0.01%) ⬆️
src/zarr/core/dtype/npy/int.py 99.37% <100.00%> (+<0.01%) ⬆️
src/zarr/core/dtype/npy/string.py 97.79% <100.00%> (+0.01%) ⬆️
... and 10 more
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

unified runtime type checking for our json data
1 participant