Skip to content

Use zstandard implementation from stdlib (PEP-784) #2034

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Rogdham
Copy link

@Rogdham Rogdham commented Aug 3, 2025

Thanks to PEP-784, Zstandard will be included in Python starting from version 3.14, and also in the tarfile module.

So for Python 3.14+, we don't need an external lib. For older version of Python, I'm using backports.zstd.

This also allows to remove the Python version check against 3.12+: we can use the filter parameter in all cases (implementation of tarfile module in backports.zstd comes from CPython 3.14 codebase).

As a result, this allowed for a small refactor, that improves clarity and reduces redundancy.


I'm expecting the typing tests with mypy to fail if checked against version 3.14+ of Python, because the proper type hints for the tarfile module (to allow r:zst mode) are not in the latest released version of mypy yet (they are in typeshed and on mypy master branch, so probably in next release).

Also, backports.zstd does not currently support PyPy (but should do before October), so if you do want to work with PyPy I suggest to wait a little bit before merging the PR. Edit: now supported starting from backports.zstd version 0.5.0

Note that I don't know the hatch codebase, but did my best to fit in. Feel free to suggest any changes!

Full disclosure: I'm the author and maintainer of backports.zstd, and the maintainer of pyzstd (which code was used as a base for the integration into Python). I also helped with PEP-784 and its integration into CPython.

Fixes #1801
Fixes #2007

@ofek
Copy link
Contributor

ofek commented Aug 3, 2025

It will take me some time to review this but thanks so much for your efforts on standard library inclusion! Out of curiosity, have you ran benchmarks compared to the zstandard package? https://pypi.org/project/zstandard/

@Rogdham
Copy link
Author

Rogdham commented Aug 3, 2025

Out of curiosity, have you ran benchmarks compared to the zstandard package?

Not yet. This is something I would like to do, but having a meaningful benchmark it kind of hard.

So far I chose instead to invest my time in PEP-784, integration into CPython, and backports.zstd.

I believe that even if the implementation in CPython (or in the backport) is slower by a huge margin, it would not change much for most applications. Of course this would depend on the difference and the application.

@ofek
Copy link
Contributor

ofek commented Aug 3, 2025

Please let me know whenever you come up with even a simple benchmark, I'm very interested!

@Rogdham
Copy link
Author

Rogdham commented Aug 16, 2025

Please let me know whenever you come up with even a simple benchmark

This benchmark is work in progress, but I compared the oneshot compression/decompression of bytes already in memory (in a variable), for different sizes (1kB/1MB/1GB) and different levels (the default of 3 as well as levels with higher compression: 10 and 17). Data used from enwik9.

Each operation was run with timeit, using a value for number big enough so that each operation runs for about 10 seconds.

benchmark2

green means backports.zstd is faster


For the use case of hatch, my guess is that you would decompress archives around 1MB in size, so switching from zstandard to backports.zstd would be around 3% slower when decompressing.

However, unless you are doing it numerous times, it does not change much, as the operation takes less than 2 milliseconds.

Edit: updated the benchmark results after finding an issue in timeit invocation

@ofek
Copy link
Contributor

ofek commented Aug 16, 2025

Thanks! For extra context, the main/only use case for this currently in Hatch is decompressing Python distributions from this project: https://github.com/astral-sh/python-build-standalone

As an example, cpython-3.11.13+20250814-x86_64_v4-unknown-linux-musl-lto-full.tar.zst is 62.5 MB

@Rogdham
Copy link
Author

Rogdham commented Aug 16, 2025

Using your .tar.zst file as an example, I measured the following timings (running each implementation 50 times):

Implementation Average time per run
zstandard 1.5027s
backports.zstd 1.5099s (0.5% slower)
Benchmark code

import shutil
import tarfile as orig_tarfile
from pathlib import Path
from timeit import timeit

import zstandard
from backports.zstd import tarfile as new_tarfile

archive = "cpython-3.11.13+20250814-x86_64_v4-unknown-linux-musl-lto-full.tar.zst"
directory = Path("/tmp/ramdisk").absolute()  # mounted as tmpfs


def clean():
    shutil.rmtree(directory, ignore_errors=True)
    directory.mkdir()


def run_zstandard():
    with open(archive, "rb") as ifh:
        dctx = zstandard.ZstdDecompressor()
        with (
            dctx.stream_reader(ifh) as reader,
            orig_tarfile.open(mode="r|", fileobj=reader) as tf,
        ):
            tf.extractall(directory, filter="data")


def run_backportszstd():
    with new_tarfile.open(archive, "r:zst") as tf:
        tf.extractall(directory, filter="data")


number = 50

timing = timeit(
    stmt="run()",
    setup="clean()",
    number=number,
    globals={"clean": clean, "run": run_zstandard},
)
print("zstandard", timing / number)

timing = timeit(
    stmt="run()",
    setup="clean()",
    number=number,
    globals={"clean": clean, "run": run_backportszstd},
)
print("backports.zstd", timing / number)

@ofek
Copy link
Contributor

ofek commented Aug 16, 2025

That's impressive, and I'm sure there are optimizations yet to be made. Thank you very much!

I will try to review this soon, I've been very busy. Please know that every review I have yet to do and every email I have yet to respond to is a permanent weight on my psyche 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

hatch not using zstandard for python-3.14.0b3, for free-threading compatibility pip install hatch fails on free-threaded build
2 participants