-
-
Notifications
You must be signed in to change notification settings - Fork 340
Use zstandard implementation from stdlib (PEP-784) #2034
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
It will take me some time to review this but thanks so much for your efforts on standard library inclusion! Out of curiosity, have you ran benchmarks compared to the |
Not yet. This is something I would like to do, but having a meaningful benchmark it kind of hard. So far I chose instead to invest my time in PEP-784, integration into CPython, and backports.zstd. I believe that even if the implementation in CPython (or in the backport) is slower by a huge margin, it would not change much for most applications. Of course this would depend on the difference and the application. |
Please let me know whenever you come up with even a simple benchmark, I'm very interested! |
This benchmark is work in progress, but I compared the oneshot compression/decompression of bytes already in memory (in a variable), for different sizes (1kB/1MB/1GB) and different levels (the default of 3 as well as levels with higher compression: 10 and 17). Data used from Each operation was run with ![]() green means For the use case of However, unless you are doing it numerous times, it does not change much, as the operation takes less than 2 milliseconds. Edit: updated the benchmark results after finding an issue in timeit invocation |
Thanks! For extra context, the main/only use case for this currently in Hatch is decompressing Python distributions from this project: https://github.com/astral-sh/python-build-standalone As an example, cpython-3.11.13+20250814-x86_64_v4-unknown-linux-musl-lto-full.tar.zst is 62.5 MB |
Using your
Benchmark code
import shutil
import tarfile as orig_tarfile
from pathlib import Path
from timeit import timeit
import zstandard
from backports.zstd import tarfile as new_tarfile
archive = "cpython-3.11.13+20250814-x86_64_v4-unknown-linux-musl-lto-full.tar.zst"
directory = Path("/tmp/ramdisk").absolute() # mounted as tmpfs
def clean():
shutil.rmtree(directory, ignore_errors=True)
directory.mkdir()
def run_zstandard():
with open(archive, "rb") as ifh:
dctx = zstandard.ZstdDecompressor()
with (
dctx.stream_reader(ifh) as reader,
orig_tarfile.open(mode="r|", fileobj=reader) as tf,
):
tf.extractall(directory, filter="data")
def run_backportszstd():
with new_tarfile.open(archive, "r:zst") as tf:
tf.extractall(directory, filter="data")
number = 50
timing = timeit(
stmt="run()",
setup="clean()",
number=number,
globals={"clean": clean, "run": run_zstandard},
)
print("zstandard", timing / number)
timing = timeit(
stmt="run()",
setup="clean()",
number=number,
globals={"clean": clean, "run": run_backportszstd},
)
print("backports.zstd", timing / number) |
That's impressive, and I'm sure there are optimizations yet to be made. Thank you very much! I will try to review this soon, I've been very busy. Please know that every review I have yet to do and every email I have yet to respond to is a permanent weight on my psyche 😅 |
Thanks to PEP-784, Zstandard will be included in Python starting from version 3.14, and also in the
tarfile
module.So for Python 3.14+, we don't need an external lib. For older version of Python, I'm using
backports.zstd
.This also allows to remove the Python version check against 3.12+: we can use the
filter
parameter in all cases (implementation oftarfile
module inbackports.zstd
comes from CPython 3.14 codebase).As a result, this allowed for a small refactor, that improves clarity and reduces redundancy.
I'm expecting the typing tests with
mypy
to fail if checked against version 3.14+ of Python, because the proper type hints for thetarfile
module (to allowr:zst
mode) are not in the latest released version ofmypy
yet (they are intypeshed
and onmypy
master branch, so probably in next release).Also,Edit: now supported starting frombackports.zstd
does not currently support PyPy (but should do before October), so if you do want to work with PyPy I suggest to wait a little bit before merging the PR.backports.zstd
version 0.5.0Note that I don't know the
hatch
codebase, but did my best to fit in. Feel free to suggest any changes!Full disclosure: I'm the author and maintainer of
backports.zstd
, and the maintainer ofpyzstd
(which code was used as a base for the integration into Python). I also helped with PEP-784 and its integration into CPython.Fixes #1801
Fixes #2007