Skip to content

Releases: ashvardanian/StringZilla

Release v3.12.5

18 Apr 20:43
Compare
Choose a tag to compare

Release: v3.12.5 [skip ci]

Patch

  • Make: Upgrade to newer CMake (25224bb)

Release v3.12.4

12 Apr 16:11
Compare
Choose a tag to compare

Release: v3.12.4 [skip ci]

Patch

  • Fix: Long input tail in sz_copy_avx512 (#221) (18f04e7)

Release v3.12.3

09 Mar 05:47
Compare
Choose a tag to compare

Release: v3.12.3 [skip ci]

Patch

  • Improve: C++ Lifetime bounds (caf5650)

Release v3.12.2

02 Mar 11:08
Compare
Choose a tag to compare

Release: v3.12.2 [skip ci]

Patch

Release v3.12.1

26 Feb 19:30
Compare
Choose a tag to compare

Release: v3.12.1 [skip ci]

Patch

GoLang support in StringZilla v3.12 🥳

23 Feb 16:32
Compare
Choose a tag to compare

Together with @MarkReedZ we've added basic GoLang bindings to StringZilla, which look surprisingly fast compared to native GoLang strings. We currently use the new cGo annotations available in Go 1.24:

Cgo has gained new capabilities in Go 1.24, supporting new C function annotations to improve runtime performance. Among them, #cgo noescape cFunctionName is used to inform the compiler that the memory passed to cFunctionname will not escape; #cgo nocallback cFunctionName indicates that this C function will not call back any Go functions. In addition, Cgo's inspection of multiple incompatible declarations of C functions has become more stringent. When there are incompatible declarations in different files, errors can be detected and reported more timely and accurately.

I was using an Intel Sapphire Rapids machine on AWS for preliminary testing and benchmarking. I've precompiled StringZilla with dynamic dispatch enabled, linked to the thin GoLang binding layer:

$ ~/StringZilla/golang$ CGO_CFLAGS="-I$(pwd)/../include" \
        CGO_LDFLAGS="-L$(pwd)/../build_golang -lstringzilla_shared" \
        LD_LIBRARY_PATH="$(pwd)/../build_golang:$LD_LIBRARY_PATH" \
        go run ../scripts/bench.go  --input ../leipzig1M.txt --split lines --seed 42

... and compared to native GoLang strings on some key operations:

Benchmarking on `../leipzig1M.txt` with seed 42.
Total input length: 129644797
Total lines: 1000000
Average line length: 128.64
Running benchmark using `testing.Benchmark`.
strings.Contains              :      309           3818144 ns/op
sz.Contains                   :      664           1881251 ns/op
strings.Index                 :      325           3669081 ns/op
sz.Index                      :      624           1990093 ns/op
strings.LastIndex             :       12          85201713 ns/op
sz.LastIndex                  :      494           2306318 ns/op
strings.IndexAny              :  6321228             181.0 ns/op
sz.IndexAny                   : 10608960             112.6 ns/op
strings.Count                 :      156           8015292 ns/op
sz.Count (non-overlap)        :      285           4206698 ns/op
sz.Count (overlap)            :      284           4204370 ns/op

So if you are processing a lot of text in Go, try doing so with StringZilla and stay tuned for the upcoming 4.0 release #201 🥳

Release v3.11.3

26 Dec 19:44
Compare
Choose a tag to compare

Release: v3.11.3 [skip ci]

Patch

  • Improve: Pointer casting rules (a3f2f00)
  • Docs: LLVM build instruction (89be0cb)

Release v3.11.2

19 Dec 11:30
Compare
Choose a tag to compare

Release: v3.11.2 [skip ci]

Patch

v3.11.1: Matching N3322 for `memcpy` UB in C2y

11 Dec 14:46
Compare
Choose a tag to compare

Release: v3.11.1 [skip ci]

Patch

v3.11.0: Checksums in AVX-512, AVX2, NEON

01 Dec 10:11
Compare
Choose a tag to compare
  • 🆕 sz_checksum(char const *, size_t) C 99 interface
  • 🆕 sz::str().checksum() C++ 11 interface
  • 🆕 sz.checksum(str) Python interface

Database and other Systems Engineers, you can now use StringZilla to dynamically dispatch different check-sum kernels for AVX2 capable Haswell+ CPUs, AVX-512BW capable Ice Lake+ CPUs, and Arm NEON CPUs on mobile. In AVX-512, masked loads are used extensively, resulting in a 10% improvement even on typical English words, averaging 5 bytes in length and 20x performance improvement compared to the serial code for longer strings.

On the technical side, on x86, the kernels use the well-known SAD(text, zeros) idiom to accumulate absolute differences between individual bytes into 64-bit words. It also uses bidirectional traversal to saturate the core, capable of performing 2 loads per CPU cycle. Moreover, on large inputs, it switches to streaming loads, separately handling the head and the tail, similar to our memcpy alternative, also outperforming LibC on AVX-512-capable machines 😎

Minor

Patch

  • Docs: Simpler Python doc-strings (ad5fa2c)
  • Fix: sz_checksum visibility (9bec0eb)
  • Fix: Missing _mm_cvtsi128_si64x in Clang (c8c6c7c)