Added notes on memory-bound multithreading

anicusan · anicusan · commit 6e38e60284a2 · 2025-05-27T15:56:00.000+01:00
diff --git a/benchmark/sort.jl b/benchmark/sort.jl
@@ -32,21 +32,26 @@ println("Using ArrayType: ", ArrayType)
 n = 1_000_000
 
 
+# Memory-bound, so not much improvement expected when multithreading
 println("\n===\nBenchmarking sort! on $n UInt32 - Base vs. AK")
 display(@benchmark Base.sort!(v) setup=(v = ArrayType(rand(UInt32(1):UInt32(1_000_000), n))))
 display(@benchmark AK.sort!(v) setup=(v = ArrayType(rand(UInt32(1):UInt32(1_000_000), n))))
 
 
-println("\n===\nBenchmarking sort! on $n Int64 - Base vs. AK")
-display(@benchmark Base.sort!(v) setup=(v = ArrayType(rand(Int64(1):Int64(1_000_000), n))))
-display(@benchmark AK.sort!(v) setup=(v = ArrayType(rand(Int64(1):Int64(1_000_000), n))))
+# Lexicographic sorting of tuples - more complex comparators
+ntup = 5
+println("\n===\nBenchmarking sort! on $n NTuple{$ntup, Int64} - Base vs. AK")
+display(@benchmark Base.sort!(v) setup=(v = ArrayType(rand(NTuple{ntup, Int64}, n))))
+display(@benchmark AK.sort!(v) setup=(v = ArrayType(rand(NTuple{ntup, Int64}, n))))
 
 
+# Memory-bound again
 println("\n===\nBenchmarking sort! on $n Float32 - Base vs. AK")
 display(@benchmark Base.sort!(v) setup=(v = ArrayType(rand(Float32, n))))
 display(@benchmark AK.sort!(v) setup=(v = ArrayType(rand(Float32, n))))
 
 
+# More complex by=sin
 println("\n===\nBenchmarking sort!(by=sin) on $n Float32 - Base vs. AK")
 display(@benchmark Base.sort!(v, by=sin) setup=(v = ArrayType(rand(Float32, n))))
 display(@benchmark AK.sort!(v, by=sin) setup=(v = ArrayType(rand(Float32, n))))
diff --git a/benchmark/sortperm.jl b/benchmark/sortperm.jl
@@ -33,21 +33,26 @@ n = 1_000_000
 ix = ArrayType(ones(Int, n))
 
 
+# Memory-bound, so not much improvement expected when multithreading
 println("\n===\nBenchmarking sortperm! on $n UInt32 - Base vs. AK")
 display(@benchmark Base.sortperm!($ix, v) setup=(v = ArrayType(rand(UInt32(1):UInt32(1_000_000), n))))
 display(@benchmark AK.sortperm!($ix, v) setup=(v = ArrayType(rand(UInt32(1):UInt32(1_000_000), n))))
 
 
-println("\n===\nBenchmarking sortperm! on $n Int64 - Base vs. AK")
-display(@benchmark Base.sortperm!($ix, v) setup=(v = ArrayType(rand(Int64(1):Int64(1_000_000), n))))
-display(@benchmark AK.sortperm!($ix, v) setup=(v = ArrayType(rand(Int64(1):Int64(1_000_000), n))))
+# Lexicographic sorting of tuples - more complex comparators
+ntup = 5
+println("\n===\nBenchmarking sortperm! on $n NTuple{$ntup, Int64} - Base vs. AK")
+display(@benchmark Base.sortperm!($ix, v) setup=(v = ArrayType(rand(NTuple{ntup, Int64}, n))))
+display(@benchmark AK.sortperm!($ix, v) setup=(v = ArrayType(rand(NTuple{ntup, Int64}, n))))
 
 
+# Memory-bound again
 println("\n===\nBenchmarking sortperm! on $n Float32 - Base vs. AK")
 display(@benchmark Base.sortperm!($ix, v) setup=(v = ArrayType(rand(Float32, n))))
 display(@benchmark AK.sortperm!($ix, v) setup=(v = ArrayType(rand(Float32, n))))
 
 
+# More complex by=sin
 println("\n===\nBenchmarking sortperm!(by=sin) on $n Float32 - Base vs. AK")
 display(@benchmark Base.sortperm!($ix, v, by=sin) setup=(v = ArrayType(rand(Float32, n))))
 display(@benchmark AK.sortperm!($ix, v, by=sin) setup=(v = ArrayType(rand(Float32, n))))
diff --git a/src/accumulate/accumulate.jl b/src/accumulate/accumulate.jl
@@ -82,6 +82,11 @@ recommend using the single-array interface (the first one above).
 ## CPU
 Use at most `max_tasks` threads with at least `min_elems` elements per task.
 
+Note that accumulation is typically a memory-bound operation, so multithreaded accumulation only
+becomes faster if it is a more compute-heavy operation to hide memory latency - that includes:
+- Accumulating more complex types, e.g. accumulation of tuples / structs / strings.
+- More complex operators, e.g. `op=custom_complex_function`.
+
 ## GPU
 For the 1D case (`dims=nothing`), the `alg` can be one of the following:
 - `DecoupledLookback()`: the default algorithm, using opportunistic lookback to reuse earlier
diff --git a/src/map.jl b/src/map.jl
@@ -13,6 +13,11 @@
 Apply the function `f` to each element of `src` in parallel and store the result in `dst`. The
 CPU and GPU settings are the same as for [`foreachindex`](@ref).
 
+On CPUs, multithreading only improves performance when complex computation hides the memory
+latency and the overhead of spawning tasks - that includes more complex functions and less
+cache-local array access patterns. For compute-bound tasks, it scales linearly with the number of
+threads.
+
 # Examples
 ```julia
 import Metal
diff --git a/src/reduce/reduce.jl b/src/reduce/reduce.jl
@@ -33,6 +33,13 @@ Use at most `max_tasks` threads with at least `min_elems` elements per task. For
 arrays (`dims::Int`) multithreading currently only becomes faster for `max_tasks >= 4`; all other
 cases are scaling linearly with the number of threads.
 
+Note that multithreading reductions only improves performance for cases with more compute-heavy
+operations, which hide the memory latency and thread launch overhead - that includes:
+- Reducing more complex types, e.g. reduction of tuples / structs / strings.
+- More complex operators, e.g. `op=custom_complex_op_function`.
+
+For non-memory-bound operations, reductions scale almost linearly with the number of threads.
+
 ## GPU settings
 The `block_size` parameter controls the number of threads per block.
 
diff --git a/src/sort/sort.jl b/src/sort/sort.jl
@@ -32,10 +32,18 @@ include("cpu_sample_sort.jl")
 Sorts the array `v` in-place using the specified backend. The `lt`, `by`, `rev`, and `order`
 arguments are the same as for `Base.sort`.
 
+## CPU
 CPU settings: use at most `max_tasks` threads to sort the array such that at least `min_elems`
 elements are sorted by each thread. A parallel [`sample_sort!`](@ref) is used, processing
 independent slices of the array and deferring to `Base.sort!` for the final local sorts.
 
+Note that the Base Julia `sort!` is mainly memory-bound, so multithreaded sorting only becomes
+faster if it is a more compute-heavy operation to hide memory latency - that includes:
+- Sorting more complex types, e.g. lexicographic sorting of tuples / structs / strings.
+- More complex comparators, e.g. `by=custom_complex_function` or `lt=custom_lt_function`.
+- Less cache-predictable data movement, e.g. `sortperm`.
+
+## GPU
 GPU settings: use `block_size` threads per block to sort the array. A parallel [`merge_sort!`](@ref)
 is used.