Skip to content

Commit 6e38e60

Browse files
committed
Added notes on memory-bound multithreading
1 parent 92beead commit 6e38e60

File tree

6 files changed

+41
-6
lines changed

6 files changed

+41
-6
lines changed

benchmark/sort.jl

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -32,21 +32,26 @@ println("Using ArrayType: ", ArrayType)
3232
n = 1_000_000
3333

3434

35+
# Memory-bound, so not much improvement expected when multithreading
3536
println("\n===\nBenchmarking sort! on $n UInt32 - Base vs. AK")
3637
display(@benchmark Base.sort!(v) setup=(v = ArrayType(rand(UInt32(1):UInt32(1_000_000), n))))
3738
display(@benchmark AK.sort!(v) setup=(v = ArrayType(rand(UInt32(1):UInt32(1_000_000), n))))
3839

3940

40-
println("\n===\nBenchmarking sort! on $n Int64 - Base vs. AK")
41-
display(@benchmark Base.sort!(v) setup=(v = ArrayType(rand(Int64(1):Int64(1_000_000), n))))
42-
display(@benchmark AK.sort!(v) setup=(v = ArrayType(rand(Int64(1):Int64(1_000_000), n))))
41+
# Lexicographic sorting of tuples - more complex comparators
42+
ntup = 5
43+
println("\n===\nBenchmarking sort! on $n NTuple{$ntup, Int64} - Base vs. AK")
44+
display(@benchmark Base.sort!(v) setup=(v = ArrayType(rand(NTuple{ntup, Int64}, n))))
45+
display(@benchmark AK.sort!(v) setup=(v = ArrayType(rand(NTuple{ntup, Int64}, n))))
4346

4447

48+
# Memory-bound again
4549
println("\n===\nBenchmarking sort! on $n Float32 - Base vs. AK")
4650
display(@benchmark Base.sort!(v) setup=(v = ArrayType(rand(Float32, n))))
4751
display(@benchmark AK.sort!(v) setup=(v = ArrayType(rand(Float32, n))))
4852

4953

54+
# More complex by=sin
5055
println("\n===\nBenchmarking sort!(by=sin) on $n Float32 - Base vs. AK")
5156
display(@benchmark Base.sort!(v, by=sin) setup=(v = ArrayType(rand(Float32, n))))
5257
display(@benchmark AK.sort!(v, by=sin) setup=(v = ArrayType(rand(Float32, n))))

benchmark/sortperm.jl

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -33,21 +33,26 @@ n = 1_000_000
3333
ix = ArrayType(ones(Int, n))
3434

3535

36+
# Memory-bound, so not much improvement expected when multithreading
3637
println("\n===\nBenchmarking sortperm! on $n UInt32 - Base vs. AK")
3738
display(@benchmark Base.sortperm!($ix, v) setup=(v = ArrayType(rand(UInt32(1):UInt32(1_000_000), n))))
3839
display(@benchmark AK.sortperm!($ix, v) setup=(v = ArrayType(rand(UInt32(1):UInt32(1_000_000), n))))
3940

4041

41-
println("\n===\nBenchmarking sortperm! on $n Int64 - Base vs. AK")
42-
display(@benchmark Base.sortperm!($ix, v) setup=(v = ArrayType(rand(Int64(1):Int64(1_000_000), n))))
43-
display(@benchmark AK.sortperm!($ix, v) setup=(v = ArrayType(rand(Int64(1):Int64(1_000_000), n))))
42+
# Lexicographic sorting of tuples - more complex comparators
43+
ntup = 5
44+
println("\n===\nBenchmarking sortperm! on $n NTuple{$ntup, Int64} - Base vs. AK")
45+
display(@benchmark Base.sortperm!($ix, v) setup=(v = ArrayType(rand(NTuple{ntup, Int64}, n))))
46+
display(@benchmark AK.sortperm!($ix, v) setup=(v = ArrayType(rand(NTuple{ntup, Int64}, n))))
4447

4548

49+
# Memory-bound again
4650
println("\n===\nBenchmarking sortperm! on $n Float32 - Base vs. AK")
4751
display(@benchmark Base.sortperm!($ix, v) setup=(v = ArrayType(rand(Float32, n))))
4852
display(@benchmark AK.sortperm!($ix, v) setup=(v = ArrayType(rand(Float32, n))))
4953

5054

55+
# More complex by=sin
5156
println("\n===\nBenchmarking sortperm!(by=sin) on $n Float32 - Base vs. AK")
5257
display(@benchmark Base.sortperm!($ix, v, by=sin) setup=(v = ArrayType(rand(Float32, n))))
5358
display(@benchmark AK.sortperm!($ix, v, by=sin) setup=(v = ArrayType(rand(Float32, n))))

src/accumulate/accumulate.jl

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,11 @@ recommend using the single-array interface (the first one above).
8282
## CPU
8383
Use at most `max_tasks` threads with at least `min_elems` elements per task.
8484
85+
Note that accumulation is typically a memory-bound operation, so multithreaded accumulation only
86+
becomes faster if it is a more compute-heavy operation to hide memory latency - that includes:
87+
- Accumulating more complex types, e.g. accumulation of tuples / structs / strings.
88+
- More complex operators, e.g. `op=custom_complex_function`.
89+
8590
## GPU
8691
For the 1D case (`dims=nothing`), the `alg` can be one of the following:
8792
- `DecoupledLookback()`: the default algorithm, using opportunistic lookback to reuse earlier

src/map.jl

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,11 @@
1313
Apply the function `f` to each element of `src` in parallel and store the result in `dst`. The
1414
CPU and GPU settings are the same as for [`foreachindex`](@ref).
1515
16+
On CPUs, multithreading only improves performance when complex computation hides the memory
17+
latency and the overhead of spawning tasks - that includes more complex functions and less
18+
cache-local array access patterns. For compute-bound tasks, it scales linearly with the number of
19+
threads.
20+
1621
# Examples
1722
```julia
1823
import Metal

src/reduce/reduce.jl

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,13 @@ Use at most `max_tasks` threads with at least `min_elems` elements per task. For
3333
arrays (`dims::Int`) multithreading currently only becomes faster for `max_tasks >= 4`; all other
3434
cases are scaling linearly with the number of threads.
3535
36+
Note that multithreading reductions only improves performance for cases with more compute-heavy
37+
operations, which hide the memory latency and thread launch overhead - that includes:
38+
- Reducing more complex types, e.g. reduction of tuples / structs / strings.
39+
- More complex operators, e.g. `op=custom_complex_op_function`.
40+
41+
For non-memory-bound operations, reductions scale almost linearly with the number of threads.
42+
3643
## GPU settings
3744
The `block_size` parameter controls the number of threads per block.
3845

src/sort/sort.jl

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,10 +32,18 @@ include("cpu_sample_sort.jl")
3232
Sorts the array `v` in-place using the specified backend. The `lt`, `by`, `rev`, and `order`
3333
arguments are the same as for `Base.sort`.
3434
35+
## CPU
3536
CPU settings: use at most `max_tasks` threads to sort the array such that at least `min_elems`
3637
elements are sorted by each thread. A parallel [`sample_sort!`](@ref) is used, processing
3738
independent slices of the array and deferring to `Base.sort!` for the final local sorts.
3839
40+
Note that the Base Julia `sort!` is mainly memory-bound, so multithreaded sorting only becomes
41+
faster if it is a more compute-heavy operation to hide memory latency - that includes:
42+
- Sorting more complex types, e.g. lexicographic sorting of tuples / structs / strings.
43+
- More complex comparators, e.g. `by=custom_complex_function` or `lt=custom_lt_function`.
44+
- Less cache-predictable data movement, e.g. `sortperm`.
45+
46+
## GPU
3947
GPU settings: use `block_size` threads per block to sort the array. A parallel [`merge_sort!`](@ref)
4048
is used.
4149

0 commit comments

Comments
 (0)