Improve fib, prevent regression in parse_int #3

saethlin · 2018-05-22T18:41:23Z

Motivation:
There's no reason for the Rust version of fib to be any slower than the C one. The only difference I see is that there's a black_box in the Rust one, so let's try to get rid of it.

I modified the implementation of measure_best to put the benchmark results behind a memory allocation which should prevent them from being optimized away. It's janky and criterion would probably be better but it works. This should bring the Rust fib in line with the C one; I'd love to test personally but I don't currently have the time to figure out the system that the Julia repository is using to run them (trying to load libraries from inside the source tree and all that).

This change would cause a regression in parse_int, but switching the assert inside its hot loop to a debug_assert prevents the regression though I'm not entirely certain why.

Enet4

Thanks for the proposed changes. However, there are multiple things in consideration that are making me reluctant to merge this. Some of them depend on a general consensus in the main Julia micro-benchmarks repository, others not as much.

Can you provide the raw values which you obtained before and after the change? Not all improvements will be reproducible in another machine.

Here's what I got in mine, they don't seem to be very conclusive regarding the regressions that you mentioned. The speed-up in fib is also very minor.

Before:

rust,iteration_pi_sum,6.077312
rust,matrix_multiply,29.457496
rust,matrix_statistics,6.059977
rust,parse_integers,0.137050
rust,print_to_file,10.579359
rust,recursion_fibonacci,0.035762
rust,recursion_quicksort,0.315747
rust,userfunc_mandelbrot,0.060763

After:

rust,iteration_pi_sum,6.077004
rust,matrix_multiply,29.338795
rust,matrix_statistics,6.062510
rust,parse_integers,0.133108
rust,print_to_file,9.086284
rust,recursion_fibonacci,0.034280
rust,recursion_quicksort,0.311959
rust,userfunc_mandelbrot,0.061079

Enet4 · 2018-05-23T10:42:19Z

src/main.rs

        .map(move |_| {
            let t = Instant::now();
            op();
            t.elapsed()
-        }).min().unwrap()
+        }).collect();
+    *results.iter().min().unwrap()


I would seriously wish to know why this prevents loop-invariant code motion. 😮 Even when considering this part as a loop, I wonder what stops the compiler from just moving fib(fibarg) out of the inner loop declared in line 266.

I strongly suspect it's because there may be a syscall (to do the memory allocation), and LLVM stops trying to analyze at that point.

What concerns me more is that the black_box doesn't work at this place in the code... or doesn't seem to work. At least on my machine if I crank up the value of fibarg to something above 40, the benchmark starts taking longer and longer to run, but the function that attempt to report the fastest run still reports 0.0. Probably worth more investigation.

Can you show how you added black_box to this section of the code? I wouldn't be surprised it these optimizations cannot be avoided at this level. Only once measure_best is monomorphized can the compiler effectively know what can be done. Avoiding unwanted optimizations would require two black_boxes: one to assume nothing about fibarg (so as to call fib every time instead of just once), and another one to assume that f is used (otherwise everything is optimized out).

It might also be worth pointing out that if I fold the vector into the best measurement while manually avoiding the check for an empty iterator, the optimization reappears:

#[inline] fn measure_best<F: FnMut()>(niters: u32, mut op: F) -> Duration { let first = { let t = Instant::now(); op(); t.elapsed() }; let results: Vec<_> = (0..niters - 1) .map(move |_| { let t = Instant::now(); op(); t.elapsed() }).collect(); results.iter().fold(first, |a, &b| Duration::min(a,b)) }

This might suggest that min().unwrap() is preventing the optimizations, rather than the memory allocation. LLVM is even known to optimize allocations away in some cases, which only makes me more surprised about this.

There's more to it than just the measuring: rustc nightly and GCC 4.8.5 produce different machine codes for fib in isolation (Clang 6 provides the exact same outcome, which is reasonable). This might just be a matter of how well optimized the compiler is.

In the mean time, please try replacing *results.iter().min().unwrap() with results.into_iter().min.unwrap(), which provided a 1.06 speed-up on this end.

Enet4 · 2018-05-23T10:55:07Z

src/main.rs

@@ -275,7 +275,7 @@ fn main() {
            let n: u32 = rng.gen();
            let s = format!("{:x}", n);
            let m = u32::from_str_radix(&s, 16).unwrap();
-            assert_eq!(m, n);
+            debug_assert_eq!(m, n);


Hmm nope, we can't do this. At least while the issue 10 is unresolved. There is an inconsistency on how this is done across languages, but since the C version does these checks, it would currently be unfair to change this.

Enet4 · 2018-05-23T10:59:20Z

src/main.rs

@@ -287,7 +287,7 @@ fn main() {
            let m = mandelperf();
            if j == 0 {
                let mandel_sum: u32 = m.iter().sum();
-                assert_eq!(mandel_sum, 14791);
+                debug_assert_eq!(mandel_sum, 14791);


Again, this would also be unfair towards the other benchmarks. We'll keep track of issue 10, but this one should be reverted as well.

saethlin · 2018-06-06T02:38:09Z

old:

rust,recursion_fibonacci,0.038156
rust,parse_integers,0.123291
rust,userfunc_mandelbrot,0.054799
rust,recursion_quicksort,0.274005
rust,iteration_pi_sum,19.171578
rust,matrix_statistics,5.798091
rust,matrix_multiply,66.084760
rust,print_to_file,8.200426

new:

rust,recursion_fibonacci,0.028430
rust,parse_integers,0.121543
rust,userfunc_mandelbrot,0.051552
rust,recursion_quicksort,0.258179
rust,iteration_pi_sum,17.944624
rust,matrix_statistics,5.458947
rust,matrix_multiply,24.948180
rust,print_to_file,8.138551

I suggest you just try getting these changes run on the official benchmark machine. Who knows, maybe it will make a difference there.

Enet4 · 2018-06-06T10:23:16Z

Please consider running the program multiple times and showing the shortest times of each task. The rust,matrix_multiply,66.084760 entry suggests that the machine was spiking there.

saethlin · 2022-01-13T22:32:25Z

Doing some housekeeping on my open PRs. No idea how I lost track of this one, but I agree with the feedback. Nothing to see here.

perf cleanup- memory instead of black_box, assert->debug_assert

ce7bd56

Enet4 requested changes May 23, 2018

View reviewed changes

assert always

2f6bffd

saethlin closed this Jan 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve fib, prevent regression in parse_int #3

Improve fib, prevent regression in parse_int #3

saethlin commented May 22, 2018

Enet4 left a comment

Enet4 May 23, 2018

saethlin Jun 6, 2018 •

edited

Loading

Enet4 Jun 6, 2018

Enet4 May 23, 2018

Enet4 May 23, 2018

saethlin commented Jun 6, 2018

Enet4 commented Jun 6, 2018

saethlin commented Jan 13, 2022

Improve fib, prevent regression in parse_int #3

Improve fib, prevent regression in parse_int #3

Conversation

saethlin commented May 22, 2018

Enet4 left a comment

Choose a reason for hiding this comment

Enet4 May 23, 2018

Choose a reason for hiding this comment

saethlin Jun 6, 2018 • edited Loading

Choose a reason for hiding this comment

Enet4 Jun 6, 2018

Choose a reason for hiding this comment

Enet4 May 23, 2018

Choose a reason for hiding this comment

Enet4 May 23, 2018

Choose a reason for hiding this comment

saethlin commented Jun 6, 2018

Enet4 commented Jun 6, 2018

saethlin commented Jan 13, 2022

saethlin Jun 6, 2018 •

edited

Loading