Skip to content

Improve fib, prevent regression in parse_int #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

Improve fib, prevent regression in parse_int #3

wants to merge 2 commits into from

Conversation

saethlin
Copy link

Motivation:
There's no reason for the Rust version of fib to be any slower than the C one. The only difference I see is that there's a black_box in the Rust one, so let's try to get rid of it.

I modified the implementation of measure_best to put the benchmark results behind a memory allocation which should prevent them from being optimized away. It's janky and criterion would probably be better but it works. This should bring the Rust fib in line with the C one; I'd love to test personally but I don't currently have the time to figure out the system that the Julia repository is using to run them (trying to load libraries from inside the source tree and all that).

This change would cause a regression in parse_int, but switching the assert inside its hot loop to a debug_assert prevents the regression though I'm not entirely certain why.

Copy link
Owner

@Enet4 Enet4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the proposed changes. However, there are multiple things in consideration that are making me reluctant to merge this. Some of them depend on a general consensus in the main Julia micro-benchmarks repository, others not as much.

Can you provide the raw values which you obtained before and after the change? Not all improvements will be reproducible in another machine.

Here's what I got in mine, they don't seem to be very conclusive regarding the regressions that you mentioned. The speed-up in fib is also very minor.

Before:

rust,iteration_pi_sum,6.077312
rust,matrix_multiply,29.457496
rust,matrix_statistics,6.059977
rust,parse_integers,0.137050
rust,print_to_file,10.579359
rust,recursion_fibonacci,0.035762
rust,recursion_quicksort,0.315747
rust,userfunc_mandelbrot,0.060763

After:

rust,iteration_pi_sum,6.077004
rust,matrix_multiply,29.338795
rust,matrix_statistics,6.062510
rust,parse_integers,0.133108
rust,print_to_file,9.086284
rust,recursion_fibonacci,0.034280
rust,recursion_quicksort,0.311959
rust,userfunc_mandelbrot,0.061079

.map(move |_| {
let t = Instant::now();
op();
t.elapsed()
}).min().unwrap()
}).collect();
*results.iter().min().unwrap()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would seriously wish to know why this prevents loop-invariant code motion. 😮 Even when considering this part as a loop, I wonder what stops the compiler from just moving fib(fibarg) out of the inner loop declared in line 266.

Copy link
Author

@saethlin saethlin Jun 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly suspect it's because there may be a syscall (to do the memory allocation), and LLVM stops trying to analyze at that point.

What concerns me more is that the black_box doesn't work at this place in the code... or doesn't seem to work. At least on my machine if I crank up the value of fibarg to something above 40, the benchmark starts taking longer and longer to run, but the function that attempt to report the fastest run still reports 0.0. Probably worth more investigation.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you show how you added black_box to this section of the code? I wouldn't be surprised it these optimizations cannot be avoided at this level. Only once measure_best is monomorphized can the compiler effectively know what can be done. Avoiding unwanted optimizations would require two black_boxes: one to assume nothing about fibarg (so as to call fib every time instead of just once), and another one to assume that f is used (otherwise everything is optimized out).

It might also be worth pointing out that if I fold the vector into the best measurement while manually avoiding the check for an empty iterator, the optimization reappears:

#[inline]
fn measure_best<F: FnMut()>(niters: u32, mut op: F) -> Duration {
    let first = {
        let t = Instant::now();
        op();
        t.elapsed()
    };
    let results: Vec<_> = (0..niters - 1)
        .map(move |_| {
            let t = Instant::now();
            op();
            t.elapsed()
        }).collect();
    results.iter().fold(first, |a, &b| Duration::min(a,b))
}

This might suggest that min().unwrap() is preventing the optimizations, rather than the memory allocation. LLVM is even known to optimize allocations away in some cases, which only makes me more surprised about this.

There's more to it than just the measuring: rustc nightly and GCC 4.8.5 produce different machine codes for fib in isolation (Clang 6 provides the exact same outcome, which is reasonable). This might just be a matter of how well optimized the compiler is.

In the mean time, please try replacing *results.iter().min().unwrap() with results.into_iter().min.unwrap(), which provided a 1.06 speed-up on this end.

src/main.rs Outdated
@@ -275,7 +275,7 @@ fn main() {
let n: u32 = rng.gen();
let s = format!("{:x}", n);
let m = u32::from_str_radix(&s, 16).unwrap();
assert_eq!(m, n);
debug_assert_eq!(m, n);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm nope, we can't do this. At least while the issue 10 is unresolved. There is an inconsistency on how this is done across languages, but since the C version does these checks, it would currently be unfair to change this.

src/main.rs Outdated
@@ -287,7 +287,7 @@ fn main() {
let m = mandelperf();
if j == 0 {
let mandel_sum: u32 = m.iter().sum();
assert_eq!(mandel_sum, 14791);
debug_assert_eq!(mandel_sum, 14791);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this would also be unfair towards the other benchmarks. We'll keep track of issue 10, but this one should be reverted as well.

@saethlin
Copy link
Author

saethlin commented Jun 6, 2018

old:

rust,recursion_fibonacci,0.038156
rust,parse_integers,0.123291
rust,userfunc_mandelbrot,0.054799
rust,recursion_quicksort,0.274005
rust,iteration_pi_sum,19.171578
rust,matrix_statistics,5.798091
rust,matrix_multiply,66.084760
rust,print_to_file,8.200426

new:

rust,recursion_fibonacci,0.028430
rust,parse_integers,0.121543
rust,userfunc_mandelbrot,0.051552
rust,recursion_quicksort,0.258179
rust,iteration_pi_sum,17.944624
rust,matrix_statistics,5.458947
rust,matrix_multiply,24.948180
rust,print_to_file,8.138551

I suggest you just try getting these changes run on the official benchmark machine. Who knows, maybe it will make a difference there.

@Enet4
Copy link
Owner

Enet4 commented Jun 6, 2018

Please consider running the program multiple times and showing the shortest times of each task. The rust,matrix_multiply,66.084760 entry suggests that the machine was spiking there.

@saethlin
Copy link
Author

Doing some housekeeping on my open PRs. No idea how I lost track of this one, but I agree with the feedback. Nothing to see here.

@saethlin saethlin closed this Jan 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants