Skip to content

Add Rust microbenchmark #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 22, 2018
Merged

Conversation

Enet4
Copy link
Contributor

@Enet4 Enet4 commented Apr 17, 2018

This PR adds a Rust implementation of the benchmark. This contribution was originally in JuliaLang/julia#22433, but the merge was postponed until the microbenchmarks were migrated here. It should be ready to go now. 👍

The necessary preparation steps for the benchmark machine:

  1. Install Rustup
  2. Add the toolchain mentioned in the rust-toolchain file: rustup toolchain add nightly-2018-04-16

Then: make benchmark/rust.csv

@Enet4 Enet4 force-pushed the rust_microbenchmark branch from b211a62 to f5167da Compare April 17, 2018 18:50
let k = k as f64;
1. / (k * k)
})
.sum();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code doesn't seem reasonable to me. Wouldn't it be faster and simpler to just write a loop to do the summation in Rust (since it is a compiled statically typed language, ordinary loops should be fast), rather than allocating a temporary array of summands and then calling .sum()?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no allocation there: the abstractions in Rust have zero cost. This is the idiomatic way to write this code in Rust.

Copy link
Contributor Author

@Enet4 Enet4 Apr 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Iterator API is a very idiomatic part of Rust, and very often doesn't incur any performance penalty. Usually, it is not worth rewriting this into plain loops (even index range loops such as for i in 0..10 to traverse collections are unadvised). Also note that iterator adaptors such as map and flat_map are lazy, and do not allocate any extra memory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification; I didn't realize that map was lazy in Rust.

rust/src/main.rs Outdated
})
.take(80)
.take_while(|z| z.norm() <= 2.0)
.count() as u32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this kind of iterator-composition code really the fastest way to do it in Rust? Again, wouldn't you just write a loop?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't speak to this code in particular, or whether the style is ideal for performance, but I have noticed that this general style is very common in Rust.

Copy link

@real-felix real-felix Apr 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functional-style iteration is the idiomatic way to do so. In some case, it is even faster than a procedural loop because there are no bound checks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As already mentioned by @Boiethios, this is pretty idiomatic and performant. Previously, I had a version of this function without an extensive use of the Iterator API (here), and the subsequent changes to what we have now actually improved performance a tiny bit.

}

fn fib(n: i32) -> i32 {
let n = black_box(n); // prevent over-optimization
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's going on here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is to prevent the compiler from doing invariant code motion: it would recognize that the function had already been called with the same arguments in a previous step, and choose to memoize the output, which would distort the whole benchmark. black_box hints LLVM to ignore about the "nature" of the value, thus forcing it to always calculate fib.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#[inline(never)] might work as well.

Copy link
Contributor Author

@Enet4 Enet4 May 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vks I'm afraid not, just tried that too. What makes the compiler perform invariant code motion is the perception that the pure function is being called with the exact same argument value across the multiple iterations, and so the compiler can just move it out of the loop. This is orthogonal to whether the function is inlined.

Volatile argument reads would force fibarg to be read inside the loop without making additional assumptions of its value, thus preventing invariant code motion.

let a = nrand((n, n), &mut rng);
let b = nrand((n, n), &mut rng);
let c = nrand((n, n), &mut rng);
let d = nrand((n, n), &mut rng);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't performance-conscious code allocate the arrays once and then call fill_rand in the loop?

Copy link
Contributor Author

@Enet4 Enet4 Apr 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Granted, this function can be slightly boosted. Moving the array allocations to the beginning improves the benchmark score (roughly from 6.15 to 5.85 in raw time measurements). If I recall correctly, I did this because the Julia implementation also declares these arrays inside the loop.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay.

This is kind of a silly benchmark anyway, because it mostly measures the speed of the random-number generation and matmul libraries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed! The same can be said about parse_integers, which measures the RNG and number printing alongside the actual number parsing.

@Enet4
Copy link
Contributor Author

Enet4 commented Apr 19, 2018

A small but important update: the mandel function was modified to compare the complex value's squared norm with 4, rather than the value's norm with 2. This seems to be a fair optimization, since the other implementations are doing the same thing. The performance of this particular task is now on par with the Julia implementation.

@johnfgibson
Copy link
Contributor

The necessary preparation steps for the benchmark machine:

  1. Install Rustup
  2. Add the toolchain mentioned in the rust-toolchain file: rustup toolchain add nightly-2018-04-16

How about rustup toolchain add stable instead? That would be more consistent with the Julia versioning. I've been running the benchmarks and posting the data only when Julia updates its stable version: 0.6.0, 0.6.1, 0.6.2, etc.

@Enet4
Copy link
Contributor Author

Enet4 commented May 15, 2018

How about rustup toolchain add stable instead?

As previously mentioned here, the project relies on the test crate for the black_box function, which is currently only available in a nightly Rust compiler. Stabilization of this function is unlikely to happen soon, it is an ongoing concern without a trivial solution.

The only way to make this work with a stable toolchain is to replace black_box with something that also prevents undesirable optimizations (namely the invariant code motion seen in fib and a few others). From the experiments I made some months ago, this seemed to be possible with volatile reads and writes. I can look into this again soon.

I've been running the benchmarks and posting the data only when Julia updates its stable version: 0.6.0, 0.6.1, 0.6.2, etc.

Would the other microbenchmarks also be updated in the process? Rust has a regular 6-week release cycle, which means that pointing this project to the stable toolchain will also lead to iteratively updated versions of the compiler.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented May 15, 2018

I think we should go ahead and merge this soon without needing to debate all the fine details of how we get and update the rust compiler, otherwise I fear this may languish and never get merged. Of course, we can open a separate issue about how the rust compiler should be acquired and run.

@johnfgibson
Copy link
Contributor

Sounds good to me. This and #2 look good for merging. I'll use the Rust nightly for now and aim to include Rust data in the julialang.org table and plot when julia-0.7.0 drops.

@StefanKarpinski
Copy link
Member

@johnfgibson, I've given you write access to this repo, so feel free to merge PRs as they seem ready. Let me know if there are any permission issues and we'll get them sorted out.

@johnfgibson
Copy link
Contributor

The Rust code runs fine and produces sensible results. I don't understand Rust well enough to vouch for the code's correctness & consistency with other languages. I see @stevengj reviewed to some extent above. I'll plan to merge in a day or two if I don't hear any objections.

Rust results, with julia-0.7.0-DEV.5096 (7 days old master).
benchmarks

cputime	lang
1.000	C
1.080	LuaJIT
1.089	Rust
1.186	Julia
1.450	Go
1.592	Fortran
2.929	Java
4.133	JavaScript
9.964	Matlab
14.072	Mathematica
16.794	Python
69.365	R
515.363	Octave

@Enet4
Copy link
Contributor Author

Enet4 commented May 21, 2018

I haven't run the benchmarks with every language available, but they seem to be roughly in line with my local run:

perf lang
1.000 C
1.060 Rust
1.180 Julia

Feel free to ask about any other details of the code. I noticed that the readme also contains a list of languages, so maybe I could update that as well? The PR might also be updated in time for the decisions made in #10.

@waldyrious
Copy link
Contributor

@johnfgibson, just to be sure: will the issue with the Mathematica label, seen in your comment above, be present in the final rendered image?

@johnfgibson
Copy link
Contributor

@waldyrious : No, that Mathematica label problem arose in an inkscape SVG -> PNG conversion needed to post the image to this thread. The SVG that'll appear on julialang.org is just fine.

@eira-fransham
Copy link

Is there a reason that you use vectors instead of arrays on the stack? All the Vecs in directblas::randmatstat could be turned into fixed-size arrays.

@StefanKarpinski
Copy link
Member

@Vurich, if I'm understanding (it's quite possible that I'm not since I'm not a Rust programmer), that would be comparable to using StaticArrays in Julia. That's a legitimate optimization but in Julia it goes beyond the basics. In Rust, how advanced of a change would that be?

@eira-fransham
Copy link

eira-fransham commented May 22, 2018

@StefanKarpinski I would actually consider using Vec to be an antipattern when you know the length of an array is constant. Using fixed-size arrays expresses intent much better. I think it would just be a case of turning let n = 5 into const N: usize = 5 and let foo = vec![val; something] into let foo = [val; something]

@Enet4
Copy link
Contributor Author

Enet4 commented May 22, 2018

I went for dynamic allocations because the C version is doing the same thing. At first glance, the Julia benchmark isn't using StaticArrays either, right? I agree that the avoidance of heap allocations is a plus, but I would have to ensure that this is a fair optimization.

In practice, changing this should be trivial: fixed size arrays deref to a slice just the same, and ndarray views can be created from arbitrary slices. However, since Rust doesn't have const generics yet, I would have to either hardcode the outputs of nrand to the intended dimensions or just construct the arrays inline.

@johnfgibson
Copy link
Contributor

johnfgibson commented May 22, 2018

Hmm, I'd assumed the main point of randmatstat was to test the speed of tr((P'*P)^4) on large matrices. But it's n x n for n=5. Fixed-size arrays seem natural for that. Even so, the microbenchmarks aim to test identical algorithms in many languages. I'm inclined to think doing randmatstat with static arrays where possible (C, Fortran, Julia, Rust...) and heap arrays where not (Python, Matlab, Java, ...?) would not be in the spirit of identical algorithms.

Unless the benchmark code is the algorithm and the array implementation is just infrastructure...

@StefanKarpinski
Copy link
Member

StefanKarpinski commented May 22, 2018

The point of the way these were written is that in library code you don't generally know that you're allocating a specific array size: you have to make your library generic and code across different sizes and do dynamic allocations. Sure, in this specific case you can do static allocations, but you can also just replace the entire benchmark computation with a constant, so drawing some lines about what is or is not "allowed" is in the nature of benchmarking. The line is currently drawn at "everyone does dynamic memory allocation of non-fixed-sized arrays". To a significant extent, this is to give the Pythons and Rs in the comparison an easier time since that's all they can do and they're already getting clobbered—it seems unsporting not to at least give them a fighting chance.

@johnfgibson johnfgibson merged commit ffd7051 into JuliaLang:master May 22, 2018
@Enet4 Enet4 deleted the rust_microbenchmark branch May 22, 2018 16:33
@Enet4
Copy link
Contributor Author

Enet4 commented May 22, 2018

🎉

Just to let you know that I experimented with turning heap allocations into static arrays in randmatstat (branch). This resulted in a minor speed-up of around 1.02 in my desktop (in raw values, from 6.10 to 5.97). Still, I agree that keeping the heap allocations is a fair decision to make.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented May 22, 2018

Good to hear—I kind of figured that would be the case. Dynamic allocation with a decently implemented allocator shouldn't be a bottleneck here. If we switched Julia to use StaticArrays, on the other hand, we might see a significant speedup since the package provides specialized fully unrolled matmul implementations for fixed matrix sizes. The speedup would be from code specialization, however, rather than avoiding dynamic memory allocation.

@Enet4
Copy link
Contributor Author

Enet4 commented May 22, 2018

If we switched Julia to use StaticArrays, on the other hand, we might see a significant speedup since the package provides specialized fully unrolled matmul implementations for fixed matrix sizes. The speedup would be from code specialization, however, rather than avoiding dynamic memory allocation.

In a way, that is what the Rust community intends to achieve with the const generics proposal. :) The folks working on embedded systems are particularly concerned about this, since they usually cannot rely on heap allocations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants