-
-
Notifications
You must be signed in to change notification settings - Fork 48
Add Rust microbenchmark #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b211a62
to
f5167da
Compare
let k = k as f64; | ||
1. / (k * k) | ||
}) | ||
.sum(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code doesn't seem reasonable to me. Wouldn't it be faster and simpler to just write a loop to do the summation in Rust (since it is a compiled statically typed language, ordinary loops should be fast), rather than allocating a temporary array of summands and then calling .sum()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no allocation there: the abstractions in Rust have zero cost. This is the idiomatic way to write this code in Rust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Iterator API is a very idiomatic part of Rust, and very often doesn't incur any performance penalty. Usually, it is not worth rewriting this into plain loops (even index range loops such as for i in 0..10
to traverse collections are unadvised). Also note that iterator adaptors such as map
and flat_map
are lazy, and do not allocate any extra memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification; I didn't realize that map
was lazy in Rust.
rust/src/main.rs
Outdated
}) | ||
.take(80) | ||
.take_while(|z| z.norm() <= 2.0) | ||
.count() as u32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this kind of iterator-composition code really the fastest way to do it in Rust? Again, wouldn't you just write a loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't speak to this code in particular, or whether the style is ideal for performance, but I have noticed that this general style is very common in Rust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functional-style iteration is the idiomatic way to do so. In some case, it is even faster than a procedural loop because there are no bound checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As already mentioned by @Boiethios, this is pretty idiomatic and performant. Previously, I had a version of this function without an extensive use of the Iterator API (here), and the subsequent changes to what we have now actually improved performance a tiny bit.
} | ||
|
||
fn fib(n: i32) -> i32 { | ||
let n = black_box(n); // prevent over-optimization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's going on here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is to prevent the compiler from doing invariant code motion: it would recognize that the function had already been called with the same arguments in a previous step, and choose to memoize the output, which would distort the whole benchmark. black_box
hints LLVM to ignore about the "nature" of the value, thus forcing it to always calculate fib
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#[inline(never)]
might work as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vks I'm afraid not, just tried that too. What makes the compiler perform invariant code motion is the perception that the pure function is being called with the exact same argument value across the multiple iterations, and so the compiler can just move it out of the loop. This is orthogonal to whether the function is inlined.
Volatile argument reads would force fibarg
to be read inside the loop without making additional assumptions of its value, thus preventing invariant code motion.
let a = nrand((n, n), &mut rng); | ||
let b = nrand((n, n), &mut rng); | ||
let c = nrand((n, n), &mut rng); | ||
let d = nrand((n, n), &mut rng); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't performance-conscious code allocate the arrays once and then call fill_rand
in the loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Granted, this function can be slightly boosted. Moving the array allocations to the beginning improves the benchmark score (roughly from 6.15 to 5.85 in raw time measurements). If I recall correctly, I did this because the Julia implementation also declares these arrays inside the loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay.
This is kind of a silly benchmark anyway, because it mostly measures the speed of the random-number generation and matmul libraries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed! The same can be said about parse_integers
, which measures the RNG and number printing alongside the actual number parsing.
A small but important update: the |
How about |
As previously mentioned here, the project relies on the The only way to make this work with a stable toolchain is to replace
Would the other microbenchmarks also be updated in the process? Rust has a regular 6-week release cycle, which means that pointing this project to the |
I think we should go ahead and merge this soon without needing to debate all the fine details of how we get and update the rust compiler, otherwise I fear this may languish and never get merged. Of course, we can open a separate issue about how the rust compiler should be acquired and run. |
Sounds good to me. This and #2 look good for merging. I'll use the Rust nightly for now and aim to include Rust data in the julialang.org table and plot when julia-0.7.0 drops. |
@johnfgibson, I've given you write access to this repo, so feel free to merge PRs as they seem ready. Let me know if there are any permission issues and we'll get them sorted out. |
The Rust code runs fine and produces sensible results. I don't understand Rust well enough to vouch for the code's correctness & consistency with other languages. I see @stevengj reviewed to some extent above. I'll plan to merge in a day or two if I don't hear any objections. Rust results, with julia-0.7.0-DEV.5096 (7 days old master).
|
I haven't run the benchmarks with every language available, but they seem to be roughly in line with my local run:
Feel free to ask about any other details of the code. I noticed that the readme also contains a list of languages, so maybe I could update that as well? The PR might also be updated in time for the decisions made in #10. |
@johnfgibson, just to be sure: will the issue with the Mathematica label, seen in your comment above, be present in the final rendered image? |
@waldyrious : No, that Mathematica label problem arose in an inkscape SVG -> PNG conversion needed to post the image to this thread. The SVG that'll appear on julialang.org is just fine. |
Is there a reason that you use vectors instead of arrays on the stack? All the |
@Vurich, if I'm understanding (it's quite possible that I'm not since I'm not a Rust programmer), that would be comparable to using StaticArrays in Julia. That's a legitimate optimization but in Julia it goes beyond the basics. In Rust, how advanced of a change would that be? |
@StefanKarpinski I would actually consider using |
I went for dynamic allocations because the C version is doing the same thing. At first glance, the Julia benchmark isn't using StaticArrays either, right? I agree that the avoidance of heap allocations is a plus, but I would have to ensure that this is a fair optimization. In practice, changing this should be trivial: fixed size arrays deref to a slice just the same, and ndarray views can be created from arbitrary slices. However, since Rust doesn't have const generics yet, I would have to either hardcode the outputs of |
Hmm, I'd assumed the main point of Unless the benchmark code is the algorithm and the array implementation is just infrastructure... |
The point of the way these were written is that in library code you don't generally know that you're allocating a specific array size: you have to make your library generic and code across different sizes and do dynamic allocations. Sure, in this specific case you can do static allocations, but you can also just replace the entire benchmark computation with a constant, so drawing some lines about what is or is not "allowed" is in the nature of benchmarking. The line is currently drawn at "everyone does dynamic memory allocation of non-fixed-sized arrays". To a significant extent, this is to give the Pythons and Rs in the comparison an easier time since that's all they can do and they're already getting clobbered—it seems unsporting not to at least give them a fighting chance. |
🎉 Just to let you know that I experimented with turning heap allocations into static arrays in randmatstat (branch). This resulted in a minor speed-up of around 1.02 in my desktop (in raw values, from 6.10 to 5.97). Still, I agree that keeping the heap allocations is a fair decision to make. |
Good to hear—I kind of figured that would be the case. Dynamic allocation with a decently implemented allocator shouldn't be a bottleneck here. If we switched Julia to use StaticArrays, on the other hand, we might see a significant speedup since the package provides specialized fully unrolled matmul implementations for fixed matrix sizes. The speedup would be from code specialization, however, rather than avoiding dynamic memory allocation. |
In a way, that is what the Rust community intends to achieve with the const generics proposal. :) The folks working on embedded systems are particularly concerned about this, since they usually cannot rely on heap allocations. |
This PR adds a Rust implementation of the benchmark. This contribution was originally in JuliaLang/julia#22433, but the merge was postponed until the microbenchmarks were migrated here. It should be ready to go now. 👍
The necessary preparation steps for the benchmark machine:
rustup toolchain add nightly-2018-04-16
Then:
make benchmark/rust.csv