-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use libc::memchr instead of iter().rposition(), closes #30076 #30151
Conversation
(rust_highfive has picked a reviewer for you, use r? to override) |
buf.as_ptr() as *const libc::c_void, | ||
b'\n' as libc::c_int, | ||
buf.len() as libc::size_t) | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Best to have a safe wrapper function for memchr — and then the structure of fn write
doesn't need to change at all either.
Thanks. As exploration I've tested exactly this method, and it increased throughput of binary data to LineWriter buffered stdout hugely on my machine, (from 0.9 GB/s to 1.6 GB/s in my test), and with it, my regular rust reimplementation of So the solution is certainly good on a platform with a fast memchr (glibc/linux in my case). Libc on windows does not provide such a nice version. So I think ideally should have memrchr (which does not exist on windows at all), but we can provide a reasonably fast fallback. I wanted to ask you @alexcrichton, can we add such a custom memrchr to libstd for now? The code for a memrchr fallback already exists, in memchr written by yours truly. |
@alexcrichton And a trickier question, in libcore we'd like to have both memchr and memrchr available for byte, char and substring searching, and it would dramatically speed them up. Is it possible to have an optional libc dependency? Having just the rust versions of those functions would still be a massive performance win, but the glibc version is yet much faster. This is just more context for placement ^ Maybe we can have the rust versions in libcore, and have libstd use either-or depending on what's available. |
Thanks for the PR @fhahn! This looks good to me, although the semantics have changed due to searching from the front of the string instead of the end (which may not flush as much as it needs to). @bluss I'd prefer to not pick up a libc dep in core, but having different versions in core/std seems fine. This has been an open libs question for quite awhile now and I'd rather not export this as a public API as a result, but having it as an implementation detail is fine. Along those lines I don't mind bringing in |
They are using an SSE2 version in asm for x86 platforms, so we can't really copy that. |
I think we'd need to have them in public (unstable) API in libcore, so that we can have the pure rust implementations there (autovectorized to some decent degree, as noted), and still use them as fallback in libstd. |
There is already a crate by @BurntSushi , which provides wrappers and fallbacks for I also created a benchmark to compare the following things:
Libc's optimized version with vectorization beats all other implementations by a very wide margin, probably because the Rust code is not vectorized. I am currently looking into building a Rust version that uses the |
The code of the benchmark can be found here: https://gist.github.com/fhahn/7a7b46e75bc17f1b39a4 |
Hm, a 100+x increase in speed seems very surprising, even for vectorisation; there may be something else going on that's making the benchmark non-representative. Especially since 32 ns/iter (22 on my machine) implies a data transfer rate of 940 GB/s (1300 on my machine)... it's not entirely out of the realms of possibility, since all the data will be in L1 cache, but it does seem rather fast. |
Ah, looks like this loop: for i in 0..n {
match get_memrchr(s.as_bytes()) {
Some(x) => ret += x,
None => ()
};
ret += i;
} is being optimised to something equivalent to the following (which also allows the loop to be folded to be O(1)): let result = get_memrchr(s.as_bytes());
for i in 0..n {
match result {
Some(x) => ret += x,
None => ()
};
ret += i;
} Turning the test bench_iter ... bench: 13,253 ns/iter (+/- 504)
test bench_libc_memrchr ... bench: 690 ns/iter (+/- 30)
test bench_rust_memrchr1 ... bench: 3,000 ns/iter (+/- 58)
test bench_rust_memrchr2 ... bench: 3,577 ns/iter (+/- 68) which is along the lines of what I would expect. |
Thanks @huonw for discovering that. I was also surprised by the result. Hm |
@alexcrichton do you think it makes sense to change |
The crate is his, but I contributed this part. I think we should use memrchr to still write & flush as much as possible. I should have written |
@bluss Shouldn't it be possible to flush as often as before even when using |
It looks like the previous behavior is to find the last \n, write to the inner Write until that point, and flush, so it only flushes once per LineWriter::write call. It's true that for binary data that contains no newlines at all, it doesn't matter which direction we search this, but for 64KB of text with a newline near the end, it does. Both for performance & for flushing after the last newline. |
Yeah we'll want to maintain the same semantics to ensure that stdout of programs is properly buffered, otherwise if you wrote 2 lines to stdout at once you may only see one and be waiting for the second until later on when the thread dies and fully flushes stdout. I'd certainly believe that an SSE and/or vectorized implementation of memrchr is faster than a hand-written one, but we do also have simd in rust (unstable), so it should at least in theory be possible! |
It sounds like we should go ahead with a memrchr implemented in rust. As said, if this is a safe function, then the change to LineWriter::write will be simple & equivalent. |
LLVM has rules that tell it that |
@huonw Thanks for looking into that! @alexcrichton @bluss so going forward I should add safe Regarding a SIMD implementation in Rust, the |
I don't think we can put the libc fallback in libcore, since we don't want to depend on it. I want / need the pure rust mem(r)chr in libcore, but maybe that can be a separate PR. Don't worry about obsoleting Since @alexcrichton prefers this only as a private implementation detail, we can add it to just libstd for now? I just wanted to raise the flag that it will be needed in libcore as soon as we get around to it. Ascii byte searches and single byte substring searches could be massively improved. |
I forget how much byte searching we do in libcore but if we do it a lot in both std and core then it's fine to have as a public-but-unstable detail for now in libcore, but @bluss is right that libcore should not depend on libc (although libstd is allowed to do so) |
I've created a new PR request, which adds |
This PR adds `memchr`and `memrchr` based on @BurntSushi 's rust-memchr crate to libstd (as discussed in #30151). I've update some places in libstd to use memchr/memrchr, but I am not sure if there are other places where it could be used as well. ref #30076
This is an initial PR for #30076 that uses
libc::memchr
instead ofiter().rposition()
. In the general casememchr
should be much faster.rposition
is only faster in the case when an\n
is near the end of a long string (becausememchr
searches from left to right).On Linux there is a
memrchr
which searches right to left, but it seems like it's not available on other platforms. I think it would also make sense to add some benchmarks, but I am not sure what the right place would be (and how to run them then).