Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use libc::memchr instead of iter().rposition(), closes #30076 #30151

Closed
wants to merge 1 commit into from

Conversation

fhahn
Copy link
Contributor

@fhahn fhahn commented Dec 1, 2015

This is an initial PR for #30076 that uses libc::memchr instead of iter().rposition(). In the general case memchr should be much faster. rposition is only faster in the case when an \n is near the end of a long string (because memchr searches from left to right).

On Linux there is a memrchr which searches right to left, but it seems like it's not available on other platforms. I think it would also make sense to add some benchmarks, but I am not sure what the right place would be (and how to run them then).

@rust-highfive
Copy link
Collaborator

r? @alexcrichton

(rust_highfive has picked a reviewer for you, use r? to override)

buf.as_ptr() as *const libc::c_void,
b'\n' as libc::c_int,
buf.len() as libc::size_t)
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best to have a safe wrapper function for memchr — and then the structure of fn write doesn't need to change at all either.

@bluss
Copy link
Member

bluss commented Dec 1, 2015

Thanks. As exploration I've tested exactly this method, and it increased throughput of binary data to LineWriter buffered stdout hugely on my machine, (from 0.9 GB/s to 1.6 GB/s in my test), and with it, my regular rust reimplementation of cat behaved the same as the original, even when I was using he LineWriter buffered stdout.

So the solution is certainly good on a platform with a fast memchr (glibc/linux in my case).

Libc on windows does not provide such a nice version. So I think ideally should have memrchr (which does not exist on windows at all), but we can provide a reasonably fast fallback.

I wanted to ask you @alexcrichton, can we add such a custom memrchr to libstd for now? The code for a memrchr fallback already exists, in memchr written by yours truly.

@bluss
Copy link
Member

bluss commented Dec 1, 2015

@alexcrichton And a trickier question, in libcore we'd like to have both memchr and memrchr available for byte, char and substring searching, and it would dramatically speed them up. Is it possible to have an optional libc dependency? Having just the rust versions of those functions would still be a massive performance win, but the glibc version is yet much faster.

This is just more context for placement ^

Maybe we can have the rust versions in libcore, and have libstd use either-or depending on what's available.

@alexcrichton
Copy link
Member

Thanks for the PR @fhahn! This looks good to me, although the semantics have changed due to searching from the front of the string instead of the end (which may not flush as much as it needs to).

@bluss I'd prefer to not pick up a libc dep in core, but having different versions in core/std seems fine. This has been an open libs question for quite awhile now and I'd rather not export this as a public API as a result, but having it as an implementation detail is fine.

Along those lines I don't mind bringing in memrchr stuff to get it to work. Ideally we'd just copy whatever glibc does as I suspect that LLVM is at least capable of generating such code.

@bluss
Copy link
Member

bluss commented Dec 2, 2015

They are using an SSE2 version in asm for x86 platforms, so we can't really copy that.

@bluss
Copy link
Member

bluss commented Dec 2, 2015

I think we'd need to have them in public (unstable) API in libcore, so that we can have the pure rust implementations there (autovectorized to some decent degree, as noted), and still use them as fallback in libstd.

@fhahn
Copy link
Contributor Author

fhahn commented Dec 2, 2015

There is already a crate by @BurntSushi , which provides wrappers and fallbacks for memchr and memrchr and looks pretty good.

I also created a benchmark to compare the following things:

  • iter().rposition()

  • libc's memrchr

  • @BurntSushi's Rust imeplementation of memrchr (bench_rust_memrchr1)

  • my port of libc's C memrchr implementation (bench_rust_memrchr2)

    running 4 tests
    test bench_iter ... bench: 20,100 ns/iter (+/- 1,046)
    test bench_libc_memrchr ... bench: 32 ns/iter (+/- 2)
    test bench_rust_memrchr1 ... bench: 3,895 ns/iter (+/- 200)
    test bench_rust_memrchr2 ... bench: 5,832 ns/iter (+/- 404)

Libc's optimized version with vectorization beats all other implementations by a very wide margin, probably because the Rust code is not vectorized. I am currently looking into building a Rust version that uses the simd crate to build a vectorized version.

@fhahn
Copy link
Contributor Author

fhahn commented Dec 2, 2015

The code of the benchmark can be found here: https://gist.github.com/fhahn/7a7b46e75bc17f1b39a4

@huonw
Copy link
Member

huonw commented Dec 2, 2015

Hm, a 100+x increase in speed seems very surprising, even for vectorisation; there may be something else going on that's making the benchmark non-representative. Especially since 32 ns/iter (22 on my machine) implies a data transfer rate of 940 GB/s (1300 on my machine)... it's not entirely out of the realms of possibility, since all the data will be in L1 cache, but it does seem rather fast.

@huonw
Copy link
Member

huonw commented Dec 2, 2015

Ah, looks like this loop:

            for i in 0..n {
                match get_memrchr(s.as_bytes()) {
                    Some(x) => ret += x,
                    None => ()
                };
                ret += i;
            }

is being optimised to something equivalent to the following (which also allows the loop to be folded to be O(1)):

            let result = get_memrchr(s.as_bytes());
            for i in 0..n {
                match result {
                    Some(x) => ret += x,
                    None => ()
                };
                ret += i;
            }

Turning the s.as_bytes() call into test::black_box(&s).as_bytes() (and all the other s.as_bytes() calls too, just in case) gives the following result on my computer:

test bench_iter          ... bench:      13,253 ns/iter (+/- 504)
test bench_libc_memrchr  ... bench:         690 ns/iter (+/- 30)
test bench_rust_memrchr1 ... bench:       3,000 ns/iter (+/- 58)
test bench_rust_memrchr2 ... bench:       3,577 ns/iter (+/- 68)

which is along the lines of what I would expect.

@fhahn
Copy link
Contributor Author

fhahn commented Dec 2, 2015

Thanks @huonw for discovering that. I was also surprised by the result.

Hm as_bytes() is used in all benchmarks to pass the buffer, strange that rustc only folded the loop for the libc version.

@fhahn
Copy link
Contributor Author

fhahn commented Dec 2, 2015

@alexcrichton do you think it makes sense to change write to use memchr, which should be available on all platforms? Or should we go with shipping our own memrchr implementations for platforms that do not have it?

@bluss
Copy link
Member

bluss commented Dec 2, 2015

@BurntSushi's Rust imeplementation of memrchr (bench_rust_memrchr1)

The crate is his, but I contributed this part.

I think we should use memrchr to still write & flush as much as possible. I should have written memrchr in the original issue if I had read the code with more attention (it uses rposition).

@fhahn
Copy link
Contributor Author

fhahn commented Dec 2, 2015

@bluss Shouldn't it be possible to flush as often as before even when using memchr instead of memrchr? It should flush after each \n, right? But I have to admit I am not sure how LineWriter::write does that. So far it seems to write the buffer up to the last \n, then flushes and prints the rest. I am probably missing something, but shouldn't it call LineWriter::write again for the first part of the buffer, because that could still contain \ns?

@bluss
Copy link
Member

bluss commented Dec 2, 2015

It looks like the previous behavior is to find the last \n, write to the inner Write until that point, and flush, so it only flushes once per LineWriter::write call.

It's true that for binary data that contains no newlines at all, it doesn't matter which direction we search this, but for 64KB of text with a newline near the end, it does. Both for performance & for flushing after the last newline.

@alexcrichton
Copy link
Member

Yeah we'll want to maintain the same semantics to ensure that stdout of programs is properly buffered, otherwise if you wrote 2 lines to stdout at once you may only see one and be waiting for the second until later on when the thread dies and fully flushes stdout.

I'd certainly believe that an SSE and/or vectorized implementation of memrchr is faster than a hand-written one, but we do also have simd in rust (unstable), so it should at least in theory be possible!

@bluss
Copy link
Member

bluss commented Dec 2, 2015

It sounds like we should go ahead with a memrchr implemented in rust. As said, if this is a safe function, then the change to LineWriter::write will be simple & equivalent.

@huonw
Copy link
Member

huonw commented Dec 2, 2015

Hm as_bytes() is used in all benchmarks to pass the buffer, strange that rustc only folded the loop for the libc version.

LLVM has rules that tell it that memrchr is pure (i.e. calling it once is the same as calling it multiple times), but it is apparently not inferring this for the Rust versions.

@fhahn
Copy link
Contributor Author

fhahn commented Dec 11, 2015

@huonw Thanks for looking into that!

@alexcrichton @bluss so going forward I should add safe memchr and memrchr functions to core, which reuse C implementation if possible, similar to what @BurntSushi's rust-memchr does at the moment. This would mean that the crate would become obsolete. It seems quite popular (163,233 downloads on crates.io right now) and I want to make sure not to step on anybodys toes.

Regarding a SIMD implementation in Rust, the simd crate shipped with rustc is deprecated and the external crate should be used. So a hand written SIMD version could not make it into core, right?

@bluss
Copy link
Member

bluss commented Dec 11, 2015

I don't think we can put the libc fallback in libcore, since we don't want to depend on it. I want / need the pure rust mem(r)chr in libcore, but maybe that can be a separate PR. Don't worry about obsoleting rust-memchr if that should happen; it's also far into the future (needs to be stabilized in libstd before that, and it's not even been proposed here).

Since @alexcrichton prefers this only as a private implementation detail, we can add it to just libstd for now? I just wanted to raise the flag that it will be needed in libcore as soon as we get around to it. Ascii byte searches and single byte substring searches could be massively improved.

@alexcrichton
Copy link
Member

I forget how much byte searching we do in libcore but if we do it a lot in both std and core then it's fine to have as a public-but-unstable detail for now in libcore, but @bluss is right that libcore should not depend on libc (although libstd is allowed to do so)

@fhahn fhahn mentioned this pull request Dec 14, 2015
@fhahn
Copy link
Contributor Author

fhahn commented Dec 14, 2015

I've created a new PR request, which adds memchr and memrchr based on rust-memchr

@fhahn fhahn closed this Dec 14, 2015
bors added a commit that referenced this pull request Dec 19, 2015
This PR adds `memchr`and `memrchr` based on @BurntSushi 's rust-memchr crate to libstd (as discussed in #30151).

I've update some places in libstd to use memchr/memrchr, but I am not sure if there are other places where it could be used as well.

ref #30076
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants