-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-10588: [Rust] Safe and parallel bit operations for Arrow #8664
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-10588: [Rust] Safe and parallel bit operations for Arrow #8664
Conversation
ea6fe5b to
6131c02
Compare
|
@vertexclique did you work on this on top of #8663 from @jhorstmann (or at least both PRs remove the popcnt table)? |
|
This is using safe bitvec interface to manage bits. I worked on top of my existing PR. Since that didn't change all code to use safe operations, now this does. I squashed all my work into a single commit and closing PR: #8598 Closed that with comment: #8598 (comment) Bitvec is also issuing popcnt instruction like the other pr which just got merged. So I didn't put effort to fix them instead dependency does them safely. |
b5d3219 to
4948aa0
Compare
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I plan to review this PR carefully tomorrow morning
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All in all I like this PR. Nice work @vertexclique -- it removes a bunch of unsafe calls and I think the structure is an improvement.
I carefully went through this PR and I think it would be ok to merge as is, and address comments as a follow on. However, given that it needs a rebase anyways, I would suggest looking at some of the comment suggestions as they may help future readers / reviewers. Removing bit_utils.rs would be good too
I also ran this code under valgrind and it did not find any errors 👍
As feedback, this PR might have been less intimidating (and thus I would have been more likely to be able to review it earlier and more quickly) if it had been broken into smaller pieces. Some potential ways to split it up could have been:
- Rename of
bit_utils.rstoutils.rs - Rename of
Bufferfunctions likecount_set_bits-->count_ones - Introduction of bit_ops / rewrite to use
BitBufferSlice*
|
Super cool. Great work @vertexclique ! 💯 I am all in in removing that Just curious, do you know why it has a 120% hit in performance on a filtering op? |
Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
|
Weird, on my machine when I pushed the initial implementation of this PR I got the numbers above. Seems like it is regressed for me too. |
|
So any idea how to proceed with this? Based on that I will close all related PRs. Though, seems like #8688 can use these changes to build slice alignment on top. |
|
If this is not stalling any other PRs, you can leave it open. I've started looking into what's causing the regression, but I'll only finish on the weekend (I'm mostly working on the Parquet writer with my evening time). I'm also seeing big regressions of 400%+ |
|
@nevi-me @alamb @jorgecarleitao Before (current master): After: Since it is a time-consuming task, I am not going to perform a full rewrite until we agree that this performance improvement is enough. Looking forward to receiving your feedback. |
|
Thanks @vertexclique, that looks great. I'm here with the approach |
|
Ok, so crucial operations are improved. I've updated the benchmarks. Since benchmarks with 512 elements fit most caches it creates unstable benchmarks. Kernels only can get better after this PR got merged and they are rewritten with parallel iterators. Feel free to benchmark this PR.
@nevi-me @jorgecarleitao @alamb Because bit ops performance has been improved these benchmarks have been improved significantly. The rest of the improvements are in the pr description: |
nevi-me
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
@alamb @andygrove @jorgecarleitao @vertexclique
I'd like to propose that we merge this (pending rebase and CI), and address any extra follow-up as separate PRs.
From my side:
- Addition of
rayon: I think this might cause issues withwasm32(IIRC the threading might be an issue, but my knowledge might be dated). We've had some small interest in supportingwasm32, and there might be a few users of Arrow with wasm already; but as we don't yet have the target in CI, we can't reliably tell if something could affect that target, and thus need gating. We might have to putrayonbehind a feature flag, but we can worry about this later (or whoever wants to usewasm32can contribute this work?). - Introducing thread parallelism at a compute kernel level: my previous rudimentary experiments over a year ago weren't showing much improvement, but our codebase has changed a lot, so it's great that we're seeing the speed-ups. It'd be great if we can confirm if the speedups in the microbenchmarks also get realised by
datafusion. This is not a concern per se, but just noting it.
Thanks for the great work @vertexclique!
I'm currently running the benchmarks, I'll post an update when they're done
|
|
||
| [features] | ||
| default = [] | ||
| default = ["simd"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we probably shouldn't change the default to include simd, as we'd like the default features to allow users to compile with stable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh forgot to set it back.
| accumulator + *value | ||
| }); | ||
| let total = data | ||
| .par_iter() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's interesting that this is yielding better results. I would have thought that rayon being introduced at thsi level, would incur enough overhead to slow the kernels down. I've previously applied parallelism at an Array level instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what MKL suggests, Array level non-primitive parallelism has shared pointer overhead, that's the reason. I had so much bad time before with arc overhead in my projects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rayon is pretty smart in this respect, as its execution model starts the first part of the iterator immediately.
I am thankful that @vertexclique introduced this, as I also believe that we greatly benefit from multi-threading at this level.
What I am a bit unsure is whether there is any issue in nested spawning: we may want to fork in other places that are nested with this op. But I am confident that someone knows the answer (👀 to @vertexclique and @alamb).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rayon is initializing global pool with 1.5 * logical cores. Since they are all spawned but not joined threads, that will easily work from direct users of arrow. Closures that spawn on though, are executing like fork-join. Coming to the process forking point of view, while using this library as is, won't cause problems. In the nested operations case, they will be queued and spawned to any free slot, all operations are going through global producer-consumer hub inside rayon (take a gaze into bit ops code in this pr). Sometime later if a contributor comes and says that, "I want to configure tlp" then we can just expose the pool config by a method. But since it configures itself on the machine it's running we can directly skip that part.
| where | ||
| T::Native: Add<Output = T::Native>, | ||
| T: ArrowNumericType, | ||
| // T::Native: Add<Output = T::Native> + Sum + Sum<T::Simd>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should the comment be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, eagle eyes.
|
Really impressive improvement, @vertexclique . @nevi-me , can I have 12h to review it? |
Yup, I more meant that if someone else picks up things they'd like addressed, we could open JIRAs for them instead of trying to address them as part of this PR (unless they're clear blockers). In any case, we're still weeks or at least 2 months before the next release, so we still have time even for potential blockers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through this without going line by line and I am already convinced:
- the CI passes
- new code contains good coverage
- the amount of unsafe removed ❤️
- the cleanness of the APIs
![]()
Impressive work, @vertexclique . Really well done. Thank you so much for this.
| /// | ||
| /// Get bit value at the given index in this bit view | ||
| #[inline] | ||
| pub fn get_bit(&self, index: usize) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't this go out of bounds? What do you think about making this unsafe and expose a safe version with a check?
Conceptually this is not a small change. Personally I think parallelizing on the datafusion level and keeping kernels single-threaded is the better model. The benchmark is now also testing a much larger array (2^20 elements) than what is usually used as a chunk size, so in reality the speedup due to parallelism would be much smaller. I'm totally fine with a small slowdown if that leads to cleaner and safer code. |
I don't think personally, I am experimenting and working on that based on proofs. It is not a better model, even Intel says that. So there is still no conflict for further parallelizing in the DataFusion (which is already done via reactor). We are using arrow in the company. and yet still, we haven't parallelized chunks which is also wrong in many directions.
There are no predetermined chunk sizes and benchmarks shouldn't work on cache fittable data at all. That's against the benchmarking rule. Otherwise, you don't see the actual processing time, at all. Moreover, if this data format can't process large data on demand, there is no point in having this data format. (Users will think like that even I don't, which is also the most valid thing out there.) Also please don't alter my comments/words or use them against the pr. Please also stop undermining others' work and be honest about what has been done. I have already told you to do this way but you preferred sharing raw pointers in an unsafe context, which won't and never ever going to be parallelized without a full rewrite. This is that full rewrite. Finally, It is entirely unethical and dishonest to open PRs that I told/taught to you. I am not your rival, I don't see myself as a rival to anyone in any form, because there is no rivalry. This is an open-source environment. I don't see goodwill from your comments too. Also, some of your comments are giving false information (which I stopped giving feedback). I prefer instead of having counterproductive comments, productive comments from you. Even I told this plenty of times seems like that's not going to happen. So, please don't comment and review my PRs. Thanks. |
jorgecarleitao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we have the remaining benchmarks, I think that we need to discuss this further:
All large scale operations (take, filter, sort) have a 100-500% performance degradation. These are the most important kernels and are where we verify that the operations are in place.
I admit I was a bit blindsided by the statement
Up to %95 improvements on many kernels.
which makes no reference to the degradation on other (often more important) benchmarks.
I have already mentioned that here I think: #8664 (comment) |
I agree with @jhorstmann 's opinion here -- I think we should keep arrow single threaded (parallelizing their invocation across record batches can be done at a higher level). One challenge of adding parallelism like rayon at a lower level such as the kernel is that the higher level libraries or applications lose control of the resources (for example, if some app wanted a particular background operation with only a single core on a multi core, they couldn't do that easily with the code in this PR anymore) |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not a fan of introducing parallelism into the lowest level aggregate kernels.
I would like to request we remove the changes to the aggregate kernels out of this PR and keep it focused on increasing the safety of bit operations
Given the large variety of opinions regarding using rayon it is probably best to have those conversations in a separate PR.
@vertexclique This really isn't acceptable behavior. It is against the project's code of conduct [1] to insult or harass other contributors. We don't do that here. |
|
Personally, I think violation against me also refers to the code of conduct even in two occurrences:
I tried my best to assume good intentions. Even there is no problem in PRs I have received comments which are not meant to bring any productivity but changing the entire aim of the PR.
Which I did:
I was thinking 3 months ago the project was going in a nice direction, I think it got prevented somehow and now it is not going very well from my point of view. Moreover in general also Apache Foundation's values described here not quite held in my opinion. https://www.apache.org/foundation/how-it-works.html#meritocracy So, finally, I would like to say that, I am totally ok with not committing to this project. Feel free to close this PR. |
|
Hi @vertexclique. All your contributions are very much appreciated. You are one of the most advanced contributors to the project meaning that it's important for the other members of the community to be able to review and ask questions. This may be frustrating for you but it is necessary. In general, smaller more focused PR's will be easier to merge. Larger PR's or ones that are low level in nature need to be reviewed in more depth and we should be glad to have @jhorstmann and others to provide constructive feedback. Also, you may be asked to make changes to your PR's and you may not always agree. This can always happen in a community run project. In general, you need to not take feedback so personally. I have re-read the interactions you mentioned and I'm sorry I can't see the issues you are referring to and I think you were wrong to call out @jhorstmann. On this PR I agreed with @jhorstmann's perspective, as did @alamb. I at least thought the question was a good one to ask. On the other PR's you linked to it's clear that @jhorstmann is doing his best to be polite while providing feedback, for example:
Hopefully, you can move past this and keep contributing to the project. |
|
I think there are some very interesting things in this PR:
Would love if this PR would be continued, maybe in a slimmed down form? |
|
I agree with @Dandandan 's comments 👍 |
|
@vertexclique -- Given the imminent Arrow 3.0 release, I am trying to clean up older Rust PRs and see if the authors have plans to move them forward. Do you plan on working on this PR in the near future? If not, should we close this PR until there is time to make progress? Thanks again for your contributions so far. |
|
@alamb Thanks for reaching out! I don't have time to work on these PRs. Closing. |
typed_bitsto get bits as VecI squashed all my work into a single commit and closing PR: #8598 with comment: #8598 (comment)
Benchmarks