Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize Lagrange constraints evaluation #316

Conversation

Al-Kindi-0
Copy link
Contributor

@Al-Kindi-0 Al-Kindi-0 commented Sep 13, 2024

Improves Lagrange constraints evaluation. This improves the end-to-end LogUp-GKR benchmark by 17-18% on my machine.

@irakliyk
Copy link
Collaborator

Thank you! Not a review yet, but a couple of questions:

Improves Lagrange constraints evaluation by 17-18% on my machine.

What can I ran to test the effect on my machine?

Also, this seems quite a bit less than what we get by parallelizing general constraint evaluation (I think it is something like 4x or 5x on your machine). Are there big parts still left to address for Lagrange kernel constraints? Or is there something inherently different here?

@Al-Kindi-0
Copy link
Contributor Author

Al-Kindi-0 commented Sep 14, 2024

The command I used are:

  1. cargo bench --bench logup_gkr
  2. cargo bench --features concurrent --bench logup_gkr

The improvement is for the overall benchmark, which is end-to-end. As constraint evaluation is a small percentage, I think the improvement to Lagrange constraint evaluation is probably on the order you gave.
I updated the original comment as it was misleading.

Edit: did some back of the envelop calculations and the actual improvement should probably be around 1.5x.

Comment on lines +84 to +88
#[cfg(feature = "concurrent")]
combined_evaluations_acc
.par_iter_mut()
.enumerate()
.fold(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One potential issue with this approach is poor memory locality resulting of basically random assignment of data to various threads. An alternative would be to do something similar to what we with regular constraint evaluation:

  1. First, we split up the trace into adjacent fragments. The number of fragments would be equal to the number of threads.
  2. Then we evaluate each fragment in a different thread (e.g., see here).

I'm not 100% sure this will yield a significant improvement, but given that we are seeing only 1.5x overall improvement, I think this may be worth a try.

Comment on lines +99 to +103
trace.read_lagrange_kernel_frame_into(
step << lde_shift,
l_col_idx,
&mut lagrange_frame,
);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One potential thing to check is whether using Vec::push() inside this method is slowing things down for some reason. It shouldn't but also we usually use direct element assignment for things like this.

@Al-Kindi-0
Copy link
Contributor Author

Subsumed by #317

@irakliyk irakliyk closed this Sep 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants