Skip to content

Conversation

@SteveBronder
Copy link
Collaborator

@SteveBronder SteveBronder commented May 3, 2023

Submission Checklist

  • Run unit tests: ./runTests.py src/test/unit
  • Run cpplint: make cpplint
  • Declare copyright holder and open-source license: Steve Bronder

Summary

When we first wrote the assignments for the new matrix type we didn't optimize very hard for multi_index assignments. The discourse thread here makes a case for using these so I decided to fix these up a bit.

Instead of using a set to keep track of unique indices being assigned to we just use a map with the key being the index to assign to and the value being the linear index position to access the right hand side vector / matrix. I haven't benchmarked this but it should take this from O(N * ln(N)) to O(N). It also cuts out an N sized vector we used to originally store indices. It also cuts out an if statement we had to do in the loop which should be much nicer for the CPU

Intended Effect

Make multi index assignment more performant

How to Verify

All tests still pass for multi index assignment

Side Effects

None hopefully

Documentation

No doc changes

Copyright and Licensing

Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company): Steve Bronder

By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses:

@WardBrian
Copy link
Member

I'd like to see some more benchmarks on this before merging. For the model @mike-lawrence was using on the forums, I was able to re-create the following numbers

Matrix mode Time (approximate)
AoS 3 seconds
SoA (2.32.1 release) 12 seconds
SoA (this branch) 8 seconds

So for a very large model with a lot of multi-indexing, this seems to be a win over the current implementation but still quite far off from the default (AoS) implementation.

There are really two questions:

  • Would alternative implementations (like Boost 1.82's flat_map or our own small implementation of the same) would be faster (curious, but not essential)
  • Are there models where this change makes multi-index assignment slower?

@SteveBronder
Copy link
Collaborator Author

Yes I'm going to writeup a google benchmark for this vs AoS soon. I think the big Q for me is at whether multi index assignment should always be turned off in the compiler. Those are like 3x slowdowns so I think there would need to be a lot of other work being done before SoA leads to an improvement here

@bob-carpenter
Copy link
Member

I think the big Q for me is at whether multi index assignment should always be turned off in the compiler. Those are like 3x slowdowns

Given that we have to support multi-index assignment in the language. Is the issue just getting it efficient for varmat relative to matvar? How are you measuring? Also given that the advantage of varmat is downstream operation memory locality, not the assignment itself, how are you measuring?

@SteveBronder
Copy link
Collaborator Author

Given that we have to support multi-index assignment in the language. Is the issue just getting it efficient for varmat relative to matvar?

Yep!

How are you measuring? Also given that the advantage of varmat is downstream operation memory locality, not the assignment itself, how are you measuring?

These are Qs I don't have answers to rn but I'm thinking about. It's hard! My first thought is varying sizes of multi-index assignments followed by a large multiplication then call to normal_lpdf.

@bob-carpenter
Copy link
Member

My first thought is varying sizes of multi-index assignments followed by a large multiplication then call to normal_lpdf.

Me, too. That sounds completely reasonable. Of course, based on size of the matrix, results will vary, so it'll be good to report for a range of sizes.

@SteveBronder
Copy link
Collaborator Author

So I've built a little repo for generally doing benchmarks with Stan and Stan math

https://github.com/SteveBronder/stan-perf

In there I have benchmarks for testing the current impl, using unordered_map, and a hash table with open addressing and linear probing.

I'm running a benchmark that does a little more work rn, but for just a plain assignment multi index is pretty slow for SoA. The current version (set) is horrific, unordered map is about twice as good, and the custom map about twice as good as unordered map.

raw_compare_plot

I took just the unordered map and custom map and looked at their average performance relative to matrix. It seems like it's about a 50% slowdown even with the custom map, but imo I think as long as the user is doing just a few operations the new memory scheme should still be an improvement. I'm running some benchmarks rn to test that.

map_to_matvar

@SteveBronder
Copy link
Collaborator Author

Okay so after updating the tests to also do a little extra work at the end (dot product and addition) I think we should add the custom hash table.

https://github.com/SteveBronder/stan-perf/tree/main/benchmarks/varmat_multi_assign

Here are the updated raw plots where custom map is much faster after only doing a little extra work after the assignment.

raw_compare_plot2

And here is the ratio of times which shows the custom map being about 30% faster than matrix<var>.

map_to_matvar2

@bob-carpenter are you okay with me adding a paired down version of the hashmap from here? I'm not sure what folder it would go in?

@bob-carpenter
Copy link
Member

In terms of bigger picture, we want matrices to work in the case where users set every entry. In fact, I think that's a very common use case for something like GPs (and why I want to get comprehensions in ASAP). I'm actually OK having struct of arrays (I thought that was what we'd settled on?) be constant and not user settable at all. That is, if a container gets modified, use array of structs.

Are there example use cases where a Stan program only sets a few elements of a larger matrix? Fiddling with the diagonal of a matrix maybe, but that's still N ops for an N x N matrix.

@bob-carpenter are you okay with me adding a paired down version of the hashmap from here?

Totally OK adding a custom map as long as it has sufficient testing.

I'm not sure what folder it would go in?

We can create something like a new utility class folder if there isn't something like that already. I guess that'd be inside prim since this won't do any explicit autodiff, right?

@SteveBronder
Copy link
Collaborator Author

In terms of bigger picture, we want matrices to work in the case where users set every entry. In fact, I think that's a very common use case for something like GPs (and why I want to get comprehensions in ASAP).

Yes I agree when @WardBrian gets back I think we will talk about how to do that

I'm actually OK having struct of arrays (I thought that was what we'd settled on?) be constant and not user settable at all. That is, if a container gets modified, use array of structs.

I think it's pretty easy to add the hash table and we should be able to allow for all the subsettting besides single index in a loop

Are there example use cases where a Stan program only sets a few elements of a larger matrix? Fiddling with the diagonal of a matrix maybe, but that's still N ops for an N x N matrix.

I think sometimes people will do things with contiguous slices of a matrix, like saying operating over each column. The speed for that stuff should be fine and this change should make the rest fine as well.

@bob-carpenter are you okay with me adding a paired down version of the hashmap from here?

Totally OK adding a custom map as long as it has sufficient testing.

Alrighty yeah the hashmap I used comes from this repo (MIT) licensed. It doesn't have any tests but is a well known table so I may kill two birds with one stone and add tests here and send a PR upstream to that PR with the tests

We can create something like a new utility class folder if there isn't something like that already. I guess that'd be inside prim since this won't do any explicit autodiff, right?

I think I'd keep the folder in the model folder since I'd rather not have to go back and forth between repos and this is the only place we use it atm. If that changes then we can always put it in the math library

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SoA optimization can over-promote objects used inside loops to produce slower code

3 participants