Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SIMD escape #160

Closed
wants to merge 4 commits into from
Closed

Add SIMD escape #160

wants to merge 4 commits into from

Conversation

zzau13
Copy link
Contributor

@zzau13 zzau13 commented Nov 16, 2018

askama_escape/src/lib.rs Outdated Show resolved Hide resolved
@zzau13
Copy link
Contributor Author

zzau13 commented Nov 17, 2018

Wait alignment solution, please comment unsafe solution.

@djc
Copy link
Collaborator

djc commented Nov 17, 2018

So I was thinking about this last night and I think we should separate out the long and short benchmarks so we can compare the effect of using AVX2 instructions on them separately (based on my reading, I think AVX2 stuff might actually slow down short string escaping). Can you do that? I'll also want to chew a bit on this code since I myself don't have any experience with AVX2 and I want to really understand it before we merge this.

Also I think next we might want to explore searching based on the PCMPESTRI instruction from SSE 4.2, which might be faster than AVX2 according to what's described here.

@zzau13
Copy link
Contributor Author

zzau13 commented Nov 17, 2018

Ok, I do the separation in the bench.
picohttpparser is implemented with that I look at it.

@zzau13 zzau13 changed the title Add avx escape Add SIMD escape Nov 19, 2018
@djc
Copy link
Collaborator

djc commented Nov 21, 2018

So, high level conclusions just from looking at your benchmark data:

  • The SSE implementation doesn't really pay for itself in terms of complexity / performance
  • AVX2 is faster even for the shortest strings you've tested (5 characters)
  • AVX2 is only neglibly faster if there's not a lot to escape

Very good stuff! In light of these findings, I'd like to change the benchmarks a little:

  • Lengths of 1 bytes, 10 bytes, and 5 MB (to have more granularity about smaller strings)
  • At 10 bytes and 5 MB, add strings that have 3% escaped characters

My question here is if the AVX2 stuff is even worth having, or if it's really only effective at levels of escaped characters that are unlikely to come up in the wild.

(I'm really sorry if we end up not using some of this painstakingly developed code, but I also want to be realistic about the complexity it adds and whether it's worth it in terms of real-world performance.)

@djc
Copy link
Collaborator

djc commented Nov 21, 2018

Also, can you squash the code prettification from the last commit into the related earlier commits?

@zzau13
Copy link
Contributor Author

zzau13 commented Nov 21, 2018

If I always finish squash, if I leave the commit is so you can see the changes easily.

As I mentioned, it is less effective for string length less than the instruction, 16 or 32.
The result of the benchs: vectorized is log constant as a function of to_string(), so it is much more convenient for a production environment than the scalar.

If the e2e would stop escaping numbers or relocate string, duplicated to_string(), is very difficult to have the pointer aligned, would go much faster. But we need to reimplement From of the generic type Display for Num, String, str... with rust-lang/rust#31844
Let's put a bench of a forum or blog and the result will be spectacular. But in these cases it is not appreciated, it penalizes as I told you before starting. As I said, the easiest thing is to make an template operator to mark long strings, than more 10-16 characters.

My opinion is that we should improve MarkupDisplay and limit the extreme cases.

The benchs are already uploaded.
I do not know what happens to AppVeyor.

@zzau13
Copy link
Contributor Author

zzau13 commented Nov 22, 2018

We are comparing the efficiency between a truck and a motorcycle to carry a candy.

It does not make any sense and it is also that this result warned by gitter when you asked me. Now you can not tell that all the work does not work because what I said has happened.

@djc
Copy link
Collaborator

djc commented Nov 22, 2018

Yes, you were right. Just to make sure I understand, are you angry and/or frustrated with me for this?

@zzau13
Copy link
Contributor Author

zzau13 commented Nov 22, 2018

I'm not angry nor frustrated, but is very unefficient that after talking and knowing about that possible issue you comment on it like a problem or something unexpected.

@zzau13
Copy link
Contributor Author

zzau13 commented Nov 22, 2018

Wtf, Bill, what do we do with that memory? I have icons but not in my memory, perfect.

@djc
Copy link
Collaborator

djc commented Nov 22, 2018

Wtf, Bill, what do we do with that memory? I have icons but not in my memory, perfect.

Huh? Not sure what you mean.

@zzau13
Copy link
Contributor Author

zzau13 commented Nov 22, 2018

In Windows, AppVeyor, it gives an equality error in two strings with the same bytes. But I still have 4k icons and these useful things. I was messing with Windows. I removed the simd support to windows.

@zzau13
Copy link
Contributor Author

zzau13 commented Nov 23, 2018

Well, what do you want to do with this pull request?
For my part, I do not have any more changes.

@djc
Copy link
Collaborator

djc commented Nov 23, 2018

I'm not angry nor frustrated, but is very unefficient that after talking and knowing about that possible issue you comment on it like a problem or something unexpected.

It's not a problem per se, but I still didn't know what to expect. That might be because I don't have a lot of experience doing actual performance optimization, even though I know quite a bit about what kinds of things can cause performance issues. So I had expected that AVX2 would be substantially faster at large strings even if there wasn't a lot to escape. Now that it isn't, I'm no longer sure if there's value in it.

What do you think we should do? Do you think adding the SSE and AVX versions of the code will lead to better performance on real-world workloads? For that matter, have you tried and compared this branch on an actual project you've used Askama for?

My inclination is to see if I can figure out a better benchmark, in the sense that it should be a more reflective mix of longer and shorter strings as well as the relative frequencies of characters from the different relative spaces. For example, it occurs to me now with the benchmark results we have that the trick of ignoring byte values from outside the range of escapable characters is very effective, but that means that digits and non-escaped interpunction will have a different effect on the benchmark results -- yet, the current benchmark data does not contain this class of character at all.

@zzau13
Copy link
Contributor Author

zzau13 commented Nov 23, 2018

When I do not like sse, but the avx if I like it. It is still very green and does not give the expected performance.

I will continue investigating vectorized operations with @tizgafa and later we will try again on askama.

Viewing the results, prioritize askama in the recognition of elements not escapeable and the heuristic of output size.

How about?

@zzau13
Copy link
Contributor Author

zzau13 commented Nov 23, 2018

I close it. The experimentation of the escape with simd a new repo. In addition to Askama, we will devote ourselves to implement all kinds of algorithms with this tool, in case you want to raise another algorithms.

@zzau13 zzau13 closed this Nov 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants