Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: use AVX2 in FastParseCommand #813

Open
mgravell opened this issue Nov 18, 2024 · 8 comments
Open

Suggestion: use AVX2 in FastParseCommand #813

mgravell opened this issue Nov 18, 2024 · 8 comments
Labels
enhancement New feature or request

Comments

@mgravell
Copy link
Contributor

mgravell commented Nov 18, 2024

Feature request type

Performance enhancement in command identification

Is your feature request related to a problem? Please describe

The [current implementation] (

private RespCommand FastParseCommand(out int count)
) splits by length then matches the last 8 bytes against a set of known values. This works, but is a lot of branches - effectively it is a bunch of:

if (lastChunk == SOME_NUMBER) return foo;
if (lastChunk == SOME_OTHER_NUMBER) return bar;
// ...

Describe the solution you'd like

It turns out that AVX2 is good at this - at least, if we limit ourselves to 4-byte chunks, which is a pretty good filter, especially if we ignore the trailing \r\n. The basic approach here is discussed here, and is fundamentally:

  • load known values vector (8 x 32-bit)
  • load and broadcast test operand _mm256_set1_epi32
  • compare for equality _mm256_cmpeq_epi32
  • pull the MSB of each 32-bit result _mm256_movemask_ps into a single value (8 effective bits)
  • find the index of the set bit, if any tzcnt

It would still need a final "were we right?" check, like the existing code does if it doesn't fit in just 8 bytes.

I did a quick demo, here, which so far only handles 8 commands. Obviously it is very incomplete, just to indicate "this is what is available". I also did a test against using static readonly uint values instead of Read<uint> each time, to look at all options (that didn't move the needle any, or is possibly worse somehow, or is just jitter).

Tests here include:

  • one value from the start of the test list (PING)
  • one value from the end of the test list (INCR)
  • one value that isn't in the test list (JUNK)

Results:

| Method            | Command | Mean      | Error     | StdDev    | Ratio | RatioSD |
|------------------ |-------- |----------:|----------:|----------:|------:|--------:|
| ComputePerCall    | INCR    | 1.0559 ns | 0.2056 ns | 0.0113 ns |  1.00 |    0.01 |
| PreComputedSwitch | INCR    | 1.0576 ns | 0.1848 ns | 0.0101 ns |  1.00 |    0.01 |
| PreComputedAvx2   | INCR    | 0.2417 ns | 0.0299 ns | 0.0016 ns |  0.23 |    0.00 |
|                   |         |           |           |           |       |         |
| ComputePerCall    | JUNK    | 1.0449 ns | 0.1534 ns | 0.0084 ns |  1.00 |    0.01 |
| PreComputedSwitch | JUNK    | 1.0495 ns | 0.5420 ns | 0.0297 ns |  1.00 |    0.03 |
| PreComputedAvx2   | JUNK    | 0.2480 ns | 0.0986 ns | 0.0054 ns |  0.24 |    0.00 |
|                   |         |           |           |           |       |         |
| ComputePerCall    | PING    | 0.3143 ns | 0.0095 ns | 0.0005 ns |  1.00 |    0.00 |
| PreComputedSwitch | PING    | 0.2359 ns | 0.0422 ns | 0.0023 ns |  0.75 |    0.01 |
| PreComputedAvx2   | PING    | 0.2422 ns | 0.0975 ns | 0.0053 ns |  0.77 |    0.01 |

The data shows that the existing switch code has time that extends with the number of tests, where-as the AVX2 approach is consistent and fast, giving basically constant time, rather than time that is the number of missed cases. I would expect this to grow proportional to the number of AVX2 tests performed, but it should still be faster.

Describe alternatives you've considered

I have not considered AVX512; in theory this would allow either 16 32-bit tests, or 8 64-bit tests, at a time - but AVX512 is still pretty under-served. It may not be worth the additional effort.

Additional context

The biggest question I can think of here is: which 4 bytes to test? the first 4? or the last 4? someone could probably do a collision analysis to see which narrows it down more. There are collisions either way (LPUSH vs BRPOPLPUSH, INCR vs INCRBY, FLUSHDB vs FLUSHALL, SSCAN vs SCAN vs HSCAN etc). My guess is that the first four bytes give more entropy, but heck: if you switch on length first, you could always do 2 AVX2 tests for collision scenarios, i.e. test the first four and the last four. Or something.

@mgravell
Copy link
Contributor Author

(borked the test; closing while I fix that, in case this is a nothing-burger)

@mgravell
Copy link
Contributor Author

@badrishc I closed this while I fixed the test/data (now done); feel free to reopen if it is of interest (I don't have access for that).

@badrishc badrishc reopened this Nov 18, 2024
@vazois
Copy link
Contributor

vazois commented Nov 19, 2024

This is a very cool idea! We have actually been thinking about doing this for a while, but we never got around to do it. The numbers look promising, I assume the speedup will be much greater for commands at the bottom of the switch statement. The bigger question: what is the expected end to end gain? So, for example, what is the speedup if we process a batch of commands with parameters instead of individual commands headers? We can use BDN for this.

@badrishc
Copy link
Contributor

badrishc commented Nov 19, 2024

@mgravell - this is indeed interesting but we need to see if this helps end-to-end. There are BDN benchmarks for Garnet operations at https://github.com/microsoft/garnet/tree/main/benchmark/BDN.benchmark/Operations -- you might want to prototype against that to check that there is an opportunity here.

@mgravell
Copy link
Contributor Author

mgravell commented Nov 19, 2024 via email

@mgravell
Copy link
Contributor Author

mgravell commented Nov 20, 2024

I did some more digging this morning; it turns out that if you're testing multiple values, ReadOnlySpan<uint> etc already do AVX-based searching when possible, so: you have the option of simply pre-calculating a static readonly uint[] or similar from the partial matches (I would suggest first-four or last-four), and you can do a very effective search on that. If you've already identified either the command-length or the argument-count, you could also further split on that, i.e. by having a separate uint[] for commands of different lengths. This would then allow, for example:

ReadOnlySpan<uint> candidates = commandLength switch {
    3 => __CommandPrecomputed_3, // note: for 3, include trailing \n from preamble, so GET is "\nGET"u8
    4 => __CommandPrecomputed_4,
    5 => __CommandPrecomputed_5,
    6 => __CommandPrecomputed_6,
    _ => [],
};
var index = candidates.IndexOf(ComputeUInt32FromLastFourBytesOfCommand());
if (index >= 0) // recognized pattern, fast-path
{
    switch (commandLength)
    {
        // ...
        case 4:
            switch (index) // fully validated
            {
                case 0: // TIME ....
                case 1: // SREM....
                case 2: // SADD....
                case 3: // MGET ....
                case 4: // SCAN ....
                // ...
            }
        case 5:
            switch (index) // need to check leading 1 byte
            {
                // 0 == ?CARD
                case 0 when Starts("S"\u8): // SCARD...
                case 0 when Starts("Z"\u8): // ZCARD...
                // 1 == ?PUSH
                case 1 when Starts("L"\u8): // LPUSH...
                case 1 when Starts("R"\u8): // RPUSH...
                // ...
            }
    }
}

The key thing here being: a switch over 0, 1, 2, 3, ... etc becomes a simple jump operation - very fast (with final validation in the when); a switch on either large precomputed constants, or where all the "real" testing is in the when is basically a giant list of if branches.

@badrishc
Copy link
Contributor

badrishc commented Nov 21, 2024

All of this sounds good and potentially is a huge win. But it does seem like a non-trivial amount of dev effort to translate the idea into Garnet's parser. Would this be of interest to you or anyone else in the community to actually try out in Garnet code? The starting point would likely be https://github.com/microsoft/garnet/blob/main/libs/server/Resp/Parser/RespCommand.cs#L594

@darrenge darrenge added the enhancement New feature or request label Nov 25, 2024
@mgravell
Copy link
Contributor Author

I will see what I can do with this shortly - hopefully I should have some useful numbers based on the real code (where "useful" can include "no meaningful impact, let's do nothing") this side of 2025.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants