Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorization of find_next_host_delimiter and find_next_host_delimiter_special #548

Merged
merged 7 commits into from
Nov 1, 2023

Conversation

lemire
Copy link
Member

@lemire lemire commented Oct 31, 2023

We vectorize manually for SSE2 and NEON two functions that relied on SWAR: find_next_host_delimiter and find_next_host_delimiter_special. To assess the performance effect, we use benchdata and the BasicBench_AdaURL_aggregator_href measure.

Apple M2, LLVM 14

Before: speed=484.976M/

After: speed=496.554M/s

GCC 12, Intel Ice Lake

Before: speed=375.019M/s

After: speed=401.595M/s

Simpler alternative

If we remove optimizations and switch back to the following (which compiles to a bitset lookup, character-by-character), our speed goes down slightly to 372.901M/s (GCC 12, Intel Ice Lake).

for (size_t i = location; i < view.size(); i++) {
      if (view[i] == ':' || view[i] == '/' || view[i] == '\\' ||
          view[i] == '?' || view[i] == '[') {
        return i;
      }
    }
    return size_t(view.size());
  }

This might compile to...

        movabs  rcx, 17592186112001
        mov     rax, rdx
.L2:
        cmp     rax, rdi
        jnb     .L7
        mov     dl, BYTE PTR [rsi+rax]
        sub     edx, 47
        cmp     dl, 44
        ja      .L3
        bt      rcx, rdx
        jc      .L1
.L3:
        inc     rax
        jmp     .L2

... which is likely highly competitive with a character-by-character table lookup (because of the latency of a table lookup).

Fixes #547

cc @the-moisrex

@the-moisrex
Copy link
Contributor

Seems like a simple lookup table is faster than the current main branch, and this PR is faster than a simplified version; but proposed benchmark does a ton of allocations and what not, and I saw a lot of noises in this benchmark (possibly) due to those.

Version Mean Median
Simplified speed=289.901M/s speed=290.272M/s
Main speed=267.224M/s speed=260.651M/s
Vectorized(SSE2) speed=307.408M/s speed=307.465M/s
./benchmarks/benchdata --benchmark_filter=BasicBench_AdaURL_aggregator_href --benchmark_color=true --benchmark_repetitions=30

Simplified:

// : / [ \\ ?
static constexpr std::array<bool, 128> special_host_delimiters{
    false, false, false, false, false, false, false, false, false, false, false,
    false, false, false, false, false, false, false, false, false, false, false,
    false, false, false, false, false, false, false, false, false, false, false,
    false, false, false, false, false, false, false, false, false, false, false,
    false, false, false, true,  false, false, false, false, false, false, false,
    false, false, false, true,  false, false, false, false, true,  false, false,
    false, false, false, false, false, false, false, false, false, false, false,
    false, false, false, false, false, false, false, false, false, false, false,
    false, false, false, true,  true,  false, false, false, false, false, false,
    false, false, false, false, false, false, false, false, false, false, false,
    false, false, false, false, false, false, false, false, false, false, false,
    false, false, false, false, false, false, false,
};

// : / [ ?
static constexpr std::array<bool, 128> host_delimiters{
    false, false, false, false, false, false, false, false, false, false, false,
    false, false, false, false, false, false, false, false, false, false, false,
    false, false, false, false, false, false, false, false, false, false, false,
    false, false, false, false, false, false, false, false, false, false, false,
    false, false, false, true,  false, false, false, false, false, false, false,
    false, false, false, true,  false, false, false, false, true,  false, false,
    false, false, false, false, false, false, false, false, false, false, false,
    false, false, false, false, false, false, false, false, false, false, false,
    false, false, false, true,  false, false, false, false, false, false, false,
    false, false, false, false, false, false, false, false, false, false, false,
    false, false, false, false, false, false, false, false, false, false, false,
    false, false, false, false, false, false, false,
};

// starting at index location, this finds the next location of a character
// :, /, \\, ? or [. If none is found, view.size() is returned.
// For use within get_host_delimiter_location.
// ['0x3a', '0x2f', '0x5c', '0x3f', '0x5b']
ada_really_inline size_t find_next_host_delimiter_special(
    std::string_view view, size_t location) noexcept {
  auto const str = view.substr(location);
  for (auto pos = str.begin(); pos != str.end(); ++pos) {
    if (special_host_delimiters[*pos]) {
      return pos - str.begin();
    }
  }
  return size_t(view.size());
}

// starting at index location, this finds the next location of a character
// :, /, ? or [. If none is found, view.size() is returned.
// For use within get_host_delimiter_location.

ada_really_inline size_t find_next_host_delimiter(std::string_view view,
                                                  size_t location) noexcept {
  auto const str = view.substr(location);
  for (auto pos = str.begin(); pos != str.end(); ++pos) {
    if (host_delimiters[*pos]) {
      return pos - str.begin();
    }
  }
  return size_t(view.size());
}

@anonrig
Copy link
Member

anonrig commented Nov 1, 2023

One of the tests is failing.

#if ADA_NEON
ada_really_inline size_t find_next_host_delimiter_special(
std::string_view view, size_t location) noexcept {
// first check for short strings in which case we do it naively.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if statement seems like repetitive code

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate? We handle the short string (< 16 characters) naively. What is repetitive?

src/helpers.cpp Outdated Show resolved Hide resolved
src/helpers.cpp Outdated
@@ -197,9 +197,132 @@ ada_really_inline uint64_t swap_bytes_if_big_endian(uint64_t val) noexcept {
#endif
}

ada_really_inline int trailing_zeroes(uint32_t input_num) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment on top of this function? If we don't want to expose this, we should add private to comments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did. In my view, an inline function defined in a source file and just used locally does not have to be declared in the header file (and indeed, doing so just adds noise, isn't it?).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(To be clearer: I did add a comment in a later commit.)

@the-moisrex
Copy link
Contributor

the-moisrex commented Nov 1, 2023

Let me know if you want the simplified version as a separate PR; also, I haven't tested the std::bitset way of doing this here, but that's generally slower than look up table.

@lemire
Copy link
Member Author

lemire commented Nov 1, 2023

@the-moisrex Please see #549

@lemire
Copy link
Member Author

lemire commented Nov 1, 2023

@the-moisrex I have updated this PR. We fallback on your proposal when SIMD is not available. I added credit to you in the comments.

@lemire
Copy link
Member Author

lemire commented Nov 1, 2023

I haven't tested the std::bitset way of doing this here, but that's generally slower than look up table.

By bitset, I did not mean std::bitset. The compiler will use a single 64-bit word as a bitset.

@the-moisrex
Copy link
Contributor

I haven't tested the std::bitset way of doing this here, but that's generally slower than look up table.

By bitset, I did not mean std::bitset. The compiler will use a single 64-bit word as a bitset.

The approach in character_sets-inl.h is almost the same way that std::bitset does it, that's why I mentioned it.

@lemire
Copy link
Member Author

lemire commented Nov 1, 2023

The approach in character_sets-inl.h is almost the same way that std::bitset does it, that's why I mentioned it.

Yep. The percent coding work can be accelerated.

We should vectorize it, at least on some systems. We have a PR on this front.

@lemire lemire requested a review from anonrig November 1, 2023 04:55
@anonrig anonrig merged commit 88d9b0a into main Nov 1, 2023
34 checks passed
@anonrig anonrig deleted the vectorize_find_next_host_delimiter_special branch November 1, 2023 11:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Questionable premature optimizations?
3 participants