Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High spinlock overhead #146

Open
milianw opened this issue Jun 28, 2022 · 4 comments
Open

High spinlock overhead #146

milianw opened this issue Jun 28, 2022 · 4 comments

Comments

@milianw
Copy link
Contributor

milianw commented Jun 28, 2022

Hey there,

in one of our scenarios, we encounter quite a lot of contention that leads to a significant CPU overhead in the libcuckoo spinlock implementation.

On one hand, the spinlock could/should be improved based on the insights from https://rigtorp.se/spinlock/

On the other hand, I did that locally (I will provide pull request for that) and did not measure a huge improvement. Spinlocks are naturally CPU intensive for contention. In my case, as outlined in #138, the insertion can take significant resources and time. Thus contention is to be expected and I would like to reduce the CPU overhead - but I don't see a way to do that with normal spinlocks.

Would it be acceptable to expand libcuckoo to allow more fine grained control over the lock? Potentially even allowing me to provide my own lock class. See also https://www.youtube.com/watch?v=z7fPxjZKsBY&t=1550s for some inspiration of what could potentially be done.

I'm open for discussion and insight here. Additionally, I'll try and find some time to create a benchmark for this scenario, so we can make better educated decisions. Or do you happen to have a benchmark for the contended case already?

Thanks

milianw added a commit to KDABLabs/libcuckoo that referenced this issue Jun 28, 2022
Use the code from [1] to implement the spinlock based on
std::atomic<bool> instead of std::atomic_flag. While the former
is not necessarily lock-free, it is on the majority of platforms.
A static assert is added to catch this on platforms that don't
have this - we could potentially use the older implementation on
those instead then.

WIP because I don't have a good scientific benchmark for this yet.
I tested it in our realworld application, and it seems to have
slightly reduced the load of the spinlock, but not in a really large
way...

See also: efficient#146

[1]: https://rigtorp.se/spinlock/
@BurningEnlightenment
Copy link

See also https://www.youtube.com/watch?v=z7fPxjZKsBY&t=1550s for some inspiration of what could potentially be done.

Conditionally using C++20's futex support (a.k.a. wait() and notify()) should be quite easy. However, this should be a configurable option, because mixing C++17-or-earlier and C++20 TUs could lead to dead locks otherwise.

@milianw
Copy link
Contributor Author

milianw commented Jun 28, 2022

Sadly, we cannot yet use C++20 and are stuck with C++17 until macOS updates its toolchain :( So at least for me own purposes, I'm looking for a solution that is also compatible with at least C++17.

@manugoyal
Copy link
Contributor

Hi Milian, thanks for bringing this up! I can certainly see how spinlocks in general may be suboptimal for use-cases where the critical section is expensive, since spinning wastes CPU cycles. So from that standpoint, there is value in allowing users to specify their own lock implementation. (separately I'm happy to accept general-purpose improvements to the spinlock implementation :) ).

On the other hand, I worry that fixing a specific locking interface may restrict the set of future optimization opportunities for the table. For instance, the original paper that libcuckoo is based on employed atomic version counters instead of actual locks to reduce contention (https://www.cs.princeton.edu/~mfreed/docs/cuckoo-eurosys14.pdf, search for "optimistic locking"). We've also experimented with different locking implementations such as read-write locks. It would be great to maintain this flexibility in how we manage concurrency in the hashtable.

For the use-case you detailed in #138, would it be at all possible to try moving the expensive work outside the hashtable insert operation? I'm thinking something along these lines:

struct optional_index {
  std::atomic<bool> has_value; // could be any locking mechanism depending on your use-case
  std::size_t index;
};

std::shared_mutex mutex;
std::vector<std::string> indexToString;
libcuckoo::cuckoohash_map<std::string, std::unique_ptr<optional_index> stringToIndex;

auto indexForString = [&](const std::string &string)
{
    bool is_new;
    optional_index* opt_index;
    stringToIndex.upsert(string, [&](std::unique_ptr<optional_index>& val, UpsertContext context) {
      switch (context) {
        case UpsertContext::NEWLY_INSERTED:
          is_new = true;
          val = std::make_unique<optional_index>();
          break;
        case UpsertContext::ALREADY_INSERTED:
          is_new = false;
          assert(val != nullptr);
          break;
      }
      opt_index = val.get();
    });
    if (is_new) {
      {
        auto lock = std::unique_lock(mutex);
        opt_index->index = indexToString.size();
        indexToString.push_back(string);
      }
      opt_index->has_value.store(true, std::memory_order_release);
      return index;
    } else {
      while (!opt_index->has_value.load(std::memory_order_acquire)) {}
      return opt_index->index;
    }
};

@milianw
Copy link
Contributor Author

milianw commented Jul 4, 2022

I'm all for keeping things flexible for the implementation side of things. Instead of letting the user pass through a custom mutex/lock implementation, maybe for starters one could offer various locking schemes that are selectable via some enumerator? the current polling would be the default, and one could add e.g. one that waits instead next. but then again, maybe a generic improvement to the spinlock to employ e.g. exponential backoff would be a win-win and shouldn't be user-configurable. I clearly don't know enough about this topic yet :)

regarding your suggestion for my example use case: thanks! that is a very interesting idea, I will for sure try that out. I just fear that this will simply move the contention from one place to another, as then we would burn CPU time while polling in the new while loop. Anyhow, I should really try it out and report some hard numbers back for you.

Cheers

BurningEnlightenment added a commit to BurningEnlightenment/libcuckoo that referenced this issue Jul 7, 2022
During the discussion of [1] it has been suggested that the congestion
could be reduced by relying on C++20 futex APIs. Additionally we
incorporated the `try_lock()` design from [2].

[1]: efficient#146
[2]: https://rigtorp.se/spinlock/

Co-authored-by: Milian Wolff <[email protected]>
BurningEnlightenment added a commit to BurningEnlightenment/libcuckoo that referenced this issue Jul 7, 2022
During the discussion of [1] it has been suggested that the congestion
could be reduced by relying on C++20 futex APIs. Additionally we
incorporated the `try_lock()` design from [2].

[1]: efficient#146
[2]: https://rigtorp.se/spinlock/

Co-authored-by: Milian Wolff <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants