High spinlock overhead #146

milianw · 2022-06-28T08:50:32Z

Hey there,

in one of our scenarios, we encounter quite a lot of contention that leads to a significant CPU overhead in the libcuckoo spinlock implementation.

On one hand, the spinlock could/should be improved based on the insights from https://rigtorp.se/spinlock/

On the other hand, I did that locally (I will provide pull request for that) and did not measure a huge improvement. Spinlocks are naturally CPU intensive for contention. In my case, as outlined in #138, the insertion can take significant resources and time. Thus contention is to be expected and I would like to reduce the CPU overhead - but I don't see a way to do that with normal spinlocks.

Would it be acceptable to expand libcuckoo to allow more fine grained control over the lock? Potentially even allowing me to provide my own lock class. See also https://www.youtube.com/watch?v=z7fPxjZKsBY&t=1550s for some inspiration of what could potentially be done.

I'm open for discussion and insight here. Additionally, I'll try and find some time to create a benchmark for this scenario, so we can make better educated decisions. Or do you happen to have a benchmark for the contended case already?

Thanks

Use the code from [1] to implement the spinlock based on std::atomic<bool> instead of std::atomic_flag. While the former is not necessarily lock-free, it is on the majority of platforms. A static assert is added to catch this on platforms that don't have this - we could potentially use the older implementation on those instead then. WIP because I don't have a good scientific benchmark for this yet. I tested it in our realworld application, and it seems to have slightly reduced the load of the spinlock, but not in a really large way... See also: efficient#146 [1]: https://rigtorp.se/spinlock/

BurningEnlightenment · 2022-06-28T11:23:08Z

See also https://www.youtube.com/watch?v=z7fPxjZKsBY&t=1550s for some inspiration of what could potentially be done.

Conditionally using C++20's futex support (a.k.a. wait() and notify()) should be quite easy. However, this should be a configurable option, because mixing C++17-or-earlier and C++20 TUs could lead to dead locks otherwise.

milianw · 2022-06-28T12:00:03Z

Sadly, we cannot yet use C++20 and are stuck with C++17 until macOS updates its toolchain :( So at least for me own purposes, I'm looking for a solution that is also compatible with at least C++17.

manugoyal · 2022-07-04T02:14:50Z

Hi Milian, thanks for bringing this up! I can certainly see how spinlocks in general may be suboptimal for use-cases where the critical section is expensive, since spinning wastes CPU cycles. So from that standpoint, there is value in allowing users to specify their own lock implementation. (separately I'm happy to accept general-purpose improvements to the spinlock implementation :) ).

On the other hand, I worry that fixing a specific locking interface may restrict the set of future optimization opportunities for the table. For instance, the original paper that libcuckoo is based on employed atomic version counters instead of actual locks to reduce contention (https://www.cs.princeton.edu/~mfreed/docs/cuckoo-eurosys14.pdf, search for "optimistic locking"). We've also experimented with different locking implementations such as read-write locks. It would be great to maintain this flexibility in how we manage concurrency in the hashtable.

For the use-case you detailed in #138, would it be at all possible to try moving the expensive work outside the hashtable insert operation? I'm thinking something along these lines:

struct optional_index {
  std::atomic<bool> has_value; // could be any locking mechanism depending on your use-case
  std::size_t index;
};

std::shared_mutex mutex;
std::vector<std::string> indexToString;
libcuckoo::cuckoohash_map<std::string, std::unique_ptr<optional_index> stringToIndex;

auto indexForString = [&](const std::string &string)
{
    bool is_new;
    optional_index* opt_index;
    stringToIndex.upsert(string, [&](std::unique_ptr<optional_index>& val, UpsertContext context) {
      switch (context) {
        case UpsertContext::NEWLY_INSERTED:
          is_new = true;
          val = std::make_unique<optional_index>();
          break;
        case UpsertContext::ALREADY_INSERTED:
          is_new = false;
          assert(val != nullptr);
          break;
      }
      opt_index = val.get();
    });
    if (is_new) {
      {
        auto lock = std::unique_lock(mutex);
        opt_index->index = indexToString.size();
        indexToString.push_back(string);
      }
      opt_index->has_value.store(true, std::memory_order_release);
      return index;
    } else {
      while (!opt_index->has_value.load(std::memory_order_acquire)) {}
      return opt_index->index;
    }
};

milianw · 2022-07-04T14:43:32Z

I'm all for keeping things flexible for the implementation side of things. Instead of letting the user pass through a custom mutex/lock implementation, maybe for starters one could offer various locking schemes that are selectable via some enumerator? the current polling would be the default, and one could add e.g. one that waits instead next. but then again, maybe a generic improvement to the spinlock to employ e.g. exponential backoff would be a win-win and shouldn't be user-configurable. I clearly don't know enough about this topic yet :)

regarding your suggestion for my example use case: thanks! that is a very interesting idea, I will for sure try that out. I just fear that this will simply move the contention from one place to another, as then we would burn CPU time while polling in the new while loop. Anyhow, I should really try it out and report some hard numbers back for you.

Cheers

During the discussion of [1] it has been suggested that the congestion could be reduced by relying on C++20 futex APIs. Additionally we incorporated the `try_lock()` design from [2]. [1]: efficient#146 [2]: https://rigtorp.se/spinlock/ Co-authored-by: Milian Wolff <[email protected]>

milianw mentioned this issue Jun 28, 2022

WIP: Reducing cache choerency traffic under contention for spinlock #147

Open

BurningEnlightenment mentioned this issue Jul 7, 2022

Add option to use futex APIs for spinlock #149

Open

serkanturgut mentioned this issue Sep 27, 2023

High contention under many reader threads on hot keys #159

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High spinlock overhead #146

High spinlock overhead #146

milianw commented Jun 28, 2022

BurningEnlightenment commented Jun 28, 2022

milianw commented Jun 28, 2022

manugoyal commented Jul 4, 2022

milianw commented Jul 4, 2022

High spinlock overhead #146

High spinlock overhead #146

Comments

milianw commented Jun 28, 2022

BurningEnlightenment commented Jun 28, 2022

milianw commented Jun 28, 2022

manugoyal commented Jul 4, 2022

milianw commented Jul 4, 2022