Add `WaitGroup` synchronization primitive by ysbaddaden · Pull Request #14167 · crystal-lang/crystal

ysbaddaden · 2024-01-04T12:10:42Z

This is more efficient than creating a Channel(Nil) and looping to receive N messages: we don't need a queue, only a counter, and we can avoid spurious wake ups of the main fiber and resume it only once.

See the documentation for examples and more details.

This is more efficient than creating a Channel(Nil) and looping to receive N messages: we don't need a queue, only a counter, and we can avoid spurious wakeups of the main fiber and resume it only once.

HertzDevil · 2024-01-04T12:29:22Z

Note to self: this (or rather its MT equivalent) is also known as a latch in C++ and Java

straight-shoota

I think we'll need a bunch more tests for this. E.g. for multiple #wait calls, fibers adding more fibers, #add with negative delta or #done called before #wait. The latter two could both result in @counter < 0 which is a relevant invariant to verify.

The Go implementation has some test cases that we could take inspiration from: https://cs.opensource.google/go/go/+/refs/tags/go1.21.5:src/sync/waitgroup_test.go

straight-shoota · 2024-01-04T12:39:49Z

As food for thought: https://cs.opensource.google/go/go/+/refs/tags/go1.21.5:src/sync/waitgroup.go
Go seems to merge the lock and counter int a single atomic field. Which makes sense considering that a spin lock is really just another atomic counter. You probably won't need the full bit width for either of them. This removes the need to keep track of waiting fibers, they just stay blocked on the lock 🤯
Maybe we could explore this in a future refactor?

ysbaddaden · 2024-01-04T13:54:55Z

Go merges the counter (i32) and the number of waiters (u32) into a single atomic (u64) which is a neat idea, then uses a semaphore to suspend the goroutines.

What I'm curious about is:

how is the semaphore implemented?
what about architectures without a 64-bit atomic?

src/wait_group.cr

ysbaddaden · 2024-01-04T16:33:13Z

I'm afraid following on Go won't be possible because their implementation takes advantage of the atomics returning the new value (e.g. add to counter => returns new counter + waiters), but Crystal relies on LLVM atomics that always return the old value, which... is often pointless 😞

For example, to support add(-5) I must do the math twice 🤦

old_value = @counter.add(n)
new_value = old_value + n

src/wait_group.cr

spec/std/wait_group_spec.cr

alexkutsan · 2024-01-23T13:52:45Z

src/wait_group.cr

+  # the fiber will be resumed.
+  #
+  # Can be called from different fibers.
+  def wait : Nil


I like this donation very much! It will be very useful in many cases.

One proposal that probably can be done later on as a separate improvement - is to make the wait method compatible with select to support the following snippet:

select when wg.wait puts "All fibers done" when timeout(X.seconds) puts "Some fiber stuck" end

Or maybe just have wg.wait(timeout: Time::Span | Nil) ?

@bararchy We're missing a generic mechanism for timeouts... but we could abstract how it's implemented for select so that could be doable.

That doesn't mean we can't also integrate with select: we could wait on channel(s) + waitgroup(s) + timeout. Now, I'm not sure how to do that.

@ysbaddaden I think that @alexkutsan's idea is better, because then we don't need to handle a Timeout Exception in case that the Timeout happen, and instead handle it in select context which seems cleaner, like how channel works when calling "receive" etc..

So I think my idea is less clean tbh 😅

I have a commit to support WaitGroup in select expressions.

The integration wasn't complex after I understood how SelectAction and SelectContext are working, but the current implementation is very isolated to Channel (on purpose). Maybe the integration is not a good idea, but if proves to be a good idea, we might want to extract the select logic from Channel to the Crystal namespace.

I'll open a pull request after this one is merged, so we can have a proper discussion.

@ysbaddaden now that it's in and merged, are you planning to make the followup PR? 👁️

src/wait_group.cr

spec/std/wait_group_spec.cr

src/wait_group.cr

Co-authored-by: Johannes Müller <straightshoota@gmail.com>

src/wait_group.cr

Co-authored-by: Jason Frey <fryguy9@gmail.com>

ysbaddaden · 2024-01-25T17:23:52Z

~~I enabled the waitgroup spec for the interpreter (and disabled the stress test because the interpreter is too slow)... and the interpreter crashed on CI:~~ Errata I didn't push it, yet, so the crash is unrelated to the WaitGroup specs (maybe related to #14252?).

FATAL: can't resume a running fiber: #<Fiber:0x7f7b0181baa0: main>
  from src/crystal/scheduler.cr:148:7 in 'swapcontext'
  from src/compiler/crystal/interpreter/interpreter.cr:354:7 in 'interpret'
  from src/indexable.cr:574:11 in '->'
  from src/fiber.cr:146:11 in 'run'
  from ???

Disables the stress test when interpreted as it takes forever to complete.

straight-shoota · 2024-01-29T12:45:27Z

This failure consistently appears in #14122 as well.

ysbaddaden · 2024-01-29T13:07:45Z

And in #14257.

I can't reproduce the "can't resuming running fiber" anymore when I disable the thread specs. I think we likely need actual support from the interpreter to start threads in the interpreted code.

spec/std/wait_group_spec.cr

src/wait_group.cr

Co-authored-by: Sijawusz Pur Rahnama <sija@sija.pl>

src/wait_group.cr

spec/std/wait_group_spec.cr

src/wait_group.cr

RX14 · 2024-03-26T14:42:33Z

At the cost of using a compare_and_set loop, it would be possible to "saturate" counter decrements at 0 and provide a guarantee all future add/done operations on a waitgroup that is in the 0 state will not change the counter, and raise. This increases correctness, but will decrease throughput of a heavily contended waitgroup.

The performance implications depend on whether in practice waitgroups will be protecting "large" operations (i.e. the time each thread spends maintaining the waitgroup is not dominant, and each cas op is very likely to succeed) or "small" operations (i.e. the thread spends most of it's time maintaining the waitgroup, and each cas op is likely to fail). My intuition is that it's the former and the performance implications of using a cas loop would be tiny.

Apologies for bringing this up at a "late" stage in the PR, this didn't occur to me until now.

ysbaddaden · 2024-03-28T10:25:51Z

@RX14 Interesting. I'm not concerned about performance (it's still lock free so it's fine to me), but:

the counter starts initialized at zero by default, so add(n) wouldn't detect an invalid waitgroup anymore or we'd require the counter to not be initialized at zero;
as outlined in the RFC the counter may reach zero multiple times and still be valid (with MT), as long as we always increment before we decrement and start waiting.

Or am I missing something?

ysbaddaden · 2024-03-28T10:48:41Z

Thanks @RX14. I'm noticing more edge cases. For example we could reach a negative number (raises) then a concurrent fiber would increment back to a positive number and fail to raise because the new counter is positive, also impacting the resumed waiting fibers that may continue despite the waitgroup being left in an invalid state.

I'll likely to add a CAS loop, just not saturating at zero, but keep the negative number to detect the invalid state.

It's now impossible for `#add` to increment a negative counter back into a positive one. `#wait` now checks for negative counter in addition to zero counter right after grabbing the lock.

src/wait_group.cr

ysbaddaden · 2024-03-28T13:24:20Z

I eased out corner edges:

can't increment from a negative counter anymore (forever invalid state);
raises on positive counter after resume (confused state).

I'm a bit torn about the last one: the situation can happen when the counter reached zero, enqueued waiters, and continued to increment, which is invalid, yet there is a race condition when reusing a WaitGroup with at least 2 waiters: fiber A reuses the WaitGroup (i.e. increments) before fiber B resumes (positive counter -> raise).

Oh, the race condition would also trigger with a negative counter (lower probability but could happen), so the problem is reusing the object before all waiters are properly resumed.

Ah, the joys of writing a synchronization primitive

RX14 · 2024-03-28T15:56:36Z

My highest concern is deadlocks: any condition where the counter remains on zero but fails to resume a waiter. There are race conditions when the waitgroup is used weirdly, but they do not cause a wrong count or a deadlock so can fail in a better way (raising).

ysbaddaden · 2024-03-29T10:27:58Z

AFAIK it should now be impossible to fail to wake the waiting fibers or leave the waitgroup in a confusing state: the counter saturates to a negative number and can't return to a positive number anymore; waiting fibers are always resumed (once) when the counter reaches zero or below.

I can't think of any scenario where we'd end up with a deadlock.

I can still think of race conditions, though:

counter reaches zero
waiting fibers are enqueued
counter reaches a negative number

Depending on when the fibers are resumed, some may return successfully (zero counter) while some may raise (negative counter), yet at least one fiber will raise (the one decrementing the counter below zero), so the error shouldn't go unnoticed.

RX14

This looks great to me, and I think the remaining race condition is defined well enough to not have a negative impact, especially as it doesn't appear in good usages

jkthorne · 2024-04-01T13:42:37Z

I cannot wait to use this over some Channels that I have.

bararchy · 2024-04-01T14:05:18Z

Same! very excited for this one 🎉

crysbot · 2025-02-09T04:26:25Z

This pull request has been mentioned on Crystal Forum. There might be relevant details there:

https://forum.crystal-lang.org/t/crystal-and-parallelism/7716/7

Add WaitGroup synchronization primitive

fcbc5a0

This is more efficient than creating a Channel(Nil) and looping to receive N messages: we don't need a queue, only a counter, and we can avoid spurious wakeups of the main fiber and resume it only once.

HertzDevil added kind:feature topic:stdlib:concurrency labels Jan 4, 2024

straight-shoota requested changes Jan 4, 2024

View reviewed changes

ysbaddaden commented Jan 4, 2024

View reviewed changes

src/wait_group.cr Show resolved Hide resolved

Improve tests + add(-n) + raise on negative counter

5e6dee2

yxhuvud reviewed Jan 5, 2024

View reviewed changes

src/wait_group.cr Show resolved Hide resolved

Fix example to call wg.done inside ensure block

c76bc4c

ysbaddaden commented Jan 5, 2024

View reviewed changes

spec/std/wait_group_spec.cr Outdated Show resolved Hide resolved

Leverage fail + improve documentation

900649a

beta-ziliani approved these changes Jan 23, 2024

View reviewed changes

beta-ziliani requested a review from straight-shoota January 23, 2024 10:52

alexkutsan reviewed Jan 23, 2024

View reviewed changes

straight-shoota reviewed Jan 23, 2024

View reviewed changes

Apply suggestions from code review

6564683

Co-authored-by: Johannes Müller <straightshoota@gmail.com>

straight-shoota approved these changes Jan 24, 2024

View reviewed changes

Fryguy reviewed Jan 25, 2024

View reviewed changes

src/wait_group.cr Outdated Show resolved Hide resolved

Fix: yet another typo...

720f22e

Co-authored-by: Jason Frey <fryguy9@gmail.com>

Test WaitGroup with the interpreter

463c068

Disables the stress test when interpreted as it takes forever to complete.

Fix: disable thread specs for the interpreter

8358b01

I can't reproduce the "can't resuming running fiber" anymore when I disable the thread specs. I think we likely need actual support from the interpreter to start threads in the interpreted code.

straight-shoota added this to the 1.12.0 milestone Feb 5, 2024

ysbaddaden mentioned this pull request Feb 6, 2024

Fix: don't run thread specs with the interpreter #14287

Merged

straight-shoota removed this from the 1.12.0 milestone Feb 6, 2024

straight-shoota reviewed Mar 25, 2024

View reviewed changes

spec/std/wait_group_spec.cr Show resolved Hide resolved

fix: remove src/once + use done in spec (not add)

8c06bbd

Sija reviewed Mar 25, 2024

View reviewed changes

src/wait_group.cr Outdated Show resolved Hide resolved

straight-shoota approved these changes Mar 25, 2024

View reviewed changes

Improve readability of WaitGroup#wait

1bb9224

Co-authored-by: Sijawusz Pur Rahnama <sija@sija.pl>

jgaskins reviewed Mar 25, 2024

View reviewed changes

src/wait_group.cr Show resolved Hide resolved

src/wait_group.cr Show resolved Hide resolved

RX14 requested changes Mar 26, 2024

View reviewed changes

spec/std/wait_group_spec.cr Show resolved Hide resolved

src/wait_group.cr Outdated Show resolved Hide resolved

ysbaddaden added 3 commits March 28, 2024 13:44

Fix: race conditions

bc7c174

It's now impossible for `#add` to increment a negative counter back into a positive one. `#wait` now checks for negative counter in addition to zero counter right after grabbing the lock.

Fix: raise on early wake up

0deae7c

Merge remote-tracking branch 'upstream/master' into feature/wait-group

3ecac1c

straight-shoota reviewed Mar 28, 2024

View reviewed changes

src/wait_group.cr Outdated Show resolved Hide resolved

Fix: avoid uninitialized (and avoid a nilable)

cd6ae90

RX14 approved these changes Mar 30, 2024

View reviewed changes

straight-shoota approved these changes Apr 2, 2024

View reviewed changes

straight-shoota added this to the 1.13.0 milestone Apr 2, 2024

straight-shoota merged commit c14fc89 into crystal-lang:master Apr 13, 2024

ysbaddaden deleted the feature/wait-group branch April 15, 2024 09:27

straight-shoota mentioned this pull request May 24, 2024

Add WaitGroup to docs_main.cr #14624

Merged

straight-shoota changed the title ~~Add WaitGroup synchronization primitive~~ Add WaitGroup synchronization primitive Jun 14, 2024

BrewTestBot mentioned this pull request Jul 10, 2024

crystal 1.13.0 Homebrew/homebrew-core#176873

Merged

1 task

ysbaddaden mentioned this pull request Jul 22, 2024

Streamlining the WaitGroup API #14820

Closed

Uh oh!

Conversation

ysbaddaden commented Jan 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HertzDevil commented Jan 4, 2024

Uh oh!

straight-shoota left a comment

Choose a reason for hiding this comment

Uh oh!

straight-shoota commented Jan 4, 2024

Uh oh!

ysbaddaden commented Jan 4, 2024

Uh oh!

Uh oh!

ysbaddaden commented Jan 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alexkutsan Jan 23, 2024

Choose a reason for hiding this comment

Uh oh!

bararchy Jan 23, 2024

Choose a reason for hiding this comment

Uh oh!

ysbaddaden Jan 24, 2024

Choose a reason for hiding this comment

Uh oh!

bararchy Jan 24, 2024

Choose a reason for hiding this comment

Uh oh!

ysbaddaden Jan 25, 2024

Choose a reason for hiding this comment

Uh oh!

bararchy May 28, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ysbaddaden commented Jan 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

straight-shoota commented Jan 29, 2024

Uh oh!

ysbaddaden commented Jan 29, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RX14 commented Mar 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ysbaddaden commented Mar 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ysbaddaden commented Mar 28, 2024

Uh oh!

Uh oh!

ysbaddaden commented Mar 28, 2024

Uh oh!

RX14 commented Mar 28, 2024

Uh oh!

ysbaddaden commented Mar 29, 2024

Uh oh!

RX14 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkthorne commented Apr 1, 2024

Uh oh!

bararchy commented Apr 1, 2024

ysbaddaden commented Jan 4, 2024 •

edited

Loading

ysbaddaden commented Jan 4, 2024 •

edited

Loading

ysbaddaden commented Jan 25, 2024 •

edited

Loading

RX14 commented Mar 26, 2024 •

edited

Loading

ysbaddaden commented Mar 28, 2024 •

edited

Loading

RX14 left a comment •

edited

Loading