Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use an AtomicPtr for PulseStream's drain_timer #72

Merged
merged 1 commit into from
Oct 15, 2021

Conversation

glandium
Copy link
Contributor

@glandium glandium commented Oct 15, 2021

There is a race condition between drained_cb and PulseStream::stop that
happens reliably on Firefox CI with rust 1.56 beta (LLVM 13) and PGO
instrumentation. Here's how it goes:

  • in the Firefox AudioIPC Server RPC thread, PulseStream::stop is
    called
  • PulseStream::stop enters the loop waiting for drain, and blocks on
    mainloop.wait
  • Later, some other thread calls drained_cb, which resets drain_timer,
    and signals the mainloop.
  • Back the other AudioIPC Server RPC thread, mainloop.wait returns,
    looping back to the test for drain_timer... which this thread
    doesn't know had been updated yet, so it blocks on mainloop.wait
    again.

There is a race condition between `drained_cb` and `PulseStream::stop` that
happens reliably on Firefox CI with rust 1.56 beta (LLVM 13) and PGO
instrumentation. Here's how it goes:
- in the Firefox AudioIPC Server RPC thread, `PulseStream::stop` is
  called
- `PulseStream::stop enters the loop waiting for drain, and blocks on
  `mainloop.wait`
- Later, some other thread calls `drained_cb`, which resets `drain_timer`,
  and signals the mainloop.
- Back the other AudioIPC Server RPC thread, `mainloop.wait` returns,
  looping back to the test for `drain_timer`... which this thread
  doesn't know had been updated yet, so it blocks on `mainloop.wait`
  again.
@glandium
Copy link
Contributor Author

Note that the real problem here is that there are two live mutable references to the PulseStream: the one from self in PulseStream::stop, and the one created in the unsafe block in drained_cb. Technically speaking, there's still room for other race conditions with anything that mutates the PulseStream, like state_change_callback.

@kinetiknz
Copy link
Contributor

Nice find, thanks for fixing this!

@kinetiknz kinetiknz merged commit 9695281 into mozilla:dev Oct 15, 2021
@kotarou3
Copy link

Why does drain_timer need to be atomic, when it's already protected by the stm.context.mainloop mutex?
Are you sure there was a race condition in the first place, or is this code change just masking/working around some deeper underlying problem (e.g., a compiler bug)?

Both PulseStream::stop() and drained_cb()¹ have the mutex locked before accessing drain_timer, and it uses the standard conditional variable pattern for waiting on a shared variable.
The mutex locking and unlocking already provides the necessary memory barrier to ensure drain_timer accesses are well-ordered.

  1. drained_cb() is called from pa_mainloop_iterate() > pa_mainloop_dispatch(), and pa_mainloop_iterate() locks the mutex in pa_mainloop_poll() > thread-mainloop.c:poll_func()

@glandium
Copy link
Contributor Author

is this code change just masking/working around some deeper underlying problem

The deeper underlying problem is outlined in #72 (comment) and allows the compiler to make the assumption that drain_timer doesn't change in the loop. This is not a miscompilation, it's a rust code safety violation. The atomic papers over it.

@kotarou3
Copy link

You're right - I wasn't aware that Rust makes that assumption.
I confirmed in the disassembly that without this change, stop() just infinite loops stm.context.mainloop.wait() without ever checking drain_timer again:

/firefox-94.0/third_party/rust/cubeb-pulse/src/backend/stream.rs:
640	            while !self.drain_timer.is_null() {
   0x0000000007191796 <+598>:	cmpq   $0x0,0x40(%r14)
   0x000000000719179b <+603>:	je     0x71917b7 <_ZN94_$LT$cubeb_pulse..backend..stream..PulseStream$u20$as$u20$cubeb_backend..traits..StreamOps$GT$4stop17h888579641f57ae07E+631>
   0x000000000719179d <+605>:	lea    0x979c(%rip),%rbx        # 0x719af40 <_ZN5pulse17threaded_mainloop16ThreadedMainloop4wait17hf5e4f1c3711458a0E>
   0x00000000071917a4 <+612>:	nopw   %cs:0x0(%rax,%rax,1)
   0x00000000071917ae <+622>:	xchg   %ax,%ax

641	                self.context.mainloop.wait();
   0x00000000071917b0 <+624>:	mov    %r13,%rdi
   0x00000000071917b3 <+627>:	call   *%rbx

640	            while !self.drain_timer.is_null() {
   0x00000000071917b5 <+629>:	jmp    0x71917b0 <_ZN94_$LT$cubeb_pulse..backend..stream..PulseStream$u20$as$u20$cubeb_backend..traits..StreamOps$GT$4stop17h888579641f57ae07E+624>

@paulmenzel
Copy link

How did you create the disassembly?

@kotarou3
Copy link

  1. Build Firefox with the --enable-debug configure or mozconfig option
  2. Open libxul.so with gdb
  3. disas /s _ZN94_$LT$cubeb_pulse..backend..stream..PulseStream$u20$as$u20$cubeb_backend..traits..StreamOps$GT$4stop17h888579641f57ae07E

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants