Make PhysicalBridge? physical member threadSafe #2966

jcaspes · 2025-10-06T14:28:12Z

By reading PhysicalBridge code i see that the phycical connexion is hosted by a volatile class member to be "protected".
This have been done to securize concurrent accesses to the physical connection because of volatile keyword i think.. but iot's not really safe.
This PR try to move physical connection in a threadSafe accessor class so no conflicts can happen.

Some details:

Seeing this kind of unsafe codes:

or another example

and more example where we only test the nullity of the memeber and then work on it... but can be set to null, and /or shutdown / disposed at any time by an other thread

And reading the microsoft documentation for the volatile keyword here: https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/keywords/volatile?devlangs=csharp

This code protection + HighIntegrity mode seem to fix the two issue: #2919 and #2920

…sses threadsafe

mgravell · 2025-10-06T15:16:36Z

I'm going to have to take a careful look here. Ultimately, we're not changing this value randomly - it is set once, and under once; we could probably just remove the volatile entirely. I will need to read this one with much coffee to see what is going on.

jcaspes · 2025-10-06T16:15:38Z

I'm going to have to take a careful look here. Ultimately, we're not changing this value randomly - it is set once, and under once; we could probably just remove the volatile entirely. I will need to read this one with much coffee to see what is going on.

This PR is more a proof of concept to check if something was wrong with the physical connexion management and should cause the issues during OOM.

It seem that this change avoid the issues but i do not know why :D

During the HighMemory pressure test, i've added some traces in methods that create / dispose the physical connexion, it's called multiple times in some cases:

we can imagine that a thread is writing on it and an other disposing it ... then the thread that was writing get the bad response from the new connexion were other threads are posting messages...

jcaspes · 2025-10-06T16:31:36Z

An other example, in the TryEnqueue bridge Method:

physical is tested for nullity before the Messages loop, but during the loop physical can have been set to null or disposed

same in the two TryWriteSync methods

NickCraver · 2025-10-07T15:03:13Z

Just noting here that we can't take this route - it adds a ton of locking in a critical path :)

jcaspes · 2025-10-07T15:36:45Z

Just noting here that we can't take this route - it adds a ton of locking in a critical path :)

Yes sure, this PR is a proof of concept to point the problem on the volatile physical connection class member volatile PhysicalConnection physical that is not protected so easily corrupted when Disposed / Shutdown .... and my tests proves it.

For some cases i've locked a big portion of code to test this quicker, but we certainly can lock more precisely small portions, like for example in the TryEnqueue, i've locked the full loop on messages, we can lock only into the loop and check on each message if connection is available...

mgravell · 2025-10-07T16:06:06Z

"so easily corrupted"

That's what I'd want to focus in on here; until we have a viable description of a code-path that somehow corrupts/confuses this value, this is just arbitrary code addition.

jcaspes · 2025-10-07T16:07:39Z

"so easily corrupted"

That's what I'd want to focus in on here; until we have a viable description of a code-path that somehow corrupts/confuses this value, this is just arbitrary code addition.

I'll try to find this

…utdown, =null ...) => so in normal run, only read locks are done => no slowdown out of connections problems because of read lock are reentrant

jcaspes · 2025-10-08T13:42:02Z

Last commit lower locking.

I've enhanced locks by only using WriteLock (blocking) when modifying the physical member like Dispose(), shutdown(), set to null...

Read lock do not lock other threads calls with read lock, only calls with write locks will need to wait for all readlocks to end, so during normal run (no connection timeout / recreation), only readlock are used, so theorically no overload, and no "ton of locking in a critical path" ;-)

I've done a quick benchmark with 200 threads on localhost with only HGETS. i cannot see performances downgrade for now.

Now try to find why these locks make the issue disappears

jcaspes · 2025-10-08T16:11:34Z

First try to find code path seem to point on connection management.
With code from main branch, i've added some dirty traces where we touch to physical member, example:

By relaunching my memory test, for now each time i reproduce the issue, we have traces indicating that connection is recreated, on sometime multiple recreations are done

other example

I will dig more tomorrow to see if i can add more informations in logs.

@mgravell Can you explain us how we can activate Traces / debug traces from StackExchange.Redis and where to see them ? i've tryed to add the VERBOSE build flag, but not see any trace :-(

jcaspes · 2025-10-10T12:30:10Z

Hey,
I've added debug info so i can see the connection id used by Message instances to see if cases with leak match with a connection that is being "killed".

Conclusion, yes, each time i reproduce the issue, the connection used is the one that is being killed. here some screenshots of examples:

Locking the physical connection seem necessary when some reconnection occurs (here due to OutOfMemory exceptions)

NickCraver · 2025-10-10T13:57:00Z

@jcaspes I appreciate the notion to benchmark as that is the right call, but I want to be up front: those call numbers are extremely small. We have folks operating at 1000x that load (where locking matters way more), and this adds performance overhead for sure. I'm not saying we don't have something to improve here, but adding an expensive lock around every call is not going to be the answer :)

jcaspes added 2 commits October 6, 2025 16:11

Move PhysicalConnection? physical member in a new class with all acce…

18e57ad

…sses threadsafe

fix access modifiers

44c765a

Write lock only when applying changes on physical member (dispose, sh…

286b096

…utdown, =null ...) => so in normal run, only read locks are done => no slowdown out of connections problems because of read lock are reentrant

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make PhysicalBridge? physical member threadSafe #2966

Make PhysicalBridge? physical member threadSafe #2966

jcaspes commented Oct 6, 2025

Uh oh!

mgravell commented Oct 6, 2025

Uh oh!

jcaspes commented Oct 6, 2025 •

edited

Loading

Uh oh!

jcaspes commented Oct 6, 2025 •

edited

Loading

Uh oh!

NickCraver commented Oct 7, 2025

Uh oh!

jcaspes commented Oct 7, 2025

Uh oh!

mgravell commented Oct 7, 2025

Uh oh!

jcaspes commented Oct 7, 2025 •

edited

Loading

Uh oh!

jcaspes commented Oct 8, 2025

Uh oh!

jcaspes commented Oct 8, 2025 •

edited

Loading

Uh oh!

jcaspes commented Oct 10, 2025

Uh oh!

NickCraver commented Oct 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Make PhysicalBridge? physical member threadSafe #2966

Are you sure you want to change the base?

Make PhysicalBridge? physical member threadSafe #2966

Conversation

jcaspes commented Oct 6, 2025

Uh oh!

mgravell commented Oct 6, 2025

Uh oh!

jcaspes commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jcaspes commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NickCraver commented Oct 7, 2025

Uh oh!

jcaspes commented Oct 7, 2025

Uh oh!

mgravell commented Oct 7, 2025

Uh oh!

jcaspes commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jcaspes commented Oct 8, 2025

Uh oh!

jcaspes commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jcaspes commented Oct 10, 2025

Uh oh!

NickCraver commented Oct 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jcaspes commented Oct 6, 2025 •

edited

Loading

jcaspes commented Oct 6, 2025 •

edited

Loading

jcaspes commented Oct 7, 2025 •

edited

Loading

jcaspes commented Oct 8, 2025 •

edited

Loading