Skip to content

Conversation

jcaspes
Copy link
Contributor

@jcaspes jcaspes commented Oct 6, 2025

By reading PhysicalBridge code i see that the phycical connexion is hosted by a volatile class member to be "protected".
This have been done to securize concurrent accesses to the physical connection because of volatile keyword i think.. but iot's not really safe.
This PR try to move physical connection in a threadSafe accessor class so no conflicts can happen.

Some details:

Seeing this kind of unsafe codes:
image
or another example
image
and more example where we only test the nullity of the memeber and then work on it... but can be set to null, and /or shutdown / disposed at any time by an other thread

And reading the microsoft documentation for the volatile keyword here: https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/keywords/volatile?devlangs=csharp
image

This code protection + HighIntegrity mode seem to fix the two issue: #2919 and #2920

@mgravell
Copy link
Collaborator

mgravell commented Oct 6, 2025

I'm going to have to take a careful look here. Ultimately, we're not changing this value randomly - it is set once, and under once; we could probably just remove the volatile entirely. I will need to read this one with much coffee to see what is going on.

@jcaspes
Copy link
Contributor Author

jcaspes commented Oct 6, 2025

I'm going to have to take a careful look here. Ultimately, we're not changing this value randomly - it is set once, and under once; we could probably just remove the volatile entirely. I will need to read this one with much coffee to see what is going on.

This PR is more a proof of concept to check if something was wrong with the physical connexion management and should cause the issues during OOM.

It seem that this change avoid the issues but i do not know why :D

During the HighMemory pressure test, i've added some traces in methods that create / dispose the physical connexion, it's called multiple times in some cases:

image

we can imagine that a thread is writing on it and an other disposing it ... then the thread that was writing get the bad response from the new connexion were other threads are posting messages...

@jcaspes
Copy link
Contributor Author

jcaspes commented Oct 6, 2025

An other example, in the TryEnqueue bridge Method:
image

physical is tested for nullity before the Messages loop, but during the loop physical can have been set to null or disposed

same in the two TryWriteSync methods

@NickCraver
Copy link
Collaborator

Just noting here that we can't take this route - it adds a ton of locking in a critical path :)

@jcaspes
Copy link
Contributor Author

jcaspes commented Oct 7, 2025

Just noting here that we can't take this route - it adds a ton of locking in a critical path :)

Yes sure, this PR is a proof of concept to point the problem on the volatile physical connection class member volatile PhysicalConnection physical that is not protected so easily corrupted when Disposed / Shutdown .... and my tests proves it.

For some cases i've locked a big portion of code to test this quicker, but we certainly can lock more precisely small portions, like for example in the TryEnqueue, i've locked the full loop on messages, we can lock only into the loop and check on each message if connection is available...

@mgravell
Copy link
Collaborator

mgravell commented Oct 7, 2025

"so easily corrupted"

That's what I'd want to focus in on here; until we have a viable description of a code-path that somehow corrupts/confuses this value, this is just arbitrary code addition.

@jcaspes
Copy link
Contributor Author

jcaspes commented Oct 7, 2025

"so easily corrupted"

That's what I'd want to focus in on here; until we have a viable description of a code-path that somehow corrupts/confuses this value, this is just arbitrary code addition.

I'll try to find this

…utdown, =null ...)

=> so in normal run, only read locks are done => no slowdown out of connections problems because of read lock are reentrant
@jcaspes
Copy link
Contributor Author

jcaspes commented Oct 8, 2025

Last commit lower locking.

I've enhanced locks by only using WriteLock (blocking) when modifying the physical member like Dispose(), shutdown(), set to null...

Read lock do not lock other threads calls with read lock, only calls with write locks will need to wait for all readlocks to end, so during normal run (no connection timeout / recreation), only readlock are used, so theorically no overload, and no "ton of locking in a critical path" ;-)

I've done a quick benchmark with 200 threads on localhost with only HGETS. i cannot see performances downgrade for now.

image

Now try to find why these locks make the issue disappears

@jcaspes
Copy link
Contributor Author

jcaspes commented Oct 8, 2025

First try to find code path seem to point on connection management.
With code from main branch, i've added some dirty traces where we touch to physical member, example:
image

By relaunching my memory test, for now each time i reproduce the issue, we have traces indicating that connection is recreated, on sometime multiple recreations are done

image

other example

image

I will dig more tomorrow to see if i can add more informations in logs.

@mgravell Can you explain us how we can activate Traces / debug traces from StackExchange.Redis and where to see them ? i've tryed to add the VERBOSE build flag, but not see any trace :-(

@jcaspes
Copy link
Contributor Author

jcaspes commented Oct 10, 2025

Hey,
I've added debug info so i can see the connection id used by Message instances to see if cases with leak match with a connection that is being "killed".

Conclusion, yes, each time i reproduce the issue, the connection used is the one that is being killed. here some screenshots of examples:

reproduce_setToNull reproduce_setToNull2 reproduce_setToNull3

Locking the physical connection seem necessary when some reconnection occurs (here due to OutOfMemory exceptions)

@NickCraver
Copy link
Collaborator

@jcaspes I appreciate the notion to benchmark as that is the right call, but I want to be up front: those call numbers are extremely small. We have folks operating at 1000x that load (where locking matters way more), and this adds performance overhead for sure. I'm not saying we don't have something to improve here, but adding an expensive lock around every call is not going to be the answer :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants