Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix | Fixes wrong data errors due to exceptions not handled in time #894

Closed
wants to merge 3 commits into from

Conversation

cheenamalhotra
Copy link
Member

@cheenamalhotra cheenamalhotra commented Jan 30, 2021

This change fixes the wrong data issue in #659

Root Cause

Ref: Case_Study.zip (RCA explained - #659 (comment))

Attempt 3 [The Fix - Updated 04-02-2021]
Those missing checks in SqlCommand methods added here caused wrong data flow, adding them respectively fixes it by capturing exception on time and throwing it immediately.

These checks were always done in Old Async APIs, e.g. InternalEndExecuteReader , InternalEndExecuteNonQuery

Fix applies to all platforms on all supported .NET SDKs. APIs fixed:

  • SqlCommand.ExecuteNonQueryAsync
  • SqlCommand.ExecuteReaderAsync
  • SqlCommand.ExecuteXmlReaderAsync

Old Attempts

Attempt 2 [Updated 02-02-2021]
[Discarded - Breaks Timeout Flows]
The first attempted solution was discarded after further investigations as the counter adjustment only made the connection cleaned out from connection pool, but the original issue existed.

After more investigations into the trigger point of Attention packets, I looked more into _networkPacketTimeout being used in

_networkPacketTimeout = ADP.UnsafeCreateTimer(
new TimerCallback(OnTimeout),
null,
Timeout.Infinite,
Timeout.Infinite);
}

This timer when times out (when duration is updated again) triggers an Attention signal, and is likely causing some issues here. The update to duration when there is time left, kicks off this timer to timeout on another thread independently while the read could still be going on. If we only timeout when we hit 0 (next few lines) we aren't allowing timeout and read to be in a position to occur simultaneously.

Attempt 1 [Discarded]

The repro apps created by community in #659 were very helpful to reproduce this issue, thanks to everyone for helping us with that! We really appreciate your support and patience!

Repro Test Results

Below is the screenshot for the repro from #659 (comment) in multiple parallel sessions all at once with the fix, which has never been possible! Yes there are network errors, but that is due to thread-pool starvation issue #422 which we'll continue to investigate!

image

[Updated 04-02-2021]
[Attempt 3 - no wrong data observed in multiple sessions, results similar to screenshot]

Without this fix, invalid data is easily captured in a single process

image

@saurabh500
Copy link
Contributor

@cheenamalhotra it will be hard to confirm this purely by code inspection.
Is the repro reliable and can be run against windows to test this? If this doesn't repro on Windows then is the fix above SNI layer, the right place to put the changes in?
My concern is that of this is non-windows specific and we cannot repro the behavior on Windows then the fix may cause side effects on native SNI under load. How will you verify that?

@saurabh500
Copy link
Contributor

Alternatively I think we should a fix to check for managed sni being enabled and not enable the fix for native SNI unless we have a confirmed repro

@@ -2060,6 +2060,10 @@ private IAsyncResult BeginExecuteReaderInternal(CommandBehavior behavior, AsyncC

if (!inRetry)
{
if (asyncWrite)
{
_activeConnection.GetOpenTdsConnection().IncrementAsyncCount();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ON line 2088 RunExecuteReader is called which calls RunExecuteReaderTds which already does something similar to what you are doing here for async cases. Why is that not working? Basically we are ref counting the connection before the execution of the command and that seems to be happening for the code path that you seem to be changing as well. So why do we need this extra increment here?

// If we are in retry, don't increment the Async count. This should have already been set.
            if (async && !inRetry)
            {
                _activeConnection.GetOpenTdsConnection().IncrementAsyncCount();
                decrementAsyncCountOnFailure = true;
            }

I think the code is already structured to do what you are doing here, the question is why is the async count getting messed up with the current code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need traces to confirm that for this area, so we can track connection's activity.
I will work on that next week and will update findings here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I captured traces for the failing scenario, we are setting these counters but something goes terribly wrong when an exception is thrown to user from previous query timeout, take this case for example:

Logs have this:

e414c538-cfba-41a2-b26e-b7495e5dc215    82 secs  44   Success
e414c538-cfba-41a2-b26e-b7495e5dc215    82 secs  529  Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.

When "44" was queried, timeout occurred but User got data as both threads were blocked that handle callbacks "HandleReceiveError" v/s completion callback > and completion callback finished first > that led to Commit being performed, but also Attention signal was sent once "HandleReceiveError" did it's job, about which user is unaware and error stayed in Connection's state when it was pushed to pool.

When "529" was queried, driver queried the database, results came on time, and "async command counter was decremented" as a normal flow, but a pending exception was also found in this connection, due to which the data that was fetched, was not consumed and was pushed back to pool as-is. This is where the wrong data flow begins. The async counter seemed to have no relation to whether exception were to be handled before concluding, and by the time exception was handled, it was too late.

image

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it seems with my changes I'm introducing another counter here on the same method, but it also saved the connection to go back to pool, when "529" as queried.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and error stayed in Connection's state when it was pushed to pool.

I dumped both good and bad objects state from the pool and apart from id's and muxing counters saw no difference so where is the state when it's in the pool?

@saurabh500
Copy link
Contributor

saurabh500 commented Jan 30, 2021

@cheenamalhotra what is the async count if failure cases ? You mentioned that it is neither incremented or decrements without the fix. But I am trying to understand why that is the case?
Is it possible that decrement is being called incorrectly? What is the execution result with Debug builds? I see there are asserts which check for the count before decrementing. I have chased down with @Wraith2 's help an issue in Mars about one of the ref counters not being managed correctly for Managed SNI execution path and the bug fix was in managed SNI code and not upstream in SqlCommand, TdsParser, SqlInternalConnectionTds. This issue is definitely not MARS specific, but just wanted to lay down some of the earlier bug investigation findings in the past.
based on the repro results, I strongly feel that we should be trying to narrow down the problem for managed SNI since managed SNI had some quirks for increment and decrement of these counters, which in the past issues, we were able to see with managed SNI specific code.
I understand that the connection was not being doomed, but why is the counter not being managed correctly inspite of existing code which is supposed to manage the async count at seemingly the right places, is something I am not clear about.

@saurabh500
Copy link
Contributor

Is it possible to modify the code and see if you can print the callers of SqlInternalConnectionTds.IncrementAsyncCount() and DecrementAsyncCount() to see if they got called and what was the subsequent results.

@saurabh500
Copy link
Contributor

I do want to acknowledge that getting to the point where you can say that async count is messed up and that is what need to investigate is a huge achievement esp for a non-deterministic repro. I think we need to bridge the gap around "why" since the existing code is already structured to take care of ref counting the asyncCount.

@Wraith2
Copy link
Contributor

Wraith2 commented Jan 30, 2021

I've been working on debugging this issue as well. I added in some code to identify internal connection instances that were in two states: known broken because it has returned a clear incorrect result and known good because it has returned a known good result after the broken connection has been identified. I then added in some code to put the spids in environment variables and dump the details of those state objects as they left the pool immediately prior to be connected to the external command object. The state apart from id's were identical.

I also added checks for data remaining in the network stream and the socket buffers on return from the pool. In both cases there was none. So I know that the poisoned state that is retained through the pool is not observable as it exits the pool. So the poisoned state wasn't observable as data either. So there was a state hole somewhere. The client state wasn't accurately reflecting something about the operations representing everything as clean when it wasn't, This seems a reasonable mitigation for that to me.

The SNI layer is where I started investigating and @cheenamalhotra noted that she was seeing SNIPackets being finalized which pointed to there being a flaw in it somewhere. I spent a couple of weeks tracking down the recycling and packet flows. I've rewritten the mars demuxer and debugged through a lot of paths to get all packets correctly recycled even in the repro app which is massive overload scenario and has tasks that don't complete all over the place. I've definitely got areas of the managed SNI layer that I can improve but there is no clear flaw and as noted above the stored state remains consistent between known good and bad connections.

The other big difference between the managed and native implementation is how threading is used. Managed uses the threadpool and the reproduction causes threadpool exhaustion along with all the non-deterministic behaviours that come along with that. I strongly suspect that the native component uses IOCP internally so won't suffer from the same timing issues and even if it has problems they're limited to the native layer and can only escape as error codes, in managed mode the failures can occur absolutely everywhere.

@saurabh500
Copy link
Contributor

@Wraith2 you are right about the native implementation using IOCP and leveraging IO threads and not Application thread pool threads. That is the biggest problem in Managed SNI implementation and causes the applications to suffer under stress where timeouts may occur in scenarios where windows implementation works fine.
Essentially the requests on Windows occur on an app thread (from threadpool), and are sent to native layer and there is a single listener thread in native layer which listens to the IOCP signals and sends them to the appropriate managed object which initiates the call for async using IOCP. This makes the situation particularly worse for MARS because all executions in sync mode also leads to App pool threads being used for networking, instead of relying on IO threads in case of IOCP on windows. I believe the windows networking from .Net core also leverages IOCP and the problem "may be" less prevalent on Windows.

@cheenamalhotra
Copy link
Member Author

Is the repro reliable and can be run against windows to test this? If this doesn't repro on Windows then is the fix above SNI layer, the right place to put the changes in?

Yes, the screenshots above are run on Windows using Managed SNI enabled with App Context switch, so this is not a platform specific bug. The repro is also reliable, as we immediately see the Invalid packets in 1 single run.

@cheenamalhotra what is the async count if failure cases ?

The count is always 0 for corrupted connections where exceptions occur or end is not complete.
I will work on a standalone repro next week as we know the full story of how can we make the driver yield wrong results without this change, I think that will give us a clear picture.

Is it possible that decrement is being called incorrectly? What is the execution result with Debug builds? I see there are asserts which check for the count before decrementing. I have chased down with @Wraith2 's help an issue in Mars about one of the ref counters not being managed correctly for Managed SNI execution path and the bug fix was in managed SNI code and not upstream in SqlCommand, TdsParser, SqlInternalConnectionTds. This issue is definitely not MARS specific, but just wanted to lay down some of the earlier bug investigation findings in the past.

This could also be reason. The issue is equally reproducible with Debug builds and yes not MARS specific. Our main focus was always Managed SNI, MARS and and TDS Parser layers as that was what we were reproducing, so that's why we never captured enough traces for SqlCommand counters. I will also work on adding eventsource traces for this area to be absolutely sure we are making a right fix.

Is it possible to modify the code and see if you can print the callers of SqlInternalConnectionTds.IncrementAsyncCount() and DecrementAsyncCount() to see if they got called and what was the subsequent results.
I do want to acknowledge that getting to the point where you can say that async count is messed up and that is what need to investigate is a huge achievement esp for a non-deterministic repro. I think we need to bridge the gap around "why" since the existing code is already structured to take care of ref counting the asyncCount.

Yes absolutely! I will gather more details to confirm this is the right fix. This is the first change we made in the driver that stops wrong results from happening, so there's definitely a possibility of polishing it further and we may find something more.

@saurabh500
Copy link
Contributor

I spent about half hour trying to repro this, but not able to.
@cheenamalhotra and @Wraith2 any pointers here? I am running the repro on my windows laptop at home connecting to a SQL server in office over VPN. Also I tried from the laptop at home to the Azure SQL DB instance in East US and I dont get a failure even when I set the App Context switch to use managed SNI on Windows.
From @cheenamalhotra's screenshot, I can see that she is probably doing this on Windows with Managed SNI being used. Is that correct?
Anything else I can try to get a repro?
Should I put the repro in a Windows VM with really low memory? like 4 gigs?

@saurabh500
Copy link
Contributor

Thanks @cheenamalhotra for the response, help me with the conditions for the reliable repro.
That would be awesome. I have a love/hate relationship with the "ref count" related bugs :) I hate them but love to investigate them

@cheenamalhotra
Copy link
Member Author

cheenamalhotra commented Jan 30, 2021

AllManagedSNITests.zip

Please use this repro, we don't need a very low memory machine I think. I'm able to reproduce on 32 GB RAM laptop.
[Allow 1-2 mins as all threads get blocked initially]
(Ignore other test cases :))

cc @Wraith2 (for your ease, all major test cases are here)

@saurabh500
Copy link
Contributor

OK. i will give this a shot

@Wraith2
Copy link
Contributor

Wraith2 commented Jan 30, 2021

It can take a few seconds to get an invalid result. It's highly timing dependent and the more you add to try and debug the issue the less likely it is to occur because you stay on one thread for longer. I've got mine setup for 4000 threads and have set the threadpool max to that as well to get things rolling with high concurrency quickly.

@saurabh500
Copy link
Contributor

@cheenamalhotra I see Processed: 5781 - Network errors: 115 - Missing: 0 - Invalid: 0 on console.
In what cases do I know I have a bad data scenario?

@Wraith2
Copy link
Contributor

Wraith2 commented Jan 30, 2021

You get an invalid count. I can push my repro if you want but it's derived from the same codebase. You start it and let it run for a while and you should get an invalid result before it hits a debug assert most of the time.

@saurabh500
Copy link
Contributor

I have the repro running for 10 minutes without an error right now. Connecting to SQL server on a reasonably latent network connection.
Let me wait from another 10.
@Wraith2 is there a fundamental change in your repro? Wont hurt to have a zip file that you are using as well.

@cheenamalhotra
Copy link
Member Author

Sorry I updated the repro as I was testing with ExecuteReader, please update the 659_2.cs class on line 96:

            using (var reader = await command.ExecuteReaderAsync())
            {
                if (await reader.ReadAsync())
                {
                    var columnIndex = reader.GetOrdinal("Id");
                    result = reader.GetInt32(columnIndex);
                }
            }

@saurabh500
Copy link
Contributor

Got it. Trying again now.

@saurabh500
Copy link
Contributor

And I have it !!

λ dotnet run
Waiting for SQLServer...
Smoke test passed!
Starting 1000 tasks...
Processed:    238 - Network errors:      0 - Missing:      0 - Invalid:      0
Processed:    245 - Network errors:      4 - Missing:      0 - Invalid:      0
Processed:    247 - Network errors:      6 - Missing:      0 - Invalid:      0
Processed:    249 - Network errors:      7 - Missing:      0 - Invalid:      0
Processed:    250 - Network errors:      8 - Missing:      0 - Invalid:      0
Processed:    264 - Network errors:     19 - Missing:      0 - Invalid:      0
Processed:    264 - Network errors:     19 - Missing:      0 - Invalid:      0
Processed:    265 - Network errors:     20 - Missing:      0 - Invalid:      0
Processed:   1221 - Network errors:    951 - Missing:      0 - Invalid:      0
Processed:   1995 - Network errors:   1050 - Missing:      0 - Invalid:    327
Processed:   2504 - Network errors:   1096 - Missing:      0 - Invalid:    560
Processed:   7802 - Network errors:   1503 - Missing:      0 - Invalid:   3013
Processed:  14886 - Network errors:   2050 - Missing:      0 - Invalid:   6273
Processed:  21850 - Network errors:   2549 - Missing:      0 - Invalid:   9515
Processed:  28877 - Network errors:   3084 - Missing:      0 - Invalid:  12758
Processed:  35872 - Network errors:   3594 - Missing:      0 - Invalid:  16006
Processed:  42927 - Network errors:   4074 - Missing:      0 - Invalid:  19316
Processed:  49920 - Network errors:   4633 - Missing:      0 - Invalid:  22516
Processed:  56948 - Network errors:   5201 - Missing:      0 - Invalid:  25723

@Wraith2
Copy link
Contributor

Wraith2 commented Jan 30, 2021

No fundemental differences just minor rearrangements and additional output. Mine identifies incorrect results with (spid, expected, recieved) entry when it finds one and there's some code for setting some environment variables which were then picked up by code inside the client library.
Program.txt

I ran it for 5 minutes and got 1 incorrect result. Not unusual, there are some small changes in #889 which generally speed it up so that you get higher throughput and better chances of hitting the error. It's non-deterministic though so it may just not fail for you.

@saurabh500
Copy link
Contributor

@cheenamalhotra is there a dev doc on how to build the nuget and consume it in a local project for MDS ?

@saurabh500
Copy link
Contributor

Or any quick instructions to refresh my memory?

@Wraith2
Copy link
Contributor

Wraith2 commented Jan 30, 2021

default build packages the nuget. You don't need to though since you can add in a project to the sln add a project ref and run the library directly.
My bulid bat file:

msbuild /t:clean
msbuild /p:Configuration=Release
msbuild /t:BuildTestsNetCore /p:TargetNetCoreVersion=netcoreapp3.1 /p:Configuration=Release

@cheenamalhotra
Copy link
Member Author

cheenamalhotra commented Jan 30, 2021

I also updated Case Study.zip file above (description), that contains an XL sheet for all details you may find helpful.

@cheenamalhotra is there a dev doc on how to build the nuget and consume it in a local project for MDS ?

Just run msbuild /p:configuration=Release you'll get a NuGet package in "packages" folder. Alternatively, you can open MDS solution and include AllManagedSNITests project in the solution, and add project reference to netcore\src\MDS project.

@saurabh500
Copy link
Contributor

How do I consume the package in the repro after building it?

@cheenamalhotra cheenamalhotra marked this pull request as ready for review February 4, 2021 19:41
@cheenamalhotra cheenamalhotra added this to In Progress in SqlClient v3.0 via automation Feb 4, 2021
@cheenamalhotra cheenamalhotra added this to the 3.0.0-preview1 milestone Feb 4, 2021
@cheenamalhotra cheenamalhotra moved this from In Progress to Review In Progress in SqlClient v3.0 Feb 4, 2021
@cheenamalhotra cheenamalhotra changed the title Fix | Fixes wrong data errors due to corrupted connections not cleaning up Fix | Fixes wrong data errors due to exceptions not handled in time Feb 4, 2021
Copy link
Contributor

@saurabh500 saurabh500 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wooohhooo

@Wraith2
Copy link
Contributor

Wraith2 commented Feb 4, 2021

It's rather disappointing that the thing that took us so much effort and time to track down is fixed with a simple call to a method that already exists. On the other hand it's safer than writing lots of new code so yay!

@@ -1973,6 +1975,7 @@ private SqlDataReader EndExecuteReaderInternal(IAsyncResult asyncResult)
int? sqlExceptionNumber = null;
try
{
CheckThrowSNIException();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love for this to be the final fix to this issue. But how come adding a call to CheckThrowSNIException() at these locations makes a difference when the same method is already being called not much later in the flow?

From here, -> InternalEndExecuteReader (about 4 statements into it) -> CheckThrowSNIException()

I wonder if VerifyEndExecuteState is stopping flow before it gets to CheckThrowSNIException.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true. But this flow actually completes and reader is provided to user.
I'm looking more into why CheckThrow is swallowing that exception and not throwing it in that particular case.

Manual stepping does not reproduce that error, and not all cases that timeout are bypassed as we have good exception cases too.. I'll continue scanning and tracking again. It's far more dense that it seems 🙈

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In InternalEndExecuteReader before it checks for errors it calls WaitForAsyncResults which resets the _networkPacketSource which is where the OnTimeout function puts the exception that we want to find and throw. It certainly changes the behaviour in one of the downstream callee's of the poorly named CheckThrowSNIException (it doesn't throw...).

If the original author of all this isn't in an asylum at the moment I expect they're currently located in a hidden base at the base of a volcano planning world domination.

@cheenamalhotra
Copy link
Member Author

cheenamalhotra commented Feb 6, 2021

Unfortunately, after more digging, this change just got lucky and didn't capture any wrong data when we tested, but use-case stays possible. So Attempt 3 is discarded too. 😞 The check we added here and the existing one, have no state change in between.

It's basically a matter of time, when event for OnTimeout gets a thread to work on, and can send attention to server. In our repros, consistent behavior observed is that for connections that get corrupted later, their Timeout thread starts after the data was sent to user and timer were reset long ago. The timeout event ["OnTimeout" function], did not even get a thread to do it's job when it timed out. So to conclude this case is also exploited to ThreadPool starvation, as Timer Events are scheduled on Threadpool too, and we cannot prioritize these events over pre-scheduled events. 😞

To also confirm this is NOT Managed SNI/Linux issue, I was also able to reproduce this issue on Native SNI on Windows, just had to blow up the counter 4000 to 12M and wait around 30 mins. Not trying to say this is a use-case, or it may never even occur to client applications, but just proves the theory that there is a basic driver design issue here, and the driver lacks synchronization between OnTimeout event and the query results.

image

So we have a design problem to solve, irrespective to thread starvation, and the options we've tried so far are not going to fix it. Since it doesn't seems like we can do anything from SqlCommand layer or main thread where Reader is being prepared. We most probably need to double check before sending attention, whether this command for which timeout event is going to raise attention, is even active anymore! Or stop this timeout event that's already scheduled from ThreadPool? Now because we have features like MARS, it's going to be extremely challenging to come up with a better design to handle this scenario.

I don't know about how other applications/drivers handle this case, but this seems a pretty dirty problem to solve. As we never "expect" in any generic design, that a timeout event could take further 60ish seconds to even get started! So whatever we were doing if timeout didn't occur, could eventually complete, and things we do when "timeout" event occurs, could cause instability and serious consequences, as it's already scheduled in ThreadPool and can we even stop it from happening!!!???

Closing for now, we need some brainstorming here!

SqlClient v3.0 automation moved this from Review In Progress to Done Feb 6, 2021
@Wraith2
Copy link
Contributor

Wraith2 commented Feb 6, 2021

Is it worth considering the same approach used with statistics and some of the other timeouts. Take a timestamp and calculate the expiry timeout then keep that value around, later check if it's been exceeded. Since DateTime.UtcNow is synchronous and doesn't rely on a thread running you can know if you should have timed out and change state appropriately.

Perhaps @roji has some insight about how this would work in the postgres driver?

@roji
Copy link
Member

roji commented Feb 6, 2021

I've read the above, but am lacking some context here :) If you guys can give a quick/short description I'll can try to help with ideas...

Is the problem that some timer code is firing too late, at a point where it causes some sort of corruption? If so, common practice is to cancel the timer (e.g. when the command ends)? For example, you can dispose a System.Threading.Timer in a way which guarantees that all callbacks have completed when DisposeAsync returns (but watch out for the different sync Dispose overloads).

Another option would be for the timer callback to check some counter before doing its thing, and to increment it when the command completes; this effectively does the same thing since the callback no longer does anything after the command is over.

Hopefully I've understood more or less what's going on...

@Wraith2
Copy link
Contributor

Wraith2 commented Feb 6, 2021

The problem is that the timer never fires because the process is in a state of massive thread contention on the threadpool. So the timer is set, work continues, data arrives late but because the timer hasn't been able to fire the fact that it's late isn't seen and as a result we end up with the timer firing to set the expiry flag after we've already proceeded to return data to the user.

So essentially we can't rely on anything that requires any form of threadpool or thread scheduling in general since they may happen in a totally unreliable order.

Are there any techniques you've used or seen in the postgres driver for dealing with threadpool unreliability?

@roji
Copy link
Member

roji commented Feb 6, 2021

OK - it seems like what I wrote above is probably relevant then. Timers - like other types of concurrent operations - indeed cannot be relied upon to happen at particular points in time; the only reliable way to handle this would be to either disable the timer when the command completes - and to wait until all callbacks are guaranteed to have completed (the Timer class provides this) - or to make sure that if the timer fires after that point it does nothing.

Is there any reason why one of these options isn't feasible in this case?

@cheenamalhotra
Copy link
Member Author

cheenamalhotra commented Feb 6, 2021

HI @roji

Thanks for chiming in!

In the current design, we don't dispose the packet timer for MARS.

[Flow: PutSession > SessionPool.PutSession does not call DisposeCounters (non-Mars)] as I think multiple reads are possible for MARS in the same TdsParserStateObject, as there are multiple sessions and multiple events could have been triggered from different queries. We are simply resetting it to -1 (Infinite), when done with the read to allow callbacks to complete. So even if we could Dispose timer with WaitHandle to wait for timer to complete, we are not able to do that. It doesn't seem a very clean design and we'll have to revalidate this flow.

I was thinking if we can manage that the timer be owned and controlled by SqlCommand Async APIs (considering performance does not degrade for multi-packet reads) or by a MARS session, that can wait for it's timer events to complete before completing reader, can also dispose timer safely. We'll need to redesign this part, as currently the flow from SqlCommand to TdsParserStateObject is not clearly defined, and "ReadSni" is called from various places. If we have to redesign this, we could ensure the owning objects own the timer and dispose them safely, without having the TdsParserStateObject to leave things unhandled.

Secondly, if we were to consider not doing task on timeout after a flag/counter is reset, it will again interfere with MARS I think, but we'll have to see it's feasibility.

@roji
Copy link
Member

roji commented Feb 6, 2021

Good luck @cheenamalhotra... Keep in mind that there's the second alternative - rather than disposing the timer (and waiting for callbacks to complete), you could allow the timer to fire but its callback would check whether it should still perform its job (that's the counter solution I mentioned above). That would allow you continue reusing the same timer and refactor too much for ownership, but at the same time allow you to stop the callback from doing any damage if it runs too late.

@Wraith2
Copy link
Contributor

Wraith2 commented Feb 6, 2021

In the current design, we don't dispose the packet timer for MARS.

It's hard to track down but I'm fairly certain that the packet timer is associated with the TdsParserStateObject and that there is an individual instanceper virtual handle, so SNIMarsHandle will have an associated state and parser object of it's own, that's why there are two constructors (with a ridiculously complicated call hierarchy) for the state object, one is a standalone object and the other has a reference to a physical state object which is the one that holds the SNIMarsConnection multiplexor. So the one state object per queryable connection timer.

The problem is that there's so much state in the state object and it interacts in so many complicated ways that I'm very wary of trying to change how any of it is used. What we really want is an async query object that we can use to track all the info about the running async query rather than keeping it all directly on the state object.

If we keep the timer but since we know the time that it has to elapse we can do an inline time check after an async read completes and if the timeout has elapsed try to invoke the OnTimeout method synchronously and as roji said we then have OnTimeout check to see if it's no longer needed (this might already work if we set _attentionSent it should drop out safely). The problem i see is that OnTimeout uses a delay timer with more async stuff going on.

@cheenamalhotra
Copy link
Member Author

cheenamalhotra commented Feb 7, 2021

I thought so too, but 'attentionSent' actually resets to 'false' after receiving Attn Acknowledgment from server (again supporting MARS here), so it's not reliable and therefore we will end up sending attention.

And also we cannot set it to true on query completion so we have to think of something else in state object and/or mars session handle to be able to reliably continue.

@Wraith2
Copy link
Contributor

Wraith2 commented Feb 7, 2021

Then lets use a new variable. One more won't cause huge problems we just need to know where to clear it (probably in begin execute) and then interlock access to setting it.

@cheenamalhotra cheenamalhotra removed this from the 3.0.0-preview1 milestone Feb 23, 2021
@cheenamalhotra cheenamalhotra deleted the fix659 branch February 24, 2021 23:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

Async reader may return wrong results on MacOS, Linux, WSL, Docker
5 participants