Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception in test infrastructure - System.InvalidOperationException: Collection was modified after the enumerator was instantiated #11063

Closed
Tracked by #1613
4creators opened this issue Sep 11, 2018 · 85 comments
Labels
area-Infrastructure-coreclr blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' tracking-external-issue The issue is caused by external problem (e.g. OS) - nothing we can do to fix it directly
Milestone

Comments

@4creators
Copy link
Contributor

4creators commented Sep 11, 2018

For the second time I have encountered the following error during CI OSX test run.

https://ci.dot.net/job/dotnet_coreclr/job/master/job/x64_checked_osx10.12_innerloop_tst_prtest/5550/

02:51:26   Unhandled Exception: System.InvalidOperationException: Collection was modified after the enumerator was instantiated.
02:51:26      at System.Collections.Generic.Stack`1.Enumerator.MoveNext()
02:51:26      at Xunit.Sdk.DisposalTracker.Dispose() in C:\projects\xunit\src\xunit.execution\Sdk\DisposalTracker.cs:line 26
02:51:26      at Xunit.Sdk.ExtensibilityPointFactory.Dispose() in C:\projects\xunit\src\xunit.execution\Sdk\ExtensibilityPointFactory.cs:line 54
02:51:26      at Xunit.Sdk.TestFramework.Dispose() in C:\projects\xunit\src\xunit.execution\Sdk\Frameworks\TestFramework.cs:line 49
02:51:26      at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
02:51:26   --- End of stack trace from previous location where exception was thrown ---
02:51:26      at System.Threading.ThreadPoolWorkQueue.Dispatch()
@danmoseley
Copy link
Member

@ViktorHofer does this look familiar?

@ViktorHofer
Copy link
Member

that's something that dotnet/arcade#1613 should fix.

@ViktorHofer
Copy link
Member

@echesakov any numbers how often this happens? have somebody already root-caused the issue? Seems like this lives in xunit/xunit and not in the runner.

@danmoseley
Copy link
Member

xunit/xunit#1855

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@Anipik
Copy link
Contributor

Anipik commented Nov 2, 2020

happened again in #43651

@ViktorHofer
Copy link
Member

@stephentoub
Copy link
Member

I've been hitting this reasonably frequently, e.g.
https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-44409-merge-8a1a8bc2383a41f093/PayloadGroup0/console.e860a3c4.log?sv=2019-07-07&se=2020-11-29T17%3A18%3A38Z&sr=c&sp=rl&sig=0mS1HKsUz%2FgaWANt6pRwMCfnGXyaOIXCnwq327pVblg%3D

Unhandled exception.   Discovering: Exceptions.ForeignThread.XUnitWrapper
System.InvalidOperationException: Collection was modified after the enumerator was instantiated.
   at System.Collections.Generic.Stack`1.Enumerator.MoveNext()
   at Xunit.Sdk.DisposalTracker.Dispose() in C:\Dev\xunit\xunit\src\xunit.execution\Sdk\DisposalTracker.cs:line 26
   at Xunit.Sdk.TestFramework.Dispose() in C:\Dev\xunit\xunit\src\xunit.execution\Sdk\Frameworks\TestFramework.cs:line 52
   at System.Threading.Tasks.Task.<>c.<ThrowAsync>b__140_1(Object state)
   at System.Threading.QueueUserWorkItemCallbackDefaultContext.Execute()
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
Aborted (core dumped)

@stephentoub
Copy link
Member

I see @danmosemsft already linked to the xunit issue: xunit/xunit#1855

Which is closed.

@ViktorHofer
Copy link
Member

Which is closed.

As it was fixed in xunit v3 which hasn't shipped yet.

@naricc
Copy link
Contributor

naricc commented Nov 23, 2020

I am also seeing this intermittently on Mono runtime test lanes.

@safern safern added the blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' label Nov 23, 2020
@maryamariyan
Copy link
Member

@ericstj and I were looking at the linked test failures again and found an interesting pattern. All of the clr test failures shown here were using XUnitWrapper in the logs.

e.g. seen in here:

Unhandled exception.   Discovering: readytorun.r2rdump.XUnitWrapper (method display = ClassAndMethod, method display options = None)
  Discovered:  readytorun.r2rdump.XUnitWrapper (found 1 test case)
  Starting:    readytorun.r2rdump.XUnitWrapper (parallel test collections = on, max threads = 4)
System.InvalidOperationException: Collection was modified after the enumerator was instantiated.
   at System.Collections.Generic.Stack`1.Enumerator.MoveNext()
   at Xunit.Sdk.DisposalTracker.Dispose() in C:\Dev\xunit\xunit\src\xunit.execution\Sdk\DisposalTracker.cs:line 26
   at Xunit.Sdk.ExtensibilityPointFactory.Dispose() in C:\Dev\xunit\xunit\src\xunit.execution\Sdk\ExtensibilityPointFactory.cs:line 53
   at Xunit.Sdk.TestFramework.Dispose() in C:\Dev\xunit\xunit\src\xunit.execution\Sdk\Frameworks\TestFramework.cs:line 52
   at System.Threading.Tasks.Task.<>c.<ThrowAsync>b__127_1(Object state)
   at System.Threading.QueueUserWorkItemCallbackDefaultContext.Execute()
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
   at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
   at System.Threading.Thread.StartCallback()

Are we doing anything unusual in that test infrastructure?

@ChadNedzlek
Copy link
Member

I think this is just a bug in the XUnit execution engine, which I don't think is under active development anymore. I don't think it's caused by anything in the actual testing.

@ericstj
Copy link
Member

ericstj commented Sep 9, 2021

@maryamariyan is actually looking at fixing that bug, and as part of that fix we'd like to describe why it's happening. Xunit has a static API that will operate on this shared stack, so something is calling that static API from multiple threads. We were going to trace a test exectuion to see why that happens from multiple threads. To find a good test to trace we started looking at the existing repros and that's when we noticed this pattern. It seemed too curious not to ignore.

We'd also like to make sure that a fix for this InvalidOperationException doesn't just result in exposing a new exception because we disposed some shared state that the other thread needs. If we can identify root cause it would be good.

@BruceForstall
Copy link
Member

BruceForstall commented Sep 9, 2021

That's how the CoreCLR tests work: a set of "XUnitWrapper" assemblies are built with one class for each test with an xunit "Fact" for the test, which gets invoked by xunit, which then spawns an execution of the test. You can see the built wrapper source code in the artifacts, e.g., artifacts\tests\coreclr\windows.x64.Checked\TestWrappers\JIT.CodeGenBringUpTests\JIT.CodeGenBringUpTests.XUnitWrapper.cs

I think the code which creates these wrappers is src\tests\run.proj

@trylek can probably explain the process more

@ericstj
Copy link
Member

ericstj commented Sep 9, 2021

built with one class for each test

I see, so perhaps this pattern is appearing because we're stressing Xunit. It's interesting that we pass many test assemblies to a single invocation. That could be the reason this happens more often with these wrapper tests.

@BruceForstall
Copy link
Member

Note that we pass -parallel collections to xunit. I think (?) that allows tests invoked by different wrapper assemblies to be run in parallel, but those in a single wrapper assembly must be serialized. I can't remember why we did that; possibly to avoid hitting this issue (as much).

@agocke
Copy link
Member

agocke commented Sep 9, 2021

I believe -parallel collections means to parallelize every XUnit collection, which is controlled by the XUnit Collection attribute that you can put on test classes. If a class isn't attributed, I think all classes are by-default different collections, so effectively every test class is executed in parallel.

@ericstj
Copy link
Member

ericstj commented Sep 9, 2021

Interesting, that jives with "stressing" xunit. Another characteristic we noticed was that the tests failing were all PayloadGroup0. At first I thought this was just a reporting quirk, but now I noticed that it's actually the work item name:

<XUnitWrapperGrouping Include="$(TestBinDir)baseservices\compilerservices\*.XUnitWrapper.dll;
$(TestBinDir)baseservices\exceptions\*.XUnitWrapper.dll;
$(TestBinDir)baseservices\multidimmarray\*.XUnitWrapper.dll;
$(TestBinDir)baseservices\TieredCompilation\*.XUnitWrapper.dll;
$(TestBinDir)baseservices\typeequivalence\*.XUnitWrapper.dll;
$(TestBinDir)baseservices\varargs\*.XUnitWrapper.dll;
$(TestBinDir)CoreMangLib\**\*.XUnitWrapper.dll;
$(TestBinDir)Exceptions\**\*.XUnitWrapper.dll;
$(TestBinDir)GC\**\*.XUnitWrapper.dll;
$(TestBinDir)hosting\**\*.XUnitWrapper.dll;
$(TestBinDir)ilasm\**\*.XUnitWrapper.dll;
$(TestBinDir)Loader\**\*.XUnitWrapper.dll;
$(TestBinDir)managed\**\*.XUnitWrapper.dll;
$(TestBinDir)readytorun\**\*.XUnitWrapper.dll;
$(TestBinDir)reflection\**\*.XUnitWrapper.dll;
$(TestBinDir)Regressions\**\*.XUnitWrapper.dll;
$(TestBinDir)tracing\**\*.XUnitWrapper.dll"
Exclude="$(TestBinDir)Loader\classloader\Loader.classloader.XUnitWrapper.dll">
<PayloadGroup>PayloadGroup0</PayloadGroup>

@ViktorHofer
Copy link
Member

@ericstj and I were looking at the linked test failures again and found an interesting pattern. All of the clr test failures shown here were using XUnitWrapper in the logs.

Even though the failure occurs in runtime tests most of the time (which use the xunitwrapper infra) there are also hits in libraries tests.

@ericstj
Copy link
Member

ericstj commented Sep 15, 2021

I'd be very interested in seeing where this hits in the libraries tests. If you have any logs please share them. I couldn't find any in the linked builds.

After looking at the call-stacks here and the tests they repro in, it's all happening due to us running multiple test assemblies at once. The TestFramework object has a 1:1 relationship with test assembly and when that object is disposed it calls the static ExtensibilityFactory.Dispose which modifies a static DisposalTracker field without a lock. These runtime tests use a lot of assemblies with very few tests, so they have a high likelihood of concurrent execution. @bradwilson mentioned that .NETCore xunit runner isn't designed to handle running multiple assemblies at once and we might be the only folks doing it due to our special runner. It might be a good idea to avoid multiple assemblies long term, as it puts us off the happy path and might cause other issues. I can see us dealing with that by either merging the assemblies (they are all generated after all) or making one call per assembly.

We'll still pursue the xunit fix, but I wanted to point out we are doing something rather special here that seems to be causing/exacerbating this problem.

@agocke
Copy link
Member

agocke commented Sep 15, 2021

@bradwilson mentioned that .NETCore xunit runner isn't designed to handle running multiple assemblies at once

I don't understand this. https://xunit.net/docs/running-tests-in-parallel seems to document the option as the default for the XUnit 1 and 2 console runners.

Never mind, I misread -- the default is collections.

@ericstj
Copy link
Member

ericstj commented Sep 15, 2021

Test classes/cases work in parallel, but multiple assemblies in parallel does not (is unknown). Apparently it's only working for us because we have a fork of the runner.

@agocke
Copy link
Member

agocke commented Sep 15, 2021

Well, assemblies is a documented option to the runner, it's just not the default. Maybe it's just less robust?

@ViktorHofer
Copy link
Member

Apparently it's only working for us because we have a fork of the runner.

Our fork is exactly the same as the upstream xunit.console runner (unsure if it still exists upstream) except for the following changes which we documented here: https://github.com/dotnet/arcade/blob/main/src/Microsoft.DotNet.XUnitConsoleRunner/README.md.

@ericstj
Copy link
Member

ericstj commented Sep 15, 2021

My understanding is that even the upstream runner didn't support multiple test assemblies very well and was a motivating factor for a change in direction in v3. V3's console runner only supports .NETFramework. I believe the runner was in this state because it was originally designed for .NETFramework which provided AppDomain isolation and test-specific dependency control that could allow for isolation of runner:test and test:test dependencies. We didn't have parity for that in .NETCore. Now we have some support with ALC functionality but I suspect it's not equivalent and I believe the decision was already made to change direction.

Maybe it's just less robust?

Yes, though the 2.0 runner supports that commandline option, we might be the only ones using it, and likely are the only ones stressing it to the extent we do in dotnet/runtime. So if we could refactor to single-assembly per invocation we'd be more likely to be on the happy path and avoid these sort of bugs.

@bradwilson
Copy link

bradwilson commented Sep 15, 2021

The issue (in v2) is this:

We support running multiple assemblies in parallel, but the only runners that officially support this are the .NET Framework runners (our console and MSBuild runners). There is app domain isolation available there that helps prevent multiple test assemblies from stepping on each other, though there are edge cases inside .NET Framework itself that can still confound (like the fact that "current directory" is a process-wide setting, so calling Directory.SetCurrentDirectory affects all app domains in the process; by and large, app domain isolation does all the isolation heavy lifting).

Most teams--I want to say "everybody except the .NET team" but I don't have the data to back that up--who run .NET Core unit tests use the third party dotnet test functionality (or something that consumes that same plugin, like Visual Studio and VS Code). Whether those things support running multiple assemblies in parallel is up to them, and how they deal with isolation is up to them (the only reasonable option they'd have is process isolation, I assume). We don't provide that functionality because it's outside the scope of the VSTest plugin API.

What the .NET team appears to be doing is using a forked version of a deprecated runner that was a .NET Core version of our console runner (originally designed for the now removed dotnet xunit functionality), which is attempting to run multiple assemblies in the same address space. .NET Core does not have app domain isolation (or any other isolation, for that matter) that can help prevent the kinds of collisions that you've hit. This is why the bug your team is hitting is only being hit by you: because you're the only people trying to write your own runner in this way (or if anybody else is doing so, they've fixed or ignored the problem in their own way without raising it with xUnit.net).

As for v3, the entire design of test projects has changed from "compiles to library" to "compiles to application". The issues with isolation and dependency resolution in .NET Core were a big part of the forcing function. In v3, our supported console runner will allow you to run .NET Core tests in addition to .NET Framework tests (and even parallelize .NET Framework and .NET Core tests together, as well as having significantly better support for varying versions of .NET Framework or .NET Core, allow running 32- and 64-bit tests in parallel, etc.).

@ViktorHofer
Copy link
Member

According to runfo this isn't happening anymore in main or release/6.0: https://runfo.azurewebsites.net/tracking/issue/111/

As the fix was merged into release/6.0 but not release/6.0-rc2 there are still a few hits for the RC2 branch but those will just naturally go away when RC2 doesn't produce any builds anymore.

Closing.

@stephentoub
Copy link
Member

Woohoo

@ghost ghost locked as resolved and limited conversation to collaborators Nov 3, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-Infrastructure-coreclr blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' tracking-external-issue The issue is caused by external problem (e.g. OS) - nothing we can do to fix it directly
Projects
None yet
Development

No branches or pull requests