Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EvaluateCacheFile is causing restore crashes on Linux #9199

Closed
jaredpar opened this issue Feb 20, 2020 · 19 comments · Fixed by NuGet/NuGet.Client#3264
Closed

EvaluateCacheFile is causing restore crashes on Linux #9199

jaredpar opened this issue Feb 20, 2020 · 19 comments · Fixed by NuGet/NuGet.Client#3264
Assignees
Labels
Functionality:Restore Platform:Xplat Product:MSBuildSDKResolver The NuGet powered SDK resolver. Owned by MSBuild team RegressionDuringThisVersion A regression which broke since last RTM, and was fixed before the next RTM. Style:PackageReference Tenet:Reliability Crashes, hangs etc. Type:Bug

Comments

@jaredpar
Copy link

jaredpar commented Feb 20, 2020

Details about Problem

Restore during build using dotnet msbuild is crashing with an unhandled exception on Unix operating systems. The problem appears to be a simple unhandled exception.

/__w/1/s/.dotnet/sdk/5.0.100-preview.1.20112.7/MSBuild.dll /nologo -maxcpucount /m -verbosity:m /v:minimal /bl:/__w/1/s/artifacts/log/Release/ToolsetRestore.binlog /clp:Summary /clp:ErrorsOnly;NoSummary /nr:false /p:TreatWarningsAsErrors=true /p:ContinuousIntegrationBuild=true /p:__ToolsetLocationOutputFile=/__w/1/s/artifacts/toolset/5.0.0-beta.20117.3.txt /t:__WriteToolsetLocation /warnaserror /__w/1/s/artifacts/toolset/restore.proj
Clearing NuGet HTTP cache: /home/helixbot_azpcontainer/.local/share/NuGet/v3-cache
Local resources cleared.
/__w/1/s/.dotnet/sdk/5.0.100-preview.1.20112.7/MSBuild.dll /nologo -logger:/__w/1/s/.packages/microsoft.dotnet.arcade.sdk/5.0.0-beta.20117.3/tools/netcoreapp2.1/Microsoft.DotNet.Arcade.Sdk.dll -maxcpucount /m -verbosity:m /v:minimal /bl:/__w/1/s/artifacts/log/Release/Build.binlog /clp:Summary /nr:false /p:TreatWarningsAsErrors=true /p:ContinuousIntegrationBuild=true /p:Configuration=Release /p:RepoRoot=/__w/1/s /p:Restore=true /p:Build=true /p:Rebuild=false /p:Test=false /p:Pack=false /p:IntegrationTest=false /p:PerformanceTest=false /p:Sign=false /p:Publish=false /p:SubsetCategory=mono /p:ConfigurationGroup=Release /p:ArchGroup=x64 /p:TargetArchitecture=x64 /p:CMakeArgs= /p:MonoEnableLLVM=false /warnaserror /__w/1/s/.packages/microsoft.dotnet.arcade.sdk/5.0.0-beta.20117.3/tools/Build.proj
  Restore completed in 16.48 sec for /__w/1/s/tools-local/tasks/installer.tasks/installer.tasks.csproj.
  installer.tasks -> /__w/1/s/artifacts/bin/installer.tasks/Debug/netstandard2.0/installer.tasks.dll
  Tool 'coverlet.console' (version '1.7.0') was restored. Available commands: coverlet
  Tool 'dotnet-reportgenerator-globaltool' (version '4.4.2') was restored. Available commands: reportgenerator
  Tools restore was successful.
  Restore completed in 5.11 sec for /__w/1/s/.packages/microsoft.dotnet.arcade.sdk/5.0.0-beta.20117.3/tools/Tools.proj.
  Restore completed in 11.44 ms for /__w/1/s/src/mono/netcore/System.Private.CoreLib/System.Private.CoreLib.csproj.
/__w/1/s/src/mono/netcore/nuget/packages.builds : error : System.AggregateException: One or more errors occurred. (Could not find a part of the path '/tmp/NuGetScratch/fddf3c7a91d54497a7b711c945fe60d3/b8178b951e2f4151a1758acd7ba7326f.proj.nuget.dgspec.json'.)
/__w/1/s/src/mono/netcore/nuget/packages.builds : error :  ---> System.IO.DirectoryNotFoundException: Could not find a part of the path '/tmp/NuGetScratch/fddf3c7a91d54497a7b711c945fe60d3/b8178b951e2f4151a1758acd7ba7326f.proj.nuget.dgspec.json'.
/__w/1/s/src/mono/netcore/nuget/packages.builds : error :    at Interop.ThrowExceptionForIoErrno(ErrorInfo errorInfo, String path, Boolean isDirectory, Func`2 errorRewriter)
/__w/1/s/src/mono/netcore/nuget/packages.builds : error :    at Microsoft.Win32.SafeHandles.SafeFileHandle.Open(String path, OpenFlags flags, Int32 mode)
/__w/1/s/src/mono/netcore/nuget/packages.builds : error :    at System.IO.FileStream..ctor(String path, FileMode mode, FileAccess access, FileShare share, Int32 bufferSize, FileOptions options)
/__w/1/s/src/mono/netcore/nuget/packages.builds : error :    at System.IO.FileStream..ctor(String path, FileMode mode)
/__w/1/s/src/mono/netcore/nuget/packages.builds : error :    at NuGet.ProjectModel.DependencyGraphSpec.Save(String path)
/__w/1/s/src/mono/netcore/nuget/packages.builds : error :    at NuGet.Commands.NoOpRestoreUtilities.PersistDGSpecFile(DependencyGraphSpec spec, String dgPath, ILogger log)
/__w/1/s/src/mono/netcore/nuget/packages.builds : error :    at NuGet.Commands.RestoreCommand.EvaluateCacheFile()
/__w/1/s/src/mono/netcore/nuget/packages.builds : error :    at NuGet.Commands.RestoreCommand.ExecuteAsync(CancellationToken token)
/__w/1/s/src/mono/netcore/nuget/packages.builds : error :    at NuGet.Commands.RestoreRunner.ExecuteAsync(RestoreSummaryRequest summaryRequest, CancellationToken token)
/__w/1/s/src/mono/netcore/nuget/packages.builds : error :    at NuGet.Commands.RestoreRunner.CompleteTaskAsync(List`1 restoreTasks)
/__w/1/s/src/mono/netcore/nuget/packages.builds : error :    at NuGet.Commands.RestoreRunner.RunWithoutCommit(IEnumerable`1 restoreRequests, RestoreArgs restoreContext)
/__w/1/s/src/mono/netcore/nuget/packages.builds : error :    --- End of inner exception stack trace ---
/__w/1/s/src/mono/netcore/nuget/packages.builds : error :    at System.Threading.Tasks.Task`1.GetResultCore(Boolean waitCompletionNotification)
/__w/1/s/src/mono/netcore/nuget/packages.builds : error :    at Microsoft.Build.NuGetSdkResolver.NuGetSdkResolver.NuGetAbstraction.GetSdkResult(SdkReference sdk, Object nuGetVersion, SdkResolverContext context, SdkResultFactory factory)
##[error]src/mono/netcore/nuget/packages.builds(0,0): error : System.AggregateException: One or more errors occurred. (Could not find a part of the pa

Builds where this has occurred

Build Kind Timeline Record
534074 PR dotnet/runtime#32736 Restore
533781 PR dotnet/runtime#32716 Send tests to Helix
533729 PR dotnet/runtime#32697 Build product
533526 PR dotnet/runtime#32538 Build managed test components
533303 PR dotnet/runtime#32595 Build product
532918 Rolling Send to Helix
532832 Rolling Build product
531845 PR dotnet/runtime#32612 Mono LLVM Product Build Linux x64 debug
531845 PR dotnet/runtime#32612 Build product
531649 Rolling Build native test components
531649 Rolling CoreCLR Product Build OSX x64 checked
531548 PR dotnet/runtime#32603 Build product
531548 PR dotnet/runtime#32603 Mono LLVM Product Build OSX x64 release
531315 Rolling Libraries Test Build OSX x64 Release
530973 PR dotnet/runtime#32587 Libraries Build OSX x64 Debug
530946 PR dotnet/runtime#32636 Mono LLVM Product Build Linux x64 debug
530946 PR dotnet/runtime#32636 Build product
530196 PR dotnet/runtime#32609 Build product
530196 PR dotnet/runtime#32609 Mono Product Build Linux arm64 debug
529996 PR dotnet/runtime#32149 CoreCLR Product Build OSX x64 checked
529996 PR dotnet/runtime#32149 Build product
529965 PR dotnet/runtime#32612 Mono Product Build Linux x64 release
529965 PR dotnet/runtime#32612 Build product

This is starting to become a bit of a blocking issue for the dotnet/runtime repository. Started showing up yesterday and now I'm seeing it across a number of builds.

The runs do have binlogs available from the restores but they don't really seem to have helpful information in them. Example

https://dev.azure.com/dnceng/9ee6d478-d288-47f7-aacc-f6e6d082ae6d/_apis/build/builds/529965/artifacts?artifactName=BuildLogs_Mono_Linux_x64_release&fileId=50260E1C6BB8177752CB6690A24F7D636CC8A076F6FB6CC6FAB6F5DB3AED4BC502&fileName=Build.binlog&api-version=5.0-preview.3

@jaredpar
Copy link
Author

Updated to include more of the builds / jobs where this failed.

@JohnTortugo
Copy link

@rrelyea @dtivel @nkolev92 - Can you please chime in here? This is affecting a bunch of repos already and has the potential to become even more critical.

@dtivel
Copy link
Contributor

dtivel commented Feb 21, 2020

@nkolev92, this looks similar to #7341. Any ideas?

@nkolev92
Copy link
Member

nkolev92 commented Feb 21, 2020

Are all the failures coming from the SDK resolver?
If so, I have some ideas that I can look into.

Might be related to #8793
Problem probably here: https://github.com/NuGet/NuGet.Client/blob/dev/src/NuGet.Core/Microsoft.Build.NuGetSdkResolver/RestoreRunnerEx.cs#L48

@JohnTortugo
Copy link

@nkolev92 I think all builds listed by @jaredpar failed with the same error. Let me know if you need anything from our side to drill down this.

@nkolev92
Copy link
Member

I clicked on a couple of random ones and all of them had resolver stack traces.
Not sure 100% what exactly the sequence is that leads to these failures, but I'm pretty confident I know the general area where the issue is.

I'll be handling priority issues next week (conveniently), so I'll take an initial stab at it.

@nkolev92 nkolev92 added Product:MSBuildSDKResolver The NuGet powered SDK resolver. Owned by MSBuild team Type:Bug labels Feb 21, 2020
@nkolev92
Copy link
Member

fyi @jeffkl

@nkolev92 nkolev92 self-assigned this Feb 21, 2020
@nkolev92 nkolev92 added this to the Sprint 166 - 2020.02.10 milestone Feb 21, 2020
@jeffkl
Copy link
Contributor

jeffkl commented Feb 21, 2020

That seems weird, the directory should exist. I suppose DependencyGraphSpec.Save() should create the directory just to be safe

https://github.com/NuGet/NuGet.Client/blob/dev/src/NuGet.Core/NuGet.ProjectModel/DependencyGraphSpec.cs#L267

@nkolev92
Copy link
Member

nkolev92 commented Feb 21, 2020

There's a permissions trick.

The tmp folder in linux is global, so the first operation that creates /tmp/NuGetScratch sets the permissions for that folder.

In every scenario that NuGet writes to the temp folder we usually call: https://github.com/NuGet/NuGet.Client/blob/dev/src/NuGet.Core/NuGet.Common/PathUtil/DirectoryUtility.cs instead of Directory.CreateDirectory, because that creates the tmp folder with the correct permissions, allowing other machine users to write to the same folder, in turn allowing all users to do a restore themselves.

Now, we don't want the dgspec folder to have the same open permissions, we want to use the default one, because the dg spec folder is obj.

Since the resolver runs a preview restore, I think fixing #8793 should be enough, since it makes the temp project location irrelevant since we don't write anything to it.

@jaredpar
Copy link
Author

Added more affected builds

@nkolev92 nkolev92 added the RegressionDuringThisVersion A regression which broke since last RTM, and was fixed before the next RTM. label Feb 24, 2020
@nkolev92 nkolev92 changed the title EvaluateCacheFile is causing restore crashes on EvaluateCacheFile is causing restore crashes on Linux Feb 24, 2020
@jaredpar
Copy link
Author

Presently this is causing 6% of our builds to fail in dotnet/runtime.

Is there a step we can do before calling restore that would allow us to work around this problem until a real fix comes in? Waiting for NuGet means we'll likely stay at 6% failure for several weeks (at a minimum) given the state of code flow.

@markwilkie, @tmat

@nkolev92
Copy link
Member

nkolev92 commented Feb 24, 2020

It's a multi step thing but should unblock you:

  • Delete /tmp/NuGetScratch
  • Ensure that the first NuGet operation is one that doesn't go through the SDK resolver.
    • Literally any restore that leads to package download and does not have a custom SDK.
  • Profit

@jaredpar
Copy link
Author

@nkolev92

Would a chmod also work here? That is significantly simpler for us to implement as a work around than an extra restore (just realities of how the build is setup).

@nkolev92
Copy link
Member

Yeah, that'd work too

@jaredpar
Copy link
Author

@nkolev92 that work around of creating the /tmp/NuGetScratch directory ahead of time didn't work. Getting build failures on the change where we added the work around.

https://dev.azure.com/dnceng/public/_build/results?buildId=540894&view=logs

This is worisome for several reasons:

  1. Our repository continues to be blocked on this issue. 6% of our builds continue to fail because of this. It's unclear how long the actual fix will take to make it's way to the repositories.
  2. Are we sure the correct fix was added here? Given that the work around didn't work how confident are we that the correct fix was implemented?

@nkolev92
Copy link
Member

I'm fairly confident it's the correct fix...the dependency hasn't flown into dotnet yet dotnet/sdk#10726.

@ViktorHofer
Copy link

@nkolev92 we are still hitting this with 5.0.100-preview.2.20155.14:

/__w/2/s/.dotnet/sdk/5.0.100-preview.2.20155.14/MSBuild.dll /nologo -maxcpucount /m -verbosity:m /v:minimal /bl:/__w/2/s/artifacts/log/Debug/ToolsetRestore.binlog /clp:Summary /clp:ErrorsOnly;NoSummary /nr:false /p:TreatWarningsAsErrors=true /p:ContinuousIntegrationBuild=true /p:__ToolsetLocationOutputFile=/__w/2/s/artifacts/toolset/5.0.0-beta.20153.1.txt /t:__WriteToolsetLocation /warnaserror /__w/2/s/artifacts/toolset/restore.proj
/__w/2/s/artifacts/toolset/restore.proj : error : System.AggregateException: One or more errors occurred. (Could not find a part of the path '/tmp/NuGetScratch/02fd45654b454f9599090cac8c2dbf3c/3d2806c25a204fc38f7508f48b753d99.proj.nuget.dgspec.json'.)
/__w/2/s/artifacts/toolset/restore.proj : error :  ---> System.IO.DirectoryNotFoundException: Could not find a part of the path '/tmp/NuGetScratch/02fd45654b454f9599090cac8c2dbf3c/3d2806c25a204fc38f7508f48b753d99.proj.nuget.dgspec.json'.
/__w/2/s/artifacts/toolset/restore.proj : error :    at Interop.ThrowExceptionForIoErrno(ErrorInfo errorInfo, String path, Boolean isDirectory, Func`2 errorRewriter)
/__w/2/s/artifacts/toolset/restore.proj : error :    at Microsoft.Win32.SafeHandles.SafeFileHandle.Open(String path, OpenFlags flags, Int32 mode)
/__w/2/s/artifacts/toolset/restore.proj : error :    at System.IO.FileStream.OpenHandle(FileMode mode, FileShare share, FileOptions options)
/__w/2/s/artifacts/toolset/restore.proj : error :    at System.IO.FileStream..ctor(String path, FileMode mode, FileAccess access, FileShare share, Int32 bufferSize, FileOptions options)
/__w/2/s/artifacts/toolset/restore.proj : error :    at System.IO.FileStream..ctor(String path, FileMode mode)
/__w/2/s/artifacts/toolset/restore.proj : error :    at NuGet.ProjectModel.DependencyGraphSpec.Save(String path)
/__w/2/s/artifacts/toolset/restore.proj : error :    at NuGet.Commands.NoOpRestoreUtilities.PersistDGSpecFile(DependencyGraphSpec spec, String dgPath, ILogger log)
/__w/2/s/artifacts/toolset/restore.proj : error :    at NuGet.Commands.RestoreCommand.EvaluateCacheFile()
/__w/2/s/artifacts/toolset/restore.proj : error :    at NuGet.Commands.RestoreCommand.ExecuteAsync(CancellationToken token)
/__w/2/s/artifacts/toolset/restore.proj : error :    at NuGet.Commands.RestoreRunner.ExecuteAsync(RestoreSummaryRequest summaryRequest, CancellationToken token)
/__w/2/s/artifacts/toolset/restore.proj : error :    at NuGet.Commands.RestoreRunner.CompleteTaskAsync(List`1 restoreTasks)
/__w/2/s/artifacts/toolset/restore.proj : error :    at NuGet.Commands.RestoreRunner.RunWithoutCommit(IEnumerable`1 restoreRequests, RestoreArgs restoreContext)
/__w/2/s/artifacts/toolset/restore.proj : error :    --- End of inner exception stack trace ---
/__w/2/s/artifacts/toolset/restore.proj : error :    at System.Threading.Tasks.Task`1.GetResultCore(Boolean waitCompletionNotification)
/__w/2/s/artifacts/toolset/restore.proj : error :    at System.Threading.Tasks.Task`1.get_Result()
/__w/2/s/artifacts/toolset/restore.proj : error :    at Microsoft.Build.NuGetSdkResolver.NuGetSdkResolver.NuGetAbstraction.GetSdkResult(SdkReference sdk, Object nuGetVersion, SdkResolverContext context, SdkResultFactory factory)
/__w/2/s/artifacts/toolset/restore.proj : error MSB4236: The SDK 'Microsoft.DotNet.Arcade.Sdk' specified could not be found.

https://dev.azure.com/dnceng/public/_build/results?buildId=551633&view=logs

@ViktorHofer
Copy link

ViktorHofer commented Mar 8, 2020

cc @jaredpar @hoyosjs

@ViktorHofer
Copy link

Sorry I was wrong, we aren't using the right bits. The commits points to an old build from early February: 7bac015acc5b7e0161a058c8febc98abe2386d51

@kartheekp-ms kartheekp-ms reopened this Mar 18, 2020
@zkat zkat reopened this Mar 18, 2020
@zkat zkat closed this as completed Mar 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Functionality:Restore Platform:Xplat Product:MSBuildSDKResolver The NuGet powered SDK resolver. Owned by MSBuild team RegressionDuringThisVersion A regression which broke since last RTM, and was fixed before the next RTM. Style:PackageReference Tenet:Reliability Crashes, hangs etc. Type:Bug
Projects
None yet
8 participants