Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Too many open files" error on Linux (on s390x) #12410

Closed
uweigand opened this issue Feb 6, 2023 · 3 comments · Fixed by NuGet/NuGet.Client#5046
Closed

"Too many open files" error on Linux (on s390x) #12410

uweigand opened this issue Feb 6, 2023 · 3 comments · Fixed by NuGet/NuGet.Client#5046
Labels
Area:HttpCommunication Functionality:Restore Priority:3 Issues under consideration. With enough upvotes, will be reconsidered to be added to the backlog. Type:Bug

Comments

@uweigand
Copy link

uweigand commented Feb 6, 2023

NuGet Product Used

dotnet.exe

Product Version

dotnet 6.0, 7.0

Worked before?

n/a

Impact

It's more difficult to complete my work

Repro Steps & Context

The original use case where we ran into this problem was running Red Hat's regression test suite:
https://github.com/redhat-developer/dotnet-regular-tests
against the Red Hat dotnet RPMs on linux-s390x.

After some investigation, I was able to create a much simpler test case:

  1. Create a dotnet project with a non-trivial set of dependencies. The simplest I found is a dotnet new xunit hello world.
  2. Create a nuget.config that pulls in a large number of repo sources, e.g. along these lines: https://github.com/dotnet/installer/blob/release/6.0.1xx/NuGet.config
  3. Start from an empty nuget cache, i.e. removing ~/.nuget/packages or setting NUGET_PACKAGES to an empty directory.
  4. Run dotnet restore

This results in many many errors along the lines of:

  Determining projects to restore...
  Retrying 'FindPackagesByIdAsync' for source 'https://api.nuget.org/v3-flatcontainer/system.io.compression/index.json'.
  Too many open files in system (api.nuget.org:443)
    Too many open files in system
  Retrying 'FindPackagesByIdAsync' for source 'https://pkgs.dev.azure.com/dnceng/9ee6d478-d288-47f7-aacc-f6e6d082ae6d/_packaging/1e329448-4305-4ce3-bd4c-fb5718f7a5b1/nuget/v3/flat2/system.reflection.extensions/index.json'.
  Too many open files in system (pkgs.dev.azure.com:443)
    Too many open files in system
  Retrying 'FindPackagesByIdAsync' for source 'https://pkgs.dev.azure.com/dnceng/9ee6d478-d288-47f7-aacc-f6e6d082ae6d/_packaging/d1622942-d16f-48e5-bc83-96f4539e7601/nuget/v3/flat2/system.reflection/index.json'.
  Too many open files in system (pkgs.dev.azure.com:443)
    Too many open files in system
  Retrying 'FindPackagesByIdAsync' for source 'https://api.nuget.org/v3-flatcontainer/system.io.compression.zipfile/index.json'.
  Too many open files in system (api.nuget.org:443)
    Too many open files in system

Note that the test runner framework employed by dotnet-regular-tests implicitly performs steps 2) and 3) above; 2) is done to allow testing dotnet prereleases that may pull in dependencies not yet on nuget.org, and 3) is done to ensure reproducible results.

[ Note also that we're only seeing this on the s390x (IBM Z) architecture, not on Intel. I wasn't able to figure out any platform-specific code difference that might responsible for that - it seems to simply come down to different I/O timing behavior. ]

Initial investigation with strace confirms that the error message is correct: at the time of failure, we do indeed have 1024 open file descriptors, which is the default limit on Linux. (While it is possible to increase that limit, this is not done by default, since file handles larger than 1024 cannot be processed via select due to the fixed size of the fd_set data type, and therefore some applications may no longer work correctly if that limit is extended.)

Those 1024 open file descriptors for the most part appear to originate from about 500 simultaneous connections to various nuget sources, where each such connection uses two file descriptors - one for the TCP socket, and one for a local lock file. Those connections are (nearly) all about collecting package metadata, i.e. index.json files from many versions of many packages across many nuget sources.

In some cases, I was able to collect an exception backtrace, and it looks mostly similar to this:

at Interop.ThrowExceptionForIoErrno(ErrorInfo errorInfo, String path, Boolean isDirectory, Func`2 errorRewriter)
at Interop.CheckIo(Error error, String path, Boolean isDirectory, Func`2 errorRewriter)
at Microsoft.Win32.SafeHandles.SafeFileHandle.Open(String path, OpenFlags flags, Int32 mode)
at Microsoft.Win32.SafeHandles.SafeFileHandle.Open(String fullPath, FileMode mode, FileAccess access, FileShare share, FileOptions options, Int64 preallocationSize)
at System.IO.Strategies.OSFileStreamStrategy..ctor(String path, FileMode mode, FileAccess access, FileShare share, FileOptions options, Int64 preallocationSize)
at System.IO.Strategies.UnixFileStreamStrategy..ctor(String path, FileMode mode, FileAccess access, FileShare share, FileOptions options, Int64 preallocationSize)
at System.IO.Strategies.FileStreamHelpers.ChooseStrategyCore(String path, FileMode mode, FileAccess access, FileShare share, FileOptions options, Int64 preallocationSize)
at System.IO.Strategies.FileStreamHelpers.ChooseStrategy(FileStream fileStream, String path, FileMode mode, FileAccess access, FileShare share, Int32 bufferSize, FileOptions options, Int64 preallocationSize)
at System.IO.FileStream..ctor(String path, FileMode mode, FileAccess access, FileShare share, Int32 bufferSize, FileOptions options, Int64 preallocationSize)
at System.IO.FileStream..ctor(String path, FileMode mode, FileAccess access, FileShare share, Int32 bufferSize, FileOptions options)
at System.IO.FileStream..ctor(String path, FileMode mode, FileAccess access, FileShare share, Int32 bufferSize, Boolean useAsync)
at System.IO.FileStream..ctor(String path, FileMode mode, FileAccess access, FileShare share, Int32 bufferSize)
at NuGet.Protocol.FindPackagesByIdNupkgDownloader.<>c__DisplayClass13_0.<ProcessCacheEntryAsync>b__0(CancellationToken lockedToken)
at NuGet.Common.ConcurrencyUtilities.<ExecuteWithFileLockedAsync>d__4`1[[System.IO.FileStream, System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].MoveNext()
at NuGet.Common.ConcurrencyUtilities.<ExecuteWithFileLockedAsync>d__4`1[[System.IO.FileStream, System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].MoveNext()
at NuGet.Protocol.FindPackagesByIdNupkgDownloader.ProcessCacheEntryAsync(CacheEntry cacheEntry, Func`2 processStreamAsync, CancellationToken token)
at NuGet.Protocol.FindPackagesByIdNupkgDownloader.ProcessNupkgStreamAsync(PackageIdentity identity, String url, Func`2 processStreamAsync, SourceCacheContext cacheContext, ILogger logger, CancellationToken token)
at NuGet.Protocol.FindPackagesByIdNupkgDownloader.GetNuspecReaderFromNupkgAsync(PackageIdentity identity, String url, SourceCacheContext cacheContext, ILogger logger, CancellationToken token)
at NuGet.Protocol.HttpFileSystemBasedFindPackageByIdResource.GetDependencyInfoAsync(String id, NuGetVersion version, SourceCacheContext cacheContext, ILogger logger, CancellationToken cancellationToken)
at NuGet.Commands.SourceRepositoryDependencyProvider.GetDependenciesCoreAsync(LibraryIdentity match, NuGetFramework targetFramework, SourceCacheContext cacheContext, ILogger logger, CancellationToken cancellationToken)
at NuGet.Commands.SourceRepositoryDependencyProvider.<>c__DisplayClass21_0.<<GetDependenciesAsync>b__0>d.MoveNext()
--- End of stack trace from previous location ---
at NuGet.Commands.SourceRepositoryDependencyProvider.GetDependenciesAsync(LibraryIdentity libraryIdentity, NuGetFramework targetFramework, SourceCacheContext cacheContext, ILogger logger, CancellationToken cancellationToken)
at NuGet.DependencyResolver.ResolverUtility.CreateGraphItemAsync(RemoteMatch match, NuGetFramework framework, SourceCacheContext cacheContext, ILogger logger, CancellationToken cancellationToken)
at NuGet.DependencyResolver.ResolverUtility.FindLibraryEntryAsync(LibraryRange libraryRange, NuGetFramework framework, String runtimeIdentifier, RemoteWalkContext context, CancellationToken cancellationToken)
at NuGet.DependencyResolver.RemoteDependencyWalker.CreateGraphNode(LibraryRange libraryRange, NuGetFramework framework, String runtimeName, RuntimeGraph runtimeGraph, Func`2 predicate, GraphEdge`1 outerEdge, TransitiveCentralPackageVersions transitiveCentralPackageVersions)
at NuGet.DependencyResolver.RemoteDependencyWalker.CreateGraphNode(LibraryRange libraryRange, NuGetFramework framework, String runtimeName, RuntimeGraph runtimeGraph, Func`2 predicate, GraphEdge`1 outerEdge, TransitiveCentralPackageVersions transitiveCentralPackageVersions)
at NuGet.DependencyResolver.RemoteDependencyWalker.CreateGraphNode(LibraryRange libraryRange, NuGetFramework framework, String runtimeName, RuntimeGraph runtimeGraph, Func`2 predicate, GraphEdge`1 outerEdge, TransitiveCentralPackageVersions transitiveCentralPackageVersions)
at NuGet.DependencyResolver.RemoteDependencyWalker.CreateGraphNode(LibraryRange libraryRange, NuGetFramework framework, String runtimeName, RuntimeGraph runtimeGraph, Func`2 predicate, GraphEdge`1 outerEdge, TransitiveCentralPackageVersions transitiveCentralPackageVersions)
at NuGet.DependencyResolver.RemoteDependencyWalker.CreateGraphNode(LibraryRange libraryRange, NuGetFramework framework, String runtimeName, RuntimeGraph runtimeGraph, Func`2 predicate, GraphEdge`1 outerEdge, TransitiveCentralPackageVersions transitiveCentralPackageVersions)
at NuGet.DependencyResolver.RemoteDependencyWalker.CreateGraphNode(LibraryRange libraryRange, NuGetFramework framework, String runtimeName, RuntimeGraph runtimeGraph, Func`2 predicate, GraphEdge`1 outerEdge, TransitiveCentralPackageVersions transitiveCentralPackageVersions)
at NuGet.DependencyResolver.RemoteDependencyWalker.CreateGraphNode(LibraryRange libraryRange, NuGetFramework framework, String runtimeName, RuntimeGraph runtimeGraph, Func`2 predicate, GraphEdge`1 outerEdge, TransitiveCentralPackageVersions transitiveCentralPackageVersions)
at NuGet.DependencyResolver.RemoteDependencyWalker.CreateGraphNode(LibraryRange libraryRange, NuGetFramework framework, String runtimeName, RuntimeGr                        at NuGet.DependencyResolver.RemoteDependencyWalker.WalkAsync(LibraryRange library, NuGetFramework framework, String runtimeIdentifier, RuntimeGraph runtimeGraph, Boolean recursive)
at NuGet.Commands.ProjectRestoreCommand.WalkDependenciesAsync(LibraryRange projectRange, NuGetFramework framework, String runtimeIdentifier, RuntimeGraph runtimeGraph, RemoteDependencyWalker walker, RemoteWalkContext context, CancellationToken token)
at NuGet.Commands.ProjectRestoreCommand.TryRestoreAsync(LibraryRange projectRange, IEnumerable`1 frameworkRuntimePairs, NuGetv3LocalRepository userPackageFolder, IReadOnlyList`1 fallbackPackageFolders, RemoteDependencyWalker remoteWalker, RemoteWalkContext context, Boolean forceRuntimeGraphCreation, CancellationToken token, TelemetryActivity telemetryActivity, String telemetryPrefix)
at NuGet.Commands.RestoreCommand.ExecuteRestoreAsync(NuGetv3LocalRepository userPackageFolder, IReadOnlyList`1 fallbackPackageFolders, RemoteWalkContext context, CancellationToken token, TelemetryActivity telemetryActivity)
at NuGet.Commands.RestoreCommand.ExecuteAsync(CancellationToken token)
at NuGet.Commands.RestoreRunner.ExecuteAsync(RestoreSummaryRequest summaryRequest, CancellationToken token)
at NuGet.Commands.RestoreRunner.ExecuteAndCommitAsync(RestoreSummaryRequest summaryRequest, CancellationToken token)
at NuGet.Commands.RestoreRunner.CompleteTaskAsync(List`1 restoreTasks)
at NuGet.Commands.RestoreRunner.RunAsync(IEnumerable`1 restoreRequests, RestoreArgs restoreContext, CancellationToken token)
at NuGet.Commands.RestoreRunner.RunAsync(RestoreArgs restoreContext, CancellationToken token)
at NuGet.Build.Tasks.BuildTasksUtility.RestoreAsync(DependencyGraphSpec dependencyGraphSpec, Boolean interactive, Boolean recursive, Boolean noCache, Boolean ignoreFailedSources, Boolean disableParallel, Boolean force, Boolean forceEvaluate, Boolean hideWarningsAndErrors, Boolean restorePC, Boolean cleanupAssetsForUnsupportedProjects, ILogger log, CancellationToken cancellationToken)
at NuGet.Build.Tasks.RestoreTask.ExecuteAsync(ILogger log) (TaskId:36)

Now, looking at the nuget source code, this doesn't appear to be the result of any obvious bug, it's more of a design issue - in this scenario, nuget really does have to check all these nuget sources for different versions of the dependent packages, and it is indeed better to run those checks in parallel. That said, it still would seem preferable to at least limit concurrency to the extent that we don't exceed default system limits.

I also noticed that there are in fact two mechanisms in place that appear to have the intent to limit concurrency just so, but they don't work in this particular use case.

The first is a limit imposed by ./src/NuGet.Core/NuGet.Protocol/HttpSource/HttpSource.cs by the throttle argument to the HttpSource constructor: https://github.com/NuGet/NuGet.Client/blob/419ab86c016778ee3aeab40ea6fee879c09c7365/src/NuGet.Core/NuGet.Protocol/HttpSource/HttpSource.cs#L41
However, the default for this throttle appears to be to provide no limitation at all - it gets used only when either the --disable-parallel option is given to dotnet restore, or if there is a maxHttpRequestsPerSource statement in nuget.config.
This actually works (sometimes) for my use case: --disable-parallel makes it succeed reliably, but takes are very long time (5 minutes to restore the hello world example); while setting maxHttpRequestsPerSource to a low value like 16 makes the restore succeed sometimes (after about 30 seconds), but still fails in other cases, apparently depending on random timing effects.

The second mechanism is an extra throttle in src/NuGet.Core/NuGet.Commands/RestoreCommand/SourceRepositoryDependencyProvider.cs which was apparently added to address a problem with MacOS only supporting 256 open files by default: https://github.com/NuGet/NuGet.Client/blob/419ab86c016778ee3aeab40ea6fee879c09c7365/src/NuGet.Core/NuGet.Commands/RestoreCommand/SourceRepositoryDependencyProvider.cs#L47
But this doesn't work either - it's only enabled on MacOS, not Linux, and even if I do enable it, it still doesn't work reliably. The latter appears to be due to the fact the throttle isn't applied consistently: while most uses of the _findPackagesByIdResource callbacks are guarded by the throttle, there is one call to _findPackagesByIdResource.DoesPackageExistAsync which isn't. This looks like an oversight to me ...

If I add a throttle guard to that call, and enable the throttle on Linux, dotnet restore now succeeds reliably, and takes only about 18 seconds for a full restore from an empty nuget cache.

I'd appreciate any comments or suggestions on whether this change looks reasonable and might be acceptable upstream, or whether this should be solved in a different way - happy to try other solutions! For reference, this is the patch I'm currently using - I'd be happy to submit a PR along those lines if you'd like me to.

diff --git a/src/NuGet.Core/NuGet.Commands/RestoreCommand/SourceRepositoryDependencyProvider.cs b/src/NuGet.Core/NuGet.Commands/RestoreCommand/SourceRepositoryDependencyProvider.cs
index d859f17fb..4913b1914 100644
--- a/src/NuGet.Core/NuGet.Commands/RestoreCommand/SourceRepositoryDependencyProvider.cs
+++ b/src/NuGet.Core/NuGet.Commands/RestoreCommand/SourceRepositoryDependencyProvider.cs
@@ -44,7 +44,7 @@ public class SourceRepositoryDependencyProvider : IRemoteDependencyProvider
         // Limiting concurrent requests to limit the amount of files open at a time on Mac OSX
         // the default is 256 which is easy to hit if we don't limit concurrency
         private readonly static SemaphoreSlim _throttle =
-            RuntimeEnvironmentHelper.IsMacOSX
+            RuntimeEnvironmentHelper.IsMacOSX || RuntimeEnvironmentHelper.IsLinux
                 ? new SemaphoreSlim(ConcurrencyLimit, ConcurrencyLimit)
                 : null;

@@ -211,8 +211,27 @@ public class SourceRepositoryDependencyProvider : IRemoteDependencyProvider
                 if (libraryRange.VersionRange?.MinVersion != null && libraryRange.VersionRange.IsMinInclusive && !libraryRange.VersionRange.IsFloating)
                 {
                     // first check if the exact min version exist then simply return that
-                    if (await _findPackagesByIdResource.DoesPackageExistAsync(
-                        libraryRange.Name, libraryRange.VersionRange.MinVersion, cacheContext, logger, cancellationToken))
+                    bool versionExists = false;
+                    try
+                    {
+                        if (_throttle != null)
+                        {
+                            await _throttle.WaitAsync();
+                        }
+
+                        versionExists = await _findPackagesByIdResource.DoesPackageExistAsync(
+                            libraryRange.Name,
+                            libraryRange.VersionRange.MinVersion,
+                            cacheContext,
+                            logger,
+                            cancellationToken);
+                    }
+                    finally
+                    {
+                        _throttle?.Release();
+                    }
+
+                    if (versionExists)
                     {
                         return new LibraryIdentity
                         {

Related issues for reference:
#2004
#2163
#8571

FYI @omajid @tmds @crummel

Resolution

In the meantime, the Mono runtime has been changed to raise the soft RLIMIT_NOFILE on Linux, making the original issue described here no longer appear. However, as a result of the ongoing discussion, the linked PR still implements two changes to the NuGet.client code base:

  • Add environment variable NUGET_CONCURRENCY_LIMIT to override OS-specific limit of concurrent connections in SourceRepositoryDependencyProvider.cs. This allows manually limiting concurrency just in case, e.g. when running in an environment where the OS limit cannot be raised high enough. Conversely, it would also allow manually lifting the automatic limit imposed by NuGet on MacOS, in case it isn't actually needed in some environments.
  • Guard _findPackagesByIdResource.DoesPackageExistAsync under _throttle. Note that all other _findPackagesByIdResource callbacks were already guarded under the same _throttle to limit concurrency, so the consensus was that this was an oversight. Adding this additional guard is required to have the concurrency limit work reliably in those cases where it is enabled, e.g. on MacOS or when using the new environment variable.

Documentation update

The documentation should be updated to describe the new NUGET_CONCURRENCY_LIMIT variable.

@tmds
Copy link

tmds commented Feb 6, 2023

we're only seeing this on the s390x (IBM Z) architecture, not on Intel.

And these builds use Mono runtime, not CoreCLR.

@nkolev92 nkolev92 added Priority:3 Issues under consideration. With enough upvotes, will be reconsidered to be added to the backlog. Pipeline:Icebox and removed Triage:NeedsTriageDiscussion labels Feb 9, 2023
@nkolev92
Copy link
Member

nkolev92 commented Feb 9, 2023

We'd be happy to review and consider a change in this area.

uweigand added a commit to uweigand/NuGet.Client that referenced this issue Feb 10, 2023
Avoid "too many open files" error in some Linux configuations.
Guard _findPackagesByIdResource.DoesPackageExistAsync under _throttle.

Addresses NuGet/Home#12410.
@uweigand
Copy link
Author

We'd be happy to review and consider a change in this area.

Thanks, I've now opened a PR.

uweigand added a commit to uweigand/NuGet.Client that referenced this issue Feb 21, 2023
Avoid "too many open files" error in some Linux configuations.
Guard _findPackagesByIdResource.DoesPackageExistAsync under _throttle.

Addresses NuGet/Home#12410.
uweigand added a commit to uweigand/NuGet.Client that referenced this issue Mar 2, 2023
Add environment variable to override OS-specific limit of
concurrent connections in SourceRepositoryDependencyProvider.cs.

Also guard _findPackagesByIdResource.DoesPackageExistAsync under
the concurrent connection _throttle.

Addresses NuGet/Home#12410.
zivkan pushed a commit to uweigand/NuGet.Client that referenced this issue Mar 3, 2023
Add environment variable to override OS-specific limit of
concurrent connections in SourceRepositoryDependencyProvider.cs.

Also guard _findPackagesByIdResource.DoesPackageExistAsync under
the concurrent connection _throttle.

Addresses NuGet/Home#12410.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area:HttpCommunication Functionality:Restore Priority:3 Issues under consideration. With enough upvotes, will be reconsidered to be added to the backlog. Type:Bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants