Skip to content

Conversation

@nagilson
Copy link
Member

@nagilson nagilson commented Apr 9, 2025

Now that the mutex is implemented correctly, I can check that the code which uses the mutex is all correct.
An issue I discovered, is that on Linux, we use a sudo process folder to exchange files with a master sudo process.
This folder is shared across all instances of vscode. If you kill vscode, the sudo process does not go away.

When a new instance checks if it should spawn a master sudo process, it tries to communicate with the old one. But the old one may be busy, and the old logic only waited 1 second. The sudo process may be busy for multiple minutes. This can result in race conditions where the old sudo process and new one are doing some of the same tasks. Now, I've given each vscode instance its own folder to run sudo commands under.

The script that runs these commands is still in the same protected location. It will only run authorized and approved commands by us, so spawning a new directory is OK.

Resolves #2224

nagilson added 6 commits April 9, 2025 11:30
It seems like it dropped the execution lock (maybe it threw an error) when executing The command apt-get -o DPkg::Lock::Timeout=180 upgrade -y dotnet-sdk-8.0 was forwarded to the master process to run.

It waited about 2 seconds then freed the server before it was done.  It freed before it threw but free happens before throw, so it might have failed for some reason, then failed again since stdout.txt didn't exist because the process wasn't done running. then freed. When it freed, it was still going but then it allowed the sudo master process killer to run which caused this to fail later....

Still not sure why that'd happen, it might be if the locking was not implemented correctly.
This will help prevent an existing process from colliding with another.
this is basically like random int collision at this point, could probably remove.
@nagilson nagilson marked this pull request as ready for review April 9, 2025 22:25
Copilot AI review requested due to automatic review settings April 9, 2025 22:25
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (2)

vscode-dotnet-runtime-library/src/Utils/CommandExecutor.ts:123

  • Locking now uses the script path instead of the communication directory, while subsequent operations continue to use the directory. Verify if this change is intentional to avoid potential race conditions or locking inconsistencies.
return executeWithLock(this.context.eventStream, false, RUN_UNDER_SUDO_LOCK(this.sudoProcessScript), SUDO_LOCK_PING_DURATION_MS, waitForLockTimeMs,

vscode-dotnet-runtime-library/src/Utils/CommandExecutor.ts:149

  • [nitpick] Reducing the timeout from 1000ms to 500ms for checking process liveness may cause premature failures. Confirm that this new timeout value meets the operational requirements.
if (await this.sudoProcIsLive(false, fullCommandString, 500))

@nagilson nagilson requested a review from a team April 10, 2025 16:40
catch (error: any)
{
// eslint-disable-next-line @typescript-eslint/no-unsafe-member-access
error.message = error.message + `\nFailed to create ${this.sudoProcessCommunicationDir}. Please check your permissions or install dotnet manually.`;
Copy link
Member Author

@nagilson nagilson Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may happen in the rare event of a collision where the dir exists, so its ok to keep going here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you have fallback dirs? Like if dir exists, check dir1, then dir2, etc.? I don't feel great about catching any exception and not even verifying that it failed for the reason you mentioned. You could also have sudoProcessCommunication dir is unexpectedly null; that would be a bug, and we should care, I think

Copy link

@Forgind Forgind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think any of these are too serious. I'm a little concerned about the undefined thing, but that might just be my lack of js knowledge

catch (error: any)
{
// eslint-disable-next-line @typescript-eslint/no-unsafe-member-access
error.message = error.message + `\nFailed to create ${this.sudoProcessCommunicationDir}. Please check your permissions or install dotnet manually.`;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you have fallback dirs? Like if dir exists, check dir1, then dir2, etc.? I don't feel great about catching any exception and not even verifying that it failed for the reason you mentioned. You could also have sudoProcessCommunication dir is unexpectedly null; that would be a bug, and we should care, I think

{
if (await this.sudoProcIsLive(false, fullCommandString, 1000)) // If the sudo process was spawned by another instance of code, we do not want to have 2 at once but also do not waste a lot of time checking
// As it should not be in the middle of an operation which may cause it to take a while.
if (await this.sudoProcIsLive(false, fullCommandString, 500)) // If the sudo process was spawned by another instance of code, we do not want to have 2 at once but also do not waste a lot of time checking
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure users would really notice the difference between 500ms and 1000ms. I don't know how long this is expected to take (IPC, right? I think speed then depends on OS), but I could imagine it taking longer than 500ms on a heavily loaded system

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true. This check only exists for when there is an existing sudo process working in same the individualized folder.

That should only happen when we think we have not spawned a sudo process ourselves that is currently alive, or the sudo process spawned unsuccessfully while throwing an error. This check can probably be removed in its entirety.

stderr: (fs.readFileSync(stderrFile, 'utf8')).trim(),
status: (fs.readFileSync(statusFile, 'utf8')).trim()
stdout: (await (this.fileUtil as FileUtilities).read(stdoutFile)).trim(),
stderr: (await (this.fileUtil as FileUtilities).read(stderrFile)).trim(),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This defaults to utf8? I'm also wondering if you can have these three going in parallel then just 'join' them afterwards.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! The function does default to utf8. I like the parallelization idea, though I want to avoid adding that complexity here

@nagilson
Copy link
Member Author

Can you have fallback dirs? Like if dir exists, check dir1, then dir2, etc.? I don't feel great about catching any exception and not even verifying that it failed for the reason you mentioned. You could also have sudoProcessCommunication dir is unexpectedly null; that would be a bug, and we should care, I think

That's a good point. If we cant make the first dir, I doubt we can make the 2nd. Perhaps we could use the existing directory as a fallback, though that would introduce the race condition again if it happens multiple times. I could condition it to not fail on EEXIST. Though, I just don't know what other types of errors the kernel could throw at us :/

… kernels

EACCESS as well. This may be something that will not resolve, though if some admin thing is blocking and holding the file, perhaps this is a good resolution. I never encountered this when trying different permissions on my demos but it seems that some people are.
@nagilson
Copy link
Member Author

nagilson commented Apr 10, 2025

@Forgind, I think I resolved the 2 issues you put at the top by throwing the error in certain cases if we cant use mkdir or using a backupdir, and then by removing the extra 500 ms timeout check. Thanks for looking and let me know what you think!

Copy link

@Forgind Forgind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy; thanks!

might as well do it in this PR
@nagilson nagilson enabled auto-merge (squash) April 10, 2025 20:46
@nagilson nagilson merged commit 902bb39 into dotnet:main Apr 10, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[NETE2ESDK][Ubuntu24.04] Error occurs when re-install runtime or SDK after re-launch VSCode which is killed before

3 participants