Skip to content
This repository has been archived by the owner on Jan 2, 2021. It is now read-only.

FileExists: set one watcher instead of thousands #831

Merged
merged 2 commits into from
Sep 27, 2020

Conversation

michaelpj
Copy link
Contributor

This is broken. It sends an illegal message before we've sent the initialize response. I got this far because (amusingly) it actually works in emacs, but the test suite complains.

I'm not sure quite how to fix this. @pepeiborra suggested sending the message in fileExistsRules as I'm doing, but this appears to happen too early.

I can think of a few gross ways to fix this (actually do the registration the first time GetFileExists runs, add some extra state to determine whether we've done it), but I thought I'd put it up to solicit ideas.


This prevents us from sending thousands of notifications to the client
on startup, which can lock up some clients like emacs. Instead we send
precisely one.

This has some consequences for the behaviour of the fast file existence
lookup, which I've noted in the code, alongside a description of how it
works (I spent a while figuring it out, I thought I might as well write
it down).

Fixes #776.

@pepeiborra
Copy link
Collaborator

Yup, the spec is quite clear on this:

In addition the server is not allowed to send any requests or notifications to the client until it has responded with an InitializeResult

To fix this, installing the watcher on the first call to GetFileExists is ok. A nicer way would be to change the Rules () computation taken by initialise to Rules (IO ()), so that rules are allowed to return some delayed initialisation to be run on the LSP.onStartup handler.

https://github.com/haskell/ghcide/blob/master/src/Development/IDE/Core/Service.hs#L41-L44

https://github.com/haskell/ghcide/blob/master/src/Development/IDE/LSP/LanguageServer.hs#L135-L136

@michaelpj michaelpj force-pushed the imp/less-file-watching branch 2 times, most recently from 26ed2a9 to 65b840c Compare September 25, 2020 18:52
@michaelpj
Copy link
Contributor Author

To fix this, installing the watcher on the first call to GetFileExists is ok.

I did this. It seems to be okay.

A nicer way would be to change the Rules () computation taken by initialise to Rules (IO ()), so that rules are allowed to return some delayed initialisation to be run on the LSP.onStartup handler.

I agree this would be nicer, but it's not clear we can do this: the Rules () eventually gets passed to a function from Shake, and we don't get the return value. Or at least, it seemed non-trivial to make this work.

That said, I've got quite lost trying to follow how the LSP initialization works. So perhaps there is a place we could slip in a hook for something to run post-initialize.

@michaelpj michaelpj force-pushed the imp/less-file-watching branch from 65b840c to 4cc504b Compare September 25, 2020 19:06
@michaelpj
Copy link
Contributor Author

Okay, now I have an issue where some of the tests are timing out. I'm somewhat mystified.

My initial guess was that this resulted from a TOCTOU issue with checking whether the watcher had been initialized and initializing it, which could potentially result in lots of notifications being sent. But I think I fixed that in the second commit, and things still time out. Any ideas for debugging?

@pepeiborra
Copy link
Collaborator

There's one test failing:

    add missing module:                                                                                   FAIL
      Exception: Timed out waiting to receive a message from the server.
      Last message received:
      {
          "tag": "NotLogMessage",
          "contents": {
              "jsonrpc": "2.0",
              "params": {
                  "type": 4,
                  "message": "finish: kick (took 0.03s)"
              },
              "method": "window/logMessage"
          }
      }

Tests are parsers that recognise a specific sequence of lsp messages. A timeout means that ghcide didn't send the expected message back after waiting for long enough time.

This test is checking that ghcide notices that a new module was created in the file system. You probably need to update the test expectations to work with your changes.

By the way, I'm surprised that the tests below passed; they count the number of file watchers created for a file:

  watched files
    workspace files:                                                                                      OK (0.18s)
    non workspace file:                                                                                   OK (0.19s)

@michaelpj
Copy link
Contributor Author

I changed the expectations for the watched files indeed! I had a look at the other test, but it wasn't obvious what it was doing wrong (that it would fail when the others don't). I'll have another stab!

@michaelpj michaelpj force-pushed the imp/less-file-watching branch from a605333 to 2b62893 Compare September 26, 2020 11:34
@michaelpj
Copy link
Contributor Author

I think I tracked down the problem: in the case where we didn't have cached information for a file, we would delegate to the VFS lookup, but unlike fileExistsSlow, we wouldn't set alwaysRerun, so this result would not get invalidated.

In the old version, if we hit the slow case in fileExistsFast, we would register a listener, and so we would later get notifications from that, which would handle invalidation. Tricky.

I think the handling of invalidation here is maybe a bit more complicated than it needs to be. I might investigate just using alwaysRerun unconditionally, but I expect that could affect performance, so I'll do that in a follow-up PR.

@pepeiborra
Copy link
Collaborator

The fileExistsFast rule is quite performance critical and a bit special - it doesn't need an alwaysRerun because modifyFileExists calls deleteValue to mutate the state used by defineEarlyCutoff. In other words, it bypasses the Shake machinery by abusing the implementation details of the rules defined by ghcide.

Why is this better than a cheap alwaysRerun with cutoff? Back in the day we didn't have a benchmark suite so it's entirely possible that it isn't any better.

pepeiborra
pepeiborra previously approved these changes Sep 26, 2020
newtype FileExistsMapVar = FileExistsMapVar (Var FileExistsMap)
-- | The state of our file existence cache. A pair of a boolean indicating whether we have initialized the
-- file watcher in the client yet, and a map tracking file existence.
data FileExistsState = FileExistsState { fileExistsWatcherInitialized :: Bool, fileExistsMap :: FileExistsMap }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both fields should probably be strict

changesMap <- evaluate $ HashMap.fromList changes

-- Masked to ensure that the previous values are flushed together with the map update
mask $ \_ -> do
-- update the map
modifyVar_ var $ evaluate . HashMap.union changesMap
-- flush previous values
modifyVar_ var $ \st -> evaluate $ st{fileExistsMap=HashMap.union changesMap (fileExistsMap st)}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is why the fileExistsMap field needs to be strict

@pepeiborra
Copy link
Collaborator

The CI benchmarks show a 23% performance regression - could you run them locally in a stable environment and verify the results? I suspect this is because now the fast path calls the slow path and that propagates the alwaysRerun

@pepeiborra pepeiborra dismissed their stale review September 26, 2020 13:07

benchmarks

@michaelpj
Copy link
Contributor Author

Yep, I can reproduce the benchmarks being bad. It's not yet clear to me why that is, I'll do some more digging.


On the functional issue: my understanding is that the problem is the invalidation in the fast path. On this branch, if we have a file that's a workspace file but not covered by our watcher, we will use fileExistsFast, do a slow lookup, and then cache that forever, since we're not using alwaysRerun, and we will never get a client notification about changes to that file (since we're not watching it).

A simple fix would be to change the isWorkspaceFile guard for using fileExistsFast to also check if the file matches our watch pattern. That would restore the invariant that "x is handled with fileExistsFast <=> x is watched". But it would add some more checks.

However, it would be nice if we could use alwaysRerun, because the current logic seems racy even for files that are watched. Consider the following series of events:

  • We check for the existence of X.hs
  • This triggers us asking the client to watch all ".hs" files
  • We do a slow lookup for X.hs, it doesn't exist.
  • X.hs is created.
  • The client installs the watcher we ask for.
  • We check for X.hs, and get a cached answer that it doesn't exist... forever.

What I really want to say is that if we got the answer the slow way then we should recheck next time. But I don't know how to do that. alwaysRerun errs in the opposite direction of checking too often.

@michaelpj
Copy link
Contributor Author

The performance degrades even before the commit which uses alwaysRerun :/

@wz1000
Copy link
Collaborator

wz1000 commented Sep 26, 2020

Can't we register the watcher immediately after initialization?

@michaelpj
Copy link
Contributor Author

Can't we register the watcher immediately after initialization?

I just came to this realization myself, in particular I found https://microsoft.github.io/language-server-protocol/specification#initialized, which seems to be exactly for this. I'm in the process of rewriting to use that, let's see if it works.

Not sure that will help with the performance, but it should be neater.

@michaelpj michaelpj force-pushed the imp/less-file-watching branch from 2b62893 to d65fd03 Compare September 26, 2020 20:45
@pepeiborra
Copy link
Collaborator

Yep, I can reproduce the benchmarks being bad. It's not yet clear to me why that is, I'll do some more digging.

On the functional issue: my understanding is that the problem is the invalidation in the fast path. On this branch, if we have a file that's a workspace file but not covered by our watcher, we will use fileExistsFast, do a slow lookup, and then cache that forever, since we're not using alwaysRerun, and we will never get a client notification about changes to that file (since we're not watching it).
A simple fix would be to change the isWorkspaceFile guard for using fileExistsFast to also check if the file matches our watch pattern. That would restore the invariant that "x is handled with fileExistsFast <=> x is watched". But it would add some more checks.

Yes. We were assuming that we never care about files not covered by the watcher. Unfortunately this is not the case for arbitrary cradle dependencies, like cabal descriptors and stack.yaml files; we'll need individual watchers for any file not covered by the glob pattern in the workspace watcher.

However, it would be nice if we could use alwaysRerun, because the current logic seems racy even for files that are watched. Consider the following series of events:

  • We check for the existence of X.hs
  • This triggers us asking the client to watch all ".hs" files
  • We do a slow lookup for X.hs, it doesn't exist.
  • X.hs is created.
  • The client installs the watcher we ask for.
  • We check for X.hs, and get a cached answer that it doesn't exist... forever.

What I really want to say is that if we got the answer the slow way then we should recheck next time. But I don't know how to do that. alwaysRerun errs in the opposite direction of checking too often.

It is a bit racy. What's the cost of using alwaysRerun? Let's measure it and make an informed decision: if the cost is small do plug the gap, but if it's big the choice is less clear

@pepeiborra
Copy link
Collaborator

pepeiborra commented Sep 27, 2020

I measured for the Cabal project using the branch https://github.com/pepeiborra/ghcide/tree/alwaysRerun-fileExists

The results are around 10% slower, so I think it's probably worth dropping the hack and switching to alwaysRerun.

My only concern is whether bigger projects like GHC will see a bigger impact, since the cost of alwaysRerun is linear in the number of possible import paths (the benchmark suite uses the Cabal codebase). I'll see about repeating the measurements with 10x the number of import source folders.

Results

HEAD - Use alwaysRerun
upstream - Use the values hack instead
Y axis - total time

image

image

Row Labels HEAD upstream
code actions 0.993193472 0.777636776
code actions after edit 36.5799817 32.86563162
completions after edit 26.90540779 22.91404066
documentSymbols after edit 5.872847811 6.360484811
edit 34.66888288 31.02675419
getDefinition 1.381465592 1.322841643
hover 0.99478149 1.07828978
hover after edit 54.36070061 51.61647408
Grand Total 161.7572613 147.9621536

@pepeiborra
Copy link
Collaborator

With 10 additional hs-import-dirs entries in the Cabal file the impact is much bigger, see chart below.

This scenario is not unrealistic in codebases like GHC, where (unless things have changed recently) there are multiple src import dirs, or ours at Facebook where dozens of small projects are loaded into ghcide, each one with its own import dir.

I think this justifies the current hack to avoid the Shake overhead.

image

@michaelpj
Copy link
Contributor Author

Okay. I've attempted a version that filters based on whether the file in question would match our watcher, and removes the alwaysRerun.

However, I get the same timeout issue with that branch, so I need to investigate a bit more to figure out a way to get correct behaviour.

@michaelpj michaelpj force-pushed the imp/less-file-watching branch 2 times, most recently from f2549e7 to 114ebf4 Compare September 27, 2020 15:23
This prevents us from sending thousands of notifications to the client
on startup, which can lock up some clients like emacs. Instead we send
precisely one.

This has some consequences for the behaviour of the fast file existence
lookup, which I've noted in the code, alongside a description of how it
works (I spent a while figuring it out, I thought I might as well write
it down).

Fixes #776.
@michaelpj michaelpj force-pushed the imp/less-file-watching branch from 114ebf4 to f46a0e1 Compare September 27, 2020 15:24
@michaelpj michaelpj force-pushed the imp/less-file-watching branch from f46a0e1 to 9ac17c7 Compare September 27, 2020 15:34
@michaelpj
Copy link
Contributor Author

Well, turns out I was wrong about why the tests were failing, it was actually lukel97/lsp-test#77.

That was inadvertently exercising the failure mode of "the client doesn't send us change notifications even though we asked for them". This will result in us permanently caching the stale results, but this seems okay, since the whole point is to rely on the client for this and not do it ourselves, so there's not much we can do if the client lets us down. I've documented that, nonetheless.

That meant the tests passed, but I still needed to handle the case of workspace-but-not-watched files. I've implemented something like what I described above: we guard the fast path by a check whether the path matches the glob patterns that we are watching. This costs us a bit, but makes it more correct.

My benchmarks seem a bit inconsistent, particularly for "hover", but the general pattern seems to be that:

  • My version without the glob checking (first commit) is faster than master.
  • My version with glob checking (second commit) is slower than the previous version, but still faster than master.

I'd appreciate a check on this, since I'm not super-confident that I'm getting the right results here.

@michaelpj
Copy link
Contributor Author

Otherwise, I think this is good to go!

@michaelpj
Copy link
Contributor Author

Ugh, the Azure benchmarks show more of a regression on this PR. They look completely different to when I run them locally 🙄

@pepeiborra
Copy link
Collaborator

The Azure benchmarks are not reliable, because they run in a shared environment. The only reliable metric in that setting is the total allocations

@pepeiborra
Copy link
Collaborator

Performance is fine. Total time measurements for your branch (HEAD~2 is upstream):

image

Copy link
Collaborator

@pepeiborra pepeiborra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an awesome contribution, thanks.

@pepeiborra pepeiborra merged commit b279afb into haskell:master Sep 27, 2020
@michaelpj
Copy link
Contributor Author

Thanks for helping me through the thorns :) And I'm super glad we have a benchmark suite!

@mrBliss
Copy link

mrBliss commented Sep 28, 2020

Thanks @michaelpj!

pepeiborra pushed a commit to pepeiborra/ide that referenced this pull request Dec 29, 2020
* FileExists: set one watcher instead of thousands

This prevents us from sending thousands of notifications to the client
on startup, which can lock up some clients like emacs. Instead we send
precisely one.

This has some consequences for the behaviour of the fast file existence
lookup, which I've noted in the code, alongside a description of how it
works (I spent a while figuring it out, I thought I might as well write
it down).

Fixes haskell/ghcide#776.

* Use fast rules only if it matches our watcher spec
pepeiborra pushed a commit to pepeiborra/ide that referenced this pull request Dec 29, 2020
* FileExists: set one watcher instead of thousands

This prevents us from sending thousands of notifications to the client
on startup, which can lock up some clients like emacs. Instead we send
precisely one.

This has some consequences for the behaviour of the fast file existence
lookup, which I've noted in the code, alongside a description of how it
works (I spent a while figuring it out, I thought I might as well write
it down).

Fixes haskell/ghcide#776.

* Use fast rules only if it matches our watcher spec
pepeiborra pushed a commit to pepeiborra/ide that referenced this pull request Dec 29, 2020
* FileExists: set one watcher instead of thousands

This prevents us from sending thousands of notifications to the client
on startup, which can lock up some clients like emacs. Instead we send
precisely one.

This has some consequences for the behaviour of the fast file existence
lookup, which I've noted in the code, alongside a description of how it
works (I spent a while figuring it out, I thought I might as well write
it down).

Fixes haskell/ghcide#776.

* Use fast rules only if it matches our watcher spec
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

didChangedWatchedFiles notifications lock up emacs on startup
4 participants