-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
go/token: add FileSet.RemoveFile method #53200
Comments
Change https://go.dev/cl/410114 mentions this issue: |
A fourth approach (a variant on the first): FileSet.base is replaced by an actual global variable in the token package, shared by all FileSets; the base value provided to FileSet.AddFile is ignored. This ensures that all token.Files are inherently non-overlapping, permitting any subset of them to be collected in a transient FileSet. In other words, any collection of FileSets would be consistent in their (successful) Pos->Position mappings, even if they represent different subsets of files. This solves the "central management" problem. In theory, a downside of this approach is that the global counter would grow faster than each individual FileSet.base, but (a) in practice today there is only one FileSet in the application (see above!), and (b) even if that were not the case, we would still be in no danger of overflowing an int---not on a 64-bit machine at least. |
Another variant of the first could be to explicitly share a base value between FileSets. The FileSet could carry this around. The existing FileSet APIs could start from an unshared base. This is essentially just the fourth option but pass around a [thread safe?] *int on FileSet creation instead of using a global. Like the fourth option the really safe way to share a base with multiple competing parallel FileSets is to pass in a negative base to Not sure either this or the fourth option would be okay on 32-bit machine for long running processes like gopls. |
The value of approach 4 is it solves the "central management" problem without needing to update every caller of Parse, whereas the explicit-shared-based approach would still require each caller of Parse to participate.
On a 32-bit machine, gopls would run out of memory long before it could even attempt to parse 2GiB worth of Go source files. ;-) |
My understanding is that we are worried about 1 long running process (gopls) creating a File per keystroke and exhausting the available Pos values. If this understanding is wrong, the rest of this comment won't mean much. I did a back of the envelope estimate, which came out to about 16 hours of work to overflow a global base on a 32-bit machine. 1<<31 Positions * 1/(11074 Positions/AddFile)* 1 keystroke/AddFile * 1/(200 keystroke/minute) which is roughly 16 hours. 11074 comes from the average file size in the Go standard library. 1 keystroke/AddFile comes from my understanding of "parses a file at least once per keystroke of a Go developer's editor". 200 keystrokes/minute is a guess from CPM type speed estimates (https://www.toptenreviews.com/what-is-a-good-typing-speed). Unless the assumptions are +1 order of magnitude off, seems like not enough time for a gopls process. |
Your analysis is correct but, no, I'm not at all worried about running out of address space. No-one has ever worked at 200 keystrokes/minute for 16 hours straight, and I'm pretty sure gopls would crash at least once if they tried. It also needs more than 2GB of RAM for most workloads, and no laptop is a 32-bit machine. What I'm worried about is actual memory consumed by the newline index tables. |
For arguments sake, keep in mind that it's not the average file size that matters, but the size of the file that is changing. I am sure we could find a real world example of a very large file in an otherwise small workspace, that exhausts the 32 bit address space in under an hour (or perhaps a few minutes!). But as Alan suggests, this development scenario is unlikely for a variety of reasons, and in practice such a workflow would encounter other problems. |
Adding to minutes. It does seem like there are many many parts of the go/* packages that assume batch mode and are stressed near breaking by interactive use. I wonder if we should be looking at all of them together instead of just one API at a time. |
This proposal has been added to the active column of the proposals project |
That's definitely true, but I think we should decouple this proposal from larger changes. It will be a long time before we can stop using token.FileSet in gopls, and yet this unavoidable memory leak has very real consequences for our users today. (Consider that, while working on a large file with ~10k lines, gopls can leak 100kb per keystroke). |
Based on the discussion above, this proposal seems like a likely accept. |
No change in consensus, so accepted. 🎉 |
The design of FileSet encourages it to be used as a global variable. Each call to AddFile consumes about 3KB, that is never returned, even after an application no longer cares about the File. This change adds a RemoveFile method that a long-running application can use to release a File that is no longer needed, saving memory. Fixes golang/go#53200 Change-Id: Ifd34d650fe0d18b1395f922a4cd02a535afbe560 Reviewed-on: https://go-review.googlesource.com/c/go/+/410114 Auto-Submit: Alan Donovan <[email protected]> Run-TryBot: Alan Donovan <[email protected]> Reviewed-by: Robert Griesemer <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Robert Findley <[email protected]>
Background: A
token.FileSet
holds a collection of non-overlappingtoken.File
s and provides a mapping from token.Pos integers to position information (file/line/column/offset). This is analogous to the way an address space containing several non-overlapping file mappings defines the meaning of a numeric pointer value. Applications that parse Go files are expected to use a single FileSet to provide meaning to all the token.Pos values in all the ASTs; in this way, the AST nodes themselves needn't include a pointer to the mapping information, as it can be implicitly supplied by the contextual FileSet.The problem: this design leads applications to create a single FileSet that is effectively a ubiquitous global variable. A long-running application may encounter and parse an unbounded stream of files, calling AddFile for each one, causing its set of Files to grow without bound. The FileSet retains each File indefinitely, even after the application no longer cares about it. The memory footprint of a File is approximately one int per line of the file. Go source files in the standard library have an average of 371 lines, leading to 3KB of wasted space per file. Our application (gopls) parses a file at least once per keystroke of a Go developer's editor.
I can think of a few ways the memory usage of FileSet could be reduced.
The first approach is to move away from the concept of a single central FileSet in an application. Instead one would create a new FileSet for every parse, so that FileSets and Files are typically in 1:1 correspondence. An application could, as needed, create transient FileSets containing only the small subset of existing Files necessary for a given operation, such as invoking the type checker. However, this would still require all those Files to be non-overlapping, which requires central management of the "next free address" variable (like sbrk(2) in the mmap analogy) that should be provided in each call to
FileSet.AddFile(base)
. It would also require that every call to Parse go through this central management, which might be practical within an application but doesn't seem feasible if libraries with published APIs are considered.A second approach would be for the parser to save the token.File in every node of the AST so that there is never any need to maintain or consult an external FileSet data structure: each AST node's (*File, Pos) information would be complete. This approach however requires that every place that today uses a token.Pos to refer to a file position (relative to some implied FileSet) would need to be changed to a (*File, Pos) pair. For example, in errors from the type checker, and in diagnostics from the analysis framework. This seems like a major incompatible change to existing libraries.
The third, and simplest, approach, is to add a
(*FileSet).RemoveFile(*File)
method that removes a File from the set so that subsequent queries such as Pos, File, and Iterate return a negative result---likemunmap
in the mmap analogy. The liberated portion of the address space would not be re-used, but the File's memory could be reclaimed. As with munmap, an application would be responsible for remembering to call RemoveFile at an appropriate time, and not too soon. I propose we take this approach. The change to FileSet is very simple and low-risk.@findleyr @griesemer
The text was updated successfully, but these errors were encountered: