go/token: add FileSet.RemoveFile method #53200

adonovan · 2022-06-02T13:52:59Z

Background: A token.FileSet holds a collection of non-overlapping token.Files and provides a mapping from token.Pos integers to position information (file/line/column/offset). This is analogous to the way an address space containing several non-overlapping file mappings defines the meaning of a numeric pointer value. Applications that parse Go files are expected to use a single FileSet to provide meaning to all the token.Pos values in all the ASTs; in this way, the AST nodes themselves needn't include a pointer to the mapping information, as it can be implicitly supplied by the contextual FileSet.

The problem: this design leads applications to create a single FileSet that is effectively a ubiquitous global variable. A long-running application may encounter and parse an unbounded stream of files, calling AddFile for each one, causing its set of Files to grow without bound. The FileSet retains each File indefinitely, even after the application no longer cares about it. The memory footprint of a File is approximately one int per line of the file. Go source files in the standard library have an average of 371 lines, leading to 3KB of wasted space per file. Our application (gopls) parses a file at least once per keystroke of a Go developer's editor.

I can think of a few ways the memory usage of FileSet could be reduced.

The first approach is to move away from the concept of a single central FileSet in an application. Instead one would create a new FileSet for every parse, so that FileSets and Files are typically in 1:1 correspondence. An application could, as needed, create transient FileSets containing only the small subset of existing Files necessary for a given operation, such as invoking the type checker. However, this would still require all those Files to be non-overlapping, which requires central management of the "next free address" variable (like sbrk(2) in the mmap analogy) that should be provided in each call to FileSet.AddFile(base). It would also require that every call to Parse go through this central management, which might be practical within an application but doesn't seem feasible if libraries with published APIs are considered.

A second approach would be for the parser to save the token.File in every node of the AST so that there is never any need to maintain or consult an external FileSet data structure: each AST node's (*File, Pos) information would be complete. This approach however requires that every place that today uses a token.Pos to refer to a file position (relative to some implied FileSet) would need to be changed to a (*File, Pos) pair. For example, in errors from the type checker, and in diagnostics from the analysis framework. This seems like a major incompatible change to existing libraries.

The third, and simplest, approach, is to add a (*FileSet).RemoveFile(*File) method that removes a File from the set so that subsequent queries such as Pos, File, and Iterate return a negative result---like munmap in the mmap analogy. The liberated portion of the address space would not be re-used, but the File's memory could be reclaimed. As with munmap, an application would be responsible for remembering to call RemoveFile at an appropriate time, and not too soon. I propose we take this approach. The change to FileSet is very simple and low-risk.

@findleyr @griesemer

The text was updated successfully, but these errors were encountered:

gopherbot · 2022-06-02T13:56:48Z

Change https://go.dev/cl/410114 mentions this issue: go/token: add (*FileSet).RemoveFile(*File) method

adonovan · 2022-06-02T14:16:14Z

A fourth approach (a variant on the first): FileSet.base is replaced by an actual global variable in the token package, shared by all FileSets; the base value provided to FileSet.AddFile is ignored. This ensures that all token.Files are inherently non-overlapping, permitting any subset of them to be collected in a transient FileSet. In other words, any collection of FileSets would be consistent in their (successful) Pos->Position mappings, even if they represent different subsets of files. This solves the "central management" problem.

In theory, a downside of this approach is that the global counter would grow faster than each individual FileSet.base, but (a) in practice today there is only one FileSet in the application (see above!), and (b) even if that were not the case, we would still be in no danger of overflowing an int---not on a 64-bit machine at least.

timothy-king · 2022-06-02T17:27:26Z

Another variant of the first could be to explicitly share a base value between FileSets. The FileSet could carry this around. The existing FileSet APIs could start from an unshared base. This is essentially just the fourth option but pass around a [thread safe?] *int on FileSet creation instead of using a global. Like the fourth option the really safe way to share a base with multiple competing parallel FileSets is to pass in a negative base to AddFile.

Not sure either this or the fourth option would be okay on 32-bit machine for long running processes like gopls.

adonovan · 2022-06-02T17:59:13Z

Another variant of the first could be to explicitly share a base value between FileSets. The FileSet could carry this around. The existing FileSet APIs could start from an unshared base. This is essentially just the fourth option but pass around a [thread safe?] *int on FileSet creation instead of using a global. Like the fourth option the really safe way to share a base with multiple competing parallel FileSets is to pass in a negative base to AddFile.

The value of approach 4 is it solves the "central management" problem without needing to update every caller of Parse, whereas the explicit-shared-based approach would still require each caller of Parse to participate.

Not sure either this or the fourth option would be okay on 32-bit machine for long running processes like gopls.

On a 32-bit machine, gopls would run out of memory long before it could even attempt to parse 2GiB worth of Go source files. ;-)

timothy-king · 2022-06-02T19:32:55Z

My understanding is that we are worried about 1 long running process (gopls) creating a File per keystroke and exhausting the available Pos values. If this understanding is wrong, the rest of this comment won't mean much.

I did a back of the envelope estimate, which came out to about 16 hours of work to overflow a global base on a 32-bit machine. 1<<31 Positions * 1/(11074 Positions/AddFile)* 1 keystroke/AddFile * 1/(200 keystroke/minute) which is roughly 16 hours. 11074 comes from the average file size in the Go standard library. 1 keystroke/AddFile comes from my understanding of "parses a file at least once per keystroke of a Go developer's editor". 200 keystrokes/minute is a guess from CPM type speed estimates (https://www.toptenreviews.com/what-is-a-good-typing-speed). Unless the assumptions are +1 order of magnitude off, seems like not enough time for a gopls process.

adonovan · 2022-06-02T20:24:59Z

My understanding is that we are worried ... exhausting the available Pos values.

Your analysis is correct but, no, I'm not at all worried about running out of address space. No-one has ever worked at 200 keystrokes/minute for 16 hours straight, and I'm pretty sure gopls would crash at least once if they tried. It also needs more than 2GB of RAM for most workloads, and no laptop is a 32-bit machine.

What I'm worried about is actual memory consumed by the newline index tables.

findleyr · 2022-06-02T20:30:07Z

For arguments sake, keep in mind that it's not the average file size that matters, but the size of the file that is changing. I am sure we could find a real world example of a very large file in an otherwise small workspace, that exhausts the 32 bit address space in under an hour (or perhaps a few minutes!). But as Alan suggests, this development scenario is unlikely for a variety of reasons, and in practice such a workflow would encounter other problems.

rsc · 2022-06-15T17:47:33Z

Adding to minutes. It does seem like there are many many parts of the go/* packages that assume batch mode and are stressed near breaking by interactive use. I wonder if we should be looking at all of them together instead of just one API at a time.

rsc · 2022-06-15T18:35:53Z

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

findleyr · 2022-06-23T20:44:12Z

It does seem like there are many many parts of the go/* packages that assume batch mode and are stressed near breaking by interactive use.

That's definitely true, but I think we should decouple this proposal from larger changes. It will be a long time before we can stop using token.FileSet in gopls, and yet this unavoidable memory leak has very real consequences for our users today. (Consider that, while working on a large file with ~10k lines, gopls can leak 100kb per keystroke).

rsc · 2022-07-01T17:40:08Z

Based on the discussion above, this proposal seems like a likely accept.
— rsc for the proposal review group

rsc · 2022-07-13T18:40:40Z

No change in consensus, so accepted. 🎉
This issue now tracks the work of implementing the proposal.
— rsc for the proposal review group

The design of FileSet encourages it to be used as a global variable. Each call to AddFile consumes about 3KB, that is never returned, even after an application no longer cares about the File. This change adds a RemoveFile method that a long-running application can use to release a File that is no longer needed, saving memory. Fixes golang/go#53200 Change-Id: Ifd34d650fe0d18b1395f922a4cd02a535afbe560 Reviewed-on: https://go-review.googlesource.com/c/go/+/410114 Auto-Submit: Alan Donovan <[email protected]> Run-TryBot: Alan Donovan <[email protected]> Reviewed-by: Robert Griesemer <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Robert Findley <[email protected]>

adonovan assigned griesemer Jun 2, 2022

adonovan added the Proposal label Jun 2, 2022

ianlancetaylor changed the title ~~go/token: FileSet grows without bound~~ proposal: go/token: add FileSet.RemoveFile method Jun 8, 2022

gopherbot added this to the Proposal milestone Jun 8, 2022

rsc added the Proposal-FinalCommentPeriod label Jul 1, 2022

rsc changed the title ~~proposal: go/token: add FileSet.RemoveFile method~~ go/token: add FileSet.RemoveFile method Jul 13, 2022

rsc added the Proposal-Accepted label Jul 13, 2022

rsc modified the milestones: Proposal, Backlog Jul 13, 2022

rsc moved this to Accepted in Proposals Aug 10, 2022

rsc added this to Proposals Aug 10, 2022

adonovan self-assigned this Aug 15, 2022

gopherbot closed this as completed in a45bbea Aug 16, 2022

prattmic mentioned this issue Dec 6, 2022

api: audit for Go 1.20 #57126

Closed

golang locked and limited conversation to collaborators Aug 16, 2023

gopherbot added the FrozenDueToAge label Aug 16, 2023

rsc removed this from Proposals Aug 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

go/token: add FileSet.RemoveFile method #53200

go/token: add FileSet.RemoveFile method #53200

adonovan commented Jun 2, 2022 •

edited

Loading

gopherbot commented Jun 2, 2022

adonovan commented Jun 2, 2022 •

edited

Loading

timothy-king commented Jun 2, 2022

adonovan commented Jun 2, 2022

timothy-king commented Jun 2, 2022

adonovan commented Jun 2, 2022

findleyr commented Jun 2, 2022

rsc commented Jun 15, 2022

rsc commented Jun 15, 2022

findleyr commented Jun 23, 2022

rsc commented Jul 1, 2022

rsc commented Jul 13, 2022

go/token: add FileSet.RemoveFile method #53200

go/token: add FileSet.RemoveFile method #53200

Comments

adonovan commented Jun 2, 2022 • edited Loading

gopherbot commented Jun 2, 2022

adonovan commented Jun 2, 2022 • edited Loading

timothy-king commented Jun 2, 2022

adonovan commented Jun 2, 2022

timothy-king commented Jun 2, 2022

adonovan commented Jun 2, 2022

findleyr commented Jun 2, 2022

rsc commented Jun 15, 2022

rsc commented Jun 15, 2022

findleyr commented Jun 23, 2022

rsc commented Jul 1, 2022

rsc commented Jul 13, 2022

adonovan commented Jun 2, 2022 •

edited

Loading

adonovan commented Jun 2, 2022 •

edited

Loading