Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source file from in-memory string for incremental LSP analysis #1163

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

koliyo
Copy link
Contributor

@koliyo koliyo commented Nov 24, 2023

For LSP server we need to parse/analyze files interactively as they are edited, before being saved to disk. Therefore we need a mechanism to create source file representations from in-memory strings.

There is already support for synthesized source file, which is similar in mechanism, but is not really what we want here.

Additionally, the source files are continuously updated as the user edits the file, so the same filepath will be reused with updated content. Which is why I needed to introduce the replace parameter to the file storage.

Let me know if you think the design should be done differently, but the functionality is very much needed for LSP.

Copy link
Collaborator

@dabrahams dabrahams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, but you can't just add a parameter or a new init without documenting it. Otherwise, how is anyone supposed to review it for correctness?

Also, if you're going to add API, please add tests that exercise it.

@koliyo
Copy link
Contributor Author

koliyo commented Nov 25, 2023

Good point!

I have added some doc comments, and a unit test to verify that an update sourcefile will get the updated content instead of the previously cached content.

Comment on lines 36 to 39
/// Creates an instance representing the in-memory edited file at `filePath`.
/// `withContent` is the content of the in-memory representation of the file,
/// which may be different from what is actually stored on disk.
public init(at filePath: URL, withContent content: String) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Creates an instance representing the in-memory edited file at `filePath`.
/// `withContent` is the content of the in-memory representation of the file,
/// which may be different from what is actually stored on disk.
public init(at filePath: URL, withContent content: String) {
/// Creates an instance representing the in-memory `contents` of the file at `filePath`.
///
/// `contents` is the text of the file currently in memory, which may be different from what
/// is actually stored on disk when the file is being edited.
public init(at filePath: URL, withInMemoryContents contents: String) {

Note: contents with a "s" is consistent with how other parameters are named.

Comment on lines 371 to 372
}
else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}
else {
} else {

I think swift-format won't like the newline anyway.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sf allows discretionary newlines. As you know this style often creates harmful vertical density. Why push for it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

swift-format will rewrite this. I just checked again.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With "respectsExistingLineBreaks" : true? I am surprised; that seems like a bug.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh, yeah; this it's not the tool I thought it was.

Copy link
Contributor

@kyouko-taiga kyouko-taiga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a couple of inline comments. Except for those, LGTM.

@koliyo
Copy link
Contributor Author

koliyo commented Dec 1, 2023

PR comments have been addressed in last commit

Copy link
Collaborator

@dabrahams dabrahams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from my comments, I'm not convinced this approach is even viable. It introduces shared mutable state, which can cause both data and logical races. Unless there's a simple explanation of why that can't be the case here, we need a different strategy.

Comment on lines 36 to 43
/// Creates an instance representing the in-memory `contents` of the file at `filePath`.
///
/// `contents` is the text of the file currently in memory, which may be different from what
/// is actually stored on disk when the file is being edited.
public init(at filePath: URL, withInMemoryContents contents: String) {
let storage = Storage(filePath, replaceExisting: true) { contents[...] }
self.storage = storage
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what "The in-memory contents of a file" means. IMO a more appropriate API would be

Suggested change
/// Creates an instance representing the in-memory `contents` of the file at `filePath`.
///
/// `contents` is the text of the file currently in memory, which may be different from what
/// is actually stored on disk when the file is being edited.
public init(at filePath: URL, withInMemoryContents contents: String) {
let storage = Storage(filePath, replaceExisting: true) { contents[...] }
self.storage = storage
}
/// Creates an instance with the given `contents`, identified by `fileID`.
///
/// `contents` may not reflect what—if anything—is stored on disk at `fileID`.
public init(contents: String, fileID: URL) {
let storage = Storage(fileID, replaceExisting: true) { contents[...] }
self.storage = storage
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The meaning of "in-memory contents of a file" in this case is basically an "uncommited" draft of the next version of the file. But I also agree that withInMemoryContents by itself does not convey this meaning, and simply calling it contents is perhaps better. But not sure that fileID: URL is better than at filePath: URL, the previous one is more consistent with other initializers. Or maybe something like representing filePath: URL, or similar?

But if you think the suggested change is what is appropriate I am totally fine with it, as long as this functionality is provided.

Sources/Core/SourceFile.swift Show resolved Hide resolved
Comment on lines 358 to 363
/// Creates an alias to the instance with the given `url` if it exists, or creates a new
/// instance having the given `url` and the text resulting from `makeText()`.
///
/// Storage instances are cached based on url, where `replaceExisting` defines if an
/// existing entry should be replaced with new content, otherwise the storage get the
/// content from the previously cached entry.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original summary is no longer accurate. Rewriting the summary to make it accurate eliminates the need for any additional information:

Suggested change
/// Creates an alias to the instance with the given `url` if it exists, or creates a new
/// instance having the given `url` and the text resulting from `makeText()`.
///
/// Storage instances are cached based on url, where `replaceExisting` defines if an
/// existing entry should be replaced with new content, otherwise the storage get the
/// content from the previously cached entry.
/// Creates an alias to the instance with the given `url` if it exists, replacing the instance's text with the
/// result of `makeText()` iff `replaceExisting` is `true`, or creates a new
/// instance having the given `url` and the text resulting from `makeText()`.

Making an accurate summary is also instructive: this one is too complicated to be easily understood, which suggests the API is suboptimal. It's fileprivate, so maybe a little complexity is forgivable, but I would probably not try to cram all this functionality into a single initializer.

Actually now that I look, the original summary didn't document lineStarts, so it's worse than that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, the suggested rewrite is definitely more concise and instructive. However regarding any major refactoring of the initializer I think could perhaps be done in another issue/PR, at least from my perspective where I do not have full insight into the original intents of the design, eg regarding the additional lineStarts parameter. You guys are also very welcome to push changes to the PR if you have clear ideas for updated implementation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point about "instructive" is that the activity of writing an accurate summary can tell you things.

any major refactoring of the initializer I think could perhaps be done in another issue/PR, at least from my perspective where I do not have full insight into the original intents of the design, eg regarding the additional lineStarts parameter.

From my perspective you regressed the understandability of this API, and clawing back that regression later is not the way things should be done. I admit the lack of documentation for lineStarts was a defect, so one question you might start by asking is, what would the correct documentation have been for the state you're modifying? If you need insight into the original intent, just ask questions. I'm happy to answer.

Copy link
Contributor Author

@koliyo koliyo Dec 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point about "instructive" is that the activity of writing an accurate summary can tell you things

Ah, yes that is a very good point, I will definately keep this in mind going forward, not just in hylo context.

And sorry for PR lacking basic documentation from the start. The LSP project I'm working on is very much a prototype, and I did not properly switch mindset when preparing the PR for hylo, which is production codebase albeit in an early state.

And I very much appreciate that you are making the effort to uphold good coding standards in the hylo codebase!

but I would probably not try to cram all this functionality into a single initialize

Ok, let's make an attempt at improving the situation.

First of all, I am new to the swift language, and are not really familiar with some of the idioms and best practices. But this is a good opportunity to improve.

So, looking at this, I don't really see how to break up the initializer, at least if we want the Storage class members to have let bindings. One possibility is approaching the Storage class as an immutable struct, where modifications would return a new Storage instance, but not sure that is what we want here.

Also, what is it that we can break out of the initializer? makeText is dependent on replaceExisting, they can not be separated. Possibly lineStarts could be separated out?

I see two main alternatives:

  • Use some composite parameter instead of individual parameters
  • Use a builder pattern

A composite parameter could be some kind of protocol, eg

protocol StorageInitializer {
  var lineStarts: [SourceFile.Index]? { get }
  var replaceExisting: Bool { get }
  func makeText() throws -> Substring
}

with updated initializer

fileprivate convenience init(_ url: URL, initializer: StorageInitializer) rethrows {
  //...
}

The other main alternative is a builder pattern, something like:

let storage = StorageBuilder(replaceExisting: true)
    .withText(codeString)
    .build()

// Or 
let storage = StorageBuilder(replaceExisting: false)
    .withText { try String(contentsOf: filePath)[...] }
    .build()

@koliyo
Copy link
Contributor Author

koliyo commented Dec 8, 2023

Aside from my comments, I'm not convinced this approach is even viable. It introduces shared mutable state, which can cause both data and logical races. Unless there's a simple explanation of why that can't be the case here, we need a different strategy.

So the race condition is that if we do concurrent compilations interleaved with updated source file content we might get unexpected input to the compilation?

In the general case that may be true, but at least for the LSP case the updates are sequenced. In the same way the content of disk is a unique representation, (ie not multiple contents with the same filepath), the same is true in the LSP session, a filepath can not have two simultaneous in-memory content representations. The state of the file content is an ordered sequence of updates according to the LSP protocol.

Not really sure how to solve this for the generic case, or if the generic case would even be useful. So maybe constraining the API so it is less generic is the right way. Specifically perhaps we should bypass the storage all together for this "uncommited" file state, it should simply be a non-shared SourceFile instance, that is not backed by the storage cache at all?

@dabrahams
Copy link
Collaborator

So the race condition is that if we do concurrent compilations interleaved with updated source file content we might get unexpected input to the compilation?

In fairness, I misread the change so while there is a race, it's more subtle than I thought; the problems will manifest more clearly as a dangling reference. We'll get to that in a moment…

The general point is that everything in Hylo's source is written with the assumption of value semantics: each copy of an instance has an independent value that can't be changed except through that instance. Violating that property can result in races and spooky action at a distance in existing code, which that code hasn't been written to account for. Any sharing of mutable state among instances can cause havoc.

In the case of SourceFile, which gets copied pervasively by virtue of being embedded in SourceRange its storage is a shared reference, and we uphold value semantics by never mutating the shared data. Now, you aren't mutating the stored properties of any Storage instance, and no reference semantics is introduced simply by the replacement of a reference-to-immutable in a value type.

However, the shared reference to storage is unowned, and as the documentation says, the referenced instance is owned by Storage.allInstances. When you replace an instance of Storage in Storage.allInstances, that instance is destroyed (and that destruction can race with any read of a storage instance). More obviously, any extant unowned references are now dangling. Accesses through an unowned reference are unsafe and we're into undefined behavior—it's full-on nasal demon time!

In the general case that may be true, but at least for the LSP case the updates are sequenced.

One of the main reasons for building Hylo, and for building its source the way we have is to define away this kind of issue, among other things so we can freely parallelize anything without worrying about non-local properties. We're not willing to have any types in Hylo's sources that are not as threadsafe as Int.

Now you could say, "great, let's just get rid of the unowned qualification on storage." We would pay a performance penalty for the associated reference counting but that would certainly restore value semantics and thus eliminate the race and the dangling. However, I have a much more fundamental concern: the system is written with the assumption that every SourceFile instance having the same url is equivalent. That's why when we create instances that are fragments of actual files we go to the trouble to build a unique url for each one. I don't know how much code actually depends on this assumption, but we can't simply violate it and hope for the best. If you are going to weaken the invariants of the system, you need to do an investigation and show that the existing code doesn't depend on the former invariants or provide the adjustments to keep it working.

perhaps we should bypass the storage all together for this "uncommited" file state, it should simply be a non-shared SourceFile instance, that is not backed by the storage cache at all?

I'm not really sure, because I don't know where or how you're planning to use this thing. The data structures such as AST and TypedProgram resulting from parsing and typechecking, which I assume are what LSP needs to consume, are immutable once constructed—we don't have any facility for updating them in place and making changes to them without violating their integrity is a solvable-but-currently-unsolved problem that would mostly likely require some major data structure changes.

It wouldn't surprise me a bit if supporting LSP well requires some fundamental changes to our data structures—to which I'm not at all opposed. That's why I think we need to step back and take a collective look at the problems.

@kyouko-taiga
Copy link
Contributor

That's why I think we need to step back and take a collective look at the problems.

How shall we have this collective look? Do we want to setup an online meeting or would you like to prefer doing things asynchronously. I am fine either way but I don't really know where to start.

@dabrahams
Copy link
Collaborator

Maybe @koliyo could give us (or point us to) a concise description of what is needed for LSP support, and from there we can decide how to proceed?

@koliyo
Copy link
Contributor Author

koliyo commented Dec 15, 2023

Maybe @koliyo could give us (or point us to) a concise description of what is needed for LSP support, and from there we can decide how to proceed?

Summary:
The LSP needs to repeatedly analyze source files, interactively, with incremental updates to files from unsaved text buffers, identified by their original filepath.

Some additional context:
From the point of view of the IDE with interactive code feedback, the content of the text buffer is what defines the file, even if it is not saved to disk. And this content is identified by the filepath of the file being edited.

As an IDE user I expect the editor to respond to the current state of the buffer, and preferrably with as low latency as possible, eg on each keystroke. My main reference for this, is the Roslyn/C# LSP which I think does a great job, and is very responsive.

The LSP server receives textDocument/didChange updates with incremental changesets to the file. The LSP server can reconstruct the content of the IDE buffer by caching the original content, and incrementally applying the changesets as they arrive.

I think it is important that we can use the filepath as identifier for the un-saved buffer. This makes it much easier to deal with the results from hylo anaylsis, traversing AST etc. If we need to differentiate between buffers from disk and unsaved in-memory buffers, that will add additional complexity to the LSP analysis, which we do avoid when we can identify unsaved buffers by their original filepath.

The LSP protocol is designed so the server is active very passively and responds to requests. So after a didChange event there are additional requests explicitly from the LSP client/editor, eg textDocument/documentSymbol to get symbols in the document, and requests for diagnostics, etc. And this is based on the capabilities registered by the LSP server, what it provides in terms of LSP functionality.

However, the shared reference to storage is unowned, and as the documentation says, the referenced instance is owned by Storage.allInstances. When you replace an instance of Storage in Storage.allInstances, that instance is destroyed (and that destruction can race with any read of a storage instance). More obviously, any extant unowned references are now dangling. Accesses through an unowned reference are unsafe and we're into undefined behavior—it's full-on nasal demon time!

Ah, did not realize the unowned setup, and the possible implicit dangling pointers from replacing storage buffers in the storage cache. I definitely see this could be a problem! So, yes, I did underestimate the impact of this proposed change. And perhaps the LSP requirements does indeed necessitate some more fundamental structural changes as you say.

And I also think it is wise to, at this early point, make that the compiler framework is well suited for interactive LSP usage. I think this is a very important aspect of a programming language, from the user perspective.

@koliyo
Copy link
Contributor Author

koliyo commented Dec 15, 2023

The other main alternative I see is to not use the filepath to identify the usaved buffers. And instead use some kind of unique identifiers, updated for each change, eg with a running index:

file:///path/to/file.hylo // Original file on disk
file:///path/to/file.hylo@v1
file:///path/to/file.hylo@v2
file:///path/to/file.hylo@v3
file:///path/to/file.hylo@v4

This could then be managed by one additional indirection in the LSP, using a simple lookup table, from filepath to the current "head" version. eg file:///path/to/file.hylo -> file:///path/to/file.hylo@v4.

That is definitely possible.

The upside is that this would not require any direct changes to the current hylo structure and this PR is no longer needed.

Edit: Hmm, but this would perhaps not work? Would we get the right content in the hylo parsing? If eg one hylo file has import of another hylo file/module, the import would then find the original disk-based storage and not the versioned in-memory storage? Or would it work? Please enlighten me

@dabrahams
Copy link
Collaborator

The LSP server receives textDocument/didChange updates with incremental changesets to the file. The LSP server can reconstruct the content of the IDE buffer by caching the original content, and incrementally applying the changesets as they arrive.

Please keep in mind that I don't understand how a whole LSP-based system is supposed to fit together. Are you implementing an LSP server for Hylo, or are you implementing some part that plugs into an existing thing that is an LSP server? Is the thing you have to implement responsible for responding to these changesets, or is something else updating a cached in-memory representation of complete source files and passing those representations to your component? These are the kinds of questions implied by “what is needed for LSP support.” I want to know in general, if I was going to supply LSP support (as you are trying to do) for an arbitrary new language, what would I build and what specification would it have to satisfy? Am I building a library? An executable? More than that? How does that component interact with an IDE?

If there's a simple introductory article somewhere that I should read to get my bearings, I'd be happy to do that.

@maurobalbi
Copy link

maurobalbi commented Dec 19, 2023

I think this post breaks it down pretty nicely: https://nshipster.com/language-server-protocol/.

@koliyo
Copy link
Contributor Author

koliyo commented Dec 22, 2023

Please keep in mind that I don't understand how a whole LSP-based system is supposed to fit together. Are you implementing an LSP server for Hylo, or are you implementing some part that plugs into an existing thing that is an LSP server? Is the thing you have to implement responsible for responding to these changesets, or is something else updating a cached in-memory representation of complete source files and passing those representations to your component? These are the kinds of questions implied by “what is needed for LSP support.” I want to know in general, if I was going to supply LSP support (as you are trying to do) for an arbitrary new language, what would I build and what specification would it have to satisfy? Am I building a library? An executable? More than that? How does that component interact with an IDE?

I am building an LSP server for Hylo, ie a server that conforms to the LSP protocol and does analysis of Hylo code using the Hylo compiler. So this is specifically for Hylo, and not generic language handling. Here is the repository:

https://github.com/koliyo/hylo-lsp

This code is responding to changesets, builds a in-memory representation of the edited file (which should be identical to what is inside the IDE buffer), and then use Hylo compiler to analyze the code.

The release artifact is a binary executable, that communicates with the LSP client, which normally is an IDE, via JSON RPC messages, where the data communication channel typically is stdio, or local unix socket (pipe).

This is then integrated into VSCode via a pretty thin wrapper in this repo:

https://github.com/koliyo/hylo-vscode-extension

So when I say “what is needed for LSP support.” in this PR, I mean specifically for an LSP server that process hylo files.

The generic LSP protocol handling for Swift that I am using is implemented here:

https://github.com/ChimeHQ/LanguageServer

This is not for analyzing Swift source code files, it is a generic LSP protocol handler implemented in Swift.

And this repo does not have anything to do with hylo, and is not affected by this PR.

Sidenote, Apples sourcekit-lsp for Swift analysis is completely unrelated to all of this (although related in the sense that it is an LSP server implementation)

@koliyo
Copy link
Contributor Author

koliyo commented Dec 22, 2023

Also, I realize the usage of the LSP terminlogy is a bit misused in this context. If we are strict about it then LSP is the protocol, which is language agnostic. But when I say hylo-lsp, i mean the LSP server that analyze hylo code.

Perhaps this is unnecessary confusion due to poor naming on my part. Better project name might be:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants