Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PHP parser explorations #15818

Closed
mousetraps opened this issue Nov 21, 2016 · 14 comments
Closed

PHP parser explorations #15818

mousetraps opened this issue Nov 21, 2016 · 14 comments
Assignees
Labels
php PHP support issues

Comments

@mousetraps
Copy link
Contributor

I recently looked into what ongoing PHP efforts we might want to contribute to, and was especially excited by the ongoing language service efforts from extensions like PHP IntelliSense and Crane. Eventually I zeroed into the parsers underlying these language services, and began investigating what it might look like to have a PHP parser optimized for IDE scenarios (e.g. error tolerance, performance, tree representation, etc.) I've made some good progress, and hope to reach out to the community with more details soon.

This issue tracks progress on that work.

@mousetraps mousetraps self-assigned this Nov 21, 2016
@mousetraps mousetraps added the php PHP support issues label Nov 21, 2016
@felixfbecker
Copy link
Contributor

felixfbecker commented Dec 7, 2016

Does that mean you plan to contribute to https://github.com/nikic/PHP-Parser? I have a couple of issues open (and many already closed) regarding error recovery and indeed every bit there makes PHP IntelliSense work in more cases.

@mousetraps
Copy link
Contributor Author

@felixfbecker yep, that was where I started, but it's exceedingly difficult to retrofit a parser, let alone an auto-generated one, with things like "error tolerance". Over time, you end up with a long tail of esoteric bugs.

There are additional challenges, such as how the tree is represented and performance, that are a bit difficult to manage (I'll enumerate them further if you're interested). So, I began working on a handspun parser (also written in PHP). Right now we're heads down with the VS Code endgame, but I plan to share more details (and there are a lot of details 😉 ) later this month. The code currently lives in a private repo. I'll make it public after a little bit more cleanup, which is needed so that people are able to read the code, understand it, and contribute back to it.

Just to set expectations - it's a proof-of-concept. It should, however, be enough to decide whether there's enough interest to invest further and bring it to production quality or whether we want to change direction.

Once it's public - should be soon - we should check it against the error-tolerance issues you mentioned.

@felixfbecker
Copy link
Contributor

I just don't know if the effort is worth to build a completely new parser from scratch and support it long-term with new syntax etc. when there is already a pretty good one out there that has a lot of support and is used by other big libraries. The improvements nikic did for error recovery where all very small patches in the grammar and I think doing improvements there is a lot easier than writing a completely new parser.

Crane's author stated that his motiviation for not using PHP but TypeScript to parse was that the extension "should work out of the box" and have no dependencies and not PHP runtime requirement. I on the other hand chose PHP to make use of great PHP packages that help the LS. So in what language your new parser is written in will decide who of Crane/PHP IntelliSense is able to use it.
If you write it in PHP, and therefor target my LS, I would much more appreciate some help through PRs to the LS maybe or improvements in nikic's parser. The LS is very coupled to it, and switching the parser will be non-trivial, especially if it would mean a regression in features.

But anyway, appreciate that PHP gets some love from you 👍

@egamma egamma mentioned this issue Dec 20, 2016
56 tasks
@mattacosta
Copy link
Contributor

Well, as someone that has already done a lot of exploration into an IDE extension for PHP back in February/March, I actually agree with @mousetraps on needing a custom implementation.

Initially, I too tried to start with a generated parser. My "little" concept extension needed to work across all systems and not have any dependencies on the installed PHP version, so I needed a JS/TS implementation. That led me to the glayzzle/php-parser project (which the Crane extension uses a fork of). Unfortunately, it didn't support PHP 7 at the time, and there were no other suitable projects to build on. So I went to the source, figuratively and literally.

I looked at how PHP itself worked, and thankfully, PHP 7 used a new generated parser and abstract syntax tree (AST). This meant that it would be relatively simple to use its own grammar file and some of the other packages that the glayzzle project was built on, and then integrate it into vscode. In reality however, I ended up with a TON of shift-reduce, and worse, reduce-reduce conflict warnings from the generator. These are ambiguities that could affect which actions are run, and thereby possibly introduce very difficult bugs to track down later on. After a bit of work, I eventually ended up with a grammar where I had replaced most of the offending rules regarding PHP's expressions. More importantly it had 0 conflicts and by then several hundred tests to show that it was still working as expected (which still isn't full coverage).

Now, having any lexer, parser, and abstract syntax tree is generally enough to implement relatively simple IDE features like IntelliSense on a small scale (although really, even IntelliSense has quirks, like determining the correct type of $this). This was fine for my concept extension, but as you try and implement more and more features like semantic analysis, you inherently need more and more compute time. Unfortunately, you are restricted by the fact that everything needs to ideally finish and update before the next keypress. So if I were to try and go from concept to actual extension, I would have immediately run into several problems.

The first is performance. Neither the generated lexer, nor the generated parser supported methods to only tokenize and parse partial segments of a file. In a large project, you shouldn't parse entire files when someone does a "find and replace all" or "rename symbol". After each change, diagnostics must also be run, and there are a lot of them. Syntax errors are only the beginning.

Speaking of syntax errors, the next major issue would be error tolerance. If a parser choked at the first syntax error it saw as you typed, you would lose any symbols found later in the file and potentially trigger a lot of diagnostic errors. Errors are typically solved in one of two ways. The text can simply be ignored, or an additional fake token can be added to make it valid. For example, take the following text: isVariableOrFunction;. The easiest way for the parser to get around this syntax error is to simply ignore it, just like a comment or whitespace. If an IDE wanted to find the best solution however, it could take into account the cursor position, adding a fake '$' token if the cursor is at the start to make it a variable, or fake '(' and ')' tokens at the end to turn it into a function. Making the right choice for this and other situations can be difficult for generated parsers.

Now, remember how I said comments and whitespace could essentially be ignored? Well, not really. To fully take advantage of an IDE's formatting capabilities, your analyzers need contextual information. Which means that they should really be represented in the tree. Take the semicolons in a for-block. A generic formatting rule of "nothing after a semi-colon except a newline or line comment" would fail here without knowledge of the for statement. Now try making a generated parser account for all the possible combinations of whitespace in various expressions (like a=b, or a = b, or a= b, etc)...

TL;DR: Generated parsers are very fast and excellent for compilers that only need a single pass, but bad for IDEs where files constantly change. Either way, I am also very interested in seeing how your parser works and how applicable concepts used by Roslyn are 😃 .

@bmewburn
Copy link

My story is a lot like @mattacosta. I looked into using Jison to generate a php7 parser by modifying the official PHP zend_language_parser.y grammar file only to be met with many shift-reduce conflicts. I ironed out all those and then started experimenting with using error tokens to recover from parse errors but found this approach very limited.

I ditched the generated parser and wrote a recursive descent parser and regex lexer for php7 and am making progress towards providing another PHP intellisense offering in Javascript/Typescript. It's slow going and in a private repo for now.

I think a fast, error tolerant parser is key to having good code completion and static code analysis and am interested in @mousetraps efforts.

@felixfbecker
Copy link
Contributor

felixfbecker commented Dec 23, 2016

My "little" concept extension needed to work across all systems and not have any dependencies on the installed PHP version, so I needed a JS/TS implementation.

I am thinking bigger scale. A language server should work with many IDEs, not just VS Code. Now, what is a PHP developer more likely to have installed, Node or PHP? Definitely the latter. That is also the language he/she knows and can contribute too. My language server is already used by Eclipse Che, Eclipse (LSP4E) and more in the pipeline, which are all not Node-based.

I looked at how PHP itself worked, and thankfully, PHP 7 used a new generated parser and abstract syntax tree (AST).

The problem with PHP's on parser is that it is only able to parse code of the version it is running.

(although really, even IntelliSense has quirks, like determining the correct type of $this). This was fine for my concept extension, but as you try and implement more and more features like semantic analysis, you inherently need more and more compute time. Unfortunately, you are restricted by the fact that everything needs to ideally finish and update before the next keypress.

My language server has no problem with these

The first is performance. Neither the generated lexer, nor the generated parser supported methods to only tokenize and parse partial segments of a file. In a large project, you shouldn't parse entire files when someone does a "find and replace all" or "rename symbol".

I have not met any performance issues with reparsing the document on every change.

After each change, diagnostics must also be run, and there are a lot of them. Syntax errors are only the beginning.
Speaking of syntax errors, the next major issue would be error tolerance. If a parser choked at the first syntax error it saw as you typed, you would lose any symbols found later in the file and potentially trigger a lot of diagnostic errors. Errors are typically solved in one of two ways. The text can simply be ignored, or an additional fake token can be added to make it valid. For example, take the following text: isVariableOrFunction;. The easiest way for the parser to get around this syntax error is to simply ignore it, just like a comment or whitespace. If an IDE wanted to find the best solution however, it could take into account the cursor position, adding a fake '$' token if the cursor is at the start to make it a variable, or fake '(' and ')' tokens at the end to turn it into a function. Making the right choice for this and other situations can be difficult for generated parsers.

nikic/PHP-Parser has error recovery. It can meet a statement that is not terminated by a semicolon for example, and unfinished variable, etc and still give you an AST. The statement is not simply ignored, you would get a Expr\Variable node with an Error node as the name. There is no need to insert any fake tokens.
You can collect all errors in the document as diagnostics, not just the first one.


Why are you all trying to build new extensions and parsers, instead of contributing to existing open-source efforts (crane, glayzzle/php-parser, nikic/PHP-Parser, vscode-php-intellisense)?

@mattacosta
Copy link
Contributor

@felixfbecker There's nothing wrong with those projects, it's just that we may have different design goals and/or requirements to meet.

Also, it's not that generated parsers can't collect errors, it's that from a user's perspective, some of those errors might show up in inappropriate locations or have confusing messages when shown within an IDE. With custom parsers, it's easier to apply custom logic in order to provide a better user experience. Now that I have read that paragraph again, I probably should have made that a lot clearer.

@darcome
Copy link

darcome commented Jan 13, 2017

Hello!

Just a comment!
I love the work done by felix! But it only works for PHP7 and higher! Would it be possible to make it work also for PHP5? Otherwise a LOT of people will be left out...

Thanks in advance!

@felixfbecker
Copy link
Contributor

@darcome Please note that my extension is fully capable of providing IntelliSense for files written in PHP 5. Just the runtime for the language server needs to be PHP 7. You should be able to install versions side-by-side if you need to and point the LS to use PHP 7.

@mousetraps
Copy link
Contributor Author

Hey all! As promised, the parser is now available here:
https://github.com/Microsoft/tolerant-php-parser

As I mentioned, it's still in its early stages, and we're hungry for feedback, so please please please try it out and let us know what you think!

cc @mattacosta, @bmewburn, @felixfbecker

@darcome
Copy link

darcome commented Jan 18, 2017

Hello,
I am sorry but I don't understand how can I test it on vscode... What is the name of the extension to use? What are its requirements? For example, does it need PHP7?

Thanks in advance,
Luca

@mattacosta
Copy link
Contributor

mattacosta commented Jan 19, 2017

@darcome It's not a full extension on the marketplace yet. It's just under-the-hood code that may eventually power a future extension. As of right now, you can only debug the parser using VSCode. If you want to test out its implementation, you can also play around with the Syntax Visualizer mini-extension to view the abstract syntax tree it creates as well. Also, for performance/memory reasons it does require PHP 7.

@darcome
Copy link

darcome commented Jan 20, 2017

@mattacosta Sorry to bother you again, but how can I do it?
Is there a guide I can follow? Or can you tell me some hints?

Thanks in advance

@mousetraps
Copy link
Contributor Author

mousetraps commented Jan 20, 2017

Hey @darcome - as @mattacosta mentioned, the parser is under-the-hood code that produces a syntax tree from a source code string. This syntax tree represents the lexical and syntactic structure of source code, which enables services (like hover, navigations, code-completions, formatting, refactoring, etc.) to intelligently process it.

There are lots of ways to contribute:

  1. Try building an app that uses the information provided by the parser (e.g. a PHP script or VSCode extension). Sample code is available in the README under the Getting Started section, and you can configure your machine using the instructions here. This will help because you can give us feedback on the parse trees that are generated and also the API. You can also take a look at some of the scripts in the tools folder, which use the parser to perform various tasks (e.g. one of them auto-generates API documentation for the parser using the parser)
  2. Install the VSCode Syntax Visualizer Extension to get a more tangible feel for the AST and also see error squigglies in the editor (you'll also need to run through the steps to install PHP7 as outlined in Getting Started. If you see error squigglies for valid code, then please file an issue because we want to know about it, and that's something we should address. e.g. issue parsing base class-base-clause in object-creation-expression tolerant-php-parser#30.
  3. Make a PR - See the Contributing page for more details. I also included some example PRs to help people ramp up with the codebase.
  4. Your idea here 😃

@kieferrm kieferrm assigned roblourens and unassigned mousetraps May 1, 2017
@vscodebot vscodebot bot locked and limited conversation to collaborators Nov 17, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
php PHP support issues
Projects
None yet
Development

No branches or pull requests

6 participants