-
Notifications
You must be signed in to change notification settings - Fork 29.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Offline spell checker for VSCode #20266
Comments
I'd personally like to see spell checking using |
Do I get it correctly than by native you mean implemented in JavaScript/TypeScript? I understand your point of view, I would probably prefer it too. But please have a look on Hunspell GitHub page and consider how many years of active developement it took to get it where it is now. I doubt that someone could just "rewrite" it. Anyway, for the time being existing spellcheckers are problematic and it may push away e.g. people who use VSCode to write technical docs, latex papers etc. Having node-spellchecker accessible with compatible binaries, which can only realistically be achieved by bundling it with VSCode, would cure not even most but all of the above mentioned issues. Later when native API with comparable quality appears you can switch and it will not generate a lot of trouble for extensions developers because it is only a handful of calls. |
I mean native as in platform; talking to macOS, Windows, Linux services if available, and falling back to another implementation if not. |
That's EXACTLY what node-spellchecker module does! |
Somehow I go impression that you tend to have everything in pure JavaScript and native modules support is not going to appear anytime soon. Sorry about misinterpretation. |
Badly worded on my part 😄 |
First of all, sincerely many thanks to @bartosz-antosik , your work is really thorough and an awesome guidance on spell checking. I spent some time investigating into this feature this iteration and here are my thoughts and todo items. How we ship itFirstly, the spell checking process should work in a separate process, without blocking the core or extension host pipeline. Secondly, there are two ways to talk to native code: node native module or standalone script with interactive console. The former is easy to do as the only catch is you have to recompile every time if node/v8 version changes. To fix that we just need to put the extension into Code's folder, either in core or in our builtin extension folder. The benefit is obvious, we don't need to talk to C++ code painfully and stay inside NodeJS always but there are several issues that should be taken care of before we do that
The second way to solve this problem is running a standalone interactive script, which talks to system API, compiled in different architecture/platform. Then our NodeJS code, either Core or an extension can talk to it through standard IO or even better Socket. The script will be running in a new NodeJS process and we can easily make all the spell checking async. I start with the second solution. Even though this problem is fixed perfectly, we still get quite a few issues around the experience and maturity of spell checking on different platforms, including but not limited to: Spell Check APIOn macOS and Windows (8 and above), the system provides builtin spell check support, their behaviors vary but they both support following common functionalities
In addition to above features, macOS, Hunspell and Windows disagree with each other on several APIs: Ignore word
Conclusion: On Windows and Hunspell, ignore words temporarily and each time we initialize a spell check process, set the ignore list on the fly. As on macOS you can always remove words from the dictionary, let's trust it. Builtin language support
What's the experience of setting up dictionaries for another language which has no builtin support?
How to spell check text which contains multiple languages, automatically? System already has some native support, but they behave differently and the experience is not charming.
Dictionaries
Both Chrome and Firefox ship with en-US dictionary (for English users). Chrome will download any dictionary users require ( see https://cs.chromium.org/chromium/src/chrome/browser/spellchecker/spellcheck_hunspell_dictionary.cc?dr=C&q=chrome/dict&l=238 ), and Firefox fetches dictionaries from https://dxr.mozilla.org/mozilla-central/source/browser/app/profile/firefox.js#77. Conclusion: Ship with en-US (because most of time you are coding in English) and maybe ship with one user preferred language (for example, maybe one day users can get a Chinese version of VS Code directly and it has Chinese dictionary builtin). For other requests, provide a stable/high available dictionary download service. Atom now downloads dictionaries from Google's service (which is used by Chrome), however that service is not available in some countries and regions. Exception list/known Words
Spell Checker
We can ship Hunspell in all platforms and users can choose to use Hunspell or not. SettingsOpen questions about how we define the settings for spell checking.
|
Thanks @rebornix for kind words & analysis which I like a lot. I would like to refer to few points of your analysis as it looks like maybe I do not understand one or more things. Excuse me if I am very off at points but I know very little about node.js and the whole environment. Synchronous/Asynchronous InterfaceAbout this sync/async interface: are events (e.g. onDidOpenTextDocument, onDidChangeTextDocument, onDidChangeVisibleTextEditors) asynchronous or not? If they are then then why bother if node-spellchecker's interface is or is not? If they are not then not only spell checking engine should be asynchronous but all the extension code that reacts to events to parse text & select parts to spell that calls the engine should be too, should it not? What takes time in spelling is parsing a document, possibly large, and eliminating parts that should not be spelled (suppose latex commands or parts of code that should be skipped to spell comments & strings etc.) And I recon it should be left up to the extension, not the speller, to decide on what to do with particular document type. There is one more thing to consider here: Word lookup is quick. Suggestions are slow. About spellcheckers that I used they are quick to look word up to test whether it is spelled correctly and slow (like over 10 times slower on average) to produce suggestions. Current approach e.g. in my spell checker extension is to spell & feed diagnostic collection with suggestions plus there is an option to just signal misspelled words and look up suggestions on provideCodeActions event. So do I understand correctly that either all parts of the process should be async or it does not matter much whether spelling engine is? Ignoring WordsAbout custom/known/ignored words: I would consider off loading this to the extension! Don't know about the rest of the world but I would love them to be manageable like rest of the VSCode's configuration. All three MS/iOS/hunspell place them no one knows where and it is additional pain to transfer them to another location or manage them in the context of the document type. Language Scope in a DocumentI like the idea of multiple languages inside one document a lot. It seemed to me crazy at first but the more I think about it it seems quite doable. The only way though I can think of is content/comment driven language switching. Again - the extension should decide about this, as this information can be, for instance, extracted from latex document quite other way than from other document type. |
@bartosz-antosik thanks for your reply. About async/sync problem, I'm referring to function calls to native code, they are no async right now. But it's not a problem as in nodejs, we can always use Word Lookup/SuggestionsI like your idea of separating word look up and generate suggestions and thanks again for your perf testing. Postponing suggestion lookup to code action provider makes sure we only do minimal calculation. And you are right, this can be an option as the only catch of this feature is users can't have a general view of misspell suggestions in Problems View. Another thing about perf is where to do the calculation, doing all the math in native code can be faster but sending a large portion of data to native code can cost time as well. We need good testing to find the balance. Ignoring wordsSystem Spell Checker stores the ignoring words on the fly and yes we'll hide them from users. Multi languagemacOS has its in-house language detect which works reasonable to me but Windows doesn't. Comments, strings and technical documents are the most possible cases that users may need multi-language support. We can either switch languages automatically, or maybe even spawn multiple spell check process for different languages. |
Hello I'm the author of Code Spell Checker extension and cspell linter (used by the extension). WhyI did not intend to write a spell checker. I wrote it because I needed one that worked with source code and didn't find a built in checker. So the fact that you are considering having a spell checker built in is wonderful. It would have saved me a bunch of effort. :-) To be honest, it was a fun exercise. It needed to load fast and execute fast. It needed to limit memory consumption and work with very large dictionaries. Spelling suggestions needed to be quick and applicable. Importantly, I wanted it to run on all platforms. I was able to achieve all of these things. How it worksI did not choose any of the Hunspell solutions due to speed and memory concerns. The Hunspell format is designed for compact representation of words with common prefix and suffix patterns. The Hunspell .dic and .aff are deliberately easy for adding words by hand. The format is not designed for easy lookup or searching. Which is why the open source javascript solutions are very slow and use a lot of memory. Instead I wrote a hunspell file reader that would output all the word combinations. This list of words is compiled into a compact format designed for lookup speed and calculating suggestions. At its core is a Trie which is optimized into a Deterministic Acyclic Finite State Automaton. This process of compiling is rather expensive, which is why it is done offline and only the compiled dictionaries are shipped with the extension. Word Lookup and SuggestionsWord lookup is O(m) where m is the length of the word. It is a very simple process of walking the Trie. Suggestions are done using a modified Levenshtein algorithm that minimizes recalculation and culls candidates by not walking down branches in the Trie whose minimum possible error is greater than the allowed error threshold. Things to considerMost of the work was not writing the spell checker. Checking words and making spelling suggestions is rather easy. Most of the work came from the configuration options. Where possible, the system is configuration driven. Each programming language has its own combination of dictionaries and settings. In the linter fashion, the spell checker also allows for in code flags and settings. Programming Language DictionariesI ended up creating dictionaries that included keywords and common symbols for several programming languages. These dictionaries can be combined based upon the context. For example a .cpp file will use the following dictionaries: cpp, companies, softwareTerms, misc, filetypes, and wordsEn. As you can see, I even needed a dictionary for common software terms, because standard Hunspell dictionaries do not include most software terms. Programming Language Grammar awarenessI did not make my spell checker aware of the programming language grammar or syntax. There are some really cool things that are possible. Like having strings be in French while the code is in English and the comments are in Spanish. Other things like not spell checking 3rd party imports. Yet, I found this more work than I had time to spend. As an extension writer, I was wishing for access to the language grammar used by the colorizers. Linter StyleI think it is worth noting that a spell checker is usable in a Continuous Integration environment. Think of it as anyplace you might want to use tslint a spell checker might be useful. |
Questions
|
Note that this is programming-language dependent, and, for this reason, it makes sense to make spellchecker itself part of the platform, and expose language-dependant parts via LSP. Here's a list of things which could be handled by language server but can't be reasonably handled by spell checker extension alone: For markup langauges, dealing with subwork markup. For example, in asciidoctor I can write For all languages, langauge server needs to unescape string literals and strip For all languages, there should be a language-specifc built-in dictionary For statically typed languages, spell checking should be done only for definitions, and not for references: catching misspellings in the references is the job of compiler and code completion. |
I'm sorry - but why not to use chrome internal spell checker? |
Electron 8 includes support for the built-in Chromium spellchecker. Maybe now this feature would be easier? |
This looks like a primary issue for built-in spell checking in VSCode so if it's going to happen with the new Electron 8.0 capabilities, I'd like to add a few notes:
|
I guess the improved spell-checking capabilities of Electron v9.0 would be an ideal basis for VS-Code built-in spell-checking? I would love to have that - haven't found a reliable spell-checking extension yet that works under VS-code remote development. |
Microsoft also has "Microsoft Editor Service" which work for both browser and desktop. Is there any way to use it in vscode? |
The discussion above about how to ship a spell checker appears not concluded. What about WASM? All major engines have been supporting WASM since 2017 according to the MDN compatibility data. Someone has successfully compiled Hunspell as WASM: https://github.com/kwonoj/hunspell-asm . The Base64-encoded WASM binary of Hunspell is only about 780 kB, so there should be little difficulty in bundling. |
+1 I came here to say this. Just a selectable spelling dictionary would do for me, even. I'd use it for text files, markdown files, and most especially for files that are of the "git commit" language type. |
Interesting discussion! |
A mild +1 for at least rudimentary VSCode spellcheck out of the box if it seems reasonable given overall user asks. An office-like app has great spellcheck but won't start due to license check requirements if it has been offline for a long time. I prefer simple text files to avoid heavy client issues like that. VSCode supports this but without spellchecking out of the box. For certain note-taking cases, I look elsewhere... or perhaps copy/paste to office app w/spellcheck next chance. While +1 one this, it is not a push as though I'm waiting with anticipation for this... VSCode gets tons of usage in so many areas... I'd hardly complain about where it is at today... so a mild +1 if there happens to be tons of others who +1 and it makes overall sense. Hope this helps, thanks. |
Hello (first time contributing here)
There are few offline spell checkers among VSCode extensions, but they are based on seriously faulty JavaScript implementations of Hunspell spell checker.
Hunspell is nowadays probably the most widespread standard for spell check layer. It is used on MacOS, Linux and in some software (e.g. LibreOffice) on Windows. It is also used by both Atom and Sublime Text. There is an enormous collection of polished dictionaries for Hunspell.
There exists some JavaScript implementations that refer to Hunspell's name but in fact they do not implement critical functionality - lexical parser. I have verified these three:
hunspell-spellchecker
Typo.js
nspell
All three work more or less following a simple idea of loading the dictionary into memory (into a associative table, a.k.a. dictionary, object to be precise). They use the Hunspell's affixes (.aff file) to create ALL variants of the words found in the dictionary (.dic file) and then store them in the memory. When checking spelling dictionary is simply asked whether the word exist or not. Simple, but it has these implications:
For example when running hunspell-spellchecker (there is a SpellChecker extension based on it) with English dictionary ("en_US", 62K+ words in dictionary) memory consumption is in peaks 500 MB and constantly above 250 MB. It crashes under Polish language dictionary ("pl_PL", 300K+ words in dictionary) after reaching about 1.5 GB memory consumed (there are reports about other dictionaries doing the same) with "JavaScript heap out of memory" message hidden well under the hood. Hunspell has a lexical parser which allows it to use these two sets (dictionary and affixes) "on the fly" without the need to merge them thus exploding memory consumption and load time.
There is a good spell checker component for node.js, which is actually a bindings for native spell checkers for MacOS (NSSpellChecker), Linux (Hunspell) and Windows (Spell Check API in windows 8+, Hunspell in earlier versions):
https://github.com/atom/node-spellchecker
It is alas a native module.
I have built a spell checker using this module. I will rather not publish it because it is quite pointless:
So I would like you to consider doing something about it.
There are few paths I can imagine among them two are most obvious:
I am most probably no one to discuss pros or cons of these alternatives, there are maybe other alternatives that I cannot see, but I think that with the evidence provided it is clear that unless something changes the answer to the question in the title is MOST PROBABLY NOT!
The text was updated successfully, but these errors were encountered: