Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guesslang in VS Code #29

Closed
isidorn opened this issue Feb 18, 2021 · 19 comments
Closed

Guesslang in VS Code #29

isidorn opened this issue Feb 18, 2021 · 19 comments

Comments

@isidorn
Copy link

isidorn commented Feb 18, 2021

Hi there,

My name is Isidor and I work on VS Code.
We have the following problem:

Novice user create a new untitled files and start typing and they have no clue that they have to set the language mode to get all the language smartness that VS Code provides.
Thus we were thinking to use some smart language detection so we could automatically set the language for the user.

I was doing a bit of online research and I came across this project - looks very cool!

Is it possible to somehow have this work as a node module instead of python? Since then we could consume it easily in VS Code and things might just work. Even cooler would be if it worked in the browser.

Let me know if you are interested we can also setup a meeting where I could explain our use case in more detail.
Thanks!

@yoeo
Copy link
Owner

yoeo commented May 26, 2021

Hello @isidorn

Actually, I tried to convert Guesslang model to Javascript two years ago because I wanted to create an Atom extension, a VS Code extension and a Javascript front-end library.

But it was a real challenge to convert the model to TensorFlow.js because several elements of the models where not implemented in TensorFlow.js at that time. I tried different options like simplifying Guesslang model, splitting the model, compiling the missing elements from C++ to Web Assembly, etc... While some solutions "worked", the result was way too buggy and dirty to ship.

By the way, sorry for the late answer.

@yoeo
Copy link
Owner

yoeo commented May 27, 2021

It would be nice to have the insight of someone who managed to convert a canned TensorFlow model to TensorFlow.js.

@isidorn
Copy link
Author

isidorn commented May 27, 2021

@yoeo thanks for your answer.
We are actually working with the Tensorflow team so they support some of the missing funcitonality so we can convert this model to JS. Please check out this issue tensorflow/tfjs#4838
Let me also ping the Tensorflow team again so we try to get some progress here. It would be super cool to have this in VS Code.

@isidorn
Copy link
Author

isidorn commented May 31, 2021

@yoeo the Tensorflow team have just updated TensorflowJS so now it is possible to run your model in the browser, for more details checkout this comment tensorflow/tfjs#4838 (comment)

We will look into adopting this in VS Code next milestone in June.
In the meantime is it possible for the Model to be updated to also Classify JSON? This is a very common language for our users and it would be great if the model could support it.

@yoeo
Copy link
Owner

yoeo commented May 31, 2021

Hi @isidorn that's really good news.

Yes it is possible to add JSON.
However, it will take some time to generate a new training dataset that includes JSON and other requested languages like VisualBasic, Pascal, Kotlin, XML, YAML, etc... By the way the part that takes the most time is actually downloading ~1TB of repositories for Github.

@isidorn
Copy link
Author

isidorn commented Jun 1, 2021

@yoeo cool, it would be really useful for us to add JSON when possible.
Makes sense that the 1TB download is the slowest part...
Thanks a lot and I will provide more feedback in a couple of weeks when I try to integrate all of this into VS Code.

@isidorn
Copy link
Author

isidorn commented Jun 11, 2021

There's progress on our side and we are looking into adding this to our VS Code product. More details can be found here tensorflow/tfjs#4838 (comment) and microsoft/vscode#118455

Two things we would still really like to improve in order to ship this:

  • more languages support (as we discussed above)
  • Is it possible to further compress the model? Current size is 1.1 MB and we want to ship this as part of every VS Code.

@yoeo if you do not have time can you give us some instructions on how to do the first. So we might also put a help wanted in the VS Code repository, since we have lots of contributors maybe somebody will volunteer.

Thanks a lot for this great model

@isidorn
Copy link
Author

isidorn commented Jun 11, 2021

We were able to compress to model using gzip to be 20kb. So we are good regarding the size.
However improving the model for more languages would be fantastic.

@yoeo
Copy link
Owner

yoeo commented Jun 11, 2021

Wow I wonder how you were able to compress the model that much, that's insane!

more languages support (as we discussed above)

I'll try experiment that this week end.

can you give us some instructions on how to do the first

Of course. I use this tool to build the dataset: https://github.com/yoeo/guesslangtools
You can find some documentation in the README but here's a quick guide on how to add new languages with this tool:

# Install Gesslang & GuesslangTools inside your virtualenv in developer mode
git clone [email protected]:yoeo/guesslang.git guesslang
cd guesslang/
pip install -e .
cd ..

git clone [email protected]:yoeo/guesslangtools.git guesslangtools
cd guesslangtools/
pip install -e .
cd ..

# Add the new languages to the language mapping
vi guesslang/guesslang/data/languages.json
cp guesslang/guesslang/data/languages.json guesslangtools/guesslangtools/data/languages.json

# Build the dataset (might take few days, depending on your Internet connection)
DESTINATION_PATH=...  # dataset directory, 1TB of free space recommended
gltool $DESTINATION_PATH

# Train the model (might take few hours, depending on your computer speed)
guesslang --train $DESTINATION_PATH --model ./new_model/

# Play with the new model
echo '
#include <stdio.h>

int main(int argc, char* argv[])
{
  printf("Hello world");
}
' | guesslang --model ./new_model/  # Should output "Programming language: C"

Thanks for the updates @isidorn.

@yoeo
Copy link
Owner

yoeo commented Jun 18, 2021

Hello, just an update.

I'm trying to add the following languages to Guesslang:

  • Assembly
  • CSV
  • Dart
  • Fortran
  • Groovy
  • INI
  • JSON
  • Julia
  • Kotlin
  • Pascal
  • TOML
  • TypeScript
  • VBA
  • XML
  • YAML

I built the list according the requests #24 #23 #19 , Tiobe language index and Stackoverflow popular languages .

But there are few issues:

  • some language datasets that I'm generating are too specific (ex: many TOML files that I'm getting are Cargo.toml files).
  • and I mostly work on it during week-ends, so things may go slowly.

Therefore my current strategy is to add the "simplest" languages first then bump Guesslang and take some time to work on the more tricky languages.

Thanks.

@isidorn
Copy link
Author

isidorn commented Jun 22, 2021

@yoeo Thanks a lot for looking into this and for providing an update.
Starting with the simplest languages makes good sense to me.

As for the compression: the model after being converted for TensorFlowJS was .json and that seems to be easily compressible.

If needed I can put help-wanted on the vscode issues, and somebody from the community can also help here. Just let me know...
We would love to ship this feature in July / August, but at the end of the day there is no rush, we would like to get it right and there are other things we need to look at

@yoeo
Copy link
Owner

yoeo commented Jun 29, 2021

Hi @isidorn

I've made some progress during last couple of weekends. There is now a development version of Guesslang model that supports most of the languages listed above (including JSON).

You can try it at #33

There are still issues that I need to solve before merging it:

  1. the dataset download was taking forever ⏳ I had to refactor Guesslangtool to speed things up a little. This work is still ongoing... See Git based download guesslangtools#4
  2. the "Pascal" language training dataset is broken, I spotted the issue too late and now I have to generate a new ones
  3. the dataset is currently skewed, some languages have way more example files than other (ex: 27k examples for Kotlin versus 9k examples for TOML). I'll have to find more example to balance the dataset.

@isidorn
Copy link
Author

isidorn commented Jun 30, 2021

@yoeo this is great work, thanks a lot for the update 👏
I think @TylerLeonhardt plans to jump on this next milestone (July), so the timings seem to align. I think we still have to streamline the conversion to TensorFlowJS as captured here.

Next three weeks I will be on vacation, so expect slow responses from me.

@yoeo
Copy link
Owner

yoeo commented Jul 25, 2021

Hi @isidorn , @TylerLeonhardt

I just finished adding the new languages support to Guesslang #33
It now supports Guesslang 54 programming languages (24 more than before):

Languages
Assembly Batchfile C C# C++
Clojure CMake COBOL CoffeeScript CSS
CSV Dart DM Dockerfile Elixir
Erlang Fortran Go Groovy Haskell
HTML INI Java JavaScript JSON
Julia Kotlin Lisp Lua Makefile
Markdown Matlab Objective-C OCaml Pascal
Perl PHP PowerShell Prolog Python
R Ruby Rust Scala Shell
SQL Swift TeX TOML TypeScript
Verilog Visual Basic XML YAML

Feel free to tell me if you have feedbacks about this new model.

@ghost
Copy link

ghost commented Jul 25, 2021

Being able to tell the difference between Matlab Objective-C and Julia is killer, thanks*e(6)!

@isidorn
Copy link
Author

isidorn commented Jul 26, 2021

@yoeo this is amazing, thanks a lot 👏 We really appreciate your help.
I just came back from vacation, but @TylerLeonhardt was working on this and there is already a prototype of this in VS Code. More details can be found here microsoft/vscode#129004
In short if you want to try it out you should set workbench.editor.untitled.experimentalLanguageDetection in vs code settings.
We are about to pick up your latest model.

@TylerLeonhardt
Copy link

I think we can go ahead and close this issue now :) the guesslang model now ships in VS Code and beyond misc improvements to the model, there's no more action items here.

@yoeo
Copy link
Owner

yoeo commented Jul 26, 2021

In short if you want to try it out you should set workbench.editor.untitled.experimentalLanguageDetection in vs code settings.

Thanks for the info. You've got a new beta-tester 🙂

I'm closing this issue and do not hesitate to create new ones for improvement requests.

@yoeo yoeo closed this as completed Jul 26, 2021
@isidorn
Copy link
Author

isidorn commented Jul 27, 2021

For reference here's the test plan item on the vscode side that has good steps on how to setup microsoft/vscode#129436

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants