-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add LLM token classification example #4541
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works great, looks great.
Please don't spawn debugging shells without my consent though 😛
Also: at least on my machine, there's a bunch of wait time before the first logging calls arrive and again while computing the embeddings: 23-12-15_09.24.23.patched.mp4It'd be nice if the script mentioned what it was doing in its standard output during those. (Man i really wish we could log a spinner thing...) |
Added a print. Regarding runtime, the embeddings are currently computed twice. Once for logging and once as part of the whole pipeline. Not sure if it's worth to change this. In the pipeline there is a bit of extra stuff going on beyond just another function call passing the embeddings. So it'd add some complexity to the example. I added a note for now. |
Any reason we're not merging this @roym899 ? |
Making some small adjustments after talking to @nikolausWest |
What
Adds an example that tokenizes a text, visualizes the embeddings for each token (as a 3D UMAP embedding), logs the text tokens linking to the corresponding embedding, and classifies each token. Classification is into named entities (person, location, organization, and misc). The found, unique named entities are also logged.
Also removed some newlines in manifest.yml to make it more consistent.
Checklist