-
-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Topics classification #1468
Comments
There are a lot of techniques : 1 - Unsupervised algorithms that create topics themselves as groups of words. They usually have efficient implementations, and it the number of documents for each topic is well balanced. But it's quite random, and you have to name yourself the underlying topics based on the output words, and the output topics may not correspond to what you want. I tried Latent Dirichlet Allocation, it's fast and the results are pretty interesting. There is also top2vec. 2 - Unsupervised algorithms that can take a list of topics as input, and output the topic(s) corresponding to a video. I tried with lbl2vec, Lbl2TransformerVec and GPT-3 (curie (= level 3) and davinci (= level 4)). To compare the solutions, I asked them to predict the "category id" of English YouTube videos, which has 15 possible values, like "Entertainment", "Science & Technology" or "Education". A lot of these labels are questionable because there could be multiple answers (I tried myself and only got 5 correct responses out of 10, even though I knew which categories appear frequently). I will search for a better benchmark if I have the time, but here are the accuracy results : I didn't add the title and description of the videos, that would probably improve the performances. 3 - Supervised algorithms that output the topic(s) corresponding to a video. But you first need some labelled data. There is also the problem of handling multiple languages. Some pretrained language models may not have a french version. The best results (around as good as a human annotator) were obtained by automatically prompting GPT-3 DaVinci using the API and asking which topic best corresponds to the caption. But it's expensive (3 € for 100 labels), so it can be used just to annotate part of the captions. With these labels, we might consider training a supervised algorithm. I would like to have your opinions on this. We also need to discuss which topics we want to have, or how to generate them. |
Just a few comments about this issue:
|
|
Another strategy that I didn't think of is to use the tags : the tags that appear frequently sometimes represent topics that we want to include. Here are the lists of the tags in French and English, sorted by the number of times they appear : fr_tags.csv, en_tags.csv This could help to determine which topics to use, and be used to trained a supervised model. I think that's the best solution. Now we need to determine the list of topics that we want to have. If these topics are important, there will probably be videos with this topic as a tag. |
Do you refer to this API? It's restricted to OAuth authentication with specific scopes. So it's practice it can be used to fetch captions from your own channels, but not on arbitrary videos. Am I missing something? |
You must be right. If so, can we even use the captions ? Can we use JST's tools in production ? |
No, I think we would prefer stick with not too legally blurry methods to get the information we use in production. |
I understand. So do we give up using the transcripts ? We might still be able to assign topics to videos. But much of the #1475 relies on the transcripts, because it doesn't seem wise to assign scores to videos based on superficial criteria, without even analysing the content of the videos. |
I have been working on something else since. But maybe I could give it a try now. The first step is to define a list of topics. YouTube already provides a categoryId and each video is classified by a label in youtube_topics.txt. But assuming that this is not sufficient, we can add other topics for which we need to make the classification ourselves. Here is a list of additional topics that I suggest : topics.txt The classification method that gave the best results was to use OpenAI's API. Using gpt-3.5-turbo to classify all the Tournesol's videos would likely cost somewhere like 0.001€ per video (0.002€ / K tokens), so maybe 2 dozens of € for every Tournesol video (assuming we don't use the transcripts). You can propose other APIs if you want, but I don't know how effective / expensive they will be. I was about to propose a more sophisticated solution, but I think this one is simpler and gives better results than fine-tuning our own transformer encoder. There is a bit of prompt engineering (example.txt) and response processing, but it's not complicated. |
In order to improve the diversity of our recommandations, or to allow users to filter on specific topics, we need to be able to automatically attribute topics to YouTube videos.
Sources of information include the captions, titles, descriptions, and category id of the videos. The category id is a type of topic that may not be sufficient for our purpose.
The text was updated successfully, but these errors were encountered: