Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Topics classification #1468

Open
glerzing opened this issue Mar 27, 2023 · 9 comments
Open

Topics classification #1468

glerzing opened this issue Mar 27, 2023 · 9 comments
Assignees
Labels
Backend Back-end code of Tournesol Discussion Debating a proposal

Comments

@glerzing
Copy link
Collaborator

In order to improve the diversity of our recommandations, or to allow users to filter on specific topics, we need to be able to automatically attribute topics to YouTube videos.

Sources of information include the captions, titles, descriptions, and category id of the videos. The category id is a type of topic that may not be sufficient for our purpose.

@glerzing glerzing self-assigned this Mar 27, 2023
@glerzing glerzing added Backend Back-end code of Tournesol Discussion Debating a proposal labels Mar 27, 2023
@glerzing
Copy link
Collaborator Author

glerzing commented Mar 27, 2023

There are a lot of techniques :

1 - Unsupervised algorithms that create topics themselves as groups of words. They usually have efficient implementations, and it the number of documents for each topic is well balanced. But it's quite random, and you have to name yourself the underlying topics based on the output words, and the output topics may not correspond to what you want. I tried Latent Dirichlet Allocation, it's fast and the results are pretty interesting. There is also top2vec.

2 - Unsupervised algorithms that can take a list of topics as input, and output the topic(s) corresponding to a video. I tried with lbl2vec, Lbl2TransformerVec and GPT-3 (curie (= level 3) and davinci (= level 4)). To compare the solutions, I asked them to predict the "category id" of English YouTube videos, which has 15 possible values, like "Entertainment", "Science & Technology" or "Education". A lot of these labels are questionable because there could be multiple answers (I tried myself and only got 5 correct responses out of 10, even though I knew which categories appear frequently). I will search for a better benchmark if I have the time, but here are the accuracy results :
- Curie, truncated at 1000 characters of caption (≈ 300 tokens, transformers' input is limited in size) : 5%, less than random chance (7 %) !
- Curie, truncated at 2000 characters of caption : still 5 % !
- lbl2vec, based on doc2vec : 15 %
- Lbl2TransformerVec: 21 %, truncated at 1500 characters of caption
- DaVinci, truncated at 5000 characters of caption : 39 % !! But it can become expensive if done on a large batch (0.03 € per 1000 input tokens). It cost me around 3 € for 100 API calls

I didn't add the title and description of the videos, that would probably improve the performances.

3 - Supervised algorithms that output the topic(s) corresponding to a video. But you first need some labelled data.

There is also the problem of handling multiple languages. Some pretrained language models may not have a french version.

The best results (around as good as a human annotator) were obtained by automatically prompting GPT-3 DaVinci using the API and asking which topic best corresponds to the caption. But it's expensive (3 € for 100 labels), so it can be used just to annotate part of the captions. With these labels, we might consider training a supervised algorithm.

I would like to have your opinions on this. We also need to discuss which topics we want to have, or how to generate them.

@aidanjungo
Copy link
Collaborator

Just a few comments about this issue:

  • If we want to use the transcripts to do that in production, we must know if there is a correct way, through the YouTube api? Price? to get them.
  • Probably we won't want to use GPT-x or other openai model as we spend some time criticize the way they release unsafe mrobably we won't want to use GPT-x or openai models and all the ethical issues that goes with it -> e.g. https://twitter.com/le_science4all/status/1490014328254349323

@glerzing
Copy link
Collaborator Author

glerzing commented Mar 27, 2023

  • There is a caption YouTube API. It's free but there is a quota. And fetching captions quickly depletes the quota (it's 50 times more costly than the metadata, so you can only fetch 200 captions per day). There is a procedure for companies to get a higher quota, but it may be complicated (https://developers.google.com/youtube/v3/guides/quota_and_compliance_audits?hl=fr).

  • There are not so many great LLMs available out there. And I personally think OpenAI would probably make a better use of that money than other tech giants. But I understand that people here don't like OpenAI. So if you don't want to use an OpenAI key, you can call it the 😎 GPT-glerzing approach. You get some annotated data, and you don't need to know where it comes from.

@glerzing
Copy link
Collaborator Author

glerzing commented Mar 28, 2023

Another strategy that I didn't think of is to use the tags : the tags that appear frequently sometimes represent topics that we want to include. Here are the lists of the tags in French and English, sorted by the number of times they appear : fr_tags.csv, en_tags.csv

This could help to determine which topics to use, and be used to trained a supervised model.

I think that's the best solution. Now we need to determine the list of topics that we want to have. If these topics are important, there will probably be videos with this topic as a tag.

@amatissart
Copy link
Member

Do you refer to this API? It's restricted to OAuth authentication with specific scopes. So it's practice it can be used to fetch captions from your own channels, but not on arbitrary videos. Am I missing something?

@glerzing
Copy link
Collaborator Author

You must be right. If so, can we even use the captions ? Can we use JST's tools in production ?

@aidanjungo
Copy link
Collaborator

You must be right. If so, can we even use the captions ? Can we use JST's tools in production ?

No, I think we would prefer stick with not too legally blurry methods to get the information we use in production.

@glerzing
Copy link
Collaborator Author

glerzing commented Mar 29, 2023

I understand. So do we give up using the transcripts ? We might still be able to assign topics to videos. But much of the #1475 relies on the transcripts, because it doesn't seem wise to assign scores to videos based on superficial criteria, without even analysing the content of the videos.

@glerzing
Copy link
Collaborator Author

glerzing commented May 14, 2023

I have been working on something else since. But maybe I could give it a try now.

The first step is to define a list of topics. YouTube already provides a categoryId and each video is classified by a label in youtube_topics.txt. But assuming that this is not sufficient, we can add other topics for which we need to make the classification ourselves. Here is a list of additional topics that I suggest : topics.txt

The classification method that gave the best results was to use OpenAI's API. Using gpt-3.5-turbo to classify all the Tournesol's videos would likely cost somewhere like 0.001€ per video (0.002€ / K tokens), so maybe 2 dozens of € for every Tournesol video (assuming we don't use the transcripts). You can propose other APIs if you want, but I don't know how effective / expensive they will be. I was about to propose a more sophisticated solution, but I think this one is simpler and gives better results than fine-tuning our own transformer encoder. There is a bit of prompt engineering (example.txt) and response processing, but it's not complicated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Backend Back-end code of Tournesol Discussion Debating a proposal
Projects
None yet
Development

No branches or pull requests

3 participants