[VN-MTEB] Vietnamese clustering version 1 -> 2#3788
[VN-MTEB] Vietnamese clustering version 1 -> 2#3788BaoLocPham wants to merge 3 commits intoembeddings-benchmark:mainfrom
Conversation
|
Can you calculate statistics of your tasks? |
Done. |
| adapted_from=["TwentyNewsgroupsClustering-VN"], | ||
| ) | ||
|
|
||
| def dataset_transform(self): |
There was a problem hiding this comment.
I don't see difference in this task. I think you can remove dataset_transform
| class RedditFastClusteringP2PVN(AbsTaskClustering): | ||
| metadata = TaskMetadata( | ||
| name="RedditClusteringP2P-VN.v2", | ||
| description="A translated dataset from Clustering of title+posts from reddit. Clustering of 10 sets of 50k paragraphs and 40 sets of 10k paragraphs. The process of creating the VN-MTEB (Vietnamese Massive Text Embedding Benchmark) from English samples involves a new automated system: - The system uses large language models (LLMs), specifically Coherence's Aya model, for translation. - Applies advanced embedding models to filter the translations. - Use LLM-as-a-judge to scoring the quality of the samples base on multiple criteria.", |
There was a problem hiding this comment.
Can you add to all v2 version, what's different? E.g.
V2 version of tasks uses `AbsTaskClustering` instead of `AbsTaskClusteringLegacy`
There was a problem hiding this comment.
Apparently the old version doesn't work anymore.
After debugging I found out that in in the clustering now has the fast version, some task such as RedditClustering.
I follow the concept of these task (RedditClustering, TwentyNewsgroupsClustering, StackExchange, etc).
If I remove the dataset_transform, the task simply cannot run.
There was a problem hiding this comment.
Apparently the old version doesn't work anymore.
Which tasks are not working?
There was a problem hiding this comment.
- "RedditClusteringP2PVN"
- "RedditClusteringVN"
- "StackExchangeClusteringP2PVN"
- "StackExchangeClusteringVN"
- "TwentyNewsgroupsClusteringVN"
There was a problem hiding this comment.
LegacyClustering and Clustering have a bit different data scheme. For Clustering for each row 1 input example with 1 cluster e.g. https://huggingface.co/datasets/Uri-ka/ClusTREC-Covid, but Legacy for one row multiple input examples and because of that TwentyNewsgroupsClustering wasn't processed correctly (took 0.2 * 10 -> 0 self.max_fraction_of_documents_to_embed * len(data_split)).
There was a problem hiding this comment.
Do you have plans to update your benchmark? Otherwise, I don't think this is necessary to create new versions of tasks
I update version for all Vietnamese clustering task.