[VN-MTEB] Vietnamese clustering version 1 -> 2 by BaoLocPham · Pull Request #3788 · embeddings-benchmark/mteb

BaoLocPham · 2025-12-23T07:58:06Z

I update version for all Vietnamese clustering task.

Samoed · 2025-12-23T08:16:01Z

Can you calculate statistics of your tasks? task.calculate_descriptive_statistics()

BaoLocPham · 2025-12-23T08:29:40Z

Can you calculate statistics of your tasks? task.calculate_descriptive_statistics()

Done.

.pre-commit-config.yaml

Samoed · 2025-12-23T08:51:53Z

mteb/tasks/clustering/vie/twenty_newsgroups_clustering_vn.py

+        adapted_from=["TwentyNewsgroupsClustering-VN"],
    )
+
+    def dataset_transform(self):


I don't see difference in this task. I think you can remove dataset_transform

Samoed · 2025-12-23T08:52:51Z

mteb/tasks/clustering/vie/reddit_clustering_p2p_vn.py

+class RedditFastClusteringP2PVN(AbsTaskClustering):
+    metadata = TaskMetadata(
+        name="RedditClusteringP2P-VN.v2",
+        description="A translated dataset from Clustering of title+posts from reddit. Clustering of 10 sets of 50k paragraphs and 40 sets of 10k paragraphs. The process of creating the VN-MTEB (Vietnamese Massive Text Embedding Benchmark) from English samples involves a new automated system: - The system uses large language models (LLMs), specifically Coherence's Aya model, for translation. - Applies advanced embedding models to filter the translations. - Use LLM-as-a-judge to scoring the quality of the samples base on multiple criteria.",


Can you add to all v2 version, what's different? E.g.

V2 version of tasks uses `AbsTaskClustering` instead of `AbsTaskClusteringLegacy`

Apparently the old version doesn't work anymore.
After debugging I found out that in in the clustering now has the fast version, some task such as RedditClustering.
I follow the concept of these task (RedditClustering, TwentyNewsgroupsClustering, StackExchange, etc).
If I remove the dataset_transform, the task simply cannot run.

Apparently the old version doesn't work anymore.

Which tasks are not working?

"RedditClusteringP2PVN"

"RedditClusteringVN"

"StackExchangeClusteringP2PVN"

"StackExchangeClusteringVN"

"TwentyNewsgroupsClusteringVN"

Yeah, strange issue. Fixed in #3791

LegacyClustering and Clustering have a bit different data scheme. For Clustering for each row 1 input example with 1 cluster e.g. https://huggingface.co/datasets/Uri-ka/ClusTREC-Covid, but Legacy for one row multiple input examples and because of that TwentyNewsgroupsClustering wasn't processed correctly (took 0.2 * 10 -> 0 self.max_fraction_of_documents_to_embed * len(data_split)).

Do you have plans to update your benchmark? Otherwise, I don't think this is necessary to create new versions of tasks

Hi @Samoed, thanks for the quick fix in #3791, I think this version 2 is not necessary for clustering anymore.
Thank you so much for the support.

[UPDATE] Vietnamese clustering version 1 -> 2

020c11d

BaoLocPham closed this Dec 23, 2025

[UPDATE] Vietnamese clustering version 1 -> 2 with lint

b42e966

BaoLocPham reopened this Dec 23, 2025

Samoed reviewed Dec 23, 2025

View reviewed changes

.pre-commit-config.yaml Show resolved Hide resolved

[VN-MTEB] add descriptive for clustering tasks

93d5698

BaoLocPham force-pushed the main branch from ea6ba35 to 93d5698 Compare December 23, 2025 08:45

Samoed reviewed Dec 23, 2025

View reviewed changes

Samoed closed this Dec 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[VN-MTEB] Vietnamese clustering version 1 -> 2#3788

[VN-MTEB] Vietnamese clustering version 1 -> 2#3788
BaoLocPham wants to merge 3 commits intoembeddings-benchmark:mainfrom
BaoLocPham:main

BaoLocPham commented Dec 23, 2025

Uh oh!

Samoed commented Dec 23, 2025

Uh oh!

BaoLocPham commented Dec 23, 2025 •

edited by Samoed

Loading

Uh oh!

Uh oh!

Samoed Dec 23, 2025

Uh oh!

Samoed Dec 23, 2025

Uh oh!

BaoLocPham Dec 23, 2025

Uh oh!

Samoed Dec 23, 2025

Uh oh!

BaoLocPham Dec 23, 2025

Uh oh!

Samoed Dec 23, 2025

Uh oh!

Samoed Dec 23, 2025

Uh oh!

Samoed Dec 23, 2025

Uh oh!

BaoLocPham Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

BaoLocPham commented Dec 23, 2025

Uh oh!

Samoed commented Dec 23, 2025

Uh oh!

BaoLocPham commented Dec 23, 2025 • edited by Samoed Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

BaoLocPham commented Dec 23, 2025 •

edited by Samoed

Loading