Skip to content

Comments

[VN-MTEB] Vietnamese clustering version 1 -> 2#3788

Closed
BaoLocPham wants to merge 3 commits intoembeddings-benchmark:mainfrom
BaoLocPham:main
Closed

[VN-MTEB] Vietnamese clustering version 1 -> 2#3788
BaoLocPham wants to merge 3 commits intoembeddings-benchmark:mainfrom
BaoLocPham:main

Conversation

@BaoLocPham
Copy link
Contributor

I update version for all Vietnamese clustering task.

@BaoLocPham BaoLocPham closed this Dec 23, 2025
@BaoLocPham BaoLocPham reopened this Dec 23, 2025
@Samoed
Copy link
Member

Samoed commented Dec 23, 2025

Can you calculate statistics of your tasks? task.calculate_descriptive_statistics()

@BaoLocPham
Copy link
Contributor Author

BaoLocPham commented Dec 23, 2025

Can you calculate statistics of your tasks? task.calculate_descriptive_statistics()

Done.

adapted_from=["TwentyNewsgroupsClustering-VN"],
)

def dataset_transform(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see difference in this task. I think you can remove dataset_transform

class RedditFastClusteringP2PVN(AbsTaskClustering):
metadata = TaskMetadata(
name="RedditClusteringP2P-VN.v2",
description="A translated dataset from Clustering of title+posts from reddit. Clustering of 10 sets of 50k paragraphs and 40 sets of 10k paragraphs. The process of creating the VN-MTEB (Vietnamese Massive Text Embedding Benchmark) from English samples involves a new automated system: - The system uses large language models (LLMs), specifically Coherence's Aya model, for translation. - Applies advanced embedding models to filter the translations. - Use LLM-as-a-judge to scoring the quality of the samples base on multiple criteria.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add to all v2 version, what's different? E.g.

V2 version of tasks uses `AbsTaskClustering` instead of `AbsTaskClusteringLegacy`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently the old version doesn't work anymore.
After debugging I found out that in in the clustering now has the fast version, some task such as RedditClustering.
I follow the concept of these task (RedditClustering, TwentyNewsgroupsClustering, StackExchange, etc).
If I remove the dataset_transform, the task simply cannot run.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently the old version doesn't work anymore.

Which tasks are not working?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • "RedditClusteringP2PVN"
  • "RedditClusteringVN"
  • "StackExchangeClusteringP2PVN"
  • "StackExchangeClusteringVN"
  • "TwentyNewsgroupsClusteringVN"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, strange issue. Fixed in #3791

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LegacyClustering and Clustering have a bit different data scheme. For Clustering for each row 1 input example with 1 cluster e.g. https://huggingface.co/datasets/Uri-ka/ClusTREC-Covid, but Legacy for one row multiple input examples and because of that TwentyNewsgroupsClustering wasn't processed correctly (took 0.2 * 10 -> 0 self.max_fraction_of_documents_to_embed * len(data_split)).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have plans to update your benchmark? Otherwise, I don't think this is necessary to create new versions of tasks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Samoed, thanks for the quick fix in #3791, I think this version 2 is not necessary for clustering anymore.
Thank you so much for the support.

@Samoed Samoed closed this Dec 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants