add AbsTaskSpectralClustering#2430
add AbsTaskSpectralClustering#2430OnAnd0n wants to merge 19 commits intoembeddings-benchmark:mainfrom
Conversation
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
|
I have implemented Spectral Clustering to reflect cosine similarity, as we previously discussed. However, I noticed that the create_task_list() function collects task categories using a two-level iteration. Could you review whether this approach is reasonable and applicable? |
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| class SpectralClusteringEvaluator(Evaluator): |
There was a problem hiding this comment.
This should be moved to evaluators
| labels, | ||
| task_name: str | None = None, | ||
| clustering_batch_size: int = 500, | ||
| limit: int | None = None, |
There was a problem hiding this comment.
This will be removed in 2.0
| limit: int | None = None, |
| if limit is not None: | ||
| sentences = sentences[:limit] | ||
| labels = labels[:limit] |
There was a problem hiding this comment.
This will be removed in 2.0
| if limit is not None: | |
| sentences = sentences[:limit] | |
| labels = labels[:limit] |
| return {"v_measure": v_measure} | ||
|
|
||
|
|
||
| class AbsTaskSpectralClustering(AbsTask): |
There was a problem hiding this comment.
I think you clustering is almost 1to1 to original clustering. Maybe it would be better to move evaluator to properties of task and your tasks will use
for cluster_set in tqdm.tqdm(dataset, desc="Clustering"):
evaluator = self.evaluator(class Task(AbsClustering):
evaluator = SpectralClusteringEvaluatorThere was a problem hiding this comment.
@Samoed
Thanks for your advice!
I will revise it again at 'evaluation' level (if it is deemed meaningful).
Additionally, I will apply try/except as well.
There was a problem hiding this comment.
We should build this on the fast clustering task, not AbsTaskClustering (it is much slower and gives less consistent estimates)
| from collections import Counter | ||
| from typing import Any | ||
|
|
||
| import networkx as nx |
There was a problem hiding this comment.
Need to convert with try... ecept I think this should be moved inside __call__
|
@OnAnd0n will close this PR as it seems to have gotten stale - do feel free to reopen if you have the time to adress the comments |
Code Quality
make lintto maintain consistent style.Documentation
Testing
make test-with-coverage.make testormake test-with-coverageto ensure no existing functionality is broken.Adding datasets checklist
Reason for dataset addition: ...
mteb -m {model_name} -t {task_name}command.sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2intfloat/multilingual-e5-smallself.stratified_subsampling() under dataset_transform()make test.make lint.