Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add step to deduplicate records based on embeddings #946

Merged
merged 6 commits into from
Sep 6, 2024

Conversation

plaguss
Copy link
Contributor

@plaguss plaguss commented Sep 4, 2024

Description

This PR adds an EmbeddingDedup step to deduplicate texts based on embeddings, and previously detected nearest neigbours. The typical workflow is the following:

with Pipeline() as pipeline:
    loader = LoadDataFromDicts(
        data=...,  # Records containing the texts and embeddings
    )

    # NOTE: Guide to choose an index: https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index
    nn = FaissNearestNeighbour(
        k=3,
        metric_type=faiss.METRIC_INNER_PRODUCT,
        search_batch_size=50,
        # This fields would be needed in case of training an index (a big dataset)
        # string_factory="IVF300_HNSW32,Flat",
        # train_size=len(dataset),
    )

    embedding_dedup = EmbeddingDedup(
        threshold=0.99,
    )
    loader >> nn >> embedding_dedup

distiset = pipeline.run(use_cache=False)

ds = distiset["default"]["train"]
ds_dedup = ds.filter(lambda x: x["keep_row_after_embedding_filtering"])

@plaguss plaguss requested a review from gabrielmbmb September 4, 2024 07:42
@plaguss plaguss self-assigned this Sep 4, 2024
@plaguss plaguss added the enhancement New feature or request label Sep 4, 2024
Copy link

github-actions bot commented Sep 4, 2024

Documentation for this PR has been built. You can view it at: https://distilabel.argilla.io/pr-946/

Copy link

codspeed-hq bot commented Sep 4, 2024

CodSpeed Performance Report

Merging #946 will not alter performance

Comparing dedup-embeddings (3e317af) with develop (973e0fa)

Summary

✅ 1 untouched benchmarks

Copy link
Member

@gabrielmbmb gabrielmbmb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Much needed step :)

src/distilabel/steps/filtering/embedding.py Outdated Show resolved Hide resolved
src/distilabel/steps/generators/huggingface.py Outdated Show resolved Hide resolved
@plaguss plaguss added this to the 1.4.0 milestone Sep 6, 2024
@plaguss plaguss merged commit de2bed0 into develop Sep 6, 2024
7 checks passed
@plaguss plaguss deleted the dedup-embeddings branch September 6, 2024 15:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants