Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added better functionality for label selection #713

Merged
merged 6 commits into from
Feb 16, 2024
Merged

Added better functionality for label selection #713

merged 6 commits into from
Feb 16, 2024

Conversation

Vaibhav2001
Copy link
Contributor

@Vaibhav2001 Vaibhav2001 commented Feb 13, 2024

Pull Review Summary

Continued from #561. This closes #712

Description

  • Now we can select the top K labels to pass to the LLM with label_selection, and specify the count though label_selection_count, and a threshold via label_selection_threshold.
  • Enabled caching for storing embeddings.
  • Few shot examples will only contain the correct examples which have the provided label. (only works with "fixed" algorithm for now)

Benefits:

  • Leads to a similar accuracy as base classification, however leads to a better completion rate when num_labels is huge, as the model prediction does not map exactly with any label (mostly because of large context length).
  • Cost reduction as the number of tokens passed to the final model can be small.

Usage:

config = {
    "task_name": "BankingComplaintsClassification",
    "task_type": "classification",
        ...
        "labels": label_descriptions,
        "label_selection": True,
        "label_selection_count": 10,
        "label_selection_threshold": 0.5,
        ...
    },
}
  • Can use label descriptions to embed the labels.
  • "label_selection_count": -1 makes maximum possible labels equal to len(labels). Defaults to min(10, num_lables)
  • label_selection_threshold enables us to only choose the labels where label_score / topScore > threshold for effective filtering. This value defaults to 0.95 (based on experimentation).

Type of change

  • Improving existing feature
  • This change requires a documentation update

Tests

Ran benchmarking tests for different datasets.

Future Work

  • Do not classify if there is only one selected_label. Needs to change the annotation processing in dataset.process_labels.
  • Costing for embedding model. However this is typically upper bounded by the base model and is cheaper by a factor of up to 1000.

@Vaibhav2001 Vaibhav2001 self-assigned this Feb 13, 2024
src/autolabel/labeler.py Show resolved Hide resolved
src/autolabel/labeler.py Outdated Show resolved Hide resolved
src/autolabel/labeler.py Show resolved Hide resolved
src/autolabel/few_shot/label_selector.py Outdated Show resolved Hide resolved
src/autolabel/configs/config.py Outdated Show resolved Hide resolved
src/autolabel/configs/config.py Outdated Show resolved Hide resolved
Copy link
Contributor

@rajasbansal rajasbansal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm - after small fixes

src/autolabel/few_shot/label_selector.py Outdated Show resolved Hide resolved
src/autolabel/few_shot/label_selector.py Show resolved Hide resolved
src/autolabel/few_shot/label_selector.py Show resolved Hide resolved
src/autolabel/labeler.py Outdated Show resolved Hide resolved
@Vaibhav2001 Vaibhav2001 merged commit 0cbc022 into main Feb 16, 2024
2 checks passed
@Vaibhav2001 Vaibhav2001 deleted the benchmark branch February 16, 2024 22:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle larger taxonomies with top k label selection to pass to the LLM
2 participants