Support for multiple embedding providers (Huggingface, etc.) #404

Abhinav-Naikawadi · 2023-06-27T21:50:05Z

No description provided.

Abhinav-Naikawadi · 2023-06-27T22:21:13Z

pyproject.toml

@@ -64,10 +65,12 @@ anthropic = [
    "anthropic >= 0.2.6"
 ]
 huggingface = [
-    "transformers >= 4.25.0"
+    "transformers >= 4.25.0",
+    "accelerate == 0.20.3"


This dependency was needed for hugging face pipelines support

@Abhinav-Naikawadi this is only for gpu inference?

Abhinav-Naikawadi · 2023-06-27T22:21:56Z

pyproject.toml

@@ -35,6 +35,7 @@ dependencies = [
    "torch >= 1.10.0",
    "matplotlib >= 3.5.0",
    "wget >= 3.2",
+    "ipywidgets == 8.0.6",


This dependency was needed for jupyter notebook support for sentence transformers progress bars

Abhinav-Naikawadi · 2023-06-27T22:23:02Z

src/autolabel/configs/config.py

@@ -108,6 +118,15 @@ def confidence(self) -> bool:
        """Returns true if the model is able to return a confidence score along with its predictions"""
        return self._model_config.get(self.COMPUTE_CONFIDENCE_KEY, False)

+    # Embedding config
+    def embedding_provider(self) -> str:


We use the LLM provider when an embedding model provider is not specified

…/autolabel into huggingface_embeddings

nihit · 2023-06-28T01:04:44Z

Addresses #370

nihit

@Abhinav-Naikawadi

please add example configs in the PR description here for using HuggingFace and VertexAI embeddings for one of the benchmark tasks.
Add tests for verifying the embedding config is correctly used in the few shot initialization class
Make relevant changes to config schema for the new embedding key - https://github.com/refuel-ai/autolabel/blob/main/src/autolabel/configs/schema.py

nihit · 2023-06-28T01:43:20Z

pyproject.toml

@@ -64,10 +65,12 @@ anthropic = [
    "anthropic >= 0.2.6"
 ]
 huggingface = [
-    "transformers >= 4.25.0"
+    "transformers >= 4.25.0",
+    "accelerate == 0.20.3"


@Abhinav-Naikawadi this is only for gpu inference?

nihit · 2023-06-28T01:58:04Z

src/autolabel/few_shot/__init__.py


 ALGORITHM_TO_IMPLEMENTATION: Dict[FewShotAlgorithm, BaseExampleSelector] = {
    FewShotAlgorithm.FIXED: FixedExampleSelector,
    FewShotAlgorithm.SEMANTIC_SIMILARITY: SemanticSimilarityExampleSelector,
    FewShotAlgorithm.MAX_MARGINAL_RELEVANCE: MaxMarginalRelevanceExampleSelector,
 }

+PROVIDER_TO_MODEL: Dict[ModelProvider, Embeddings] = {
+    ModelProvider.ANTHROPIC: OpenAIEmbeddings,


we should do this more transparently instead of mapping anthropic, refuel to OpenAIEmbeddings under the hood.

let's only have entries here for providers that actually offer embedding endpoints - openai, google, huggingface pipelines.

define a "default" provider - this can be OpenAIEmbeddings() to be used if the input provider does not provide embeddings

nihit · 2023-06-28T01:58:41Z

src/autolabel/few_shot/__init__.py

+from autolabel.configs import AutolabelConfig
+from autolabel.schema import FewShotAlgorithm, ModelProvider
+from langchain.embeddings import (
+    HuggingFaceEmbeddings,


what is the default model from sentence-transformers that is used here?

The default for Huggingface is all-mpnet-base-v2. The default for Vertex AI is textembedding-gecko@001.

nihit · 2023-06-28T01:59:26Z

pyproject.toml

 ]
 google = [
-    "google-cloud-aiplatform>=1.25.0"
+    "google-cloud-aiplatform>=1.25.0",
+    "google-generativeai"


do we still need this if we're using VertexAIEmbeddings?

Abhinav-Naikawadi · 2023-06-28T20:29:06Z

Example config for huggingface embeddings:

{
"task_name": "BankingComplaintsClassification",
"task_type": "classification",
"dataset": {
"label_column": "label",
"delimiter": ","
},
"model": {
"provider": "huggingface_pipeline",
"name": "google/flan-t5-small"
},
"embedding": {
"provider": "huggingface_pipeline",
"model": "sentence-transformers/all-mpnet-base-v2"
},
...

Abhinav-Naikawadi · 2023-06-28T20:29:56Z

Example config with google (vertexai) embeddings:
{
"task_name": "BankingComplaintsClassification",
"task_type": "classification",
"dataset": {
"label_column": "label",
"delimiter": ","
},
"model": {
"provider": "google",
"name": "gpt-3.5-turbo"
},
"embedding": {
"provider": "google"
},
...

nihit · 2023-06-28T22:27:41Z

can merge once tests passing validated

Abhinav Naikawadi added 2 commits June 27, 2023 14:47

support for multiple embedding providers

e25cd4d

accelerate dependency correction

8c955fe

Abhinav-Naikawadi requested a review from nihit June 27, 2023 21:50

Merge branch 'main' into huggingface_embeddings

21f2f99

Abhinav-Naikawadi commented Jun 27, 2023

View reviewed changes

Abhinav Naikawadi added 2 commits June 27, 2023 16:17

vertexai embeddings instead of googlepalm embeddings

9e9e455

Merge branch 'huggingface_embeddings' of https://github.com/refuel-ai…

f9f7eb5

…/autolabel into huggingface_embeddings

nihit mentioned this pull request Jun 28, 2023

Use embeddings from other LLMs besides OpenAI for fewshot learning #370

Closed

nihit requested changes Jun 28, 2023

View reviewed changes

Abhinav Naikawadi added 2 commits June 28, 2023 13:31

add test cases and fix embedding config/schema

ac6bf24

Merge branch 'main' into huggingface_embeddings

3fbbfc9

nihit approved these changes Jun 28, 2023

View reviewed changes

add sentence_transformers dependency

a20a927

Abhinav-Naikawadi merged commit f332713 into main Jun 28, 2023

Abhinav-Naikawadi deleted the huggingface_embeddings branch June 28, 2023 22:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for multiple embedding providers (Huggingface, etc.) #404

Support for multiple embedding providers (Huggingface, etc.) #404

Abhinav-Naikawadi commented Jun 27, 2023

Abhinav-Naikawadi Jun 27, 2023

nihit Jun 28, 2023

Abhinav-Naikawadi Jun 27, 2023

Abhinav-Naikawadi Jun 27, 2023

nihit commented Jun 28, 2023

nihit left a comment

nihit Jun 28, 2023

nihit Jun 28, 2023

nihit Jun 28, 2023

Abhinav-Naikawadi Jun 28, 2023

nihit Jun 28, 2023

Abhinav-Naikawadi commented Jun 28, 2023

Abhinav-Naikawadi commented Jun 28, 2023

nihit commented Jun 28, 2023

Support for multiple embedding providers (Huggingface, etc.) #404

Support for multiple embedding providers (Huggingface, etc.) #404

Conversation

Abhinav-Naikawadi commented Jun 27, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nihit commented Jun 28, 2023

nihit left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Abhinav-Naikawadi commented Jun 28, 2023

Abhinav-Naikawadi commented Jun 28, 2023

nihit commented Jun 28, 2023