Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

community[minor]: Opensearch hybridsearch implementation #25375

Conversation

karthikbharadhwajKB
Copy link
Contributor

@karthikbharadhwajKB karthikbharadhwajKB commented Aug 14, 2024

community: add hybrid search in opensearch

Langchain OpenSearch Hybrid Search Implementation

Implementation of Hybrid Search:

I have taken LangChain's OpenSearch integration to the next level by adding hybrid search capabilities. Building on the existing OpenSearchVectorSearch class, I have implemented Hybrid Search functionality (which combines the best of both keyword and semantic search). This new functionality allows users to harness the power of OpenSearch's advanced hybrid search features without leaving the familiar LangChain ecosystem. By blending traditional text matching with vector-based similarity, the enhanced class delivers more accurate and contextually relevant results. It's designed to seamlessly fit into existing LangChain workflows, making it easy for developers to upgrade their search capabilities.

In implementing the hybrid search for OpenSearch within the LangChain framework, I also incorporated filtering capabilities. It's important to note that according to the OpenSearch hybrid search documentation, only post-filtering is supported for hybrid queries. This means that the filtering is applied after the hybrid search results are obtained, rather than during the initial search process.

Note: For the implementation of hybrid search, I strictly followed the official OpenSearch Hybrid search documentation and I took inspiration from https://github.com/AndreasThinks/langchain/tree/feature/opensearch_hybrid_search
Thanks Mate!

Experiments

I conducted few experiments to verify that the hybrid search implementation is accurate and capable of reproducing the results of both plain keyword search and vector search.

Experiment - 1
Hybrid Search
Keyword_weight: 1, vector_weight: 0

I conducted an experiment to verify the accuracy of my hybrid search implementation by comparing it to a plain keyword search. For this test, I set the keyword_weight to 1 and the vector_weight to 0 in the hybrid search, effectively giving full weightage to the keyword component. The results from this hybrid search configuration matched those of a plain keyword search, confirming that my implementation can accurately reproduce keyword-only search results when needed. It's important to note that while the results were the same, the scores differed between the two methods. This difference is expected because the plain keyword search in OpenSearch uses the BM25 algorithm for scoring, whereas the hybrid search still performs both keyword and vector searches before normalizing the scores, even when the vector component is given zero weight. This experiment validates that my hybrid search solution correctly handles the keyword search component and properly applies the weighting system, demonstrating its accuracy and flexibility in emulating different search scenarios.

Experiment - 2
Hybrid Search
keyword_weight = 0.0, vector_weight = 1.0

For experiment-2, I took the inverse approach to further validate my hybrid search implementation. I set the keyword_weight to 0 and the vector_weight to 1, effectively giving full weightage to the vector search component (KNN search). I then compared these results with a pure vector search. The outcome was consistent with my expectations: the results from the hybrid search with these settings exactly matched those from a standalone vector search. This confirms that my implementation accurately reproduces vector search results when configured to do so. As with the first experiment, I observed that while the results were identical, the scores differed between the two methods. This difference in scoring is expected and can be attributed to the normalization process in hybrid search, which still considers both components even when one is given zero weight. This experiment further validates the accuracy and flexibility of my hybrid search solution, demonstrating its ability to effectively emulate pure vector search when needed while maintaining the underlying hybrid search structure.

Experiment - 3
Hybrid Search - balanced

keyword_weight = 0.5, vector_weight = 0.5

For experiment-3, I adopted a balanced approach to further evaluate the effectiveness of my hybrid search implementation. In this test, I set both the keyword_weight and vector_weight to 0.5, giving equal importance to keyword-based and vector-based search components. This configuration aims to leverage the strengths of both search methods simultaneously. By setting both weights to 0.5, I intended to create a scenario where the hybrid search would consider lexical matches and semantic similarity equally. This balanced approach is often ideal for many real-world applications, as it can capture both exact keyword matches and contextually relevant results that might not contain the exact search terms.

Kindly verify the notebook for the experiments conducted!

Notebook: https://github.com/karthikbharadhwajKB/Langchain_OpenSearch_Hybrid_search/blob/main/Opensearch_Hybridsearch.ipynb

Instructions to follow for Performing Hybrid Search:

Step-1: Instantiating OpenSearchVectorSearch Class:

opensearch_vectorstore = OpenSearchVectorSearch(
    index_name=os.getenv("INDEX_NAME"),
    embedding_function=embedding_model,
    opensearch_url=os.getenv("OPENSEARCH_URL"),
    http_auth=(os.getenv("OPENSEARCH_USERNAME"),os.getenv("OPENSEARCH_PASSWORD")),
    use_ssl=False,
    verify_certs=False,
    ssl_assert_hostname=False,
    ssl_show_warn=False
)

Parameters:

  1. index_name: The name of the OpenSearch index to use.
  2. embedding_function: The function or model used to generate embeddings for the documents. It's assumed that embedding_model is defined elsewhere in the code.
  3. opensearch_url: The URL of the OpenSearch instance.
  4. http_auth: A tuple containing the username and password for authentication.
  5. use_ssl: Set to False, indicating that the connection to OpenSearch is not using SSL/TLS encryption.
  6. verify_certs: Set to False, which means the SSL certificates are not being verified. This is often used in development environments but is not recommended for production.
  7. ssl_assert_hostname: Set to False, disabling hostname verification in SSL certificates.
  8. ssl_show_warn: Set to False, suppressing SSL-related warnings.

Step-2: Configure Search Pipeline:

To initiate hybrid search functionality, you need to configures a search pipeline first.

Implementation Details:

This method configures a search pipeline in OpenSearch that:

  1. Normalizes the scores from both keyword and vector searches using the min-max technique.
  2. Applies the specified weights to the normalized scores.
  3. Calculates the final score using an arithmetic mean of the weighted, normalized scores.

Parameters:

  • pipeline_name (str): A unique identifier for the search pipeline. It's recommended to use a descriptive name that indicates the weights used for keyword and vector searches.
  • keyword_weight (float): The weight assigned to the keyword search component. This should be a float value between 0 and 1. In this example, 0.3 gives 30% importance to traditional text matching.
  • vector_weight (float): The weight assigned to the vector search component. This should be a float value between 0 and 1. In this example, 0.7 gives 70% importance to semantic similarity.
opensearch_vectorstore.configure_search_pipelines(
    pipeline_name="search_pipeline_keyword_0.3_vector_0.7",
    keyword_weight=0.3,
    vector_weight=0.7,
)

Step-3: Performing Hybrid Search:

After creating the search pipeline, you can perform a hybrid search using the similarity_search() method (or) any methods that are supported by langchain. This method combines both keyword-based and semantic similarity searches on your OpenSearch index, leveraging the strengths of both traditional information retrieval and vector embedding techniques.

parameters:

  • query: The search query string.
  • k: The number of top results to return (in this case, 3).
  • search_type: Set to hybrid_search to use both keyword and vector search capabilities.
  • search_pipeline: The name of the previously created search pipeline.
query = "what are the country named in our database?"

top_k = 3

pipeline_name = "search_pipeline_keyword_0.3_vector_0.7"

matched_docs = opensearch_vectorstore.similarity_search_with_score(
                query=query,
                k=top_k,
                search_type="hybrid_search",
                search_pipeline = pipeline_name
            )

matched_docs

twitter handle: @iamkarthik98

Copy link

vercel bot commented Aug 14, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Dec 13, 2024 8:36pm

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. langchain Related to the langchain package Ɑ: vector store Related to vector store module 🤖:improvement Medium size change to existing code to handle new use-cases labels Aug 14, 2024
@ccurme ccurme added community Related to langchain-community and removed langchain Related to the langchain package labels Aug 14, 2024
@eyurtsev eyurtsev self-assigned this Aug 14, 2024
@eyurtsev
Copy link
Collaborator

Hi @karthikbharadhwajKB, thank you for the contribution!

We need the following changes to get this code merged:

  1. Add integration tests that show the code is working as expected
  2. Use the opensearch client instead of issuing separate network requests via the requests library

@karthikbharadhwajKB
Copy link
Contributor Author

karthikbharadhwajKB commented Aug 15, 2024

Hey @eyurtsev,

  • [✅ ] Use the OpenSearch client instead of issuing separate network requests via the requests library - I have removed using requests library and Now update it with the OpenSearch client API for configuring_search_pipeline & performing Hybrid Search.

  • [⚒️] Add integration tests that show the code is working as expected - I'm working on it...!

Here you can see working Hybrid Search feature and some experiments (where I tried to reproduce the exact keyword search results and approximate_search with tweaking keyword_weight & vector_weight).
Notebook: https://github.com/karthikbharadhwajKB/Langchain_OpenSearch_Hybrid_search/blob/main/Opensearch_Hybridsearch.ipynb
I also tested Hybrid_Search, Hybrid_Search with Post filter & default Approximate_Search functionalities.
Notebook: https://github.com/karthikbharadhwajKB/Langchain_OpenSearch_Hybrid_search/blob/main/Tests.ipynb

@karthikbharadhwajKB
Copy link
Contributor Author

Hey buddy @eyurtsev

  • [✅] Add integration tests that show the code is working as expected - I have added integration tests for configuring_search_pipeline, hybrid_search, hybrid_search_with_score & hybrid_search_with_post_filter.

Please have a look. if everything seems fine then merge it! 🚀

@karthikbharadhwajKB
Copy link
Contributor Author

Hey @eyurtsev I have made the changes that you have requested. please check and let me know.

@eyurtsev eyurtsev changed the title Opensearch hybridsearch implementation community[minor]: Opensearch hybridsearch implementation Oct 8, 2024
@karthikbharadhwajKB
Copy link
Contributor Author

Hey @eyurtsev I have made the changes that you requested to do.

Please merge if everything looks fine 🚀

@dosubot dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Oct 17, 2024
@eyurtsev eyurtsev enabled auto-merge (squash) October 17, 2024 22:06
@eyurtsev eyurtsev disabled auto-merge October 17, 2024 22:06
@eyurtsev eyurtsev merged commit 498f024 into langchain-ai:master Dec 13, 2024
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Related to langchain-community 🤖:improvement Medium size change to existing code to handle new use-cases lgtm PR looks good. Use to confirm that a PR is ready for merging. size:L This PR changes 100-499 lines, ignoring generated files. Ɑ: vector store Related to vector store module
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants