Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UPDATED: Add caching to quickly answer repeated questions #62

Open
r-i-c-e-b-o-y opened this issue Mar 25, 2025 · 2 comments
Open

UPDATED: Add caching to quickly answer repeated questions #62

r-i-c-e-b-o-y opened this issue Mar 25, 2025 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@r-i-c-e-b-o-y
Copy link
Contributor

r-i-c-e-b-o-y commented Mar 25, 2025

Feature Request: Hybrid Caching Strategy for Similar Questions

Overview

This feature proposal introduces a hybrid caching strategy for Maeser that leverages both fuzzy string matching and embedding-based semantic similarity. The goal is to efficiently cache answers for questions asked by many students—even when the wording varies slightly—without requiring an excessively large cache.

Key Objectives:

  • Reduce API Costs: Avoid unnecessary API calls by caching similar questions.
  • Improve Response Time: Quickly return answers for similar or near-identical queries.
  • Maintain Accuracy: Use semantic analysis when fuzzy matching indicates a moderate similarity.
  • Ease of Management: Implement an eviction and update policy to keep the cache current and within size limits.

Proposed Approach

When a new question is asked, the following two-step process is used:

  1. Fuzzy Matching (Lexical Check):

    • High Threshold (≥95%):
      If a cached question scores 95% or higher using fuzzy matching (e.g., with the fuzzywuzzy library), we assume it is the same and return the cached answer immediately.
    • Moderate Threshold (80% to 95%):
      If the fuzzy match score is between 80% and 95%, use embedding-based semantic comparison for further verification.
    • Low Threshold (<80%):
      If the fuzzy match score is less than 80%, the question is considered too different, and a fresh API call will be made.
  2. Embedding-Based Semantic Comparison:

    • For questions with moderate fuzzy scores, generate embeddings (using, for example, OpenAI's text-embedding-ada-002).
    • Compare the new question's embedding with that of the cached question using cosine similarity.
    • If the cosine similarity is above a set threshold (e.g., around 0.9), treat it as a cache hit.
  3. Fresh API Call:

    • If no cached answer is found through the above checks, call the API, cache the new question and answer, and return the result.

Caching Process

1. Adding to the Cache

When a new answer is generated, cache the following:

  • The question text
  • The generated answer
  • The embedding for the question
  • A timestamp for tracking purposes (for TTL or update policies)

Below is the Python code to generate embeddings and cache an answer:

import time
import numpy as np
import openai

# Set your OpenAI API key
openai.api_key = "YOUR_OPENAI_API_KEY"

def get_embedding(text: str, model: str = "text-embedding-ada-002") -> np.ndarray:
    """
    Generate an embedding for the provided text using the OpenAI API.
    """
    response = openai.Embedding.create(input=[text], model=model)
    embedding = response['data'][0]['embedding']
    return np.array(embedding)

# In-memory cache: {question: {"embedding": np.ndarray, "answer": str, "timestamp": float}}
embedding_cache = {}

def cache_answer(question: str, answer: str):
    """
    Cache the answer along with the generated embedding and a timestamp.
    """
    embedding = get_embedding(question)
    embedding_cache[question] = {
        "embedding": embedding,
        "answer": answer,
        "timestamp": time.time()
    }

2. Retrieving from the cache

A. Fuzzy Matching Code (Using fuzzywuzzy)

Below is the fuzzy matching code compares the new question to cached questions and returns a hit if the score is high:

from fuzzywuzzy import fuzz

def get_cached_answer_with_fuzzy(new_question: str, threshold: int = 95):
    """
    Return a cached answer if any cached question has a fuzzy match score
    above the given threshold.
    """
    for cached_question, data in embedding_cache.items():
        score = fuzz.token_set_ratio(new_question.lower(), cached_question.lower())
        if score >= threshold:
            print(f"Fuzzy match hit: '{cached_question}' with score {score}")
            return data["answer"]
    return None

B. Combined Retrieval Using Fuzzy Matching and Embedding-Based Validation

For cases where the fuzzy match score is moderate (80%-95%), we use embeddings to verify semantic similarity:

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    """
    Calculate the cosine similarity between two vectors.
    """
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def get_cached_answer(new_question: str, fuzzy_high: int = 95, fuzzy_low: int = 80, cosine_threshold: float = 0.9):
    """
    Retrieve a cached answer by first using fuzzy matching to find the best candidate.
    If the best fuzzy match score is:
      - ≥ fuzzy_high: Return the cached answer immediately.
      - Between fuzzy_low and fuzzy_high: Validate with embedding cosine similarity.
      - < fuzzy_low: Consider it too different, and return None.
    """
    best_score = 0
    best_candidate = None

    # Step 1: Fuzzy matching to find the best candidate
    for cached_question, data in embedding_cache.items():
        score = fuzz.token_set_ratio(new_question.lower(), cached_question.lower())
        if score > best_score:
            best_score = score
            best_candidate = (cached_question, data)

    if best_candidate is None:
        return None

    # If fuzzy score is very high, return immediately.
    if best_score >= fuzzy_high:
        print(f"Immediate fuzzy match: '{best_candidate[0]}' with score {best_score}")
        return best_candidate[1]["answer"]

    # For moderate fuzzy scores, verify with embeddings.
    if fuzzy_low <= best_score < fuzzy_high:
        new_embedding = get_embedding(new_question)
        candidate_embedding = best_candidate[1]["embedding"]
        cosine_sim = cosine_similarity(new_embedding, candidate_embedding)
        if cosine_sim >= cosine_threshold:
            print(f"Embedding match: '{best_candidate[0]}' with cosine similarity {cosine_sim:.2f}")
            return best_candidate[1]["answer"]

    # If no candidate is sufficiently similar, return None.
    return None

C. Usage Example: Combining Both Methods

Below is an example function that demonstrates how to combine fuzzy matching and embedding-based verification. It first checks for a strong fuzzy match, then falls back on embedding-based comparison if needed:

def retrieve_answer(new_question: str):
    """
    Retrieve an answer from the cache using a hybrid approach:
    1. Try a direct fuzzy match (threshold ≥ 95%).
    2. If not, try combined fuzzy and embedding-based verification.
    3. If still no match, return None (to trigger an API call).
    """
    # First, try a direct fuzzy match.
    answer = get_cached_answer_with_fuzzy(new_question, threshold=95)
    if answer:
        return answer

    # Next, try the combined approach.
    answer = get_cached_answer(new_question, fuzzy_high=95, fuzzy_low=80, cosine_threshold=0.9)
    return answer

# Example usage:
question = "How do I implement caching for similar questions?"
cached_answer = retrieve_answer(question)
if cached_answer:
    print("Cached answer found:", cached_answer)
else:
    print("No cached answer found. Proceeding with API call.")

3. Removing and Updating Cache Entries

A. Cache Eviction (TTL-Based)

Remove entries older than a specified Time-To-Live (TTL):

CACHE_TTL = 3600  # 1 hour in seconds

def remove_expired_entries():
    """
    Remove cache entries that have exceeded their TTL.
    """
    current_time = time.time()
    keys_to_remove = [key for key, data in embedding_cache.items()
                      if current_time - data["timestamp"] > CACHE_TTL]
    for key in keys_to_remove:
        del embedding_cache[key]

B. Updating an Existing Cache Entry

If the answer for a cached question changes or needs refreshing:

def update_cache(question: str, new_answer: str):
    """
    Update an existing cache entry with a new answer and a refreshed embedding.
    """
    embedding = get_embedding(question)
    embedding_cache[question] = {
        "embedding": embedding,
        "answer": new_answer,
        "timestamp": time.time()
    }

Overall Process Flow

User Query:

  • Attempt to retrieve an answer from the cache using retrieve_answer(new_question).
  • If a cached answer is found (either via direct fuzzy match or combined verification), return it.

API Call & Caching:

  • If no cached answer is found, perform an API call to obtain the answer.
  • Cache the new question and answer using cache_answer(question, answer).

Maintenance:

  • Regularly run remove_expired_entries() to keep the cache manageable.
  • Monitor cache hit rates and adjust thresholds (for both fuzzy matching and cosine similarity) as needed.

Effectiveness and Tuning

Fuzzy Matching Thresholds:

  • ≥95%: Treat as an exact or nearly identical match.
  • 80%–95%: Trigger embedding-based semantic checks.
  • <80%: Consider queries different, leading to a fresh API call.

Embedding Similarity:

  • A cosine similarity threshold of around 0.9 is a good starting point but may require tuning with actual data.

Benefits:

  • Speed: Fuzzy matching is fast and handles near-identical queries quickly.
  • Accuracy: Embedding verification ensures semantically similar queries are recognized.
  • Cost Efficiency: Reduces expensive API calls by caching similar questions intelligently.
  • Scalability: Regular eviction and update processes keep the cache size manageable.

Conclusion

Implementing this hybrid caching strategy—combining fuzzy string matching with embedding-based semantic checks—will enable Maeser to quickly and efficiently respond to similar questions from students. This approach minimizes API costs, improves response times, and maintains answer accuracy. Feedback and fine-tuning based on real-world usage will be key to achieving optimal performance.

@r-i-c-e-b-o-y r-i-c-e-b-o-y added the enhancement New feature or request label Mar 25, 2025
@r-i-c-e-b-o-y r-i-c-e-b-o-y changed the title Add caching of used context Add caching to quickly answer repeated questions Mar 25, 2025
@r-i-c-e-b-o-y r-i-c-e-b-o-y self-assigned this Mar 25, 2025
@r-i-c-e-b-o-y r-i-c-e-b-o-y changed the title Add caching to quickly answer repeated questions UPDATED: Add caching to quickly answer repeated questions Mar 28, 2025
@r-i-c-e-b-o-y
Copy link
Contributor Author

After I am done with the pipeline rag this will be my last project before I go on my mission

@wirthlin
Copy link
Contributor

It would be interesting if you could keep track of how many times the question is made. This way frequently asked questions would have more weight than questions that are only made once.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants