Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: destination vector db add option for embedding without field names #91

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

frankiebromage
Copy link

@frankiebromage frankiebromage commented Nov 27, 2024

Description

Responding to issue #48627

Including field names when embedding text changes the semantic meaning of a vector, so it is useful to have the option of embedding only the text without the field names.

This is my first contribution, so please let me know what else is needed!

Changes

Adds an additional config to allow user to choose whether or not to embed field names with a vector db destination.

Summary by CodeRabbit

Summary by CodeRabbit

  • New Features

    • Introduced a new configuration option to control the inclusion of field names in embeddings, enhancing flexibility in document processing.
  • Bug Fixes

    • Improved the document generation logic to conditionally omit field names based on the new configuration setting.
  • Tests

    • Added tests for the new omit_field_names_from_embeddings field in the configuration model to ensure proper functionality.
  • Chores

    • Updated the CI workflow to include the new connector and enhance job execution conditions and output messages.

Resolves airbytehq/airbyte#48627 (comment)

Copy link
Contributor

coderabbitai bot commented Nov 27, 2024

📝 Walkthrough
📝 Walkthrough

Walkthrough

The changes introduce a new boolean field, omit_field_names_from_embeddings, to the ProcessingConfigModel class, allowing control over whether field names are included in embeddings. This new field is integrated into the DocumentProcessor class, affecting how text is generated from records. Additionally, the unit tests have been updated to reflect this new configuration option, ensuring the functionality is covered. Overall, the modifications enhance the flexibility of text processing within the Airbyte CDK.

Changes

File Path Change Summary
airbyte_cdk/destinations/vector_db_based/config.py Added field omit_field_names_from_embeddings: bool to ProcessingConfigModel.
airbyte_cdk/destinations/vector_db_based/document_processor.py Added variable omit_field_names_from_embeddings to DocumentProcessor and methods _generate_text_from_fields and _extract_values_from_dict.
unit_tests/destinations/vector_db_based/config_test.py Added field omit_field_names_from_embeddings to ProcessingConfigModel in the test file.
.github/workflows/connector-tests.yml Updated CI workflow to include new connector destination-pgvector and enhanced job conditions.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Config
    participant Processor

    User->>Config: Set omit_field_names_from_embeddings to true
    Config-->>Processor: Pass configuration
    Processor->>Processor: Check omit_field_names_from_embeddings
    alt If true
        Processor->>Processor: Call _extract_values_from_dict
    else If false
        Processor->>Processor: Call stringify_dict
    end
    Processor-->>User: Return processed document
Loading

Assessment against linked issues

Objective Addressed Explanation
Allow embedding without field names to improve semantic meaning (#48627)

Possibly related PRs

Suggested labels

enhancement, ci

Suggested reviewers

  • aaronsteers

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (4)
airbyte_cdk/destinations/vector_db_based/document_processor.py (2)

166-169: Consider simplifying the boolean check and fixing indentation.

The logic is correct, but we could make it more Pythonic. What do you think about these small improvements? 🤔

-        if self.omit_field_names_from_embeddings == False:
-            text = stringify_dict(relevant_fields)
-        else:
-            text = self._extract_values_from_dict(relevant_fields)
+        if self.omit_field_names_from_embeddings:
+            text = self._extract_values_from_dict(relevant_fields)
+        else:
+            text = stringify_dict(relevant_fields)

This suggestion:

  1. Uses a more idiomatic Python boolean check
  2. Puts the new feature case first (positive condition)
  3. Fixes indentation consistency

wdyt? 🙂

🧰 Tools
🪛 Ruff (0.8.0)

166-166: Avoid equality comparisons to False; use if not self.omit_field_names_from_embeddings: for false checks

Replace with not self.omit_field_names_from_embeddings

(E712)


173-180: Consider enhancing the helper method with type hints and edge cases.

The recursive implementation looks clean! Would you be interested in adding some enhancements for better maintainability and robustness? 🛠️

-    def _extract_values_from_dict(self, data):
+    def _extract_values_from_dict(self, data: Any) -> str:
+        if data is None:
+            return ""
         if isinstance(data, dict):
-            return "\n".join(self._extract_values_from_dict(value) for value in data.values())
+            return "\n".join(filter(None, (self._extract_values_from_dict(value) for value in data.values())))
         elif isinstance(data, list):
-            return "\n".join(self._extract_values_from_dict(item) for item in data)
+            return "\n".join(filter(None, (self._extract_values_from_dict(item) for item in data)))
         else:
             return str(data)

This suggestion adds:

  1. Type hints for better code maintainability
  2. Explicit handling of None values
  3. Filtering of empty strings to avoid unnecessary newlines

How does this look to you? 🤔

airbyte_cdk/destinations/vector_db_based/config.py (1)

111-116: The implementation looks great! Would you consider some small enhancements? 🤔

The new field is well-documented and has a sensible default. A couple of suggestions to make it even better:

  1. Maybe we could add an example to the description to make it super clear? Something like:
     description="Do not include the field names in the text that gets embedded. By default field names are embedded e.g 'user: name, user.email: [email protected]'. If set to true, only the values are embedded e.g. 'name, [email protected]'.",
+    examples=[True, False],
  1. Since this field works in conjunction with text_fields, should we add a note about their interaction? For instance, when text_fields is empty (meaning all fields are considered), how does omit_field_names_from_embeddings affect the behavior? wdyt?
unit_tests/destinations/vector_db_based/config_test.py (1)

245-251: LGTM! The new configuration field looks well-structured 🎉

The implementation aligns perfectly with the PR objectives. The field is properly integrated with appropriate metadata and maintains backward compatibility with the default value of False.

Hey @frankiebromage, what do you think about enhancing the description slightly to include an example of the embedding format? Something like:

-        "description": "Do not include the field names in the text that gets embedded. By default field names are embedded e.g 'user: name, user.email: [email protected]'. If set to true, only the values are embedded e.g. 'name, [email protected]'.",
+        "description": "Do not include the field names in the text that gets embedded. By default field names are embedded (e.g., 'user.name: John Doe, user.email: [email protected]'). If set to true, only the values are embedded (e.g., 'John Doe, [email protected]').",

The suggested description uses more realistic examples that better illustrate the difference. Wdyt? 🤔

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between a8b1b2b and 9313f6d.

📒 Files selected for processing (4)
  • airbyte_cdk/destinations/vector_db_based/config.py (1 hunks)
  • airbyte_cdk/destinations/vector_db_based/document_processor.py (2 hunks)
  • unit_tests/destinations/vector_db_based/config_test.py (1 hunks)
  • unit_tests/destinations/vector_db_based/document_processor_test.py (2 hunks)
🔥 Files not summarized due to errors (1)
  • unit_tests/destinations/vector_db_based/document_processor_test.py: Error: Server error: no LLM provider could handle the message
🧰 Additional context used
🪛 Ruff (0.8.0)
airbyte_cdk/destinations/vector_db_based/document_processor.py

166-166: Avoid equality comparisons to False; use if not self.omit_field_names_from_embeddings: for false checks

Replace with not self.omit_field_names_from_embeddings

(E712)

🔇 Additional comments (4)
airbyte_cdk/destinations/vector_db_based/document_processor.py (2)

128-128: LGTM! Clean integration of the new config option.

The new instance variable follows the established pattern of other configuration options in the class.


166-169: Verify impact on existing vector collections.

Just wanted to flag something important - enabling this feature on existing collections would create embeddings with a different structure (without field names) compared to previous ones. This might affect similarity search results.

Would you like help documenting this potential impact in the README or adding tests to verify the behavior? 🤝

🧰 Tools
🪛 Ruff (0.8.0)

166-166: Avoid equality comparisons to False; use if not self.omit_field_names_from_embeddings: for false checks

Replace with not self.omit_field_names_from_embeddings

(E712)

airbyte_cdk/destinations/vector_db_based/config.py (1)

111-116: Shall we verify the test coverage? 🧪

The change looks great! Since this affects how text is processed for embeddings, it would be good to verify that we have comprehensive test coverage. Could you confirm if:

  1. The DocumentProcessor tests include cases for both True and False values of this field?
  2. The integration tests (if any) verify that the embeddings are correctly generated with and without field names?
✅ Verification successful

Looking good! Test coverage is in place ✅

The code is well-tested with both True and False cases for omit_field_names_from_embeddings in document_processor_test.py. The implementation in document_processor.py correctly handles both cases, and the config schema is properly defined in config_test.py.

Great job on maintaining test coverage while adding this new feature! 🎯

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for test coverage of the new field
rg -l "omit_field_names_from_embeddings.*=.*(?:True|False)" "test_*.py"
rg -l "ProcessingConfigModel.*omit_field_names_from_embeddings" "test_*.py"

Length of output: 251


Script:

#!/bin/bash
# Let's first find the test files in the correct directory
fd "test.*\.py" --type f

# Then search for the DocumentProcessor tests
rg -l "DocumentProcessor" --type py

# And look for any tests related to field names in embeddings
rg -l "field.*names.*embed" --type py

Length of output: 11491


Script:

#!/bin/bash
# Let's check the test coverage in document_processor_test.py
rg "omit_field_names_from_embeddings" unit_tests/destinations/vector_db_based/document_processor_test.py -B 2 -A 2

# Also check config_test.py for any related tests
rg "omit_field_names_from_embeddings" unit_tests/destinations/vector_db_based/config_test.py -B 2 -A 2

# Let's see how this field is used in document_processor.py
rg "omit_field_names_from_embeddings" airbyte_cdk/destinations/vector_db_based/document_processor.py -B 2 -A 2

Length of output: 1598

unit_tests/destinations/vector_db_based/config_test.py (1)

245-251: Verify test coverage for the new configuration option

While the schema test looks good, we might want to ensure we have proper test coverage for this functionality. Would you like me to help verify if we have test cases covering:

  1. The behavior when omit_field_names_from_embeddings is True
  2. The behavior when omit_field_names_from_embeddings is False (default case)
✅ Verification successful

Test coverage looks good for the new configuration option! 🎉

I found test cases in document_processor_test.py that explicitly verify both scenarios:

  • When omit_field_names_from_embeddings = False (default case), confirming field names are included in the embedded text
  • When omit_field_names_from_embeddings = True, verifying only values are embedded

The assertions check the exact expected output format in both cases, which is quite thorough. Nice work on the test coverage! 👍

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Look for test files that might contain tests for the new configuration option

# Search for test files that might contain relevant test cases
echo "Searching for relevant test files..."
rg -l "omit_field_names_from_embeddings" "unit_tests"

# Search for test cases specifically testing this functionality
echo "Searching for specific test cases..."
rg -A 5 "def test.*embed.*field.*name" "unit_tests"

Length of output: 998


Script:

#!/bin/bash
# Let's check the content of document_processor_test.py for test cases related to this feature
echo "Checking document_processor_test.py for relevant test cases..."
rg -C 10 "omit_field_names_from_embeddings" "unit_tests/destinations/vector_db_based/document_processor_test.py"

# Let's also look at the actual implementation to understand what we should be testing
echo "Checking implementation details..."
rg -C 10 "omit_field_names_from_embeddings" "jitsu/destinations/vector_db_based/"

Length of output: 1501

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (3)
airbyte_cdk/destinations/vector_db_based/document_processor.py (3)

128-128: Consider adding documentation for the new parameter

Would you consider adding a docstring explanation for omit_field_names_from_embeddings to help users understand its purpose and impact? Something like:

# Controls whether field names should be included when generating embeddings.
# When True, only field values are used, potentially improving semantic similarity.

wdyt? 🤔


166-169: Consider extracting text generation logic

The conditional text generation logic is clean, but what do you think about extracting it into a dedicated method for better readability? Something like:

-        if not self.omit_field_names_from_embeddings:
-            text = stringify_dict(relevant_fields)
-        else:
-            text = self._extract_values_from_dict(relevant_fields)
+        text = self._generate_text_from_fields(relevant_fields)

With a new method:

def _generate_text_from_fields(self, fields: Dict[str, Any]) -> str:
    return stringify_dict(fields) if not self.omit_field_names_from_embeddings else self._extract_values_from_dict(fields)

This would make the main method more focused and easier to read. What are your thoughts? 🤔


173-180: Great implementation! Here are some suggestions to make it even better

The recursive implementation looks solid! Would you consider these enhancements?

  1. Add return type hint:
-    def _extract_values_from_dict(self, data: Union[dict, list, Any]) -> str:
+    def _extract_values_from_dict(self, data: Union[dict, list, Any]) -> str:
  1. Add docstring explaining the recursion strategy:
"""
Recursively extracts values from nested data structures.
Handles dictionaries, lists, and scalar values, joining them with newlines.
"""
  1. For large nested structures, consider adding a join_char parameter with a default value of '\n':
def _extract_values_from_dict(self, data: Union[dict, list, Any], join_char: str = '\n') -> str:
    if isinstance(data, dict):
        return join_char.join(self._extract_values_from_dict(value, join_char) for value in data.values())
    elif isinstance(data, list):
        return join_char.join(self._extract_values_from_dict(item, join_char) for item in data)
    else:
        return str(data)

This would make the method more flexible for different use cases. What do you think? 🤔

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 9313f6d and 42be95f.

📒 Files selected for processing (2)
  • airbyte_cdk/destinations/vector_db_based/document_processor.py (3 hunks)
  • unit_tests/destinations/vector_db_based/document_processor_test.py (2 hunks)
🔥 Files not summarized due to errors (1)
  • unit_tests/destinations/vector_db_based/document_processor_test.py: Error: Server error: no LLM provider could handle the message
🔇 Additional comments (2)
airbyte_cdk/destinations/vector_db_based/document_processor.py (2)

8-8: LGTM! Clean import addition.

The addition of Union to the typing imports is appropriate for the new method's type hints.


166-180: Verify test coverage for the new functionality

The changes look good! Would you mind verifying that we have test coverage for:

  1. Different combinations of nested structures with omit_field_names_from_embeddings
  2. Edge cases like empty dicts/lists
  3. Performance with large nested structures

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (4)
airbyte_cdk/destinations/vector_db_based/document_processor.py (3)

128-128: Consider updating the docstring to reflect the new behavior?

The _generate_document method now supports optional field name omission, but this isn't documented. Would you consider adding this to the docstring? Something like:

    def _generate_document(self, record: AirbyteRecordMessage) -> Optional[Document]:
+       """
+       Generate a Document from an AirbyteRecordMessage.
+       
+       The text content can be generated either with or without field names based on
+       the omit_field_names_from_embeddings configuration.
+       """

wdyt? 🤔

Also applies to: 166-168


170-174: The implementation looks clean! Quick suggestion about type hints.

The method implementation is clear and follows the single responsibility principle. Would you consider adding a return type hint to make it even more explicit? Like this:

-    def _generate_text_from_fields(self, fields: Dict[str, Any]) -> str:
+    def _generate_text_from_fields(self, fields: Dict[str, Any]) -> str:

176-184: The implementation looks solid! A few enhancement ideas to consider.

The recursive implementation is clean and handles different data types well. Would you consider these enhancements?

  1. Adding a docstring to explain the recursive nature and the join_char parameter:
def _extract_values_from_dict(self, data: Union[Dict[Any, Any], List[Any], Any], join_char: str = '\n') -> str:
    """
    Recursively extracts values from nested data structures, joining them with the specified character.
    
    Args:
        data: The data structure to process (dict, list, or scalar value)
        join_char: The character to use when joining extracted values (default: newline)
    
    Returns:
        A string containing all values joined by join_char
    """
  1. Optimizing empty collection handling:
     elif isinstance(data, dict):
+        if not data:
+            return ""
         return join_char.join(self._extract_values_from_dict(value) for value in data.values())
     elif isinstance(data, list):
+        if not data:
+            return ""
         return join_char.join(self._extract_values_from_dict(item) for item in data)

wdyt about these suggestions? 🤔

airbyte_cdk/destinations/vector_db_based/config.py (1)

111-116: LGTM! Clean implementation of the field names toggle 🎯

The implementation looks solid with good documentation and sensible defaults. Just wondering if we could make the description even more helpful by adding an example of when users might want to enable this option? For instance, when doing semantic similarity searches where field names could skew the results? wdyt?

         description="Do not include the field names in the text that gets embedded. By default field names are embedded (e.g., 'user.name: John Doe \n user.email: [email protected]'). If set to true, only the values are embedded (e.g., 'John Doe \n [email protected]').",

Could be enhanced to:

         description="Do not include the field names in the text that gets embedded. By default field names are embedded (e.g., 'user.name: John Doe \n user.email: [email protected]'). If set to true, only the values are embedded (e.g., 'John Doe \n [email protected]'). This is particularly useful when doing semantic similarity searches where field names could skew the vector representation.",
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 42be95f and 5f8b8a3.

📒 Files selected for processing (4)
  • airbyte_cdk/destinations/vector_db_based/config.py (1 hunks)
  • airbyte_cdk/destinations/vector_db_based/document_processor.py (3 hunks)
  • unit_tests/destinations/vector_db_based/config_test.py (1 hunks)
  • unit_tests/destinations/vector_db_based/document_processor_test.py (2 hunks)
🔥 Files not summarized due to errors (1)
  • unit_tests/destinations/vector_db_based/document_processor_test.py: Error: Server error: no LLM provider could handle the message
🚧 Files skipped from review as they are similar to previous changes (1)
  • unit_tests/destinations/vector_db_based/config_test.py
🔇 Additional comments (3)
airbyte_cdk/destinations/vector_db_based/document_processor.py (2)

8-8: LGTM! Clean import addition.

The Union type import is appropriately added to support the new type hints.


Line range hint 128-184: Verify test coverage for the new embedding option

The changes look well-implemented! Let's verify that we have adequate test coverage for the new functionality.

✅ Verification successful

Test coverage looks good for the new embedding option! 🎉

The new omit_field_names_from_embeddings feature is well-tested:

  • It's covered in the document processor tests with both True and False cases
  • The config schema is properly defined with a clear description and default value
  • The implementation is properly validated through test assertions
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for test files and test cases related to the new embedding option
echo "Searching for related test files..."
fd -e py test_ -e py _test | grep -i "document\|process"

echo "Searching for test cases covering the new option..."
rg -l "omit_field_names_from_embeddings" -t python

Length of output: 427


Script:

#!/bin/bash
# Search for test files related to document processor
find . -type f -name "*test*.py" | grep -i "document\|process"

# Search for test cases covering the new option with proper file type
rg -l "omit_field_names_from_embeddings" --type py

# Look for actual test cases using ast-grep
ast-grep --pattern 'def test_$_($$$):
  $$$
  omit_field_names_from_embeddings
  $$$'

# Search for test class definitions related to document processor
ast-grep --pattern 'class $_Test($$_):
  $$$
  document
  $$$'

Length of output: 647


Script:

#!/bin/bash
# Let's check the actual test content for the new option
echo "=== Document Processor Test Content ==="
rg "omit_field_names_from_embeddings" -A 5 -B 5 unit_tests/destinations/vector_db_based/document_processor_test.py

echo "=== Config Test Content ==="
rg "omit_field_names_from_embeddings" -A 5 -B 5 unit_tests/destinations/vector_db_based/config_test.py

Length of output: 1721

airbyte_cdk/destinations/vector_db_based/config.py (1)

111-116: Verify test coverage for the new configuration option

Let's make sure we have good test coverage for this new field.

✅ Verification successful

Test coverage looks good for the new configuration option

I found that the omit_field_names_from_embeddings field is well tested in document_processor_test.py with both True and False cases, verifying the expected behavior for both including and excluding field names from embeddings. The implementation in DocumentProcessor class correctly uses this configuration to control the text formatting.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for test coverage of the new field
# Look for test files that might need updates
fd "test.*\.py" | while read -r file; do
  echo "=== Checking $file ==="
  rg -A 5 "ProcessingConfigModel|omit_field_names_from_embeddings" "$file"
done

Length of output: 60192


Script:

#!/bin/bash
# Let's check for test coverage of the DocumentProcessor class which uses this field
rg -A 10 "class DocumentProcessor" airbyte_cdk/destinations/vector_db_based/document_processor.py

# Also check for tests of DocumentProcessor
rg -A 10 "def test.*omit_field_names_from_embeddings" unit_tests/destinations/vector_db_based/document_processor_test.py

Length of output: 908


Script:

#!/bin/bash
# Let's check the implementation of the DocumentProcessor class to see how the field is used
rg -A 5 "omit_field_names_from_embeddings" airbyte_cdk/destinations/vector_db_based/document_processor.py

# And check for any test cases that verify this functionality
rg -A 5 "test_.*field_names.*embedding" unit_tests/destinations/vector_db_based/document_processor_test.py

Length of output: 800


Script:

#!/bin/bash
# Let's check for test cases that verify the omit_field_names_from_embeddings functionality
rg -B 5 -A 10 "omit_field_names_from_embeddings = (True|False)" unit_tests/destinations/vector_db_based/document_processor_test.py

# Also check if there are any other test files that might need updates
fd "test_.*\.py" airbyte_cdk/destinations/vector_db_based/ unit_tests/destinations/vector_db_based/

Length of output: 1117

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
airbyte_cdk/destinations/vector_db_based/document_processor.py (2)

170-175: Consider adding a docstring for the new method?

The method looks good, but adding a docstring would help explain its purpose and parameters, wdyt? Something like:

def _generate_text_from_fields(self, fields: Dict[str, Any]) -> str:
    """
    Generate text from fields with optional field name omission.
    
    Args:
        fields: Dictionary of field names and their values
        
    Returns:
        A string representation of the fields, with or without field names
        based on omit_field_names_from_embeddings configuration
    """

176-187: The implementation looks solid! A few suggestions to consider:

  1. Would you consider adding a docstring to explain the recursive nature and join_char parameter?
  2. For large nested structures, we might want to consider using a generator to optimize memory usage, wdyt?

Here's a potential enhancement:

def _extract_values_from_dict(
    self, data: Union[Dict[Any, Any], List[Any], Any], join_char: str = "\n"
) -> str:
    """
    Recursively extracts values from nested structures without field names.
    
    Args:
        data: The data structure to extract values from
        join_char: The character used to join multiple values
        
    Returns:
        A string containing all values joined by join_char
    """
    def _generate_values():
        if data is None:
            yield ""
        elif isinstance(data, dict):
            for value in data.values():
                yield from (v for v in _generate_values(value))
        elif isinstance(data, list):
            for item in data:
                yield from (v for v in _generate_values(item))
        else:
            yield str(data)
            
    return join_char.join(_generate_values())
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 5f8b8a3 and 6f82849.

📒 Files selected for processing (1)
  • airbyte_cdk/destinations/vector_db_based/document_processor.py (3 hunks)
🔇 Additional comments (3)
airbyte_cdk/destinations/vector_db_based/document_processor.py (3)

8-8: LGTM! Clean import addition.

The Union type import is properly organized with other typing imports and is necessary for the new method parameter typing.


128-128: LGTM! Clean configuration addition.

The new configuration parameter is properly initialized following the existing pattern.


170-187: Verify test coverage for the new functionality

The implementation looks great and meets the PR objectives! Let's verify the test coverage:

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
airbyte_cdk/destinations/vector_db_based/document_processor.py (2)

171-175: Add docstring for clarity?

The method looks clean and follows good practices. Would you like to add a docstring explaining the purpose and behavior of this method? Something like:

 def _generate_text_from_fields(self, fields: Dict[str, Any]) -> str:
+    """
+    Generates text from fields either with or without field names based on configuration.
+    
+    Args:
+        fields: Dictionary of field names and their values
+    Returns:
+        A string representation of the fields
+    """
     if self.omit_field_names_from_embeddings:

wdyt? 🤔


177-187: Consider adding a depth limit?

The recursive implementation looks solid! However, for safety against potential stack overflow with deeply nested structures, maybe we could add a depth limit? Here's a thought:

 def _extract_values_from_dict(
-    self, data: Union[Dict[Any, Any], List[Any], Any], join_char: str = "\n"
+    self, data: Union[Dict[Any, Any], List[Any], Any], join_char: str = "\n", depth: int = 100
 ) -> str:
     if data is None:
         return ""
+    if depth <= 0:
+        return str(data)
     elif isinstance(data, dict):
-        return join_char.join(self._extract_values_from_dict(value) for value in data.values())
+        return join_char.join(self._extract_values_from_dict(value, join_char, depth - 1) for value in data.values())

This would prevent potential issues with circular references or extremely deep structures. wdyt? 🤔

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 6f82849 and 61f68a8.

📒 Files selected for processing (4)
  • airbyte_cdk/destinations/vector_db_based/config.py (1 hunks)
  • airbyte_cdk/destinations/vector_db_based/document_processor.py (3 hunks)
  • unit_tests/destinations/vector_db_based/config_test.py (1 hunks)
  • unit_tests/destinations/vector_db_based/document_processor_test.py (2 hunks)
🔥 Files not summarized due to errors (1)
  • unit_tests/destinations/vector_db_based/document_processor_test.py: Error: Server error: no LLM provider could handle the message
🚧 Files skipped from review as they are similar to previous changes (2)
  • unit_tests/destinations/vector_db_based/config_test.py
  • airbyte_cdk/destinations/vector_db_based/config.py
🔇 Additional comments (3)
airbyte_cdk/destinations/vector_db_based/document_processor.py (3)

8-8: LGTM! Clean import addition

The Union type import is well-placed with other typing imports and properly supports the new type hints.


171-187: Verify test coverage

The new functionality looks solid! Shall we verify that it's properly covered by tests?


129-129: LGTM! Let's verify the config integration

The new config parameter is properly integrated. Would you like me to verify that the ProcessingConfigModel properly defines this field?

✅ Verification successful

Config integration looks perfect! 🎯

The omit_field_names_from_embeddings field is properly defined in ProcessingConfigModel with:

  • Correct type (bool)
  • Sensible default (False)
  • Clear title and descriptive help text
  • Follows the same pattern as other config parameters
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if ProcessingConfigModel properly defines the new field
rg -A 5 "class ProcessingConfigModel" airbyte_cdk/destinations/vector_db_based/config.py

Length of output: 241


Script:

#!/bin/bash
# Let's search for the field definition in the entire file
rg -U "omit_field_names_from_embeddings.*Field" airbyte_cdk/destinations/vector_db_based/config.py

# Also check for any other potential locations of this field
rg "omit_field_names_from_embeddings" airbyte_cdk/destinations/vector_db_based/config.py

Length of output: 293


Script:

#!/bin/bash
# Let's get more context around the field definition
rg -A 3 -B 3 "omit_field_names_from_embeddings" airbyte_cdk/destinations/vector_db_based/config.py

Length of output: 581

@aaronsteers
Copy link
Contributor

@frankiebromage - Thanks for creating this!

Just a heads up: one challenge with rolling this out will be that most of the vector store destinations are several versions behind in their CDK pin. This may be related to the reason the Pinecone test failed in CI.

@aaronsteers
Copy link
Contributor

aaronsteers commented Dec 4, 2024

I added both of these issues to track discussions around bumping one of either Pinecone or Milvus.

  • airbytehq/airbyte-internal-issues#11130
  • airbytehq/airbyte-internal-issues#11129

Getting one or both of these to the latest CDK version would give us more confidence in changing CDK functionality around vector destinations.

I've also updated (for now at least) the default connector to test here in this PR - setting it to PGVector, since that is the most recent vector-based connector that we have updated.

@natikgadzhi
Copy link
Contributor

Kicking the tires on CI again

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
.github/workflows/connector-tests.yml (1)

78-85: Well-documented matrix changes! 🎉

The comments clearly explain why certain connectors are disabled, and the addition of PGVector aligns well with the current needs. Would you consider adding TODO comments with links to the tracking issues for source-s3 and destination-pinecone? This could help track when they should be re-enabled, wdyt?

Example format:

 # This is behind in CDK updates and can't be used in tests until updated:
+# TODO: Re-enable once updated - tracking issue: #<issue_number>
 # - connector: source-s3
 #   cdk_extra: file-based
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 61f68a8 and e0fcacb.

📒 Files selected for processing (1)
  • .github/workflows/connector-tests.yml (1 hunks)
🔇 Additional comments (1)
.github/workflows/connector-tests.yml (1)

81-82: Verify destination-pgvector configuration

Let's ensure destination-pgvector is properly set up in the monorepo.

@natikgadzhi
Copy link
Contributor

@aaronsteers while it's low(er) priority, I think we can make an epic with all our vector destinations + openai and unstructured and langchain dependency upgrades and run a project with community devs to update them to the latest versions, and add integration tests (use docker to standup DB where needed).

Gauging your appetite for something like this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[destination-milvus] Field names are included as text to embed changing semantic meaning of the vector
3 participants