Feat: destination vector db add option for embedding without field names #91

frankiebromage · 2024-11-27T00:01:32Z

Description

Responding to issue #48627

Including field names when embedding text changes the semantic meaning of a vector, so it is useful to have the option of embedding only the text without the field names.

This is my first contribution, so please let me know what else is needed!

Changes

Adds an additional config to allow user to choose whether or not to embed field names with a vector db destination.

Summary by CodeRabbit

New Features
- Introduced a new configuration option to control the inclusion of field names in embeddings, enhancing flexibility in document processing.
Bug Fixes
- Improved the document generation logic to conditionally omit field names based on the new configuration setting.
Tests
- Added tests for the new omit_field_names_from_embeddings field in the configuration model to ensure proper functionality.
Chores
- Updated the CI workflow to include the new connector and enhance job execution conditions and output messages.

Resolves airbytehq/airbyte#48627 (comment)

coderabbitai · 2024-11-27T00:04:05Z

📝 Walkthrough

Walkthrough

The changes introduce a new boolean field, omit_field_names_from_embeddings, to the ProcessingConfigModel class, allowing control over whether field names are included in embeddings. This new field is integrated into the DocumentProcessor class, affecting how text is generated from records. Additionally, the unit tests have been updated to reflect this new configuration option, ensuring the functionality is covered. Overall, the modifications enhance the flexibility of text processing within the Airbyte CDK.

Changes

File Path	Change Summary
airbyte_cdk/destinations/vector_db_based/config.py	Added field `omit_field_names_from_embeddings: bool` to `ProcessingConfigModel`.
airbyte_cdk/destinations/vector_db_based/document_processor.py	Added variable `omit_field_names_from_embeddings` to `DocumentProcessor` and methods `_generate_text_from_fields` and `_extract_values_from_dict`.
unit_tests/destinations/vector_db_based/config_test.py	Added field `omit_field_names_from_embeddings` to `ProcessingConfigModel` in the test file.
.github/workflows/connector-tests.yml	Updated CI workflow to include new connector `destination-pgvector` and enhanced job conditions.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Config
    participant Processor

    User->>Config: Set omit_field_names_from_embeddings to true
    Config-->>Processor: Pass configuration
    Processor->>Processor: Check omit_field_names_from_embeddings
    alt If true
        Processor->>Processor: Call _extract_values_from_dict
    else If false
        Processor->>Processor: Call stringify_dict
    end
    Processor-->>User: Return processed document

Assessment against linked issues

Objective	Addressed	Explanation
Allow embedding without field names to improve semantic meaning (#48627)	✅

Possibly related PRs

CI: update test workflows #26: This PR is related as it involves updates to CI workflows, which may indirectly affect the testing of changes made in the main PR.
Feat: Incorporate SDM in CDK and add publish workflow #58: This PR modifies the connector testing workflow, which is relevant since the main PR introduces a new field that affects how documents are processed, potentially impacting tests.
ci: run airbyte connector tests from master branch #99: This PR focuses on running connector tests from the master branch, which is relevant as it may influence the testing of the changes introduced in the main PR.
ci: skip downstream connector tests if no source code is changed #101: This PR enhances the logic for skipping connector tests based on source code changes, which is relevant to the testing of the changes made in the main PR.
ci: reorder job steps #102: This PR reorders job steps in the connector testing workflow, which could impact how the changes in the main PR are validated during CI.

Suggested labels

enhancement, ci

Suggested reviewers

aaronsteers

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (4)

airbyte_cdk/destinations/vector_db_based/document_processor.py (2)
166-169: Consider simplifying the boolean check and fixing indentation.

The logic is correct, but we could make it more Pythonic. What do you think about these small improvements? 🤔
-        if self.omit_field_names_from_embeddings == False:
-            text = stringify_dict(relevant_fields)
-        else:
-            text = self._extract_values_from_dict(relevant_fields)
+        if self.omit_field_names_from_embeddings:
+            text = self._extract_values_from_dict(relevant_fields)
+        else:
+            text = stringify_dict(relevant_fields)
This suggestion:

Uses a more idiomatic Python boolean check

Puts the new feature case first (positive condition)

Fixes indentation consistency

wdyt? 🙂

🧰 Tools

🪛 Ruff (0.8.0)

166-166: Avoid equality comparisons to False; use if not self.omit_field_names_from_embeddings: for false checks

Replace with not self.omit_field_names_from_embeddings

(E712)

173-180: Consider enhancing the helper method with type hints and edge cases.

The recursive implementation looks clean! Would you be interested in adding some enhancements for better maintainability and robustness? 🛠️
-    def _extract_values_from_dict(self, data):
+    def _extract_values_from_dict(self, data: Any) -> str:
+        if data is None:
+            return ""
         if isinstance(data, dict):
-            return "\n".join(self._extract_values_from_dict(value) for value in data.values())
+            return "\n".join(filter(None, (self._extract_values_from_dict(value) for value in data.values())))
         elif isinstance(data, list):
-            return "\n".join(self._extract_values_from_dict(item) for item in data)
+            return "\n".join(filter(None, (self._extract_values_from_dict(item) for item in data)))
         else:
             return str(data)
This suggestion adds:

Type hints for better code maintainability

Explicit handling of None values

Filtering of empty strings to avoid unnecessary newlines

How does this look to you? 🤔
airbyte_cdk/destinations/vector_db_based/config.py (1)
111-116: The implementation looks great! Would you consider some small enhancements? 🤔

The new field is well-documented and has a sensible default. A couple of suggestions to make it even better:

Maybe we could add an example to the description to make it super clear? Something like:
     description="Do not include the field names in the text that gets embedded. By default field names are embedded e.g 'user: name, user.email: [email protected]'. If set to true, only the values are embedded e.g. 'name, [email protected]'.",
+    examples=[True, False],
Since this field works in conjunction with text_fields, should we add a note about their interaction? For instance, when text_fields is empty (meaning all fields are considered), how does omit_field_names_from_embeddings affect the behavior? wdyt?
unit_tests/destinations/vector_db_based/config_test.py (1)
245-251: LGTM! The new configuration field looks well-structured 🎉

The implementation aligns perfectly with the PR objectives. The field is properly integrated with appropriate metadata and maintains backward compatibility with the default value of False.

Hey @frankiebromage, what do you think about enhancing the description slightly to include an example of the embedding format? Something like:
-        "description": "Do not include the field names in the text that gets embedded. By default field names are embedded e.g 'user: name, user.email: [email protected]'. If set to true, only the values are embedded e.g. 'name, [email protected]'.",
+        "description": "Do not include the field names in the text that gets embedded. By default field names are embedded (e.g., 'user.name: John Doe, user.email: [email protected]'). If set to true, only the values are embedded (e.g., 'John Doe, [email protected]').",
The suggested description uses more realistic examples that better illustrate the difference. Wdyt? 🤔

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between a8b1b2b and 9313f6d.

📒 Files selected for processing (4)

airbyte_cdk/destinations/vector_db_based/config.py (1 hunks)
airbyte_cdk/destinations/vector_db_based/document_processor.py (2 hunks)
unit_tests/destinations/vector_db_based/config_test.py (1 hunks)
unit_tests/destinations/vector_db_based/document_processor_test.py (2 hunks)

🔥 Files not summarized due to errors (1)

unit_tests/destinations/vector_db_based/document_processor_test.py: Error: Server error: no LLM provider could handle the message

🧰 Additional context used

🪛 Ruff (0.8.0)

airbyte_cdk/destinations/vector_db_based/document_processor.py

166-166: Avoid equality comparisons to False; use if not self.omit_field_names_from_embeddings: for false checks

Replace with not self.omit_field_names_from_embeddings

(E712)

🔇 Additional comments (4)

airbyte_cdk/destinations/vector_db_based/document_processor.py (2)

128-128: LGTM! Clean integration of the new config option.

The new instance variable follows the established pattern of other configuration options in the class.

166-169: Verify impact on existing vector collections.

Just wanted to flag something important - enabling this feature on existing collections would create embeddings with a different structure (without field names) compared to previous ones. This might affect similarity search results.

Would you like help documenting this potential impact in the README or adding tests to verify the behavior? 🤝

🧰 Tools

🪛 Ruff (0.8.0)

166-166: Avoid equality comparisons to False; use if not self.omit_field_names_from_embeddings: for false checks

Replace with not self.omit_field_names_from_embeddings

(E712)

airbyte_cdk/destinations/vector_db_based/config.py (1)

111-116: Shall we verify the test coverage? 🧪

The change looks great! Since this affects how text is processed for embeddings, it would be good to verify that we have comprehensive test coverage. Could you confirm if:

The DocumentProcessor tests include cases for both True and False values of this field?
The integration tests (if any) verify that the embeddings are correctly generated with and without field names?

✅ Verification successful

Looking good! Test coverage is in place ✅

The code is well-tested with both True and False cases for omit_field_names_from_embeddings in document_processor_test.py. The implementation in document_processor.py correctly handles both cases, and the config schema is properly defined in config_test.py.

Great job on maintaining test coverage while adding this new feature! 🎯

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for test coverage of the new field
rg -l "omit_field_names_from_embeddings.*=.*(?:True|False)" "test_*.py"
rg -l "ProcessingConfigModel.*omit_field_names_from_embeddings" "test_*.py"

Length of output: 251

Script:

#!/bin/bash
# Let's first find the test files in the correct directory
fd "test.*\.py" --type f

# Then search for the DocumentProcessor tests
rg -l "DocumentProcessor" --type py

# And look for any tests related to field names in embeddings
rg -l "field.*names.*embed" --type py

Length of output: 11491

Script:

#!/bin/bash
# Let's check the test coverage in document_processor_test.py
rg "omit_field_names_from_embeddings" unit_tests/destinations/vector_db_based/document_processor_test.py -B 2 -A 2

# Also check config_test.py for any related tests
rg "omit_field_names_from_embeddings" unit_tests/destinations/vector_db_based/config_test.py -B 2 -A 2

# Let's see how this field is used in document_processor.py
rg "omit_field_names_from_embeddings" airbyte_cdk/destinations/vector_db_based/document_processor.py -B 2 -A 2

Length of output: 1598

unit_tests/destinations/vector_db_based/config_test.py (1)

245-251: Verify test coverage for the new configuration option

While the schema test looks good, we might want to ensure we have proper test coverage for this functionality. Would you like me to help verify if we have test cases covering:

The behavior when omit_field_names_from_embeddings is True
The behavior when omit_field_names_from_embeddings is False (default case)

✅ Verification successful

Test coverage looks good for the new configuration option! 🎉

I found test cases in document_processor_test.py that explicitly verify both scenarios:

When omit_field_names_from_embeddings = False (default case), confirming field names are included in the embedded text
When omit_field_names_from_embeddings = True, verifying only values are embedded

The assertions check the exact expected output format in both cases, which is quite thorough. Nice work on the test coverage! 👍

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Look for test files that might contain tests for the new configuration option

# Search for test files that might contain relevant test cases
echo "Searching for relevant test files..."
rg -l "omit_field_names_from_embeddings" "unit_tests"

# Search for test cases specifically testing this functionality
echo "Searching for specific test cases..."
rg -A 5 "def test.*embed.*field.*name" "unit_tests"

Length of output: 998

Script:

#!/bin/bash
# Let's check the content of document_processor_test.py for test cases related to this feature
echo "Checking document_processor_test.py for relevant test cases..."
rg -C 10 "omit_field_names_from_embeddings" "unit_tests/destinations/vector_db_based/document_processor_test.py"

# Let's also look at the actual implementation to understand what we should be testing
echo "Checking implementation details..."
rg -C 10 "omit_field_names_from_embeddings" "jitsu/destinations/vector_db_based/"

Length of output: 1501

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (3)

airbyte_cdk/destinations/vector_db_based/document_processor.py (3)
128-128: Consider adding documentation for the new parameter

Would you consider adding a docstring explanation for omit_field_names_from_embeddings to help users understand its purpose and impact? Something like:
# Controls whether field names should be included when generating embeddings.
# When True, only field values are used, potentially improving semantic similarity.
wdyt? 🤔

166-169: Consider extracting text generation logic

The conditional text generation logic is clean, but what do you think about extracting it into a dedicated method for better readability? Something like:
-        if not self.omit_field_names_from_embeddings:
-            text = stringify_dict(relevant_fields)
-        else:
-            text = self._extract_values_from_dict(relevant_fields)
+        text = self._generate_text_from_fields(relevant_fields)
With a new method:
def _generate_text_from_fields(self, fields: Dict[str, Any]) -> str:
    return stringify_dict(fields) if not self.omit_field_names_from_embeddings else self._extract_values_from_dict(fields)
This would make the main method more focused and easier to read. What are your thoughts? 🤔

173-180: Great implementation! Here are some suggestions to make it even better

The recursive implementation looks solid! Would you consider these enhancements?

Add return type hint:
-    def _extract_values_from_dict(self, data: Union[dict, list, Any]) -> str:
+    def _extract_values_from_dict(self, data: Union[dict, list, Any]) -> str:
Add docstring explaining the recursion strategy:
"""
Recursively extracts values from nested data structures.
Handles dictionaries, lists, and scalar values, joining them with newlines.
"""
For large nested structures, consider adding a join_char parameter with a default value of '\n':
def _extract_values_from_dict(self, data: Union[dict, list, Any], join_char: str = '\n') -> str:
    if isinstance(data, dict):
        return join_char.join(self._extract_values_from_dict(value, join_char) for value in data.values())
    elif isinstance(data, list):
        return join_char.join(self._extract_values_from_dict(item, join_char) for item in data)
    else:
        return str(data)
This would make the method more flexible for different use cases. What do you think? 🤔

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 9313f6d and 42be95f.

📒 Files selected for processing (2)

airbyte_cdk/destinations/vector_db_based/document_processor.py (3 hunks)
unit_tests/destinations/vector_db_based/document_processor_test.py (2 hunks)

🔥 Files not summarized due to errors (1)

unit_tests/destinations/vector_db_based/document_processor_test.py: Error: Server error: no LLM provider could handle the message

🔇 Additional comments (2)

airbyte_cdk/destinations/vector_db_based/document_processor.py (2)

8-8: LGTM! Clean import addition.

The addition of Union to the typing imports is appropriate for the new method's type hints.

166-180: Verify test coverage for the new functionality

The changes look good! Would you mind verifying that we have test coverage for:

Different combinations of nested structures with omit_field_names_from_embeddings
Edge cases like empty dicts/lists
Performance with large nested structures

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (4)

airbyte_cdk/destinations/vector_db_based/document_processor.py (3)
128-128: Consider updating the docstring to reflect the new behavior?

The _generate_document method now supports optional field name omission, but this isn't documented. Would you consider adding this to the docstring? Something like:
    def _generate_document(self, record: AirbyteRecordMessage) -> Optional[Document]:
+       """
+       Generate a Document from an AirbyteRecordMessage.
+       
+       The text content can be generated either with or without field names based on
+       the omit_field_names_from_embeddings configuration.
+       """
wdyt? 🤔

Also applies to: 166-168

170-174: The implementation looks clean! Quick suggestion about type hints.

The method implementation is clear and follows the single responsibility principle. Would you consider adding a return type hint to make it even more explicit? Like this:
-    def _generate_text_from_fields(self, fields: Dict[str, Any]) -> str:
+    def _generate_text_from_fields(self, fields: Dict[str, Any]) -> str:
176-184: The implementation looks solid! A few enhancement ideas to consider.

The recursive implementation is clean and handles different data types well. Would you consider these enhancements?

Adding a docstring to explain the recursive nature and the join_char parameter:
def _extract_values_from_dict(self, data: Union[Dict[Any, Any], List[Any], Any], join_char: str = '\n') -> str:
    """
    Recursively extracts values from nested data structures, joining them with the specified character.
    
    Args:
        data: The data structure to process (dict, list, or scalar value)
        join_char: The character to use when joining extracted values (default: newline)
    
    Returns:
        A string containing all values joined by join_char
    """
Optimizing empty collection handling:
     elif isinstance(data, dict):
+        if not data:
+            return ""
         return join_char.join(self._extract_values_from_dict(value) for value in data.values())
     elif isinstance(data, list):
+        if not data:
+            return ""
         return join_char.join(self._extract_values_from_dict(item) for item in data)
wdyt about these suggestions? 🤔
airbyte_cdk/destinations/vector_db_based/config.py (1)
111-116: LGTM! Clean implementation of the field names toggle 🎯

The implementation looks solid with good documentation and sensible defaults. Just wondering if we could make the description even more helpful by adding an example of when users might want to enable this option? For instance, when doing semantic similarity searches where field names could skew the results? wdyt?
         description="Do not include the field names in the text that gets embedded. By default field names are embedded (e.g., 'user.name: John Doe \n user.email: [email protected]'). If set to true, only the values are embedded (e.g., 'John Doe \n [email protected]').",
Could be enhanced to:
         description="Do not include the field names in the text that gets embedded. By default field names are embedded (e.g., 'user.name: John Doe \n user.email: [email protected]'). If set to true, only the values are embedded (e.g., 'John Doe \n [email protected]'). This is particularly useful when doing semantic similarity searches where field names could skew the vector representation.",

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 42be95f and 5f8b8a3.

📒 Files selected for processing (4)

airbyte_cdk/destinations/vector_db_based/config.py (1 hunks)
airbyte_cdk/destinations/vector_db_based/document_processor.py (3 hunks)
unit_tests/destinations/vector_db_based/config_test.py (1 hunks)
unit_tests/destinations/vector_db_based/document_processor_test.py (2 hunks)

🔥 Files not summarized due to errors (1)

unit_tests/destinations/vector_db_based/document_processor_test.py: Error: Server error: no LLM provider could handle the message

🚧 Files skipped from review as they are similar to previous changes (1)

unit_tests/destinations/vector_db_based/config_test.py

🔇 Additional comments (3)

airbyte_cdk/destinations/vector_db_based/document_processor.py (2)

8-8: LGTM! Clean import addition.

The Union type import is appropriately added to support the new type hints.

Line range hint 128-184: Verify test coverage for the new embedding option

The changes look well-implemented! Let's verify that we have adequate test coverage for the new functionality.

✅ Verification successful

Test coverage looks good for the new embedding option! 🎉

The new omit_field_names_from_embeddings feature is well-tested:

It's covered in the document processor tests with both True and False cases
The config schema is properly defined with a clear description and default value
The implementation is properly validated through test assertions

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for test files and test cases related to the new embedding option
echo "Searching for related test files..."
fd -e py test_ -e py _test | grep -i "document\|process"

echo "Searching for test cases covering the new option..."
rg -l "omit_field_names_from_embeddings" -t python

Length of output: 427

Script:

#!/bin/bash
# Search for test files related to document processor
find . -type f -name "*test*.py" | grep -i "document\|process"

# Search for test cases covering the new option with proper file type
rg -l "omit_field_names_from_embeddings" --type py

# Look for actual test cases using ast-grep
ast-grep --pattern 'def test_$_($$$):
  $$$
  omit_field_names_from_embeddings
  $$$'

# Search for test class definitions related to document processor
ast-grep --pattern 'class $_Test($$_):
  $$$
  document
  $$$'

Length of output: 647

Script:

#!/bin/bash
# Let's check the actual test content for the new option
echo "=== Document Processor Test Content ==="
rg "omit_field_names_from_embeddings" -A 5 -B 5 unit_tests/destinations/vector_db_based/document_processor_test.py

echo "=== Config Test Content ==="
rg "omit_field_names_from_embeddings" -A 5 -B 5 unit_tests/destinations/vector_db_based/config_test.py

Length of output: 1721

airbyte_cdk/destinations/vector_db_based/config.py (1)

111-116: Verify test coverage for the new configuration option

Let's make sure we have good test coverage for this new field.

✅ Verification successful

Test coverage looks good for the new configuration option

I found that the omit_field_names_from_embeddings field is well tested in document_processor_test.py with both True and False cases, verifying the expected behavior for both including and excluding field names from embeddings. The implementation in DocumentProcessor class correctly uses this configuration to control the text formatting.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for test coverage of the new field
# Look for test files that might need updates
fd "test.*\.py" | while read -r file; do
  echo "=== Checking $file ==="
  rg -A 5 "ProcessingConfigModel|omit_field_names_from_embeddings" "$file"
done

Length of output: 60192

Script:

#!/bin/bash
# Let's check for test coverage of the DocumentProcessor class which uses this field
rg -A 10 "class DocumentProcessor" airbyte_cdk/destinations/vector_db_based/document_processor.py

# Also check for tests of DocumentProcessor
rg -A 10 "def test.*omit_field_names_from_embeddings" unit_tests/destinations/vector_db_based/document_processor_test.py

Length of output: 908

Script:

#!/bin/bash
# Let's check the implementation of the DocumentProcessor class to see how the field is used
rg -A 5 "omit_field_names_from_embeddings" airbyte_cdk/destinations/vector_db_based/document_processor.py

# And check for any test cases that verify this functionality
rg -A 5 "test_.*field_names.*embedding" unit_tests/destinations/vector_db_based/document_processor_test.py

Length of output: 800

Script:

#!/bin/bash
# Let's check for test cases that verify the omit_field_names_from_embeddings functionality
rg -B 5 -A 10 "omit_field_names_from_embeddings = (True|False)" unit_tests/destinations/vector_db_based/document_processor_test.py

# Also check if there are any other test files that might need updates
fd "test_.*\.py" airbyte_cdk/destinations/vector_db_based/ unit_tests/destinations/vector_db_based/

Length of output: 1117

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)

airbyte_cdk/destinations/vector_db_based/document_processor.py (2)

170-175: Consider adding a docstring for the new method?

The method looks good, but adding a docstring would help explain its purpose and parameters, wdyt? Something like:

def _generate_text_from_fields(self, fields: Dict[str, Any]) -> str:
    """
    Generate text from fields with optional field name omission.
    
    Args:
        fields: Dictionary of field names and their values
        
    Returns:
        A string representation of the fields, with or without field names
        based on omit_field_names_from_embeddings configuration
    """

176-187: The implementation looks solid! A few suggestions to consider:

Would you consider adding a docstring to explain the recursive nature and join_char parameter?
For large nested structures, we might want to consider using a generator to optimize memory usage, wdyt?

Here's a potential enhancement:

def _extract_values_from_dict(
    self, data: Union[Dict[Any, Any], List[Any], Any], join_char: str = "\n"
) -> str:
    """
    Recursively extracts values from nested structures without field names.
    
    Args:
        data: The data structure to extract values from
        join_char: The character used to join multiple values
        
    Returns:
        A string containing all values joined by join_char
    """
    def _generate_values():
        if data is None:
            yield ""
        elif isinstance(data, dict):
            for value in data.values():
                yield from (v for v in _generate_values(value))
        elif isinstance(data, list):
            for item in data:
                yield from (v for v in _generate_values(item))
        else:
            yield str(data)
            
    return join_char.join(_generate_values())

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 5f8b8a3 and 6f82849.

📒 Files selected for processing (1)

airbyte_cdk/destinations/vector_db_based/document_processor.py (3 hunks)

🔇 Additional comments (3)

airbyte_cdk/destinations/vector_db_based/document_processor.py (3)

8-8: LGTM! Clean import addition.

The Union type import is properly organized with other typing imports and is necessary for the new method parameter typing.

128-128: LGTM! Clean configuration addition.

The new configuration parameter is properly initialized following the existing pattern.

170-187: Verify test coverage for the new functionality

The implementation looks great and meets the PR objectives! Let's verify the test coverage:

…n_to_vector_db

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)

airbyte_cdk/destinations/vector_db_based/document_processor.py (2)
171-175: Add docstring for clarity?

The method looks clean and follows good practices. Would you like to add a docstring explaining the purpose and behavior of this method? Something like:
 def _generate_text_from_fields(self, fields: Dict[str, Any]) -> str:
+    """
+    Generates text from fields either with or without field names based on configuration.
+    
+    Args:
+        fields: Dictionary of field names and their values
+    Returns:
+        A string representation of the fields
+    """
     if self.omit_field_names_from_embeddings:
wdyt? 🤔

177-187: Consider adding a depth limit?

The recursive implementation looks solid! However, for safety against potential stack overflow with deeply nested structures, maybe we could add a depth limit? Here's a thought:
 def _extract_values_from_dict(
-    self, data: Union[Dict[Any, Any], List[Any], Any], join_char: str = "\n"
+    self, data: Union[Dict[Any, Any], List[Any], Any], join_char: str = "\n", depth: int = 100
 ) -> str:
     if data is None:
         return ""
+    if depth <= 0:
+        return str(data)
     elif isinstance(data, dict):
-        return join_char.join(self._extract_values_from_dict(value) for value in data.values())
+        return join_char.join(self._extract_values_from_dict(value, join_char, depth - 1) for value in data.values())
This would prevent potential issues with circular references or extremely deep structures. wdyt? 🤔

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 6f82849 and 61f68a8.

📒 Files selected for processing (4)

airbyte_cdk/destinations/vector_db_based/config.py (1 hunks)
airbyte_cdk/destinations/vector_db_based/document_processor.py (3 hunks)
unit_tests/destinations/vector_db_based/config_test.py (1 hunks)
unit_tests/destinations/vector_db_based/document_processor_test.py (2 hunks)

🔥 Files not summarized due to errors (1)

unit_tests/destinations/vector_db_based/document_processor_test.py: Error: Server error: no LLM provider could handle the message

🚧 Files skipped from review as they are similar to previous changes (2)

unit_tests/destinations/vector_db_based/config_test.py
airbyte_cdk/destinations/vector_db_based/config.py

🔇 Additional comments (3)

airbyte_cdk/destinations/vector_db_based/document_processor.py (3)

8-8: LGTM! Clean import addition

The Union type import is well-placed with other typing imports and properly supports the new type hints.

171-187: Verify test coverage

The new functionality looks solid! Shall we verify that it's properly covered by tests?

129-129: LGTM! Let's verify the config integration

The new config parameter is properly integrated. Would you like me to verify that the ProcessingConfigModel properly defines this field?

✅ Verification successful

Config integration looks perfect! 🎯

The omit_field_names_from_embeddings field is properly defined in ProcessingConfigModel with:

Correct type (bool)
Sensible default (False)
Clear title and descriptive help text
Follows the same pattern as other config parameters

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if ProcessingConfigModel properly defines the new field
rg -A 5 "class ProcessingConfigModel" airbyte_cdk/destinations/vector_db_based/config.py

Length of output: 241

Script:

#!/bin/bash
# Let's search for the field definition in the entire file
rg -U "omit_field_names_from_embeddings.*Field" airbyte_cdk/destinations/vector_db_based/config.py

# Also check for any other potential locations of this field
rg "omit_field_names_from_embeddings" airbyte_cdk/destinations/vector_db_based/config.py

Length of output: 293

Script:

#!/bin/bash
# Let's get more context around the field definition
rg -A 3 -B 3 "omit_field_names_from_embeddings" airbyte_cdk/destinations/vector_db_based/config.py

Length of output: 581

aaronsteers · 2024-12-02T22:07:01Z

@frankiebromage - Thanks for creating this!

Just a heads up: one challenge with rolling this out will be that most of the vector store destinations are several versions behind in their CDK pin. This may be related to the reason the Pinecone test failed in CI.

aaronsteers · 2024-12-04T18:48:25Z

I added both of these issues to track discussions around bumping one of either Pinecone or Milvus.

airbytehq/airbyte-internal-issues#11130
airbytehq/airbyte-internal-issues#11129

Getting one or both of these to the latest CDK version would give us more confidence in changing CDK functionality around vector destinations.

I've also updated (for now at least) the default connector to test here in this PR - setting it to PGVector, since that is the most recent vector-based connector that we have updated.

…n_to_vector_db

natikgadzhi · 2024-12-20T04:32:56Z

Kicking the tires on CI again

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

.github/workflows/connector-tests.yml (1)
78-85: Well-documented matrix changes! 🎉

The comments clearly explain why certain connectors are disabled, and the addition of PGVector aligns well with the current needs. Would you consider adding TODO comments with links to the tracking issues for source-s3 and destination-pinecone? This could help track when they should be re-enabled, wdyt?

Example format:
 # This is behind in CDK updates and can't be used in tests until updated:
+# TODO: Re-enable once updated - tracking issue: #<issue_number>
 # - connector: source-s3
 #   cdk_extra: file-based

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 61f68a8 and e0fcacb.

📒 Files selected for processing (1)

.github/workflows/connector-tests.yml (1 hunks)

🔇 Additional comments (1)

.github/workflows/connector-tests.yml (1)

81-82: Verify destination-pgvector configuration

Let's ensure destination-pgvector is properly set up in the monorepo.

natikgadzhi · 2025-01-17T23:55:36Z

@aaronsteers while it's low(er) priority, I think we can make an epic with all our vector destinations + openai and unstructured and langchain dependency upgrades and run a project with community devs to update them to the latest versions, and add integration tests (use docker to standup DB where needed).

Gauging your appetite for something like this.

Frankie Bromage added 2 commits November 26, 2024 10:55

add option to not embed field names for vector db destination

0f33b05

add unit test

9313f6d

coderabbitai bot reviewed Nov 27, 2024

View reviewed changes

coderabbitai bot approved these changes Nov 27, 2024

View reviewed changes

reformat according to standards

42be95f

coderabbitai bot reviewed Nov 27, 2024

View reviewed changes

increase test coverage and add typing

5f8b8a3

coderabbitai bot reviewed Nov 27, 2024

View reviewed changes

reformat according to standards

6f82849

coderabbitai bot reviewed Nov 27, 2024

View reviewed changes

Merge branch 'main' into frankiebromage1/feature_add_embed_text_optio…

61f68a8

…n_to_vector_db

coderabbitai bot reviewed Dec 2, 2024

View reviewed changes

replace pinecone connector test with pgvector test

d55f599

Merge branch 'main' into frankiebromage1/feature_add_embed_text_optio…

3e82367

…n_to_vector_db

frankiebromage mentioned this pull request Dec 5, 2024

Chore: update the airbyte cdk version for milvus airbytehq/airbyte#48823

Draft

2 tasks

Merge branch 'main' into frankiebromage1/feature_add_embed_text_optio…

e0fcacb

…n_to_vector_db

coderabbitai bot reviewed Dec 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: destination vector db add option for embedding without field names #91

Feat: destination vector db add option for embedding without field names #91

frankiebromage commented Nov 27, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 27, 2024 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Assessment against linked issues

Possibly related PRs

Suggested labels

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

aaronsteers commented Dec 2, 2024

aaronsteers commented Dec 4, 2024 •

edited

Loading

natikgadzhi commented Dec 20, 2024

coderabbitai bot left a comment

natikgadzhi commented Jan 17, 2025

Feat: destination vector db add option for embedding without field names #91

Are you sure you want to change the base?

Feat: destination vector db add option for embedding without field names #91

Conversation

frankiebromage commented Nov 27, 2024 • edited by coderabbitai bot Loading

Description

Changes

Summary by CodeRabbit

Summary by CodeRabbit

coderabbitai bot commented Nov 27, 2024 • edited Loading

Walkthrough

Changes

Sequence Diagram(s)

Assessment against linked issues

Possibly related PRs

Suggested labels

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

aaronsteers commented Dec 2, 2024

aaronsteers commented Dec 4, 2024 • edited Loading

natikgadzhi commented Dec 20, 2024

coderabbitai bot left a comment

Choose a reason for hiding this comment

natikgadzhi commented Jan 17, 2025

frankiebromage commented Nov 27, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 27, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

aaronsteers commented Dec 4, 2024 •

edited

Loading