-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: destination vector db add option for embedding without field names #91
base: main
Are you sure you want to change the base?
Conversation
📝 Walkthrough📝 WalkthroughWalkthroughThe changes introduce a new boolean field, Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Config
participant Processor
User->>Config: Set omit_field_names_from_embeddings to true
Config-->>Processor: Pass configuration
Processor->>Processor: Check omit_field_names_from_embeddings
alt If true
Processor->>Processor: Call _extract_values_from_dict
else If false
Processor->>Processor: Call stringify_dict
end
Processor-->>User: Return processed document
Assessment against linked issues
Possibly related PRs
Suggested labels
Suggested reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (4)
airbyte_cdk/destinations/vector_db_based/document_processor.py (2)
166-169
: Consider simplifying the boolean check and fixing indentation.The logic is correct, but we could make it more Pythonic. What do you think about these small improvements? 🤔
- if self.omit_field_names_from_embeddings == False: - text = stringify_dict(relevant_fields) - else: - text = self._extract_values_from_dict(relevant_fields) + if self.omit_field_names_from_embeddings: + text = self._extract_values_from_dict(relevant_fields) + else: + text = stringify_dict(relevant_fields)This suggestion:
- Uses a more idiomatic Python boolean check
- Puts the new feature case first (positive condition)
- Fixes indentation consistency
wdyt? 🙂
🧰 Tools
🪛 Ruff (0.8.0)
166-166: Avoid equality comparisons to
False
; useif not self.omit_field_names_from_embeddings:
for false checksReplace with
not self.omit_field_names_from_embeddings
(E712)
173-180
: Consider enhancing the helper method with type hints and edge cases.The recursive implementation looks clean! Would you be interested in adding some enhancements for better maintainability and robustness? 🛠️
- def _extract_values_from_dict(self, data): + def _extract_values_from_dict(self, data: Any) -> str: + if data is None: + return "" if isinstance(data, dict): - return "\n".join(self._extract_values_from_dict(value) for value in data.values()) + return "\n".join(filter(None, (self._extract_values_from_dict(value) for value in data.values()))) elif isinstance(data, list): - return "\n".join(self._extract_values_from_dict(item) for item in data) + return "\n".join(filter(None, (self._extract_values_from_dict(item) for item in data))) else: return str(data)This suggestion adds:
- Type hints for better code maintainability
- Explicit handling of None values
- Filtering of empty strings to avoid unnecessary newlines
How does this look to you? 🤔
airbyte_cdk/destinations/vector_db_based/config.py (1)
111-116
: The implementation looks great! Would you consider some small enhancements? 🤔The new field is well-documented and has a sensible default. A couple of suggestions to make it even better:
- Maybe we could add an example to the description to make it super clear? Something like:
description="Do not include the field names in the text that gets embedded. By default field names are embedded e.g 'user: name, user.email: [email protected]'. If set to true, only the values are embedded e.g. 'name, [email protected]'.", + examples=[True, False],
- Since this field works in conjunction with
text_fields
, should we add a note about their interaction? For instance, whentext_fields
is empty (meaning all fields are considered), how doesomit_field_names_from_embeddings
affect the behavior? wdyt?unit_tests/destinations/vector_db_based/config_test.py (1)
245-251
: LGTM! The new configuration field looks well-structured 🎉The implementation aligns perfectly with the PR objectives. The field is properly integrated with appropriate metadata and maintains backward compatibility with the default value of
False
.Hey @frankiebromage, what do you think about enhancing the description slightly to include an example of the embedding format? Something like:
- "description": "Do not include the field names in the text that gets embedded. By default field names are embedded e.g 'user: name, user.email: [email protected]'. If set to true, only the values are embedded e.g. 'name, [email protected]'.", + "description": "Do not include the field names in the text that gets embedded. By default field names are embedded (e.g., 'user.name: John Doe, user.email: [email protected]'). If set to true, only the values are embedded (e.g., 'John Doe, [email protected]').",The suggested description uses more realistic examples that better illustrate the difference. Wdyt? 🤔
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (4)
airbyte_cdk/destinations/vector_db_based/config.py
(1 hunks)airbyte_cdk/destinations/vector_db_based/document_processor.py
(2 hunks)unit_tests/destinations/vector_db_based/config_test.py
(1 hunks)unit_tests/destinations/vector_db_based/document_processor_test.py
(2 hunks)
🔥 Files not summarized due to errors (1)
- unit_tests/destinations/vector_db_based/document_processor_test.py: Error: Server error: no LLM provider could handle the message
🧰 Additional context used
🪛 Ruff (0.8.0)
airbyte_cdk/destinations/vector_db_based/document_processor.py
166-166: Avoid equality comparisons to False
; use if not self.omit_field_names_from_embeddings:
for false checks
Replace with not self.omit_field_names_from_embeddings
(E712)
🔇 Additional comments (4)
airbyte_cdk/destinations/vector_db_based/document_processor.py (2)
128-128
: LGTM! Clean integration of the new config option.
The new instance variable follows the established pattern of other configuration options in the class.
166-169
: Verify impact on existing vector collections.
Just wanted to flag something important - enabling this feature on existing collections would create embeddings with a different structure (without field names) compared to previous ones. This might affect similarity search results.
Would you like help documenting this potential impact in the README or adding tests to verify the behavior? 🤝
🧰 Tools
🪛 Ruff (0.8.0)
166-166: Avoid equality comparisons to False
; use if not self.omit_field_names_from_embeddings:
for false checks
Replace with not self.omit_field_names_from_embeddings
(E712)
airbyte_cdk/destinations/vector_db_based/config.py (1)
111-116
: Shall we verify the test coverage? 🧪
The change looks great! Since this affects how text is processed for embeddings, it would be good to verify that we have comprehensive test coverage. Could you confirm if:
- The DocumentProcessor tests include cases for both
True
andFalse
values of this field? - The integration tests (if any) verify that the embeddings are correctly generated with and without field names?
✅ Verification successful
Looking good! Test coverage is in place ✅
The code is well-tested with both True
and False
cases for omit_field_names_from_embeddings
in document_processor_test.py
. The implementation in document_processor.py
correctly handles both cases, and the config schema is properly defined in config_test.py
.
Great job on maintaining test coverage while adding this new feature! 🎯
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check for test coverage of the new field
rg -l "omit_field_names_from_embeddings.*=.*(?:True|False)" "test_*.py"
rg -l "ProcessingConfigModel.*omit_field_names_from_embeddings" "test_*.py"
Length of output: 251
Script:
#!/bin/bash
# Let's first find the test files in the correct directory
fd "test.*\.py" --type f
# Then search for the DocumentProcessor tests
rg -l "DocumentProcessor" --type py
# And look for any tests related to field names in embeddings
rg -l "field.*names.*embed" --type py
Length of output: 11491
Script:
#!/bin/bash
# Let's check the test coverage in document_processor_test.py
rg "omit_field_names_from_embeddings" unit_tests/destinations/vector_db_based/document_processor_test.py -B 2 -A 2
# Also check config_test.py for any related tests
rg "omit_field_names_from_embeddings" unit_tests/destinations/vector_db_based/config_test.py -B 2 -A 2
# Let's see how this field is used in document_processor.py
rg "omit_field_names_from_embeddings" airbyte_cdk/destinations/vector_db_based/document_processor.py -B 2 -A 2
Length of output: 1598
unit_tests/destinations/vector_db_based/config_test.py (1)
245-251
: Verify test coverage for the new configuration option
While the schema test looks good, we might want to ensure we have proper test coverage for this functionality. Would you like me to help verify if we have test cases covering:
- The behavior when
omit_field_names_from_embeddings
is True - The behavior when
omit_field_names_from_embeddings
is False (default case)
✅ Verification successful
Test coverage looks good for the new configuration option! 🎉
I found test cases in document_processor_test.py
that explicitly verify both scenarios:
- When
omit_field_names_from_embeddings = False
(default case), confirming field names are included in the embedded text - When
omit_field_names_from_embeddings = True
, verifying only values are embedded
The assertions check the exact expected output format in both cases, which is quite thorough. Nice work on the test coverage! 👍
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Look for test files that might contain tests for the new configuration option
# Search for test files that might contain relevant test cases
echo "Searching for relevant test files..."
rg -l "omit_field_names_from_embeddings" "unit_tests"
# Search for test cases specifically testing this functionality
echo "Searching for specific test cases..."
rg -A 5 "def test.*embed.*field.*name" "unit_tests"
Length of output: 998
Script:
#!/bin/bash
# Let's check the content of document_processor_test.py for test cases related to this feature
echo "Checking document_processor_test.py for relevant test cases..."
rg -C 10 "omit_field_names_from_embeddings" "unit_tests/destinations/vector_db_based/document_processor_test.py"
# Let's also look at the actual implementation to understand what we should be testing
echo "Checking implementation details..."
rg -C 10 "omit_field_names_from_embeddings" "jitsu/destinations/vector_db_based/"
Length of output: 1501
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (3)
airbyte_cdk/destinations/vector_db_based/document_processor.py (3)
128-128
: Consider adding documentation for the new parameterWould you consider adding a docstring explanation for
omit_field_names_from_embeddings
to help users understand its purpose and impact? Something like:# Controls whether field names should be included when generating embeddings. # When True, only field values are used, potentially improving semantic similarity.wdyt? 🤔
166-169
: Consider extracting text generation logicThe conditional text generation logic is clean, but what do you think about extracting it into a dedicated method for better readability? Something like:
- if not self.omit_field_names_from_embeddings: - text = stringify_dict(relevant_fields) - else: - text = self._extract_values_from_dict(relevant_fields) + text = self._generate_text_from_fields(relevant_fields)With a new method:
def _generate_text_from_fields(self, fields: Dict[str, Any]) -> str: return stringify_dict(fields) if not self.omit_field_names_from_embeddings else self._extract_values_from_dict(fields)This would make the main method more focused and easier to read. What are your thoughts? 🤔
173-180
: Great implementation! Here are some suggestions to make it even betterThe recursive implementation looks solid! Would you consider these enhancements?
- Add return type hint:
- def _extract_values_from_dict(self, data: Union[dict, list, Any]) -> str: + def _extract_values_from_dict(self, data: Union[dict, list, Any]) -> str:
- Add docstring explaining the recursion strategy:
""" Recursively extracts values from nested data structures. Handles dictionaries, lists, and scalar values, joining them with newlines. """
- For large nested structures, consider adding a join_char parameter with a default value of '\n':
def _extract_values_from_dict(self, data: Union[dict, list, Any], join_char: str = '\n') -> str: if isinstance(data, dict): return join_char.join(self._extract_values_from_dict(value, join_char) for value in data.values()) elif isinstance(data, list): return join_char.join(self._extract_values_from_dict(item, join_char) for item in data) else: return str(data)This would make the method more flexible for different use cases. What do you think? 🤔
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
airbyte_cdk/destinations/vector_db_based/document_processor.py
(3 hunks)unit_tests/destinations/vector_db_based/document_processor_test.py
(2 hunks)
🔥 Files not summarized due to errors (1)
- unit_tests/destinations/vector_db_based/document_processor_test.py: Error: Server error: no LLM provider could handle the message
🔇 Additional comments (2)
airbyte_cdk/destinations/vector_db_based/document_processor.py (2)
8-8
: LGTM! Clean import addition.
The addition of Union
to the typing imports is appropriate for the new method's type hints.
166-180
: Verify test coverage for the new functionality
The changes look good! Would you mind verifying that we have test coverage for:
- Different combinations of nested structures with
omit_field_names_from_embeddings
- Edge cases like empty dicts/lists
- Performance with large nested structures
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (4)
airbyte_cdk/destinations/vector_db_based/document_processor.py (3)
128-128
: Consider updating the docstring to reflect the new behavior?The _generate_document method now supports optional field name omission, but this isn't documented. Would you consider adding this to the docstring? Something like:
def _generate_document(self, record: AirbyteRecordMessage) -> Optional[Document]: + """ + Generate a Document from an AirbyteRecordMessage. + + The text content can be generated either with or without field names based on + the omit_field_names_from_embeddings configuration. + """wdyt? 🤔
Also applies to: 166-168
170-174
: The implementation looks clean! Quick suggestion about type hints.The method implementation is clear and follows the single responsibility principle. Would you consider adding a return type hint to make it even more explicit? Like this:
- def _generate_text_from_fields(self, fields: Dict[str, Any]) -> str: + def _generate_text_from_fields(self, fields: Dict[str, Any]) -> str:
176-184
: The implementation looks solid! A few enhancement ideas to consider.The recursive implementation is clean and handles different data types well. Would you consider these enhancements?
- Adding a docstring to explain the recursive nature and the join_char parameter:
def _extract_values_from_dict(self, data: Union[Dict[Any, Any], List[Any], Any], join_char: str = '\n') -> str: """ Recursively extracts values from nested data structures, joining them with the specified character. Args: data: The data structure to process (dict, list, or scalar value) join_char: The character to use when joining extracted values (default: newline) Returns: A string containing all values joined by join_char """
- Optimizing empty collection handling:
elif isinstance(data, dict): + if not data: + return "" return join_char.join(self._extract_values_from_dict(value) for value in data.values()) elif isinstance(data, list): + if not data: + return "" return join_char.join(self._extract_values_from_dict(item) for item in data)wdyt about these suggestions? 🤔
airbyte_cdk/destinations/vector_db_based/config.py (1)
111-116
: LGTM! Clean implementation of the field names toggle 🎯The implementation looks solid with good documentation and sensible defaults. Just wondering if we could make the description even more helpful by adding an example of when users might want to enable this option? For instance, when doing semantic similarity searches where field names could skew the results? wdyt?
description="Do not include the field names in the text that gets embedded. By default field names are embedded (e.g., 'user.name: John Doe \n user.email: [email protected]'). If set to true, only the values are embedded (e.g., 'John Doe \n [email protected]').",Could be enhanced to:
description="Do not include the field names in the text that gets embedded. By default field names are embedded (e.g., 'user.name: John Doe \n user.email: [email protected]'). If set to true, only the values are embedded (e.g., 'John Doe \n [email protected]'). This is particularly useful when doing semantic similarity searches where field names could skew the vector representation.",
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (4)
airbyte_cdk/destinations/vector_db_based/config.py
(1 hunks)airbyte_cdk/destinations/vector_db_based/document_processor.py
(3 hunks)unit_tests/destinations/vector_db_based/config_test.py
(1 hunks)unit_tests/destinations/vector_db_based/document_processor_test.py
(2 hunks)
🔥 Files not summarized due to errors (1)
- unit_tests/destinations/vector_db_based/document_processor_test.py: Error: Server error: no LLM provider could handle the message
🚧 Files skipped from review as they are similar to previous changes (1)
- unit_tests/destinations/vector_db_based/config_test.py
🔇 Additional comments (3)
airbyte_cdk/destinations/vector_db_based/document_processor.py (2)
8-8
: LGTM! Clean import addition.
The Union type import is appropriately added to support the new type hints.
Line range hint 128-184
: Verify test coverage for the new embedding option
The changes look well-implemented! Let's verify that we have adequate test coverage for the new functionality.
✅ Verification successful
Test coverage looks good for the new embedding option! 🎉
The new omit_field_names_from_embeddings
feature is well-tested:
- It's covered in the document processor tests with both
True
andFalse
cases - The config schema is properly defined with a clear description and default value
- The implementation is properly validated through test assertions
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for test files and test cases related to the new embedding option
echo "Searching for related test files..."
fd -e py test_ -e py _test | grep -i "document\|process"
echo "Searching for test cases covering the new option..."
rg -l "omit_field_names_from_embeddings" -t python
Length of output: 427
Script:
#!/bin/bash
# Search for test files related to document processor
find . -type f -name "*test*.py" | grep -i "document\|process"
# Search for test cases covering the new option with proper file type
rg -l "omit_field_names_from_embeddings" --type py
# Look for actual test cases using ast-grep
ast-grep --pattern 'def test_$_($$$):
$$$
omit_field_names_from_embeddings
$$$'
# Search for test class definitions related to document processor
ast-grep --pattern 'class $_Test($$_):
$$$
document
$$$'
Length of output: 647
Script:
#!/bin/bash
# Let's check the actual test content for the new option
echo "=== Document Processor Test Content ==="
rg "omit_field_names_from_embeddings" -A 5 -B 5 unit_tests/destinations/vector_db_based/document_processor_test.py
echo "=== Config Test Content ==="
rg "omit_field_names_from_embeddings" -A 5 -B 5 unit_tests/destinations/vector_db_based/config_test.py
Length of output: 1721
airbyte_cdk/destinations/vector_db_based/config.py (1)
111-116
: Verify test coverage for the new configuration option
Let's make sure we have good test coverage for this new field.
✅ Verification successful
Test coverage looks good for the new configuration option
I found that the omit_field_names_from_embeddings
field is well tested in document_processor_test.py
with both True and False cases, verifying the expected behavior for both including and excluding field names from embeddings. The implementation in DocumentProcessor
class correctly uses this configuration to control the text formatting.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for test coverage of the new field
# Look for test files that might need updates
fd "test.*\.py" | while read -r file; do
echo "=== Checking $file ==="
rg -A 5 "ProcessingConfigModel|omit_field_names_from_embeddings" "$file"
done
Length of output: 60192
Script:
#!/bin/bash
# Let's check for test coverage of the DocumentProcessor class which uses this field
rg -A 10 "class DocumentProcessor" airbyte_cdk/destinations/vector_db_based/document_processor.py
# Also check for tests of DocumentProcessor
rg -A 10 "def test.*omit_field_names_from_embeddings" unit_tests/destinations/vector_db_based/document_processor_test.py
Length of output: 908
Script:
#!/bin/bash
# Let's check the implementation of the DocumentProcessor class to see how the field is used
rg -A 5 "omit_field_names_from_embeddings" airbyte_cdk/destinations/vector_db_based/document_processor.py
# And check for any test cases that verify this functionality
rg -A 5 "test_.*field_names.*embedding" unit_tests/destinations/vector_db_based/document_processor_test.py
Length of output: 800
Script:
#!/bin/bash
# Let's check for test cases that verify the omit_field_names_from_embeddings functionality
rg -B 5 -A 10 "omit_field_names_from_embeddings = (True|False)" unit_tests/destinations/vector_db_based/document_processor_test.py
# Also check if there are any other test files that might need updates
fd "test_.*\.py" airbyte_cdk/destinations/vector_db_based/ unit_tests/destinations/vector_db_based/
Length of output: 1117
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (2)
airbyte_cdk/destinations/vector_db_based/document_processor.py (2)
170-175
: Consider adding a docstring for the new method?The method looks good, but adding a docstring would help explain its purpose and parameters, wdyt? Something like:
def _generate_text_from_fields(self, fields: Dict[str, Any]) -> str: """ Generate text from fields with optional field name omission. Args: fields: Dictionary of field names and their values Returns: A string representation of the fields, with or without field names based on omit_field_names_from_embeddings configuration """
176-187
: The implementation looks solid! A few suggestions to consider:
- Would you consider adding a docstring to explain the recursive nature and join_char parameter?
- For large nested structures, we might want to consider using a generator to optimize memory usage, wdyt?
Here's a potential enhancement:
def _extract_values_from_dict( self, data: Union[Dict[Any, Any], List[Any], Any], join_char: str = "\n" ) -> str: """ Recursively extracts values from nested structures without field names. Args: data: The data structure to extract values from join_char: The character used to join multiple values Returns: A string containing all values joined by join_char """ def _generate_values(): if data is None: yield "" elif isinstance(data, dict): for value in data.values(): yield from (v for v in _generate_values(value)) elif isinstance(data, list): for item in data: yield from (v for v in _generate_values(item)) else: yield str(data) return join_char.join(_generate_values())
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
airbyte_cdk/destinations/vector_db_based/document_processor.py
(3 hunks)
🔇 Additional comments (3)
airbyte_cdk/destinations/vector_db_based/document_processor.py (3)
8-8
: LGTM! Clean import addition.
The Union type import is properly organized with other typing imports and is necessary for the new method parameter typing.
128-128
: LGTM! Clean configuration addition.
The new configuration parameter is properly initialized following the existing pattern.
170-187
: Verify test coverage for the new functionality
The implementation looks great and meets the PR objectives! Let's verify the test coverage:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (2)
airbyte_cdk/destinations/vector_db_based/document_processor.py (2)
171-175
: Add docstring for clarity?The method looks clean and follows good practices. Would you like to add a docstring explaining the purpose and behavior of this method? Something like:
def _generate_text_from_fields(self, fields: Dict[str, Any]) -> str: + """ + Generates text from fields either with or without field names based on configuration. + + Args: + fields: Dictionary of field names and their values + Returns: + A string representation of the fields + """ if self.omit_field_names_from_embeddings:wdyt? 🤔
177-187
: Consider adding a depth limit?The recursive implementation looks solid! However, for safety against potential stack overflow with deeply nested structures, maybe we could add a depth limit? Here's a thought:
def _extract_values_from_dict( - self, data: Union[Dict[Any, Any], List[Any], Any], join_char: str = "\n" + self, data: Union[Dict[Any, Any], List[Any], Any], join_char: str = "\n", depth: int = 100 ) -> str: if data is None: return "" + if depth <= 0: + return str(data) elif isinstance(data, dict): - return join_char.join(self._extract_values_from_dict(value) for value in data.values()) + return join_char.join(self._extract_values_from_dict(value, join_char, depth - 1) for value in data.values())This would prevent potential issues with circular references or extremely deep structures. wdyt? 🤔
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (4)
airbyte_cdk/destinations/vector_db_based/config.py
(1 hunks)airbyte_cdk/destinations/vector_db_based/document_processor.py
(3 hunks)unit_tests/destinations/vector_db_based/config_test.py
(1 hunks)unit_tests/destinations/vector_db_based/document_processor_test.py
(2 hunks)
🔥 Files not summarized due to errors (1)
- unit_tests/destinations/vector_db_based/document_processor_test.py: Error: Server error: no LLM provider could handle the message
🚧 Files skipped from review as they are similar to previous changes (2)
- unit_tests/destinations/vector_db_based/config_test.py
- airbyte_cdk/destinations/vector_db_based/config.py
🔇 Additional comments (3)
airbyte_cdk/destinations/vector_db_based/document_processor.py (3)
8-8
: LGTM! Clean import addition
The Union type import is well-placed with other typing imports and properly supports the new type hints.
171-187
: Verify test coverage
The new functionality looks solid! Shall we verify that it's properly covered by tests?
129-129
: LGTM! Let's verify the config integration
The new config parameter is properly integrated. Would you like me to verify that the ProcessingConfigModel properly defines this field?
✅ Verification successful
Config integration looks perfect! 🎯
The omit_field_names_from_embeddings
field is properly defined in ProcessingConfigModel
with:
- Correct type (bool)
- Sensible default (False)
- Clear title and descriptive help text
- Follows the same pattern as other config parameters
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check if ProcessingConfigModel properly defines the new field
rg -A 5 "class ProcessingConfigModel" airbyte_cdk/destinations/vector_db_based/config.py
Length of output: 241
Script:
#!/bin/bash
# Let's search for the field definition in the entire file
rg -U "omit_field_names_from_embeddings.*Field" airbyte_cdk/destinations/vector_db_based/config.py
# Also check for any other potential locations of this field
rg "omit_field_names_from_embeddings" airbyte_cdk/destinations/vector_db_based/config.py
Length of output: 293
Script:
#!/bin/bash
# Let's get more context around the field definition
rg -A 3 -B 3 "omit_field_names_from_embeddings" airbyte_cdk/destinations/vector_db_based/config.py
Length of output: 581
@frankiebromage - Thanks for creating this! Just a heads up: one challenge with rolling this out will be that most of the vector store destinations are several versions behind in their CDK pin. This may be related to the reason the Pinecone test failed in CI. |
I added both of these issues to track discussions around bumping one of either Pinecone or Milvus.
Getting one or both of these to the latest CDK version would give us more confidence in changing CDK functionality around vector destinations. I've also updated (for now at least) the default connector to test here in this PR - setting it to PGVector, since that is the most recent vector-based connector that we have updated. |
Kicking the tires on CI again |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
.github/workflows/connector-tests.yml (1)
78-85
: Well-documented matrix changes! 🎉The comments clearly explain why certain connectors are disabled, and the addition of PGVector aligns well with the current needs. Would you consider adding TODO comments with links to the tracking issues for source-s3 and destination-pinecone? This could help track when they should be re-enabled, wdyt?
Example format:
# This is behind in CDK updates and can't be used in tests until updated: +# TODO: Re-enable once updated - tracking issue: #<issue_number> # - connector: source-s3 # cdk_extra: file-based
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
.github/workflows/connector-tests.yml
(1 hunks)
🔇 Additional comments (1)
.github/workflows/connector-tests.yml (1)
81-82
: Verify destination-pgvector configuration
Let's ensure destination-pgvector is properly set up in the monorepo.
@aaronsteers while it's low(er) priority, I think we can make an epic with all our vector destinations + openai and unstructured and langchain dependency upgrades and run a project with community devs to update them to the latest versions, and add integration tests (use docker to standup DB where needed). Gauging your appetite for something like this. |
Description
Responding to issue #48627
Including field names when embedding text changes the semantic meaning of a vector, so it is useful to have the option of embedding only the text without the field names.
This is my first contribution, so please let me know what else is needed!
Changes
Adds an additional config to allow user to choose whether or not to embed field names with a vector db destination.
Summary by CodeRabbit
Summary by CodeRabbit
New Features
Bug Fixes
Tests
omit_field_names_from_embeddings
field in the configuration model to ensure proper functionality.Chores
Resolves airbytehq/airbyte#48627 (comment)