-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Adds ZipfileDecoder
component
#169
base: main
Are you sure you want to change the base?
Conversation
📝 Walkthrough📝 WalkthroughWalkthroughThe pull request introduces several new components to the Changes
Assessment against linked issues
Possibly related PRs
Suggested labels
Suggested reviewers
Wdyt about the changes and the suggested reviewers? Are there any other aspects you would like to explore further? 📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Outside diff range and nitpick comments (7)
unit_tests/sources/declarative/decoders/parsers/test_parsers.py (1)
30-35
: Simplify test assertions for clarity?In your test function, would it be cleaner to compare the entire list of parsed results with the expected output directly? This could make the test more straightforward and easier to read. Wdyt?
Here's how you might adjust it:
def test_json_parser_with_valid_data(raw_data, expected): - for i, actual in enumerate(JsonParser().parse(raw_data)): - if isinstance(expected, list): - assert actual == expected[i] - else: - assert actual == expected + actual = list(JsonParser().parse(raw_data)) + assert actual == expected if isinstance(expected, list) else [expected]unit_tests/sources/declarative/decoders/test_zipfile_decoder.py (3)
17-23
: Consider adding error handling to the helper function?The
create_zip_from_dict
function looks good but might benefit from some error handling for invalid inputs. What do you think about adding try-catch blocks to handle potential serialization errors? wdyt?def create_zip_from_dict(data: Union[dict, list]): try: zip_buffer = BytesIO() with zipfile.ZipFile(zip_buffer, mode="w") as zip_file: zip_file.writestr("data.json", data) zip_buffer.seek(0) return zip_buffer.getvalue() except (TypeError, ValueError) as e: raise ValueError(f"Failed to create zip from data: {e}")
35-36
: Clarify the commented code's purpose?There's a commented line that seems to show an alternative implementation. Consider either removing it if it's no longer needed or adding a comment explaining why it's kept for reference. wdyt?
25-44
: Consider adding more test scenarios?The test covers basic happy path scenarios but could benefit from additional test cases. What do you think about adding tests for:
- Error cases (malformed zip files)
- Empty files
- Large datasets
- Invalid JSON data
airbyte_cdk/sources/declarative/declarative_component_schema.yaml (3)
1900-1916
: Documentation could be more detailed, wdyt?The ZipfileDecoder component looks good, but the documentation could be enhanced to help users better understand its capabilities. Consider adding:
- How multiple files within a zip are handled
- Example usage showing parser configuration
- What happens if no parser is specified (default behavior)
ZipfileDecoder: title: Zipfile Decoder - description: Decoder for response data that is returned as zipfile(s). + description: | + Decoder for response data that is returned as zipfile(s). When multiple files are present in the zip, + they are processed sequentially. If no parser is specified, JsonParser is used by default. + + Example usage: + ```yaml + decoder: + type: ZipfileDecoder + parser: + type: JsonParser + ```
1917-1927
: Would adding error handling details be helpful?The JsonParser component looks good for basic use cases, but users might benefit from understanding error scenarios. Consider enhancing the description with:
- How parsing errors are handled
- Expected input/output format examples
JsonParser: title: JsonParser - description: Parser used for parsing str, bytes, or bytearray data and returning data in a dictionary format. + description: | + Parser used for parsing str, bytes, or bytearray data and returning data in a dictionary format. + Raises JsonParseError if the input cannot be parsed as valid JSON. + + Example input/output: + Input: '{"key": "value"}' + Output: {"key": "value"} + + Input: b'{"key": "value"}' + Output: {"key": "value"}
1928-1949
: Should we use a more practical example?The CustomParser component looks good, but the example might be too whimsical. Consider:
- Using a real-world example that demonstrates practical usage
- Adding information about the base Parser class requirements
CustomParser: title: Custom Parser - description: Use this to implement custom parser logic. + description: | + Use this to implement custom parser logic by extending the base Parser class. + The custom implementation must override the parse method with signature: + def parse(self, response: Union[str, bytes, bytearray]) -> Dict[str, Any] type: object additionalProperties: true required: - type - class_name properties: class_name: examples: - - "source_rivendell.components.ElvishParser" + - "source_amplitude.components.GzipJsonParser" + - "source_salesforce.components.CsvParser"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
airbyte_cdk/sources/declarative/declarative_component_schema.yaml
(1 hunks)airbyte_cdk/sources/declarative/decoders/__init__.py
(1 hunks)airbyte_cdk/sources/declarative/decoders/parsers/__init__.py
(1 hunks)airbyte_cdk/sources/declarative/decoders/parsers/parsers.py
(1 hunks)airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py
(1 hunks)airbyte_cdk/sources/declarative/models/declarative_component_schema.py
(2 hunks)airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
(6 hunks)unit_tests/sources/declarative/decoders/parsers/__init__.py
(1 hunks)unit_tests/sources/declarative/decoders/parsers/test_parsers.py
(1 hunks)unit_tests/sources/declarative/decoders/test_zipfile_decoder.py
(1 hunks)
✅ Files skipped from review due to trivial changes (2)
- unit_tests/sources/declarative/decoders/parsers/init.py
- airbyte_cdk/sources/declarative/decoders/parsers/init.py
🔇 Additional comments (7)
airbyte_cdk/sources/declarative/decoders/__init__.py (1)
10-10
: ZipfileDecoder added to exports correctly!
The inclusion of ZipfileDecoder
in the module imports and __all__
list looks great. Nice work!
Also applies to: 12-12
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (2)
744-750
: LGTM! Clean implementation of JsonParser
The JsonParser class follows the established pattern and correctly allows extra fields through Config.extra.
1492-1502
: LGTM! Well-structured ZipfileDecoder implementation
The ZipfileDecoder class is well-designed with:
- Proper inheritance from BaseModel
- Flexible parser configuration through Union type
- Clear field descriptions
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (4)
69-69
: LGTM! Clean import additions
The new imports are properly organized and follow the existing import structure.
Also applies to: 71-71
481-481
: LGTM! Well-organized constructor mapping
The new entries in PYDANTIC_MODEL_TO_CONSTRUCTOR are properly placed and maintain alphabetical ordering.
Also applies to: 517-517
1695-1704
: LGTM! Well-implemented ZipfileDecoder factory method
The create_zipfile_decoder method follows the established pattern and properly handles the optional parser configuration.
1705-1708
: LGTM! Clean JsonParser factory method
The create_json_parser method is simple and follows the established pattern for basic components.
try: | ||
zip_file = zipfile.ZipFile(io.BytesIO(response.content)) | ||
except zipfile.BadZipFile as e: | ||
logger.exception(e) | ||
logger.error( | ||
f"Received an invalid zip file in response to URL: {response.request.url}. " | ||
f"The size of the response body is: {len(response.content)}" | ||
) | ||
yield {} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return after handling a bad ZIP file exception?
When a BadZipFile
exception occurs, we're yielding an empty dictionary but continuing execution. Should we add a return
statement after yielding to prevent further processing since the ZIP file is invalid? This might avoid unnecessary iterations. Wdyt?
Here's a suggested change:
except zipfile.BadZipFile as e:
logger.exception(e)
logger.error(
f"Received an invalid zip file in response to URL: {response.request.url}. "
f"The size of the response body is: {len(response.content)}"
)
yield {}
+ return
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
try: | |
zip_file = zipfile.ZipFile(io.BytesIO(response.content)) | |
except zipfile.BadZipFile as e: | |
logger.exception(e) | |
logger.error( | |
f"Received an invalid zip file in response to URL: {response.request.url}. " | |
f"The size of the response body is: {len(response.content)}" | |
) | |
yield {} | |
try: | |
zip_file = zipfile.ZipFile(io.BytesIO(response.content)) | |
except zipfile.BadZipFile as e: | |
logger.exception(e) | |
logger.error( | |
f"Received an invalid zip file in response to URL: {response.request.url}. " | |
f"The size of the response body is: {len(response.content)}" | |
) | |
yield {} | |
return | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No thanks
def __post_init__(self, parameters: Mapping[str, Any]) -> None: | ||
self._parser = ( | ||
self.parser(parameters=parameters) if self.parser else JsonParser(parameters=parameters) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Check if parser is already an instance?
In the __post_init__
method, should we verify if self.parser
is already an instance of Parser
before attempting to instantiate it? This might prevent unintended behavior if an instance is passed instead of a class. Wdyt?
Consider modifying the initialization:
def __post_init__(self, parameters: Mapping[str, Any]) -> None:
- self._parser = (
- self.parser(parameters=parameters) if self.parser else JsonParser(parameters=parameters)
- )
+ if isinstance(self.parser, Parser):
+ self._parser = self.parser
+ elif self.parser:
+ self._parser = self.parser(parameters=parameters)
+ else:
+ self._parser = JsonParser(parameters=parameters)
Committable suggestion skipped: line range outside the PR's diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The input is based on the model_to_component.py
factory method and therefore we can assume it will be an uninstantiated parser.
for gzip_filename in zip_file.namelist(): | ||
with zip_file.open(gzip_filename) as file: | ||
try: | ||
for data in gzip.open(file): | ||
yield from self._parser.parse(data) | ||
except gzip.BadGzipFile as e: | ||
logger.exception(e) | ||
logger.error(f"Fail to read contents of zipped response: {e}") | ||
yield {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Handle non-gzip files within the ZIP archive?
I noticed that the code assumes each file inside the ZIP archive is gzip-compressed. Would it be beneficial to check the file type before attempting to decompress with gzip.open
, in case some files aren't gzip-compressed? This could prevent potential errors when processing the ZIP contents. Wdyt?
Consider updating the code like this:
for filename in zip_file.namelist():
with zip_file.open(filename) as file:
+ if filename.endswith('.gz'):
try:
+ with gzip.open(file) as decompressed_file:
+ data = decompressed_file.read()
+ yield from self._parser.parse(data)
except gzip.BadGzipFile as e:
logger.exception(e)
logger.error(f"Fail to read gzip file {filename}: {e}")
yield {}
+ else:
+ data = file.read()
+ yield from self._parser.parse(data)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
for gzip_filename in zip_file.namelist(): | |
with zip_file.open(gzip_filename) as file: | |
try: | |
for data in gzip.open(file): | |
yield from self._parser.parse(data) | |
except gzip.BadGzipFile as e: | |
logger.exception(e) | |
logger.error(f"Fail to read contents of zipped response: {e}") | |
yield {} | |
for filename in zip_file.namelist(): | |
with zip_file.open(filename) as file: | |
if filename.endswith('.gz'): | |
try: | |
with gzip.open(file) as decompressed_file: | |
data = decompressed_file.read() | |
yield from self._parser.parse(data) | |
except gzip.BadGzipFile as e: | |
logger.exception(e) | |
logger.error(f"Fail to read gzip file {filename}: {e}") | |
yield {} | |
else: | |
data = file.read() | |
yield from self._parser.parse(data) |
try: | ||
body_json = json.loads(data) | ||
except json.JSONDecodeError: | ||
logger.warning(f"Data cannot be parsed into json: {data=}") | ||
yield {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prevent NameError after JSON decoding fails?
When a JSONDecodeError
occurs, body_json
isn't defined, but the code continues and may reference it later. Should we add a return
statement after yielding {}
to avoid potential NameError
exceptions? Wdyt?
Consider updating the exception handling:
except json.JSONDecodeError:
logger.warning(f"Data cannot be parsed into json: {data=}")
yield {}
+ return
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
try: | |
body_json = json.loads(data) | |
except json.JSONDecodeError: | |
logger.warning(f"Data cannot be parsed into json: {data=}") | |
yield {} | |
try: | |
body_json = json.loads(data) | |
except json.JSONDecodeError: | |
logger.warning(f"Data cannot be parsed into json: {data=}") | |
yield {} | |
return |
Can you create recommend reviewing order of the modified files? Please skip any code related to the JsonParser. |
Tip For best results, initiate chat on the files or code changes. @pnilan: Sure, here's a recommended order for reviewing the modified files, excluding any code related to the
Does this reviewing order help? Let me know if you have any questions or need further adjustments. Wdyt? |
What
ZipfileDecoder
component to decode responses containing compressed zip files (e.g.source-amplitude
events stream)How
ZipFileDecoder
class and component. Enables connector dev to specify underlying data parser component (or defaults to theJsonParser
component)Recommended Reviewing Order
Summary by CodeRabbit
Summary by CodeRabbit
New Features
ZipfileDecoder
,JsonParser
, andCustomParser
for enhanced data handling and custom parsing capabilities.ZipfileDecoder
andJsonParser
through factory methods.Bug Fixes
ZipfileDecoder
for invalid zip and gzip files.Tests
JsonParser
andZipfileDecoder
to ensure correct parsing and decoding functionality.