Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Adds ZipfileDecoder component #169

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

pnilan
Copy link
Contributor

@pnilan pnilan commented Dec 12, 2024

What

How

  • Creates new ZipFileDecoder class and component. Enables connector dev to specify underlying data parser component (or defaults to the JsonParser component)

Recommended Reviewing Order

  1. zipfile_decoder.py
  2. model_to_component_factory.py
  3. declarative_component_schema.yaml
  4. test_zipfile_decoder.py

Summary by CodeRabbit

Summary by CodeRabbit

  • New Features

    • Introduced new components: ZipfileDecoder, JsonParser, and CustomParser for enhanced data handling and custom parsing capabilities.
    • Added functionality for creating ZipfileDecoder and JsonParser through factory methods.
  • Bug Fixes

    • Improved error handling in ZipfileDecoder for invalid zip and gzip files.
  • Tests

    • Added tests for JsonParser and ZipfileDecoder to ensure correct parsing and decoding functionality.

@github-actions github-actions bot added the enhancement New feature or request label Dec 12, 2024
Copy link
Contributor

coderabbitai bot commented Dec 12, 2024

📝 Walkthrough
📝 Walkthrough

Walkthrough

The pull request introduces several new components to the declarative_component_schema.yaml file, including ZipfileDecoder, JsonParser, and CustomParser, enhancing the schema for data extraction and transformation. It also adds a new zipfile_decoder.py file containing the ZipfileDecoder class, which processes zip file responses using specified parsers. The JsonParser and CustomParser classes are defined for parsing JSON and custom formats, respectively. Additionally, new methods are added to the ModelToComponentFactory for creating these components, alongside corresponding tests to validate their functionality.

Changes

File Path Change Summary
airbyte_cdk/sources/declarative/declarative_component_schema.yaml Added ZipfileDecoder, JsonParser, and CustomParser components with properties for handling various data formats.
airbyte_cdk/sources/declarative/decoders/__init__.py Updated __all__ to include ZipfileDecoder.
airbyte_cdk/sources/declarative/decoders/parsers/__init__.py Created new __init__.py file to export Parser and JsonParser.
airbyte_cdk/sources/declarative/decoders/parsers/parsers.py Added Parser and JsonParser classes with parsing logic for JSON data.
airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py Introduced ZipfileDecoder class for decoding zip file responses, handling errors in zip and gzip processing.
airbyte_cdk/sources/declarative/models/declarative_component_schema.py Added JsonParser, CustomParser, and ZipfileDecoder classes extending BaseModel.
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py Added methods create_zipfile_decoder and create_json_parser to facilitate component creation.
unit_tests/sources/declarative/decoders/parsers/__init__.py Added copyright notice; no functional changes.
unit_tests/sources/declarative/decoders/parsers/test_parsers.py Introduced tests for JsonParser to validate parsing functionality with various input data.
unit_tests/sources/declarative/decoders/test_zipfile_decoder.py Added tests for ZipfileDecoder, including helper functions for creating zip files from dictionaries.

Assessment against linked issues

Objective Addressed Explanation
Create a ZipfileDecoder class that handles zip file responses (163)
The ZipfileDecoder should use an underlying parser (163)
The decoder should yield data in the same format as other decoders (163)

Possibly related PRs

Suggested labels

enhancement

Suggested reviewers

  • maxi297
  • aldogonzalez8

Wdyt about the changes and the suggested reviewers? Are there any other aspects you would like to explore further?


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 49d0ec8 and 96ec874.

📒 Files selected for processing (1)
  • unit_tests/sources/declarative/decoders/test_zipfile_decoder.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • unit_tests/sources/declarative/decoders/test_zipfile_decoder.py

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Experiment)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (7)
unit_tests/sources/declarative/decoders/parsers/test_parsers.py (1)

30-35: Simplify test assertions for clarity?

In your test function, would it be cleaner to compare the entire list of parsed results with the expected output directly? This could make the test more straightforward and easier to read. Wdyt?

Here's how you might adjust it:

def test_json_parser_with_valid_data(raw_data, expected):
-    for i, actual in enumerate(JsonParser().parse(raw_data)):
-        if isinstance(expected, list):
-            assert actual == expected[i]
-        else:
-            assert actual == expected
+    actual = list(JsonParser().parse(raw_data))
+    assert actual == expected if isinstance(expected, list) else [expected]
unit_tests/sources/declarative/decoders/test_zipfile_decoder.py (3)

17-23: Consider adding error handling to the helper function?

The create_zip_from_dict function looks good but might benefit from some error handling for invalid inputs. What do you think about adding try-catch blocks to handle potential serialization errors? wdyt?

def create_zip_from_dict(data: Union[dict, list]):
    try:
        zip_buffer = BytesIO()
        with zipfile.ZipFile(zip_buffer, mode="w") as zip_file:
            zip_file.writestr("data.json", data)
        zip_buffer.seek(0)
        return zip_buffer.getvalue()
    except (TypeError, ValueError) as e:
        raise ValueError(f"Failed to create zip from data: {e}")

35-36: Clarify the commented code's purpose?

There's a commented line that seems to show an alternative implementation. Consider either removing it if it's no longer needed or adding a comment explaining why it's kept for reference. wdyt?


25-44: Consider adding more test scenarios?

The test covers basic happy path scenarios but could benefit from additional test cases. What do you think about adding tests for:

  • Error cases (malformed zip files)
  • Empty files
  • Large datasets
  • Invalid JSON data
airbyte_cdk/sources/declarative/declarative_component_schema.yaml (3)

1900-1916: Documentation could be more detailed, wdyt?

The ZipfileDecoder component looks good, but the documentation could be enhanced to help users better understand its capabilities. Consider adding:

  1. How multiple files within a zip are handled
  2. Example usage showing parser configuration
  3. What happens if no parser is specified (default behavior)
   ZipfileDecoder:
     title: Zipfile Decoder
-    description: Decoder for response data that is returned as zipfile(s).
+    description: |
+      Decoder for response data that is returned as zipfile(s). When multiple files are present in the zip,
+      they are processed sequentially. If no parser is specified, JsonParser is used by default.
+      
+      Example usage:
+      ```yaml
+      decoder:
+        type: ZipfileDecoder
+        parser:
+          type: JsonParser
+      ```

1917-1927: Would adding error handling details be helpful?

The JsonParser component looks good for basic use cases, but users might benefit from understanding error scenarios. Consider enhancing the description with:

  1. How parsing errors are handled
  2. Expected input/output format examples
   JsonParser:
     title: JsonParser
-    description: Parser used for parsing str, bytes, or bytearray data and returning data in a dictionary format.
+    description: |
+      Parser used for parsing str, bytes, or bytearray data and returning data in a dictionary format.
+      Raises JsonParseError if the input cannot be parsed as valid JSON.
+      
+      Example input/output:
+      Input: '{"key": "value"}'
+      Output: {"key": "value"}
+      
+      Input: b'{"key": "value"}'
+      Output: {"key": "value"}

1928-1949: Should we use a more practical example?

The CustomParser component looks good, but the example might be too whimsical. Consider:

  1. Using a real-world example that demonstrates practical usage
  2. Adding information about the base Parser class requirements
   CustomParser:
     title: Custom Parser
-    description: Use this to implement custom parser logic.
+    description: |
+      Use this to implement custom parser logic by extending the base Parser class.
+      The custom implementation must override the parse method with signature:
+      def parse(self, response: Union[str, bytes, bytearray]) -> Dict[str, Any]
     type: object
     additionalProperties: true
     required:
       - type
       - class_name
     properties:
       class_name:
         examples:
-          - "source_rivendell.components.ElvishParser"
+          - "source_amplitude.components.GzipJsonParser"
+          - "source_salesforce.components.CsvParser"
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ceebfda and 49d0ec8.

📒 Files selected for processing (10)
  • airbyte_cdk/sources/declarative/declarative_component_schema.yaml (1 hunks)
  • airbyte_cdk/sources/declarative/decoders/__init__.py (1 hunks)
  • airbyte_cdk/sources/declarative/decoders/parsers/__init__.py (1 hunks)
  • airbyte_cdk/sources/declarative/decoders/parsers/parsers.py (1 hunks)
  • airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py (1 hunks)
  • airbyte_cdk/sources/declarative/models/declarative_component_schema.py (2 hunks)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (6 hunks)
  • unit_tests/sources/declarative/decoders/parsers/__init__.py (1 hunks)
  • unit_tests/sources/declarative/decoders/parsers/test_parsers.py (1 hunks)
  • unit_tests/sources/declarative/decoders/test_zipfile_decoder.py (1 hunks)
✅ Files skipped from review due to trivial changes (2)
  • unit_tests/sources/declarative/decoders/parsers/init.py
  • airbyte_cdk/sources/declarative/decoders/parsers/init.py
🔇 Additional comments (7)
airbyte_cdk/sources/declarative/decoders/__init__.py (1)

10-10: ZipfileDecoder added to exports correctly!

The inclusion of ZipfileDecoder in the module imports and __all__ list looks great. Nice work!

Also applies to: 12-12

airbyte_cdk/sources/declarative/models/declarative_component_schema.py (2)

744-750: LGTM! Clean implementation of JsonParser

The JsonParser class follows the established pattern and correctly allows extra fields through Config.extra.


1492-1502: LGTM! Well-structured ZipfileDecoder implementation

The ZipfileDecoder class is well-designed with:

  • Proper inheritance from BaseModel
  • Flexible parser configuration through Union type
  • Clear field descriptions
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (4)

69-69: LGTM! Clean import additions

The new imports are properly organized and follow the existing import structure.

Also applies to: 71-71


481-481: LGTM! Well-organized constructor mapping

The new entries in PYDANTIC_MODEL_TO_CONSTRUCTOR are properly placed and maintain alphabetical ordering.

Also applies to: 517-517


1695-1704: LGTM! Well-implemented ZipfileDecoder factory method

The create_zipfile_decoder method follows the established pattern and properly handles the optional parser configuration.


1705-1708: LGTM! Clean JsonParser factory method

The create_json_parser method is simple and follows the established pattern for basic components.

Comment on lines +36 to +45
try:
zip_file = zipfile.ZipFile(io.BytesIO(response.content))
except zipfile.BadZipFile as e:
logger.exception(e)
logger.error(
f"Received an invalid zip file in response to URL: {response.request.url}. "
f"The size of the response body is: {len(response.content)}"
)
yield {}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Return after handling a bad ZIP file exception?

When a BadZipFile exception occurs, we're yielding an empty dictionary but continuing execution. Should we add a return statement after yielding to prevent further processing since the ZIP file is invalid? This might avoid unnecessary iterations. Wdyt?

Here's a suggested change:

            except zipfile.BadZipFile as e:
                logger.exception(e)
                logger.error(
                    f"Received an invalid zip file in response to URL: {response.request.url}. "
                    f"The size of the response body is: {len(response.content)}"
                )
                yield {}
+               return
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
try:
zip_file = zipfile.ZipFile(io.BytesIO(response.content))
except zipfile.BadZipFile as e:
logger.exception(e)
logger.error(
f"Received an invalid zip file in response to URL: {response.request.url}. "
f"The size of the response body is: {len(response.content)}"
)
yield {}
try:
zip_file = zipfile.ZipFile(io.BytesIO(response.content))
except zipfile.BadZipFile as e:
logger.exception(e)
logger.error(
f"Received an invalid zip file in response to URL: {response.request.url}. "
f"The size of the response body is: {len(response.content)}"
)
yield {}
return

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No thanks

Comment on lines +25 to +28
def __post_init__(self, parameters: Mapping[str, Any]) -> None:
self._parser = (
self.parser(parameters=parameters) if self.parser else JsonParser(parameters=parameters)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Check if parser is already an instance?

In the __post_init__ method, should we verify if self.parser is already an instance of Parser before attempting to instantiate it? This might prevent unintended behavior if an instance is passed instead of a class. Wdyt?

Consider modifying the initialization:

        def __post_init__(self, parameters: Mapping[str, Any]) -> None:
-           self._parser = (
-               self.parser(parameters=parameters) if self.parser else JsonParser(parameters=parameters)
-           )
+           if isinstance(self.parser, Parser):
+               self._parser = self.parser
+           elif self.parser:
+               self._parser = self.parser(parameters=parameters)
+           else:
+               self._parser = JsonParser(parameters=parameters)

Committable suggestion skipped: line range outside the PR's diff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The input is based on the model_to_component.py factory method and therefore we can assume it will be an uninstantiated parser.

Comment on lines +46 to +54
for gzip_filename in zip_file.namelist():
with zip_file.open(gzip_filename) as file:
try:
for data in gzip.open(file):
yield from self._parser.parse(data)
except gzip.BadGzipFile as e:
logger.exception(e)
logger.error(f"Fail to read contents of zipped response: {e}")
yield {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Handle non-gzip files within the ZIP archive?

I noticed that the code assumes each file inside the ZIP archive is gzip-compressed. Would it be beneficial to check the file type before attempting to decompress with gzip.open, in case some files aren't gzip-compressed? This could prevent potential errors when processing the ZIP contents. Wdyt?

Consider updating the code like this:

        for filename in zip_file.namelist():
            with zip_file.open(filename) as file:
+               if filename.endswith('.gz'):
                    try:
+                       with gzip.open(file) as decompressed_file:
+                           data = decompressed_file.read()
+                           yield from self._parser.parse(data)
                    except gzip.BadGzipFile as e:
                        logger.exception(e)
                        logger.error(f"Fail to read gzip file {filename}: {e}")
                        yield {}
+               else:
+                   data = file.read()
+                   yield from self._parser.parse(data)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for gzip_filename in zip_file.namelist():
with zip_file.open(gzip_filename) as file:
try:
for data in gzip.open(file):
yield from self._parser.parse(data)
except gzip.BadGzipFile as e:
logger.exception(e)
logger.error(f"Fail to read contents of zipped response: {e}")
yield {}
for filename in zip_file.namelist():
with zip_file.open(filename) as file:
if filename.endswith('.gz'):
try:
with gzip.open(file) as decompressed_file:
data = decompressed_file.read()
yield from self._parser.parse(data)
except gzip.BadGzipFile as e:
logger.exception(e)
logger.error(f"Fail to read gzip file {filename}: {e}")
yield {}
else:
data = file.read()
yield from self._parser.parse(data)

Comment on lines +38 to +42
try:
body_json = json.loads(data)
except json.JSONDecodeError:
logger.warning(f"Data cannot be parsed into json: {data=}")
yield {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Prevent NameError after JSON decoding fails?

When a JSONDecodeError occurs, body_json isn't defined, but the code continues and may reference it later. Should we add a return statement after yielding {} to avoid potential NameError exceptions? Wdyt?

Consider updating the exception handling:

            except json.JSONDecodeError:
                logger.warning(f"Data cannot be parsed into json: {data=}")
                yield {}
+               return
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
try:
body_json = json.loads(data)
except json.JSONDecodeError:
logger.warning(f"Data cannot be parsed into json: {data=}")
yield {}
try:
body_json = json.loads(data)
except json.JSONDecodeError:
logger.warning(f"Data cannot be parsed into json: {data=}")
yield {}
return

@pnilan
Copy link
Contributor Author

pnilan commented Dec 12, 2024

@coderabbitai

Can you create recommend reviewing order of the modified files? Please skip any code related to the JsonParser.

Copy link
Contributor

coderabbitai bot commented Dec 12, 2024

Tip

For best results, initiate chat on the files or code changes.

@pnilan: Sure, here's a recommended order for reviewing the modified files, excluding any code related to the JsonParser:

  1. airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py

    • Begin with the implementation of the ZipfileDecoder class, which introduces the core functionality for decoding zip file responses.
  2. airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py

    • Next, examine how the ZipfileDecoder is integrated into the component factory via the create_zipfile_decoder method.
  3. airbyte_cdk/sources/declarative/declarative_component_schema.yaml

    • Then, check the updates to the declarative component schema to include the ZipfileDecoder component.
  4. airbyte_cdk/sources/declarative/models/declarative_component_schema.py

    • Review the addition of the ZipfileDecoder class in the Pydantic models to ensure it aligns with the schema definitions.
  5. airbyte_cdk/sources/declarative/decoders/init.py

    • Ensure that the ZipfileDecoder is properly exported in the module's __all__ list.
  6. unit_tests/sources/declarative/decoders/test_zipfile_decoder.py

    • Finally, verify the unit tests for the ZipfileDecoder to confirm that the new functionality works as expected.

Does this reviewing order help? Let me know if you have any questions or need further adjustments. Wdyt?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat(ZipfileDecoder) - Create new zip decoder that takes an underlying parser to decoder and parse a response
1 participant