Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add padding to base64 before decoding #769

Merged
merged 39 commits into from
Jul 26, 2024
Merged

Conversation

NolanTrem
Copy link
Collaborator

@NolanTrem NolanTrem commented Jul 26, 2024

Fixes issue where strings with multiples of four characters caused "invalid start byte" errors during document ingestion.


🚀 This description was created by Ellipsis for commit 89b0afb

Summary:

Added base64 padding and normalization in Document class to prevent decoding errors.

Key points:

  • Added Document.decode_base64 method to handle base64 padding and normalization in r2r/base/abstractions/document.py.
  • Updated Document.__init__ to use decode_base64 for decoding base64 strings.
  • Ensures proper handling of base64 strings with multiples of four characters, preventing "invalid start byte" errors.

Generated with ❤️ by ellipsis.dev

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Reviewed everything up to 89b0afb in 32 seconds

More details
  • Looked at 72 lines of code in 1 files
  • Skipped 0 files when reviewing.
  • Skipped posting 1 drafted comments based on config settings.
1. r2r/base/abstractions/document.py:46
  • Draft comment:
    The implementation of decode_base64 method correctly handles the addition of necessary padding to base64 strings before decoding, which should resolve the issue with strings having lengths not multiple of four. Good use of regex for cleaning the input and exception handling for robust error feedback.
  • Reason this comment was not posted:
    Confidence changes required: 0%
    The PR introduces a method to add padding to base64 strings before decoding. This is generally a good practice as base64 encoding requires the string length to be a multiple of 4. The method first checks if the data is a string and encodes it to ASCII, then removes any non-base64 characters, and finally adds the necessary padding before attempting to decode it. The exception handling is appropriate, raising a ValueError if the decoding fails. Overall, the implementation seems correct and should resolve the issue described in the PR.

Workflow ID: wflow_vxBupECWIxy4sCpb


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

@emrgnt-cmplxty emrgnt-cmplxty changed the base branch from main to dev July 26, 2024 14:15
@emrgnt-cmplxty emrgnt-cmplxty merged commit b3657ee into dev Jul 26, 2024
3 of 4 checks passed
@NolanTrem NolanTrem deleted the Nolan/FixBase64Decoding branch July 26, 2024 23:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants