-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: Added regex to sub special characters #545
Fix: Added regex to sub special characters #545
Conversation
📝 Walkthrough📝 WalkthroughWalkthroughThe changes made in this pull request involve modifying the Changes
Assessment against linked issues
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (3)
airbyte/_writers/file_writers.py (3)
65-69
: Add safety checks for edge cases? 🛡️The current implementation might produce an empty string if the stream name only contains special characters. Maybe we could add some validation? Something like:
cleaned_stream_name = re.sub(r"[^a-zA-Z0-9\s]", "", stream_name) + if not cleaned_stream_name.strip(): + raise ValueError(f"Stream name '{stream_name}' contains no valid characters for filename")What are your thoughts on this additional safety check?
65-69
: Would you like help with test coverage? 🧪It would be great to add some unit tests to verify the behavior with:
- Stream names containing various special characters
- Edge cases like empty strings or whitespace-only names
- Names that would collide after cleaning
Would you like me to help draft some test cases or open an issue to track this?
65-67
: Consider enhancing the comment? 📝The current comment is good but could be more specific about which OS restrictions we're addressing. Maybe something like:
- # If a stream contains a special Character, the temporary jsonl.gz - # file can't be created, because of OS restrictions. Therefore, we - # remove the special characters. + # Windows and certain Linux filesystems (especially on WSL) restrict specific characters + # in filenames. Common restrictions include: <>:"/\|?* and control characters. + # We sanitize the stream name to ensure compatibility across different OS environments.What do you think about this more detailed explanation?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
airbyte/_writers/file_writers.py
(2 hunks)
🔇 Additional comments (2)
airbyte/_writers/file_writers.py (2)
7-7
: LGTM on the import!
Clean addition of the re
module, properly placed with other standard library imports.
65-69
: Verify impact on existing tests and related code 🔍
Let's check if this change might affect existing tests or if similar sanitization might be needed elsewhere:
✅ Verification successful
Stream name sanitization looks safe, but how about adding a test? 🤔
From the verification results, I can see that the existing tests don't rely on special characters in stream names - they use simple alphanumeric names like "stream1", "products", "users", etc. Also, this is the only place where we're sanitizing stream names for file paths.
However, would you be interested in adding a test case that explicitly verifies this sanitization behavior? Something like testing that a stream name with special characters still works correctly? wdyt?
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check for tests that might rely on special characters in stream names
echo "Checking for tests using special characters in stream names..."
rg -l "stream_name.*[^a-zA-Z0-9\s]" tests/
# Check for other places that might need similar filename sanitization
echo "Checking for other file path construction patterns..."
ast-grep --pattern 'Path($1)'
Length of output: 728
Script:
#!/bin/bash
# Let's check the content of these test files to see how they use stream names
echo "Checking test_progress.py for stream name usage:"
rg "stream_name" tests/unit_tests/test_progress.py -B 2 -A 2
echo "\nChecking test_source_faker_integration.py for stream name usage:"
rg "stream_name" tests/integration_tests/test_source_faker_integration.py -B 2 -A 2
echo "\nChecking test_state_handling.py for stream name usage:"
rg "stream_name" tests/integration_tests/test_state_handling.py -B 2 -A 2
# Let's also check for similar file path sanitization patterns
echo "\nChecking for similar file path sanitization patterns:"
rg "re\.sub.*stream_name" -B 2 -A 2
Length of output: 2984
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tinomerl - This is a valuable update - thank you!
Should we include similar logic also when designating a table name?
Hey @aaronsteers, thanks, that's great to hear. I think there is already some kind of logic implemented for table names. Just looked at the table name for my file test and it looks like this, in the default cache: Maybe i should adapt the same replacement pattern for filenames. So instead of removing any of the characters with nothing, using the underscore could be better. I think that would also cover some edge cases better. Like for example if we have something like this:
They would currently be named the same. But with the underscores they would differ. What do you think? |
@tinomerl - thanks for confirming that the table names are working well already. The method used for table name normalization is in the LowerCaseNormalizer class. You may be able to use the same here for normalizing the stream name prior to creating the path. I'm ok with either approach but if the lower case normalizer class works well, that might be slightly easier to maintain, since the same code would be applied for both cases. Want to try and let me know what you find? If for any reason that doesn't work well, I think your approach here is perfectly sound. |
@aaronsteers - thanks for the Infos. I'll try it out tomorrow and let you know the results. |
@aaronsteers - So i just changed it to use the |
@tinomerl - This is great. I'll run a few more tests and merge/release later today if all is passing! Thanks again for this contribution. 🙏 |
@aaronsteers - thanks for the kind words. I'm always happy to help out. :) |
/test-pr
|
I meant to merge earlier but ran into some test failures. All is looking good now so I'll go ahead and merge+release. |
Fixes #544.
I've tested the fix with three different sources.
Source File
PokeApi
It also worked on the custom Connector mentioned in the Bug Issue.
Summary by CodeRabbit