Fix: Added regex to sub special characters #545

tinomerl · 2024-12-02T07:59:14Z

Fixes #544.

I've tested the fix with three different sources.

Source File

import airbyte as ab

file_content = "col_1,col_2\nfoo,bar\n"
with open("test \ special chars.csv", "w") as f:
    f.write(file_content)

source: ab.Source = ab.get_source("source-file")
config = {
    "dataset_name": "test \ special chars",
    "format": "csv",
    "url": "test \ special chars.csv",
    "provider": {"storage": "local"}
}

source.set_config(config)
source.select_all_streams()
records = source.read()
records.streams.get("test \\ special chars").to_pandas()

PokeApi

import airbyte as ab

source: ab.Source = ab.get_source(    "source-pokeapi",
    config={"pokemon_name": "bulbasaur"},
    source_manifest=True,
)

source.check()
source.get_available_streams()
source.select_all_streams()
records = source.read()

It also worked on the custom Connector mentioned in the Bug Issue.

Summary by CodeRabbit

Bug Fixes
- Improved file path generation by cleaning stream names to remove special characters, ensuring compatibility with operating system restrictions.

coderabbitai · 2024-12-02T08:04:56Z

📝 Walkthrough

Walkthrough

The changes made in this pull request involve modifying the _get_new_cache_file_path method in the FileWriterBase class located in airbyte/_writers/file_writers.py. A new dependency, LowerCaseNormalizer, is introduced to sanitize the stream_name parameter by removing special characters that are not allowed in file names, specifically <>:"/\|?* and ASCII control characters (0-31). The sanitized normalized_stream_name is then utilized to construct the cache file path, which includes the batch_id and the default cache file suffix.

Changes

File Path	Change Summary
airbyte/_writers/file_writers.py	Modified `_get_new_cache_file_path` to sanitize the `stream_name` by removing special characters and ASCII control characters.

Assessment against linked issues

Objective	Addressed	Explanation
Remove special characters in stream names to prevent file creation issues (#544)	✅

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between bff1359 and ee76045.

📒 Files selected for processing (1)

airbyte/_writers/file_writers.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

airbyte/_writers/file_writers.py

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (3)

airbyte/_writers/file_writers.py (3)
65-69: Add safety checks for edge cases? 🛡️

The current implementation might produce an empty string if the stream name only contains special characters. Maybe we could add some validation? Something like:
         cleaned_stream_name = re.sub(r"[^a-zA-Z0-9\s]", "", stream_name)
+        if not cleaned_stream_name.strip():
+            raise ValueError(f"Stream name '{stream_name}' contains no valid characters for filename")
What are your thoughts on this additional safety check?

65-69: Would you like help with test coverage? 🧪

It would be great to add some unit tests to verify the behavior with:

Stream names containing various special characters

Edge cases like empty strings or whitespace-only names

Names that would collide after cleaning

Would you like me to help draft some test cases or open an issue to track this?

65-67: Consider enhancing the comment? 📝

The current comment is good but could be more specific about which OS restrictions we're addressing. Maybe something like:
-        # If a stream contains a special Character, the temporary jsonl.gz
-        # file can't be created, because of OS restrictions. Therefore, we
-        # remove the special characters.
+        # Windows and certain Linux filesystems (especially on WSL) restrict specific characters
+        # in filenames. Common restrictions include: <>:"/\|?* and control characters.
+        # We sanitize the stream name to ensure compatibility across different OS environments.
What do you think about this more detailed explanation?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between f2529a6 and e19b95e.

📒 Files selected for processing (1)

airbyte/_writers/file_writers.py (2 hunks)

🔇 Additional comments (2)

airbyte/_writers/file_writers.py (2)

7-7: LGTM on the import!

Clean addition of the re module, properly placed with other standard library imports.

65-69: Verify impact on existing tests and related code 🔍

Let's check if this change might affect existing tests or if similar sanitization might be needed elsewhere:

✅ Verification successful

Stream name sanitization looks safe, but how about adding a test? 🤔

From the verification results, I can see that the existing tests don't rely on special characters in stream names - they use simple alphanumeric names like "stream1", "products", "users", etc. Also, this is the only place where we're sanitizing stream names for file paths.

However, would you be interested in adding a test case that explicitly verifies this sanitization behavior? Something like testing that a stream name with special characters still works correctly? wdyt?

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for tests that might rely on special characters in stream names
echo "Checking for tests using special characters in stream names..."
rg -l "stream_name.*[^a-zA-Z0-9\s]" tests/

# Check for other places that might need similar filename sanitization
echo "Checking for other file path construction patterns..."
ast-grep --pattern 'Path($1)'

Length of output: 728

Script:

#!/bin/bash
# Let's check the content of these test files to see how they use stream names
echo "Checking test_progress.py for stream name usage:"
rg "stream_name" tests/unit_tests/test_progress.py -B 2 -A 2

echo "\nChecking test_source_faker_integration.py for stream name usage:"
rg "stream_name" tests/integration_tests/test_source_faker_integration.py -B 2 -A 2

echo "\nChecking test_state_handling.py for stream name usage:"
rg "stream_name" tests/integration_tests/test_state_handling.py -B 2 -A 2

# Let's also check for similar file path sanitization patterns
echo "\nChecking for similar file path sanitization patterns:"
rg "re\.sub.*stream_name" -B 2 -A 2

Length of output: 2984

airbyte/_writers/file_writers.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

airbyte/_writers/file_writers.py

aaronsteers

@tinomerl - This is a valuable update - thank you!

Should we include similar logic also when designating a table name?

tinomerl · 2024-12-03T07:04:25Z

Hey @aaronsteers,

thanks, that's great to hear. I think there is already some kind of logic implemented for table names. Just looked at the table name for my file test and it looks like this, in the default cache:

Maybe i should adapt the same replacement pattern for filenames. So instead of removing any of the characters with nothing, using the underscore could be better. I think that would also cover some edge cases better. Like for example if we have something like this:

myStreamDataFromA\B
myStreamDataFromAB

They would currently be named the same. But with the underscores they would differ. What do you think?

aaronsteers · 2024-12-03T15:04:26Z

@tinomerl - thanks for confirming that the table names are working well already. The method used for table name normalization is in the LowerCaseNormalizer class. You may be able to use the same here for normalizing the stream name prior to creating the path.

I'm ok with either approach but if the lower case normalizer class works well, that might be slightly easier to maintain, since the same code would be applied for both cases.

Want to try and let me know what you find? If for any reason that doesn't work well, I think your approach here is perfectly sound.

tinomerl · 2024-12-03T16:34:34Z

@aaronsteers - thanks for the Infos. I'll try it out tomorrow and let you know the results.

tinomerl · 2024-12-04T07:17:50Z

@aaronsteers - So i just changed it to use the LowerCaseNormalizer instead of just the regex pattern. And it works just fine. I like it this way more because, like you said, it can now be maintained at just one spot.

aaronsteers · 2024-12-04T14:52:35Z

@tinomerl - This is great. I'll run a few more tests and merge/release later today if all is passing! Thanks again for this contribution. 🙏

tinomerl · 2024-12-04T15:03:29Z

@aaronsteers - thanks for the kind words. I'm always happy to help out. :)

aaronsteers · 2024-12-05T23:41:45Z

/test-pr

PR test job started... Check job output.

❌ Tests failed.

aaronsteers · 2024-12-10T01:24:12Z

I meant to merge earlier but ran into some test failures. All is looking good now so I'll go ahead and merge+release.

tinomerl added 2 commits December 2, 2024 08:38

added regex to sub special characters

Loading
Loading status checks…

dbffc27

added comment to explain behaviour

Loading
Loading status checks…

e19b95e

tinomerl changed the title ~~🐛Bugfix: added regex to sub special characters~~ Fix: added regex to sub special characters Dec 2, 2024

coderabbitai bot requested changes Dec 2, 2024

View reviewed changes

airbyte/_writers/file_writers.py Outdated Show resolved Hide resolved

tinomerl changed the title ~~Fix: added regex to sub special characters~~ Fix: Added regex to sub special characters Dec 2, 2024

Add Suggestions

Loading
Loading status checks…

bff1359

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

coderabbitai bot approved these changes Dec 2, 2024

View reviewed changes

aaronsteers reviewed Dec 2, 2024

View reviewed changes

airbyte/_writers/file_writers.py Outdated Show resolved Hide resolved

expand comment

Loading
Loading status checks…

fddfa1b

aaronsteers reviewed Dec 3, 2024

View reviewed changes

used lowercasenormalizer

Loading
Loading status checks…

ee76045

aaronsteers approved these changes Dec 4, 2024

View reviewed changes

aaronsteers merged commit ea54711 into airbytehq:main Dec 10, 2024
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Added regex to sub special characters #545

Fix: Added regex to sub special characters #545

tinomerl commented Dec 2, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 2, 2024 •

edited

Loading

Walkthrough

Changes

Assessment against linked issues

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

aaronsteers left a comment

tinomerl commented Dec 3, 2024

aaronsteers commented Dec 3, 2024

tinomerl commented Dec 3, 2024

tinomerl commented Dec 4, 2024

aaronsteers commented Dec 4, 2024

tinomerl commented Dec 4, 2024

aaronsteers commented Dec 5, 2024 •

edited by github-actions bot

Loading

aaronsteers commented Dec 10, 2024

Fix: Added regex to sub special characters #545

Fix: Added regex to sub special characters #545

Conversation

tinomerl commented Dec 2, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

coderabbitai bot commented Dec 2, 2024 • edited Loading

Walkthrough

Changes

Assessment against linked issues

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

aaronsteers left a comment

Choose a reason for hiding this comment

tinomerl commented Dec 3, 2024

aaronsteers commented Dec 3, 2024

tinomerl commented Dec 3, 2024

tinomerl commented Dec 4, 2024

aaronsteers commented Dec 4, 2024

tinomerl commented Dec 4, 2024

aaronsteers commented Dec 5, 2024 • edited by github-actions bot Loading

aaronsteers commented Dec 10, 2024

tinomerl commented Dec 2, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 2, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

aaronsteers commented Dec 5, 2024 •

edited by github-actions bot

Loading