[pyupgrade] Fix Unicode escape sequence handling in format string parsing#19774
[pyupgrade] Fix Unicode escape sequence handling in format string parsing#19774danparizher wants to merge 3 commits intoastral-sh:mainfrom
pyupgrade] Fix Unicode escape sequence handling in format string parsing#19774Conversation
|
ntBre
left a comment
There was a problem hiding this comment.
Thanks for looking into this!
I'm a bit wary of changing this parsing code, but I had a few nits if we do go this route. We'd probably also want @MichaReiser to take a look here.
This code isn't super widely used, only occurring in ~3 rules or so, so it's not as foundational as I thought initially, but we should still be careful.
| let chars: Vec<char> = cur_text.chars().collect(); | ||
|
|
||
| for (i, &c) in chars.iter().enumerate().skip(3) { |
There was a problem hiding this comment.
I don't think we need to collect here:
| let chars: Vec<char> = cur_text.chars().collect(); | |
| for (i, &c) in chars.iter().enumerate().skip(3) { | |
| for (i, c) in cur_text.chars().iter().enumerate().skip(3) { |
| if c == '{' { | ||
| brace_count += 1; | ||
| } else if c == '}' { |
There was a problem hiding this comment.
I don't think we need to count the braces, but we need to return some kind of error if we encounter another open brace, based on a quick test in Python. Maybe there's a way to make this work, but it seems to me that braces aren't allowed inside of a \N{...} escape:
>>> "\N{{angle}}".format(angle="angle")
File "<python-input-0>", line 1
"\N{{angle}}".format(angle="angle")
^^^^^^^^^^^^^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-9: unknown Unicode character name| if brace_count == 0 { | ||
| // Add the entire Unicode escape sequence as literal text | ||
| result_string.push_str(&cur_text[..end_pos]); | ||
| cur_text = &cur_text[end_pos..]; |
There was a problem hiding this comment.
If we push_str("\\N") at the start, we can just push individual characters as we go instead of bumping end_pos.
| assert_eq!(format_string.format_parts.len(), 1); | ||
| match &format_string.format_parts[0] { | ||
| FormatPart::Literal(literal) => { | ||
| assert_eq!(literal, "\\N{LATIN {SMALL} LETTER A}"); |
There was a problem hiding this comment.
Ah yeah, this is an error in CPython, so this is the case I think we should return an Err in.
>>> "\N{LATIN {SMALL} LETTER A}"
File "<python-input-2>", line 1
"\N{LATIN {SMALL} LETTER A}"
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-15: unknown Unicode character name|
I don't think I'll have time anytime soon to review this in detail and I'm also not familiar with this code or the logic around it. @ntBre you're right to be wary. There have been plenty of bugs around f-string handling, https://github.com/astral-sh/ruff/issues?q=sort%3Aupdated-desc%20is%3Aissue%20state%3Aclosed%20UP032 |
| result_string.push_str("\\N"); | ||
| let rest = &cur_text[2..]; | ||
| let mut chars = rest.chars(); | ||
| if let Some('{') = chars.next() { |
There was a problem hiding this comment.
These seems guaranteed by the starts_with call on line 597`
| result_string.truncate(result_string.len() - 2); | ||
| result_string.push_str(cur_text); | ||
| cur_text = ""; |
There was a problem hiding this comment.
It feels a bit silly to update result_string only to undo the changes right after. Can we avoid that?
| result_string.push(ch); | ||
| } | ||
| None => { | ||
| result_string.push_str(chars.as_str()); |
There was a problem hiding this comment.
Isn't chars.as_str always guaranteed to be empty if chars.next() is None?
|
@danparizher
# ruff 0.12.8
print("\\N{snowman} {}".format(1)) # doesn't apply UP032# With the fix, UP032 will be applied
print("\\N{snowman} {}".format(1)) # -> print(f"\\N{snowman} {1}")
# ruff 0.12.8
print("\N{snowman} {snowman}".format(snowman=3)) # -> print(f"\N{snowman} {3}") # (1)# With the fix, here first {snowman} must also be substituted with {1}.
print("\\N{snowman} {snowman}".format(snowman=1)) # -> print(f"\\N{snowman} {1}") # (2)
|
|
Closing due to inactivity. I can re-open this PR if you plan to work on this. If anyone else is interested to work in this, feel free to submit a new PR that takes the above feedback into account. |
Summary
Fixes #19771