Skip to content

Comments

[pyupgrade] Fix Unicode escape sequence handling in format string parsing#19774

Closed
danparizher wants to merge 3 commits intoastral-sh:mainfrom
danparizher:fix-19771
Closed

[pyupgrade] Fix Unicode escape sequence handling in format string parsing#19774
danparizher wants to merge 3 commits intoastral-sh:mainfrom
danparizher:fix-19771

Conversation

@danparizher
Copy link
Contributor

Summary

Fixes #19771

@github-actions
Copy link
Contributor

github-actions bot commented Aug 5, 2025

ruff-ecosystem results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

@ntBre ntBre added bug Something isn't working fixes Related to suggested fixes for violations labels Aug 6, 2025
@ntBre ntBre self-requested a review August 6, 2025 00:58
Copy link
Contributor

@ntBre ntBre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this!

I'm a bit wary of changing this parsing code, but I had a few nits if we do go this route. We'd probably also want @MichaReiser to take a look here.

This code isn't super widely used, only occurring in ~3 rules or so, so it's not as foundational as I thought initially, but we should still be careful.

Comment on lines 601 to 603
let chars: Vec<char> = cur_text.chars().collect();

for (i, &c) in chars.iter().enumerate().skip(3) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to collect here:

Suggested change
let chars: Vec<char> = cur_text.chars().collect();
for (i, &c) in chars.iter().enumerate().skip(3) {
for (i, c) in cur_text.chars().iter().enumerate().skip(3) {

Comment on lines 604 to 606
if c == '{' {
brace_count += 1;
} else if c == '}' {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to count the braces, but we need to return some kind of error if we encounter another open brace, based on a quick test in Python. Maybe there's a way to make this work, but it seems to me that braces aren't allowed inside of a \N{...} escape:

>>> "\N{{angle}}".format(angle="angle")
  File "<python-input-0>", line 1
    "\N{{angle}}".format(angle="angle")
    ^^^^^^^^^^^^^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-9: unknown Unicode character name

if brace_count == 0 {
// Add the entire Unicode escape sequence as literal text
result_string.push_str(&cur_text[..end_pos]);
cur_text = &cur_text[end_pos..];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we push_str("\\N") at the start, we can just push individual characters as we go instead of bumping end_pos.

assert_eq!(format_string.format_parts.len(), 1);
match &format_string.format_parts[0] {
FormatPart::Literal(literal) => {
assert_eq!(literal, "\\N{LATIN {SMALL} LETTER A}");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, this is an error in CPython, so this is the case I think we should return an Err in.

>>> "\N{LATIN {SMALL} LETTER A}"
  File "<python-input-2>", line 1
    "\N{LATIN {SMALL} LETTER A}"
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-15: unknown Unicode character name

@danparizher danparizher requested a review from ntBre August 7, 2025 19:52
@MichaReiser
Copy link
Member

I don't think I'll have time anytime soon to review this in detail and I'm also not familiar with this code or the logic around it. @ntBre you're right to be wary. There have been plenty of bugs around f-string handling, https://github.com/astral-sh/ruff/issues?q=sort%3Aupdated-desc%20is%3Aissue%20state%3Aclosed%20UP032

result_string.push_str("\\N");
let rest = &cur_text[2..];
let mut chars = rest.chars();
if let Some('{') = chars.next() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These seems guaranteed by the starts_with call on line 597`

Comment on lines 604 to 606
result_string.truncate(result_string.len() - 2);
result_string.push_str(cur_text);
cur_text = "";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels a bit silly to update result_string only to undo the changes right after. Can we avoid that?

result_string.push(ch);
}
None => {
result_string.push_str(chars.as_str());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't chars.as_str always guaranteed to be empty if chars.next() is None?

@IDrokin117
Copy link
Contributor

@danparizher
I've been looking into the same problem while working on the issue #19792
Please examine the potential problems this PR could cause

  1. Consider the example below. ruff 0.12.8 doesn't fix it due to missing format keyword snowman (playground). However, with the proposed changes, the format will be replaced by f-string. I suppose it is wrong behavior.
# ruff 0.12.8 
print("\\N{snowman} {}".format(1)) # doesn't apply UP032
# With the fix, UP032 will be applied
print("\\N{snowman} {}".format(1)) # -> print(f"\\N{snowman} {1}")
  1. While the fix will resolve the first (1) case, it still has a problem with the second (2) case.
# ruff 0.12.8 
print("\N{snowman} {snowman}".format(snowman=3)) # -> print(f"\N{snowman} {3}") # (1)
# With the fix, here first {snowman} must also be substituted with {1}.
print("\\N{snowman} {snowman}".format(snowman=1)) # -> print(f"\\N{snowman} {1}") # (2)


Here
you can find my solution that correctly addresses the cases mentioned above. Feel free to reuse it in your PR if it makes sense.

@MichaReiser
Copy link
Member

Closing due to inactivity. I can re-open this PR if you plan to work on this. If anyone else is interested to work in this, feel free to submit a new PR that takes the above feedback into account.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working fixes Related to suggested fixes for violations

Projects

None yet

Development

Successfully merging this pull request may close these issues.

UP032 fix misinterprets \N escape sequence as interpolation

4 participants