Add support for the new f-string tokens per PEP 701 by dhruvmanila · Pull Request #6659 · astral-sh/ruff

dhruvmanila · 2023-08-17T19:24:38Z

Summary

This PR adds support in the lexer for the newly added f-string tokens as per PEP 701. The following new tokens are added:

FStringStart: Token value for the start of an f-string. This includes the f/F/fr prefix and the opening quote(s).
FStringMiddle: Token value that includes the portion of text inside the f-string that's not part of the expression part and isn't an opening or closing brace.
FStringEnd: Token value for the end of an f-string. This includes the closing quote.

Additionally, a new Exclamation token is added for conversion (f"{foo!s}") as that's part of an expression.

Test Plan

New test cases are added to for various possibilities using snapshot testing. The output has been verified using python/cpython@f2cc00527e.

Benchmarks

I've put the number of f-strings for each of the following files after the file name

lexer/large/dataset.py (1)       1.05   612.6±91.60µs    66.4 MB/sec    1.00   584.7±33.72µs    69.6 MB/sec
lexer/numpy/ctypeslib.py (0)     1.01    131.8±3.31µs   126.3 MB/sec    1.00    130.9±5.37µs   127.2 MB/sec
lexer/numpy/globals.py (1)       1.02     13.2±0.43µs   222.7 MB/sec    1.00     13.0±0.41µs   226.8 MB/sec
lexer/pydantic/types.py (8)      1.13   285.0±11.72µs    89.5 MB/sec    1.00   252.9±10.13µs   100.8 MB/sec
lexer/unicode/pypinyin.py (0)    1.03     32.9±1.92µs   127.5 MB/sec    1.00     31.8±1.25µs   132.0 MB/sec

It seems that overall the lexer has regressed. I profiled every file mentioned above and I saw one improvement which is done in (098ee5d). But otherwise I don't see anything else. A few notes by isolating the f-string part in the profile:

As we're adding new tokens and functionality to emit them, I expect the lexer to take more time because of more code.
The lex_fstring_middle_or_end takes the most amount of time followed by the current_mut line when lexing the : token. The latter is to check if we're at the start of a format spec or not.
In a f-string heavy file such as https://github.com/python/cpython/blob/main/Lib/test/test_fstring.py ¹ (293), most of the time in lex_fstring_middle_or_end is accounted by string allocation for the string literal part of FStringMiddle token (https://share.firefox.dev/3ErEa1W)

I don't see anything out of ordinary for pydantic/types profile (https://share.firefox.dev/45XcLRq)

fixes: #7042

We could add this in lexer and parser benchmark ↩

dhruvmanila · 2023-08-17T19:24:45Z

Current dependencies on/for this PR:

main
- PR Add a NotebookError type to avoid returning Diagnostics on error #7035
  - PR Make SourceKind a required parameter #7013
    - PR Add support for the new f-string tokens per PEP 701 #6659 👈
      - PR Add support for parsing f-string as per PEP 701 #7041
        
        PR Use narrow type for string parsing patterns #7211
        
        PR Disallow non-parenthesized lambda expr in f-string #7263
        
        PR Fix curly brace escape handling in f-strings #7331
        
        PR Update Indexer to use new f-string tokens #7325
        
        PR Detect noqa directives for multi-line f-strings #7326
        PR Update F541 to use new f-string tokens #7327
        PR Update Stylist quote detection with new f-string token #7328
        PR Update W605 to check in f-strings #7329

This comment was auto-generated by Graphite.

crates/ruff_python_parser/src/lexer.rs

dhruvmanila · 2023-08-18T13:15:18Z

@MichaReiser Thanks for your initial review even though the PR is still in draft :)

crates/ruff_python_parser/src/lexer.rs

dhruvmanila · 2023-08-23T05:38:58Z

crates/ruff_python_parser/src/lexer.rs

+        if kind.is_any_fstring() {
+            return Ok(self.lex_fstring_start(quote, triple_quoted, kind.is_raw()));
+        }
+


I'm thinking of removing the StringKind::FString and StringKind::RawFString as that is an invalid representation now that we'll be emitting different tokens for f-strings. Instead, I'm thinking of updating the lex_identifier to directly check for f (and related F, r, R) and call lex_fstring_start directly.

This will be done after the linter changes are complete.

dhruvmanila · 2023-08-23T05:39:45Z

(This is ready for review, but not to be merged)

MichaReiser

This is impressive work. I left a few follow-up questions and I want to wait to approve this PR until we have the first benchmarks in.

MichaReiser · 2023-08-23T07:01:27Z

crates/ruff_python_parser/src/token.rs

    },
+    /// Token value for the start of an f-string. This includes the `f`/`F`/`fr` prefix
+    /// and the opening quote(s).
+    FStringStart,


We'll need to go through all logical line rules to make sure they handle f-strings correctly

crates/ruff_python_parser/src/lexer.rs

MichaReiser · 2023-08-23T07:09:40Z

crates/ruff_python_parser/src/lexer.rs

+    parentheses_count: u32,
+    format_spec_count: u32,


Can we document these fields and why tracking them in the state is necessary? E.g. why do we need to track both parenthesis_count and format_spec_count (Should it be format_spec_depth?). What are the different possible states?

There are 3 reasons to track these fields:

To check if we're in a f-string expression f"{<here>}"

To check if we're in a format spec within a f-string expression f"{foo:<here>}"

To check if the : is to start a format spec or is it part of, for example, named expression :=:

For f"{x:=1}", the colon is to indicate the format spec start

For f"{(x:=1)}", the colon is part of := named expression because the colon is not at the same level of parentheses. Another example is f"{x,{y:=1}}" where the colon is part of := named expression

From PEP 701: How to produce these new tokens:

[..] This mode tokenizes as the “Regular Python tokenization” until a : or a } character is encountered with the same level of nesting as the opening bracket token that was pushed when we enter the f-string part. [..]

With this explanation, do you think _depth is a better suffix?

crates/ruff_python_parser/src/lexer.rs

MichaReiser · 2023-08-23T07:56:26Z

crates/ruff_python_parser/src/lexer.rs

    // This is the main entry point. Call this function to retrieve the next token.
    // This function is used by the iterator implementation.
    pub fn next_token(&mut self) -> LexResult {
+        if let Some(fstring_context) = self.fstring_stack.last() {


I dislike adding a stack lookup into the hot path but I don't see a way of how we could avoid it :(

Maybe we could utilize State? The problem here is that it'll get updated when inside a f-string expression as it's not persistent.

dhruvmanila · 2023-08-25T11:56:25Z

crates/ruff_python_parser/src/lexer.rs

+            if !fstring_context.is_in_expression()
+                // Avoid lexing f-string middle/end if we're sure that this is
+                // the start of a f-string expression i.e., `f"{foo}"` and not
+                // `f"{{foo}}"`.
+                && (self.cursor.first() != '{' || self.cursor.second() == '{')
+                // Avoid lexing f-string middle/end if we're sure that this is
+                // the end of a f-string expression. This is only for when
+                // the `}` is after the format specifier i.e., `f"{foo:.3f}"`
+                // because the `.3f` is lexed as `FStringMiddle` and thus not
+                // in a f-string expression.
+                && (!fstring_context.is_in_format_spec() || self.cursor.first() != '}')
+            {


I don't really like this that much. I think we're sacrificing readability a bit. Without this, the lex_fstring_middle_or_end function would return Result<Option<Tok>, ...> and this condition will be identified initially in the function loop itself.

I've reverted this change although I'm open to suggestions.

dhruvmanila · 2023-08-29T02:47:56Z

crates/ruff_python_parser/src/lexer.rs

+            ('f' | 'F', quote @ ('\'' | '"')) => {
+                self.cursor.bump();
+                return Ok(self.lex_fstring_start(quote, false));
+            }
+            ('r' | 'R', 'f' | 'F') | ('f' | 'F', 'r' | 'R') if is_quote(self.cursor.second()) => {
+                self.cursor.bump();
+                let quote = self.cursor.bump().unwrap();
+                return Ok(self.lex_fstring_start(quote, true));
+            }


This separation is because the StringKind::FString and StringKind::RawFString variants will be removed after all the linter changes.

crates/ruff_python_parser/src/lexer.rs

dhruvmanila · 2023-08-29T02:51:52Z

crates/ruff_python_parser/src/lexer.rs

+        };
+        Ok(Some(Tok::FStringMiddle {
+            value,
+            is_raw: context.is_raw_string,


This is required for the parser changes.

dhruvmanila · 2023-08-29T02:52:37Z

crates/ruff_python_parser/src/lexer.rs

+            // When we are in an f-string, check whether does the initial quote
+            // matches with f-strings quotes and if it is, then this must be a
+            // missing '}' token so raise the proper error.


Source: https://github.com/python/cpython/blob/21a7420190778fb6e9237bf12e029a26cd18d82d/Parser/tokenizer.c#L2444-L2455

crates/ruff_python_parser/src/parser.rs

## Summary This PR adds support in the lexer for the newly added f-string tokens as per PEP 701. The following new tokens are added: * `FStringStart`: Token value for the start of an f-string. This includes the `f`/`F`/`fr` prefix and the opening quote(s). * `FStringMiddle`: Token value that includes the portion of text inside the f-string that's not part of the expression part and isn't an opening or closing brace. * `FStringEnd`: Token value for the end of an f-string. This includes the closing quote. Additionally, a new `Exclamation` token is added for conversion (`f"{foo!s}"`) as that's part of an expression. ## Test Plan New test cases are added to for various possibilities using snapshot testing. The output has been verified using python/cpython@f2cc00527e. ## Benchmarks _I've put the number of f-strings for each of the following files after the file name_ ``` lexer/large/dataset.py (1) 1.05 612.6±91.60µs 66.4 MB/sec 1.00 584.7±33.72µs 69.6 MB/sec lexer/numpy/ctypeslib.py (0) 1.01 131.8±3.31µs 126.3 MB/sec 1.00 130.9±5.37µs 127.2 MB/sec lexer/numpy/globals.py (1) 1.02 13.2±0.43µs 222.7 MB/sec 1.00 13.0±0.41µs 226.8 MB/sec lexer/pydantic/types.py (8) 1.13 285.0±11.72µs 89.5 MB/sec 1.00 252.9±10.13µs 100.8 MB/sec lexer/unicode/pypinyin.py (0) 1.03 32.9±1.92µs 127.5 MB/sec 1.00 31.8±1.25µs 132.0 MB/sec ``` It seems that overall the lexer has regressed. I profiled every file mentioned above and I saw one improvement which is done in (098ee5d). But otherwise I don't see anything else. A few notes by isolating the f-string part in the profile: * As we're adding new tokens and functionality to emit them, I expect the lexer to take more time because of more code. * The `lex_fstring_middle_or_end` takes the most amount of time followed by the `current_mut` line when lexing the `:` token. The latter is to check if we're at the start of a format spec or not. * In a f-string heavy file such as https://github.com/python/cpython/blob/main/Lib/test/test_fstring.py [^1] (293), most of the time in `lex_fstring_middle_or_end` is accounted by string allocation for the string literal part of `FStringMiddle` token (https://share.firefox.dev/3ErEa1W) I don't see anything out of ordinary for `pydantic/types` profile (https://share.firefox.dev/45XcLRq) fixes: #7042 [^1]: We could add this in lexer and parser benchmark

This PR adds support in the lexer for the newly added f-string tokens as per PEP 701. The following new tokens are added: * `FStringStart`: Token value for the start of an f-string. This includes the `f`/`F`/`fr` prefix and the opening quote(s). * `FStringMiddle`: Token value that includes the portion of text inside the f-string that's not part of the expression part and isn't an opening or closing brace. * `FStringEnd`: Token value for the end of an f-string. This includes the closing quote. Additionally, a new `Exclamation` token is added for conversion (`f"{foo!s}"`) as that's part of an expression. New test cases are added to for various possibilities using snapshot testing. The output has been verified using python/cpython@f2cc00527e. _I've put the number of f-strings for each of the following files after the file name_ ``` lexer/large/dataset.py (1) 1.05 612.6±91.60µs 66.4 MB/sec 1.00 584.7±33.72µs 69.6 MB/sec lexer/numpy/ctypeslib.py (0) 1.01 131.8±3.31µs 126.3 MB/sec 1.00 130.9±5.37µs 127.2 MB/sec lexer/numpy/globals.py (1) 1.02 13.2±0.43µs 222.7 MB/sec 1.00 13.0±0.41µs 226.8 MB/sec lexer/pydantic/types.py (8) 1.13 285.0±11.72µs 89.5 MB/sec 1.00 252.9±10.13µs 100.8 MB/sec lexer/unicode/pypinyin.py (0) 1.03 32.9±1.92µs 127.5 MB/sec 1.00 31.8±1.25µs 132.0 MB/sec ``` It seems that overall the lexer has regressed. I profiled every file mentioned above and I saw one improvement which is done in (098ee5d). But otherwise I don't see anything else. A few notes by isolating the f-string part in the profile: * As we're adding new tokens and functionality to emit them, I expect the lexer to take more time because of more code. * The `lex_fstring_middle_or_end` takes the most amount of time followed by the `current_mut` line when lexing the `:` token. The latter is to check if we're at the start of a format spec or not. * In a f-string heavy file such as https://github.com/python/cpython/blob/main/Lib/test/test_fstring.py [^1] (293), most of the time in `lex_fstring_middle_or_end` is accounted by string allocation for the string literal part of `FStringMiddle` token (https://share.firefox.dev/3ErEa1W) I don't see anything out of ordinary for `pydantic/types` profile (https://share.firefox.dev/45XcLRq) fixes: #7042 [^1]: We could add this in lexer and parser benchmark

dhruvmanila mentioned this pull request Aug 17, 2023

Update lexer tests to use snapshots #6658

Merged

MichaReiser added the parser Related to the parser label Aug 18, 2023

MichaReiser reviewed Aug 18, 2023

View reviewed changes

konstin reviewed Aug 21, 2023

View reviewed changes

crates/ruff_python_parser/src/lexer.rs Outdated Show resolved Hide resolved

dhruvmanila commented Aug 22, 2023

View reviewed changes

crates/ruff_python_parser/src/lexer.rs Outdated Show resolved Hide resolved

crates/ruff_python_parser/src/lexer.rs Outdated Show resolved Hide resolved

dhruvmanila force-pushed the dhruv/lexer-tests branch from 059bb9a to e7e83a5 Compare August 22, 2023 18:11

dhruvmanila force-pushed the dhruv/fstring-tokens branch from 5ccadb2 to 2bdbea1 Compare August 22, 2023 18:12

Base automatically changed from dhruv/lexer-tests to main August 22, 2023 18:23

dhruvmanila force-pushed the dhruv/fstring-tokens branch from 2bdbea1 to 3e22d6b Compare August 22, 2023 18:35

dhruvmanila commented Aug 23, 2023

View reviewed changes

dhruvmanila marked this pull request as ready for review August 23, 2023 05:39

dhruvmanila requested a review from MichaReiser August 23, 2023 05:39

MichaReiser reviewed Aug 23, 2023

View reviewed changes

zanieb added the python312 Related to Python 3.12 label Aug 24, 2023

dhruvmanila commented Aug 25, 2023

View reviewed changes

dhruvmanila force-pushed the dhruv/fstring-tokens branch 5 times, most recently from 2a75ece to 62e052d Compare August 29, 2023 02:46

dhruvmanila commented Aug 29, 2023

View reviewed changes

dhruvmanila force-pushed the dhruv/fstring-tokens branch from b2f6141 to 2979c91 Compare August 29, 2023 04:42

dhruvmanila changed the base branch from main to dhruv/eat-char2 August 29, 2023 04:53

dhruvmanila force-pushed the dhruv/fstring-tokens branch from 2979c91 to 440739d Compare August 29, 2023 04:53

dhruvmanila mentioned this pull request Aug 29, 2023

Add eat_char2 for the lexer #6968

Merged

Base automatically changed from dhruv/eat-char2 to main August 29, 2023 11:48

dhruvmanila force-pushed the dhruv/fstring-tokens branch from a1973e5 to e120c32 Compare August 29, 2023 11:52

dhruvmanila added 9 commits September 14, 2023 07:02

Add support for the new f-string tokens per PEP 701

0cc8221

Emit empty FStringMiddle token for special case

12a7c3a

Update comment from code review

72f9a21

Avoid tracking parentheses nesting multiple times

0159ae0

Add test for empty FStringMiddle tok in lambda expr

63f43dc

Code review changes

bfa0296

Emit empty token only in format spec, handle SingleRBrace error

5db6b20

Fix incorrect position to start the f-string token

be4a3eb

Update snapshots

adb7a05

dhruvmanila force-pushed the dhruv/fstring-tokens branch from c4da178 to adb7a05 Compare September 14, 2023 01:37

dhruvmanila changed the base branch from main to dhruv/pep-701 September 14, 2023 01:37

dhruvmanila merged commit 9820c04 into dhruv/pep-701 Sep 14, 2023

dhruvmanila deleted the dhruv/fstring-tokens branch September 14, 2023 01:46

dhruvmanila mentioned this pull request Sep 14, 2023

Add support for PEP 701 #7376

Merged

Conversation

dhruvmanila commented Aug 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Benchmarks

Footnotes

Uh oh!

dhruvmanila commented Aug 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dhruvmanila commented Aug 18, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dhruvmanila commented Aug 23, 2023

Uh oh!

MichaReiser left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dhruvmanila Aug 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dhruvmanila commented Aug 17, 2023 •

edited

Loading

dhruvmanila commented Aug 17, 2023 •

edited

Loading

dhruvmanila Aug 29, 2023 •

edited

Loading