Add support for the new f-string tokens per PEP 701#6659
Add support for the new f-string tokens per PEP 701#6659dhruvmanila merged 9 commits intodhruv/pep-701from
Conversation
|
Current dependencies on/for this PR:
This comment was auto-generated by Graphite. |
|
@MichaReiser Thanks for your initial review even though the PR is still in draft :) |
059bb9a to
e7e83a5
Compare
5ccadb2 to
2bdbea1
Compare
2bdbea1 to
3e22d6b
Compare
| if kind.is_any_fstring() { | ||
| return Ok(self.lex_fstring_start(quote, triple_quoted, kind.is_raw())); | ||
| } | ||
|
|
There was a problem hiding this comment.
I'm thinking of removing the StringKind::FString and StringKind::RawFString as that is an invalid representation now that we'll be emitting different tokens for f-strings. Instead, I'm thinking of updating the lex_identifier to directly check for f (and related F, r, R) and call lex_fstring_start directly.
There was a problem hiding this comment.
This will be done after the linter changes are complete.
|
(This is ready for review, but not to be merged) |
MichaReiser
left a comment
There was a problem hiding this comment.
This is impressive work. I left a few follow-up questions and I want to wait to approve this PR until we have the first benchmarks in.
| }, | ||
| /// Token value for the start of an f-string. This includes the `f`/`F`/`fr` prefix | ||
| /// and the opening quote(s). | ||
| FStringStart, |
There was a problem hiding this comment.
We'll need to go through all logical line rules to make sure they handle f-strings correctly
| parentheses_count: u32, | ||
| format_spec_count: u32, |
There was a problem hiding this comment.
Can we document these fields and why tracking them in the state is necessary? E.g. why do we need to track both parenthesis_count and format_spec_count (Should it be format_spec_depth?). What are the different possible states?
There was a problem hiding this comment.
There are 3 reasons to track these fields:
- To check if we're in a f-string expression
f"{<here>}" - To check if we're in a format spec within a f-string expression
f"{foo:<here>}" - To check if the
:is to start a format spec or is it part of, for example, named expression:=:- For
f"{x:=1}", the colon is to indicate the format spec start - For
f"{(x:=1)}", the colon is part of:=named expression because the colon is not at the same level of parentheses. Another example isf"{x,{y:=1}}"where the colon is part of:=named expression
- For
From PEP 701: How to produce these new tokens:
- [..] This mode tokenizes as the “Regular Python tokenization” until a
:or a}character is encountered with the same level of nesting as the opening bracket token that was pushed when we enter the f-string part. [..]
With this explanation, do you think _depth is a better suffix?
| // This is the main entry point. Call this function to retrieve the next token. | ||
| // This function is used by the iterator implementation. | ||
| pub fn next_token(&mut self) -> LexResult { | ||
| if let Some(fstring_context) = self.fstring_stack.last() { |
There was a problem hiding this comment.
I dislike adding a stack lookup into the hot path but I don't see a way of how we could avoid it :(
There was a problem hiding this comment.
Maybe we could utilize State? The problem here is that it'll get updated when inside a f-string expression as it's not persistent.
| if !fstring_context.is_in_expression() | ||
| // Avoid lexing f-string middle/end if we're sure that this is | ||
| // the start of a f-string expression i.e., `f"{foo}"` and not | ||
| // `f"{{foo}}"`. | ||
| && (self.cursor.first() != '{' || self.cursor.second() == '{') | ||
| // Avoid lexing f-string middle/end if we're sure that this is | ||
| // the end of a f-string expression. This is only for when | ||
| // the `}` is after the format specifier i.e., `f"{foo:.3f}"` | ||
| // because the `.3f` is lexed as `FStringMiddle` and thus not | ||
| // in a f-string expression. | ||
| && (!fstring_context.is_in_format_spec() || self.cursor.first() != '}') | ||
| { |
There was a problem hiding this comment.
I don't really like this that much. I think we're sacrificing readability a bit. Without this, the lex_fstring_middle_or_end function would return Result<Option<Tok>, ...> and this condition will be identified initially in the function loop itself.
There was a problem hiding this comment.
I've reverted this change although I'm open to suggestions.
2a75ece to
62e052d
Compare
| ('f' | 'F', quote @ ('\'' | '"')) => { | ||
| self.cursor.bump(); | ||
| return Ok(self.lex_fstring_start(quote, false)); | ||
| } | ||
| ('r' | 'R', 'f' | 'F') | ('f' | 'F', 'r' | 'R') if is_quote(self.cursor.second()) => { | ||
| self.cursor.bump(); | ||
| let quote = self.cursor.bump().unwrap(); | ||
| return Ok(self.lex_fstring_start(quote, true)); | ||
| } |
There was a problem hiding this comment.
This separation is because the StringKind::FString and StringKind::RawFString variants will be removed after all the linter changes.
| }; | ||
| Ok(Some(Tok::FStringMiddle { | ||
| value, | ||
| is_raw: context.is_raw_string, |
There was a problem hiding this comment.
This is required for the parser changes.
| // When we are in an f-string, check whether does the initial quote | ||
| // matches with f-strings quotes and if it is, then this must be a | ||
| // missing '}' token so raise the proper error. |
There was a problem hiding this comment.
b2f6141 to
2979c91
Compare
2979c91 to
440739d
Compare
a1973e5 to
e120c32
Compare
c4da178 to
adb7a05
Compare
## Summary
This PR adds support in the lexer for the newly added f-string tokens as
per PEP 701. The following new tokens are added:
* `FStringStart`: Token value for the start of an f-string. This
includes the `f`/`F`/`fr` prefix and the opening quote(s).
* `FStringMiddle`: Token value that includes the portion of text inside
the f-string that's not part of the expression part and isn't an opening
or closing brace.
* `FStringEnd`: Token value for the end of an f-string. This includes
the closing quote.
Additionally, a new `Exclamation` token is added for conversion
(`f"{foo!s}"`) as that's part of an expression.
## Test Plan
New test cases are added to for various possibilities using snapshot
testing. The output has been verified using python/cpython@f2cc00527e.
## Benchmarks
_I've put the number of f-strings for each of the following files after
the file name_
```
lexer/large/dataset.py (1) 1.05 612.6±91.60µs 66.4 MB/sec 1.00 584.7±33.72µs 69.6 MB/sec
lexer/numpy/ctypeslib.py (0) 1.01 131.8±3.31µs 126.3 MB/sec 1.00 130.9±5.37µs 127.2 MB/sec
lexer/numpy/globals.py (1) 1.02 13.2±0.43µs 222.7 MB/sec 1.00 13.0±0.41µs 226.8 MB/sec
lexer/pydantic/types.py (8) 1.13 285.0±11.72µs 89.5 MB/sec 1.00 252.9±10.13µs 100.8 MB/sec
lexer/unicode/pypinyin.py (0) 1.03 32.9±1.92µs 127.5 MB/sec 1.00 31.8±1.25µs 132.0 MB/sec
```
It seems that overall the lexer has regressed. I profiled every file
mentioned above and I saw one improvement which is done in
(098ee5d). But otherwise I don't see
anything else. A few notes by isolating the f-string part in the
profile:
* As we're adding new tokens and functionality to emit them, I expect
the lexer to take more time because of more code.
* The `lex_fstring_middle_or_end` takes the most amount of time followed
by the `current_mut` line when lexing the `:` token. The latter is to
check if we're at the start of a format spec or not.
* In a f-string heavy file such as
https://github.com/python/cpython/blob/main/Lib/test/test_fstring.py
[^1] (293), most of the time in `lex_fstring_middle_or_end` is accounted
by string allocation for the string literal part of `FStringMiddle`
token (https://share.firefox.dev/3ErEa1W)
I don't see anything out of ordinary for `pydantic/types` profile
(https://share.firefox.dev/45XcLRq)
fixes: #7042
[^1]: We could add this in lexer and parser benchmark
## Summary
This PR adds support in the lexer for the newly added f-string tokens as
per PEP 701. The following new tokens are added:
* `FStringStart`: Token value for the start of an f-string. This
includes the `f`/`F`/`fr` prefix and the opening quote(s).
* `FStringMiddle`: Token value that includes the portion of text inside
the f-string that's not part of the expression part and isn't an opening
or closing brace.
* `FStringEnd`: Token value for the end of an f-string. This includes
the closing quote.
Additionally, a new `Exclamation` token is added for conversion
(`f"{foo!s}"`) as that's part of an expression.
## Test Plan
New test cases are added to for various possibilities using snapshot
testing. The output has been verified using python/cpython@f2cc00527e.
## Benchmarks
_I've put the number of f-strings for each of the following files after
the file name_
```
lexer/large/dataset.py (1) 1.05 612.6±91.60µs 66.4 MB/sec 1.00 584.7±33.72µs 69.6 MB/sec
lexer/numpy/ctypeslib.py (0) 1.01 131.8±3.31µs 126.3 MB/sec 1.00 130.9±5.37µs 127.2 MB/sec
lexer/numpy/globals.py (1) 1.02 13.2±0.43µs 222.7 MB/sec 1.00 13.0±0.41µs 226.8 MB/sec
lexer/pydantic/types.py (8) 1.13 285.0±11.72µs 89.5 MB/sec 1.00 252.9±10.13µs 100.8 MB/sec
lexer/unicode/pypinyin.py (0) 1.03 32.9±1.92µs 127.5 MB/sec 1.00 31.8±1.25µs 132.0 MB/sec
```
It seems that overall the lexer has regressed. I profiled every file
mentioned above and I saw one improvement which is done in
(098ee5d). But otherwise I don't see
anything else. A few notes by isolating the f-string part in the
profile:
* As we're adding new tokens and functionality to emit them, I expect
the lexer to take more time because of more code.
* The `lex_fstring_middle_or_end` takes the most amount of time followed
by the `current_mut` line when lexing the `:` token. The latter is to
check if we're at the start of a format spec or not.
* In a f-string heavy file such as
https://github.com/python/cpython/blob/main/Lib/test/test_fstring.py
[^1] (293), most of the time in `lex_fstring_middle_or_end` is accounted
by string allocation for the string literal part of `FStringMiddle`
token (https://share.firefox.dev/3ErEa1W)
I don't see anything out of ordinary for `pydantic/types` profile
(https://share.firefox.dev/45XcLRq)
fixes: #7042
[^1]: We could add this in lexer and parser benchmark
## Summary
This PR adds support in the lexer for the newly added f-string tokens as
per PEP 701. The following new tokens are added:
* `FStringStart`: Token value for the start of an f-string. This
includes the `f`/`F`/`fr` prefix and the opening quote(s).
* `FStringMiddle`: Token value that includes the portion of text inside
the f-string that's not part of the expression part and isn't an opening
or closing brace.
* `FStringEnd`: Token value for the end of an f-string. This includes
the closing quote.
Additionally, a new `Exclamation` token is added for conversion
(`f"{foo!s}"`) as that's part of an expression.
## Test Plan
New test cases are added to for various possibilities using snapshot
testing. The output has been verified using python/cpython@f2cc00527e.
## Benchmarks
_I've put the number of f-strings for each of the following files after
the file name_
```
lexer/large/dataset.py (1) 1.05 612.6±91.60µs 66.4 MB/sec 1.00 584.7±33.72µs 69.6 MB/sec
lexer/numpy/ctypeslib.py (0) 1.01 131.8±3.31µs 126.3 MB/sec 1.00 130.9±5.37µs 127.2 MB/sec
lexer/numpy/globals.py (1) 1.02 13.2±0.43µs 222.7 MB/sec 1.00 13.0±0.41µs 226.8 MB/sec
lexer/pydantic/types.py (8) 1.13 285.0±11.72µs 89.5 MB/sec 1.00 252.9±10.13µs 100.8 MB/sec
lexer/unicode/pypinyin.py (0) 1.03 32.9±1.92µs 127.5 MB/sec 1.00 31.8±1.25µs 132.0 MB/sec
```
It seems that overall the lexer has regressed. I profiled every file
mentioned above and I saw one improvement which is done in
(098ee5d). But otherwise I don't see
anything else. A few notes by isolating the f-string part in the
profile:
* As we're adding new tokens and functionality to emit them, I expect
the lexer to take more time because of more code.
* The `lex_fstring_middle_or_end` takes the most amount of time followed
by the `current_mut` line when lexing the `:` token. The latter is to
check if we're at the start of a format spec or not.
* In a f-string heavy file such as
https://github.com/python/cpython/blob/main/Lib/test/test_fstring.py
[^1] (293), most of the time in `lex_fstring_middle_or_end` is accounted
by string allocation for the string literal part of `FStringMiddle`
token (https://share.firefox.dev/3ErEa1W)
I don't see anything out of ordinary for `pydantic/types` profile
(https://share.firefox.dev/45XcLRq)
fixes: #7042
[^1]: We could add this in lexer and parser benchmark
## Summary
This PR adds support in the lexer for the newly added f-string tokens as
per PEP 701. The following new tokens are added:
* `FStringStart`: Token value for the start of an f-string. This
includes the `f`/`F`/`fr` prefix and the opening quote(s).
* `FStringMiddle`: Token value that includes the portion of text inside
the f-string that's not part of the expression part and isn't an opening
or closing brace.
* `FStringEnd`: Token value for the end of an f-string. This includes
the closing quote.
Additionally, a new `Exclamation` token is added for conversion
(`f"{foo!s}"`) as that's part of an expression.
## Test Plan
New test cases are added to for various possibilities using snapshot
testing. The output has been verified using python/cpython@f2cc00527e.
## Benchmarks
_I've put the number of f-strings for each of the following files after
the file name_
```
lexer/large/dataset.py (1) 1.05 612.6±91.60µs 66.4 MB/sec 1.00 584.7±33.72µs 69.6 MB/sec
lexer/numpy/ctypeslib.py (0) 1.01 131.8±3.31µs 126.3 MB/sec 1.00 130.9±5.37µs 127.2 MB/sec
lexer/numpy/globals.py (1) 1.02 13.2±0.43µs 222.7 MB/sec 1.00 13.0±0.41µs 226.8 MB/sec
lexer/pydantic/types.py (8) 1.13 285.0±11.72µs 89.5 MB/sec 1.00 252.9±10.13µs 100.8 MB/sec
lexer/unicode/pypinyin.py (0) 1.03 32.9±1.92µs 127.5 MB/sec 1.00 31.8±1.25µs 132.0 MB/sec
```
It seems that overall the lexer has regressed. I profiled every file
mentioned above and I saw one improvement which is done in
(098ee5d). But otherwise I don't see
anything else. A few notes by isolating the f-string part in the
profile:
* As we're adding new tokens and functionality to emit them, I expect
the lexer to take more time because of more code.
* The `lex_fstring_middle_or_end` takes the most amount of time followed
by the `current_mut` line when lexing the `:` token. The latter is to
check if we're at the start of a format spec or not.
* In a f-string heavy file such as
https://github.com/python/cpython/blob/main/Lib/test/test_fstring.py
[^1] (293), most of the time in `lex_fstring_middle_or_end` is accounted
by string allocation for the string literal part of `FStringMiddle`
token (https://share.firefox.dev/3ErEa1W)
I don't see anything out of ordinary for `pydantic/types` profile
(https://share.firefox.dev/45XcLRq)
fixes: #7042
[^1]: We could add this in lexer and parser benchmark
## Summary
This PR adds support in the lexer for the newly added f-string tokens as
per PEP 701. The following new tokens are added:
* `FStringStart`: Token value for the start of an f-string. This
includes the `f`/`F`/`fr` prefix and the opening quote(s).
* `FStringMiddle`: Token value that includes the portion of text inside
the f-string that's not part of the expression part and isn't an opening
or closing brace.
* `FStringEnd`: Token value for the end of an f-string. This includes
the closing quote.
Additionally, a new `Exclamation` token is added for conversion
(`f"{foo!s}"`) as that's part of an expression.
## Test Plan
New test cases are added to for various possibilities using snapshot
testing. The output has been verified using python/cpython@f2cc00527e.
## Benchmarks
_I've put the number of f-strings for each of the following files after
the file name_
```
lexer/large/dataset.py (1) 1.05 612.6±91.60µs 66.4 MB/sec 1.00 584.7±33.72µs 69.6 MB/sec
lexer/numpy/ctypeslib.py (0) 1.01 131.8±3.31µs 126.3 MB/sec 1.00 130.9±5.37µs 127.2 MB/sec
lexer/numpy/globals.py (1) 1.02 13.2±0.43µs 222.7 MB/sec 1.00 13.0±0.41µs 226.8 MB/sec
lexer/pydantic/types.py (8) 1.13 285.0±11.72µs 89.5 MB/sec 1.00 252.9±10.13µs 100.8 MB/sec
lexer/unicode/pypinyin.py (0) 1.03 32.9±1.92µs 127.5 MB/sec 1.00 31.8±1.25µs 132.0 MB/sec
```
It seems that overall the lexer has regressed. I profiled every file
mentioned above and I saw one improvement which is done in
(098ee5d). But otherwise I don't see
anything else. A few notes by isolating the f-string part in the
profile:
* As we're adding new tokens and functionality to emit them, I expect
the lexer to take more time because of more code.
* The `lex_fstring_middle_or_end` takes the most amount of time followed
by the `current_mut` line when lexing the `:` token. The latter is to
check if we're at the start of a format spec or not.
* In a f-string heavy file such as
https://github.com/python/cpython/blob/main/Lib/test/test_fstring.py
[^1] (293), most of the time in `lex_fstring_middle_or_end` is accounted
by string allocation for the string literal part of `FStringMiddle`
token (https://share.firefox.dev/3ErEa1W)
I don't see anything out of ordinary for `pydantic/types` profile
(https://share.firefox.dev/45XcLRq)
fixes: #7042
[^1]: We could add this in lexer and parser benchmark
This PR adds support in the lexer for the newly added f-string tokens as
per PEP 701. The following new tokens are added:
* `FStringStart`: Token value for the start of an f-string. This
includes the `f`/`F`/`fr` prefix and the opening quote(s).
* `FStringMiddle`: Token value that includes the portion of text inside
the f-string that's not part of the expression part and isn't an opening
or closing brace.
* `FStringEnd`: Token value for the end of an f-string. This includes
the closing quote.
Additionally, a new `Exclamation` token is added for conversion
(`f"{foo!s}"`) as that's part of an expression.
New test cases are added to for various possibilities using snapshot
testing. The output has been verified using python/cpython@f2cc00527e.
_I've put the number of f-strings for each of the following files after
the file name_
```
lexer/large/dataset.py (1) 1.05 612.6±91.60µs 66.4 MB/sec 1.00 584.7±33.72µs 69.6 MB/sec
lexer/numpy/ctypeslib.py (0) 1.01 131.8±3.31µs 126.3 MB/sec 1.00 130.9±5.37µs 127.2 MB/sec
lexer/numpy/globals.py (1) 1.02 13.2±0.43µs 222.7 MB/sec 1.00 13.0±0.41µs 226.8 MB/sec
lexer/pydantic/types.py (8) 1.13 285.0±11.72µs 89.5 MB/sec 1.00 252.9±10.13µs 100.8 MB/sec
lexer/unicode/pypinyin.py (0) 1.03 32.9±1.92µs 127.5 MB/sec 1.00 31.8±1.25µs 132.0 MB/sec
```
It seems that overall the lexer has regressed. I profiled every file
mentioned above and I saw one improvement which is done in
(098ee5d). But otherwise I don't see
anything else. A few notes by isolating the f-string part in the
profile:
* As we're adding new tokens and functionality to emit them, I expect
the lexer to take more time because of more code.
* The `lex_fstring_middle_or_end` takes the most amount of time followed
by the `current_mut` line when lexing the `:` token. The latter is to
check if we're at the start of a format spec or not.
* In a f-string heavy file such as
https://github.com/python/cpython/blob/main/Lib/test/test_fstring.py
[^1] (293), most of the time in `lex_fstring_middle_or_end` is accounted
by string allocation for the string literal part of `FStringMiddle`
token (https://share.firefox.dev/3ErEa1W)
I don't see anything out of ordinary for `pydantic/types` profile
(https://share.firefox.dev/45XcLRq)
fixes: #7042
[^1]: We could add this in lexer and parser benchmark
This PR adds support in the lexer for the newly added f-string tokens as
per PEP 701. The following new tokens are added:
* `FStringStart`: Token value for the start of an f-string. This
includes the `f`/`F`/`fr` prefix and the opening quote(s).
* `FStringMiddle`: Token value that includes the portion of text inside
the f-string that's not part of the expression part and isn't an opening
or closing brace.
* `FStringEnd`: Token value for the end of an f-string. This includes
the closing quote.
Additionally, a new `Exclamation` token is added for conversion
(`f"{foo!s}"`) as that's part of an expression.
New test cases are added to for various possibilities using snapshot
testing. The output has been verified using python/cpython@f2cc00527e.
_I've put the number of f-strings for each of the following files after
the file name_
```
lexer/large/dataset.py (1) 1.05 612.6±91.60µs 66.4 MB/sec 1.00 584.7±33.72µs 69.6 MB/sec
lexer/numpy/ctypeslib.py (0) 1.01 131.8±3.31µs 126.3 MB/sec 1.00 130.9±5.37µs 127.2 MB/sec
lexer/numpy/globals.py (1) 1.02 13.2±0.43µs 222.7 MB/sec 1.00 13.0±0.41µs 226.8 MB/sec
lexer/pydantic/types.py (8) 1.13 285.0±11.72µs 89.5 MB/sec 1.00 252.9±10.13µs 100.8 MB/sec
lexer/unicode/pypinyin.py (0) 1.03 32.9±1.92µs 127.5 MB/sec 1.00 31.8±1.25µs 132.0 MB/sec
```
It seems that overall the lexer has regressed. I profiled every file
mentioned above and I saw one improvement which is done in
(098ee5d). But otherwise I don't see
anything else. A few notes by isolating the f-string part in the
profile:
* As we're adding new tokens and functionality to emit them, I expect
the lexer to take more time because of more code.
* The `lex_fstring_middle_or_end` takes the most amount of time followed
by the `current_mut` line when lexing the `:` token. The latter is to
check if we're at the start of a format spec or not.
* In a f-string heavy file such as
https://github.com/python/cpython/blob/main/Lib/test/test_fstring.py
[^1] (293), most of the time in `lex_fstring_middle_or_end` is accounted
by string allocation for the string literal part of `FStringMiddle`
token (https://share.firefox.dev/3ErEa1W)
I don't see anything out of ordinary for `pydantic/types` profile
(https://share.firefox.dev/45XcLRq)
fixes: #7042
[^1]: We could add this in lexer and parser benchmark
This PR adds support in the lexer for the newly added f-string tokens as
per PEP 701. The following new tokens are added:
* `FStringStart`: Token value for the start of an f-string. This
includes the `f`/`F`/`fr` prefix and the opening quote(s).
* `FStringMiddle`: Token value that includes the portion of text inside
the f-string that's not part of the expression part and isn't an opening
or closing brace.
* `FStringEnd`: Token value for the end of an f-string. This includes
the closing quote.
Additionally, a new `Exclamation` token is added for conversion
(`f"{foo!s}"`) as that's part of an expression.
New test cases are added to for various possibilities using snapshot
testing. The output has been verified using python/cpython@f2cc00527e.
_I've put the number of f-strings for each of the following files after
the file name_
```
lexer/large/dataset.py (1) 1.05 612.6±91.60µs 66.4 MB/sec 1.00 584.7±33.72µs 69.6 MB/sec
lexer/numpy/ctypeslib.py (0) 1.01 131.8±3.31µs 126.3 MB/sec 1.00 130.9±5.37µs 127.2 MB/sec
lexer/numpy/globals.py (1) 1.02 13.2±0.43µs 222.7 MB/sec 1.00 13.0±0.41µs 226.8 MB/sec
lexer/pydantic/types.py (8) 1.13 285.0±11.72µs 89.5 MB/sec 1.00 252.9±10.13µs 100.8 MB/sec
lexer/unicode/pypinyin.py (0) 1.03 32.9±1.92µs 127.5 MB/sec 1.00 31.8±1.25µs 132.0 MB/sec
```
It seems that overall the lexer has regressed. I profiled every file
mentioned above and I saw one improvement which is done in
(098ee5d). But otherwise I don't see
anything else. A few notes by isolating the f-string part in the
profile:
* As we're adding new tokens and functionality to emit them, I expect
the lexer to take more time because of more code.
* The `lex_fstring_middle_or_end` takes the most amount of time followed
by the `current_mut` line when lexing the `:` token. The latter is to
check if we're at the start of a format spec or not.
* In a f-string heavy file such as
https://github.com/python/cpython/blob/main/Lib/test/test_fstring.py
[^1] (293), most of the time in `lex_fstring_middle_or_end` is accounted
by string allocation for the string literal part of `FStringMiddle`
token (https://share.firefox.dev/3ErEa1W)
I don't see anything out of ordinary for `pydantic/types` profile
(https://share.firefox.dev/45XcLRq)
fixes: #7042
[^1]: We could add this in lexer and parser benchmark
This PR adds support in the lexer for the newly added f-string tokens as
per PEP 701. The following new tokens are added:
* `FStringStart`: Token value for the start of an f-string. This
includes the `f`/`F`/`fr` prefix and the opening quote(s).
* `FStringMiddle`: Token value that includes the portion of text inside
the f-string that's not part of the expression part and isn't an opening
or closing brace.
* `FStringEnd`: Token value for the end of an f-string. This includes
the closing quote.
Additionally, a new `Exclamation` token is added for conversion
(`f"{foo!s}"`) as that's part of an expression.
New test cases are added to for various possibilities using snapshot
testing. The output has been verified using python/cpython@f2cc00527e.
_I've put the number of f-strings for each of the following files after
the file name_
```
lexer/large/dataset.py (1) 1.05 612.6±91.60µs 66.4 MB/sec 1.00 584.7±33.72µs 69.6 MB/sec
lexer/numpy/ctypeslib.py (0) 1.01 131.8±3.31µs 126.3 MB/sec 1.00 130.9±5.37µs 127.2 MB/sec
lexer/numpy/globals.py (1) 1.02 13.2±0.43µs 222.7 MB/sec 1.00 13.0±0.41µs 226.8 MB/sec
lexer/pydantic/types.py (8) 1.13 285.0±11.72µs 89.5 MB/sec 1.00 252.9±10.13µs 100.8 MB/sec
lexer/unicode/pypinyin.py (0) 1.03 32.9±1.92µs 127.5 MB/sec 1.00 31.8±1.25µs 132.0 MB/sec
```
It seems that overall the lexer has regressed. I profiled every file
mentioned above and I saw one improvement which is done in
(098ee5d). But otherwise I don't see
anything else. A few notes by isolating the f-string part in the
profile:
* As we're adding new tokens and functionality to emit them, I expect
the lexer to take more time because of more code.
* The `lex_fstring_middle_or_end` takes the most amount of time followed
by the `current_mut` line when lexing the `:` token. The latter is to
check if we're at the start of a format spec or not.
* In a f-string heavy file such as
https://github.com/python/cpython/blob/main/Lib/test/test_fstring.py
[^1] (293), most of the time in `lex_fstring_middle_or_end` is accounted
by string allocation for the string literal part of `FStringMiddle`
token (https://share.firefox.dev/3ErEa1W)
I don't see anything out of ordinary for `pydantic/types` profile
(https://share.firefox.dev/45XcLRq)
fixes: #7042
[^1]: We could add this in lexer and parser benchmark
This PR adds support in the lexer for the newly added f-string tokens as
per PEP 701. The following new tokens are added:
* `FStringStart`: Token value for the start of an f-string. This
includes the `f`/`F`/`fr` prefix and the opening quote(s).
* `FStringMiddle`: Token value that includes the portion of text inside
the f-string that's not part of the expression part and isn't an opening
or closing brace.
* `FStringEnd`: Token value for the end of an f-string. This includes
the closing quote.
Additionally, a new `Exclamation` token is added for conversion
(`f"{foo!s}"`) as that's part of an expression.
New test cases are added to for various possibilities using snapshot
testing. The output has been verified using python/cpython@f2cc00527e.
_I've put the number of f-strings for each of the following files after
the file name_
```
lexer/large/dataset.py (1) 1.05 612.6±91.60µs 66.4 MB/sec 1.00 584.7±33.72µs 69.6 MB/sec
lexer/numpy/ctypeslib.py (0) 1.01 131.8±3.31µs 126.3 MB/sec 1.00 130.9±5.37µs 127.2 MB/sec
lexer/numpy/globals.py (1) 1.02 13.2±0.43µs 222.7 MB/sec 1.00 13.0±0.41µs 226.8 MB/sec
lexer/pydantic/types.py (8) 1.13 285.0±11.72µs 89.5 MB/sec 1.00 252.9±10.13µs 100.8 MB/sec
lexer/unicode/pypinyin.py (0) 1.03 32.9±1.92µs 127.5 MB/sec 1.00 31.8±1.25µs 132.0 MB/sec
```
It seems that overall the lexer has regressed. I profiled every file
mentioned above and I saw one improvement which is done in
(098ee5d). But otherwise I don't see
anything else. A few notes by isolating the f-string part in the
profile:
* As we're adding new tokens and functionality to emit them, I expect
the lexer to take more time because of more code.
* The `lex_fstring_middle_or_end` takes the most amount of time followed
by the `current_mut` line when lexing the `:` token. The latter is to
check if we're at the start of a format spec or not.
* In a f-string heavy file such as
https://github.com/python/cpython/blob/main/Lib/test/test_fstring.py
[^1] (293), most of the time in `lex_fstring_middle_or_end` is accounted
by string allocation for the string literal part of `FStringMiddle`
token (https://share.firefox.dev/3ErEa1W)
I don't see anything out of ordinary for `pydantic/types` profile
(https://share.firefox.dev/45XcLRq)
fixes: #7042
[^1]: We could add this in lexer and parser benchmark

Summary
This PR adds support in the lexer for the newly added f-string tokens as per PEP 701. The following new tokens are added:
FStringStart: Token value for the start of an f-string. This includes thef/F/frprefix and the opening quote(s).FStringMiddle: Token value that includes the portion of text inside the f-string that's not part of the expression part and isn't an opening or closing brace.FStringEnd: Token value for the end of an f-string. This includes the closing quote.Additionally, a new
Exclamationtoken is added for conversion (f"{foo!s}") as that's part of an expression.Test Plan
New test cases are added to for various possibilities using snapshot testing. The output has been verified using python/cpython@f2cc00527e.
Benchmarks
I've put the number of f-strings for each of the following files after the file name
It seems that overall the lexer has regressed. I profiled every file mentioned above and I saw one improvement which is done in (098ee5d). But otherwise I don't see anything else. A few notes by isolating the f-string part in the profile:
lex_fstring_middle_or_endtakes the most amount of time followed by thecurrent_mutline when lexing the:token. The latter is to check if we're at the start of a format spec or not.lex_fstring_middle_or_endis accounted by string allocation for the string literal part ofFStringMiddletoken (https://share.firefox.dev/3ErEa1W)I don't see anything out of ordinary for
pydantic/typesprofile (https://share.firefox.dev/45XcLRq)fixes: #7042
Footnotes
We could add this in lexer and parser benchmark ↩