-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delay literal unescaping #118699
Delay literal unescaping #118699
Conversation
These commits modify the If this was unintentional then you should revert the changes before this PR is merged. |
This comment has been minimized.
This comment has been minimized.
This comment was marked as resolved.
This comment was marked as resolved.
536b970
to
8f12cac
Compare
…e1-dead Unescaping cleanups Minor improvements I found while working on rust-lang#118699. r? `@fee1-dead`
Rollup merge of rust-lang#118734 - nnethercote:literal-cleanups, r=fee1-dead Unescaping cleanups Minor improvements I found while working on rust-lang#118699. r? `@fee1-dead`
8f12cac
to
9f7951b
Compare
This comment has been minimized.
This comment has been minimized.
9f7951b
to
4314dff
Compare
This comment has been minimized.
This comment has been minimized.
4314dff
to
13e9c76
Compare
This comment has been minimized.
This comment has been minimized.
13e9c76
to
a748526
Compare
This comment has been minimized.
This comment has been minimized.
@rust-lang/lang: This PR proposes a backward-compatible language change. It changes some errors related to literals from syntactic to semantic. In the compiler, this is all the errors in the Here is example code showing how things change. This covers some of the relevant errors, but not all; the missing ones would be affected in the same way. fn main() {
''; //~ error: empty character literal
b'ab'; //~ error: character literal may only contain one codepoint
"\a"; //~ error: unknown character escape: `a`
b"\xzz"; //~ error: invalid character in numeric character escape
"\u20"; //~ error: incorrect unicode escape sequence
c"\u{999999}"; //~ error: invalid unicode character escape
}
sink! {
''; // was an error, now allowed
b'ab'; // was an error, now allowed
"\a"; // was an error, now allowed
b"\xzz"; // was an error, now allowed
"\u20"; // was an error, now allowed
c"\u{999999}"; // was an error, now allowed
};
#[cfg(FALSE)]
fn configured_out() {
''; // was an error, now allowed
b'ab'; // was an error, now allowed
"\a"; // was an error, now allowed
b"\xzz"; // was an error, now allowed
"\u20"; // was an error, now allowed
c"\u{999999}"; // was an error, now allowed
} This means a macro can assign meaning to arbitrary escape sequences, such as This change is consistent with a general trend of delayed literal checking:
cc @m-ou-se |
Some parts of this 💯 agreed. For example, I do wonder, though, whether we want to leave some of the lexical space intentionally reserved here. The above says
But that restricts the possibilities of new lexically-meaningful escape sequences in future. Imagine, say, that we wanted to one day allow something like Relatedly, And we've discussed things like using To try to make another example, I'm definitely happy with TL/DR: I agree with delayed semantic checking, but my instinct is that in-the-straight-forward-lexing-regex restrictions are different from semantic checks. |
This comment was marked as resolved.
This comment was marked as resolved.
For completeness, here is a program exhibiting all the relevant literal errors. The comments describe the current behaviour. fn main() {
// ---- SYNTACTIC (LEXING) ERRORS ----
''; // ZeroChars
'ab'; // MoreThanOneChar
"\x\\"; // InvalidCharInHexEscape + LoneSlash (a weird one)
"\a"; // InvalidEscape
"a^Mb"; // BareCarriageReturn
r"a^Mb"; // BareCarriageReturnInRawString
' '; // (TAB) EscapeOnlyChar
"\x1"; // TooShortHexEscape
"\xz"; // InvalidCharInHexEscape
"\xff"; // OutOfRangeHexEscape
"\u1234"; // NoBraceInUnicodeEscape
"\u{xyz}"; // InvalidCharInUnicodeEscape
"\u{}"; // EmptyUnicodeEscape
"\u{1234"; // UnclosedUnicodeEscape
"\u{_123}"; // LeadingUnderscoreUnicodeEscape
"\u{1234567}"; // OverlongUnicodeEscape
"\u{dfff}"; // LoneSurrogateUnicodeEscape
"\u{123456}"; // OutOfRangeUnicodeEscape
b"\u{1234}"; // UnicodeEscapeInByte
b"🦀"; // NonAsciiCharInByte
// ---- SEMANTIC (DELAYED) ERRORS ----
"abc"xyz; // InvalidSuffix
1xyz; // InvalidIntSuffix
1u20; // InvalidIntSuffix
1.0xyz; // InvalidFloatSuffix
1.0f20; // InvalidFloatSuffix
0b10f32; // NonDecimalFloat
0o10f32; // NonDecimalFloat
0x100000000000000000000000000000000; // IntTooLarge
c"a \0 b"; // NulInCStr
} This PR would make all of them delayed, i.e. compiling this program with I'm going to argue for one of two outcomes here.
I don't want to end up in a situation where some string literal errors are lexer-level and some are delayed (e.g. |
a748526
to
f9e3632
Compare
This comment has been minimized.
This comment has been minimized.
I'm not convinced by this everywhere. If I look at https://spec.ferrocene.dev/lexical-elements.html#character-literals for example, that core of
seems entirely reasonable, and changing that to
(or to Whereas there are things like
where I do agree with this PR that having it be just either |
It would simplify the lexical specification, at the cost of requiring an additional semantic check later. Overall that's less simple, IMO. Likewise, from the implementation POV, it's nice if all the char/byte/string literal checking is in one place, be it the lexer, or HIR lowering. Having some checks in the lexer and others in HIR lowering is worse. |
…, r=fee1-dead More unescaping cleanups More minor improvements I found while working on rust-lang#118699. r? `@fee1-dead`
This comment was marked as resolved.
This comment was marked as resolved.
1e49114
to
96e5189
Compare
This comment has been minimized.
This comment has been minimized.
This comment was marked as resolved.
This comment was marked as resolved.
By making it an `EscapeError` instead of a `LitError`. This makes it more like the other errors produced during unescaping. NOTE: this means these errors are issued earlier, before expansion, which changes behaviour. The next commit will delay issue of this error and others, reverting the behaviour change for this particular error. One nice thing about this: the old approach had some code in `report_lit_error` to calculate the span of the nul char from a range. This code used a hardwired `+2` to account for the `c"` at the start of a C string literal, but this should have changed to a `+3` for raw C string literals to account for the `cr"`, which meant that the caret in `cr"` nul error messages was one short of where it should have been. The new approach doesn't need any of this and avoids the off-by-one error.
It's a more logical spot for it, and will be a big help for the next commit. Doing this creates a new dependency from `rustc_ast_lowering` on `rustc_parse`, but `rustc_ast_lowering` is clearly higher up the crate graph, so this isn't a big deal. One thing in favour of this change, is that two fluent labels were duplicated across `rustc_session` and `rustc_parse`: `invalid_literal_suffix` and `parse_not_supported`. This duplication is now gone, so that's nice evidence that this is a reasonable change.
Currently string literals are unescaped twice. - Once during lexing in `cook_quoted`/`cook_c_string`/`cook_common`. This one just checks for errors. - Again in `LitKind::from_token_lit`, which is called when lowering AST to HIR, and also in a few other places during expansion. This one actually constructs the unescaped string. It also has error checking code, but that part of the code is actually dead (and has several bugs) because the check during lexing catches all errors! This commit removes the error-check-only unescaping during lexing, and fixes up `LitKind::from_token_lit` so it properly does both checking and construction. This is a backwards-compatible language change: some programs now compile that previously did not. For example, it is now possible for macros to consume "invalid" string literals like "\a\b\c". This is a continuation of a trend of delaying semantic error checking of literals to after expansion: - rust-lang#102944 did this for some cases for numeric literals - The detection of NUL chars in C string literals is already delayed in this way. Notes about test changes: - `ignore-block-help.rs`: this requires a parse error for the test to work. The error used was an unescaping error, which is now delayed to after parsing. So the commit changes it to an "unterminated character literal" error which still occurs during parsing. - `tests/ui/lexer/error-stage.rs`: this shows the newly allowed cases, due to delayed literal unescaping. - Several tests had unescaping errors combined with unterminated literal errors. The former are now delayed but the latter remain as lexing errors. So the unterminated literal part needed to be split into a separate test file otherwise compilation would end before the other errors were reported. - issue-62913.rs: The structure and output changed a bit. Issue rust-lang#62913 was about an ICE due to an unterminated string literal, so the new version should be good enough. - literals-are-validated-before-expansion.rs: this tests exactly the behaviour that has been changed, and so was removed - A couple of other test produce the same errors, just in a different order.
96e5189
to
ee19d52
Compare
The job Click to see the possible cause of the failure (guessed by this bot)
|
@rustbot labels -I-lang-nominated We discussed this in the T-lang triage meeting on 2023-12-20. There were parts of this that the members liked (e.g. allowing an arbitrary string of Our impression was that @nnethercote would prefer to go fully in one direction or the other here. In our discussion, we were somewhat open to and curious about the direction of going the other way and rejecting syntactically more of those things that are semantically invalid. @nnethercote, what are your thoughts on the desirability and feasibility of this? Alternatively, if we were to go in the direction of relaxing the syntactic rules, we were interested in perhaps breaking these out. @nnethercote, is this something you'd consider doing, and if so, what categories might you propose? Please renominate this for us with your answers. |
☔ The latest upstream changes (presumably #119097) made this pull request unmergeable. Please resolve the merge conflicts. |
I'm going to switch terminology from syntactic/semantic to lex-time/post-expansion, because I think the definition of what's syntactic vs. semantic is open to interpretation, and what's important is when the check occurs, not which conceptual category each check belongs to. In general, you can't move post-expansion checks to lex-time, because that could break code that currently compiles. The one exception is NulInCStr, because that hasn't yet stabilized. I am advocating for moving that earlier in #119172 and this Zulip thread, at least in the short term, for consistency with all the other byte/char/string literal checks. That needs to be done before stabilization occurs. Moving a check from lex-time to post-expansion can be done at any time, because it's a backwards compatible change.
Do you mean break them into categories? I don't particularly want to do that, because it would encourage a conclusion where some checks are done at lex-time and some post-expansion. (The attraction of this PR is that it moves all the checks to post-expansion, for maximum consistency.) But if you put a gun to my head I'd look at this comment above and come up with:
|
We do have an edition coming up. If we could do it, what things would we want to change to achieve consistency? |
There's a category of artificial errors, like "this can compile just right, but we don't want to accept it to avoid confusing human readers". If we want them reported at all, then we want them to be reported in lexer because human eyes look at code regardless of whether it's in Edit: errors like |
I don't think moving all errors either to post-expansion, or to lexer is a goal. The |
There's also a question of when a "not a literal" turns into a "literal with escaping problems", which is relevant to cases like |
Basically, I'm currently on the position that what we have now is fine, but we should probably remove some artificial errors. |
I have a partial implementation of RFC3349. There is no PR yet. It's blocked on two things:
I know I wrote this PR, but I think I will close it. Why?
|
r? @ghost