Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 JSON Parser: Unterminated string literal can make inner_string_text panic #2357

Closed
1 task done
Sec-ant opened this issue Apr 8, 2024 · 7 comments · Fixed by #2621
Closed
1 task done

🐛 JSON Parser: Unterminated string literal can make inner_string_text panic #2357

Sec-ant opened this issue Apr 8, 2024 · 7 comments · Fixed by #2621
Assignees
Labels
A-Parser Area: parser L-JSON Language: JSON and super languages S-Bug-confirmed Status: report has been confirmed as a valid bug S-Help-wanted Status: you're familiar with the code base and want to help the project

Comments

@Sec-ant
Copy link
Member

Sec-ant commented Apr 8, 2024

Environment information

Unrelated, the issue is reproducible when nursery/noDuplicateJsonKeys is enabled.

What happened?

Our JSON parser allows JSON_STRING_LITERAL to include an unterminated string literal:

let unterminated =
ParseDiagnostic::new("Missing closing quote", start..self.text_position())
.with_detail(self.position..self.position + 1, "line breaks here");
self.diagnostics.push(unterminated);
return JSON_STRING_LITERAL;

let unterminated =
ParseDiagnostic::new("Missing closing quote", start..self.text_position())
.with_detail(
self.source.text_len()..self.source.text_len(),
"file ends here",
);
self.diagnostics.push(unterminated);
JSON_STRING_LITERAL

This will make the function inner_string_text panic when the unterminated string literal is just a single doublequote ":

if token.kind() == JsonSyntaxKind::JSON_STRING_LITERAL {
// remove string delimiters
// SAFETY: string literal token have a delimiters at the start and the end of the string
let range = TextRange::new(1.into(), text.len() - TextSize::from(1));
text = text.slice(range);
}

The function inner_string_text is called in the JSON nursery rule noDuplicateJsonKeys:

let text = name.inner_string_text().ok()?;

And this nursery rule is enabled by default when debugging or in a nightly build. So when changing the VS Code extension lspBin to a local debug build, and when a JSON file containing content like this is scanned by the extension, Biome will panic and keep reloading until it's killed permanently:

{
  "a": """
}

The above erroneous JSON can appear quite frequently when using code completion so it will kill Biome often.

Expected result

Unterminated string literal in an AnyJsonValue place should be a JsonBogusValue instead of a JsonStringLiteral.

However I don't know how to treat it when the unterminated string literal appears in the JsonMemberName place. The JSON grammar may need changing. And I don't have enough knowledge to modify the grammar.

Code of Conduct

  • I agree to follow Biome's Code of Conduct
@Sec-ant Sec-ant added S-Help-wanted Status: you're familiar with the code base and want to help the project A-Parser Area: parser L-JSON Language: JSON and super languages S-Bug-confirmed Status: report has been confirmed as a valid bug labels Apr 8, 2024
@Sec-ant Sec-ant changed the title 🐛 JSON Parser: Unterminated string literal will make inner_string_text panic 🐛 JSON Parser: Unterminated string literal can make inner_string_text panic Apr 8, 2024
@ematipico
Copy link
Member

Wow, that's surprising!

@Conaclos
Copy link
Member

Some ideas:

  • We could add a JsonBogusMemberName and AnyJsonMemberName = JsonMemberName | JsonBogusMemberName.
  • or, we could handle { "a } like a value with a missing key and colon. This could avoid introducing a new node.

Any opinion?

@Sec-ant
Copy link
Member Author

Sec-ant commented Apr 27, 2024

Some ideas:

  • We could add a JsonBogusMemberName and AnyJsonMemberName = JsonMemberName | JsonBogusMemberName.
  • or, we could handle { "a } like a value with a missing key and colon. This could avoid introducing a new node.

Any opinion?

I myself prefer the first option because I was expecting the parser to be strict and conformant to the spec, and leave ambiguous tokens as bogus nodes just as what Biome claims to do. But I also think the bogus part should be as small as possible so to not propagate the bogus state to the outer tree, and the parsing can be recoverable as soon as possible.

#2606 also reports other fail cases to trigger the error, that's also another reason why I think option 2 is not ideal because we'll have to evaluate all the possible fail cases and draw a line somewhere to decide whether something is unrecoverably malformed, which I think can be error-prone.

@Conaclos
Copy link
Member

Same there. The first approach seems better because I think it is closer to the user's intention.
I suggested the second approach to limit the number of bogus nodes: It is unpleasant to work with bogus nodes.

@Conaclos Conaclos self-assigned this Apr 27, 2024
@Conaclos
Copy link
Member

I have started to work on this.

However, I wonder: why is it actually wrong to treat an underlined string as a string?
We consider unterminated objects and arrays to be objects and arrays. Should unterminated string be different?

The lexer gives an error. Could we just change inner_string_text to handle unterminated strings?

@Sec-ant
Copy link
Member Author

Sec-ant commented Apr 27, 2024

However, I wonder: why is it actually wrong to treat an unterminated string as a string?

Maybe because it can cause som unstable or unexpected behavior?

I haven't try if this would be an issue, but an example I can think of is when the string is ended with a newline without the quote, it is regarded as an unterminated string, but a formatter will still format it and it may remove the new line, and that unterminated string may then become something else and so the tree may be changed accidentally?

@Conaclos
Copy link
Member

Conaclos commented Apr 27, 2024

I haven't try if this would be an issue, but an example I can think of is when the string is ended with a newline without the quote, it is regarded as an unterminated string, but a formatter will still format it and it may remove the new line, and that unterminated string may then become something else and so the tree may be changed accidentally?

I am not familiar with the JSON formatter. However, the newlime is part of the string. Thus, the formatter should not remove ir?

EDIT: Ok, I was wrong, the lexer doesn't include the newline in the string. See an example in the playground.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Parser Area: parser L-JSON Language: JSON and super languages S-Bug-confirmed Status: report has been confirmed as a valid bug S-Help-wanted Status: you're familiar with the code base and want to help the project
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants