Add a strict parsing mode to the JsonReader and improve its javadoc #1609

marcus-h · 2019-11-06T17:07:42Z

By default, the JsonReader accepts unescaped control characters that
are embedded in a string or name (which is technically just a string).
According to RFC 8259, RFC 7159, and RFC 4627, this is forbidden.
From RFC 8259 Section 7 "Strings":

"All Unicode characters may be placed within the
quotation marks, except for the characters that MUST be escaped:
quotation mark, reverse solidus, and the control characters (U+0000
through U+001F)."

Accepting unescaped control characters can at least cause some confusion
as the following naive program demonstrates:

marcus@linux:~> cat NaiveJsonProcessor.java
import java.io.IOException;
import java.io.StringReader;

import com.google.gson.stream.JsonReader;

public class NaiveJsonProcessor {
  public static void parseAndLog(String jsonInput) throws IOException {
    JsonReader reader = new JsonReader(new StringReader(jsonInput));
    String parsed = reader.nextString();
    if (parsed.equals("foo")) {
        throw new IllegalStateException("foo is forbidden");
    }
    /*
     * According to the JsonReader's documentation "[...] this parser
     * is strict and only accepts JSON as specified by RFC 4627" (see
     * documentation of setLenient). Hence, we can safely log the
     * raw jsonInput to stdout because it contains no unescaped control
     * characters, which could be interpreted by a terminal.
     * Oops... wrong assumption:)
     */
    System.out.println("Processed: " + jsonInput);
  }

  public static void main(String[] args) {
    String jsonInput = "\"foobar\u001b[3D\u001b[K\"";
    try {
        // the log entry might confuse the user...
        parseAndLog(jsonInput);
    } catch (IOException e) {
        e.printStackTrace();
    }
  }
}
marcus@linux:~> javac -cp /path/to/gson/classes NaiveJsonProcessor.java
marcus@linux:~> java -cp /path/to/gson/classes:. NaiveJsonProcessor
Processed: "foo"
marcus@linux:~>

Since the unescaped control characters of the raw jsonInput are interpreted
by the terminal, it looks as if we processed the JSON text "foo" even
though this string should result in an IllegalStateException (of course in
reality we did not process "foo").

This PR fixes this by adding an optional strict mode to the JsonReader.

If preferred, I can also split commit 473a9b3
("Make JsonReader tests reusable in the com.google.gson.stream package") into two commits.

Make the JsonReader tests reusable so that we can run the same set of tests with differently configured JsonReaders. Note that this change does not expose a new API, which could be subclassed by 3rd party classes, because the AbstractJsonReaderTest has package-private visibility and the JsonReaderTest is final. Signed-off-by: Marcus Huewe <[email protected]>

Document the current behavior of the JsonReader if it encounters an unescaped control character in a string or a name. The JsonReader accepts unescaped control characters in a string or a name even though they are forbidden according to RFC 8259, RFC 7159, and RFC 4627. Signed-off-by: Marcus Huewe <[email protected]>

googlebot · 2019-11-06T17:07:48Z

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.

What to do if you already signed the CLA

Individual signers

It's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.

Corporate signers

Your company has a Point of Contact who decides which employees are authorized to participate. Ask your POC to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the Google project maintainer to go/cla#troubleshoot (Public version).
The email used to register you as an authorized contributor must be the email used for the Git commit. Check your existing CLA data and verify that your email is set on your git commits.
The email used to register you as an authorized contributor must also be attached to your GitHub account.

ℹ️ Googlers: Go here for more info.

marcus-h · 2019-11-06T17:34:38Z

@googlebot I signed it!

googlebot · 2019-11-06T17:34:48Z

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

Marcono1234 · 2020-06-16T23:47:50Z

gson/src/main/java/com/google/gson/stream/JsonReader.java

+ *
+ * <p>In contrast to the strict mode, the semi-strict mode allows non-lowercase
+ * literals (like TRUE, fAlSe, NUlL etc.) and unescaped control characters in
+ * strings (and names). For the latter, consider the two java strings


Suggested change

* strings (and names). For the latter, consider the two java strings

* strings (and names). For the latter, consider the two Java strings

Fixed. (I just did a forced push.)

Marcono1234 · 2020-06-16T23:58:27Z

gson/src/main/java/com/google/gson/stream/JsonReader.java

-   *   <li>Top-level values of any type. With strict parsing, the top-level
-   *       value must be an object or an array.


Why has this been removed? Is this incorrect and current non-lenient mode also allows non-object or array top-level values?

Yes, other top-level elements like a number, boolean etc. were already supported (which
conforms to RFC 8259) in the non-lenient mode.

Marcono1234 · 2020-06-17T00:00:12Z

gson/src/main/java/com/google/gson/stream/JsonReader.java

-   * this parser is strict and only accepts JSON as specified by <a
-   * href="http://www.ietf.org/rfc/rfc4627.txt">RFC 4627</a>. Setting the
+   * this parser is semi-strict and only accepts JSON as specified by <a
+   * href="https://www.ietf.org/rfc/rfc8259.txt">RFC 8259</a>. Setting the


This is slightly misleading because it is more lenient (="semi-strict") than RFC 8259, though this might not matter here much

Ugh. Good catch! I removed the "only" part. Now, it says that it "accepts RFC 8259
(including some slight variations)".
I just did a forced push.

Thanks a lot for your (initial) review! Much appreciated!

By default, the JsonReader accepts unescaped control characters that are embedded in a string or name (which is technically just a string). According to RFC 8259, RFC 7159, and RFC 4627, this is forbidden. From RFC 8259 Section 7 "Strings": "All Unicode characters may be placed within the quotation marks, except for the characters that MUST be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F)." Accepting unescaped control characters can at least cause some confusion as the following naive program demonstrates: marcus@linux:~> cat NaiveJsonProcessor.java import java.io.IOException; import java.io.StringReader; import com.google.gson.stream.JsonReader; public class NaiveJsonProcessor { public static void parseAndLog(String jsonInput) throws IOException { JsonReader reader = new JsonReader(new StringReader(jsonInput)); String parsed = reader.nextString(); if (parsed.equals("foo")) { throw new IllegalStateException("foo is forbidden"); } /* * According to the JsonReader's documentation "[...] this parser * is strict and only accepts JSON as specified by RFC 4627" (see * documentation of setLenient). Hence, we can safely log the * raw jsonInput to stdout because it contains no unescaped control * characters, which could be interpreted by a terminal. * Oops... wrong assumption:) */ System.out.println("Processed: " + jsonInput); } public static void main(String[] args) { String jsonInput = "\"foobar\u001b[3D\u001b[K\""; try { // the log entry might confuse the user... parseAndLog(jsonInput); } catch (IOException e) { e.printStackTrace(); } } } marcus@linux:~> javac -cp /path/to/gson/classes NaiveJsonProcessor.java marcus@linux:~> java -cp /path/to/gson/classes:. NaiveJsonProcessor Processed: "foo" marcus@linux:~> Since the unescaped control characters of the raw jsonInput are interpreted by the terminal, it _looks_ as if we processed the JSON text "foo" even though this string should result in an IllegalStateException (of course in reality we did _not_ process "foo"). Apart from this, the JsonReader accepts non-lowercase literals (like tRuE, falSE, NULl). According to the previously mentioned RFCs, this is forbidden. From RFC 8259 Section 3 "Values": "[...]or one of the following three literal names: false null true The literal names MUST be lowercase." To cope with this a strict mode is added to the JsonReader. In strict mode, the JsonReader does not accept unescaped control characters in strings and names. For this, the JsonReader raises an exception if it encounters an unescaped control character in nextQuotedValue and skipQuotedValue. Also, it does not accept non-lowercase literals. For this, peekKeyword raises an exception if a non-lowercase literal is encountered. In order to avoid regressions, the strict mode is disabled by default and the old behavior is retained. In strict mode, the JsonReader behaves exactly as before (except in case of an unescaped control character or non-lowercase literal, of course). For the details, see the new JsonReaderStrictTest testcase. The javadoc of the JsonReader is updated accordingly. As part of this update, all references to a JSON RFC are changed to RFC 8259 (that's what the JsonReader conforms to (in strict mode)). Signed-off-by: Marcus Huewe <[email protected]>

Marcono1234

You might have to adjust JsonReader.readEscapeCharacter() for strict mode as well. Currently it accepts:

\'
\LF

Both are not allowed by RFC 8259.

Also maybe it would be useful to cross-link in the javadoc between the corresponding methods, e.g.

setStrict has @see #isStrict()
isStrict has @see #setStrict(boolean)

Though the existing documentation sadly does not really do that either.

Marcono1234 · 2023-08-04T22:52:38Z

#2437 was been merged which adds a new Strictness mode API. This should hopefully address the issues here, and also in the meantime the documentation was adjusted to be more explicit about Gson not being strict by default and how to enable strict parsing.

I will therefore close this pull request, but thanks nonetheless for your work on this! If you notice anything which is missing in the current implementation or think the documentation can be improved, then any pull request for this is appreciated.

marcus-h added 2 commits November 6, 2019 17:45

googlebot added the cla: no label Nov 6, 2019

googlebot added cla: yes and removed cla: no labels Nov 6, 2019

marcus-h mentioned this pull request Jun 16, 2020

Consistently refer to RFC 7159 #1710

Closed

Marcono1234 reviewed Jun 17, 2020

View reviewed changes

marcus-h force-pushed the strict branch from 69c9658 to 2ab2763 Compare June 17, 2020 11:29

Marcono1234 reviewed Aug 23, 2020

View reviewed changes

Marcono1234 mentioned this pull request Nov 6, 2021

Make it possible to allow comments in JSON #212

Closed

Marcono1234 mentioned this pull request May 22, 2022

Improve lenient mode documentation #2122

Merged

Marcono1234 closed this Aug 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a strict parsing mode to the JsonReader and improve its javadoc #1609

Add a strict parsing mode to the JsonReader and improve its javadoc #1609

marcus-h commented Nov 6, 2019

googlebot commented Nov 6, 2019

marcus-h commented Nov 6, 2019

googlebot commented Nov 6, 2019

Marcono1234 Jun 16, 2020

marcus-h Jun 17, 2020

Marcono1234 Jun 16, 2020

marcus-h Jun 17, 2020

Marcono1234 Jun 17, 2020

marcus-h Jun 17, 2020

Marcono1234 left a comment •

edited

Loading

Marcono1234 commented Aug 4, 2023

	* strings (and names). For the latter, consider the two java strings
	* strings (and names). For the latter, consider the two Java strings

		* <li>Top-level values of any type. With strict parsing, the top-level
		* value must be an object or an array.

Add a strict parsing mode to the JsonReader and improve its javadoc #1609

Add a strict parsing mode to the JsonReader and improve its javadoc #1609

Conversation

marcus-h commented Nov 6, 2019

googlebot commented Nov 6, 2019

What to do if you already signed the CLA

Individual signers

Corporate signers

marcus-h commented Nov 6, 2019

googlebot commented Nov 6, 2019

Marcono1234 Jun 16, 2020

Choose a reason for hiding this comment

marcus-h Jun 17, 2020

Choose a reason for hiding this comment

Marcono1234 Jun 16, 2020

Choose a reason for hiding this comment

marcus-h Jun 17, 2020

Choose a reason for hiding this comment

Marcono1234 Jun 17, 2020

Choose a reason for hiding this comment

marcus-h Jun 17, 2020

Choose a reason for hiding this comment

Marcono1234 left a comment • edited Loading

Choose a reason for hiding this comment

Marcono1234 commented Aug 4, 2023

Marcono1234 left a comment •

edited

Loading