Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a strict parsing mode to the JsonReader and improve its javadoc #1609

Closed
wants to merge 3 commits into from

Conversation

marcus-h
Copy link

@marcus-h marcus-h commented Nov 6, 2019

By default, the JsonReader accepts unescaped control characters that
are embedded in a string or name (which is technically just a string).
According to RFC 8259, RFC 7159, and RFC 4627, this is forbidden.
From RFC 8259 Section 7 "Strings":

"All Unicode characters may be placed within the
quotation marks, except for the characters that MUST be escaped:
quotation mark, reverse solidus, and the control characters (U+0000
through U+001F)."

Accepting unescaped control characters can at least cause some confusion
as the following naive program demonstrates:

marcus@linux:~> cat NaiveJsonProcessor.java
import java.io.IOException;
import java.io.StringReader;

import com.google.gson.stream.JsonReader;

public class NaiveJsonProcessor {
  public static void parseAndLog(String jsonInput) throws IOException {
    JsonReader reader = new JsonReader(new StringReader(jsonInput));
    String parsed = reader.nextString();
    if (parsed.equals("foo")) {
        throw new IllegalStateException("foo is forbidden");
    }
    /*
     * According to the JsonReader's documentation "[...] this parser
     * is strict and only accepts JSON as specified by RFC 4627" (see
     * documentation of setLenient). Hence, we can safely log the
     * raw jsonInput to stdout because it contains no unescaped control
     * characters, which could be interpreted by a terminal.
     * Oops... wrong assumption:)
     */
    System.out.println("Processed: " + jsonInput);
  }

  public static void main(String[] args) {
    String jsonInput = "\"foobar\u001b[3D\u001b[K\"";
    try {
        // the log entry might confuse the user...
        parseAndLog(jsonInput);
    } catch (IOException e) {
        e.printStackTrace();
    }
  }
}
marcus@linux:~> javac -cp /path/to/gson/classes NaiveJsonProcessor.java
marcus@linux:~> java -cp /path/to/gson/classes:. NaiveJsonProcessor
Processed: "foo"
marcus@linux:~>

Since the unescaped control characters of the raw jsonInput are interpreted
by the terminal, it looks as if we processed the JSON text "foo" even
though this string should result in an IllegalStateException (of course in
reality we did not process "foo").

This PR fixes this by adding an optional strict mode to the JsonReader.

If preferred, I can also split commit 473a9b3
("Make JsonReader tests reusable in the com.google.gson.stream package") into two commits.

Make the JsonReader tests reusable so that we can run the same set of tests
with differently configured JsonReaders.
Note that this change does not expose a new API, which could be subclassed
by 3rd party classes, because the AbstractJsonReaderTest has package-private
visibility and the JsonReaderTest is final.

Signed-off-by: Marcus Huewe <[email protected]>
Document the current behavior of the JsonReader if it encounters an
unescaped control character in a string or a name. The JsonReader
accepts unescaped control characters in a string or a name even though
they are forbidden according to RFC 8259, RFC 7159, and RFC 4627.

Signed-off-by: Marcus Huewe <[email protected]>
@googlebot
Copy link

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@marcus-h
Copy link
Author

marcus-h commented Nov 6, 2019

@googlebot I signed it!

@googlebot
Copy link

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

*
* <p>In contrast to the strict mode, the semi-strict mode allows non-lowercase
* literals (like TRUE, fAlSe, NUlL etc.) and unescaped control characters in
* strings (and names). For the latter, consider the two java strings
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* strings (and names). For the latter, consider the two java strings
* strings (and names). For the latter, consider the two Java strings

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. (I just did a forced push.)

Comment on lines -306 to -307
* <li>Top-level values of any type. With strict parsing, the top-level
* value must be an object or an array.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why has this been removed? Is this incorrect and current non-lenient mode also allows non-object or array top-level values?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, other top-level elements like a number, boolean etc. were already supported (which
conforms to RFC 8259) in the non-lenient mode.

* this parser is strict and only accepts JSON as specified by <a
* href="http://www.ietf.org/rfc/rfc4627.txt">RFC 4627</a>. Setting the
* this parser is semi-strict and only accepts JSON as specified by <a
* href="https://www.ietf.org/rfc/rfc8259.txt">RFC 8259</a>. Setting the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is slightly misleading because it is more lenient (="semi-strict") than RFC 8259, though this might not matter here much

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh. Good catch! I removed the "only" part. Now, it says that it "accepts RFC 8259
(including some slight variations)".
I just did a forced push.

Thanks a lot for your (initial) review! Much appreciated!

By default, the JsonReader accepts unescaped control characters that
are embedded in a string or name (which is technically just a string).
According to RFC 8259, RFC 7159, and RFC 4627, this is forbidden.
From RFC 8259 Section 7 "Strings":

"All Unicode characters may be placed within the
 quotation marks, except for the characters that MUST be escaped:
 quotation mark, reverse solidus, and the control characters (U+0000
 through U+001F)."

Accepting unescaped control characters can at least cause some confusion
as the following naive program demonstrates:

marcus@linux:~> cat NaiveJsonProcessor.java
import java.io.IOException;
import java.io.StringReader;

import com.google.gson.stream.JsonReader;

public class NaiveJsonProcessor {
  public static void parseAndLog(String jsonInput) throws IOException {
    JsonReader reader = new JsonReader(new StringReader(jsonInput));
    String parsed = reader.nextString();
    if (parsed.equals("foo")) {
        throw new IllegalStateException("foo is forbidden");
    }
    /*
     * According to the JsonReader's documentation "[...] this parser
     * is strict and only accepts JSON as specified by RFC 4627" (see
     * documentation of setLenient). Hence, we can safely log the
     * raw jsonInput to stdout because it contains no unescaped control
     * characters, which could be interpreted by a terminal.
     * Oops... wrong assumption:)
     */
    System.out.println("Processed: " + jsonInput);
  }

  public static void main(String[] args) {
    String jsonInput = "\"foobar\u001b[3D\u001b[K\"";
    try {
        // the log entry might confuse the user...
        parseAndLog(jsonInput);
    } catch (IOException e) {
        e.printStackTrace();
    }
  }
}
marcus@linux:~> javac -cp /path/to/gson/classes NaiveJsonProcessor.java
marcus@linux:~> java -cp /path/to/gson/classes:. NaiveJsonProcessor
Processed: "foo"
marcus@linux:~>

Since the unescaped control characters of the raw jsonInput are interpreted
by the terminal, it _looks_ as if we processed the JSON text "foo" even
though this string should result in an IllegalStateException (of course in
reality we did _not_ process "foo").

Apart from this, the JsonReader accepts non-lowercase literals (like tRuE,
falSE, NULl). According to the previously mentioned RFCs, this is forbidden.
From RFC 8259 Section 3 "Values":

"[...]or one of
 the following three literal names:

    false
    null
    true

 The literal names MUST be lowercase."

To cope with this a strict mode is added to the JsonReader. In strict mode,
the JsonReader does not accept unescaped control characters in strings and
names. For this, the JsonReader raises an exception if it encounters an
unescaped control character in nextQuotedValue and skipQuotedValue.
Also, it does not accept non-lowercase literals. For this, peekKeyword
raises an exception if a non-lowercase literal is encountered.

In order to avoid regressions, the strict mode is disabled by default and
the old behavior is retained. In strict mode, the JsonReader behaves exactly
as before (except in case of an unescaped control character or non-lowercase
literal, of course). For the details, see the new JsonReaderStrictTest
testcase.

The javadoc of the JsonReader is updated accordingly. As part of this update,
all references to a JSON RFC are changed to RFC 8259 (that's what the
JsonReader conforms to (in strict mode)).

Signed-off-by: Marcus Huewe <[email protected]>
Copy link
Collaborator

@Marcono1234 Marcono1234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might have to adjust JsonReader.readEscapeCharacter() for strict mode as well. Currently it accepts:

  • \'
  • \LF

Both are not allowed by RFC 8259.

Also maybe it would be useful to cross-link in the javadoc between the corresponding methods, e.g.

  • setStrict has @see #isStrict()
  • isStrict has @see #setStrict(boolean)

Though the existing documentation sadly does not really do that either.

@Marcono1234
Copy link
Collaborator

#2437 was been merged which adds a new Strictness mode API. This should hopefully address the issues here, and also in the meantime the documentation was adjusted to be more explicit about Gson not being strict by default and how to enable strict parsing.

I will therefore close this pull request, but thanks nonetheless for your work on this! If you notice anything which is missing in the current implementation or think the documentation can be improved, then any pull request for this is appreciated.

@Marcono1234 Marcono1234 closed this Aug 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants