-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a strict parsing mode to the JsonReader and improve its javadoc #1609
Conversation
Make the JsonReader tests reusable so that we can run the same set of tests with differently configured JsonReaders. Note that this change does not expose a new API, which could be subclassed by 3rd party classes, because the AbstractJsonReaderTest has package-private visibility and the JsonReaderTest is final. Signed-off-by: Marcus Huewe <[email protected]>
Document the current behavior of the JsonReader if it encounters an unescaped control character in a string or a name. The JsonReader accepts unescaped control characters in a string or a name even though they are forbidden according to RFC 8259, RFC 7159, and RFC 4627. Signed-off-by: Marcus Huewe <[email protected]>
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here with What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
@googlebot I signed it! |
CLAs look good, thanks! ℹ️ Googlers: Go here for more info. |
* | ||
* <p>In contrast to the strict mode, the semi-strict mode allows non-lowercase | ||
* literals (like TRUE, fAlSe, NUlL etc.) and unescaped control characters in | ||
* strings (and names). For the latter, consider the two java strings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* strings (and names). For the latter, consider the two java strings | |
* strings (and names). For the latter, consider the two Java strings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. (I just did a forced push.)
* <li>Top-level values of any type. With strict parsing, the top-level | ||
* value must be an object or an array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why has this been removed? Is this incorrect and current non-lenient mode also allows non-object or array top-level values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, other top-level elements like a number, boolean etc. were already supported (which
conforms to RFC 8259) in the non-lenient mode.
* this parser is strict and only accepts JSON as specified by <a | ||
* href="http://www.ietf.org/rfc/rfc4627.txt">RFC 4627</a>. Setting the | ||
* this parser is semi-strict and only accepts JSON as specified by <a | ||
* href="https://www.ietf.org/rfc/rfc8259.txt">RFC 8259</a>. Setting the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is slightly misleading because it is more lenient (="semi-strict") than RFC 8259, though this might not matter here much
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh. Good catch! I removed the "only" part. Now, it says that it "accepts RFC 8259
(including some slight variations)".
I just did a forced push.
Thanks a lot for your (initial) review! Much appreciated!
By default, the JsonReader accepts unescaped control characters that are embedded in a string or name (which is technically just a string). According to RFC 8259, RFC 7159, and RFC 4627, this is forbidden. From RFC 8259 Section 7 "Strings": "All Unicode characters may be placed within the quotation marks, except for the characters that MUST be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F)." Accepting unescaped control characters can at least cause some confusion as the following naive program demonstrates: marcus@linux:~> cat NaiveJsonProcessor.java import java.io.IOException; import java.io.StringReader; import com.google.gson.stream.JsonReader; public class NaiveJsonProcessor { public static void parseAndLog(String jsonInput) throws IOException { JsonReader reader = new JsonReader(new StringReader(jsonInput)); String parsed = reader.nextString(); if (parsed.equals("foo")) { throw new IllegalStateException("foo is forbidden"); } /* * According to the JsonReader's documentation "[...] this parser * is strict and only accepts JSON as specified by RFC 4627" (see * documentation of setLenient). Hence, we can safely log the * raw jsonInput to stdout because it contains no unescaped control * characters, which could be interpreted by a terminal. * Oops... wrong assumption:) */ System.out.println("Processed: " + jsonInput); } public static void main(String[] args) { String jsonInput = "\"foobar\u001b[3D\u001b[K\""; try { // the log entry might confuse the user... parseAndLog(jsonInput); } catch (IOException e) { e.printStackTrace(); } } } marcus@linux:~> javac -cp /path/to/gson/classes NaiveJsonProcessor.java marcus@linux:~> java -cp /path/to/gson/classes:. NaiveJsonProcessor Processed: "foo" marcus@linux:~> Since the unescaped control characters of the raw jsonInput are interpreted by the terminal, it _looks_ as if we processed the JSON text "foo" even though this string should result in an IllegalStateException (of course in reality we did _not_ process "foo"). Apart from this, the JsonReader accepts non-lowercase literals (like tRuE, falSE, NULl). According to the previously mentioned RFCs, this is forbidden. From RFC 8259 Section 3 "Values": "[...]or one of the following three literal names: false null true The literal names MUST be lowercase." To cope with this a strict mode is added to the JsonReader. In strict mode, the JsonReader does not accept unescaped control characters in strings and names. For this, the JsonReader raises an exception if it encounters an unescaped control character in nextQuotedValue and skipQuotedValue. Also, it does not accept non-lowercase literals. For this, peekKeyword raises an exception if a non-lowercase literal is encountered. In order to avoid regressions, the strict mode is disabled by default and the old behavior is retained. In strict mode, the JsonReader behaves exactly as before (except in case of an unescaped control character or non-lowercase literal, of course). For the details, see the new JsonReaderStrictTest testcase. The javadoc of the JsonReader is updated accordingly. As part of this update, all references to a JSON RFC are changed to RFC 8259 (that's what the JsonReader conforms to (in strict mode)). Signed-off-by: Marcus Huewe <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might have to adjust JsonReader.readEscapeCharacter()
for strict mode as well. Currently it accepts:
\'
\LF
Both are not allowed by RFC 8259.
Also maybe it would be useful to cross-link in the javadoc between the corresponding methods, e.g.
setStrict
has@see #isStrict()
isStrict
has@see #setStrict(boolean)
Though the existing documentation sadly does not really do that either.
#2437 was been merged which adds a new I will therefore close this pull request, but thanks nonetheless for your work on this! If you notice anything which is missing in the current implementation or think the documentation can be improved, then any pull request for this is appreciated. |
By default, the JsonReader accepts unescaped control characters that
are embedded in a string or name (which is technically just a string).
According to RFC 8259, RFC 7159, and RFC 4627, this is forbidden.
From RFC 8259 Section 7 "Strings":
"All Unicode characters may be placed within the
quotation marks, except for the characters that MUST be escaped:
quotation mark, reverse solidus, and the control characters (U+0000
through U+001F)."
Accepting unescaped control characters can at least cause some confusion
as the following naive program demonstrates:
Since the unescaped control characters of the raw jsonInput are interpreted
by the terminal, it looks as if we processed the JSON text "foo" even
though this string should result in an IllegalStateException (of course in
reality we did not process "foo").
This PR fixes this by adding an optional strict mode to the JsonReader.
If preferred, I can also split commit 473a9b3
("Make JsonReader tests reusable in the com.google.gson.stream package") into two commits.