Emit warnings when dumping binary strings #643

casperisfine · 2024-10-24T14:13:44Z

Because of it's Ruby 1.8 heritage, the C extension doesn't care much about strings encoding. We should get stricter over time.

casperisfine · 2024-10-24T14:15:07Z

ext/json/ext/parser/parser.rl

+    VALUE utf8_string = rb_enc_associate_index(rb_str_dup(source), utf8_encindex);
+    switch (rb_enc_str_coderange(utf8_string)) {
+      case ENC_CODERANGE_7BIT:
+      case ENC_CODERANGE_VALID:
+        return utf8_string;
+    }


This one is #138. I'm having second thoughts about deprecating / removing it, as I fear lots of people depend on it, and looking at other parsers they all seem to accept it. So maybe we should just support it in the Java parser?

WDYT @headius @eregon ?

Whoops, I commented too early and thought this was a YAML issue.

After reexamining the code, it does appear that the json implementation for JRuby works directly with bytes via the ragel parser. It would probably be possible to make it accept invalid characters.

My position remains the same, however. I'm not sure what other parsers you tested, but all json specifications I looked at explicitly require content to be in one of the main unicode UTF encodings. That says to me that providing invalid characters is off spec and should produce an error.

The original RFC from 2006 even says:

JSON text SHALL be encoded in Unicode. The default encoding is UTF-8.

https://www.ietf.org/rfc/rfc4627.txt

The character stream cannot be parsed as Unicode, then it is not up to spec.

I'm not sure what other parsers you tested,

oj and Python's stdlib.

I totally get your argument, and generally agree, the concern is really about breaking existing (misbehaving) code.

Maybe I'm crazy, but I would break it in a minor release so it's easy to revert with another minor release, rather than break it in a major release and have to go back. I doubt the impact will be large. If we put out a minor release that starts erroring on bad input and it doesn't get reported for a few months, it's probably fine.

I think it's also important to remember that Ruby has become much more strict about character encodings over the years so the chance of getting bad content has reduced significantly since the original issue.

A common case where this still happens a lot today is JSON POST requests handled by Rack.

The Rack spec clearly state the body should be ASCII_8BIT, so if you do the common JSON.parse(request.body) you hit this case.

That's why I'm worried about breaking this.

Yeah that's a decent point.

Just yesterday someone at work told me they had an outage because they have a JSON based logger and were logging binary strings (some avro stuff or whatever).

So perhaps you are right and should just keep this behavior, unless strict: true is set.

unless strict: true is set.

json/lib/json/pure/generator.rb

Lines 227 to 228 in 6d3b3ac

# Strict mode only allow serializing JSON native types: Hash, Array,

# String, Integer, Float, true, false and nil.

I think even strict: true should accept it given the documented meaning of it.
It could be another option, but not sure there is any value to prohibit broken strings if anyway it works fine.

Would you be willing to make a PR with a test of what you have in mind? I can make it pass, just want to be 100% sure we're talking about the same thing.

What I'm thinking is:

diff --git a/test/json/json_parser_test.rb b/test/json/json_parser_test.rb index 59cfcfa..a7ec04d 100644 --- a/test/json/json_parser_test.rb +++ b/test/json/json_parser_test.rb @@ -196,6 +196,19 @@ class JSONParserTest < Test::Unit::TestCase ) end + def test_parse_broken_string + # https://github.com/ruby/json/issues/138 + s = parse(%{["\x80"]})[0] + assert_equal("\x80", s) + assert_equal Encoding::UTF_8, s.encoding + assert_equal false, s.valid_encoding? + + s = parse(%{["\x80"]}.b)[0] + assert_equal("\x80", s) + assert_equal Encoding::UTF_8, s.encoding + assert_equal false, s.valid_encoding? + end + def test_parse_big_integers json1 = JSON(orig = (1 << 31) - 1) assert_equal orig, parse(json1)

And that passes on master, because of the rb_str_conv_enc() below which if there is an error during conversion returns the source string as-is (https://github.com/ruby/ruby/blob/22abcce704cf3d82aaa9af5b8ae4e6bf628502ea/spec/ruby/optional/capi/string_spec.rb#L884). And I guess the C extension doesn't really care whether the input String is internally in UTF-8 or BINARY and just works on bytes (although that could lead to subtle bugs as e.g. ENC_CODERANGE_VALID means something different for both encodings).
If I replace the rb_str_conv_enc() with return rb_funcall(source, rb_intern("encode"), 1, rb_const_get(rb_path2class("Encoding"), rb_intern("UTF_8"))); then I get Encoding::UndefinedConversionError.
If on top of that I undo the change in this file then it works again.

BTW, the pure-Ruby parser actually uses the input String as BINARY internally.

Because of it's Ruby 1.8 heritage, the C extension doesn't care much about strings encoding. We should get stricter over time.

byroot · 2024-10-30T11:10:52Z

Merging as is for now, we can always decide to be even more strict in a followup.

casperisfine commented Oct 24, 2024

View reviewed changes

byroot force-pushed the master branch from 1af38e5 to d54063a Compare October 29, 2024 13:38

Emit warnings when dumping binary strings

42402fc

Because of it's Ruby 1.8 heritage, the C extension doesn't care much about strings encoding. We should get stricter over time.

casperisfine force-pushed the deprecate-broken-encoding branch from 4de4399 to 42402fc Compare October 30, 2024 11:01

byroot merged commit 0a94596 into ruby:master Oct 30, 2024
36 checks passed

casperisfine deleted the deprecate-broken-encoding branch October 30, 2024 11:13

fschwahn mentioned this pull request Nov 12, 2024

Deprecation warning with json gem 2.8.0+ getsentry/sentry-ruby#2462

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Emit warnings when dumping binary strings #643

Emit warnings when dumping binary strings #643

casperisfine commented Oct 24, 2024 •

edited

Loading

casperisfine Oct 24, 2024

headius Oct 24, 2024 •

edited

Loading

byroot Oct 24, 2024

headius Oct 24, 2024

casperisfine Oct 30, 2024

byroot Oct 30, 2024

eregon Oct 30, 2024

byroot Oct 30, 2024

eregon Oct 30, 2024 •

edited

Loading

eregon Oct 30, 2024

byroot commented Oct 30, 2024 •

edited by eregon

Loading

	# Strict mode only allow serializing JSON native types: Hash, Array,
	# String, Integer, Float, true, false and nil.

Emit warnings when dumping binary strings #643

Emit warnings when dumping binary strings #643

Conversation

casperisfine commented Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

headius Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eregon Oct 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

byroot commented Oct 30, 2024 • edited by eregon Loading

casperisfine commented Oct 24, 2024 •

edited

Loading

headius Oct 24, 2024 •

edited

Loading

eregon Oct 30, 2024 •

edited

Loading

byroot commented Oct 30, 2024 •

edited by eregon

Loading