Decouple encodings from JSON parsing #6

cdunn2001 · 2014-07-09T18:26:29Z

A. For reading, we should first parse JSON into nested tables, arrays, and strings. We should then interpret the strings only as needed.
B. For writing, the caller should generate the string, and we should simply store it into the nested data structures. Utility libraries could help the caller convert numbers to strings.

As an experiment, I intend to separate these layers. The jsoncpp API would remain, for convenience, but under the covers there can be an extremely fast, efficient reader (maybe based on gason), which can be used directly by anyone who wants unlimited length numbers. One thing which people rarely notice in the JSON standard is that it says nothing about how long numbers can be. The whole issue of converting between ints/floats and strings is implementation-specific.

The other thing is that, in my opinion, we can simply matters by having two versions: One which reads/writes ASCII, and one which reads/writes UTF-32. rapidjson is an example of a library which goes overboard on unicode support. The encoding is threaded through the entire library as a template parameter. Way too complicated!

The problem is in parsing a JSON String. We have to support both standard Unicode characters and special JSON escapes. With UTF-8, we would have to skip variable numbers of bytes while looking for the closing quotation mark, unless we restrict ourselves to ASCII. With UTF-16, we have to worry about "surrogate pairs". But why bother? If someone needs real unicode strings, let them pre- and post-process in UTF-32, which is the easiest to deal with. If they want efficiency, let them be restricted to ASCII.

Those are the basic ideas. An example will make things more clear.

EphDoering · 2015-06-24T12:52:47Z

Why would you have to skip variable number of bytes looking for closing quotation mark with UFT-8? a UTF-8 encoded string can always be interpreted as a valid CP-1252 encoded string. The \ and " symbols don't change so you'd still just zip along looking for a " not preceded by a \ . It's also not necessary to escape anything besides 0x00-0x19, 0x22, and 0x5C which are all encoded the same in UTF-8 and CP-1252, as are all ASCII characters.

So all strings could be encoded in UTF-8 before parsing. And then serialized as UTF-8 before being converted to whatever desired format as the output (possibly escaping higher Unicode values if encoding to CP-1252 or ASCII). While converting before and after are ever so slightly more complicated with UTF-8 over UTF-32, it's likely that UTF-8 would take up way less memory than UTF-32. So it's a trade off between speed and space+speed (assuming any string copy operations take longer with more bytes).

cdunn2001 · 2015-06-25T00:00:06Z

You're right: UTF-8 is just as easy to scan for " as UTF-32. So we could store the JSON strings (the part between quotation marks) as either UTF-8 or UTF-32, depending on the input.

The question is how to store the strings. I wanted to load the entire input JSON into memory and store references into it, but that means we cannot decode the escape sequences during parsing. We need some dynamic memory anyway -- for storing arrays and tables -- but that requires only linked lists of fixed-size elements. (See gason, which I've recently transcribed to Nim.) Linked list nodes can come from super-efficient chunk-allocators. Strings, however, require contiguous memory (or more sophisticated data-structures). This could turn into a long discussion.

In order to avoid the costs of memory management during parsing, I want the low-level API to return the strings as read. (The lower level is really little more than a lexer.) A higher level can decode them. But yes, you're right: We could accept either UTF-32 or UTF-8 input and still store the input without decoding.

One caveat: \\" might end a string. We have to keep a 2-state machine to handle escapes. But yes, that's fast and easy.

By the way, though gason is extremely fast, I would like to modify it to record array and table lengths to speed up later processing. My plan is to use the Nim gason implementation to generate C code which would then be distributed. It generates readable, efficient C code. Then the whole thing can be implemented in Nim. C++ would be for convenient wrappers.

I will try to pull the other changes from Chromium as well. This passed Travis.

cdunn2001 mentioned this issue Jul 9, 2014

LC_NUMERIC locale #5

Merged

cdunn2001 added the enhancement label Jul 10, 2014

cdunn2001 mentioned this issue Feb 10, 2016

Hide implementation of Value #417

Open

dodng mentioned this issue Mar 3, 2017

Json::value constructed memory corruption #585

Closed

ghost mentioned this issue May 23, 2017

json_value.cpp:417: Json::Value::~Value(): Assertion `false' failed #402

Closed

robin-raymond pushed a commit to webrtc-uwp/jsoncpp that referenced this issue Dec 17, 2017

Merge pull request open-source-parsers#6 from cdunn2001/fix-static-init

8050d8b

I will try to pull the other changes from Chromium as well. This passed Travis.

xjtuwjp mentioned this issue Sep 28, 2018

debug help wanted corrupted size vs. prev_size in jsoncpp #825

Closed

huangsanm mentioned this issue Sep 2, 2019

在项目里面使用JSONCPP报错 Json::Value::releasePayload() #1017

Closed

coolcode-ws mentioned this issue Jan 8, 2020

Json::parseFromStream crash, not thread safety? #1130

Closed

bveldhoen mentioned this issue Jun 18, 2020

Segfault during Reader::parse() #1195

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decouple encodings from JSON parsing #6

Decouple encodings from JSON parsing #6

cdunn2001 commented Jul 9, 2014

EphDoering commented Jun 24, 2015

cdunn2001 commented Jun 25, 2015

Decouple encodings from JSON parsing #6

Decouple encodings from JSON parsing #6

Comments

cdunn2001 commented Jul 9, 2014

EphDoering commented Jun 24, 2015

cdunn2001 commented Jun 25, 2015