Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decouple encodings from JSON parsing #6

Open
cdunn2001 opened this issue Jul 9, 2014 · 2 comments
Open

Decouple encodings from JSON parsing #6

cdunn2001 opened this issue Jul 9, 2014 · 2 comments

Comments

@cdunn2001
Copy link
Contributor

A. For reading, we should first parse JSON into nested tables, arrays, and strings. We should then interpret the strings only as needed.
B. For writing, the caller should generate the string, and we should simply store it into the nested data structures. Utility libraries could help the caller convert numbers to strings.

As an experiment, I intend to separate these layers. The jsoncpp API would remain, for convenience, but under the covers there can be an extremely fast, efficient reader (maybe based on gason), which can be used directly by anyone who wants unlimited length numbers. One thing which people rarely notice in the JSON standard is that it says nothing about how long numbers can be. The whole issue of converting between ints/floats and strings is implementation-specific.

The other thing is that, in my opinion, we can simply matters by having two versions: One which reads/writes ASCII, and one which reads/writes UTF-32. rapidjson is an example of a library which goes overboard on unicode support. The encoding is threaded through the entire library as a template parameter. Way too complicated!

The problem is in parsing a JSON String. We have to support both standard Unicode characters and special JSON escapes. With UTF-8, we would have to skip variable numbers of bytes while looking for the closing quotation mark, unless we restrict ourselves to ASCII. With UTF-16, we have to worry about "surrogate pairs". But why bother? If someone needs real unicode strings, let them pre- and post-process in UTF-32, which is the easiest to deal with. If they want efficiency, let them be restricted to ASCII.

Those are the basic ideas. An example will make things more clear.

@EphDoering
Copy link

Why would you have to skip variable number of bytes looking for closing quotation mark with UFT-8? a UTF-8 encoded string can always be interpreted as a valid CP-1252 encoded string. The \ and " symbols don't change so you'd still just zip along looking for a " not preceded by a \ . It's also not necessary to escape anything besides 0x00-0x19, 0x22, and 0x5C which are all encoded the same in UTF-8 and CP-1252, as are all ASCII characters.

So all strings could be encoded in UTF-8 before parsing. And then serialized as UTF-8 before being converted to whatever desired format as the output (possibly escaping higher Unicode values if encoding to CP-1252 or ASCII). While converting before and after are ever so slightly more complicated with UTF-8 over UTF-32, it's likely that UTF-8 would take up way less memory than UTF-32. So it's a trade off between speed and space+speed (assuming any string copy operations take longer with more bytes).

@cdunn2001
Copy link
Contributor Author

You're right: UTF-8 is just as easy to scan for " as UTF-32. So we could store the JSON strings (the part between quotation marks) as either UTF-8 or UTF-32, depending on the input.

The question is how to store the strings. I wanted to load the entire input JSON into memory and store references into it, but that means we cannot decode the escape sequences during parsing. We need some dynamic memory anyway -- for storing arrays and tables -- but that requires only linked lists of fixed-size elements. (See gason, which I've recently transcribed to Nim.) Linked list nodes can come from super-efficient chunk-allocators. Strings, however, require contiguous memory (or more sophisticated data-structures). This could turn into a long discussion.

In order to avoid the costs of memory management during parsing, I want the low-level API to return the strings as read. (The lower level is really little more than a lexer.) A higher level can decode them. But yes, you're right: We could accept either UTF-32 or UTF-8 input and still store the input without decoding.

One caveat: \\" might end a string. We have to keep a 2-state machine to handle escapes. But yes, that's fast and easy.

By the way, though gason is extremely fast, I would like to modify it to record array and table lengths to speed up later processing. My plan is to use the Nim gason implementation to generate C code which would then be distributed. It generates readable, efficient C code. Then the whole thing can be implemented in Nim. C++ would be for convenient wrappers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants
@cdunn2001 @EphDoering and others