-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decouple encodings from JSON parsing #6
Comments
Why would you have to skip variable number of bytes looking for closing quotation mark with UFT-8? a UTF-8 encoded string can always be interpreted as a valid CP-1252 encoded string. The \ and " symbols don't change so you'd still just zip along looking for a " not preceded by a \ . It's also not necessary to escape anything besides 0x00-0x19, 0x22, and 0x5C which are all encoded the same in UTF-8 and CP-1252, as are all ASCII characters. So all strings could be encoded in UTF-8 before parsing. And then serialized as UTF-8 before being converted to whatever desired format as the output (possibly escaping higher Unicode values if encoding to CP-1252 or ASCII). While converting before and after are ever so slightly more complicated with UTF-8 over UTF-32, it's likely that UTF-8 would take up way less memory than UTF-32. So it's a trade off between speed and space+speed (assuming any string copy operations take longer with more bytes). |
You're right: UTF-8 is just as easy to scan for The question is how to store the strings. I wanted to load the entire input JSON into memory and store references into it, but that means we cannot decode the escape sequences during parsing. We need some dynamic memory anyway -- for storing arrays and tables -- but that requires only linked lists of fixed-size elements. (See gason, which I've recently transcribed to Nim.) Linked list nodes can come from super-efficient chunk-allocators. Strings, however, require contiguous memory (or more sophisticated data-structures). This could turn into a long discussion. In order to avoid the costs of memory management during parsing, I want the low-level API to return the strings as read. (The lower level is really little more than a lexer.) A higher level can decode them. But yes, you're right: We could accept either UTF-32 or UTF-8 input and still store the input without decoding. One caveat: By the way, though gason is extremely fast, I would like to modify it to record array and table lengths to speed up later processing. My plan is to use the Nim gason implementation to generate C code which would then be distributed. It generates readable, efficient C code. Then the whole thing can be implemented in Nim. C++ would be for convenient wrappers. |
I will try to pull the other changes from Chromium as well. This passed Travis.
A. For reading, we should first parse JSON into nested tables, arrays, and strings. We should then interpret the strings only as needed.
B. For writing, the caller should generate the string, and we should simply store it into the nested data structures. Utility libraries could help the caller convert numbers to strings.
As an experiment, I intend to separate these layers. The jsoncpp API would remain, for convenience, but under the covers there can be an extremely fast, efficient reader (maybe based on gason), which can be used directly by anyone who wants unlimited length numbers. One thing which people rarely notice in the JSON standard is that it says nothing about how long numbers can be. The whole issue of converting between ints/floats and strings is implementation-specific.
The other thing is that, in my opinion, we can simply matters by having two versions: One which reads/writes ASCII, and one which reads/writes UTF-32. rapidjson is an example of a library which goes overboard on unicode support. The encoding is threaded through the entire library as a template parameter. Way too complicated!
The problem is in parsing a JSON String. We have to support both standard Unicode characters and special JSON escapes. With UTF-8, we would have to skip variable numbers of bytes while looking for the closing quotation mark, unless we restrict ourselves to ASCII. With UTF-16, we have to worry about "surrogate pairs". But why bother? If someone needs real unicode strings, let them pre- and post-process in UTF-32, which is the easiest to deal with. If they want efficiency, let them be restricted to ASCII.
Those are the basic ideas. An example will make things more clear.
The text was updated successfully, but these errors were encountered: