Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup parsing large JSON data files #9

Closed
tik0 opened this issue Sep 8, 2015 · 21 comments
Closed

Speedup parsing large JSON data files #9

tik0 opened this issue Sep 8, 2015 · 21 comments

Comments

@tik0
Copy link

tik0 commented Sep 8, 2015

Dear fangq,

with just a few tweaks, we significantly speed up the parsing process of our big JSON data sets (>> 300 MB).
Do you have some test data sets, with which we can evaluate our improvements?
If it works fine with yours, we would like to contribute.

Greetings,
tik0

@fangq
Copy link
Member

fangq commented Sep 19, 2015

@tik0
sorry for the delay, I was in the process of moving to a new office; my work computers were down for some time.

here is a sample JSON data file, shared by one of the users. The file is about 2MB in size, containing about 150k small unstructured objects.

http://kwafoo.coe.neu.edu/temp/419138.json.gz

please try your speedup approach and see how much it outperforms my latest code. If it works better, I would be happy to help you incorporate your changes into jsonlab, and I appreciate your contribution in advance.

@fangq fangq changed the title Speedup Speedup parsing large JSON data files Sep 19, 2015
@okomarov
Copy link

okomarov commented Dec 7, 2015

Removing globals and as many regexp() as possible will help a lot. There is one regexp in particular that takes ages.

capture1

20 whole seconds are wasted in creating valid_fields. That is something that can be done much more efficiently.

capture2

@okomarov
Copy link

okomarov commented Dec 8, 2015

@fangq Why so much complexity for creating valid Matlab fieldnames? For instance, when does the need for unicode2native() arises? Why not simply use the same format for Octave and Matlab.

I have simplified that part, and I basically shaved off the whole 20 seconds from that subfunction, but myabe im missing something.

@fangq
Copy link
Member

fangq commented Dec 11, 2015

@okomarov thanks for looking into this

Why not simply use the same format for Octave and Matlab.

unicode2native does not exist in Octave. I called unicode2native in loadjson and native2unicode in savejson because I wanted to maintain a round-trip translation between a unicode string to a hex string. I believe translating it byte-by-byte instead of unicode-by-unicode will lose the integrity of the string.

However, this feature has not been well tested. The only example I have is this script

https://github.com/fangq/jsonlab/blob/master/examples/demo_jsonlab_basic.m#L159

@okomarov
Copy link

As a side note, to get proper conversion, you need to specify the encoding:

unicode2native('绝密')
ans =
   26   26

This happens because my default character set does not cover chinese ideograms

feature('DefaultCharacterSet')
ans =
windows-1252

If you use 'UTF-8', you get the expected result

native2unicode(unicode2native('绝密','UTF-8'),'UTF-8')

However, the chinese ideograms are in this case 3 bytes long under UTF-8:

sprintf('%X',unicode2native('绝密','UTF-8'))
ans =
E7BB9DE5AF86

@fangq
Copy link
Member

fangq commented Dec 14, 2015

@okomarov yes, I was aware of the dependency to DefaultCharacterSet, see Item#2 in the Known Issues and TODOs in the README

http://iso2mesh.sourceforge.net/cgi-bin/index.cgi?jsonlab/Doc/README#Known_Issues_and_TODOs

JSON spec requires all strings are Unicode strings, I am not sure if I should force to use UTF-8, as I assume other Unicode formats may also be valid.

Frankly, I am surprised on the overhead of regexprep, because essentially nothing was changed in that line. If we can convert a unicode string to a hex-key without losing information, perhaps we can somehow convert it back to unicode in loadjson. I just haven't found the combination yet.

@okomarov
Copy link

When you said translating byte-by-byte you meant this?

double('绝密')
ans =
       32477       23494
>> char(double('绝密'))
ans =
绝密

@fangq
Copy link
Member

fangq commented Dec 14, 2015

the main goal of valid_field() is to convert any JSON "name" field (can have multi-byte unicodes) to a valid matlab variable name (a string with only [0-9a-zA-Z_]). Since you have identified the hot-spot in the unicode handling line in this function, I was debating if there is an alternative to achieve a fast conversion without losing information.

@okomarov
Copy link

double() should not lose info. Matlab uses UTF-16 to encode char and any conversion with double should be fine. I'll test the double conversion (and dec2hex) and see the speedup. If acceptable, ill submit a PR

@fangq
Copy link
Member

fangq commented Dec 15, 2015

@okomarov I was hoping that conversions are only needed for multi-byte unicodes, for ASCII, I'd like them to stay the same.

@jerlich
Copy link

jerlich commented Jun 25, 2016

@okomarov how did the test go? I really like the features of jsonlab but i am concerned with the speed. Would it be worth rewriting this in C ?

@okomarov
Copy link

@jerlich If you have R2016a, matlab.internal.webservices.toJSON() and matlab.internal.webservices.fromJSON() are already mex-ed.

@jerlich
Copy link

jerlich commented Jun 26, 2016

@okomarov I have tried christianpanton/matlab-json and the speed-up is massive (10-50x). I got it compiled on amd64, but haven't managed to get it compiled to maci64. Are you using the built-in matlab service now? I could push my team to upgrade to 2016a. I guess that is a good feature.

@at15 at15 mentioned this issue Jun 27, 2016
4 tasks
@tik0
Copy link
Author

tik0 commented Aug 22, 2016

@fangq was a bit lost ;).
I just forked your repo and add a simple wrapper script for parallel parsing just by pre-formatting the code.
With this commit tik0@67aaaf9 I've add a new script called loadjsonpar.m which first separate all objects, and then parse it by your parser.
By just executing your script on my files, the time grows exponentially while using mine, it clearly grows linear with the number of objects.
The evaluation can be done on your PC using example/evaluation.m

All my JSON files have a bunch of objects in a file like this:
{object1}{object2}{...}....
They are in fact ¸RFC 4627 compliant so the wrapper script is as well.
Would be nice to have such a extension in your jsonlib, because it would ease some peoples waiting ;).
If you need help, just ask!

Here some evaluation with my old CoreI7 4700:

ST: Single thread using loadjsonpar
MT: Multi thread with 2 parallel worker using loadjsonpar
CO: Common execution using loadjson

Parsing 10^2 objects:
ST: 0.9076 seconds
MT: 0.9941 seconds
CO: 0.8861 seconds

Parsing 10^3 objects:
ST: 8.5400 seconds
MT: 6.4673 seconds
CO: 9.6118 seconds

Parsing 10^4 objects:
ST:  91.8963 seconds
MT:  58.0130 seconds
CO: 270.1061 seconds

@fangq
Copy link
Member

fangq commented Aug 22, 2016

@tik0, looks like an interesting idea

I am curious what did you mean by "CO: Common execution using loadjson"? why CO is significantly slower than ST for 10^4 elements?

@tik0
Copy link
Author

tik0 commented Aug 22, 2016

@fangq if you have a look on the simple script, the "CO" part is just the execution of the standard command "loadjson(jsonFile, parsingOptions)".
Actually, I am a bit confused as well but I think it is because of the fact that you process the variable "inStr" in many operations which holds the whole JSON file.
This might be inefficient for large files.
On the other hand, I don't think that it is because of the JIT, because of the recursive parsing characteristic (But who knows what the Matlab-Magic does there).

@fangq
Copy link
Member

fangq commented Aug 22, 2016

do you mind attaching your 10^4 test data file so I can profile loadjson?

@tik0
Copy link
Author

tik0 commented Aug 23, 2016

Just have a look on the commit tik0@67aaaf9 . It is the file examples/10000.json.

@fangq fangq closed this as completed in 8a26d68 Jan 2, 2017
@fangq
Copy link
Member

fangq commented Jan 2, 2017

thanks everyone for useful comments. I am working on making a new release of jsonlab and thought that it would be a great ideal to accelerate the code and close this issue.

in the latest commit (https://github.com/fangq/jsonlab/commit/8a26d68776a9e65867ee4f5b93e030daeec64066), I made two changes, both were discussed previously -

  • disabling/bypassing unicode2native when no multi-byte-character is detected
  • cut the use of global variables, especially the input JSON strong (inStr).

the results of these changes yielded a over 2-fold speed up for the previously included test dataset
( http://kwafoo.coe.neu.edu/temp/419138.json.gz ). Here are the timing outputs when running the benchmark on a new desktop (i7-6770k+DDR4 mem)

%%% old loadjson %%%%
>> tic; dd=loadjson('419138.json');toc
Elapsed time is 27.633101 seconds.

%%% updated loadjson %%%%
>> tic; dd=loadjson('419138.json');toc
Elapsed time is 12.351393 seconds.

%%% matlab built-in JSON parser %%%%
>> tic;ss=matlab.internal.webservices.fromJSON(fileread('419138.json'));toc
Elapsed time is 15.474570 seconds.

the optimized loadjson turns out to be 20% faster than the hidden builtin fromJSON function for this benchmark, which I feel quite happy.

if you are interested, please checkout the latest version, and try it on your data and see if there is any improvement.

@okomarov
Copy link

okomarov commented Jan 3, 2017 via email

@fangq
Copy link
Member

fangq commented Jan 3, 2017

@okomarov, even that's true, I can still see plenty of reasons to continue investing time and improving this toolbox. this toolbox not only works for matlab, but also octave, and can be helpful for open-source users. it is already distributed by some distros

https://admin.fedoraproject.org/pkgdb/package/rpms/octave-jsonlab/

also, the UBJSON support is unique with jsonlab.

I did try jsonencode/jsonencode on my laptop running matlab 2016b. jsonencode is lightening fast, however, it currently does not support complex and sparse. This can be a headache for some users. I also ran the benchmark json file with jsondecode, it is about 20% slower than the latest loadjson, similar to matlab.internal.webservices.fromJSON.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants