Speedup parsing large JSON data files #9

tik0 · 2015-09-08T14:38:12Z

Dear fangq,

with just a few tweaks, we significantly speed up the parsing process of our big JSON data sets (>> 300 MB).
Do you have some test data sets, with which we can evaluate our improvements?
If it works fine with yours, we would like to contribute.

Greetings,
tik0

fangq · 2015-09-19T23:45:27Z

@tik0
sorry for the delay, I was in the process of moving to a new office; my work computers were down for some time.

here is a sample JSON data file, shared by one of the users. The file is about 2MB in size, containing about 150k small unstructured objects.

http://kwafoo.coe.neu.edu/temp/419138.json.gz

please try your speedup approach and see how much it outperforms my latest code. If it works better, I would be happy to help you incorporate your changes into jsonlab, and I appreciate your contribution in advance.

okomarov · 2015-12-07T22:42:28Z

Removing globals and as many regexp() as possible will help a lot. There is one regexp in particular that takes ages.

20 whole seconds are wasted in creating valid_fields. That is something that can be done much more efficiently.

okomarov · 2015-12-08T00:37:08Z

@fangq Why so much complexity for creating valid Matlab fieldnames? For instance, when does the need for unicode2native() arises? Why not simply use the same format for Octave and Matlab.

I have simplified that part, and I basically shaved off the whole 20 seconds from that subfunction, but myabe im missing something.

fangq · 2015-12-11T22:37:24Z

@okomarov thanks for looking into this

Why not simply use the same format for Octave and Matlab.

unicode2native does not exist in Octave. I called unicode2native in loadjson and native2unicode in savejson because I wanted to maintain a round-trip translation between a unicode string to a hex string. I believe translating it byte-by-byte instead of unicode-by-unicode will lose the integrity of the string.

However, this feature has not been well tested. The only example I have is this script

https://github.com/fangq/jsonlab/blob/master/examples/demo_jsonlab_basic.m#L159

okomarov · 2015-12-12T13:58:01Z

As a side note, to get proper conversion, you need to specify the encoding:

unicode2native('绝密')
ans =
   26   26

This happens because my default character set does not cover chinese ideograms

feature('DefaultCharacterSet')
ans =
windows-1252

If you use 'UTF-8', you get the expected result

native2unicode(unicode2native('绝密','UTF-8'),'UTF-8')

However, the chinese ideograms are in this case 3 bytes long under UTF-8:

sprintf('%X',unicode2native('绝密','UTF-8'))
ans =
E7BB9DE5AF86

fangq · 2015-12-14T21:04:20Z

@okomarov yes, I was aware of the dependency to DefaultCharacterSet, see Item#2 in the Known Issues and TODOs in the README

http://iso2mesh.sourceforge.net/cgi-bin/index.cgi?jsonlab/Doc/README#Known_Issues_and_TODOs

JSON spec requires all strings are Unicode strings, I am not sure if I should force to use UTF-8, as I assume other Unicode formats may also be valid.

Frankly, I am surprised on the overhead of regexprep, because essentially nothing was changed in that line. If we can convert a unicode string to a hex-key without losing information, perhaps we can somehow convert it back to unicode in loadjson. I just haven't found the combination yet.

okomarov · 2015-12-14T21:49:00Z

When you said translating byte-by-byte you meant this?

double('绝密')
ans =
       32477       23494
>> char(double('绝密'))
ans =
绝密

fangq · 2015-12-14T23:11:13Z

the main goal of valid_field() is to convert any JSON "name" field (can have multi-byte unicodes) to a valid matlab variable name (a string with only [0-9a-zA-Z_]). Since you have identified the hot-spot in the unicode handling line in this function, I was debating if there is an alternative to achieve a fast conversion without losing information.

okomarov · 2015-12-14T23:51:20Z

double() should not lose info. Matlab uses UTF-16 to encode char and any conversion with double should be fine. I'll test the double conversion (and dec2hex) and see the speedup. If acceptable, ill submit a PR

fangq · 2015-12-15T04:31:40Z

@okomarov I was hoping that conversions are only needed for multi-byte unicodes, for ASCII, I'd like them to stay the same.

jerlich · 2016-06-25T17:42:19Z

@okomarov how did the test go? I really like the features of jsonlab but i am concerned with the speed. Would it be worth rewriting this in C ?

okomarov · 2016-06-25T21:57:04Z

@jerlich If you have R2016a, matlab.internal.webservices.toJSON() and matlab.internal.webservices.fromJSON() are already mex-ed.

jerlich · 2016-06-26T09:20:54Z

@okomarov I have tried christianpanton/matlab-json and the speed-up is massive (10-50x). I got it compiled on amd64, but haven't managed to get it compiled to maci64. Are you using the built-in matlab service now? I could push my team to upgrade to 2016a. I guess that is a good feature.

tik0 · 2016-08-22T16:48:41Z

@fangq was a bit lost ;).
I just forked your repo and add a simple wrapper script for parallel parsing just by pre-formatting the code.
With this commit tik0@67aaaf9 I've add a new script called loadjsonpar.m which first separate all objects, and then parse it by your parser.
By just executing your script on my files, the time grows exponentially while using mine, it clearly grows linear with the number of objects.
The evaluation can be done on your PC using example/evaluation.m

All my JSON files have a bunch of objects in a file like this:
{object1}{object2}{...}....
They are in fact ¸RFC 4627 compliant so the wrapper script is as well.
Would be nice to have such a extension in your jsonlib, because it would ease some peoples waiting ;).
If you need help, just ask!

Here some evaluation with my old CoreI7 4700:

ST: Single thread using loadjsonpar
MT: Multi thread with 2 parallel worker using loadjsonpar
CO: Common execution using loadjson

Parsing 10^2 objects:
ST: 0.9076 seconds
MT: 0.9941 seconds
CO: 0.8861 seconds

Parsing 10^3 objects:
ST: 8.5400 seconds
MT: 6.4673 seconds
CO: 9.6118 seconds

Parsing 10^4 objects:
ST:  91.8963 seconds
MT:  58.0130 seconds
CO: 270.1061 seconds

fangq · 2016-08-22T19:14:48Z

@tik0, looks like an interesting idea

I am curious what did you mean by "CO: Common execution using loadjson"? why CO is significantly slower than ST for 10^4 elements?

tik0 · 2016-08-22T20:13:52Z

@fangq if you have a look on the simple script, the "CO" part is just the execution of the standard command "loadjson(jsonFile, parsingOptions)".
Actually, I am a bit confused as well but I think it is because of the fact that you process the variable "inStr" in many operations which holds the whole JSON file.
This might be inefficient for large files.
On the other hand, I don't think that it is because of the JIT, because of the recursive parsing characteristic (But who knows what the Matlab-Magic does there).

fangq · 2016-08-22T20:57:54Z

do you mind attaching your 10^4 test data file so I can profile loadjson?

tik0 · 2016-08-23T01:43:06Z

Just have a look on the commit tik0@67aaaf9 . It is the file examples/10000.json.

fangq · 2017-01-02T21:10:31Z

thanks everyone for useful comments. I am working on making a new release of jsonlab and thought that it would be a great ideal to accelerate the code and close this issue.

in the latest commit (https://github.com/fangq/jsonlab/commit/8a26d68776a9e65867ee4f5b93e030daeec64066), I made two changes, both were discussed previously -

disabling/bypassing unicode2native when no multi-byte-character is detected
cut the use of global variables, especially the input JSON strong (inStr).

the results of these changes yielded a over 2-fold speed up for the previously included test dataset
( http://kwafoo.coe.neu.edu/temp/419138.json.gz ). Here are the timing outputs when running the benchmark on a new desktop (i7-6770k+DDR4 mem)

%%% old loadjson %%%%
>> tic; dd=loadjson('419138.json');toc
Elapsed time is 27.633101 seconds.

%%% updated loadjson %%%%
>> tic; dd=loadjson('419138.json');toc
Elapsed time is 12.351393 seconds.

%%% matlab built-in JSON parser %%%%
>> tic;ss=matlab.internal.webservices.fromJSON(fileread('419138.json'));toc
Elapsed time is 15.474570 seconds.

the optimized loadjson turns out to be 20% faster than the hidden builtin fromJSON function for this benchmark, which I feel quite happy.

if you are interested, please checkout the latest version, and try it on your data and see if there is any improvement.

okomarov · 2017-01-03T01:04:34Z

As a side note, I think since r2016a, matlab has an official json encode and decoder which is mexed.

Sent from my iPhone On 2 Jan 2017, at 21:10, "Qianqian Fang" <[email protected]<mailto:[email protected]>> wrote: thanks everyone for useful comments. I am working on making a new release of jsonlab and thought that it would be helpful to accelerate the code and close this issue. in the latest commit, I made two changes, both were discussed previously - * disabling/bypassing unicode2native when no multi-byte-character is detected * cut the use of global variables, especially the input JSON strong (inStr). the results of these changes yielded a over 2-fold speed up for the previously included test dataset ( http://kwafoo.coe.neu.edu/temp/419138.json.gz ). Here are the timing outputs when running the benchmark on a new desktop (i7-6770k+DDR4 mem) %%% old loadjson %%%%

> tic; dd=loadjson('419138.json');toc

Elapsed time is 27.633101 seconds. %%% updated loadjson %%%%

> tic; dd=loadjson('419138.json');toc

Elapsed time is 12.351393 seconds. %%% matlab built-in JSON parser %%%%

> tic;ss=matlab.internal.webservices.fromJSON(fileread('419138.json'));toc

Elapsed time is 15.474570 seconds. the optimized loadjson turns out to be 20% faster than the hidden builtin fromJSON function for this benchmark, which I feel quite happy. if you are interested, please checkout the latest version, and try it on your data and see if there is any improvement. - You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#9 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ADju5fjUpYhjv803sWCWsfeqwcYxQ8dGks5rOWfHgaJpZM4F5je8>.

fangq · 2017-01-03T03:22:12Z

@okomarov, even that's true, I can still see plenty of reasons to continue investing time and improving this toolbox. this toolbox not only works for matlab, but also octave, and can be helpful for open-source users. it is already distributed by some distros

https://admin.fedoraproject.org/pkgdb/package/rpms/octave-jsonlab/

also, the UBJSON support is unique with jsonlab.

I did try jsonencode/jsonencode on my laptop running matlab 2016b. jsonencode is lightening fast, however, it currently does not support complex and sparse. This can be a headache for some users. I also ran the benchmark json file with jsondecode, it is about 20% slower than the latest loadjson, similar to matlab.internal.webservices.fromJSON.

fangq added enhancement help wanted labels Sep 8, 2015

fangq changed the title ~~Speedup~~ Speedup parsing large JSON data files Sep 19, 2015

at15 mentioned this issue Jun 27, 2016

Read JSON in MATLAB at15/mk-fld#5

Closed

4 tasks

fangq closed this as completed in 8a26d68 Jan 2, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speedup parsing large JSON data files #9

Speedup parsing large JSON data files #9

tik0 commented Sep 8, 2015

fangq commented Sep 19, 2015 •

edited

Loading

okomarov commented Dec 7, 2015

okomarov commented Dec 8, 2015

fangq commented Dec 11, 2015

okomarov commented Dec 12, 2015

fangq commented Dec 14, 2015

okomarov commented Dec 14, 2015

fangq commented Dec 14, 2015

okomarov commented Dec 14, 2015

fangq commented Dec 15, 2015

jerlich commented Jun 25, 2016

okomarov commented Jun 25, 2016

jerlich commented Jun 26, 2016 •

edited

Loading

tik0 commented Aug 22, 2016

fangq commented Aug 22, 2016

tik0 commented Aug 22, 2016

fangq commented Aug 22, 2016

tik0 commented Aug 23, 2016

fangq commented Jan 2, 2017 •

edited

Loading

okomarov commented Jan 3, 2017 via email

fangq commented Jan 3, 2017

Speedup parsing large JSON data files #9

Speedup parsing large JSON data files #9

Comments

tik0 commented Sep 8, 2015

fangq commented Sep 19, 2015 • edited Loading

okomarov commented Dec 7, 2015

okomarov commented Dec 8, 2015

fangq commented Dec 11, 2015

okomarov commented Dec 12, 2015

fangq commented Dec 14, 2015

okomarov commented Dec 14, 2015

fangq commented Dec 14, 2015

okomarov commented Dec 14, 2015

fangq commented Dec 15, 2015

jerlich commented Jun 25, 2016

okomarov commented Jun 25, 2016

jerlich commented Jun 26, 2016 • edited Loading

tik0 commented Aug 22, 2016

fangq commented Aug 22, 2016

tik0 commented Aug 22, 2016

fangq commented Aug 22, 2016

tik0 commented Aug 23, 2016

fangq commented Jan 2, 2017 • edited Loading

okomarov commented Jan 3, 2017 via email

fangq commented Jan 3, 2017

fangq commented Sep 19, 2015 •

edited

Loading

jerlich commented Jun 26, 2016 •

edited

Loading

fangq commented Jan 2, 2017 •

edited

Loading