-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speedup parsing large JSON data files #9
Comments
@tik0 here is a sample JSON data file, shared by one of the users. The file is about 2MB in size, containing about 150k small unstructured objects. http://kwafoo.coe.neu.edu/temp/419138.json.gz please try your speedup approach and see how much it outperforms my latest code. If it works better, I would be happy to help you incorporate your changes into jsonlab, and I appreciate your contribution in advance. |
@fangq Why so much complexity for creating valid Matlab fieldnames? For instance, when does the need for unicode2native() arises? Why not simply use the same format for Octave and Matlab. I have simplified that part, and I basically shaved off the whole 20 seconds from that subfunction, but myabe im missing something. |
@okomarov thanks for looking into this
unicode2native does not exist in Octave. I called unicode2native in loadjson and native2unicode in savejson because I wanted to maintain a round-trip translation between a unicode string to a hex string. I believe translating it byte-by-byte instead of unicode-by-unicode will lose the integrity of the string. However, this feature has not been well tested. The only example I have is this script |
As a side note, to get proper conversion, you need to specify the encoding:
This happens because my default character set does not cover chinese ideograms
If you use 'UTF-8', you get the expected result
However, the chinese ideograms are in this case 3 bytes long under UTF-8:
|
@okomarov yes, I was aware of the dependency to DefaultCharacterSet, see Item#2 in the Known Issues and TODOs in the README http://iso2mesh.sourceforge.net/cgi-bin/index.cgi?jsonlab/Doc/README#Known_Issues_and_TODOs JSON spec requires all strings are Unicode strings, I am not sure if I should force to use UTF-8, as I assume other Unicode formats may also be valid. Frankly, I am surprised on the overhead of regexprep, because essentially nothing was changed in that line. If we can convert a unicode string to a hex-key without losing information, perhaps we can somehow convert it back to unicode in loadjson. I just haven't found the combination yet. |
When you said translating byte-by-byte you meant this?
|
the main goal of valid_field() is to convert any JSON "name" field (can have multi-byte unicodes) to a valid matlab variable name (a string with only [0-9a-zA-Z_]). Since you have identified the hot-spot in the unicode handling line in this function, I was debating if there is an alternative to achieve a fast conversion without losing information. |
|
@okomarov I was hoping that conversions are only needed for multi-byte unicodes, for ASCII, I'd like them to stay the same. |
@okomarov how did the test go? I really like the features of jsonlab but i am concerned with the speed. Would it be worth rewriting this in C ? |
@jerlich If you have R2016a, |
@okomarov I have tried christianpanton/matlab-json and the speed-up is massive (10-50x). I got it compiled on amd64, but haven't managed to get it compiled to maci64. Are you using the built-in matlab service now? I could push my team to upgrade to 2016a. I guess that is a good feature. |
@fangq was a bit lost ;). All my JSON files have a bunch of objects in a file like this: Here some evaluation with my old CoreI7 4700:
|
@tik0, looks like an interesting idea I am curious what did you mean by "CO: Common execution using loadjson"? why CO is significantly slower than ST for 10^4 elements? |
@fangq if you have a look on the simple script, the "CO" part is just the execution of the standard command "loadjson(jsonFile, parsingOptions)". |
do you mind attaching your 10^4 test data file so I can profile loadjson? |
Just have a look on the commit tik0@67aaaf9 . It is the file |
thanks everyone for useful comments. I am working on making a new release of jsonlab and thought that it would be a great ideal to accelerate the code and close this issue. in the latest commit (https://github.com/fangq/jsonlab/commit/8a26d68776a9e65867ee4f5b93e030daeec64066), I made two changes, both were discussed previously -
the results of these changes yielded a over 2-fold speed up for the previously included test dataset
the optimized loadjson turns out to be 20% faster than the hidden builtin fromJSON function for this benchmark, which I feel quite happy. if you are interested, please checkout the latest version, and try it on your data and see if there is any improvement. |
As a side note, I think since r2016a, matlab has an official json encode and decoder which is mexed.
Sent from my iPhone
On 2 Jan 2017, at 21:10, "Qianqian Fang" <[email protected]<mailto:[email protected]>> wrote:
thanks everyone for useful comments. I am working on making a new release of jsonlab and thought that it would be helpful to accelerate the code and close this issue.
in the latest commit, I made two changes, both were discussed previously -
* disabling/bypassing unicode2native when no multi-byte-character is detected
* cut the use of global variables, especially the input JSON strong (inStr).
the results of these changes yielded a over 2-fold speed up for the previously included test dataset
( http://kwafoo.coe.neu.edu/temp/419138.json.gz ). Here are the timing outputs when running the benchmark on a new desktop (i7-6770k+DDR4 mem)
%%% old loadjson %%%%
> tic; dd=loadjson('419138.json');toc
Elapsed time is 27.633101 seconds.
%%% updated loadjson %%%%
> tic; dd=loadjson('419138.json');toc
Elapsed time is 12.351393 seconds.
%%% matlab built-in JSON parser %%%%
> tic;ss=matlab.internal.webservices.fromJSON(fileread('419138.json'));toc
Elapsed time is 15.474570 seconds.
the optimized loadjson turns out to be 20% faster than the hidden builtin fromJSON function for this benchmark, which I feel quite happy.
if you are interested, please checkout the latest version, and try it on your data and see if there is any improvement.
-
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#9 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ADju5fjUpYhjv803sWCWsfeqwcYxQ8dGks5rOWfHgaJpZM4F5je8>.
|
@okomarov, even that's true, I can still see plenty of reasons to continue investing time and improving this toolbox. this toolbox not only works for matlab, but also octave, and can be helpful for open-source users. it is already distributed by some distros https://admin.fedoraproject.org/pkgdb/package/rpms/octave-jsonlab/ also, the UBJSON support is unique with jsonlab. I did try |
Dear fangq,
with just a few tweaks, we significantly speed up the parsing process of our big JSON data sets (>> 300 MB).
Do you have some test data sets, with which we can evaluate our improvements?
If it works fine with yours, we would like to contribute.
Greetings,
tik0
The text was updated successfully, but these errors were encountered: