-
Notifications
You must be signed in to change notification settings - Fork 135
Conversation
This new C extension backend implements three objects (basic_parse, parse and items) which work like an iterator (i.e., they implement the iter and iternext methods). They do the same work their counterparts in the other backends do; that is, basic_parse generates tuples of (event, value), parse adds a path to that tuple, and items generates objects out of those. The fact that they are written in C, plus a few other minor goodies, means that their performance is substantially better than the other backends. The code tries to take all error cases into account and return the correct values when needed. Reference counting seems correct when things work correctly; that is no memory corruption or memory leak occurs when parsing bigs amount of data. Reference counting might be incorrect in some error situations, but that's work future work. The C extension itself is called _yajl2. Its functionality is in turn exposed via the yajl2_c python module which is what users should finally import. Signed-off-by: Rodrigo Tobar <[email protected]>
The new backend requires a C compiler and the YAJL development headers and library to be present on the system. If the conditions are not right we don't include extension to the setup process. Signed-off-by: Rodrigo Tobar <[email protected]>
Signed-off-by: Rodrigo Tobar <[email protected]>
Tests run but were not reporting which backends were not being tested. The find_yajl_ctypes function also failed sometimes with OSError instead of YAJLImportError. Signed-off-by: Rodrigo Tobar <[email protected]>
Using YAJL2 means that we test more backends (i.e., python, yajl2, yajl2_cffi and yajl2_c) than we currently do with libyajl1 (i.e., python and yajl only). One can see the difference in the number of tests run (32 against 58). Signed-off-by: Rodrigo Tobar <[email protected]>
This is purely for aeasthetic reasons. There was also a bit a duplicated code in the macro definitions which has now been unified. Signed-off-by: Rodrigo Tobar <[email protected]>
Hi Ivan, Any chance to get this work reviewed and integrated? |
gen->path always contains at least 1 element, and when a map_start event is found a new None element is added to the list. Thus the logic to find whether this map key is the first element of the prefix returned by this generator was wrong and generated ".key" for a JSON like '{"key": 1}' I've added a unit test to make sure this problem doesn't arise again. Signed-off-by: Rodrigo Tobar <[email protected]>
Yes! Looks like great stuff. I'll look at it once I get around ijson again. Being ve-e-ery busy at the moment :-) |
This new way allow for environments to be passed down (CC, CFLAGS et al.) and also leaves less temporary files behind. Signed-off-by: Rodrigo Tobar <[email protected]>
ijson/backends/_yajl2.c
Outdated
* ICRAR - International Centre for Radio Astronomy Research | ||
* (c) UWA - The University of Western Australia, 2016 | ||
* Copyright by UWA (in the framework of the ICRAR) | ||
* All rights reserved |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it's compatible with the BSD license of ijson. The copyright statement itself is fine of course, since you're the author, but "All rights reserved" specifically contradicts the explicit grant of the BSD license to do pretty much anything. Would UWA agree to license this work under BSD, do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmmm, good point, I haven't given much thought to that particular phrase really, I simply added it from the template I've seen in other codes of ours without realizing it conflicted with the license. I will have to bring this up and come back with an answer.
ijson/backends/yajl2_c.py
Outdated
from ijson import common | ||
from ._yajl2 import basic_parse as _basic_parse # @UnresolvedImport | ||
from ._yajl2 import parse as _parse # @UnresolvedImport | ||
from ._yajl2 import items as _items # @UnresolvedImport |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming nit: I'd prefer _yajl2
renamed to pyyajl2
and importing the module itself instead of the individual objects, and refer to them as pyyajl2.basic_parse
, pyyajl2.parse
and pyyajl2.items
from code. This would also remove the need to start their names with _
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Importing the C module itself instead of its individual parts sounds fine, I will change that.
Regarding the name, what is the motivation for having pyyajl2 instead of _yajl2? In the python library python wrappers for C-written seem to be usually named likewise (e.g. ssl imports _ssl, socket imports _socket, struct imports _struct). Following that I named the C module _yajl2.c, without the final "c" in the name as it is already present in the extension. I think also that the name pyyajl2 somehow would imply that the extension is written in pure python, which in turn is confusing again since there is already a python backend.
Let me know if you want to reconsider the naming; otherwise I will go ahead with the names you suggest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pyyajl2
was just a subjective preference… But I see your point on consistency with other libraries, so let's leave the naming as is. Thanks for pointing it out!
tests.py
Outdated
@@ -205,6 +211,7 @@ def test_api(self): | |||
{'backend': import_module('ijson.backends.%s' % name)}, | |||
) | |||
except ImportError: | |||
print("Skipping backend", name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would add unnecessary noise to tests output, let's not add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, no worries, I'll take that out.
Hi Rodrigo! I have finally found some time to review this pull request. It is a great addition and I'm looking forward to committing it after we iron out a few things I mentioned above. Thanks a lot! |
Hi Ivan, Thanks to you for taking the time to review these changes. I'm currently on holiday, and won't be back to the office until the last week of January, so I might or not be able to do something before then. I responded to your observations above, which all seem very reasonable. I'm expecting small feedback only on one small point, the rest I'll implement as requested. Thanks again! |
Signed-off-by: Rodrigo Tobar <[email protected]>
Signed-off-by: Rodrigo Tobar <[email protected]>
Thanks for a quick reply! I'll try to be more responsive as well :-) Take your holiday time! |
Hi, I have a need to iteratively parse very large JSON objects and I've decided to try this branch out for more speed.
and a simple code that replicates it from io import StringIO
import ijson.backends.yajl2_c as ijson
test = StringIO('[{"events": 1}]')
items = ijson.items(test, 'item.events') Thanks for working on this, cheers! |
@KenjiTakahashi Python 3.6 is fairly new, it wasn't around at the time of submitting these changes, and thus it hadn't been tested until now. I'll definitely have a look at this new issue once I'm back at work in ~1.5 weeks; until then you can use Python 3.5 which should work fine. |
This was due to the missing final NULL in he keyword list. This should have always have been there, but until CPython 3.6 the error wasn't evident (and even with 3.6 it's not always reproducible -- for instance, the unit tests all nicely run). I'm also extending the python versions used by Travis to include 3.5 and 3.6 for better coverage. Signed-off-by: Rodrigo Tobar <[email protected]>
This phrase is a formality that no longer has legal effects, but caused confusion with the permissive grants of the 3-clause BSD license under which ijson is distributed. Signed-off-by: Rodrigo Tobar <[email protected]>
Hi @isagalaev, I'm finally back at work! After some reading I learned that the "All rights reserved" phrase has no legal effect at all nowadays, and thus I simply removed it from the two files I added in this pull request. The importing of the C module has been changed as agreed, and the additional printing from the tests has been removed as well. I've also included a fix for the bug reported by @KenjiTakahashi, and extended the python versions used by travis to include 3.5 and 3.6. |
@isagalaev any plans for getting this in? |
pinging again |
Sorry, I'm trying to find time to merge it and release a new version. I thought I'd be able to do it during PyCon past weekend but didn't get around… |
Just the usual ping... |
It's been some time again, hasn't it :-)? |
setup.py
Outdated
libraries = ['yajl']) | ||
setupArgs['ext_modules'] = [yajl_ext] | ||
|
||
setup(**setupArgs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no new line :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just pushed a commit for this, was it an actual problem though or just aesthetic?
I haven't heard this is an actual issue that blocks compilation/usage, but on the other hand it doesn't hurt. Signed-off-by: Rodrigo Tobar <[email protected]>
@isagalaev Yearly ping - is there a chance you can take a look at this again and hopefully merge it? |
Hey everyone. I lost track of things happening here a while ago and don't remember why I didn't merge it in the first place. It's also unlikely I'll find time to get back to it any time soon. As far as I understand, @rtobar fork is alive and well, so I recommend everyone just use it. (I'll close the ticket to avoid confusion, but feel free to continue conversation if needed.) |
@isagalaev Assuming @rtobar is interested, would you be willing to give him PyPi privileges to make new releases? |
Sure, I wouldn't mind on my side. My user in PyPI is rtobar too. |
@rtobar hey! Added you on PyPI as an owner. I can't seem to remove myself, so feel free to either leave me there for precaution or take it over completely. |
Thanks Ivan! I already exercised this and uploaded a new 2.4 version of ijson to PyPI, including binary wheels for the main x64 linux/macos. |
Yay! Glad not to be a bottleneck anymore! And thanks to @wichert for kicking me in the right direction. |
Maintenance over this package was passed to new developer, based on previous maintainer comment: isagalaev/ijson#58 (comment) Package-Manager: Portage-2.3.78, Repoman-2.3.17 Signed-off-by: Zamarin Arthur <[email protected]>
Maintenance over this package was passed to new developer, based on previous maintainer comment: isagalaev/ijson#58 (comment) Package-Manager: Portage-2.3.78, Repoman-2.3.17 Signed-off-by: Zamarin Arthur <[email protected]> Signed-off-by: Michał Górny <[email protected]>
This pull request is to add a new backend based on a C extension. The C extension exposes three generators:
basic_parse
,parse
anditems
, which work the same way their counterparts in the other backends and in thecommon
module work. Reference counting seems correct, creating neither memory corruptions nor memory leaks when parsing big amounts of data.The main advantage of this new backend is its speed. Local tests indicate a gain of ~10x for the
items
generator, and even more for theparse
andbasic_parse
generators, when compared with the cffi backend. This makes this backend comparable to thejson
module in terms of speed, while still giving users an iterative way of consuming a JSON stream, and therefore keeping memory consumption low.I've re-organized the commits from their original form to make the pull request more compact and manageable. Commits include not only the new backend, but also their integration to the package installation procedure (optional, only if the compiler/libraries are found) and to the tests suite. I've also modified the travis configuration to use YAJL2 for testing, effectively testing more backends now (4 instead of 2).