-
Notifications
You must be signed in to change notification settings - Fork 337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apply recent C optimizations to Java extension #725
base: master
Are you sure you want to change the base?
Conversation
As part of this I'll be trying to align more of the Java code with the C equivalents, with comments to indicate how they sync up. This should make it easier to keep them in sync in the future. |
Lots of surrounding state so just take the hit of a Set and Iterator rather than a big visitor object.
This change duplicates some code from JRuby to allow rendering the fixnum value to a shared byte array rather than allocating new for each value. Since fixnum dumping is a leaf operation, only one is needed per session.
I jumped down the optimization well and continued past the recent string optimizations on to the other types of dumpable objects. Strings are now faster than in CRuby, as well as a few other cases in the encoder.rb benchmark, but many cases are still slower... sometimes less than half the performance. Tracking results here: https://gist.github.com/headius/3e56d80656543bf2343f4b26f00bc446 |
Anonymous classes show up as unnamed, numbered classes in profiles which makes them difficult to read.
Rather than allocating a buffer to hold N copies of arrayNL, just write it N times. We're buffering into a stream anyway. This makes array dumping zero-alloc other than buffer growth.
Since there's a fixed number of types we have special dumping logic for, this abstraction just introduces overhead we don't need. This patch starts moving away from indirecting all dumps through the Handler abstraction and directly generating from the type switch. This also aligns better with the main loop of the C code and should inline and optimize better.
The byte[] output stream used here extended ByteArrayOutputStream from the JDK, which sychronizes all mutation operations (like writes). Since this is only going to be used once within a given call stack, it needs no synchronization. This change more than triples the performance of a benchmark of dumping an array of empty arrays and should increase performance of all dump forms.
* Return incoming array if only one repeat is needed and array is exact size. * Only retrieve ByteList fields once for repeat writes.
A nice discovery: the default ByteArrayOutputStream we extend for our ByteList version uses |
The math is much faster here than array access, due to bounds checking and pointer dereferencing.
Java will generated accessor methods for private fields, burning some inlining budget.
Just catching up with all of @byroot's excellent optimization work.