- 
                Notifications
    You must be signed in to change notification settings 
- Fork 1.5k
Optimize String write #1651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize String write #1651
Conversation
- Remove extra bounds checking. - Add ASCII fast-loop similar to String.getBytes().
| That's related to #1629 🙏 | 
…ncoding tests. - Adjust logic to handle non-zero ByteBuffer.arrayOffset, as some Netty Pooled ByteBuffer implementations return an offset != 0. - Add unit tests for UTF-8 encoding across buffer boundaries and for malformed surrogate pairs. - Fix issue with a leacked reference count on ByteBufs in the pipe() method (2 non-released reference counts were retained). JAVA-5816
|  | ||
| if (c < 0x80) { | ||
| if (remaining == 0) { | ||
| buf = getNextByteBuffer(); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still suggest to give at shot at the PR I made which use a separate index to access single bytes in the internalNio Buffer within the Netty buffers, for two reasons:
- NIO ByteBuffer can benefit from additional optimizations from the JVM since it's a known type to it
- Netty buffer read/write both move forward the internal indexes and force Netty to verify accessibility of the buffer for each operation, which have some Java Memory Model effects (.e.g. any subsequent load has to happen for real, each time!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point - accessing the NIO buffer directly sounds like a potential win. I’m aiming to keep this PR focused and incremental for easier review and integration. We could consider Netty-specific optimizations in a follow-up PR/ scope, once we have Netty benchmarks running in CI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good a couple of comments and one suggestion to help improve the readability of writeCharacters
        
          
                driver-core/src/main/com/mongodb/internal/connection/ByteBufferBsonOutput.java
          
            Show resolved
            Hide resolved
        
              
          
                driver-core/src/main/com/mongodb/internal/connection/ByteBufferBsonOutput.java
          
            Show resolved
            Hide resolved
        
              
          
                driver-core/src/main/com/mongodb/internal/connection/ByteBufferBsonOutput.java
          
            Show resolved
            Hide resolved
        
              
          
                driver-core/src/test/unit/com/mongodb/internal/connection/ByteBufferBsonOutputTest.java
          
            Show resolved
            Hide resolved
        
      JAVA-5816
JAVA-5816
JAVA-5816
JAVA-5816
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
| } | ||
|  | ||
| // We get here, when the buffer is not backed by an array, or when the string contains at least one non-ASCII characters. | ||
| return writeOnBuffers(str, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we have within this a fast PATH for ASCII too?
It will grant more chances to get inlined (since is a smaller method) and be more unrolled...
If we can have a JMH bench it would be fairly easy (I can do it) to peek into the assembly produced to verify it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’d expect the fast path for buffers to be in the else branch of if (curBuffer.hasArray()).
However, once we detect UTF-8 characters there, we call a fallback writeOnBuffers(maybe we could rename it to writeUtf8OnBuffers).
Are you suggesting we add a fast path similar to writeOnArrayAscii, but using dynamic buffer allocation and falling back to writeOnBuffers/writeUtf8OnBuffers  when a UTF-8 character is encountered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting we add a fast path similar to writeOnArrayAscii, but using dynamic buffer allocation and falling back to writeOnBuffers/writeUtf8OnBuffers when a UTF-8 character is encountered?
Yep, since I see the ascii path there is already taking care to change the buffer to write against instead of performing a lookup per each byte to write.
A tighter loop increase the chance it to be loop unrolled, although...the fact we can change the buffer where to write during the loop, can affect this - both for the array case and this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it makes sense. Thanks for the suggestion. I implemented writeOnBuffersAsccii and ran local benchmarks.
The implementation of writeOnBuffersAsccii
private int writeOnBuffersAsccii(final String str,
                                     final boolean checkNullTermination,
                                     final int stringPointer,
                                     final int bufferLimit,
                                     final int bufferPos,
                                     final ByteBuf buffer) {
        int remaining;
        int sp = stringPointer;
        int curBufferPos = bufferPos;
        int curBufferLimit = bufferLimit;
        ByteBuf curBuffer = buffer;
        final int length = str.length();
        while (sp < length) {
            remaining = curBufferLimit - curBufferPos;
            char c = str.charAt(sp);
            if (checkNullTermination && c == 0) {
                throw new BsonSerializationException(
                        format("BSON cstring '%s' is not valid because it contains a null character " + "at index %d", str, sp));
            }
            if (c >= 0x80) {
                break;
            }
            if (remaining == 0) {
                curBuffer = getNextByteBuffer();
                curBufferPos = 0;
                curBufferLimit = curBuffer.limit();
            }
            curBuffer.put((byte) c);
            position++;
            sp++;
            curBufferPos++;
        }
        return sp;
    }
I printed the assembly to see the JIT’s behavior. It looks like the loop wasn’t unrolled by JIT - there’s only one charAt and put per iteration in the main loop. However, there seems to be an additional charAt in the buffer allocation path (after getNextByteBuffer), but it’s a separate code path and mostly an edge-case.
I’ve shared a GitHub Gist with the shortened assembly (keeping the key parts) and a pseudo-Java interpretation to show how the assembly might map back to the logic: Gist. Local perf showed modest gains likely limited by dynamic buffer allocation, as you noted. I’ll run more tests on a dedicated perf instance to confirm. If I missed anything in the assembly, please let me know!
I’m merging this PR for the current improvements, but I agree tighter loops or manual unrolling could be further explored, keeping in mind the maintainability trade-off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot see the assembly there , but the not decoded binary instead - did you miss the https://blogs.oracle.com/javamagazine/post/java-hotspot-hsdis-disassembler so in your class path?
that will help me a lot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean something which uses compile command too as TechEmpower/FrameworkBenchmarks#9800 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You’re right - my earlier Gist showed raw hex. I recompiled on Oracle JDK 17.0.7 with hsdis for readable assembly. Thanks!
The main loop (0x0000000113a12c40–0x0000000113a12d3c) seems to have no unrolling:
Condition: 0x0000000113a12c40 (cmp w4, w15, line 453: sp < length).
charAt: One instance at 0x0000000113a12c60–0x0000000113a12c88 (line 455).
Checks: 
Null termination (0x0000000113a12c8c, line 456), ASCII (0x0000000113a12c94, line 460),
buffer space (0x0000000113a12c98, line 463).
put: One instance at 0x0000000113a12ca4–0x0000000113a12d20 (line 468).
Index Updates: 0x0000000113a12d24–0x0000000113a12d34 (lines 469–471).
Branch Back: 0x0000000113a12d3c: b.lt 0x0000000113a12c60, looping to 0x0000000113a12c40.
A second charAt seem to appear in the getNextByteBuffer path, not the main loop after compilation. I’ve created new Gist with the readable assembly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm It looks to me that the unrolling was having a factor of two, I will re read it again since I am more used to x86 asm :)
Anyway, I suggest to look at the PR I sent for this same optimization: having the check for remaining buffer space in the loop would bloat the loop body, reducing the chances that C2 will unroll it many times.
You should keep the loop as simple and branch free as possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another reason why the Netty version loop body is too fat is because the JIT doesn't trust final fields and since you can get a new mongo ByteBuf at each iteration, the required amount of pointer chase (mongo buf, Netty swapped big, Netty buf, Unsafe..) is way to much; this prevent massively to have unrolling.
The reason why I have unwrapped till Netty or NIO, the buffers, is to be as per byte[], at the level in which the loop body can just use whatever is saved into a register (the reference of the address field in the Netty buffer) assuming it to be constant, and trust it.
# Conflicts: # driver-core/src/test/unit/com/mongodb/internal/connection/ByteBufferBsonOutputTest.java
JAVA-5816
| 1 << 21, 1 << 22, 1 << 23, 1 << 24, 1 << 24), | ||
| byteBuffers.stream().map(ByteBuf::capacity).collect(toList())); | ||
| } finally { | ||
| byteBuffers.forEach(ByteBuf::release); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Netty was logging warnings due to a direct Netty ByteBuf not being released properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep leaving leak detection paranoid enabled for test is key to save leaks to happen
Summary
This PR optimizes string writing in BSON by minimizing redundant checks and internal method indirection in hot paths.
Key Changes:
put(). Logic follows the same fast path approach used inString.getBytes()for ASCII.str.charAt()instead ofCharacter.toCodePoint()to avoid unnecessary surrogate checks when not needed (e.g., for characters within the ASCII or 2-byte UTF-8 code unit range). Fall back only when multi-unit characters (e.g., 3-byte UTF-8) are detected.Performance analysis
To ensure accurate performance comparison and reduce noise, 11 versions (Comparison Versions) were aggregated and compared against a stable region of data around the Base Mainline Version. The percentage difference and z-score of the mean of the Comparison Versions were calculated relative to the Base Mainline Version’s stable region mean.
The following tables show improvements across two configurations:
ASCII Benchmark Suite (Regular JSON Workloads)
Perf analyzer results: Link
Augmented Benchmark Suite (UTF-8 Strings with 3-Byte Characters)
To evaluate performance on multi-byte UTF-8 input, the large_doc.json, small_doc.json, and tweet.json datasets were modified to use UTF-8 characters encoded with 3 bytes (code units). These changes were introduced on mainline via an Evergreen patch, and benchmark results were collected from:
Perf analyzer results: Link
JAVA-5816