Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
bytes_terminate(): speed up by avoiding unnecessary memory copying
This rewrite of the bytes_terminate() function fixes performance issues of the old version without any known performance regressions. `bytes.partition()` used before is quite fast on its own for what it's supposed to do. However, there were two measurable performance issues in our use of it, if we consider that the `data` argument is an immutable `bytes` object (which is true in practice, since our read_bytes() method returns `bytes` objects): 1. See https://docs.python.org/3/library/stdtypes.html#bytes.partition: > Split the sequence at the first occurrence of sep, and return a > 3-tuple containing the part before the separator, the separator > itself or its bytearray copy, **and the part after the separator**. The important part is that it creates and returns the part _after the separator_ even though we don't need it. Unfortunately, as of Python 3.11.5, slicing an immutable `bytes` object incurs a memory copy, i.e. it is not a zero-copy operation. In other words, the bytes_terminate() function unnecessarily copied all bytes after the terminator, only to not use it afterward. Of course, it's faster to copy only the part up to the terminator - the earlier the terminator occurs, the faster the operation. 2. The `include_term=True` was significantly slower than `include_term=False`. This is because `bytes.partition()` always returns the part before the separator, but if we have a non-empty terminator byte to include, we have no other option than to concatenate two `bytes` objects. This copies memory because `bytes` objects are immutable. See the benchmark at https://gist.github.com/generalmimon/dcaec4d005d0d1a87bd237d94199203a (results from Python 3.10.12 in WSL 2 on my computer are at the bottom) - we can clearly see both mentioned issues. For `include_term=False`, if the terminator is in the middle of a 128 KiB byte array, the new implementation is 1.8x faster due to the issue 1) above. If the terminator is near the end of these 128 KiB or is missing, there's no measurable difference between the implementations. For `include_term=True`, the new version is 2.0x faster than the old one if the terminator is near the end due to 2) - the new impl has the same performance regardless of `include_term`, whereas in the old impl `include_term=True` was twice as slow as `include_term=False`.
- Loading branch information