Skip to content

Commit

Permalink
bytes_terminate(): speed up by avoiding unnecessary memory copying
Browse files Browse the repository at this point in the history
This rewrite of the bytes_terminate() function fixes performance issues
of the old version without any known performance regressions.

`bytes.partition()` used before is quite fast on its own for what it's
supposed to do. However, there were two measurable performance issues in
our use of it, if we consider that the `data` argument is an immutable
`bytes` object (which is true in practice, since our read_bytes() method
returns `bytes` objects):

1. See https://docs.python.org/3/library/stdtypes.html#bytes.partition:

   > Split the sequence at the first occurrence of sep, and return a
   > 3-tuple containing the part before the separator, the separator
   > itself or its bytearray copy, **and the part after the separator**.

   The important part is that it creates and returns the part _after the
   separator_ even though we don't need it. Unfortunately, as of Python
   3.11.5, slicing an immutable `bytes` object incurs a memory copy,
   i.e. it is not a zero-copy operation. In other words, the
   bytes_terminate() function unnecessarily copied all bytes after the
   terminator, only to not use it afterward. Of course, it's faster to
   copy only the part up to the terminator - the earlier the terminator
   occurs, the faster the operation.

2. The `include_term=True` was significantly slower than
   `include_term=False`. This is because `bytes.partition()` always
   returns the part before the separator, but if we have a non-empty
   terminator byte to include, we have no other option than to
   concatenate two `bytes` objects. This copies memory because `bytes`
   objects are immutable.

See the benchmark at
https://gist.github.com/generalmimon/dcaec4d005d0d1a87bd237d94199203a
(results from Python 3.10.12 in WSL 2 on my computer are at the bottom) -
we can clearly see both mentioned issues. For `include_term=False`, if
the terminator is in the middle of a 128 KiB byte array, the new
implementation is 1.8x faster due to the issue 1) above. If the
terminator is near the end of these 128 KiB or is missing, there's no
measurable difference between the implementations. For
`include_term=True`, the new version is 2.0x faster than the old one if
the terminator is near the end due to 2) - the new impl has the same
performance regardless of `include_term`, whereas in the old impl
`include_term=True` was twice as slow as `include_term=False`.
  • Loading branch information
generalmimon committed Sep 28, 2023
1 parent d457617 commit 92a2d71
Showing 1 changed file with 4 additions and 4 deletions.
8 changes: 4 additions & 4 deletions kaitaistruct.py
Original file line number Diff line number Diff line change
Expand Up @@ -445,10 +445,10 @@ def bytes_strip_right(data, pad_byte):

@staticmethod
def bytes_terminate(data, term, include_term):
new_data, term_byte, _ = data.partition(KaitaiStream.byte_from_int(term))
if include_term:
new_data += term_byte
return new_data
term_index = KaitaiStream.byte_array_index_of(data, term)
if term_index == -1:
return data[:]
return data[:term_index + (1 if include_term else 0)]

# endregion

Expand Down

0 comments on commit 92a2d71

Please sign in to comment.