Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve buffer-accepting hashes and more #84

Merged
merged 16 commits into from
Sep 17, 2024
Merged
45 changes: 29 additions & 16 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,30 +10,43 @@ This project has adhered to

### Added

- Add `digest` functions that accept a non-immutable buffer as input
and process it without internal copying
([#75](https://github.com/hajimes/mmh3/issues/75)).
- Slightly improve the performance of the `hash_bytes` function.
- Add support for Python 3.13.
- Add `digest` functions that support the new buffer protocol
([PEP 688](https://peps.python.org/pep-0688/)) as input
([#75](https://github.com/hajimes/mmh3/pull/75)).
These functions are implemented with
[METH_FASTCALL](https://docs.python.org/3/c-api/structures.html#c.METH_FASTCALL),
offering improved performance over legacy functions.
- Slightly improve the performance of the `hash_bytes()` function.
- Add Read the Docs documentation
([#54](https://github.com/hajimes/mmh3/issues/54)).
- (planned: Document benchmark results
([#53](https://github.com/hajimes/mmh3/issues/53))).

### Changed

- **Backward-incompatible**: The `seed` argument is now strictly validated to
ensure it falls within the range [0, 0xFFFFFFFF]. A `ValueError` is raised
if the seed is out of range.
- **Backward-incompatible**: Change the constructors of hasher classes to
accept a buffer as the first argument
([#83](https://github.com/hajimes/mmh3/pull/83)).
- The type of flag argumens has been changed from `bool` to `Any`.
- Change the format of CHANGELOG.md to conform to the
[Keep a Changelog](https://keepachangelog.com/en/1.1.0/) standard
([#63](https://github.com/hajimes/mmh3/issues/63)).
- **Backward-incompatible**: Change the constructors of hasher classes to
accept a buffer as the first argument.
([#63](https://github.com/hajimes/mmh3/pull/63)).

### Deprecated

- Deprecate the `hash_from_buffer()` function.
Use `mmh3_32_sintdigest()` or `mmh3_32_uintdigest()` as alternatives.

### Fixed

- Fix a reference leak in the `hash_from_buffer()` function
([#75](https://github.com/hajimes/mmh3/issues/75)).
- Fix type hints ([#76](https://github.com/hajimes/mmh3/issues/76),
[#77](https://github.com/hajimes/mmh3/issues/77)).
([#75](https://github.com/hajimes/mmh3/pull/75)).
- Fix type hints ([#76](https://github.com/hajimes/mmh3/pull/76),
[#77](https://github.com/hajimes/mmh3/pull/77)).

## [4.1.0] - 2024-01-09

Expand All @@ -47,7 +60,7 @@ This project has adhered to
([#50](https://github.com/hajimes/mmh3/issues/50)).
- Fix incorrect type hints ([#51](https://github.com/hajimes/mmh3/issues/51)).
- Fix invalid results on s390x when the arg `x64arch` of `hash64` or
`hash_bytes` is set to `False`
`hash_bytes()` is set to `False`
([#52](https://github.com/hajimes/mmh3/issues/52)).

## [4.0.1] - 2023-07-14
Expand Down Expand Up @@ -97,8 +110,8 @@ This project has adhered to
[wouter bolsterlee](https://github.com/wbolster) and
[Dušan Nikolić](https://github.com/n-dusan)!
- Add support for 32-bit architectures such as `i686` and `armv7l`. From now on,
`hash` and `hash_from_buffer` on these architectures will generate the same
hash values as those on other environments. Thanks
`hash()` and `hash_from_buffer()` on these architectures will generate the
same hash values as those on other environments. Thanks
[Danil Shein](https://github.com/dshein-alt)!
- In relation to the above, `manylinux2014_i686` wheels are now available.
- Support for hashing huge data (>16GB). Thanks
Expand Down Expand Up @@ -134,13 +147,13 @@ This project has adhered to

### Fixed

- Bugfix for `hash_bytes`. Thanks [doozr](https://github.com/doozr)!
- Bugfix for `hash_bytes()`. Thanks [doozr](https://github.com/doozr)!

## [2.5] - 2017-10-28

### Added

- Add `hash_from_buffer`. Thanks [Dimitri Vorona](https://github.com/alendit)!
- Add `hash_from_buffer()`. Thanks [Dimitri Vorona](https://github.com/alendit)!
- Add a keyword argument `signed`.

## [2.4] - 2017-05-27
Expand Down Expand Up @@ -175,7 +188,7 @@ Thanks!

### Added

- Add `hash128`, which returns a 128-bit signed integer.
- Add `hash128()`, which returns a 128-bit signed integer.

### Fixed

Expand Down
60 changes: 25 additions & 35 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -128,37 +128,50 @@ b'\x82_n\xdd \xac\xb6j\xef\x99\xb1e\xc4\n\xc9\xfd'

## Changelog

See [Changelog](https://mmh3.readthedocs.io/en/latest/changelog_link.html) for the
See [Changelog](https://mmh3.readthedocs.io/en/latest/changelog.html) for the
complete changelog.

### [Unreleased]

#### Added

- Add `digest` functions that accept a non-immutable buffer as input
and process it without internal copying
([#75](https://github.com/hajimes/mmh3/issues/75)).
- Slightly improve the performance of the `hash_bytes` function.
- Add support for Python 3.13.
- Add `digest` functions that support the new buffer protocol
([PEP 688](https://peps.python.org/pep-0688/)) as input
([#75](https://github.com/hajimes/mmh3/pull/75)).
These functions are implemented with
[METH_FASTCALL](https://docs.python.org/3/c-api/structures.html#c.METH_FASTCALL),
offering improved performance over legacy functions.
- Slightly improve the performance of the `hash_bytes()` function.
- Add Read the Docs documentation
([#54](https://github.com/hajimes/mmh3/issues/54)).
- (planned: Document benchmark results
([#53](https://github.com/hajimes/mmh3/issues/53))).

#### Changed

- **Backward-incompatible**: The `seed` argument is now strictly validated to
ensure it falls within the range [0, 0xFFFFFFFF]. A `ValueError` is raised
if the seed is out of range.
- **Backward-incompatible**: Change the constructors of hasher classes to
accept a buffer as the first argument
([#83](https://github.com/hajimes/mmh3/pull/83)).
- The type of flag argumens has been changed from `bool` to `Any`.
- Change the format of CHANGELOG.md to conform to the
[Keep a Changelog](https://keepachangelog.com/en/1.1.0/) standard
([#63](https://github.com/hajimes/mmh3/issues/63)).
- **Backward-incompatible**: Change the constructors of hasher classes to
accept a buffer as the first argument.
([#63](https://github.com/hajimes/mmh3/pull/63)).

#### Deprecated

- Deprecate the `hash_from_buffer()` function.
Use `mmh3_32_sintdigest()` or `mmh3_32_uintdigest()` as alternatives.

#### Fixed

- Fix a reference leak in the `hash_from_buffer()` function
([#75](https://github.com/hajimes/mmh3/issues/75)).
- Fix type hints ([#76](https://github.com/hajimes/mmh3/issues/76),
[#77](https://github.com/hajimes/mmh3/issues/77)).
([#75](https://github.com/hajimes/mmh3/pull/75)).
- Fix type hints ([#76](https://github.com/hajimes/mmh3/pull/76),
[#77](https://github.com/hajimes/mmh3/pull/77)).

### [4.1.0] - 2024-01-09

Expand All @@ -172,7 +185,7 @@ complete changelog.
([#50](https://github.com/hajimes/mmh3/issues/50)).
- Fix incorrect type hints ([#51](https://github.com/hajimes/mmh3/issues/51)).
- Fix invalid results on s390x when the arg `x64arch` of `hash64` or
`hash_bytes` is set to `False`
`hash_bytes()` is set to `False`
([#52](https://github.com/hajimes/mmh3/issues/52)).

## License
Expand Down Expand Up @@ -201,29 +214,6 @@ For compatibility with
[murmur3 (Go)](https://pkg.go.dev/github.com/spaolacci/murmur3), see
<https://github.com/hajimes/mmh3/issues/46>.

### Unexpected results when given non 32-bit seeds

In version 2.4, the type of a seed was changed from a signed 32-bit integer to
an unsigned 32-bit integer. However, the resulting values for signed seeds
remain unchanged from previous versions, as long as they are 32-bit.

```pycon
>>> mmh3.hash("aaaa", -1756908916) # signed representation for 0x9747b28c
1519878282
>>> mmh3.hash("aaaa", 2538058380) # unsigned representation for 0x9747b28c
1519878282
```

Be careful so that these seeds do not exceed 32-bit. Unexpected results may
happen with invalid values.

```pycon
>>> mmh3.hash("foo", 2 ** 33)
-156908512
>>> mmh3.hash("foo", 2 ** 34)
-156908512
```

## Contributing Guidelines

See [Contributing](https://mmh3.readthedocs.io/en/latest/CONTRIBUTING.html).
Expand Down
4 changes: 2 additions & 2 deletions benchmark/plot_graph.py
Original file line number Diff line number Diff line change
Expand Up @@ -124,8 +124,8 @@ def ordered_intersection(list1: list[T], list2: list[T]) -> list[T]:
plt.savefig(os.path.join(args.output_dir, BANDWIDTH_SMALL_FILE_NAME))

df_latency_all = df_latency * 1000
df_latency_all.index = df_latency_all.index / (1024 * 1024)
df_latency_all.plot(xlabel="Input size (MiB)", ylabel="Latency (ms)")
df_latency_all.index = df_latency_all.index / 1024
df_latency_all.plot(xlabel="Input size (KiB)", ylabel="Latency (ms)")
plt.savefig(os.path.join(args.output_dir, LATENCY_FILE_NAME))

df_latency_small = df_latency * 1000 * 1000 * 1000
Expand Down
15 changes: 6 additions & 9 deletions docs/CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -129,13 +129,13 @@ The idea of the subproject directory loosely follows the

### Updating mmh3 core C code

Run `tox -e build-cfiles`. This will fetch Appleby's original SMHasher project
Run `tox -e build_cfiles`. This will fetch Appleby's original SMHasher project
as a git submodule and then generate PEP 7-compliant C code from the original
project.

To perform further edits, add transformation code to the `refresh.py` script,
instead of editing `murmurhash3.*` files manually.
Then, run `tox -e build-cfiles` again to update the `murmurhash3.*` files.
Then, run `tox -e build_cfiles` again to update the `murmurhash3.*` files.

### Local files

Expand All @@ -153,8 +153,7 @@ Then, run `tox -e build-cfiles` again to update the `murmurhash3.*` files.
To run benchmarks locally, try the following command:

```shell
pip install ".[benchmark]"
python benchmark/benchmark.py -o OUTPUT_FILE \
tox -e benchmark -- -o OUTPUT_FILE \
--test-hash HASH_NAME --test-buffer-size-max HASH_SIZE
```

Expand All @@ -165,9 +164,8 @@ in bytes.
For example,

```shell
pip install ".[benchmark]"
mkdir results
python benchmark/benchmark.py -o results/mmh3_128.json \
mkdir -p _results
tox -e benchmark -- -o _results/mmh3_128.json \
--test-hash mmh3_128 --test-buffer-size-max 262144
```

Expand All @@ -182,8 +180,7 @@ After obtaining the benchmark results, you can plot graphs by `plot_graph.py`.
The following is an example of how to run the script:

```shell
pip install ".[benchmark,plot]"
python benchmark/plot_graph.py --output-dir docs/_static RESULT_DIR/*.json
tox -e plot -- --output-dir docs/_static RESULT_DIR/*.json
```

where `RESULT_DIR` is the directory containing the benchmark results.
Expand Down
4 changes: 2 additions & 2 deletions docs/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,8 @@ UTF-8 encoding before hashing.

The following functions are used to hash types that implement the buffer
protocol such as `bytes`, `bytearray`, `memoryview`, and `numpy` arrays.
String inputs are also supported and are automatically converted to `bytes`
using UTF-8 encoding before hashing.

```{seealso}
The buffer protocol,
[originally implemented as a part of Python/C API](https://docs.python.org/3/c-api/buffer.html),
was formally defined as a Python-level API in
Expand All @@ -37,6 +36,7 @@ type hint
which is itself an alias for
[typing_extensions.Buffer](https://typing-extensions.readthedocs.io/en/latest/#typing_extensions.Buffer),
the backported type hint for `collections.abc.Buffer`.
```

```{eval-rst}
.. autofunction:: mmh3.hash_from_buffer
Expand Down
File renamed without changes.
5 changes: 2 additions & 3 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@ mmh3 is a Python extension for `MurmurHash (MurmurHash3) <https://en.wikipedia.o
:maxdepth: 2
:caption: User Guideline

Quickstart<readme_link>
Quickstart<quickstart>
api
Changelog<changelog_link>
Changelog<changelog>

.. toctree::
:maxdepth: 2
Expand All @@ -21,5 +21,4 @@ Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
File renamed without changes.
14 changes: 6 additions & 8 deletions src/mmh3/__init__.pyi
Original file line number Diff line number Diff line change
Expand Up @@ -2,26 +2,24 @@
from __future__ import annotations

import sys
from typing import Union, final
from typing import Any, Union, final

if sys.version_info >= (3, 12):
from collections.abc import Buffer
else:
from _typeshed import ReadableBuffer as Buffer

def hash(key: Union[bytes, str], seed: int = 0, signed: bool = True) -> int: ...
def hash(key: Union[bytes, str], seed: int = 0, signed: Any = True) -> int: ...
def hash_from_buffer(
key: Union[Buffer, str], seed: int = 0, signed: bool = True
key: Union[Buffer, str], seed: int = 0, signed: Any = True
) -> int: ...
def hash64(
key: Union[bytes, str], seed: int = 0, x64arch: bool = True, signed: bool = True
key: Union[bytes, str], seed: int = 0, x64arch: Any = True, signed: Any = True
) -> tuple[int, int]: ...
def hash128(
key: Union[bytes, str], seed: int = 0, x64arch: bool = True, signed: bool = False
key: Union[bytes, str], seed: int = 0, x64arch: Any = True, signed: Any = False
) -> int: ...
def hash_bytes(
key: Union[bytes, str], seed: int = 0, x64arch: bool = True
) -> bytes: ...
def hash_bytes(key: Union[bytes, str], seed: int = 0, x64arch: Any = True) -> bytes: ...
def mmh3_32_digest(key: Union[Buffer, str], seed: int = 0) -> bytes: ...
def mmh3_32_sintdigest(key: Union[Buffer, str], seed: int = 0) -> int: ...
def mmh3_32_uintdigest(key: Union[Buffer, str], seed: int = 0) -> int: ...
Expand Down
Loading
Loading