Add partition key hash baseline tests for Cosmos DB Python SDK#45136
Add partition key hash baseline tests for Cosmos DB Python SDK#45136analogrelay wants to merge 2 commits intomainfrom
Conversation
Validates MurmurHash3 V1/V2 partition key hashing against baseline XML data covering singletons (undefined, null, true, false), numbers (zero, negative zero, epsilon, NaN, infinity, int64 extremes), strings (empty through 2KB), and hierarchical partition key lists. Uses the production code path (PartitionKey._write_for_hashing / _write_for_hashing_v2) including string suffix bytes. Baselines can be regenerated with UPDATE_BASELINE=1.
|
I built this as part of developing Azure/azure-sdk-for-go#26007 and thought it would be really good if our various SDKs had one set of common tests for computing EPKs. The .NET SDK has baseline tests like this, and this is a slight iteration on those (which I plan to roll in to other SDKs too). |
There was a problem hiding this comment.
Pull request overview
Adds baseline-driven tests to validate Cosmos DB partition key hashing outputs against known-good XML baselines, with an opt-in mode to regenerate baselines.
Changes:
- Introduces
test_partition_key_hash_baseline.pyto compute V1/V2 hashes from PartitionKey’s hashing serialization and assert against XML baselines. - Adds baseline XML fixtures for singletons, numbers, strings, and list (multi-hash) cases.
- Supports baseline regeneration via
UPDATE_BASELINE=1.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| sdk/cosmos/azure-cosmos/tests/test_partition_key_hash_baseline.py | Implements hash computation helpers, XML baseline load/write, parametrized assertions, and baseline regeneration flag. |
| sdk/cosmos/azure-cosmos/tests/test_data/partition_key_hash_baseline/PartitionKeyHashBaselineTest.Singletons.xml | Baseline expected hashes for undefined/null/bool cases. |
| sdk/cosmos/azure-cosmos/tests/test_data/partition_key_hash_baseline/PartitionKeyHashBaselineTest.Numbers.xml | Baseline expected hashes for numeric edge cases. |
| sdk/cosmos/azure-cosmos/tests/test_data/partition_key_hash_baseline/PartitionKeyHashBaselineTest.Strings.xml | Baseline expected hashes for string lengths (including long strings). |
| sdk/cosmos/azure-cosmos/tests/test_data/partition_key_hash_baseline/PartitionKeyHashBaselineTest.Lists.xml | Baseline expected hashes for multi-component (concatenated) hashing. |
| suite = filename.rsplit('.', 2)[0].split('.')[-1] # e.g. "Singletons" | ||
| cases = _load_test_cases(filename) | ||
| for case in cases: | ||
| test_id = f"{suite}/{case['description']}" | ||
| items.append((test_id, case, is_list)) |
There was a problem hiding this comment.
suite = filename.rsplit('.', 2)[0].split('.')[-1] does not produce the intended suite name (e.g. for PartitionKeyHashBaselineTest.Singletons.xml it yields PartitionKeyHashBaselineTest, not Singletons). This makes the parametrized test IDs misleading and likely duplicates across baseline files. Consider extracting the suite via filename.rsplit('.', 1)[0].split('.')[-1] or filename.split('.')[-2].
| if isinstance(parsed, int) and parsed == 0 and value_str.strip().startswith('-'): | ||
| return float('-0.0') | ||
|
|
||
| if isinstance(parsed, int): | ||
| return float(parsed) | ||
|
|
There was a problem hiding this comment.
_parse_partition_key_value converts every parsed JSON integer to float. Since PartitionKey._write_for_hashing*_core has a dedicated int code path, this prevents the baseline tests from exercising that branch (and any future differences between int vs float handling). Consider returning the parsed int as-is (except for the special -0 case) and letting _write_for_hashing* perform its own int-to-float casting.
| if isinstance(parsed, int) and parsed == 0 and value_str.strip().startswith('-'): | |
| return float('-0.0') | |
| if isinstance(parsed, int): | |
| return float(parsed) | |
| # Preserve negative zero semantics for the bare JSON number "-0". | |
| if isinstance(parsed, int) and parsed == 0 and value_str.strip().startswith('-'): | |
| return float('-0.0') | |
| # For all other cases, return the parsed value as-is so that integers | |
| # remain ints and exercise the int-specific hashing code paths. |
| def _compute_hash_v1(value): | ||
| """Compute V1 partition key hash using the production serialisation path. | ||
|
|
||
| Serialises *value* with ``PartitionKey._write_for_hashing`` (which appends a | ||
| 0x00 suffix after strings), hashes with MurmurHash3-32, and returns the | ||
| result formatted as a 32-hex-char UInt128 string (big-endian, no dashes). | ||
| """ | ||
| ms = BytesIO() | ||
| PartitionKey._write_for_hashing(value, ms) | ||
| hash32 = murmurhash3_32(bytearray(ms.getvalue()), 0) | ||
| h = int(hash32) | ||
| low_bytes = list(h.to_bytes(8, byteorder='little')) | ||
| high_bytes = [0] * 8 | ||
| uint128_bytes = low_bytes + high_bytes | ||
| uint128_bytes.reverse() | ||
| return ''.join('{:02X}'.format(b) for b in uint128_bytes) | ||
|
|
||
|
|
||
| def _compute_hash_v2(value): | ||
| """Compute V2 partition key hash using the production serialisation path. | ||
|
|
||
| Serialises *value* with ``PartitionKey._write_for_hashing_v2`` (which appends | ||
| a 0xFF suffix after strings), hashes with MurmurHash3-128, and returns the | ||
| result formatted as a 32-hex-char UInt128 string (big-endian, no dashes). | ||
| """ | ||
| ms = BytesIO() | ||
| PartitionKey._write_for_hashing_v2(value, ms) | ||
| hash128 = murmurhash3_128(bytearray(ms.getvalue()), _UInt128(0, 0)) | ||
| ba = hash128.to_byte_array() | ||
| ba_reversed = list(reversed(ba)) | ||
| return ''.join('{:02X}'.format(b) for b in ba_reversed) | ||
|
|
||
|
|
||
| def _compute_multi_hash_v1(values): | ||
| """Compute V1 multi-hash (concatenated per-element UInt128 hashes).""" | ||
| return ''.join(_compute_hash_v1(v) for v in values) | ||
|
|
||
|
|
||
| def _compute_multi_hash_v2(values): | ||
| """Compute V2 multi-hash (concatenated per-element UInt128 hashes).""" | ||
| return ''.join(_compute_hash_v2(v) for v in values) | ||
|
|
There was a problem hiding this comment.
The V1/V2 helpers here compute a raw MurmurHash output from _write_for_hashing*, but they skip additional steps used by the SDK’s production effective-partition-key computation: V1 truncates strings to 100 chars and normalizes large ints via _UInt32 before hashing (PartitionKey._get_effective_partition_key_for_hash_partitioning), and V2 clears the top bits after reversing the 128-bit hash (hash_bytes[0] &= 0x3F) in both single- and multi-hash (PartitionKey._get_effective_partition_key_for_hash_partitioning_v2 / _get_effective_partition_key_for_multi_hash_partitioning_v2). If the intent is to validate the SDK’s actual routing/hash behavior end-to-end, these baselines won’t match/catch regressions for those steps. Consider either (a) baselining the outputs of the production _get_effective_partition_key_for_* methods, or (b) explicitly documenting that these tests validate the raw MurmurHash result (and not the masked/truncated EPK form).
Summary
Adds baseline tests validating MurmurHash3 V1/V2 partition key hashing against known-good values.
Test coverage (58 tests)
Design
PartitionKey._write_for_hashing/_write_for_hashing_v2) including string suffix bytes, so tests validate the actual SDK behavior end-to-end.UPDATE_BASELINE=1 pytest tests/test_partition_key_hash_baseline.py.PartitionKeyHashBaselineTest, with expected hash values regenerated from the Python SDK's production serialization path.