Add partition key hash baseline tests for Cosmos DB Python SDK by analogrelay · Pull Request #45136 · Azure/azure-sdk-for-python

analogrelay · 2026-02-11T19:18:27Z

Summary

Adds baseline tests validating MurmurHash3 V1/V2 partition key hashing against known-good values.

Test coverage (58 tests)

Singletons: undefined, null, true, false
Numbers: zero, negative zero, epsilon, NaN, infinity, int64 extremes, MaxSafeInteger
Strings: empty through 2KB (exercises V1 truncation at 100 chars)
Lists: 1/2/3-path hierarchical partition keys (multi-hash)

Design

Uses the production code path (PartitionKey._write_for_hashing / _write_for_hashing_v2) including string suffix bytes, so tests validate the actual SDK behavior end-to-end.
Baseline XML files can be regenerated via UPDATE_BASELINE=1 pytest tests/test_partition_key_hash_baseline.py.
Input test cases originated from the .NET SDK's PartitionKeyHashBaselineTest, with expected hash values regenerated from the Python SDK's production serialization path.

Validates MurmurHash3 V1/V2 partition key hashing against baseline XML data covering singletons (undefined, null, true, false), numbers (zero, negative zero, epsilon, NaN, infinity, int64 extremes), strings (empty through 2KB), and hierarchical partition key lists. Uses the production code path (PartitionKey._write_for_hashing / _write_for_hashing_v2) including string suffix bytes. Baselines can be regenerated with UPDATE_BASELINE=1.

analogrelay · 2026-02-11T19:20:43Z

I built this as part of developing Azure/azure-sdk-for-go#26007 and thought it would be really good if our various SDKs had one set of common tests for computing EPKs. The .NET SDK has baseline tests like this, and this is a slight iteration on those (which I plan to roll in to other SDKs too).

Copilot

Pull request overview

Adds baseline-driven tests to validate Cosmos DB partition key hashing outputs against known-good XML baselines, with an opt-in mode to regenerate baselines.

Changes:

Introduces test_partition_key_hash_baseline.py to compute V1/V2 hashes from PartitionKey’s hashing serialization and assert against XML baselines.
Adds baseline XML fixtures for singletons, numbers, strings, and list (multi-hash) cases.
Supports baseline regeneration via UPDATE_BASELINE=1.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
sdk/cosmos/azure-cosmos/tests/test_partition_key_hash_baseline.py	Implements hash computation helpers, XML baseline load/write, parametrized assertions, and baseline regeneration flag.
sdk/cosmos/azure-cosmos/tests/test_data/partition_key_hash_baseline/PartitionKeyHashBaselineTest.Singletons.xml	Baseline expected hashes for undefined/null/bool cases.
sdk/cosmos/azure-cosmos/tests/test_data/partition_key_hash_baseline/PartitionKeyHashBaselineTest.Numbers.xml	Baseline expected hashes for numeric edge cases.
sdk/cosmos/azure-cosmos/tests/test_data/partition_key_hash_baseline/PartitionKeyHashBaselineTest.Strings.xml	Baseline expected hashes for string lengths (including long strings).
sdk/cosmos/azure-cosmos/tests/test_data/partition_key_hash_baseline/PartitionKeyHashBaselineTest.Lists.xml	Baseline expected hashes for multi-component (concatenated) hashing.

Copilot · 2026-03-10T22:31:37Z

sdk/cosmos/azure-cosmos/tests/test_partition_key_hash_baseline.py

+        suite = filename.rsplit('.', 2)[0].split('.')[-1]  # e.g. "Singletons"
+        cases = _load_test_cases(filename)
+        for case in cases:
+            test_id = f"{suite}/{case['description']}"
+            items.append((test_id, case, is_list))


suite = filename.rsplit('.', 2)[0].split('.')[-1] does not produce the intended suite name (e.g. for PartitionKeyHashBaselineTest.Singletons.xml it yields PartitionKeyHashBaselineTest, not Singletons). This makes the parametrized test IDs misleading and likely duplicates across baseline files. Consider extracting the suite via filename.rsplit('.', 1)[0].split('.')[-1] or filename.split('.')[-2].

Copilot · 2026-03-10T22:31:37Z

sdk/cosmos/azure-cosmos/tests/test_partition_key_hash_baseline.py

+    if isinstance(parsed, int) and parsed == 0 and value_str.strip().startswith('-'):
+        return float('-0.0')
+
+    if isinstance(parsed, int):
+        return float(parsed)
+


_parse_partition_key_value converts every parsed JSON integer to float. Since PartitionKey._write_for_hashing*_core has a dedicated int code path, this prevents the baseline tests from exercising that branch (and any future differences between int vs float handling). Consider returning the parsed int as-is (except for the special -0 case) and letting _write_for_hashing* perform its own int-to-float casting.

Suggested change

if isinstance(parsed, int) and parsed == 0 and value_str.strip().startswith('-'):

return float('-0.0')

if isinstance(parsed, int):

return float(parsed)

# Preserve negative zero semantics for the bare JSON number "-0".

if isinstance(parsed, int) and parsed == 0 and value_str.strip().startswith('-'):

return float('-0.0')

# For all other cases, return the parsed value as-is so that integers

# remain ints and exercise the int-specific hashing code paths.

Copilot · 2026-03-10T22:31:37Z

sdk/cosmos/azure-cosmos/tests/test_partition_key_hash_baseline.py

+def _compute_hash_v1(value):
+    """Compute V1 partition key hash using the production serialisation path.
+
+    Serialises *value* with ``PartitionKey._write_for_hashing`` (which appends a
+    0x00 suffix after strings), hashes with MurmurHash3-32, and returns the
+    result formatted as a 32-hex-char UInt128 string (big-endian, no dashes).
+    """
+    ms = BytesIO()
+    PartitionKey._write_for_hashing(value, ms)
+    hash32 = murmurhash3_32(bytearray(ms.getvalue()), 0)
+    h = int(hash32)
+    low_bytes = list(h.to_bytes(8, byteorder='little'))
+    high_bytes = [0] * 8
+    uint128_bytes = low_bytes + high_bytes
+    uint128_bytes.reverse()
+    return ''.join('{:02X}'.format(b) for b in uint128_bytes)
+
+
+def _compute_hash_v2(value):
+    """Compute V2 partition key hash using the production serialisation path.
+
+    Serialises *value* with ``PartitionKey._write_for_hashing_v2`` (which appends
+    a 0xFF suffix after strings), hashes with MurmurHash3-128, and returns the
+    result formatted as a 32-hex-char UInt128 string (big-endian, no dashes).
+    """
+    ms = BytesIO()
+    PartitionKey._write_for_hashing_v2(value, ms)
+    hash128 = murmurhash3_128(bytearray(ms.getvalue()), _UInt128(0, 0))
+    ba = hash128.to_byte_array()
+    ba_reversed = list(reversed(ba))
+    return ''.join('{:02X}'.format(b) for b in ba_reversed)
+
+
+def _compute_multi_hash_v1(values):
+    """Compute V1 multi-hash (concatenated per-element UInt128 hashes)."""
+    return ''.join(_compute_hash_v1(v) for v in values)
+
+
+def _compute_multi_hash_v2(values):
+    """Compute V2 multi-hash (concatenated per-element UInt128 hashes)."""
+    return ''.join(_compute_hash_v2(v) for v in values)
+


The V1/V2 helpers here compute a raw MurmurHash output from _write_for_hashing*, but they skip additional steps used by the SDK’s production effective-partition-key computation: V1 truncates strings to 100 chars and normalizes large ints via _UInt32 before hashing (PartitionKey._get_effective_partition_key_for_hash_partitioning), and V2 clears the top bits after reversing the 128-bit hash (hash_bytes[0] &= 0x3F) in both single- and multi-hash (PartitionKey._get_effective_partition_key_for_hash_partitioning_v2 / _get_effective_partition_key_for_multi_hash_partitioning_v2). If the intent is to validate the SDK’s actual routing/hash behavior end-to-end, these baselines won’t match/catch regressions for those steps. Consider either (a) baselining the outputs of the production _get_effective_partition_key_for_* methods, or (b) explicitly documenting that these tests validate the raw MurmurHash result (and not the masked/truncated EPK form).

github-actions bot added the Cosmos label Feb 11, 2026

github-project-automation bot added this to CosmosDB Python Eco-System Feb 11, 2026

analogrelay marked this pull request as ready for review March 10, 2026 22:25

analogrelay requested a review from a team as a code owner March 10, 2026 22:25

Copilot AI review requested due to automatic review settings March 10, 2026 22:25

Merge branch 'main' into ashleyst/epk-tests

16a3a42

Copilot started reviewing on behalf of analogrelay March 10, 2026 22:27 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add partition key hash baseline tests for Cosmos DB Python SDK#45136

Add partition key hash baseline tests for Cosmos DB Python SDK#45136
analogrelay wants to merge 2 commits intomainfrom
ashleyst/epk-tests

analogrelay commented Feb 11, 2026

Uh oh!

analogrelay commented Feb 11, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

analogrelay commented Feb 11, 2026

Summary

Test coverage (58 tests)

Design

Uh oh!

analogrelay commented Feb 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants