Skip to content

Add partition key hash baseline tests for Cosmos DB Python SDK#45136

Open
analogrelay wants to merge 2 commits intomainfrom
ashleyst/epk-tests
Open

Add partition key hash baseline tests for Cosmos DB Python SDK#45136
analogrelay wants to merge 2 commits intomainfrom
ashleyst/epk-tests

Conversation

@analogrelay
Copy link
Copy Markdown
Member

Summary

Adds baseline tests validating MurmurHash3 V1/V2 partition key hashing against known-good values.

Test coverage (58 tests)

  • Singletons: undefined, null, true, false
  • Numbers: zero, negative zero, epsilon, NaN, infinity, int64 extremes, MaxSafeInteger
  • Strings: empty through 2KB (exercises V1 truncation at 100 chars)
  • Lists: 1/2/3-path hierarchical partition keys (multi-hash)

Design

  • Uses the production code path (PartitionKey._write_for_hashing / _write_for_hashing_v2) including string suffix bytes, so tests validate the actual SDK behavior end-to-end.
  • Baseline XML files can be regenerated via UPDATE_BASELINE=1 pytest tests/test_partition_key_hash_baseline.py.
  • Input test cases originated from the .NET SDK's PartitionKeyHashBaselineTest, with expected hash values regenerated from the Python SDK's production serialization path.

Validates MurmurHash3 V1/V2 partition key hashing against baseline XML
data covering singletons (undefined, null, true, false), numbers (zero,
negative zero, epsilon, NaN, infinity, int64 extremes), strings (empty
through 2KB), and hierarchical partition key lists.

Uses the production code path (PartitionKey._write_for_hashing /
_write_for_hashing_v2) including string suffix bytes. Baselines can be
regenerated with UPDATE_BASELINE=1.
@analogrelay
Copy link
Copy Markdown
Member Author

I built this as part of developing Azure/azure-sdk-for-go#26007 and thought it would be really good if our various SDKs had one set of common tests for computing EPKs. The .NET SDK has baseline tests like this, and this is a slight iteration on those (which I plan to roll in to other SDKs too).

@analogrelay analogrelay marked this pull request as ready for review March 10, 2026 22:25
@analogrelay analogrelay requested a review from a team as a code owner March 10, 2026 22:25
Copilot AI review requested due to automatic review settings March 10, 2026 22:25
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds baseline-driven tests to validate Cosmos DB partition key hashing outputs against known-good XML baselines, with an opt-in mode to regenerate baselines.

Changes:

  • Introduces test_partition_key_hash_baseline.py to compute V1/V2 hashes from PartitionKey’s hashing serialization and assert against XML baselines.
  • Adds baseline XML fixtures for singletons, numbers, strings, and list (multi-hash) cases.
  • Supports baseline regeneration via UPDATE_BASELINE=1.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
sdk/cosmos/azure-cosmos/tests/test_partition_key_hash_baseline.py Implements hash computation helpers, XML baseline load/write, parametrized assertions, and baseline regeneration flag.
sdk/cosmos/azure-cosmos/tests/test_data/partition_key_hash_baseline/PartitionKeyHashBaselineTest.Singletons.xml Baseline expected hashes for undefined/null/bool cases.
sdk/cosmos/azure-cosmos/tests/test_data/partition_key_hash_baseline/PartitionKeyHashBaselineTest.Numbers.xml Baseline expected hashes for numeric edge cases.
sdk/cosmos/azure-cosmos/tests/test_data/partition_key_hash_baseline/PartitionKeyHashBaselineTest.Strings.xml Baseline expected hashes for string lengths (including long strings).
sdk/cosmos/azure-cosmos/tests/test_data/partition_key_hash_baseline/PartitionKeyHashBaselineTest.Lists.xml Baseline expected hashes for multi-component (concatenated) hashing.

Comment on lines +227 to +231
suite = filename.rsplit('.', 2)[0].split('.')[-1] # e.g. "Singletons"
cases = _load_test_cases(filename)
for case in cases:
test_id = f"{suite}/{case['description']}"
items.append((test_id, case, is_list))
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suite = filename.rsplit('.', 2)[0].split('.')[-1] does not produce the intended suite name (e.g. for PartitionKeyHashBaselineTest.Singletons.xml it yields PartitionKeyHashBaselineTest, not Singletons). This makes the parametrized test IDs misleading and likely duplicates across baseline files. Consider extracting the suite via filename.rsplit('.', 1)[0].split('.')[-1] or filename.split('.')[-2].

Copilot uses AI. Check for mistakes.
Comment on lines +124 to +129
if isinstance(parsed, int) and parsed == 0 and value_str.strip().startswith('-'):
return float('-0.0')

if isinstance(parsed, int):
return float(parsed)

Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_parse_partition_key_value converts every parsed JSON integer to float. Since PartitionKey._write_for_hashing*_core has a dedicated int code path, this prevents the baseline tests from exercising that branch (and any future differences between int vs float handling). Consider returning the parsed int as-is (except for the special -0 case) and letting _write_for_hashing* perform its own int-to-float casting.

Suggested change
if isinstance(parsed, int) and parsed == 0 and value_str.strip().startswith('-'):
return float('-0.0')
if isinstance(parsed, int):
return float(parsed)
# Preserve negative zero semantics for the bare JSON number "-0".
if isinstance(parsed, int) and parsed == 0 and value_str.strip().startswith('-'):
return float('-0.0')
# For all other cases, return the parsed value as-is so that integers
# remain ints and exercise the int-specific hashing code paths.

Copilot uses AI. Check for mistakes.
Comment on lines +49 to +90
def _compute_hash_v1(value):
"""Compute V1 partition key hash using the production serialisation path.

Serialises *value* with ``PartitionKey._write_for_hashing`` (which appends a
0x00 suffix after strings), hashes with MurmurHash3-32, and returns the
result formatted as a 32-hex-char UInt128 string (big-endian, no dashes).
"""
ms = BytesIO()
PartitionKey._write_for_hashing(value, ms)
hash32 = murmurhash3_32(bytearray(ms.getvalue()), 0)
h = int(hash32)
low_bytes = list(h.to_bytes(8, byteorder='little'))
high_bytes = [0] * 8
uint128_bytes = low_bytes + high_bytes
uint128_bytes.reverse()
return ''.join('{:02X}'.format(b) for b in uint128_bytes)


def _compute_hash_v2(value):
"""Compute V2 partition key hash using the production serialisation path.

Serialises *value* with ``PartitionKey._write_for_hashing_v2`` (which appends
a 0xFF suffix after strings), hashes with MurmurHash3-128, and returns the
result formatted as a 32-hex-char UInt128 string (big-endian, no dashes).
"""
ms = BytesIO()
PartitionKey._write_for_hashing_v2(value, ms)
hash128 = murmurhash3_128(bytearray(ms.getvalue()), _UInt128(0, 0))
ba = hash128.to_byte_array()
ba_reversed = list(reversed(ba))
return ''.join('{:02X}'.format(b) for b in ba_reversed)


def _compute_multi_hash_v1(values):
"""Compute V1 multi-hash (concatenated per-element UInt128 hashes)."""
return ''.join(_compute_hash_v1(v) for v in values)


def _compute_multi_hash_v2(values):
"""Compute V2 multi-hash (concatenated per-element UInt128 hashes)."""
return ''.join(_compute_hash_v2(v) for v in values)

Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The V1/V2 helpers here compute a raw MurmurHash output from _write_for_hashing*, but they skip additional steps used by the SDK’s production effective-partition-key computation: V1 truncates strings to 100 chars and normalizes large ints via _UInt32 before hashing (PartitionKey._get_effective_partition_key_for_hash_partitioning), and V2 clears the top bits after reversing the 128-bit hash (hash_bytes[0] &= 0x3F) in both single- and multi-hash (PartitionKey._get_effective_partition_key_for_hash_partitioning_v2 / _get_effective_partition_key_for_multi_hash_partitioning_v2). If the intent is to validate the SDK’s actual routing/hash behavior end-to-end, these baselines won’t match/catch regressions for those steps. Consider either (a) baselining the outputs of the production _get_effective_partition_key_for_* methods, or (b) explicitly documenting that these tests validate the raw MurmurHash result (and not the masked/truncated EPK form).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants