Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster optimized frozen dictionary creation (3/n) #87688

Merged
merged 6 commits into from
Jun 22, 2023

Conversation

adamsitnik
Copy link
Member

@adamsitnik adamsitnik commented Jun 16, 2023

  • avoid the need of having Action<int, int> storeDestIndexFromSrcIndex by writing the destination indexes to the provided buffer with hashcodes after we are done using the hashcodes, move the responsibility of updating the destination to the caller (1-4% gain)
    • I know it's hacky and it's just 1-4%, so please let me know what you think.
  • For cases where the key is an integer and we know the input is already unique (because it comes from a dictionary or a hash set) there is no need to create another hash set.
    • +15% gain for scenarios where the key was an integer (time), 13-19% allocations drop
    • Also, in cases where simply all hash codes are unique, we can iterate over a span rather than a hash set. Up to +5% gain where string keys turned out to have unique hash codes
Type Method Toolchain Size Mean Ratio Allocated Alloc Ratio
CtorFromCollection<Int32> FrozenDictionaryOptimized this PR 512 38.95 us 0.84 34.48 KB 0.81
CtorFromCollection<Int32> FrozenDictionaryOptimized #87630 512 46.29 us 1.00 42.84 KB 1.00
CtorFromCollection<String> FrozenDictionaryOptimized this PR 512 63.21 us 0.94 74.77 KB 1.00
CtorFromCollection<String> FrozenDictionaryOptimized #87630 512 67.59 us 1.00 74.88 KB 1.00
CtorFromCollection<Int32> FrozenSetOptimized this PR 512 52.80 us 0.85 56.01 KB 0.87
CtorFromCollection<Int32> FrozenSetOptimized #87630 512 62.19 us 1.00 64.27 KB 1.00
CtorFromCollection<String> FrozenSetOptimized this PR 512 79.05 us 0.94 77.05 KB 1.00
CtorFromCollection<String> FrozenSetOptimized #87630 512 83.85 us 1.00 77.14 KB 1.00

…` by writing the destination indexes to the provided buffer with hashcodes and moving the responsibility to the caller (1-4% gain)
…ly is the first prime number that would give less than 5% collision rate for unique hash codes.

When bestNumCollisions was set to codes.Count (the number of unique hash codes), it meant "start the search assuming that current best collision rate is 100%".
The first iteration would then check all values, as any result would be better than 100% collision rate. It would set the new best collision rate, which then would be used by next iterations.

Setting bestNumCollisions to `codes.Count / 20 + 1` (just one more collision than 5%) at the beginning means: find me the first bucket that meets the criteria.

If none is found, the last prime number is returned, which matches the previous behavior.

+23% improvement
…y unique (because it comes from a dictionary or a hash set) there is no need to create another hash set

Also, in cases where simply all hash codes are unique, we can iterate over a span rather than a hash set

+9% gain for scenarios where the key was an integer (time), 10-20% allocations drop
up to +5% gain where string keys turned out to have unique hash codes
@ghost
Copy link

ghost commented Jun 16, 2023

Tagging subscribers to this area: @dotnet/area-system-collections
See info in area-owners.md if you want to be subscribed.

Issue Details
  • avoid the need of having Action<int, int> storeDestIndexFromSrcIndex by writing the destination indexes to the provided buffer with hashcodes after we are done using the hashcodes, move the responsibility of updating the destination to the caller (1-4% gain)
    • I know it's hacky and it's just 1-4%, so please let me know what you think.
  • CalcNumBuckets searches for the best number of buckets, which currently is the first prime number that would give less than 5% collision rate for unique hash codes. When bestNumCollisions was set to codes.Count (the number of unique hash codes), it meant "start the search assuming that current best collision rate is 100%". The first iteration would then check all values, as any result would be better than 100% collision rate. It would set the new best collision rate, which then would be used by next iterations.
    • Setting bestNumCollisions to codes.Count / 20 + 1 (just one more collision than 5%) at the beginning means: find me the first bucket that meets the criteria and quickly break for buckets that don't.
    • If none is found, the last prime number is returned, which matches the previous behavior assuming that the biggest prime number is always producing the best result.
    • +23% improvement for most collections!
  • For cases where the key is an integer and we know the input is already unique (because it comes from a dictionary or a hash set) there is no need to create another hash set.
    • Also, in cases where simply all hash codes are unique, we can iterate over a span rather than a hash set.
    • +9% gain for scenarios where the key was an integer (time), 10-20% allocations drop
    • up to +5% gain where string keys turned out to have unique hash codes
Type Method Job Size Mean Ratio Allocated Alloc Ratio
CtorFromCollection<Int32> FrozenDictionaryOptimized this PR 512 24.05 us 0.52 34.48 KB 0.80
CtorFromCollection<Int32> FrozenDictionaryOptimized #87630 512 44.46 us 0.96 42.84 KB 1.00
CtorFromCollection<Int32> FrozenDictionaryOptimized #87510 512 46.12 us 1.00 42.9 KB 1.00
CtorFromCollection<Int32> FrozenDictionaryOptimized before #87510 512 46.70 us 1.00 42.9 KB 1.00
CtorFromCollection<String> FrozenDictionaryOptimized this PR 512 46.33 us 0.51 74.77 KB 0.88
CtorFromCollection<String> FrozenDictionaryOptimized #87630 512 67.60 us 0.74 74.88 KB 0.88
CtorFromCollection<String> FrozenDictionaryOptimized #87510 512 75.48 us 0.83 74.91 KB 0.88
CtorFromCollection<String> FrozenDictionaryOptimized before #87510 512 91.57 us 1.00 85.21 KB 1.00
CtorFromCollection<Int32> FrozenSetOptimized this PR 512 37.02 us 0.61 56.01 KB 0.87
CtorFromCollection<Int32> FrozenSetOptimized #87630 512 59.76 us 0.98 64.27 KB 1.00
CtorFromCollection<Int32> FrozenSetOptimized #87510 512 63.30 us 1.04 64.35 KB 1.00
CtorFromCollection<Int32> FrozenSetOptimized before #87510 512 60.78 us 1.00 64.35 KB 1.00
CtorFromCollection<String> FrozenSetOptimized this PR 512 61.22 us 0.60 77.05 KB 0.88
CtorFromCollection<String> FrozenSetOptimized #87630 512 83.31 us 0.81 77.14 KB 0.88
CtorFromCollection<String> FrozenSetOptimized #87510 512 89.29 us 0.86 77.2 KB 0.88
CtorFromCollection<String> FrozenSetOptimized before #87510 512 103.28 us 1.00 87.51 KB 1.00

My last idea is to use binary search in CalcNumBuckets to save time on searching for the best number of buckets. It should work if my assumption (the larger the prime number, the lower collision ratio) is correct. I am going to write an app for generating tons of random inputs to verify that such assumption is correct. If somebody knows the answer already, please let me know ;)

Author: adamsitnik
Assignees: -
Labels:

area-System.Collections, tenet-performance

Milestone: -

@adamsitnik adamsitnik marked this pull request as draft June 19, 2023 06:53
@adamsitnik
Copy link
Member Author

I wrote some small utility for testing and found some differences with the old code, marking as DRAFT, will mark as ready for review when I solve the problem

… currently is the first prime number that would give less than 5% collision rate for unique hash codes."

as it's not finished yet

This reverts commit 4014ff8.

# Conflicts:
#	src/libraries/System.Collections.Immutable/src/System/Collections/Frozen/FrozenHashTable.cs
@adamsitnik adamsitnik marked this pull request as ready for review June 19, 2023 07:39
@adamsitnik
Copy link
Member Author

I've decided to revert 4014ff8 for now and offer this PR with two improvements now, will send a separate PR with improved CalcNumBuckets logic

@adamsitnik
Copy link
Member Author

FWIW I've tried one more approach: 9ca2177

To filter out duplicate codes, I tried to sort the hash codes and just skip the duplicates (previous value == current).

The allocations dropped by 6-21%, but the CPU time has regressed by 1-19%.

Comment on lines +257 to 274
foreach (int code in hashCodes)
{
seenBuckets[bucketNum / BitsPerInt32] |= 1 << (int)bucketNum;
uint bucketNum = (uint)code % (uint)numBuckets;
if ((seenBuckets[bucketNum / BitsPerInt32] & (1 << (int)bucketNum)) != 0)
{
numCollisions++;
if (numCollisions >= bestNumCollisions)
{
// If we've already hit the previously known best number of collisions,
// there's no point in continuing as worst case we'd just use that.
break;
}
}
else
{
seenBuckets[bucketNum / BitsPerInt32] |= 1 << (int)bucketNum;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After all of the work done to clone the inputs, allocate the dictionaries, analyze the keys, and so on, iterating over a span instead of a HashSet really provides a meaningful enough gain to duplicate this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After all of the work done to clone the inputs, allocate the dictionaries, analyze the keys, and so on, iterating over a span instead of a HashSet really provides a meaningful enough gain to duplicate this?

Yes: "Up to +5% gain where string keys turned out to have unique hash codes"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw that in the PR description, but I'm still skeptical.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've used a profiler and benchmarked it more than once. This loop is the hottest place in the entire process of creating frozen dictionaries/hash sets (because all other parts got optimized in other PRs and are now relatively cheap).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still skeptical :) But putting aside my skepticism, can you at least dedup this by putting it into an aggressively-inlined helper?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephentoub do you mind if I do that in my next PR that is going to touch this area?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

{
seenBuckets[bucketNum / BitsPerInt32] |= 1 << (int)bucketNum;
uint bucketNum = (uint)code % (uint)numBuckets;
if ((seenBuckets[bucketNum / BitsPerInt32] & (1 << (int)bucketNum)) != 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would computing the bitmask value once (instead of in line 260 and 272) help anything?

numCollisions++;
if (numCollisions >= bestNumCollisions)
uint bucketNum = (uint)code % (uint)numBuckets;
if ((seenBuckets[bucketNum / BitsPerInt32] & (1 << (int)bucketNum)) != 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would computing the bitmask value once (instead of in line 238 and 250) help anything? e.g.

int bucketMask = 1 << (int)bucketNum;
if ((seenBuckets[bucketNum / BitsPerInt32] &bucketMask) != 0)
{
...
}
else
{
    seenBuckets[bucketNum / BitsPerInt32] |= bucketMask;
}

@ghost ghost locked as resolved and limited conversation to collaborators Jul 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants