Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor thrust::[stable_]partition[_copy] to use cub::DevicePartition #1435

Merged
merged 4 commits into from
Feb 28, 2024

Conversation

elstehle
Copy link
Contributor

@elstehle elstehle commented Feb 26, 2024

Description

Closes #1397
Closes #1383

Changes:

  • Adds option to take a two distinct output iterators in AgentSelectIf: (1) one for the selected items, (2) one for the rejected items in
    • This is required to implement thrust::[stable_]partition_copy interfaces
  • Adds dynamic 32/64-bit offset type-dispatch to thrust::[stable_]partition[_if]
  • Adds tests for large number of items for thrust::stable_partition_if
  • Fixes an offset computation when writing rejected items for 64-bit offset types

Follow-On Tasks
The following tasks are left for follow-on PRs:

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@elstehle elstehle added the cub For all items related to CUB label Feb 26, 2024
@elstehle elstehle requested review from a team as code owners February 26, 2024 19:00
@elstehle
Copy link
Contributor Author

elstehle commented Feb 26, 2024

Performance changes look acceptable. Very slight perf improvements for a many 32-bit offset benchmarks. A few, slight degradations of ~1% and 3% for 2^16 items in the flagged case. Larger problem sizes mostly fall within noise.

cub.bench.partition.if.base

[0] Tesla V100-SXM2-32GB

T{ct} OffsetT{ct} Elements{io} Entropy Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I8 I32 2^16 1 8.766 us 6.55% 8.672 us 6.42% -0.094 us -1.07% PASS
I8 I32 2^20 1 14.548 us 3.03% 14.415 us 2.90% -0.133 us -0.91% PASS
I8 I32 2^24 1 100.123 us 1.01% 100.229 us 0.96% 0.106 us 0.11% PASS
I8 I32 2^28 1 1.478 ms 0.50% 1.481 ms 0.50% 3.936 us 0.27% PASS
I8 I32 2^16 0.544 8.633 us 5.89% 8.563 us 5.77% -0.070 us -0.81% PASS
I8 I32 2^20 0.544 14.575 us 3.07% 14.439 us 2.90% -0.136 us -0.93% PASS
I8 I32 2^24 0.544 101.978 us 1.12% 101.998 us 1.11% 0.021 us 0.02% PASS
I8 I32 2^28 0.544 1.509 ms 0.50% 1.513 ms 0.50% 3.591 us 0.24% PASS
I8 I32 2^16 0 8.614 us 5.87% 8.542 us 5.70% -0.072 us -0.84% PASS
I8 I32 2^20 0 14.654 us 3.29% 14.434 us 2.83% -0.219 us -1.50% PASS
I8 I32 2^24 0 95.358 us 0.79% 95.672 us 0.77% 0.314 us 0.33% PASS
I8 I32 2^28 0 1.382 ms 0.50% 1.387 ms 0.50% 4.706 us 0.34% PASS
I8 I64 2^16 1 8.740 us 5.90% 8.789 us 5.84% 0.049 us 0.56% PASS
I8 I64 2^20 1 15.567 us 3.11% 15.618 us 3.06% 0.052 us 0.33% PASS
I8 I64 2^24 1 113.090 us 0.67% 113.518 us 0.67% 0.428 us 0.38% PASS
I8 I64 2^28 1 1.687 ms 0.50% 1.693 ms 0.50% 6.097 us 0.36% PASS
I8 I64 2^16 0.544 8.752 us 6.22% 8.800 us 5.78% 0.048 us 0.55% PASS
I8 I64 2^20 0.544 15.662 us 5.99% 15.676 us 3.12% 0.014 us 0.09% PASS
I8 I64 2^24 0.544 114.709 us 1.07% 115.002 us 0.71% 0.293 us 0.26% PASS
I8 I64 2^28 0.544 1.715 ms 0.50% 1.721 ms 0.50% 6.216 us 0.36% PASS
I8 I64 2^16 0 8.814 us 11.64% 8.745 us 5.88% -0.069 us -0.79% PASS
I8 I64 2^20 0 15.746 us 4.07% 15.676 us 3.12% -0.069 us -0.44% PASS
I8 I64 2^24 0 109.658 us 1.15% 110.046 us 0.58% 0.387 us 0.35% PASS
I8 I64 2^28 0 1.612 ms 0.50% 1.619 ms 0.50% 7.059 us 0.44% PASS
I16 I32 2^16 1 8.844 us 11.97% 8.802 us 5.77% -0.042 us -0.47% PASS
I16 I32 2^20 1 15.711 us 6.58% 15.538 us 3.00% -0.173 us -1.10% PASS
I16 I32 2^24 1 122.303 us 1.32% 122.134 us 1.15% -0.169 us -0.14% PASS
I16 I32 2^28 1 1.820 ms 0.54% 1.818 ms 0.53% -1.646 us -0.09% PASS
I16 I32 2^16 0.544 8.968 us 11.54% 8.908 us 5.46% -0.060 us -0.67% PASS
I16 I32 2^20 0.544 16.030 us 6.46% 15.898 us 3.42% -0.132 us -0.82% PASS
I16 I32 2^24 0.544 125.554 us 1.55% 125.327 us 1.47% -0.227 us -0.18% PASS
I16 I32 2^28 0.544 1.873 ms 0.59% 1.869 ms 0.56% -3.555 us -0.19% PASS
I16 I32 2^16 0 8.933 us 10.77% 8.911 us 5.45% -0.022 us -0.25% PASS
I16 I32 2^20 0 15.891 us 6.34% 15.852 us 3.32% -0.040 us -0.25% PASS
I16 I32 2^24 0 113.816 us 1.40% 113.645 us 1.11% -0.171 us -0.15% PASS
I16 I32 2^28 0 1.671 ms 0.63% 1.670 ms 0.64% -1.616 us -0.10% PASS
I16 I64 2^16 1 9.048 us 11.20% 9.004 us 4.93% -0.044 us -0.49% PASS
I16 I64 2^20 1 16.913 us 6.03% 16.780 us 3.24% -0.133 us -0.79% PASS
I16 I64 2^24 1 131.581 us 1.11% 132.000 us 0.80% 0.418 us 0.32% PASS
I16 I64 2^28 1 1.965 ms 0.50% 1.971 ms 0.50% 5.459 us 0.28% PASS
I16 I64 2^16 0.544 9.156 us 11.40% 9.040 us 4.72% -0.115 us -1.26% PASS
I16 I64 2^20 0.544 17.265 us 5.90% 17.263 us 3.05% -0.003 us -0.02% PASS
I16 I64 2^24 0.544 134.803 us 1.29% 135.331 us 1.11% 0.528 us 0.39% PASS
I16 I64 2^28 0.544 2.024 ms 0.50% 2.031 ms 0.50% 7.294 us 0.36% PASS
I16 I64 2^16 0 9.996 us 11.10% 8.914 us 5.42% -1.082 us -10.82% FAIL
I16 I64 2^20 0 16.860 us 4.66% 16.873 us 3.27% 0.013 us 0.08% PASS
I16 I64 2^24 0 123.149 us 0.97% 123.642 us 0.74% 0.494 us 0.40% PASS
I16 I64 2^28 0 1.825 ms 0.50% 1.833 ms 0.50% 7.422 us 0.41% PASS
I32 I32 2^16 1 9.016 us 11.22% 8.919 us 5.38% -0.097 us -1.08% PASS
I32 I32 2^20 1 18.664 us 5.76% 18.532 us 2.98% -0.131 us -0.70% PASS
I32 I32 2^24 1 183.812 us 1.03% 183.890 us 0.83% 0.078 us 0.04% PASS
I32 I32 2^28 1 2.819 ms 0.60% 2.820 ms 0.59% 0.985 us 0.03% PASS
I32 I32 2^16 0.544 9.059 us 10.97% 9.030 us 4.86% -0.030 us -0.33% PASS
I32 I32 2^20 0.544 19.791 us 5.55% 19.788 us 3.13% -0.004 us -0.02% PASS
I32 I32 2^24 0.544 197.509 us 0.88% 197.573 us 0.76% 0.064 us 0.03% PASS
I32 I32 2^28 0.544 3.041 ms 0.68% 3.042 ms 0.68% 0.374 us 0.01% PASS
I32 I32 2^16 0 8.990 us 10.65% 9.012 us 4.97% 0.021 us 0.24% PASS
I32 I32 2^20 0 18.643 us 5.65% 18.737 us 3.24% 0.094 us 0.50% PASS
I32 I32 2^24 0 183.702 us 0.98% 183.593 us 0.82% -0.109 us -0.06% PASS
I32 I32 2^28 0 2.818 ms 0.60% 2.819 ms 0.59% 0.516 us 0.02% PASS
I32 I64 2^16 1 9.289 us 11.03% 9.315 us 4.24% 0.026 us 0.28% PASS
I32 I64 2^20 1 19.359 us 5.43% 19.320 us 2.89% -0.040 us -0.20% PASS
I32 I64 2^24 1 186.596 us 1.05% 185.781 us 0.85% -0.816 us -0.44% PASS
I32 I64 2^28 1 2.859 ms 0.58% 2.843 ms 0.58% -15.473 us -0.54% PASS
I32 I64 2^16 0.544 9.231 us 4.14% 9.125 us 4.45% -0.106 us -1.15% PASS
I32 I64 2^20 0.544 20.286 us 3.13% 20.169 us 3.66% -0.117 us -0.58% PASS
I32 I64 2^24 0.544 198.840 us 0.75% 198.626 us 0.78% -0.215 us -0.11% PASS
I32 I64 2^28 0.544 3.052 ms 0.65% 3.049 ms 0.66% -3.428 us -0.11% PASS
I32 I64 2^16 0 9.189 us 3.91% 9.110 us 7.38% -0.078 us -0.85% PASS
I32 I64 2^20 0 19.328 us 2.83% 19.281 us 2.93% -0.047 us -0.24% PASS
I32 I64 2^24 0 186.469 us 0.91% 185.629 us 0.83% -0.841 us -0.45% PASS
I32 I64 2^28 0 2.858 ms 0.61% 2.841 ms 0.60% -16.369 us -0.57% PASS
I64 I32 2^16 1 10.023 us 5.42% 9.971 us 4.90% -0.051 us -0.51% PASS
I64 I32 2^20 1 29.683 us 2.04% 29.330 us 2.50% -0.353 us -1.19% PASS
I64 I32 2^24 1 353.494 us 0.54% 348.736 us 0.50% -4.759 us -1.35% FAIL
I64 I32 2^28 1 5.533 ms 0.50% 5.454 ms 0.50% -78.174 us -1.41% FAIL
I64 I32 2^16 0.544 9.949 us 9.85% 10.343 us 4.56% 0.394 us 3.97% PASS
I64 I32 2^20 0.544 30.239 us 3.74% 30.146 us 2.40% -0.093 us -0.31% PASS
I64 I32 2^24 0.544 369.148 us 0.58% 365.842 us 0.50% -3.306 us -0.90% FAIL
I64 I32 2^28 0.544 5.792 ms 0.50% 5.735 ms 0.50% -57.295 us -0.99% FAIL
I64 I32 2^16 0 10.152 us 9.60% 10.100 us 4.75% -0.052 us -0.51% PASS
I64 I32 2^20 0 29.834 us 3.78% 29.369 us 2.45% -0.465 us -1.56% PASS
I64 I32 2^24 0 353.733 us 0.59% 348.666 us 0.50% -5.067 us -1.43% FAIL
I64 I32 2^28 0 5.537 ms 0.50% 5.454 ms 0.50% -83.238 us -1.50% FAIL
I64 I64 2^16 1 10.655 us 9.25% 10.634 us 4.79% -0.021 us -0.20% PASS
I64 I64 2^20 1 29.463 us 2.37% 30.132 us 2.02% 0.669 us 2.27% FAIL
I64 I64 2^24 1 349.440 us 0.50% 355.086 us 0.57% 5.645 us 1.62% FAIL
I64 I64 2^28 1 5.464 ms 0.50% 5.552 ms 0.50% 88.242 us 1.62% FAIL
I64 I64 2^16 0.544 10.327 us 4.18% 10.656 us 4.85% 0.329 us 3.19% PASS
I64 I64 2^20 0.544 30.220 us 2.31% 30.543 us 1.99% 0.323 us 1.07% PASS
I64 I64 2^24 0.544 366.696 us 0.50% 369.986 us 0.50% 3.289 us 0.90% FAIL
I64 I64 2^28 0.544 5.744 ms 0.50% 5.800 ms 0.50% 56.089 us 0.98% FAIL
I64 I64 2^16 0 10.466 us 4.57% 10.509 us 4.49% 0.044 us 0.42% PASS
I64 I64 2^20 0 29.630 us 2.50% 29.972 us 1.86% 0.341 us 1.15% PASS
I64 I64 2^24 0 349.471 us 0.50% 355.136 us 0.56% 5.665 us 1.62% FAIL
I64 I64 2^28 0 5.464 ms 0.50% 5.556 ms 0.50% 92.632 us 1.70% FAIL
I128 I32 2^16 1 12.390 us 3.50% 12.368 us 3.37% -0.022 us -0.18% PASS
I128 I32 2^20 1 51.502 us 1.35% 51.489 us 1.33% -0.013 us -0.03% PASS
I128 I32 2^24 1 706.376 us 0.45% 706.530 us 0.44% 0.153 us 0.02% PASS
I128 I32 2^28 1 11.192 ms 0.50% 11.192 ms 0.50% -0.016 us -0.00% PASS
I128 I32 2^16 0.544 12.422 us 3.70% 12.330 us 3.60% -0.092 us -0.74% PASS
I128 I32 2^20 0.544 51.545 us 1.34% 51.506 us 1.34% -0.039 us -0.08% PASS
I128 I32 2^24 0.544 706.362 us 0.44% 706.397 us 0.45% 0.035 us 0.00% PASS
I128 I32 2^28 0.544 11.191 ms 0.50% 11.192 ms 0.50% 0.919 us 0.01% PASS
I128 I32 2^16 0 12.421 us 3.77% 12.350 us 3.57% -0.071 us -0.57% PASS
I128 I32 2^20 0 51.491 us 1.35% 51.460 us 1.37% -0.031 us -0.06% PASS
I128 I32 2^24 0 706.441 us 0.44% 706.343 us 0.44% -0.098 us -0.01% PASS
I128 I32 2^28 0 11.191 ms 0.50% 11.192 ms 0.50% 0.257 us 0.00% PASS
I128 I64 2^16 1 11.951 us 4.45% 11.973 us 4.43% 0.022 us 0.19% PASS
I128 I64 2^20 1 51.562 us 1.28% 51.760 us 1.29% 0.198 us 0.38% PASS
I128 I64 2^24 1 706.360 us 0.43% 706.752 us 0.43% 0.392 us 0.06% PASS
I128 I64 2^28 1 11.189 ms 0.50% 11.192 ms 0.50% 2.485 us 0.02% PASS
I128 I64 2^16 0.544 11.972 us 4.43% 11.994 us 4.31% 0.023 us 0.19% PASS
I128 I64 2^20 0.544 51.590 us 1.29% 51.790 us 1.29% 0.201 us 0.39% PASS
I128 I64 2^24 0.544 706.650 us 0.42% 706.337 us 0.44% -0.312 us -0.04% PASS
I128 I64 2^28 0.544 11.189 ms 0.50% 11.192 ms 0.50% 3.546 us 0.03% PASS
I128 I64 2^16 0 11.857 us 4.54% 12.020 us 4.49% 0.163 us 1.38% PASS
I128 I64 2^20 0 51.624 us 1.30% 51.816 us 1.31% 0.191 us 0.37% PASS
I128 I64 2^24 0 706.475 us 0.41% 706.901 us 0.43% 0.427 us 0.06% PASS
I128 I64 2^28 0 11.189 ms 0.50% 11.192 ms 0.50% 3.289 us 0.03% PASS
F32 I32 2^16 1 8.981 us 5.42% 9.066 us 4.96% 0.085 us 0.95% PASS
F32 I32 2^20 1 18.595 us 3.04% 18.566 us 2.90% -0.029 us -0.16% PASS
F32 I32 2^24 1 183.631 us 0.80% 183.404 us 0.81% -0.227 us -0.12% PASS
F32 I32 2^28 1 2.943 ms 0.67% 2.943 ms 0.66% -0.340 us -0.01% PASS
F32 I32 2^16 0.544 9.054 us 4.69% 9.174 us 5.29% 0.120 us 1.32% PASS
F32 I32 2^20 0.544 19.759 us 3.72% 19.692 us 3.74% -0.066 us -0.34% PASS
F32 I32 2^24 0.544 195.814 us 0.75% 195.764 us 0.76% -0.050 us -0.03% PASS
F32 I32 2^28 0.544 3.013 ms 0.69% 3.013 ms 0.69% -0.332 us -0.01% PASS
F32 I32 2^16 0 9.037 us 4.94% 9.255 us 4.33% 0.218 us 2.41% PASS
F32 I32 2^20 0 18.617 us 3.23% 18.581 us 3.46% -0.037 us -0.20% PASS
F32 I32 2^24 0 183.270 us 0.83% 183.149 us 0.82% -0.121 us -0.07% PASS
F32 I32 2^28 0 2.815 ms 0.59% 2.816 ms 0.59% 0.737 us 0.03% PASS
F32 I64 2^16 1 9.195 us 4.12% 9.165 us 4.17% -0.030 us -0.32% PASS
F32 I64 2^20 1 19.508 us 2.80% 19.385 us 2.80% -0.123 us -0.63% PASS
F32 I64 2^24 1 187.190 us 0.91% 186.289 us 0.84% -0.901 us -0.48% PASS
F32 I64 2^28 1 2.964 ms 0.64% 2.958 ms 0.64% -5.249 us -0.18% PASS
F32 I64 2^16 0.544 9.264 us 4.10% 9.326 us 4.84% 0.062 us 0.67% PASS
F32 I64 2^20 0.544 20.396 us 3.52% 20.352 us 3.49% -0.044 us -0.21% PASS
F32 I64 2^24 0.544 197.149 us 0.78% 197.049 us 0.76% -0.099 us -0.05% PASS
F32 I64 2^28 0.544 3.021 ms 0.67% 3.019 ms 0.67% -1.613 us -0.05% PASS
F32 I64 2^16 0 9.153 us 4.13% 9.525 us 5.04% 0.372 us 4.06% PASS
F32 I64 2^20 0 19.458 us 2.82% 19.587 us 2.75% 0.129 us 0.66% PASS
F32 I64 2^24 0 186.895 us 0.93% 186.160 us 0.90% -0.735 us -0.39% PASS
F32 I64 2^28 0 2.863 ms 0.59% 2.850 ms 0.59% -13.348 us -0.47% PASS
F64 I32 2^16 1 10.048 us 4.78% 10.041 us 4.71% -0.007 us -0.07% PASS
F64 I32 2^20 1 29.267 us 2.48% 29.202 us 2.43% -0.065 us -0.22% PASS
F64 I32 2^24 1 348.445 us 0.50% 348.317 us 0.50% -0.129 us -0.04% PASS
F64 I32 2^28 1 5.453 ms 0.50% 5.451 ms 0.50% -2.004 us -0.04% PASS
F64 I32 2^16 0.544 9.913 us 5.24% 10.069 us 4.91% 0.156 us 1.57% PASS
F64 I32 2^20 0.544 30.047 us 2.41% 30.098 us 2.42% 0.051 us 0.17% PASS
F64 I32 2^24 0.544 364.642 us 0.50% 364.421 us 0.50% -0.221 us -0.06% PASS
F64 I32 2^28 0.544 5.715 ms 0.50% 5.711 ms 0.50% -4.296 us -0.08% PASS
F64 I32 2^16 0 9.905 us 5.28% 10.087 us 4.82% 0.182 us 1.84% PASS
F64 I32 2^20 0 29.259 us 2.50% 29.450 us 2.56% 0.191 us 0.65% PASS
F64 I32 2^24 0 348.491 us 0.50% 348.379 us 0.50% -0.112 us -0.03% PASS
F64 I32 2^28 0 5.453 ms 0.50% 5.451 ms 0.50% -1.994 us -0.04% PASS
F64 I64 2^16 1 10.489 us 4.55% 10.422 us 4.36% -0.067 us -0.64% PASS
F64 I64 2^20 1 29.541 us 2.36% 29.942 us 1.87% 0.401 us 1.36% PASS
F64 I64 2^24 1 349.718 us 0.50% 354.769 us 0.57% 5.051 us 1.44% FAIL
F64 I64 2^28 1 5.465 ms 0.50% 5.549 ms 0.50% 83.759 us 1.53% FAIL
F64 I64 2^16 0.544 11.401 us 3.72% 10.418 us 4.27% -0.983 us -8.62% FAIL
F64 I64 2^20 0.544 30.331 us 2.35% 30.671 us 1.91% 0.340 us 1.12% PASS
F64 I64 2^24 0.544 365.452 us 0.50% 368.403 us 0.50% 2.951 us 0.81% FAIL
F64 I64 2^28 0.544 5.720 ms 0.50% 5.771 ms 0.50% 51.532 us 0.90% FAIL
F64 I64 2^16 0 10.394 us 4.15% 10.425 us 4.37% 0.031 us 0.30% PASS
F64 I64 2^20 0 29.544 us 2.47% 30.109 us 1.97% 0.565 us 1.91% PASS
F64 I64 2^24 0 349.559 us 0.50% 354.973 us 0.55% 5.413 us 1.55% FAIL
F64 I64 2^28 0 5.464 ms 0.50% 5.553 ms 0.50% 89.015 us 1.63% FAIL
cub.bench.partition.flagged.base

[0] Tesla V100-SXM2-32GB

T{ct} OffsetT{ct} Elements{io} Entropy Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I8 I32 2^16 1 8.824 us 6.82% 8.849 us 6.31% 0.025 us 0.29% PASS
I8 I32 2^20 1 15.416 us 2.88% 15.515 us 3.00% 0.099 us 0.64% PASS
I8 I32 2^24 1 106.269 us 1.04% 106.272 us 1.06% 0.003 us 0.00% PASS
I8 I32 2^28 1 1.582 ms 0.50% 1.583 ms 0.50% 0.960 us 0.06% PASS
I8 I32 2^16 0.544 8.710 us 5.91% 8.741 us 5.87% 0.031 us 0.36% PASS
I8 I32 2^20 0.544 15.329 us 3.04% 15.418 us 3.09% 0.089 us 0.58% PASS
I8 I32 2^24 0.544 113.370 us 1.33% 113.347 us 1.34% -0.023 us -0.02% PASS
I8 I32 2^28 0.544 1.677 ms 0.57% 1.676 ms 0.57% -0.630 us -0.04% PASS
I8 I32 2^16 0 8.728 us 5.89% 8.751 us 5.86% 0.023 us 0.27% PASS
I8 I32 2^20 0 15.474 us 2.96% 15.530 us 3.00% 0.056 us 0.36% PASS
I8 I32 2^24 0 108.160 us 0.99% 108.141 us 0.99% -0.019 us -0.02% PASS
I8 I32 2^28 0 1.563 ms 0.23% 1.562 ms 0.22% -0.848 us -0.05% PASS
I8 I64 2^16 1 8.914 us 5.38% 8.792 us 5.78% -0.122 us -1.37% PASS
I8 I64 2^20 1 16.880 us 3.34% 16.737 us 3.38% -0.143 us -0.85% PASS
I8 I64 2^24 1 123.868 us 0.68% 123.726 us 0.70% -0.142 us -0.11% PASS
I8 I64 2^28 1 1.884 ms 0.50% 1.885 ms 0.50% 0.809 us 0.04% PASS
I8 I64 2^16 0.544 8.927 us 5.31% 8.820 us 5.71% -0.107 us -1.20% PASS
I8 I64 2^20 0.544 16.940 us 3.17% 16.839 us 3.25% -0.101 us -0.60% PASS
I8 I64 2^24 0.544 129.237 us 0.87% 129.045 us 0.83% -0.192 us -0.15% PASS
I8 I64 2^28 0.544 1.928 ms 0.50% 1.930 ms 0.50% 2.029 us 0.11% PASS
I8 I64 2^16 0 8.948 us 5.17% 8.859 us 5.58% -0.089 us -1.00% PASS
I8 I64 2^20 0 17.095 us 3.10% 16.982 us 3.22% -0.113 us -0.66% PASS
I8 I64 2^24 0 125.438 us 0.61% 125.370 us 0.60% -0.068 us -0.05% PASS
I8 I64 2^28 0 1.844 ms 0.14% 1.844 ms 0.15% 0.685 us 0.04% PASS
I16 I32 2^16 1 8.821 us 5.72% 8.791 us 5.78% -0.029 us -0.33% PASS
I16 I32 2^20 1 16.358 us 2.73% 16.281 us 2.79% -0.077 us -0.47% PASS
I16 I32 2^24 1 133.408 us 0.96% 133.580 us 0.95% 0.172 us 0.13% PASS
I16 I32 2^28 1 2.010 ms 0.50% 2.009 ms 0.50% -0.805 us -0.04% PASS
I16 I32 2^16 0.544 8.785 us 5.84% 8.832 us 5.71% 0.047 us 0.53% PASS
I16 I32 2^20 0.544 16.558 us 2.98% 16.890 us 3.20% 0.332 us 2.00% PASS
I16 I32 2^24 0.544 141.890 us 1.33% 142.052 us 1.33% 0.162 us 0.11% PASS
I16 I32 2^28 0.544 2.159 ms 0.65% 2.160 ms 0.63% 0.877 us 0.04% PASS
I16 I32 2^16 0 8.788 us 5.82% 9.130 us 4.58% 0.342 us 3.89% PASS
I16 I32 2^20 0 16.482 us 2.83% 16.568 us 2.88% 0.087 us 0.53% PASS
I16 I32 2^24 0 133.112 us 1.00% 133.396 us 1.02% 0.284 us 0.21% PASS
I16 I32 2^28 0 1.998 ms 0.23% 1.997 ms 0.22% -0.526 us -0.03% PASS
I16 I64 2^16 1 8.928 us 5.31% 9.106 us 4.41% 0.178 us 2.00% PASS
I16 I64 2^20 1 17.697 us 3.02% 17.789 us 3.05% 0.092 us 0.52% PASS
I16 I64 2^24 1 145.932 us 0.73% 146.337 us 0.71% 0.405 us 0.28% PASS
I16 I64 2^28 1 2.205 ms 0.50% 2.207 ms 0.50% 1.738 us 0.08% PASS
I16 I64 2^16 0.544 8.953 us 5.21% 9.038 us 4.73% 0.085 us 0.95% PASS
I16 I64 2^20 0.544 17.808 us 3.13% 18.228 us 3.17% 0.420 us 2.36% PASS
I16 I64 2^24 0.544 155.460 us 1.04% 156.034 us 1.01% 0.574 us 0.37% PASS
I16 I64 2^28 0.544 2.375 ms 0.50% 2.383 ms 0.50% 8.311 us 0.35% PASS
I16 I64 2^16 0 9.006 us 4.96% 9.195 us 4.39% 0.189 us 2.10% PASS
I16 I64 2^20 0 17.854 us 3.01% 17.866 us 3.04% 0.012 us 0.07% PASS
I16 I64 2^24 0 146.991 us 0.79% 147.265 us 0.78% 0.274 us 0.19% PASS
I16 I64 2^28 0 2.217 ms 0.18% 2.220 ms 0.17% 2.782 us 0.13% PASS
I32 I32 2^16 1 9.420 us 4.93% 9.354 us 4.60% -0.066 us -0.70% PASS
I32 I32 2^20 1 20.314 us 3.39% 20.228 us 3.15% -0.086 us -0.42% PASS
I32 I32 2^24 1 202.951 us 0.65% 202.940 us 0.63% -0.011 us -0.01% PASS
I32 I32 2^28 1 3.131 ms 0.54% 3.132 ms 0.53% 1.669 us 0.05% PASS
I32 I32 2^16 0.544 9.403 us 5.95% 9.121 us 4.83% -0.281 us -2.99% PASS
I32 I32 2^20 0.544 21.708 us 4.21% 21.753 us 3.98% 0.045 us 0.21% PASS
I32 I32 2^24 0.544 216.843 us 0.91% 216.784 us 0.88% -0.059 us -0.03% PASS
I32 I32 2^28 0.544 3.355 ms 0.94% 3.354 ms 0.94% -0.293 us -0.01% PASS
I32 I32 2^16 0 9.071 us 5.08% 9.318 us 4.68% 0.247 us 2.72% PASS
I32 I32 2^20 0 20.275 us 3.72% 20.348 us 3.66% 0.073 us 0.36% PASS
I32 I32 2^24 0 202.833 us 0.50% 202.810 us 0.48% -0.022 us -0.01% PASS
I32 I32 2^28 0 3.125 ms 0.10% 3.128 ms 0.10% 2.074 us 0.07% PASS
I32 I64 2^16 1 9.397 us 4.56% 9.321 us 4.69% -0.076 us -0.81% PASS
I32 I64 2^20 1 21.134 us 3.39% 21.114 us 3.32% -0.021 us -0.10% PASS
I32 I64 2^24 1 207.717 us 0.64% 207.495 us 0.65% -0.222 us -0.11% PASS
I32 I64 2^28 1 3.202 ms 0.51% 3.201 ms 0.51% -1.285 us -0.04% PASS
I32 I64 2^16 0.544 9.630 us 5.66% 9.325 us 4.71% -0.304 us -3.16% PASS
I32 I64 2^20 0.544 22.456 us 4.23% 22.547 us 4.16% 0.091 us 0.41% PASS
I32 I64 2^24 0.544 219.819 us 0.89% 219.824 us 0.87% 0.005 us 0.00% PASS
I32 I64 2^28 0.544 3.394 ms 0.91% 3.394 ms 0.91% 0.082 us 0.00% PASS
I32 I64 2^16 0 9.290 us 4.48% 9.523 us 5.12% 0.233 us 2.51% PASS
I32 I64 2^20 0 21.133 us 3.35% 21.275 us 3.32% 0.142 us 0.67% PASS
I32 I64 2^24 0 208.314 us 0.51% 208.150 us 0.49% -0.165 us -0.08% PASS
I32 I64 2^28 0 3.205 ms 0.11% 3.204 ms 0.10% -1.396 us -0.04% PASS
I64 I32 2^16 1 9.922 us 5.11% 9.953 us 4.97% 0.031 us 0.31% PASS
I64 I32 2^20 1 31.246 us 2.23% 30.638 us 2.45% -0.608 us -1.95% PASS
I64 I32 2^24 1 375.560 us 0.50% 368.851 us 0.49% -6.708 us -1.79% FAIL
I64 I32 2^28 1 5.870 ms 0.50% 5.786 ms 0.50% -84.071 us -1.43% FAIL
I64 I32 2^16 0.544 9.958 us 5.20% 10.039 us 4.85% 0.080 us 0.81% PASS
I64 I32 2^20 0.544 31.340 us 2.09% 30.722 us 2.10% -0.619 us -1.97% PASS
I64 I32 2^24 0.544 389.667 us 0.68% 385.178 us 0.69% -4.489 us -1.15% FAIL
I64 I32 2^28 0.544 6.119 ms 0.82% 6.056 ms 0.86% -63.203 us -1.03% FAIL
I64 I32 2^16 0 9.721 us 5.41% 9.880 us 5.50% 0.159 us 1.64% PASS
I64 I32 2^20 0 31.150 us 2.26% 30.808 us 2.53% -0.342 us -1.10% PASS
I64 I32 2^24 0 375.745 us 0.37% 368.885 us 0.30% -6.860 us -1.83% FAIL
I64 I32 2^28 0 5.868 ms 0.09% 5.783 ms 0.06% -84.728 us -1.44% FAIL
I64 I64 2^16 1 10.183 us 4.66% 10.216 us 4.34% 0.033 us 0.33% PASS
I64 I64 2^20 1 31.330 us 2.18% 31.344 us 2.13% 0.014 us 0.04% PASS
I64 I64 2^24 1 377.879 us 0.50% 378.028 us 0.50% 0.149 us 0.04% PASS
I64 I64 2^28 1 5.896 ms 0.50% 5.899 ms 0.50% 3.020 us 0.05% PASS
I64 I64 2^16 0.544 10.172 us 4.65% 10.259 us 4.23% 0.087 us 0.86% PASS
I64 I64 2^20 0.544 31.468 us 1.97% 31.499 us 1.93% 0.031 us 0.10% PASS
I64 I64 2^24 0.544 391.326 us 0.67% 391.685 us 0.67% 0.359 us 0.09% PASS
I64 I64 2^28 0.544 6.141 ms 0.81% 6.145 ms 0.82% 3.776 us 0.06% PASS
I64 I64 2^16 0 10.126 us 4.25% 10.347 us 4.92% 0.221 us 2.18% PASS
I64 I64 2^20 0 31.413 us 2.20% 31.599 us 2.20% 0.186 us 0.59% PASS
I64 I64 2^24 0 377.955 us 0.36% 378.149 us 0.37% 0.194 us 0.05% PASS
I64 I64 2^28 0 5.897 ms 0.08% 5.899 ms 0.07% 2.674 us 0.05% PASS
I128 I32 2^16 1 12.385 us 3.82% 12.512 us 4.17% 0.127 us 1.03% PASS
I128 I32 2^20 1 52.413 us 1.32% 52.398 us 1.30% -0.015 us -0.03% PASS
I128 I32 2^24 1 716.725 us 0.35% 716.843 us 0.35% 0.117 us 0.02% PASS
I128 I32 2^28 1 11.362 ms 0.50% 11.362 ms 0.50% 0.051 us 0.00% PASS
I128 I32 2^16 0.544 12.567 us 4.32% 12.457 us 4.20% -0.110 us -0.88% PASS
I128 I32 2^20 0.544 53.357 us 1.42% 53.345 us 1.45% -0.012 us -0.02% PASS
I128 I32 2^24 0.544 730.162 us 0.55% 730.152 us 0.57% -0.010 us -0.00% PASS
I128 I32 2^28 0.544 11.586 ms 0.93% 11.586 ms 0.93% 0.196 us 0.00% PASS
I128 I32 2^16 0 12.385 us 4.04% 12.269 us 3.53% -0.116 us -0.94% PASS
I128 I32 2^20 0 52.455 us 1.33% 52.437 us 1.32% -0.019 us -0.04% PASS
I128 I32 2^24 0 718.102 us 0.22% 718.152 us 0.25% 0.050 us 0.01% PASS
I128 I32 2^28 0 11.355 ms 0.05% 11.357 ms 0.06% 2.503 us 0.02% PASS
I128 I64 2^16 1 11.919 us 4.64% 11.974 us 4.71% 0.055 us 0.46% PASS
I128 I64 2^20 1 53.687 us 1.35% 53.694 us 1.31% 0.007 us 0.01% PASS
I128 I64 2^24 1 737.283 us 0.41% 737.649 us 0.41% 0.366 us 0.05% PASS
I128 I64 2^28 1 11.685 ms 0.50% 11.690 ms 0.50% 4.562 us 0.04% PASS
I128 I64 2^16 0.544 12.092 us 4.56% 12.029 us 4.68% -0.063 us -0.52% PASS
I128 I64 2^20 0.544 54.363 us 1.27% 54.426 us 1.30% 0.063 us 0.12% PASS
I128 I64 2^24 0.544 751.068 us 0.59% 751.166 us 0.60% 0.098 us 0.01% PASS
I128 I64 2^28 0.544 11.906 ms 0.86% 11.909 ms 0.87% 2.977 us 0.03% PASS
I128 I64 2^16 0 11.992 us 4.48% 11.721 us 4.54% -0.271 us -2.26% PASS
I128 I64 2^20 0 53.728 us 1.32% 53.630 us 1.31% -0.097 us -0.18% PASS
I128 I64 2^24 0 738.295 us 0.33% 738.407 us 0.33% 0.112 us 0.02% PASS
I128 I64 2^28 0 11.682 ms 0.09% 11.687 ms 0.10% 4.501 us 0.04% PASS
F32 I32 2^16 1 9.174 us 4.94% 9.440 us 5.11% 0.266 us 2.90% PASS
F32 I32 2^20 1 20.316 us 3.26% 20.160 us 3.15% -0.155 us -0.77% PASS
F32 I32 2^24 1 202.932 us 0.63% 202.954 us 0.63% 0.022 us 0.01% PASS
F32 I32 2^28 1 3.131 ms 0.54% 3.132 ms 0.53% 1.704 us 0.05% PASS
F32 I32 2^16 0.544 9.298 us 5.58% 9.188 us 4.84% -0.109 us -1.18% PASS
F32 I32 2^20 0.544 21.812 us 4.08% 21.903 us 4.59% 0.091 us 0.42% PASS
F32 I32 2^24 0.544 216.930 us 0.88% 216.828 us 1.02% -0.102 us -0.05% PASS
F32 I32 2^28 0.544 3.355 ms 0.94% 3.354 ms 0.94% -0.356 us -0.01% PASS
F32 I32 2^16 0 9.069 us 5.53% 9.194 us 5.31% 0.126 us 1.39% PASS
F32 I32 2^20 0 20.226 us 3.81% 20.276 us 3.56% 0.050 us 0.25% PASS
F32 I32 2^24 0 202.628 us 0.52% 202.627 us 0.50% -0.001 us -0.00% PASS
F32 I32 2^28 0 3.126 ms 0.10% 3.127 ms 0.10% 1.402 us 0.04% PASS
F32 I64 2^16 1 9.357 us 5.21% 9.664 us 5.43% 0.307 us 3.28% PASS
F32 I64 2^20 1 21.103 us 3.25% 21.199 us 4.82% 0.095 us 0.45% PASS
F32 I64 2^24 1 207.699 us 0.64% 207.670 us 0.76% -0.029 us -0.01% PASS
F32 I64 2^28 1 3.202 ms 0.51% 3.201 ms 0.51% -1.021 us -0.03% PASS
F32 I64 2^16 0.544 9.488 us 5.36% 9.481 us 11.12% -0.007 us -0.07% PASS
F32 I64 2^20 0.544 22.421 us 4.23% 22.620 us 6.13% 0.199 us 0.89% PASS
F32 I64 2^24 0.544 219.860 us 0.87% 220.253 us 1.01% 0.392 us 0.18% PASS
F32 I64 2^28 0.544 3.394 ms 0.91% 3.394 ms 0.92% -0.126 us -0.00% PASS
F32 I64 2^16 0 9.341 us 4.68% 9.350 us 10.56% 0.009 us 0.09% PASS
F32 I64 2^20 0 21.178 us 3.38% 21.199 us 5.44% 0.021 us 0.10% PASS
F32 I64 2^24 0 208.285 us 0.52% 208.201 us 0.71% -0.084 us -0.04% PASS
F32 I64 2^28 0 3.205 ms 0.11% 3.204 ms 0.11% -1.004 us -0.03% PASS
F64 I32 2^16 1 9.941 us 4.99% 10.074 us 10.46% 0.132 us 1.33% PASS
F64 I32 2^20 1 31.249 us 2.21% 30.666 us 3.71% -0.584 us -1.87% PASS
F64 I32 2^24 1 375.794 us 0.52% 368.998 us 0.50% -6.796 us -1.81% FAIL
F64 I32 2^28 1 5.870 ms 0.50% 5.786 ms 0.50% -83.980 us -1.43% FAIL
F64 I32 2^16 0.544 10.114 us 5.13% 9.922 us 8.93% -0.192 us -1.90% PASS
F64 I32 2^20 0.544 31.452 us 2.06% 30.726 us 2.11% -0.726 us -2.31% FAIL
F64 I32 2^24 0.544 389.758 us 0.67% 385.246 us 0.68% -4.512 us -1.16% FAIL
F64 I32 2^28 0.544 6.119 ms 0.82% 6.056 ms 0.86% -63.138 us -1.03% FAIL
F64 I32 2^16 0 9.782 us 5.38% 9.810 us 5.34% 0.028 us 0.28% PASS
F64 I32 2^20 0 31.241 us 2.25% 30.662 us 2.49% -0.578 us -1.85% PASS
F64 I32 2^24 0 375.846 us 0.38% 369.000 us 0.28% -6.845 us -1.82% FAIL
F64 I32 2^28 0 5.869 ms 0.08% 5.783 ms 0.06% -85.853 us -1.46% FAIL
F64 I64 2^16 1 10.217 us 4.17% 10.398 us 4.37% 0.181 us 1.78% PASS
F64 I64 2^20 1 31.436 us 2.16% 31.509 us 2.10% 0.073 us 0.23% PASS
F64 I64 2^24 1 377.791 us 0.50% 377.947 us 0.50% 0.155 us 0.04% PASS
F64 I64 2^28 1 5.897 ms 0.50% 5.900 ms 0.50% 2.831 us 0.05% PASS
F64 I64 2^16 0.544 10.263 us 4.22% 10.326 us 4.01% 0.063 us 0.61% PASS
F64 I64 2^20 0.544 31.524 us 1.98% 31.578 us 1.96% 0.054 us 0.17% PASS
F64 I64 2^24 0.544 391.299 us 0.66% 391.603 us 0.66% 0.304 us 0.08% PASS
F64 I64 2^28 0.544 6.141 ms 0.81% 6.145 ms 0.81% 3.585 us 0.06% PASS
F64 I64 2^16 0 10.077 us 4.47% 10.285 us 3.96% 0.208 us 2.06% PASS
F64 I64 2^20 0 31.410 us 2.21% 31.552 us 2.16% 0.142 us 0.45% PASS
F64 I64 2^24 0 377.946 us 0.39% 378.163 us 0.37% 0.217 us 0.06% PASS
F64 I64 2^28 0 5.896 ms 0.08% 5.899 ms 0.08% 3.000 us 0.05% PASS

Copy link
Collaborator

@gevtushenko gevtushenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an elegant solution enabling new functionality with little changes in CUB internals. Great work!

{
// Element type of the input iterator
using value_t = typename iterator_traits<InputIt>::value_type;
std::size_t num_items = static_cast<std::size_t>(thrust::distance(first, last));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: I'm not sure about this cast. Temporary array checks for num_items > 0. If user accidentally provides last < first, we'll try to allocate size_t max. Do you think leaving num items signed until we reach dispatch is any better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, good catch on the negative ranges!

I've decided to check and return early for negative/empty ranges. I was concerned about other inconvenient side effects from negative num_items, like allocating the temp array with negative size for the inplace version, and returning the rejected iterator negative by the num items count (i.e., rejected_result + num_items - num_selected). So, instead we now have:

  if(thrust::distance(first, last) <= 0){
    return thrust::make_pair(selected_result, rejected_result);
  }

and

  if(thrust::distance(first, last) <= 0){
    return first;
  }

@elstehle elstehle merged commit 1d78f0d into NVIDIA:main Feb 28, 2024
540 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cub For all items related to CUB
Projects
Archived in project
3 participants