Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve perf of Enumerable.Sum/Average/Max/Min for arrays and lists #64624

Merged
merged 2 commits into from
Feb 4, 2022

Conversation

stephentoub
Copy link
Member

It's common to use these terminal functions for quick stats on arrays and lists of values. Just the overhead of enumerating as an enumerable (involving multiple interface dispatch) per iteration is significant, and it's much faster to directly enumerate the contents of the array or the list. In some cases, we can further use vectorization to speed up the processing.

This change:

  • Adds a helper that does a fast check to see if it can extract a span from an enumerable that's actually an array or a list. It could be augmented to detect other interesting types, but T[] and List<T> are the most relevant from the data I've seen, and we can fairly quickly do type checks to get the most benefit for a small amount of cost.
  • Uses that helper in the int/long/float/double/decimal overloads of Sum/Average/Min/Max to add a span-based path.
  • Vectorizes Sum for float and double
  • Vectorizes Average for int, float, and double (the latter two via use of Sum)

@tannergooding, I assume the use of vectorization for floats/doubles could change the answer in some cases due to lack of associativity, yes? Thoughts on how much we should care about that? Also, it seemed like it should be possible to vectorize some of the methods doing checked arithmetic, but I wasn't sure how to do so correctly and skipped those. It also seemed like it should be possible to vectorize min/max for floats/doubles, but they have special-handling of NaN I couldn't figure out how to replicate with Vector.

@eiriktsarpalis, please let me know in general if you're ok with the extra code here for this special-casing. We don't have to do it, but it seems like an inexpensive win for what appears to be common, e.g. building up a List<int> and calling Average/Sum/Min/Max on it. There are certainly other patterns common with these, e.g. calling Min on the result of a Select or on other collection types. We could subsequently choose to also special-case IList<T> in order to save an interface dispatch per invocation, though I'm hopeful we'll get most of that for free with dynamic PGO, and we could subsequently choose to special-case the internal partitioning interfaces; the difficulty with those is the check for whether the type implements them is more expensive, and there's some aspect of diminishing returns because you're already doing more work to compute each element (whereas with arrays / lists, the amount of work per element is tiny).

Method Toolchain Length Mode Mean Error Ratio
MinFloat \main\corerun.exe 2 Array 20.379 ns 0.4283 ns 1.00
MinFloat \pr\corerun.exe 2 Array 4.493 ns 0.1502 ns 0.23
MinDecimal \main\corerun.exe 2 Array 22.809 ns 0.2779 ns 1.00
MinDecimal \pr\corerun.exe 2 Array 8.633 ns 0.1527 ns 0.38
SumInt32 \main\corerun.exe 2 Array 18.643 ns 0.3083 ns 1.00
SumInt32 \pr\corerun.exe 2 Array 1.924 ns 0.0544 ns 0.10
SumInt64 \main\corerun.exe 2 Array 18.344 ns 0.1885 ns 1.00
SumInt64 \pr\corerun.exe 2 Array 1.972 ns 0.0360 ns 0.11
SumFloat \main\corerun.exe 2 Array 19.220 ns 0.3734 ns 1.00
SumFloat \pr\corerun.exe 2 Array 4.102 ns 0.0312 ns 0.21
SumDecimal \main\corerun.exe 2 Array 27.653 ns 0.5665 ns 1.00
SumDecimal \pr\corerun.exe 2 Array 13.084 ns 0.1069 ns 0.47
AverageInt32 \main\corerun.exe 2 Array 18.634 ns 0.3122 ns 1.00
AverageInt32 \pr\corerun.exe 2 Array 5.071 ns 0.1270 ns 0.27
AverageInt64 \main\corerun.exe 2 Array 18.582 ns 0.3613 ns 1.00
AverageInt64 \pr\corerun.exe 2 Array 3.473 ns 0.0440 ns 0.19
AverageFloat \main\corerun.exe 2 Array 19.145 ns 0.2944 ns 1.00
AverageFloat \pr\corerun.exe 2 Array 4.278 ns 0.0346 ns 0.22
AverageDecimal \main\corerun.exe 2 Array 55.346 ns 0.7298 ns 1.00
AverageDecimal \pr\corerun.exe 2 Array 49.087 ns 0.1350 ns 0.89
MinFloat \main\corerun.exe 2 Enumerable 28.121 ns 0.4956 ns 1.00
MinFloat \pr\corerun.exe 2 Enumerable 27.229 ns 0.4659 ns 0.97
MinDecimal \main\corerun.exe 2 Enumerable 33.184 ns 0.3598 ns 1.00
MinDecimal \pr\corerun.exe 2 Enumerable 34.396 ns 0.6804 ns 1.04
SumInt32 \main\corerun.exe 2 Enumerable 21.329 ns 0.2865 ns 1.00
SumInt32 \pr\corerun.exe 2 Enumerable 21.540 ns 0.2581 ns 1.01
SumInt64 \main\corerun.exe 2 Enumerable 24.865 ns 0.2726 ns 1.00
SumInt64 \pr\corerun.exe 2 Enumerable 25.556 ns 0.4184 ns 1.03
SumFloat \main\corerun.exe 2 Enumerable 25.950 ns 0.2292 ns 1.00
SumFloat \pr\corerun.exe 2 Enumerable 26.443 ns 0.3911 ns 1.02
SumDecimal \main\corerun.exe 2 Enumerable 37.731 ns 0.6087 ns 1.00
SumDecimal \pr\corerun.exe 2 Enumerable 38.357 ns 0.7728 ns 1.02
AverageInt32 \main\corerun.exe 2 Enumerable 21.104 ns 0.2414 ns 1.00
AverageInt32 \pr\corerun.exe 2 Enumerable 22.065 ns 0.4544 ns 1.05
AverageInt64 \main\corerun.exe 2 Enumerable 24.994 ns 0.5023 ns 1.00
AverageInt64 \pr\corerun.exe 2 Enumerable 26.308 ns 0.5447 ns 1.05
AverageFloat \main\corerun.exe 2 Enumerable 27.288 ns 0.5206 ns 1.00
AverageFloat \pr\corerun.exe 2 Enumerable 26.597 ns 0.4992 ns 0.97
AverageDecimal \main\corerun.exe 2 Enumerable 67.316 ns 0.4518 ns 1.00
AverageDecimal \pr\corerun.exe 2 Enumerable 80.256 ns 0.3487 ns 1.19
MinFloat \main\corerun.exe 32 Array 162.367 ns 1.4179 ns 1.00
MinFloat \pr\corerun.exe 32 Array 40.095 ns 0.5829 ns 0.25
MinDecimal \main\corerun.exe 32 Array 268.355 ns 3.1338 ns 1.00
MinDecimal \pr\corerun.exe 32 Array 172.761 ns 1.5649 ns 0.64
SumInt32 \main\corerun.exe 32 Array 138.323 ns 1.1075 ns 1.00
SumInt32 \pr\corerun.exe 32 Array 13.147 ns 0.1089 ns 0.10
SumInt64 \main\corerun.exe 32 Array 144.375 ns 1.6040 ns 1.00
SumInt64 \pr\corerun.exe 32 Array 12.756 ns 0.2747 ns 0.09
SumFloat \main\corerun.exe 32 Array 142.810 ns 0.6627 ns 1.00
SumFloat \pr\corerun.exe 32 Array 8.980 ns 0.0286 ns 0.06
SumDecimal \main\corerun.exe 32 Array 268.380 ns 5.2945 ns 1.00
SumDecimal \pr\corerun.exe 32 Array 158.555 ns 2.0546 ns 0.59
AverageInt32 \main\corerun.exe 32 Array 154.934 ns 2.9998 ns 1.00
AverageInt32 \pr\corerun.exe 32 Array 10.516 ns 0.1869 ns 0.07
AverageInt64 \main\corerun.exe 32 Array 148.826 ns 2.9388 ns 1.00
AverageInt64 \pr\corerun.exe 32 Array 14.362 ns 0.3140 ns 0.10
AverageFloat \main\corerun.exe 32 Array 146.798 ns 2.5321 ns 1.00
AverageFloat \pr\corerun.exe 32 Array 9.836 ns 0.1240 ns 0.07
AverageDecimal \main\corerun.exe 32 Array 353.931 ns 3.2577 ns 1.00
AverageDecimal \pr\corerun.exe 32 Array 173.172 ns 3.3340 ns 0.49
MinFloat \main\corerun.exe 32 Enumerable 219.598 ns 3.2260 ns 1.00
MinFloat \pr\corerun.exe 32 Enumerable 201.268 ns 2.5158 ns 0.92
MinDecimal \main\corerun.exe 32 Enumerable 401.314 ns 6.9631 ns 1.00
MinDecimal \pr\corerun.exe 32 Enumerable 396.958 ns 4.2611 ns 0.99
SumInt32 \main\corerun.exe 32 Enumerable 130.008 ns 2.4007 ns 1.00
SumInt32 \pr\corerun.exe 32 Enumerable 133.748 ns 2.4726 ns 1.03
SumInt64 \main\corerun.exe 32 Enumerable 176.430 ns 3.4085 ns 1.00
SumInt64 \pr\corerun.exe 32 Enumerable 175.259 ns 3.0756 ns 0.99
SumFloat \main\corerun.exe 32 Enumerable 195.938 ns 2.9538 ns 1.00
SumFloat \pr\corerun.exe 32 Enumerable 190.320 ns 3.0678 ns 0.97
SumDecimal \main\corerun.exe 32 Enumerable 382.905 ns 3.4056 ns 1.00
SumDecimal \pr\corerun.exe 32 Enumerable 384.330 ns 4.6222 ns 1.00
AverageInt32 \main\corerun.exe 32 Enumerable 132.928 ns 2.5226 ns 1.00
AverageInt32 \pr\corerun.exe 32 Enumerable 143.767 ns 2.4111 ns 1.08
AverageInt64 \main\corerun.exe 32 Enumerable 181.207 ns 1.6801 ns 1.00
AverageInt64 \pr\corerun.exe 32 Enumerable 177.174 ns 2.1069 ns 0.98
AverageFloat \main\corerun.exe 32 Enumerable 203.800 ns 2.0559 ns 1.00
AverageFloat \pr\corerun.exe 32 Enumerable 201.102 ns 3.2647 ns 0.99
AverageDecimal \main\corerun.exe 32 Enumerable 497.159 ns 1.4778 ns 1.00
AverageDecimal \pr\corerun.exe 32 Enumerable 492.052 ns 2.3135 ns 0.99
using System.Collections.Generic;
using System.Linq;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

public class Program
{
    public static void Main(string[] args) => BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);

    [Params(2, 32)]
    public int Length { get; set; }

    [Params("Array", "Enumerable")]
    public string Mode { get; set; }

    private IEnumerable<int> _ints;
    private IEnumerable<long> _longs;
    private IEnumerable<float> _floats;
    private IEnumerable<decimal> _decimals;

    [Benchmark] public float MinFloat() => _floats.Min();

    [Benchmark] public int SumInt32() => _ints.Sum();
    [Benchmark] public long SumInt64() => _longs.Sum();
    [Benchmark] public float SumFloat() => _floats.Sum();
    [Benchmark] public decimal SumDecimal() => _decimals.Sum();

    [Benchmark] public double AverageInt32() => _ints.Average();
    [Benchmark] public double AverageInt64() => _longs.Average();
    [Benchmark] public float AverageFloat() => _floats.Average();
    [Benchmark] public decimal AverageDecimal() => _decimals.Average();

    [GlobalSetup]
    public void Setup()
    {
        _ints = Enumerable.Range(1, Length);
        if (Mode == "Array") _ints = _ints.ToArray();

        _longs = Enumerable.Range(1, Length).Select(i => (long)i);
        if (Mode == "Array") _longs = _longs.ToArray();

        _floats = Enumerable.Range(1, Length).Select(i => (float)i);
        if (Mode == "Array") _floats = _floats.ToArray();

        _decimals = Enumerable.Range(1, Length).Select(i => (decimal)i);
        if (Mode == "Array") _decimals = _decimals.ToArray();
    }
}

@ghost
Copy link

ghost commented Feb 1, 2022

Tagging subscribers to this area: @dotnet/area-system-linq
See info in area-owners.md if you want to be subscribed.

Issue Details

It's common to use these terminal functions for quick stats on arrays and lists of values. Just the overhead of enumerating as an enumerable (involving multiple interface dispatch) per iteration is significant, and it's much faster to directly enumerate the contents of the array or the list. In some cases, we can further use vectorization to speed up the processing.

This change:

  • Adds a helper that does a fast check to see if it can extract a span from an enumerable that's actually an array or a list. It could be augmented to detect other interesting types, but T[] and List<T> are the most relevant from the data I've seen, and we can fairly quickly do type checks to get the most benefit for a small amount of cost.
  • Uses that helper in the int/long/float/double/decimal overloads of Sum/Average/Min/Max to add a span-based path.
  • Vectorizes Sum for float and double
  • Vectorizes Average for int, float, and double (the latter two via use of Sum)

@tannergooding, I assume the use of vectorization for floats/doubles could change the answer in some cases due to lack of associativity, yes? Thoughts on how much we should care about that? Also, it seemed like it should be possible to vectorize some of the methods doing checked arithmetic, but I wasn't sure how to do so correctly and skipped those. It also seemed like it should be possible to vectorize min/max for floats/doubles, but they have special-handling of NaN I couldn't figure out how to replicate with Vector.

@eiriktsarpalis, please let me know in general if you're ok with the extra code here for this special-casing. We don't have to do it, but it seems like an inexpensive win for what appears to be common, e.g. building up a List<int> and calling Average/Sum/Min/Max on it. There are certainly other patterns common with these, e.g. calling Min on the result of a Select or on other collection types. We could subsequently choose to also special-case IList<T> in order to save an interface dispatch per invocation, though I'm hopeful we'll get most of that for free with dynamic PGO, and we could subsequently choose to special-case the internal partitioning interfaces; the difficulty with those is the check for whether the type implements them is more expensive, and there's some aspect of diminishing returns because you're already doing more work to compute each element (whereas with arrays / lists, the amount of work per element is tiny).

Method Toolchain Length Mode Mean Error Ratio
MinFloat \main\corerun.exe 2 Array 20.379 ns 0.4283 ns 1.00
MinFloat \pr\corerun.exe 2 Array 4.493 ns 0.1502 ns 0.23
MinDecimal \main\corerun.exe 2 Array 22.809 ns 0.2779 ns 1.00
MinDecimal \pr\corerun.exe 2 Array 8.633 ns 0.1527 ns 0.38
SumInt32 \main\corerun.exe 2 Array 18.643 ns 0.3083 ns 1.00
SumInt32 \pr\corerun.exe 2 Array 1.924 ns 0.0544 ns 0.10
SumInt64 \main\corerun.exe 2 Array 18.344 ns 0.1885 ns 1.00
SumInt64 \pr\corerun.exe 2 Array 1.972 ns 0.0360 ns 0.11
SumFloat \main\corerun.exe 2 Array 19.220 ns 0.3734 ns 1.00
SumFloat \pr\corerun.exe 2 Array 4.102 ns 0.0312 ns 0.21
SumDecimal \main\corerun.exe 2 Array 27.653 ns 0.5665 ns 1.00
SumDecimal \pr\corerun.exe 2 Array 13.084 ns 0.1069 ns 0.47
AverageInt32 \main\corerun.exe 2 Array 18.634 ns 0.3122 ns 1.00
AverageInt32 \pr\corerun.exe 2 Array 5.071 ns 0.1270 ns 0.27
AverageInt64 \main\corerun.exe 2 Array 18.582 ns 0.3613 ns 1.00
AverageInt64 \pr\corerun.exe 2 Array 3.473 ns 0.0440 ns 0.19
AverageFloat \main\corerun.exe 2 Array 19.145 ns 0.2944 ns 1.00
AverageFloat \pr\corerun.exe 2 Array 4.278 ns 0.0346 ns 0.22
AverageDecimal \main\corerun.exe 2 Array 55.346 ns 0.7298 ns 1.00
AverageDecimal \pr\corerun.exe 2 Array 49.087 ns 0.1350 ns 0.89
MinFloat \main\corerun.exe 2 Enumerable 28.121 ns 0.4956 ns 1.00
MinFloat \pr\corerun.exe 2 Enumerable 27.229 ns 0.4659 ns 0.97
MinDecimal \main\corerun.exe 2 Enumerable 33.184 ns 0.3598 ns 1.00
MinDecimal \pr\corerun.exe 2 Enumerable 34.396 ns 0.6804 ns 1.04
SumInt32 \main\corerun.exe 2 Enumerable 21.329 ns 0.2865 ns 1.00
SumInt32 \pr\corerun.exe 2 Enumerable 21.540 ns 0.2581 ns 1.01
SumInt64 \main\corerun.exe 2 Enumerable 24.865 ns 0.2726 ns 1.00
SumInt64 \pr\corerun.exe 2 Enumerable 25.556 ns 0.4184 ns 1.03
SumFloat \main\corerun.exe 2 Enumerable 25.950 ns 0.2292 ns 1.00
SumFloat \pr\corerun.exe 2 Enumerable 26.443 ns 0.3911 ns 1.02
SumDecimal \main\corerun.exe 2 Enumerable 37.731 ns 0.6087 ns 1.00
SumDecimal \pr\corerun.exe 2 Enumerable 38.357 ns 0.7728 ns 1.02
AverageInt32 \main\corerun.exe 2 Enumerable 21.104 ns 0.2414 ns 1.00
AverageInt32 \pr\corerun.exe 2 Enumerable 22.065 ns 0.4544 ns 1.05
AverageInt64 \main\corerun.exe 2 Enumerable 24.994 ns 0.5023 ns 1.00
AverageInt64 \pr\corerun.exe 2 Enumerable 26.308 ns 0.5447 ns 1.05
AverageFloat \main\corerun.exe 2 Enumerable 27.288 ns 0.5206 ns 1.00
AverageFloat \pr\corerun.exe 2 Enumerable 26.597 ns 0.4992 ns 0.97
AverageDecimal \main\corerun.exe 2 Enumerable 67.316 ns 0.4518 ns 1.00
AverageDecimal \pr\corerun.exe 2 Enumerable 80.256 ns 0.3487 ns 1.19
MinFloat \main\corerun.exe 32 Array 162.367 ns 1.4179 ns 1.00
MinFloat \pr\corerun.exe 32 Array 40.095 ns 0.5829 ns 0.25
MinDecimal \main\corerun.exe 32 Array 268.355 ns 3.1338 ns 1.00
MinDecimal \pr\corerun.exe 32 Array 172.761 ns 1.5649 ns 0.64
SumInt32 \main\corerun.exe 32 Array 138.323 ns 1.1075 ns 1.00
SumInt32 \pr\corerun.exe 32 Array 13.147 ns 0.1089 ns 0.10
SumInt64 \main\corerun.exe 32 Array 144.375 ns 1.6040 ns 1.00
SumInt64 \pr\corerun.exe 32 Array 12.756 ns 0.2747 ns 0.09
SumFloat \main\corerun.exe 32 Array 142.810 ns 0.6627 ns 1.00
SumFloat \pr\corerun.exe 32 Array 8.980 ns 0.0286 ns 0.06
SumDecimal \main\corerun.exe 32 Array 268.380 ns 5.2945 ns 1.00
SumDecimal \pr\corerun.exe 32 Array 158.555 ns 2.0546 ns 0.59
AverageInt32 \main\corerun.exe 32 Array 154.934 ns 2.9998 ns 1.00
AverageInt32 \pr\corerun.exe 32 Array 10.516 ns 0.1869 ns 0.07
AverageInt64 \main\corerun.exe 32 Array 148.826 ns 2.9388 ns 1.00
AverageInt64 \pr\corerun.exe 32 Array 14.362 ns 0.3140 ns 0.10
AverageFloat \main\corerun.exe 32 Array 146.798 ns 2.5321 ns 1.00
AverageFloat \pr\corerun.exe 32 Array 9.836 ns 0.1240 ns 0.07
AverageDecimal \main\corerun.exe 32 Array 353.931 ns 3.2577 ns 1.00
AverageDecimal \pr\corerun.exe 32 Array 173.172 ns 3.3340 ns 0.49
MinFloat \main\corerun.exe 32 Enumerable 219.598 ns 3.2260 ns 1.00
MinFloat \pr\corerun.exe 32 Enumerable 201.268 ns 2.5158 ns 0.92
MinDecimal \main\corerun.exe 32 Enumerable 401.314 ns 6.9631 ns 1.00
MinDecimal \pr\corerun.exe 32 Enumerable 396.958 ns 4.2611 ns 0.99
SumInt32 \main\corerun.exe 32 Enumerable 130.008 ns 2.4007 ns 1.00
SumInt32 \pr\corerun.exe 32 Enumerable 133.748 ns 2.4726 ns 1.03
SumInt64 \main\corerun.exe 32 Enumerable 176.430 ns 3.4085 ns 1.00
SumInt64 \pr\corerun.exe 32 Enumerable 175.259 ns 3.0756 ns 0.99
SumFloat \main\corerun.exe 32 Enumerable 195.938 ns 2.9538 ns 1.00
SumFloat \pr\corerun.exe 32 Enumerable 190.320 ns 3.0678 ns 0.97
SumDecimal \main\corerun.exe 32 Enumerable 382.905 ns 3.4056 ns 1.00
SumDecimal \pr\corerun.exe 32 Enumerable 384.330 ns 4.6222 ns 1.00
AverageInt32 \main\corerun.exe 32 Enumerable 132.928 ns 2.5226 ns 1.00
AverageInt32 \pr\corerun.exe 32 Enumerable 143.767 ns 2.4111 ns 1.08
AverageInt64 \main\corerun.exe 32 Enumerable 181.207 ns 1.6801 ns 1.00
AverageInt64 \pr\corerun.exe 32 Enumerable 177.174 ns 2.1069 ns 0.98
AverageFloat \main\corerun.exe 32 Enumerable 203.800 ns 2.0559 ns 1.00
AverageFloat \pr\corerun.exe 32 Enumerable 201.102 ns 3.2647 ns 0.99
AverageDecimal \main\corerun.exe 32 Enumerable 497.159 ns 1.4778 ns 1.00
AverageDecimal \pr\corerun.exe 32 Enumerable 492.052 ns 2.3135 ns 0.99
using System.Collections.Generic;
using System.Linq;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

public class Program
{
    public static void Main(string[] args) => BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);

    [Params(2, 32)]
    public int Length { get; set; }

    [Params("Array", "Enumerable")]
    public string Mode { get; set; }

    private IEnumerable<int> _ints;
    private IEnumerable<long> _longs;
    private IEnumerable<float> _floats;
    private IEnumerable<decimal> _decimals;

    [Benchmark] public float MinFloat() => _floats.Min();

    [Benchmark] public int SumInt32() => _ints.Sum();
    [Benchmark] public long SumInt64() => _longs.Sum();
    [Benchmark] public float SumFloat() => _floats.Sum();
    [Benchmark] public decimal SumDecimal() => _decimals.Sum();

    [Benchmark] public double AverageInt32() => _ints.Average();
    [Benchmark] public double AverageInt64() => _longs.Average();
    [Benchmark] public float AverageFloat() => _floats.Average();
    [Benchmark] public decimal AverageDecimal() => _decimals.Average();

    [GlobalSetup]
    public void Setup()
    {
        _ints = Enumerable.Range(1, Length);
        if (Mode == "Array") _ints = _ints.ToArray();

        _longs = Enumerable.Range(1, Length).Select(i => (long)i);
        if (Mode == "Array") _longs = _longs.ToArray();

        _floats = Enumerable.Range(1, Length).Select(i => (float)i);
        if (Mode == "Array") _floats = _floats.ToArray();

        _decimals = Enumerable.Range(1, Length).Select(i => (decimal)i);
        if (Mode == "Array") _decimals = _decimals.ToArray();
    }
}
Author: stephentoub
Assignees: -
Labels:

area-System.Linq, tenet-performance

Milestone: -

@eiriktsarpalis
Copy link
Member

@eiriktsarpalis, please let me know in general if you're ok with the extra code here for this special-casing.

Should be ok, but per your own comment in #64470 (comment) it would be nice if the code be moved out of Linq eventually.

@tannergooding
Copy link
Member

tannergooding commented Feb 1, 2022

I assume the use of vectorization for floats/doubles could change the answer in some cases due to lack of associativity, yes? Thoughts on how much we should care about that? Also, it seemed like it should be possible to vectorize some of the methods doing checked arithmetic, but I wasn't sure how to do so correctly and skipped those. It also seemed like it should be possible to vectorize min/max for floats/doubles, but they have special-handling of NaN I couldn't figure out how to replicate with Vector.

@stephentoub right. There are a number of cases where we cannot trivially vectorize float/double (particularly operations that compute a new value like Sum) because it can drastically change the output and will lead to determinism issues among other bugs (put another way, I think we should care a lot and not do this for float/double).

The simplest scenario to explain is that with integers the delta between one representable value and the next is always 1 (ignoring overflow for a minute). Where-as with floating-point this starts at epsilon (the smallest representable value greater than zero) and then doubles every power of 2 after that. For float this means that for 2^22 to 2^23 the delta between values is 0.5, for 2^23 to 2^24 the delta between values is 1, for 2^24 to 2^25 it is 2, at 2^26 it is 4, and so on (in both directions going down, including to negative powers until you hit epsilon and going up until you hit float.MaxValue).

What this means is that if you take a case like new float[] { 16777216.0f, 0.5f, 0.5f, 0.5f } you will get back 16777216.0f because 2^24 + 0.5 always returns 2^24 since its less than the delta between values (2). However if you reorder this to new float[] { 0.5f, 0.5f, 0.5f, 16777216.0f } you instead get back 16777218.0f, because 0.5 + 0.5 +0.5 is enough to create 1.5 and then (due to rounding) is enough to round up to 16777218.

Vectorization is impacted here because it ends up adding n to n + Count and then adding them "across" at the end. There have been open asks for several years to provide a "fast math" switch that would allow such optimizations but its non-trivial to achieve. The easiest thing would be to expose a new overload that allows users to explicitly opt-into such differences.


Things that do not compute a "new value" like Min/Max (it only does a comparison and returns one of the inputs) can be vectorized but it needs to take into account things like NaN handling as you indicated.

For overflow checking, it can be simple when you know the inputs are always positive or negative. It's not as simple when it can randomly be positive or negative.

@stephentoub
Copy link
Member Author

I assume the use of vectorization for floats/doubles could change the answer in some cases due to lack of associativity, yes? Thoughts on how much we should care about that?

I think we should care a lot and not do this for float/double

Ok, deleting.

It's very common to use these terminal functions for quick stats on arrays and lists of values.  Just the overhead of enumerating as an enumerable (involving multiple interface dispatch) per iteration is significant, and it's much faster to directly enumerate the contents of the array or the list.  In some cases, we can further use vectorization to speed up the processing.

This change:
- Adds a helper that does a fast check to see if it can extract a span from an enumerable that's actually an array or a list.  It could be augmented to detect other interesting types, but `T[]` and `List<T>` are the most relevant from the data I've seen, and we can fairly quickly do type checks to get the most benefit for a small amount of cost.
- Uses that helper in the int/long/float/double/decimal overloads of Sum/Average/Min/Max to add a span-based path.
- Vectorizes Sum for float and double
- Vectorizes Average for int, float, and double (the latter two via use of Sum)
Vector<long> sums = default;
do
{
Vector.Widen(new Vector<int>(span.Slice(i)), out Vector<long> low, out Vector<long> high);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the bounds checks actually getting elided here for new Vector<int>(span.Slice(i))? This seems like its going to be doing a bunch of extra work each iteration

Copy link
Member Author

@stephentoub stephentoub Feb 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe so. But I also didn't want to introduce unsafe code. Is there a way to write it that's "safe" (i.e. if I made a mistake it would result in an exception rather than potential corruption / security problems) and that avoids the bounds checks? My assumption is even if I eliminate the checks in Slice by using a loop pattern the JIT recognizes, Vector itself will still be doing a length check on the length of the span supplied, and that won't be removed (or will it)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some other places in the runtime use a helper built around Unsafe.ReadUnaligned and then the xplat helpers now have Vector128.LoadUnsafe(ref T, nuint index).

There's never really going to be a "safe" way to do this unless the JIT gets special support, however. You functionally have some T and want to read 2-32 of that T (depending on what the actual type is). So the best you'll generally get is doing the right checks up front and potentially adding some asserts. Otherwise, you pay the cost of slicing, copying the span, and doing the relevant bounds checks each iteration.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with that in this code. If these helpers are ever moved into MemoryExtensions, we can go to town on optimizing the heck out of them, both with avoiding bounds checks and with adding whatever additional paths are necessary to get the best perf. For LINQ, I think this is good enough, and there's benefit to no proliferating unsafe code here.

@ghost ghost locked as resolved and limited conversation to collaborators Mar 7, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Linq tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants