Improve perf of Enumerable.Sum/Average/Max/Min for arrays and lists #64624

stephentoub · 2022-02-01T16:10:50Z

It's common to use these terminal functions for quick stats on arrays and lists of values. Just the overhead of enumerating as an enumerable (involving multiple interface dispatch) per iteration is significant, and it's much faster to directly enumerate the contents of the array or the list. In some cases, we can further use vectorization to speed up the processing.

This change:

Adds a helper that does a fast check to see if it can extract a span from an enumerable that's actually an array or a list. It could be augmented to detect other interesting types, but T[] and List<T> are the most relevant from the data I've seen, and we can fairly quickly do type checks to get the most benefit for a small amount of cost.
Uses that helper in the int/long/float/double/decimal overloads of Sum/Average/Min/Max to add a span-based path.
Vectorizes Sum for float and double
Vectorizes Average for int, float, and double (the latter two via use of Sum)

@tannergooding, I assume the use of vectorization for floats/doubles could change the answer in some cases due to lack of associativity, yes? Thoughts on how much we should care about that? Also, it seemed like it should be possible to vectorize some of the methods doing checked arithmetic, but I wasn't sure how to do so correctly and skipped those. It also seemed like it should be possible to vectorize min/max for floats/doubles, but they have special-handling of NaN I couldn't figure out how to replicate with Vector.

@eiriktsarpalis, please let me know in general if you're ok with the extra code here for this special-casing. We don't have to do it, but it seems like an inexpensive win for what appears to be common, e.g. building up a List<int> and calling Average/Sum/Min/Max on it. There are certainly other patterns common with these, e.g. calling Min on the result of a Select or on other collection types. We could subsequently choose to also special-case IList<T> in order to save an interface dispatch per invocation, though I'm hopeful we'll get most of that for free with dynamic PGO, and we could subsequently choose to special-case the internal partitioning interfaces; the difficulty with those is the check for whether the type implements them is more expensive, and there's some aspect of diminishing returns because you're already doing more work to compute each element (whereas with arrays / lists, the amount of work per element is tiny).

Method	Toolchain	Length	Mode	Mean	Error	Ratio
MinFloat	\main\corerun.exe	2	Array	20.379 ns	0.4283 ns	1.00
MinFloat	\pr\corerun.exe	2	Array	4.493 ns	0.1502 ns	0.23

MinDecimal	\main\corerun.exe	2	Array	22.809 ns	0.2779 ns	1.00
MinDecimal	\pr\corerun.exe	2	Array	8.633 ns	0.1527 ns	0.38

SumInt32	\main\corerun.exe	2	Array	18.643 ns	0.3083 ns	1.00
SumInt32	\pr\corerun.exe	2	Array	1.924 ns	0.0544 ns	0.10

SumInt64	\main\corerun.exe	2	Array	18.344 ns	0.1885 ns	1.00
SumInt64	\pr\corerun.exe	2	Array	1.972 ns	0.0360 ns	0.11

SumFloat	\main\corerun.exe	2	Array	19.220 ns	0.3734 ns	1.00
SumFloat	\pr\corerun.exe	2	Array	4.102 ns	0.0312 ns	0.21

SumDecimal	\main\corerun.exe	2	Array	27.653 ns	0.5665 ns	1.00
SumDecimal	\pr\corerun.exe	2	Array	13.084 ns	0.1069 ns	0.47

AverageInt32	\main\corerun.exe	2	Array	18.634 ns	0.3122 ns	1.00
AverageInt32	\pr\corerun.exe	2	Array	5.071 ns	0.1270 ns	0.27

AverageInt64	\main\corerun.exe	2	Array	18.582 ns	0.3613 ns	1.00
AverageInt64	\pr\corerun.exe	2	Array	3.473 ns	0.0440 ns	0.19

AverageFloat	\main\corerun.exe	2	Array	19.145 ns	0.2944 ns	1.00
AverageFloat	\pr\corerun.exe	2	Array	4.278 ns	0.0346 ns	0.22

AverageDecimal	\main\corerun.exe	2	Array	55.346 ns	0.7298 ns	1.00
AverageDecimal	\pr\corerun.exe	2	Array	49.087 ns	0.1350 ns	0.89

MinFloat	\main\corerun.exe	2	Enumerable	28.121 ns	0.4956 ns	1.00
MinFloat	\pr\corerun.exe	2	Enumerable	27.229 ns	0.4659 ns	0.97

MinDecimal	\main\corerun.exe	2	Enumerable	33.184 ns	0.3598 ns	1.00
MinDecimal	\pr\corerun.exe	2	Enumerable	34.396 ns	0.6804 ns	1.04

SumInt32	\main\corerun.exe	2	Enumerable	21.329 ns	0.2865 ns	1.00
SumInt32	\pr\corerun.exe	2	Enumerable	21.540 ns	0.2581 ns	1.01

SumInt64	\main\corerun.exe	2	Enumerable	24.865 ns	0.2726 ns	1.00
SumInt64	\pr\corerun.exe	2	Enumerable	25.556 ns	0.4184 ns	1.03

SumFloat	\main\corerun.exe	2	Enumerable	25.950 ns	0.2292 ns	1.00
SumFloat	\pr\corerun.exe	2	Enumerable	26.443 ns	0.3911 ns	1.02

SumDecimal	\main\corerun.exe	2	Enumerable	37.731 ns	0.6087 ns	1.00
SumDecimal	\pr\corerun.exe	2	Enumerable	38.357 ns	0.7728 ns	1.02

AverageInt32	\main\corerun.exe	2	Enumerable	21.104 ns	0.2414 ns	1.00
AverageInt32	\pr\corerun.exe	2	Enumerable	22.065 ns	0.4544 ns	1.05

AverageInt64	\main\corerun.exe	2	Enumerable	24.994 ns	0.5023 ns	1.00
AverageInt64	\pr\corerun.exe	2	Enumerable	26.308 ns	0.5447 ns	1.05

AverageFloat	\main\corerun.exe	2	Enumerable	27.288 ns	0.5206 ns	1.00
AverageFloat	\pr\corerun.exe	2	Enumerable	26.597 ns	0.4992 ns	0.97

AverageDecimal	\main\corerun.exe	2	Enumerable	67.316 ns	0.4518 ns	1.00
AverageDecimal	\pr\corerun.exe	2	Enumerable	80.256 ns	0.3487 ns	1.19

MinFloat	\main\corerun.exe	32	Array	162.367 ns	1.4179 ns	1.00
MinFloat	\pr\corerun.exe	32	Array	40.095 ns	0.5829 ns	0.25

MinDecimal	\main\corerun.exe	32	Array	268.355 ns	3.1338 ns	1.00
MinDecimal	\pr\corerun.exe	32	Array	172.761 ns	1.5649 ns	0.64

SumInt32	\main\corerun.exe	32	Array	138.323 ns	1.1075 ns	1.00
SumInt32	\pr\corerun.exe	32	Array	13.147 ns	0.1089 ns	0.10

SumInt64	\main\corerun.exe	32	Array	144.375 ns	1.6040 ns	1.00
SumInt64	\pr\corerun.exe	32	Array	12.756 ns	0.2747 ns	0.09

SumFloat	\main\corerun.exe	32	Array	142.810 ns	0.6627 ns	1.00
SumFloat	\pr\corerun.exe	32	Array	8.980 ns	0.0286 ns	0.06

SumDecimal	\main\corerun.exe	32	Array	268.380 ns	5.2945 ns	1.00
SumDecimal	\pr\corerun.exe	32	Array	158.555 ns	2.0546 ns	0.59

AverageInt32	\main\corerun.exe	32	Array	154.934 ns	2.9998 ns	1.00
AverageInt32	\pr\corerun.exe	32	Array	10.516 ns	0.1869 ns	0.07

AverageInt64	\main\corerun.exe	32	Array	148.826 ns	2.9388 ns	1.00
AverageInt64	\pr\corerun.exe	32	Array	14.362 ns	0.3140 ns	0.10

AverageFloat	\main\corerun.exe	32	Array	146.798 ns	2.5321 ns	1.00
AverageFloat	\pr\corerun.exe	32	Array	9.836 ns	0.1240 ns	0.07

AverageDecimal	\main\corerun.exe	32	Array	353.931 ns	3.2577 ns	1.00
AverageDecimal	\pr\corerun.exe	32	Array	173.172 ns	3.3340 ns	0.49

MinFloat	\main\corerun.exe	32	Enumerable	219.598 ns	3.2260 ns	1.00
MinFloat	\pr\corerun.exe	32	Enumerable	201.268 ns	2.5158 ns	0.92

MinDecimal	\main\corerun.exe	32	Enumerable	401.314 ns	6.9631 ns	1.00
MinDecimal	\pr\corerun.exe	32	Enumerable	396.958 ns	4.2611 ns	0.99

SumInt32	\main\corerun.exe	32	Enumerable	130.008 ns	2.4007 ns	1.00
SumInt32	\pr\corerun.exe	32	Enumerable	133.748 ns	2.4726 ns	1.03

SumInt64	\main\corerun.exe	32	Enumerable	176.430 ns	3.4085 ns	1.00
SumInt64	\pr\corerun.exe	32	Enumerable	175.259 ns	3.0756 ns	0.99

SumFloat	\main\corerun.exe	32	Enumerable	195.938 ns	2.9538 ns	1.00
SumFloat	\pr\corerun.exe	32	Enumerable	190.320 ns	3.0678 ns	0.97

SumDecimal	\main\corerun.exe	32	Enumerable	382.905 ns	3.4056 ns	1.00
SumDecimal	\pr\corerun.exe	32	Enumerable	384.330 ns	4.6222 ns	1.00

AverageInt32	\main\corerun.exe	32	Enumerable	132.928 ns	2.5226 ns	1.00
AverageInt32	\pr\corerun.exe	32	Enumerable	143.767 ns	2.4111 ns	1.08

AverageInt64	\main\corerun.exe	32	Enumerable	181.207 ns	1.6801 ns	1.00
AverageInt64	\pr\corerun.exe	32	Enumerable	177.174 ns	2.1069 ns	0.98

AverageFloat	\main\corerun.exe	32	Enumerable	203.800 ns	2.0559 ns	1.00
AverageFloat	\pr\corerun.exe	32	Enumerable	201.102 ns	3.2647 ns	0.99

AverageDecimal	\main\corerun.exe	32	Enumerable	497.159 ns	1.4778 ns	1.00
AverageDecimal	\pr\corerun.exe	32	Enumerable	492.052 ns	2.3135 ns	0.99

using System.Collections.Generic;
using System.Linq;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

public class Program
{
    public static void Main(string[] args) => BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);

    [Params(2, 32)]
    public int Length { get; set; }

    [Params("Array", "Enumerable")]
    public string Mode { get; set; }

    private IEnumerable<int> _ints;
    private IEnumerable<long> _longs;
    private IEnumerable<float> _floats;
    private IEnumerable<decimal> _decimals;

    [Benchmark] public float MinFloat() => _floats.Min();

    [Benchmark] public int SumInt32() => _ints.Sum();
    [Benchmark] public long SumInt64() => _longs.Sum();
    [Benchmark] public float SumFloat() => _floats.Sum();
    [Benchmark] public decimal SumDecimal() => _decimals.Sum();

    [Benchmark] public double AverageInt32() => _ints.Average();
    [Benchmark] public double AverageInt64() => _longs.Average();
    [Benchmark] public float AverageFloat() => _floats.Average();
    [Benchmark] public decimal AverageDecimal() => _decimals.Average();

    [GlobalSetup]
    public void Setup()
    {
        _ints = Enumerable.Range(1, Length);
        if (Mode == "Array") _ints = _ints.ToArray();

        _longs = Enumerable.Range(1, Length).Select(i => (long)i);
        if (Mode == "Array") _longs = _longs.ToArray();

        _floats = Enumerable.Range(1, Length).Select(i => (float)i);
        if (Mode == "Array") _floats = _floats.ToArray();

        _decimals = Enumerable.Range(1, Length).Select(i => (decimal)i);
        if (Mode == "Array") _decimals = _decimals.ToArray();
    }
}

ghost · 2022-02-01T16:11:01Z

Tagging subscribers to this area: @dotnet/area-system-linq
See info in area-owners.md if you want to be subscribed.

Issue Details

It's common to use these terminal functions for quick stats on arrays and lists of values. Just the overhead of enumerating as an enumerable (involving multiple interface dispatch) per iteration is significant, and it's much faster to directly enumerate the contents of the array or the list. In some cases, we can further use vectorization to speed up the processing.

This change:

Adds a helper that does a fast check to see if it can extract a span from an enumerable that's actually an array or a list. It could be augmented to detect other interesting types, but T[] and List<T> are the most relevant from the data I've seen, and we can fairly quickly do type checks to get the most benefit for a small amount of cost.
Uses that helper in the int/long/float/double/decimal overloads of Sum/Average/Min/Max to add a span-based path.
Vectorizes Sum for float and double
Vectorizes Average for int, float, and double (the latter two via use of Sum)

@tannergooding, I assume the use of vectorization for floats/doubles could change the answer in some cases due to lack of associativity, yes? Thoughts on how much we should care about that? Also, it seemed like it should be possible to vectorize some of the methods doing checked arithmetic, but I wasn't sure how to do so correctly and skipped those. It also seemed like it should be possible to vectorize min/max for floats/doubles, but they have special-handling of NaN I couldn't figure out how to replicate with Vector.

@eiriktsarpalis, please let me know in general if you're ok with the extra code here for this special-casing. We don't have to do it, but it seems like an inexpensive win for what appears to be common, e.g. building up a List<int> and calling Average/Sum/Min/Max on it. There are certainly other patterns common with these, e.g. calling Min on the result of a Select or on other collection types. We could subsequently choose to also special-case IList<T> in order to save an interface dispatch per invocation, though I'm hopeful we'll get most of that for free with dynamic PGO, and we could subsequently choose to special-case the internal partitioning interfaces; the difficulty with those is the check for whether the type implements them is more expensive, and there's some aspect of diminishing returns because you're already doing more work to compute each element (whereas with arrays / lists, the amount of work per element is tiny).

Method	Toolchain	Length	Mode	Mean	Error	Ratio
MinFloat	\main\corerun.exe	2	Array	20.379 ns	0.4283 ns	1.00
MinFloat	\pr\corerun.exe	2	Array	4.493 ns	0.1502 ns	0.23

MinDecimal	\main\corerun.exe	2	Array	22.809 ns	0.2779 ns	1.00
MinDecimal	\pr\corerun.exe	2	Array	8.633 ns	0.1527 ns	0.38

SumInt32	\main\corerun.exe	2	Array	18.643 ns	0.3083 ns	1.00
SumInt32	\pr\corerun.exe	2	Array	1.924 ns	0.0544 ns	0.10

SumInt64	\main\corerun.exe	2	Array	18.344 ns	0.1885 ns	1.00
SumInt64	\pr\corerun.exe	2	Array	1.972 ns	0.0360 ns	0.11

SumFloat	\main\corerun.exe	2	Array	19.220 ns	0.3734 ns	1.00
SumFloat	\pr\corerun.exe	2	Array	4.102 ns	0.0312 ns	0.21

SumDecimal	\main\corerun.exe	2	Array	27.653 ns	0.5665 ns	1.00
SumDecimal	\pr\corerun.exe	2	Array	13.084 ns	0.1069 ns	0.47

AverageInt32	\main\corerun.exe	2	Array	18.634 ns	0.3122 ns	1.00
AverageInt32	\pr\corerun.exe	2	Array	5.071 ns	0.1270 ns	0.27

AverageInt64	\main\corerun.exe	2	Array	18.582 ns	0.3613 ns	1.00
AverageInt64	\pr\corerun.exe	2	Array	3.473 ns	0.0440 ns	0.19

AverageFloat	\main\corerun.exe	2	Array	19.145 ns	0.2944 ns	1.00
AverageFloat	\pr\corerun.exe	2	Array	4.278 ns	0.0346 ns	0.22

AverageDecimal	\main\corerun.exe	2	Array	55.346 ns	0.7298 ns	1.00
AverageDecimal	\pr\corerun.exe	2	Array	49.087 ns	0.1350 ns	0.89

MinFloat	\main\corerun.exe	2	Enumerable	28.121 ns	0.4956 ns	1.00
MinFloat	\pr\corerun.exe	2	Enumerable	27.229 ns	0.4659 ns	0.97

MinDecimal	\main\corerun.exe	2	Enumerable	33.184 ns	0.3598 ns	1.00
MinDecimal	\pr\corerun.exe	2	Enumerable	34.396 ns	0.6804 ns	1.04

SumInt32	\main\corerun.exe	2	Enumerable	21.329 ns	0.2865 ns	1.00
SumInt32	\pr\corerun.exe	2	Enumerable	21.540 ns	0.2581 ns	1.01

SumInt64	\main\corerun.exe	2	Enumerable	24.865 ns	0.2726 ns	1.00
SumInt64	\pr\corerun.exe	2	Enumerable	25.556 ns	0.4184 ns	1.03

SumFloat	\main\corerun.exe	2	Enumerable	25.950 ns	0.2292 ns	1.00
SumFloat	\pr\corerun.exe	2	Enumerable	26.443 ns	0.3911 ns	1.02

SumDecimal	\main\corerun.exe	2	Enumerable	37.731 ns	0.6087 ns	1.00
SumDecimal	\pr\corerun.exe	2	Enumerable	38.357 ns	0.7728 ns	1.02

AverageInt32	\main\corerun.exe	2	Enumerable	21.104 ns	0.2414 ns	1.00
AverageInt32	\pr\corerun.exe	2	Enumerable	22.065 ns	0.4544 ns	1.05

AverageInt64	\main\corerun.exe	2	Enumerable	24.994 ns	0.5023 ns	1.00
AverageInt64	\pr\corerun.exe	2	Enumerable	26.308 ns	0.5447 ns	1.05

AverageFloat	\main\corerun.exe	2	Enumerable	27.288 ns	0.5206 ns	1.00
AverageFloat	\pr\corerun.exe	2	Enumerable	26.597 ns	0.4992 ns	0.97

AverageDecimal	\main\corerun.exe	2	Enumerable	67.316 ns	0.4518 ns	1.00
AverageDecimal	\pr\corerun.exe	2	Enumerable	80.256 ns	0.3487 ns	1.19

MinFloat	\main\corerun.exe	32	Array	162.367 ns	1.4179 ns	1.00
MinFloat	\pr\corerun.exe	32	Array	40.095 ns	0.5829 ns	0.25

MinDecimal	\main\corerun.exe	32	Array	268.355 ns	3.1338 ns	1.00
MinDecimal	\pr\corerun.exe	32	Array	172.761 ns	1.5649 ns	0.64

SumInt32	\main\corerun.exe	32	Array	138.323 ns	1.1075 ns	1.00
SumInt32	\pr\corerun.exe	32	Array	13.147 ns	0.1089 ns	0.10

SumInt64	\main\corerun.exe	32	Array	144.375 ns	1.6040 ns	1.00
SumInt64	\pr\corerun.exe	32	Array	12.756 ns	0.2747 ns	0.09

SumFloat	\main\corerun.exe	32	Array	142.810 ns	0.6627 ns	1.00
SumFloat	\pr\corerun.exe	32	Array	8.980 ns	0.0286 ns	0.06

SumDecimal	\main\corerun.exe	32	Array	268.380 ns	5.2945 ns	1.00
SumDecimal	\pr\corerun.exe	32	Array	158.555 ns	2.0546 ns	0.59

AverageInt32	\main\corerun.exe	32	Array	154.934 ns	2.9998 ns	1.00
AverageInt32	\pr\corerun.exe	32	Array	10.516 ns	0.1869 ns	0.07

AverageInt64	\main\corerun.exe	32	Array	148.826 ns	2.9388 ns	1.00
AverageInt64	\pr\corerun.exe	32	Array	14.362 ns	0.3140 ns	0.10

AverageFloat	\main\corerun.exe	32	Array	146.798 ns	2.5321 ns	1.00
AverageFloat	\pr\corerun.exe	32	Array	9.836 ns	0.1240 ns	0.07

AverageDecimal	\main\corerun.exe	32	Array	353.931 ns	3.2577 ns	1.00
AverageDecimal	\pr\corerun.exe	32	Array	173.172 ns	3.3340 ns	0.49

MinFloat	\main\corerun.exe	32	Enumerable	219.598 ns	3.2260 ns	1.00
MinFloat	\pr\corerun.exe	32	Enumerable	201.268 ns	2.5158 ns	0.92

MinDecimal	\main\corerun.exe	32	Enumerable	401.314 ns	6.9631 ns	1.00
MinDecimal	\pr\corerun.exe	32	Enumerable	396.958 ns	4.2611 ns	0.99

SumInt32	\main\corerun.exe	32	Enumerable	130.008 ns	2.4007 ns	1.00
SumInt32	\pr\corerun.exe	32	Enumerable	133.748 ns	2.4726 ns	1.03

SumInt64	\main\corerun.exe	32	Enumerable	176.430 ns	3.4085 ns	1.00
SumInt64	\pr\corerun.exe	32	Enumerable	175.259 ns	3.0756 ns	0.99

SumFloat	\main\corerun.exe	32	Enumerable	195.938 ns	2.9538 ns	1.00
SumFloat	\pr\corerun.exe	32	Enumerable	190.320 ns	3.0678 ns	0.97

SumDecimal	\main\corerun.exe	32	Enumerable	382.905 ns	3.4056 ns	1.00
SumDecimal	\pr\corerun.exe	32	Enumerable	384.330 ns	4.6222 ns	1.00

AverageInt32	\main\corerun.exe	32	Enumerable	132.928 ns	2.5226 ns	1.00
AverageInt32	\pr\corerun.exe	32	Enumerable	143.767 ns	2.4111 ns	1.08

AverageInt64	\main\corerun.exe	32	Enumerable	181.207 ns	1.6801 ns	1.00
AverageInt64	\pr\corerun.exe	32	Enumerable	177.174 ns	2.1069 ns	0.98

AverageFloat	\main\corerun.exe	32	Enumerable	203.800 ns	2.0559 ns	1.00
AverageFloat	\pr\corerun.exe	32	Enumerable	201.102 ns	3.2647 ns	0.99

AverageDecimal	\main\corerun.exe	32	Enumerable	497.159 ns	1.4778 ns	1.00
AverageDecimal	\pr\corerun.exe	32	Enumerable	492.052 ns	2.3135 ns	0.99

using System.Collections.Generic;
using System.Linq;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

public class Program
{
    public static void Main(string[] args) => BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);

    [Params(2, 32)]
    public int Length { get; set; }

    [Params("Array", "Enumerable")]
    public string Mode { get; set; }

    private IEnumerable<int> _ints;
    private IEnumerable<long> _longs;
    private IEnumerable<float> _floats;
    private IEnumerable<decimal> _decimals;

    [Benchmark] public float MinFloat() => _floats.Min();

    [Benchmark] public int SumInt32() => _ints.Sum();
    [Benchmark] public long SumInt64() => _longs.Sum();
    [Benchmark] public float SumFloat() => _floats.Sum();
    [Benchmark] public decimal SumDecimal() => _decimals.Sum();

    [Benchmark] public double AverageInt32() => _ints.Average();
    [Benchmark] public double AverageInt64() => _longs.Average();
    [Benchmark] public float AverageFloat() => _floats.Average();
    [Benchmark] public decimal AverageDecimal() => _decimals.Average();

    [GlobalSetup]
    public void Setup()
    {
        _ints = Enumerable.Range(1, Length);
        if (Mode == "Array") _ints = _ints.ToArray();

        _longs = Enumerable.Range(1, Length).Select(i => (long)i);
        if (Mode == "Array") _longs = _longs.ToArray();

        _floats = Enumerable.Range(1, Length).Select(i => (float)i);
        if (Mode == "Array") _floats = _floats.ToArray();

        _decimals = Enumerable.Range(1, Length).Select(i => (decimal)i);
        if (Mode == "Array") _decimals = _decimals.ToArray();
    }
}

Author:	stephentoub
Assignees:	-
Labels:	`area-System.Linq`, `tenet-performance`
Milestone:	-

src/libraries/System.Linq/src/System/Linq/Enumerable.cs

eiriktsarpalis · 2022-02-01T17:12:56Z

@eiriktsarpalis, please let me know in general if you're ok with the extra code here for this special-casing.

Should be ok, but per your own comment in #64470 (comment) it would be nice if the code be moved out of Linq eventually.

src/libraries/System.Linq/src/System/Linq/Average.cs

src/libraries/System.Linq/src/System/Linq/Sum.cs

tannergooding · 2022-02-01T18:08:07Z

I assume the use of vectorization for floats/doubles could change the answer in some cases due to lack of associativity, yes? Thoughts on how much we should care about that? Also, it seemed like it should be possible to vectorize some of the methods doing checked arithmetic, but I wasn't sure how to do so correctly and skipped those. It also seemed like it should be possible to vectorize min/max for floats/doubles, but they have special-handling of NaN I couldn't figure out how to replicate with Vector.

@stephentoub right. There are a number of cases where we cannot trivially vectorize float/double (particularly operations that compute a new value like Sum) because it can drastically change the output and will lead to determinism issues among other bugs (put another way, I think we should care a lot and not do this for float/double).

The simplest scenario to explain is that with integers the delta between one representable value and the next is always 1 (ignoring overflow for a minute). Where-as with floating-point this starts at epsilon (the smallest representable value greater than zero) and then doubles every power of 2 after that. For float this means that for 2^22 to 2^23 the delta between values is 0.5, for 2^23 to 2^24 the delta between values is 1, for 2^24 to 2^25 it is 2, at 2^26 it is 4, and so on (in both directions going down, including to negative powers until you hit epsilon and going up until you hit float.MaxValue).

What this means is that if you take a case like new float[] { 16777216.0f, 0.5f, 0.5f, 0.5f } you will get back 16777216.0f because 2^24 + 0.5 always returns 2^24 since its less than the delta between values (2). However if you reorder this to new float[] { 0.5f, 0.5f, 0.5f, 16777216.0f } you instead get back 16777218.0f, because 0.5 + 0.5 +0.5 is enough to create 1.5 and then (due to rounding) is enough to round up to 16777218.

Vectorization is impacted here because it ends up adding n to n + Count and then adding them "across" at the end. There have been open asks for several years to provide a "fast math" switch that would allow such optimizations but its non-trivial to achieve. The easiest thing would be to expose a new overload that allows users to explicitly opt-into such differences.

Things that do not compute a "new value" like Min/Max (it only does a comparison and returns one of the inputs) can be vectorized but it needs to take into account things like NaN handling as you indicated.

For overflow checking, it can be simple when you know the inputs are always positive or negative. It's not as simple when it can randomly be positive or negative.

stephentoub · 2022-02-03T22:19:16Z

I assume the use of vectorization for floats/doubles could change the answer in some cases due to lack of associativity, yes? Thoughts on how much we should care about that?

I think we should care a lot and not do this for float/double

Ok, deleting.

src/libraries/System.Linq/src/System/Linq/Max.cs

It's very common to use these terminal functions for quick stats on arrays and lists of values. Just the overhead of enumerating as an enumerable (involving multiple interface dispatch) per iteration is significant, and it's much faster to directly enumerate the contents of the array or the list. In some cases, we can further use vectorization to speed up the processing. This change: - Adds a helper that does a fast check to see if it can extract a span from an enumerable that's actually an array or a list. It could be augmented to detect other interesting types, but `T[]` and `List<T>` are the most relevant from the data I've seen, and we can fairly quickly do type checks to get the most benefit for a small amount of cost. - Uses that helper in the int/long/float/double/decimal overloads of Sum/Average/Min/Max to add a span-based path. - Vectorizes Sum for float and double - Vectorizes Average for int, float, and double (the latter two via use of Sum)

tannergooding · 2022-02-04T19:51:31Z

src/libraries/System.Linq/src/System/Linq/Average.cs

+                Vector<long> sums = default;
+                do
+                {
+                    Vector.Widen(new Vector<int>(span.Slice(i)), out Vector<long> low, out Vector<long> high);


Are the bounds checks actually getting elided here for new Vector<int>(span.Slice(i))? This seems like its going to be doing a bunch of extra work each iteration

I don't believe so. But I also didn't want to introduce unsafe code. Is there a way to write it that's "safe" (i.e. if I made a mistake it would result in an exception rather than potential corruption / security problems) and that avoids the bounds checks? My assumption is even if I eliminate the checks in Slice by using a loop pattern the JIT recognizes, Vector itself will still be doing a length check on the length of the span supplied, and that won't be removed (or will it)?

Some other places in the runtime use a helper built around Unsafe.ReadUnaligned and then the xplat helpers now have Vector128.LoadUnsafe(ref T, nuint index).

There's never really going to be a "safe" way to do this unless the JIT gets special support, however. You functionally have some T and want to read 2-32 of that T (depending on what the actual type is). So the best you'll generally get is doing the right checks up front and potentially adding some asserts. Otherwise, you pay the cost of slicing, copying the span, and doing the relevant bounds checks each iteration.

I'm fine with that in this code. If these helpers are ever moved into MemoryExtensions, we can go to town on optimizing the heck out of them, both with avoiding bounds checks and with adding whatever additional paths are necessary to get the best perf. For LINQ, I think this is good enough, and there's benefit to no proliferating unsafe code here.

stephentoub added area-System.Linq tenet-performance Performance related issue labels Feb 1, 2022

stephentoub requested review from eiriktsarpalis and tannergooding February 1, 2022 16:10

ghost assigned stephentoub Feb 1, 2022

eiriktsarpalis reviewed Feb 1, 2022

View reviewed changes

src/libraries/System.Linq/src/System/Linq/Enumerable.cs Show resolved Hide resolved

eiriktsarpalis reviewed Feb 1, 2022

View reviewed changes

src/libraries/System.Linq/src/System/Linq/Average.cs Show resolved Hide resolved

gfoidl reviewed Feb 1, 2022

View reviewed changes

src/libraries/System.Linq/src/System/Linq/Average.cs Outdated Show resolved Hide resolved

src/libraries/System.Linq/src/System/Linq/Sum.cs Outdated Show resolved Hide resolved

src/libraries/System.Linq/src/System/Linq/Sum.cs Outdated Show resolved Hide resolved

danmoseley reviewed Feb 3, 2022

View reviewed changes

src/libraries/System.Linq/src/System/Linq/Max.cs Show resolved Hide resolved

stephentoub force-pushed the morelinqvector branch from d598db8 to 73a9d35 Compare February 3, 2022 23:04

stephentoub added 2 commits February 3, 2022 23:56

Address PR feedback

0669b9d

stephentoub force-pushed the morelinqvector branch from 73a9d35 to 0669b9d Compare February 4, 2022 04:56

runfoapp bot mentioned this pull request Feb 4, 2022

[wasm][aot] System.Text.Json tests fail while linking due to OOM #61524

Closed

stephentoub merged commit e5faab0 into dotnet:main Feb 4, 2022

stephentoub deleted the morelinqvector branch February 4, 2022 18:49

tannergooding reviewed Feb 4, 2022

View reviewed changes

ghost locked as resolved and limited conversation to collaborators Mar 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve perf of Enumerable.Sum/Average/Max/Min for arrays and lists #64624

Improve perf of Enumerable.Sum/Average/Max/Min for arrays and lists #64624

stephentoub commented Feb 1, 2022

ghost commented Feb 1, 2022

eiriktsarpalis commented Feb 1, 2022

tannergooding commented Feb 1, 2022 •

edited

Loading

stephentoub commented Feb 3, 2022

tannergooding Feb 4, 2022

stephentoub Feb 4, 2022 •

edited

Loading

tannergooding Feb 4, 2022

stephentoub Feb 4, 2022

Improve perf of Enumerable.Sum/Average/Max/Min for arrays and lists #64624

Improve perf of Enumerable.Sum/Average/Max/Min for arrays and lists #64624

Conversation

stephentoub commented Feb 1, 2022

ghost commented Feb 1, 2022

eiriktsarpalis commented Feb 1, 2022

tannergooding commented Feb 1, 2022 • edited Loading

stephentoub commented Feb 3, 2022

tannergooding Feb 4, 2022

Choose a reason for hiding this comment

stephentoub Feb 4, 2022 • edited Loading

Choose a reason for hiding this comment

tannergooding Feb 4, 2022

Choose a reason for hiding this comment

stephentoub Feb 4, 2022

Choose a reason for hiding this comment

tannergooding commented Feb 1, 2022 •

edited

Loading

stephentoub Feb 4, 2022 •

edited

Loading