Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NativeAOT] 230 kB size regression in Hello World from Enum changes #79437

Closed
MichalStrehovsky opened this issue Dec 9, 2022 · 20 comments
Closed

Comments

@MichalStrehovsky
Copy link
Member

MichalStrehovsky commented Dec 9, 2022

230 kB is about 8% of the hello world.

Caused by the enum changes in #78580.

Map files before/after:

NAotHello.map.txt
NAotHello.map.txt

There's no clear culprit, it's a death by thousand papercuts. There are some odd things in the diff, like why do we need so many new delegates, arraysorthelpers, System.Half, System.Int128, etc.

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Dec 9, 2022
@ghost
Copy link

ghost commented Dec 9, 2022

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas
See info in area-owners.md if you want to be subscribed.

Issue Details

130 kB is about 8% of the hello world.

Caused by the enum changes in #78580.

Map files before/after:

NAotHello.map.txt
NAotHello.map.txt

There's no clear culprit, it's a death by thousand papercuts. There are some odd things in the diff, like why do we need so many new delegates, arraysorthelpers, System.Half, System.Int128, etc.

Author: MichalStrehovsky
Assignees: -
Labels:

area-NativeAOT-coreclr

Milestone: -

@MichalStrehovsky MichalStrehovsky changed the title [NativeAOT] 130 kB size regression in Hello World from Enum changes [NativeAOT] 230 kB size regression in Hello World from Enum changes Dec 9, 2022
@stephentoub
Copy link
Member

stephentoub commented Dec 9, 2022

@MichalStrehovsky, would it be relatively straightforward for you to determine how much of this increase is due to:

  1. The changes in Enum itself to pivot to using EnumInfo<TUnderlyingType> rather than storing everything as a long
  2. The special-casing of typeof(T).IsEnum in the interpolated string handlers, e.g.
    if (typeof(T).IsEnum)
    {
    int charsWritten;
    while (!Enum.TryFormatUnconstrained(value, _chars.Slice(_pos), out charsWritten))
    {
    Grow();
    }
    _pos += charsWritten;
    return;
    }
  3. Making Enum implement ISpanFormattable:
    public abstract partial class Enum : ValueType, IComparable, ISpanFormattable, IConvertible

?

If it turns out the size increase in both NativeAOT and mono AOT are due to (2) and/or (3), I can temporarily back those out until we come up with a solution.

There are some odd things in the diff, like why do we need so many new delegates, arraysorthelpers, System.Half, System.Int128, etc.

Yeah, I don't know how the enum changes would have impacted that, though they obviously did.

@jkotas
Copy link
Member

jkotas commented Dec 9, 2022

arraysorthelpers

Each Array.Sort instantiation costs about 10kB of AOT binary footprint. The change introduced 13 Array.Sort instantiations, so that explains more than half of the regression.

System.Half, System.Int128,

The numeric interface methods have less than ideal behavior with trimming. If a numeric interface method is used on one type, it gets preserved on all types that implement. It explains the new methods on System.Half, System.Int128.

@stephentoub
Copy link
Member

Each Array.Sort instantiation costs about 10kB of AOT binary footprint. The change introduced 13 Array.Sort instantiations, so that explains more than half of the regression.

I see. Well native aot doesn't support 5 of those (right?). We could trivially ifdef out the types not expressible in C#. That'd eliminate 40% of that. We could pay the one-time-per-enum costs to use the non-generic sort to address the rest, at least for aot?

@stephentoub
Copy link
Member

The numeric interface methods have less than ideal behavior with trimming. If a numeric interface method is used on one type, it gets preserved on all types that implement. It explains the new methods on System.Half, System.Int128.

I can change the implementations to not use the generic interfaces and just special-case everything to each of the underlying types, but that's very far from ideal, and these instantiation will all come right back the moment someone else in the app uses them. What would you recommend?

@agocke
Copy link
Member

agocke commented Dec 9, 2022

I think there were some significant improvements in trimming added to ILLink recently. @jtschuster would know more about whether they would work in these cases.

If we can improve trimming such that these interfaces are removable when unused I think that's probably the right direction.

@marek-safar
Copy link
Contributor

@agocke the linker is not used for code trimming for nativeaot

@MichalStrehovsky
Copy link
Member Author

@MichalStrehovsky, would it be relatively straightforward for you to determine how much of this increase is due to:

It's not immediately obvious from the logs in the top post and actually trying it would involve manually backing out pieces of the change and recompiling. Not keen on doing that.

System.Half, System.Int128,

The numeric interface methods have less than ideal behavior with trimming. If a numeric interface method is used on one type, it gets preserved on all types that implement. It explains the new methods on System.Half, System.Int128.

I started drilling into this part - the numeric interfaces add about 29 kB of the total cost. I think it's doable to get rid of that in the compiler. The problem is patterns like this:

else if (typeof(TOther) == typeof(Half))
{
Half actualResult = value;
result = (TOther)(object)actualResult;
return true;
}

This brings a boxed Half into the system. RyuJIT can already eliminate the entire branch (because we compile instantiated code), but we run the compilation in two phases (the second one is RyuJIT) and the first phase ends up unnecessarily rooting a boxed Half. There are two ways to approach this: either remove unreachable branches in the first phase (we can do it without the whole program view we're building because this scans instantiated code too), or make it so that we don't pass the rooted Half to the second phase. I already have a working prototype for the former, but looking into the latter as well because it might be more generally applicable.

That will get rid of the 29 kB but there's still plenty left.

@stephentoub
Copy link
Member

@agocke
Copy link
Member

agocke commented Dec 12, 2022

@agocke the linker is not used for code trimming for nativeaot

Right, we would implement the same thing @jtschuster did for Native AOT, or more advanced if necessary.

However, that typeof pattern is tricky. Branch elimination was not what I had in mind.

@MichalStrehovsky
Copy link
Member Author

In case you change your mind:

Thanks, I'll take a look at that tomorrow.

In the meantime, #79594 gets rid of the costs associated with Half/Int128/UInt128 (29 kB) and a little bit extra on top.

@MichalStrehovsky
Copy link
Member Author

With:

We're down to a ~50 kB regression (2%).

Here's the MAP files (diffable with windiff):

I think we could get rid of a good chunk if we could do something about the sorting. We now sort the enums with the object-based Array.Sort, but that means it brings code to compare DateTime, decimal, strings and a bunch of other stuff. Commenting out the Array.Sort saves another 24 kB. If we could save that, we could close this issue.

@stephentoub
Copy link
Member

stephentoub commented Dec 14, 2022

I think we could get rid of a good chunk if we could do something about the sorting

What if we just used a simple but O(n^2) sort like insertion sort, banking on enum's being relatively small in size in general? If we went back to it being generic, such that there were still eight copies, but it was small, I wonder if that would suffice. We could try calling

private static void InsertionSort(Span<TKey> keys, Span<TValue> values)
?

@jkotas
Copy link
Member

jkotas commented Dec 14, 2022

We now sort the enums with the object-based Array.Sort, but that means it brings code to compare DateTime, decimal, strings and a bunch of other stuff.

We used ulong-based sort before. How did ulong-based sort avoided bringing in all this stuff?

@MichalStrehovsky
Copy link
Member Author

MichalStrehovsky commented Dec 14, 2022

We now sort the enums with the object-based Array.Sort, but that means it brings code to compare DateTime, decimal, strings and a bunch of other stuff.

We used ulong-based sort before. How did ulong-based sort avoided bringing in all this stuff?

Here's the relevant chunk from the WhyDgml output:

        (Secondary) VirtualMethodUse [S.P.CoreLib]System.IComparable.CompareTo(object)
          (Interface method use) __InterfaceDispatchCell_S_P_CoreLib_System_IComparable__CompareTo
            (callvirt) S_P_CoreLib_System_Collections_Comparer__Compare
              (Instance method on a constructed type) (Tentative instance method: S_P_CoreLib_System_Collections_Comparer__Compare, ??_7S_P_CoreLib_System_Collections_Comparer@@6B@ constructed)
                (Primary) Tentative instance method: S_P_CoreLib_System_Collections_Comparer__Compare
                  (callvirt) S_P_CoreLib_System_Collections_Generic_ObjectComparer_1<System___Canon>__Compare
                    (Virtual method) (??_7S_P_CoreLib_System_Collections_Generic_ObjectComparer_1<System___Canon>@@6B@, VirtualMethodUse [S.P.CoreLib]System.Collections.Generic.Comparer`1<System.__Canon>.Compare(__Canon,__Canon))
                      (Primary) ??_7S_P_CoreLib_System_Collections_Generic_ObjectComparer_1<System___Canon>@@6B@
                        (Template MethodTable) NativeLayoutTemplateTypeLayoutVertexNode_S_P_CoreLib_System_Collections_Generic_ObjectComparer_1<T_System___Canon>
                          (Generic comparer) S_P_CoreLib_System_Collections_Generic_Comparer_1<System___Canon>__Create
                            (call) S_P_CoreLib_System_Collections_Generic_Comparer_1<System___Canon>__get_Default
                              (call) S_P_CoreLib_System_Collections_Generic_ArraySortHelper_2<System___Canon__System___Canon>__Sort

The default comparers are special cased by the compiler. If we're in unshared code, we know exactly which comparer to use. If we're in shared generic code, ObjectComparer it is and that one uses IComparable.CompareTo, bringing comparison functionality for everything that was ever allocated or boxed in the program.

@kg
Copy link
Member

kg commented Dec 20, 2022

I think we could get rid of a good chunk if we could do something about the sorting

What if we just used a simple but O(n^2) sort like insertion sort, banking on enum's being relatively small in size in general? If we went back to it being generic, such that there were still eight copies, but it was small, I wonder if that would suffice. We could try calling

What's "relatively small" in this context? For example, just searching some of the C# projects I've recently touched on my machine, I have an enum of wasm opcodes that's got about 250 values in it (and would have more if I added further extensions). I've got a D3DFORMAT enum with upwards of 50 entries by my count. I also see a bunch of mid-to-large-sized enums in libraries like SDL2, mojoshader, sharpfont, etc. How badly would this stuff degrade for a large enum? Is there a path for end users to work around the bad sort? I don't have a good intuition for how bad insertion sort would be for an enum with 250 values in it, but I suspect regular production software has enums big enough to get into the Trouble Zone for O(N^2) unless O is very fast or this code path is only hit for obscure scenarios.

@stephentoub
Copy link
Member

stephentoub commented Dec 20, 2022

I have an enum of wasm opcodes that's got about 250 values in it ... How badly would this stuff degrade for a large enum?

It would depend on how close to sorted the values were. Here, for 250 values, you can see cases where the data is already perfectly sorted (Mode 0), where the first and last elements are swapped (Mode 1), and where the elements are entirely reversed (Mode 2). If the data is already sorted, insertion sort is typically really fast. If it's mostly sorted, it's reasonably on par. If it's not at all sorted, it's an order of magnitude slower.

Method Mode Mean
IntroSort 0 2.233 us
InsertionSort 0 1.174 us
IntroSort 1 2.281 us
InsertionSort 1 2.724 us
IntroSort 2 4.639 us
InsertionSort 2 98.910 us
Benchmark
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System;
using System.Linq;
using System.Numerics;

[MemoryDiagnoser]
public partial class Program
{
    private string[] _names = Enumerable.Range(1, 250).Select(i => $"Value{i}").ToArray();
    private int[] _values = Enumerable.Range(1, 250).ToArray();

    static void Main(string[] args) => BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);

    [Params(0, 1, 2)]
    public int Mode { get; set; }

    private void HandleMode()
    {
        switch (Mode)
        {
            case 0:
                break;
            case 1:
                (_values[0], _values[^1]) = (_values[^1], _values[0]);
                break;
            case 2:
                Array.Reverse(_values);
                break;
        }
    }

    [Benchmark]
    public void IntroSort()
    {
        HandleMode();
        Array.Sort(_values, _names);
    }

    [Benchmark]
    public void InsertionSort()
    {
        HandleMode();
        InsertionSort(_values, _names);
    }

    private static void InsertionSort<TKey, TValue>(TKey[] keys, TValue[] values) where TKey : IComparisonOperators<TKey, TKey, bool>
    {
        for (int i = 0; i < keys.Length - 1; i++)
        {
            TKey t = keys[i + 1];
            TValue tValue = values[i + 1];

            int j = i;
            while (j >= 0 && (t == null || t < keys[j]))
            {
                keys[j + 1] = keys[j];
                values[j + 1] = values[j];
                j--;
            }

            keys[j + 1] = t!;
            values[j + 1] = tValue;
        }
    }
}

But it sounds like my question/thought is mostly moot anyway as it seems Michal has a solution that fixes the problem with a custom comparer.

@stephentoub
Copy link
Member

We now sort the enums with the object-based Array.Sort, but that means it brings code to compare DateTime, decimal, strings and a bunch of other stuff. Commenting out the Array.Sort saves another 24 kB. If we could save that, we could close this issue.

@MichalStrehovsky, is there still more to do here, or does #79845 address this?

@MichalStrehovsky
Copy link
Member Author

We now sort the enums with the object-based Array.Sort, but that means it brings code to compare DateTime, decimal, strings and a bunch of other stuff. Commenting out the Array.Sort saves another 24 kB. If we could save that, we could close this issue.

@MichalStrehovsky, is there still more to do here, or does #79845 address this?

#79594 mentioned in the calculations above recovers 30 kB of the regression but didn't merge yet.

@MichalStrehovsky
Copy link
Member Author

All of the pull requests have been merged.

@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Jan 11, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Feb 10, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants