Skip to content

Conversation

@Wraith2
Copy link
Contributor

@Wraith2 Wraith2 commented Sep 2, 2020

fixes #41572

This removes an embedded resource binary file along with the support code needed to load it safely and atomically. It is replaced with an enormous readonly span which is inlined into the data segment by the compiler. Doing this allows the struct to be safe, not need a ctor and for the static Instance method to be default(T), the entire thing could probably be static

The changes are split into two commits. The first has the code changes, review this. The second only contains the span which was added to the bottom of the file so the contents can be seen casually without forcing a scroll through almost 4000 lines of code. The span code was generated from the binary in the project, review it only if you like hex.

/cc @RUSshy

@Dotnet-GitSync-Bot
Copy link
Collaborator

I couldn't figure out the best area label to add to this PR. If you have write-permissions please help me learn by adding exactly one area label.

@ghost
Copy link

ghost commented Sep 2, 2020

Tagging subscribers to this area: @buyaa-n, @krwq
See info in area-owners.md if you want to be subscribed.

@tarekgh tarekgh requested a review from buyaa-n September 2, 2020 20:59
@tarekgh
Copy link
Member

tarekgh commented Sep 2, 2020

        /* 000 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x11, 0x11, 0x00, 0x00, 0x11, 0x00, 0x00,

nit: could you please have the numbers inside the comment in hex instead as the table is going with 16 elements in row? and if you can add a column header like

// 0 1 2 3 ... A B C D E F

just to make it easy to lookup any character


Refers to: src/libraries/System.Private.Xml/src/System/Xml/XmlCharType.cs:233 in bfd86e5. [](commit_id = bfd86e5, deletion_comment = False)

@tarekgh
Copy link
Member

tarekgh commented Sep 2, 2020

@Wraith2 thanks for making this change. LGTM. I assume you have converted the bin file directly to the span without changing any value. I looked at some random rows and I am not seeing any change.

@krwq @buyaa-n maybe the next step here is to review https://www.w3.org/TR/xml/ and ensure this data we carry is conforming to the latest Xml specs.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Sep 2, 2020

I assume you have converted the bin file directly to the span without changing any value. I looked at some random rows and I am not seeing any change

Yes. I wrote a small program to read the file 16 bytes at a time and output the bytes in code format. No space for human error.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Sep 2, 2020

reformatted, example:

        private static ReadOnlySpan<byte> s_charProperties => new byte[]
        {
            //            0,    1,    2,    3,    4,    5,    6,    7,    8,    9,    A,    B,    C,    D,    E,    F
            /* 0000 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x11, 0x11, 0x00, 0x00, 0x11, 0x00, 0x00,
            /* 0010 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,

Trying to anything with the file that involves a format change will max the compiler for a while so be careful.

@tarekgh
Copy link
Member

tarekgh commented Sep 2, 2020

Trying to anything with the file that involves a format change will max the compiler for a while so be careful.

Maybe later we can try to convert this table into a double lookup table (if not affecting the perf much) which may make the tables much smaller. This will need some time to do perf measurement and generate the optimized tables.
currently having a 64k table is really big and I admire the compiler it can handle it.

@danmoseley
Copy link
Member

maybe the next step here is to review https://www.w3.org/TR/xml/ and ensure this data we carry is conforming to the latest Xml specs.

@Wraith2 is that something you're interested in doing? Or maybe @tarekgh did not mean that should hold up this PR.

@tarekgh
Copy link
Member

tarekgh commented Sep 3, 2020

Or maybe @tarekgh did not mean that should hold up this PR.

This should not be a blocker for this PR. @Wraith2 is not changing anything from what we already have. my comment was about to review the data we have and ensure it is conforming to the latest Xml spec just to ensure we are up to date with the spec.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Sep 3, 2020

@Wraith2 is that something you're interested in doing?

Not really no... Reviewing 64K of byte values individually while the compiler struggles to parse the data sounds like a punishment for both my computer and I. 😁

@@ -283,46 +228,4106 @@ private static bool InRange(int value, int start, int end)
return (uint)(value - start) <= (uint)(end - start);
}

#if XMLCHARTYPE_GEN_RESOURCE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you leave the generating code in DEBUG and perhaps add a Debug.Assert comparing this array with what is generated? (just make sure it runs only once) In case we ever need to change this for some reason this won't be some magic array

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps might be better to just always generate this array, not sure if we prefer initialization time vs size. what do you think @tarekgh?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generation code doesn't work. I started trying to use it but it's leftover from when the author of the current format was working with an older version of the data so it isn't useful. I ended up dumping the contents of the binary file rather than trying to work from that snippet. If it had been useful I would have kept it but it isn't useful currently.

I asked in discord if people thought this was a good use for a source generator and the answer was that it's static data so it doesn't need generating each time. In either case static vs generated the time is going to be spent parsing it and that's unavoidable without going back to the resource load.

@tarekgh
Copy link
Member

tarekgh commented Sep 3, 2020

Not really no... Reviewing 64K of byte values individually while the compiler struggles to parse the data sounds like a punishment for both my computer and I. 😁

@Wraith2 you can review everyday just 1k and we'll be done in couple of months :-) just kidding. thanks again for getting this change.

CC @jkotas @davidwrighton if they have any concern having 64k readonly span data like that.

@jkotas
Copy link
Member

jkotas commented Sep 3, 2020

This is modifying perf critical XML parsing code. Do we have any XML parsing microbenchmark to check the impact?

@tarekgh
Copy link
Member

tarekgh commented Sep 3, 2020

aside from the perf, I like the change because it allow checking any character properties by just looking at the array value. but in general, I agree, having perf numbers is better here.

@davidwrighton
Copy link
Member

I looked at this, and I have no concerns about having a large ReadOnlySpan<byte> used here. However, I do think we should measure the performance of this, as this might be slower than the previous implementation. I doubt it, but I've been unpleasantly surprised before.

There are two possibilities for reduced performance that I can see here.

  1. Accessing the span may be injecting a bounds check. In this case bounds check isn't actually needed, as the data is sufficiently large, but we might be missing that optimization in this case.
  2. Accessing the span as embedded data requires embedding a pointer into the code stream of the jit. Reading that pointer from the code stream may be slower than reading a field from the struct. I don't expect that this is true, but on some platforms (notably arm64) it requires 2 instructions to embed a pointer where the field reading logic is a single instruction, and on intel architecture platforms, the field reading instruction is slightly bigger. OTOH, reading a field is an extra memory load, which is generally considered to be more expensive to run than a typical instruction.

Now, given our system, I don't expect this to be slower, I actually expect the lack of an extra memory read to make this all slightly faster than the previous behavior.

@jkotas
Copy link
Member

jkotas commented Sep 6, 2020

Here is a small targeted test:

using System;
using System.IO;
using System.Xml;

class Program
{
    static void Main(string[] args)
    {
        var ms = new MemoryStream();

        using (var w = XmlWriter.Create(ms))
        {
            w.WriteStartDocument();
            w.WriteStartElement("OutterElement");
            for (int i = 0; i < 10000; i++)
            {
                w.WriteStartElement("InnerElement");
                w.WriteEndElement();
            }
            w.WriteEndElement();
            w.WriteEndDocument();
        }

        for (; ; )
        {
            int start = Environment.TickCount;
            for (int i = 0; i < 1000; i++)
            {
                ms.Position = 0;
                using (var r = XmlReader.Create(ms))
                {
                    while (r.Read())
                    {
                    }
                }
            }
            int end = Environment.TickCount;
            Console.WriteLine(end - start);
        }
    }
}

It runs about ~900ms per iteration before this change and ~1200ms per iteration after this change.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Sep 6, 2020

I was going to work on a test this morning but thanks for doing that.

I don't know how to determine what the speed change is caused by to see if it can be mitigated. With those results I don't believe it would be an acceptable regression so either I close or look at flowing the instance/static changes mentioned above out to see if the ability to call and possible inline static methods recovers the perf. I'd like to be guided by knowledge on that one though so if you can give me some info on what tools and methods you'd use to diagnose the speed difference that would be good.

@jkotas
Copy link
Member

jkotas commented Sep 6, 2020

I don't know how to determine what the speed change is caused by

You can use profiler such as perfview to get the idea what methods to look at. This type of change will require looking at the disassembly to identify the bottlenecks. If it is not your expertise, there are number of people with this expertise hanging around and they may be able to help you if you ask.

I took a cursory look when I wrote the test and the first problem I have noticed was a lack of inlining for XmlCharType.IsNCNameSingleChar call from XmlTextReaderImpl.ParseElement, so part of the fix is likely going to be to add AggressiveInlining on the XmlCharType accessors. It will fix some of the regression, but I suspect that it won't be enough.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Sep 6, 2020

I've got a local build of runtime that includes the change but I can't get BDN to use it at the moment, it seems to be ignoring the --coreRun argument entirely.

I used sharplab to have a look at the il and asm that is generated for the IsWhitespace call and it was surprising and given the length I'm not surprised it won't inline. If I can get to a position where I can replicate your bench I can try making it a ref struct and passing the span is as a ctor arg the way it was done previously which should cut down iteration cost.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Sep 7, 2020

I managed to replicate the results. The ref struct idea won't work because multiple uses of it are made as class fields and changing those to be fetched each time will further increase overhead. The only alternative I can see is to somehow keep a reference to the ROS which will require unsafe shenanigans.

@tannergooding
Copy link
Member

Just commenting here what I replied with in the Discord .NET Evolution #runtime channel...

It looks like the JIT currently fails to elide the bounds checks for SomeSpan[ushort] or SomeSpan[char] when the length of the span is known to be >= ushort.MaxValue (as is the case here).
Utilizing Unsafe.AddByteOffset(ref MemoryMarshal.GetReference(ROSpan), nuint) will elide the bounds check and gives codegen similar to the following:

public static unsafe bool IsWhiteSpace2(char c) {
    return (Unsafe.AddByteOffset(ref MemoryMarshal.GetReference(s_charProperties), (nuint)c) & 0x01) == 0;
}
    L0000: movzx eax, cx
    L0003: mov rdx, 0x2196faa09e8
    L000d: test byte ptr [rax+rdx], 1
    L0011: sete al
    L0014: movzx eax, al
    L0017: ret

@Wraith2
Copy link
Contributor Author

Wraith2 commented Sep 7, 2020

Thanks @tannergooding, I couldn't find the AddByteOffset method overload with nuint so I went with the original suggestion of Unsafe.Add(ref MemoryMarshal.GetReference(CharProperties), ch). I also added readonly to the various instance methods and marked them with aggressive inline. The results of that are:

Method Mean Error StdDev
Master 1.406 ms 0.0052 ms 0.0048 ms
Branch 1.192 ms 0.0038 ms 0.0034 ms

@krwq and @tarekgh can you re-review please?

extract unsafe code to single function and force inline for perf
@jkotas
Copy link
Member

jkotas commented Sep 8, 2020

Method Mean Error StdDev
Master 1.406 ms 0.0052 ms 0.0048 ms
Branch 1.192 ms 0.0038 ms 0.0034 ms

Are these numbers for the test above? They look too good to be true to me. I am not able to replicate these results on my machine.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Sep 8, 2020

Refactoring makes only a small difference.

Benchmark:

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.IO;
using System.Xml;

namespace BenchmarkXmlChartype
{
    class Program
    {
        static void Main(string[] args)
        {
            BenchmarkSwitcher.FromAssembly(typeof(Benchmarks).Assembly).Run(args);
        }
	}

    public class Benchmarks
    {
        [Benchmark]
        public int Simple()
        {
            var ms = new MemoryStream();

            using (var w = XmlWriter.Create(ms))
            {
                w.WriteStartDocument();
                w.WriteStartElement("OuterElement");
                for (int i = 0; i < 10000; i++)
                {
                    w.WriteStartElement("InnerElement");
                    w.WriteEndElement();
                }
                w.WriteEndElement();
                w.WriteEndDocument();
            }

            int readCount = 0;

            ms.Position = 0;
            using (var r = XmlReader.Create(ms))
            {
                while (r.Read())
                {
                    readCount += 1;
                }
            }
            return readCount;
        }
    }
}

Results:

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.450 (2004/?/20H1)
Intel Core i9-9900K CPU 3.60GHz (Coffee Lake), 1 CPU, 16 logical and 8 physical cores
.NET Core SDK=5.0.100-preview.8.20362.3
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.36102, CoreFX 5.0.20.36102), X64 RyuJIT
  Job-YRJNPI : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT

Toolchain=CoreRun  
Method Mean Error StdDev
Simple 1.198 ms 0.0085 ms 0.0079 ms

@jkotas
Copy link
Member

jkotas commented Sep 8, 2020

@tarekgh @krwq @buyaa-n Could you please validate the performance impact of this change?

I see about 5% regression on my machine. It would be useful to get numbers for different environments.

@jkotas
Copy link
Member

jkotas commented Sep 8, 2020

It looks good to me otherwise - as long as we convince ourselves that we are happy with the performance.

@tarekgh
Copy link
Member

tarekgh commented Sep 8, 2020

@jkotas I'll give it a try on my machine later today. I am going to use the same test used by @Wraith2 he posted in #41756 (comment)

@buyaa-n
Copy link
Contributor

buyaa-n commented Sep 8, 2020

I have tried with the existing benchmark we have. We don't have benchmarks exactly for reader/writer, XmlDocument.LoadXml could show the reader perf some level. But I don't understand why XmlNode.GetAttributes having so high regression. I might add some benchmarks for reader/writer and check the result.

Slower diff/base Base Median (ns) Diff Median (ns) Modality
XmlDocumentTests.XmlNodeTests.Perf_XmlNode.GetAttributes 1.66 0.64 1.07
XmlDocumentTests.XmlDocumentTests.Perf_XmlDocument.LoadXml 1.07 2997.90 3203.79
XmlDocumentTests.XmlNodeListTests.Perf_XmlNodeList.GetCount 1.07 10.22 10.92

@jkotas
Copy link
Member

jkotas commented Sep 8, 2020

I am going to use the same test used by @Wraith2 he posted in #41756

Ah, I have not looked carefully at this. You may want to look at the benchmark profile to see whether it is dominated by code paths affected by this change. The test that I have used measured just the Read loop and I have verified that the Read loop is dominated by paths affected by this change.

@tarekgh
Copy link
Member

tarekgh commented Sep 8, 2020

We may write the perf test to split the read and the write and get actual numbers for both operations. Of course we'll need to validate the affected code path is used in both cases or not.

@tarekgh
Copy link
Member

tarekgh commented Sep 9, 2020

I wrote a test to test the read and write operation separately and I am seeing improvement in the reading cases and around 2% regression in the writing.

Base

Method Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
TestWriter 99,978.386 ns 2,052.2271 ns 5,855.1195 ns 1.7090 - - 7344 B
TestReader 105,172.380 ns 3,134.4095 ns 9,043.4880 ns 3.5400 - - 15232 B

Change

Method Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
TestWriter 101,930.709 ns 2,010.0744 ns 4,618.4802 ns 1.7090 - - 7328 B
TestReader 102,153.794 ns 2,040.5844 ns 4,928.2406 ns 3.5400 - - 15224 B

I confirmed both reading and writing scenario hitting the code path we changed. Here is examples of the call stacks from the ETW logs:

// Writing stack

 system.private.xml.il!XmlWellFormedWriter.CheckNCName
+ system.private.xml.il!XmlWellFormedWriter.WriteStartElement
 + baseline!Baseline.Program.TestWriter()
  + 669d4deb-b062-4341-8b68-dbdb19ad7888!Runnable_0.WorkloadActionUnroll
   + benchmarkdotnet!Engine.RunIteration
   + benchmarkdotnet!Engine.GetExtraStats


// Reading stack
system.private.xml.il!System.Xml.XmlTextReaderImpl.ParseElement()
+ system.private.xml.il!XmlTextReaderImpl.ParseElementContent
\|+ baseline!Baseline.Program.TestReader()
\|+ system.private.xml.il!System.Xml.XmlTextReaderImpl.Read()
+ system.private.xml.il!XmlTextReaderImpl.ParseDocumentContent
+ baseline!Baseline.Program.TestReader()
+ system.private.xml.il!System.Xml.XmlTextReaderImpl.Read()

both CheckNCName and ParseElement calls XmlCharType we are changing here.

Here is test code I used:

        [Benchmark] public int TestWriter()
        {
            s_msWriter.Position = 0;
            using (var w = XmlWriter.Create(s_msWriter))
            {
                w.WriteStartDocument();
                w.WriteStartElement("OutterElement");
                for (int i = 0; i < 1000; i++)
                {
                    w.WriteStartElement("InnerElement");
                    w.WriteEndElement();
                }
                w.WriteEndElement();
                w.WriteEndDocument();
            }
            return 10;
        }

        [Benchmark] public int TestReader()
        {
            s_msReader.Position = 0;
            using (var r = XmlReader.Create(s_msReader))
            {
                while (r.Read())
                {
                }
            }
            return 20;
        }

s_msWriter and s_msReader are pre-initialized streams. The reader memory is also initialized with the Xml contents (similar to what we wrote in the writer case).

@jkotas I believe the changes is reasonable to accept from perf point of view too. what you think?

@jkotas
Copy link
Member

jkotas commented Sep 9, 2020

@jkotas I believe the changes is reasonable to accept from perf point of view too. what you think?

Agree.

@tarekgh tarekgh merged commit f1f7681 into dotnet:master Sep 9, 2020
@tarekgh
Copy link
Member

tarekgh commented Sep 9, 2020

@Wraith2 thanks a lot for doing this change and accommodating the feedback too.

@Wraith2 Wraith2 deleted the update-XmlCharType branch September 9, 2020 08:13
@ghost ghost locked as resolved and limited conversation to collaborators Dec 7, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants