From b2ed86cef8ed9daa74a4c5180e258020c95eec9d Mon Sep 17 00:00:00 2001 From: Jan Jones Date: Wed, 5 Feb 2025 16:28:58 +0100 Subject: [PATCH 1/4] Update data section string literals spec --- docs/features/string-literals-data-section.md | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/docs/features/string-literals-data-section.md b/docs/features/string-literals-data-section.md index 6d61a39a0cf14..46c6352da3c00 100644 --- a/docs/features/string-literals-data-section.md +++ b/docs/features/string-literals-data-section.md @@ -136,7 +136,8 @@ albeit with a disclaimer during the experimental phase of the feature. Throughput of `ldstr` vs `ldsfld` is very similar (both result in one or two move instructions). In the `ldsfld` emit strategy, the `string` instances won't ever be collected by the GC once the generated class is initialized. -`ldstr` has similar behavior, but there are some optimizations in the runtime around `ldstr`, +`ldstr` has similar behavior (GC does not collect the string literals either until the assembly is unloaded), +but there are some optimizations in the runtime around `ldstr`, e.g., they are loaded into a different frozen heap so machine codegen can be more efficient (no need to worry about pointer moves). Generating new types by the compiler means more type loads and hence runtime impact, @@ -168,7 +169,7 @@ but that seems to require similar amount of implemented abstract properties/meth as the implementations of `Cci` interfaces require. But implementing `Cci` directly allows us to reuse the same implementation for VB if needed in the future. -## Future work +## Future work and alternatives ### Edit and Continue @@ -209,7 +210,7 @@ We would generate a single `__StaticArrayInitTypeSize=*` structure for the entir add a single `.data` field to `` that points to the blob. At runtime, we would do an offset to where the required data reside in the blob and decode the required length from UTF-8 to UTF-16. -## Alternatives +However, this would be unfriendly to IL trimming. ### Configuration/emit granularity @@ -221,7 +222,8 @@ The idea is that strings from one class are likely used "together" so there is n ### GC -To avoid rooting the `string` references forever, we could turn the fields into `WeakReference`s. +To avoid rooting the `string` references forever, we could turn the fields into `WeakReference`s +(note that this would be quite expensive for both direct overhead and indirectly for the GC due to longer GC pause times). Or we could avoid the caching altogether (each eligible `ldstr` would be replaced with a direct call to `Encoding.UTF8.GetString`). This could be configurable as well. @@ -247,6 +249,12 @@ static class However, that would likely result in worse machine code due to more branches and function calls. +### String interning + +The compiler should report a diagnostic when the feature is enabled together with +`[assembly: System.Runtime.CompilerServices.CompilationRelaxations(0)]`, i.e., string interning enabled, +because that is incompatible with the feature. + [u8-literals]: https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/proposals/csharp-11.0/utf8-string-literals [constant-array-init]: https://github.com/dotnet/roslyn/pull/24621 From 18ba63898cb6bdce36bb13dbe4058163f9186cff Mon Sep 17 00:00:00 2001 From: Jan Jones Date: Thu, 6 Feb 2025 09:18:53 +0100 Subject: [PATCH 2/4] Add benchmark results --- docs/features/string-literals-data-section.md | 26 +++++++++++++++++++ 1 file changed, 26 insertions(+) diff --git a/docs/features/string-literals-data-section.md b/docs/features/string-literals-data-section.md index 46c6352da3c00..f11415c136866 100644 --- a/docs/features/string-literals-data-section.md +++ b/docs/features/string-literals-data-section.md @@ -142,10 +142,36 @@ e.g., they are loaded into a different frozen heap so machine codegen can be mor Generating new types by the compiler means more type loads and hence runtime impact, e.g., startup performance and the overhead of keeping track of these types. +On the other hand, the PE size might be smaller due to UTF-8 vs UTF-16 encoding, +which can result in memory savings since the binary is also loaded to memory by the runtime. +See [below](#runtime-overhead-benchmark) for a more detailed analysis. The generated types are returned from reflection like `Assembly.GetTypes()` which might impact the performance of Dependency Injection and similar systems. +### Runtime overhead benchmark + +| [cost per string literal](https://github.com/jkotas/stringliteralperf) | feature on | feature off | +| --- | --- | --- | +| bytes | 1037 | 550 | +| microseconds | 20.3 | 3.1 | + +The benchmark results above [show](https://github.com/dotnet/roslyn/pull/76139#discussion_r1944144978) +that the runtime overhead of this feature per 100 char string literal +is ~500 bytes of working set memory (~2x of regular string literal) +and ~17 microseconds of startup time (~7x of regular string literal). + +The startup time overhead does depend on the length of the string literal. +It is cost of the type loads and JITing the static constructor. + +The working set has two components: private working set (r/w pages) and non-private working set (r/o pages backed by the binary). +The private working set overhead (~500 bytes) does not depend on the length of the string literal. +Again, it is the cost of the type loads and the static constructor code. +Non-private working set is reduced by this feature since the binary is smaller. +Once the string literal is about 600 characters, +the private working set overhead and non-private working set improvement will break even. +For string literals longer than 600 characters, this feature is total working set improvement. + ## Implementation `CodeGenerator` obtains [configuration of the feature flag](#configuration) from `Compilation` passed to its constructor. From 01e893e84cdb7a23ff221256805e2220e080987b Mon Sep 17 00:00:00 2001 From: Jan Jones Date: Thu, 6 Feb 2025 16:10:23 +0100 Subject: [PATCH 3/4] Clarify 600 bytes --- docs/features/string-literals-data-section.md | 23 ++++++++++++++++++- 1 file changed, 22 insertions(+), 1 deletion(-) diff --git a/docs/features/string-literals-data-section.md b/docs/features/string-literals-data-section.md index f11415c136866..37d7910a6bc7f 100644 --- a/docs/features/string-literals-data-section.md +++ b/docs/features/string-literals-data-section.md @@ -165,13 +165,34 @@ The startup time overhead does depend on the length of the string literal. It is cost of the type loads and JITing the static constructor. The working set has two components: private working set (r/w pages) and non-private working set (r/o pages backed by the binary). -The private working set overhead (~500 bytes) does not depend on the length of the string literal. +The private working set overhead (~600 bytes) does not depend on the length of the string literal. Again, it is the cost of the type loads and the static constructor code. Non-private working set is reduced by this feature since the binary is smaller. Once the string literal is about 600 characters, the private working set overhead and non-private working set improvement will break even. For string literals longer than 600 characters, this feature is total working set improvement. +
+Why 600 bytes? + +When the feature is off, ~550 bytes cost of 100 char string literal is composed from: +- The string in the binary (~200 bytes). +- The string allocated on the GC heap (~200 bytes). +- Fixed overheads: metadata encoding, runtime hashtable of all allocated string literals, code that referenced the string in the benchmark (~150 bytes). + +When the feature is on, ~1050 bytes. cost of 100 char string literal is composed from: +- The string in the binary (~100 bytes). +- The string allocated on the GC heap (~200 bytes). +- Fixed overheads: metadata encoding, the extra types, code that referenced the string in the benchmark (~750 bytes). + +750 - 150 = 600. Vast majority of it are the extra types. + +A bit of the extra fixed overheads when the feature is probably in the non-private working set. +It is difficult to measure it since there is no managed API to get private vs. non-private working set. +It does not impact the estimate of the break-even point for the total working set. + +
+ ## Implementation `CodeGenerator` obtains [configuration of the feature flag](#configuration) from `Compilation` passed to its constructor. From ea109f4674304adfcaec2acbae090591243f5f14 Mon Sep 17 00:00:00 2001 From: Jan Jones Date: Thu, 6 Feb 2025 17:19:45 +0100 Subject: [PATCH 4/4] Fix typos Co-authored-by: Jan Kotas --- docs/features/string-literals-data-section.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/features/string-literals-data-section.md b/docs/features/string-literals-data-section.md index 37d7910a6bc7f..133416a2cfa36 100644 --- a/docs/features/string-literals-data-section.md +++ b/docs/features/string-literals-data-section.md @@ -180,14 +180,14 @@ When the feature is off, ~550 bytes cost of 100 char string literal is composed - The string allocated on the GC heap (~200 bytes). - Fixed overheads: metadata encoding, runtime hashtable of all allocated string literals, code that referenced the string in the benchmark (~150 bytes). -When the feature is on, ~1050 bytes. cost of 100 char string literal is composed from: +When the feature is on, ~1050 bytes cost of 100 char string literal is composed from: - The string in the binary (~100 bytes). - The string allocated on the GC heap (~200 bytes). - Fixed overheads: metadata encoding, the extra types, code that referenced the string in the benchmark (~750 bytes). 750 - 150 = 600. Vast majority of it are the extra types. -A bit of the extra fixed overheads when the feature is probably in the non-private working set. +A bit of the extra fixed overheads with the feature on is probably in the non-private working set. It is difficult to measure it since there is no managed API to get private vs. non-private working set. It does not impact the estimate of the break-even point for the total working set.