-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Add spec for the data section string literals feature #76139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
9058e0f
5b4c80d
ec93409
5aeecdb
991b785
eaefe41
c7d9c9d
354274a
14693ac
bcd4c03
51f9b02
eb7b374
6f8d5bc
8b011c5
1af62e9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,253 @@ | ||
| # String literals in data section | ||
|
|
||
| This opt-in **experimental** Roslyn feature allows changing how string literals in C# programs are emitted into PE files (`.dll`/`.exe`). | ||
| By default, string literals are emitted to the UserString heap which is limited to [2^24 bytes](https://github.com/dotnet/roslyn/issues/9852). | ||
| When the limit is reached, the following compiler error is reported by Roslyn: | ||
|
|
||
| ``` | ||
| error CS8103: Combined length of user strings used by the program exceeds allowed limit. Try to decrease use of string literals. | ||
| ``` | ||
|
|
||
| By turning on [the feature flag](#configuration), string literals ([where possible](#eligible-string-literals)) are instead emitted as UTF-8 data into a different section of the PE file | ||
| which does not have the same limit. The emit format is similar to [explicit u8 string literals][u8-literals]. | ||
|
|
||
| The feature is currently implemented only for C#, not VB. | ||
|
|
||
| > [!WARNING] | ||
| > This feature is currently experimental and can be changed or removed at any time. | ||
|
|
||
| ## Configuration | ||
|
|
||
| The feature flag can take a non-negative integer threshold. | ||
| Only string literals whose length (number of characters, not bytes) is greater than the threshold are emitted using the utf8 encoding strategy. | ||
| If the flag is set, but no value or an unrecognized value is specified, the threshold defaults to 100. | ||
| Specifying 0 means all non-empty string literals are considered for the feature. | ||
|
|
||
| The feature is turned off if | ||
| - the feature flag is not specified, or | ||
| - the string `off` is provided as the feature flag value, or | ||
| - `null` is provided as the feature flag value (usually only possible using Roslyn APIs like `ParseOptions.WithFeatures`). | ||
jjonescz marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| The feature flag can be specified on the command line like `/features:experimental-data-section-string-literals` or `/features:experimental-data-section-string-literals=20`, | ||
| or in a project file in a `<PropertyGroup>` like `<Features>$(Features);experimental-data-section-string-literals</Features>` or `<Features>$(Features);experimental-data-section-string-literals=20</Features>`. | ||
|
|
||
| > [!NOTE] | ||
| > This configuration is useful for experimenting (to see the impact of the threshold on runtime performance and IL size) | ||
| > and also allowing users to avoid a potential overhead of many type definitions generated if their program has many short string literals | ||
| > which do not need this new emit strategy because they can fit into the size-limited UserString heap. | ||
|
|
||
| ## Eligible string literals | ||
|
|
||
| A string literal is emitted with the utf8 encoding if and only if all of these are true: | ||
| - it can be encoded as UTF-8, i.e., the following call does not throw: | ||
| ```cs | ||
| new System.Text.Encoding.UTF8Encoding(encoderShouldEmitUTF8Identifier: false, throwOnInvalidBytes: true).GetBytes(literal) | ||
| ``` | ||
| - its `Length` is greater than the threshold specified by [the feature flag](#configuration). | ||
|
|
||
| ## Emit | ||
|
|
||
| The standard string literal emit strategy emits `ldstr "string"`. | ||
| That puts the string literal on the UserString heap of the PE file. | ||
|
|
||
| The utf8 string literal encoding emit strategy emits `ldsfld` of a field in a generated class instead. | ||
|
|
||
| For every unique string literal, a unique internal static class is generated which: | ||
| - has name composed of `<S>` followed by a hex-encoded XXH128 hash of the string | ||
| (collisions [should not happen][xxh128] with XXH128 and so they aren't currently detected or reported, the behavior in that case is undefined), | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this a security bug that the collision is undetected? The collisions can happen with some very small but non-zero probability. If a program is unlucky to hit the collision, the string literals will be messed up and the behavior of the program will be impacted in interesting ways. This can be used by adversaries to contribute an innocently looking string to an open-source project where the string is carefully crafted to cause a collision with some other critical string. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are a few places in the compiler where we use hashes and collisions could lead to program behavior changes. These are part of our security reviews in the past and this will be included in our next one. Thus far nothing has risen to the level of forcing extra checking. There is an increase in non-cryptographic hashing support though and I will make sure that is included in the next security review. @RikkiGibson There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right, XXH128 is non-cryptographic hash, so one should assume that it is easy to find collisions. If we are going to keep the current implementation for this feature, I would be curious to know how we are justifying that it is ok. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Understand the concern here. But you can create similar collisions today at much lower level places in the compiler by creating collisions with file contents. The use of non-crypto hashes was approved at that layer too (was in fact suggested by security team). The general reasoning is that once you can control code being submitted to the project you have lots of avenues for subverting expectations. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
In the open-source world, it is important that the project maintainers are able to review the changes with confidence. For example, let's say that this feature gets enabled for some part of OpenTelemetry project and somebody submits a PR that includes a code like this: How can the maintainer of the OpenTelemetry project tell that this change is actually planting a backdoor into an unrelated code due to a hash collision? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. xxHash128 is also used to identify source documents in the IDE currently, as well as to identify interceptable calls in source code. We consider it collision-safe for these purposes. See also the Examples at the bottom of this doc: https://web.archive.org/web/20240223124645/https://fossies.org/linux/xxHash/tests/collisions/README.md @CyrusNajmabadi in case you have further thoughts. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I would not be worried about the IDE case from the security angle. If somebody crafts a file name with a hash collision and it makes the IDE to misbehave or crash, it is not a big deal. (You may be scratching your head for a while if you are tasked with investigating the crash report.) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FWIW, I think it should be straightforward to add collision detection, it just didn't seem straightforward to test that, but I can look into that more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually I think hash collisions might not be a problem - #77061 (comment) |
||
| - is nested in the `<PrivateImplementationDetails>` type to avoid polluting the global namespace | ||
| and to avoid having to enforce name uniqueness across modules, | ||
| - has one internal static readonly `string` field which is initialized in a static constructor of the class, | ||
| - is marked `beforefieldinit` so the static constructor can be called eagerly if deemed better by the runtime for some reason. | ||
|
|
||
| There is also an internal static readonly `.data` field generated into `<PrivateImplementationDetails>` containing the actual bytes, | ||
| similar to [u8 string literals][u8-literals] and [constant array initializers][constant-array-init]. | ||
| These other scenarios might also reuse the data field, e.g., the following statements could all reuse the same data field: | ||
|
|
||
| ```cs | ||
| ReadOnlySpan<byte> a = new byte[6] { 72, 101, 108, 108, 111, 46 }; | ||
| ReadOnlySpan<byte> b = stackalloc byte[6] { 72, 101, 108, 108, 111, 46 }; | ||
| ReadOnlySpan<byte> c = "Hello."u8; | ||
| string d = "Hello."; // assuming this string literal is eligible for the `ldsfld` emit strategy | ||
| ``` | ||
|
|
||
| The initialization calls `<PrivateImplementationDetails>.BytesToString` helper which in turn calls `Encoding.UTF8.GetBytes`. | ||
jjonescz marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| This is an optimization so each of the generated static constructors is slightly smaller in IL size. | ||
| These size savings can add up since one class is generated per one eligible string literal. | ||
| A compile-time error is reported if the `Encoding.UTF8.GetBytes` API is not available and needs to be used | ||
| (the user can then either ensure the API is available or turn off the feature flag). | ||
|
|
||
| The following example demonstrates the code generated for string literal `"Hello."`. | ||
|
|
||
| ```cs | ||
| using System.Runtime.CompilerServices; | ||
| using System.Runtime.InteropServices; | ||
| using System.Text; | ||
|
|
||
| [CompilerGenerated] | ||
| internal static class <PrivateImplementationDetails> | ||
| { | ||
| internal static readonly __StaticArrayInitTypeSize=6 2BBCE396D68A9BBD2517F6B7064666C772694CD92010887BA379E10E7CDDA960 = /* IL: data(48 65 6C 6C 6F 2E) */; | ||
|
|
||
| [StructLayout(LayoutKind.Explicit, Pack = 1, Size = 6)] | ||
| internal struct __StaticArrayInitTypeSize=6 | ||
| { | ||
| } | ||
|
|
||
| internal static class <S>F6D59F18520336E6E57244B792249482 | ||
| { | ||
| internal static readonly string s; | ||
|
|
||
| unsafe static <S>F6D59F18520336E6E57244B792249482() | ||
| { | ||
| s = <PrivateImplementationDetails>.BytesToString((byte*)Unsafe.AsPointer(ref <PrivateImplementationDetails>.2BBCE396D68A9BBD2517F6B7064666C772694CD92010887BA379E10E7CDDA960), 6); | ||
| } | ||
| } | ||
|
|
||
| private unsafe static string BytesToString(byte* bytes, int length) | ||
| { | ||
| return Encoding.UTF8.GetString(bytes, length); | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| ### ILVerify and PEVerify | ||
|
|
||
| The actual generated code does not call `(byte*)Unsafe.AsPointer`. | ||
| It simply emits `ldsflda` of the data field. | ||
jjonescz marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| The same approach is taken by other scenarios like [u8 string literals][u8-literals] and [constant array initializers][constant-array-init]. | ||
| That works but does not pass IL/PE verification. | ||
|
|
||
| ### Ref assemblies | ||
|
|
||
| The synthesized types are not part of ref assemblies. That makes them smaller | ||
| and incremental compilation is faster because it does not need to recompile dependent projects when only string literal contents change. | ||
|
|
||
| This is automatically implemented, because during metadata-only compilation, method bodies are not emitted, | ||
jjonescz marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| and the synthesized types used by this feature are only emitted lazily as part of generating method body instructions. | ||
|
|
||
| ## Diagnostics | ||
|
|
||
| The error CS8103 (Combined length of user strings used by the program exceeds allowed limit) is updated to suggest the feature flag, | ||
| albeit with a disclaimer during the experimental phase of the feature. | ||
|
|
||
| ## Runtime | ||
|
|
||
| Throughput of `ldstr` vs `ldsfld` is very similar (both result in one or two move instructions). | ||
|
|
||
| In the `ldsfld` emit strategy, the `string` instances won't ever be collected by the GC once the generated class is initialized. | ||
jjonescz marked this conversation as resolved.
Show resolved
Hide resolved
jjonescz marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| `ldstr` has similar behavior, but there are some optimizations in the runtime around `ldstr`, | ||
| e.g., they are loaded into a different frozen heap so machine codegen can be more efficient (no need to worry about pointer moves). | ||
|
|
||
| Generating new types by the compiler means more type loads and hence runtime impact, | ||
| e.g., startup performance and the overhead of keeping track of these types. | ||
jjonescz marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| The generated types are returned from reflection like `Assembly.GetTypes()` | ||
| which might impact the performance of Dependency Injection and similar systems. | ||
|
|
||
| ## Implementation | ||
|
|
||
| `CodeGenerator` obtains [configuration of the feature flag](#configuration) from `Compilation` passed to its constructor. | ||
| `CodeGenerator.EmitConstantExpression` switches to the utf8 encoding strategy for [eligible string literals](#eligible-string-literals). | ||
| That uses the following API responsible for synthesizing the `<S>` class and returning reference to its `string` field: | ||
|
|
||
| ```cs | ||
| IFieldReference ITokenDeferral.GetFieldForDataString(ImmutableArray<byte> data, SyntaxNode syntaxNode, DiagnosticBag diagnostics) | ||
| ``` | ||
|
|
||
| Similar to `ITokenDeferral.GetFieldForData`, this calls into `PrivateImplementationDetails` | ||
| which is responsible for synthesizing the classes and helpers in a thread-safe and deterministic manner. | ||
|
|
||
| There are implementations of various `Cci` interfaces which represent the metadata, | ||
| rooted in a `DataStringHolder` which is the `INestedTypeDefinition` for the `<S>` class. | ||
|
|
||
| These are implemented in `Microsoft.CodeAnalysis`. | ||
| The alternative (used by inline arrays, for example) | ||
| would be to implement the `Symbol`s in the language-specific projects (and then use the standard `Cci` adapters), | ||
| but that seems to require similar amount of implemented abstract properties/methods of the `Symbol` | ||
| as the implementations of `Cci` interfaces require. | ||
| But implementing `Cci` directly allows us to reuse the same implementation for VB if needed in the future. | ||
|
|
||
| ## Future work | ||
|
|
||
| ### Edit and Continue | ||
|
|
||
| Hot reload currently does not support `.data` field replacement. | ||
| That can be implemented in the future. | ||
|
|
||
| ### AOT | ||
|
|
||
| Ahead-of-time compilation tools would need to be updated to recognize this new pattern of string literal use in place of `ldstr`. | ||
|
|
||
| ### Automatic threshold | ||
|
|
||
| The threshold could be determined automatically with some objective, for example, | ||
| use the utf8 encoding emit strategy for the lowest number of string literals necessary to avoid overflowing the UserString heap. | ||
|
|
||
| The set of string literals is not known up front in the compiler, it is discovered lazily (and in parallel) by the emit layer. | ||
| However, we could continue emitting `ldstr` instructions and fix them up in a separate phase after we have seen all the string literals | ||
| (and hence can determine the automatic threshold). | ||
| The `ldstr` and `ldsfld` instructions have the same size, so fixup would be a straightforward replace. | ||
| This fixup phase already exists in the compiler in `MetadataWriter.WriteInstructions` | ||
| (it's also how the string literals are emitted into the UserString heap). | ||
| It is called from `SerializeMethodBodies` which precedes `PopulateSystemTables` call, | ||
| hence synthesizing the utf8 string classes in the former should be possible and they would be emitted in the latter. | ||
|
|
||
| ### Statistics | ||
|
|
||
| The compiler could emit an info diagnostic with useful statistics for customers to determine what threshold to set. | ||
| For example, a histogram of string literal lengths encountered while emitting the program. | ||
| Tools like ILSpy can already be used to obtain some related data, e.g., count, length, and contents of string literals in the UserString heap. | ||
|
|
||
| ### Single blob | ||
|
|
||
| We could merge all UTF-8 byte sequences for affected strings into a single blob - a virtual User-String heap. | ||
jjonescz marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| The sequences can be deduped as we dedupe entries in User-String heap. | ||
| We can even attempt to do some vary basic compression, | ||
| for example, in the name heap that we generate (contains type and member names, etc.), names with shared content can overlap. | ||
| We would generate a single `__StaticArrayInitTypeSize=*` structure for the entire virtual heap blob and | ||
| add a single `.data` field to `<PrivateImplementationDetails>` that points to the blob. | ||
| At runtime, we would do an offset to where the required data reside in the blob and decode the required length from UTF-8 to UTF-16. | ||
|
|
||
| ## Alternatives | ||
jjonescz marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| ### Configuration/emit granularity | ||
|
|
||
| Instead of one global feature flag, the emit strategy could be controlled using compiler-recognized attributes (applicable to assemblies or classes). | ||
| Furthermore, we could emit more than one string per one class. That could be configurable as well. | ||
|
|
||
| One interesting strategy would be grouping strings which occur in one user-defined type into one compiler-generated `<S>` type. | ||
| The idea is that strings from one class are likely used "together" so there is no performance impact from initializing them all at once and we save on metadata. | ||
|
|
||
| ### GC | ||
|
|
||
| To avoid rooting the `string` references forever, we could turn the fields into `WeakReference<string>`s. | ||
jjonescz marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| Or we could avoid the caching altogether (each eligible `ldstr` would be replaced with a direct call to `Encoding.UTF8.GetString`). | ||
| This could be configurable as well. | ||
|
|
||
| ### Runtime support | ||
|
|
||
| We could emit the strings to data fields similarly but we would not synthesize the `string` fields and static constructors. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should make this happen if you plan to advertise this feature for broader use, and not just as a workaround for the few folks who run into the limit. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The plan is to use the current state of implementation, compiler only trick, to get feedback on the scenarios. We did a number of benchmarks on the work and didn't see meaningful differences that justified preemptively doing runtime work. Decided that we would ship it out as experimental, get real world feedback from teams like Bing, and then let that drive what we wanted the final state of the feature to be before we remove experimental lable |
||
| Instead, at the place of `ldstr`, we would directly call a runtime-provided helper, passing a pointer to the data field. | ||
| That runtime helper would be an intrinsic recognized by the JIT (with a fallback calling `Encoding.UTF8.GetString`). | ||
| The runtime would then be responsible for loading and caching the strings, perhaps similarly to how `ldstr` works (using a frozen GC heap). | ||
|
|
||
| ### Custom lazy initialization | ||
|
|
||
| All the data and string fields could be part of one class, e.g., `<PrivateImplementationDetails>`, | ||
| and the compiler could emit custom lazy initialization instead of relying on static constructors, something like: | ||
|
|
||
| ```cs | ||
| static class <PrivateImplementationDetails> | ||
| { | ||
| private static string _str1; | ||
| internal static string Str1 => _str1 ??= Load(str1DataSection); | ||
| } | ||
| ``` | ||
|
|
||
| However, that would likely result in worse machine code due to more branches and function calls. | ||
|
|
||
| <!-- links --> | ||
| [u8-literals]: https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/proposals/csharp-11.0/utf8-string-literals | ||
| [constant-array-init]: https://github.com/dotnet/roslyn/pull/24621 | ||
| [xxh128]: https://github.com/Cyan4973/xxHash/blob/86f6400a2a14ea7123ada691c89faf1d2a6a126f/tests/collisions/README.md | ||
Uh oh!
There was an error while loading. Please reload this page.