Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Boost performance of localized, non-static or incremental string formatting. #50330

Closed
Tracked by #79053
geeknoid opened this issue Mar 28, 2021 · 39 comments · Fixed by #80753
Closed
Tracked by #79053

Boost performance of localized, non-static or incremental string formatting. #50330

geeknoid opened this issue Mar 28, 2021 · 39 comments · Fixed by #80753
Assignees
Labels
api-approved API was approved in API review, it can be implemented area-System.Runtime
Milestone

Comments

@geeknoid
Copy link
Member

geeknoid commented Mar 28, 2021

EDITED by @stephentoub 1/13/2023:
Latest API proposal at #50330 (comment)


Background and Motivation

The strides in .NET around string formatting over the last few years have been significant, and .NET 6 with C# 10 promise further substantial improvements to string interpolation performance. However, string interpolation has a serious constraint: it only works when the format strings are known at compilation time.

For any scenario where the text to format is not known at compilation time, such as for localized content being pulled from a resource file, then classic String.Format and StringBuilder.AppendFormat remain the workhorses.

This proposal is designed to deliver substantial performance improvements to classic composite string formatting by hoisting the expensive process of parsing the composite format string to initialization time, enabling the hot path of performing string formatting to execute up to 5x faster than using classic functions.

The model is logically similar to how you can compile a regex a priori and use it repeatedly to seek matches. Here, you parse and process the format string once, and then use it to repeatedly format strings.

Please see here for a fully functional implementation, with complete test coverage and a banchmark.

Proposed API

The primary API surface proposed here is fairly minimal. The CompositeFormat type has a primary constructor, then various overloads for Format and TryFormat.

/// <summary>
/// Provides highly efficient string formatting functionality.
/// </summary>
/// <remarks>
/// This type lets you optimize string formatting operations common with the <see cref="string.Format(string,object?)"  />
/// method. This is useful for any situation where you need to repeatedly format the same string with
/// different arguments.
///
/// This type works faster than <c>string.Format</c> because it parses the composite format string only once when
/// the instance is created, rather than doing it for every formatting operation.
///
/// You first create an instance of this type, passing the composite format string that you intend to use.
/// Once the instance is created, you call the <see cref="Format{T}(IFormatProvider?,T)"/> method with arguments to use in the
/// format operation.
/// </remarks>
public readonly struct CompositeFormat
{
    /// <summary>
    /// Initializes a new instance of the <see cref="CompositeFormat"/> struct.
    /// </summary>
    /// <param name="format">A classic .NET format string as used with <see cref="string.Format(string,object?)"  />.</param>
    /// <remarks>
    /// Parses a composite format string into an efficient form for later use.
    /// </remarks>
    public CompositeFormat(ReadOnlySpan<char> format);

    /// <summary>
    /// Initializes a new instance of the <see cref="CompositeFormat"/> struct.
    /// </summary>
    /// <param name="format">A template-based .NET format string as used with <c>LoggerMessage.Define</c>.</param>
    /// <param name="templates">Holds the named templates discovered in the format string.</param>
    /// <remarks>
    /// Parses a composite format string into an efficient form for later use.
    /// </remarks>
    public CompositeFormat(ReadOnlySpan<char> format, out IList<string> templates)

    /// <summary>
    /// Formats a string with a single argument.
    /// </summary>
    /// <typeparam name="T">Type of the single argument.</typeparam>
    /// <param name="arg">An argument to use in the formatting operation.</param>
    /// <returns>The formatted string.</returns>
    public string Format<T>(T arg);

    /// <summary>
    /// Formats a string with a single argument.
    /// </summary>
    /// <typeparam name="T">Type of the single argument.</typeparam>
    /// <param name="provider">An optional format provider that provides formatting functionality for individual arguments.</param>
    /// <param name="arg">An argument to use in the formatting operation.</param>
    /// <returns>The formatted string.</returns>
    public string Format<T>(IFormatProvider? provider, T arg);

    /// <summary>
    /// Formats a string with two arguments.
    /// </summary>
    /// <typeparam name="T0">Type of the first argument.</typeparam>
    /// <typeparam name="T1">Type of the second argument.</typeparam>
    /// <param name="arg0">First argument to use in the formatting operation.</param>
    /// <param name="arg1">Second argument to use in the formatting operation.</param>
    /// <returns>The formatted string.</returns>
    public string Format<T0, T1>(T0 arg0, T1 arg1);

    /// <summary>
    /// Formats a string with two arguments.
    /// </summary>
    /// <typeparam name="T0">Type of the first argument.</typeparam>
    /// <typeparam name="T1">Type of the second argument.</typeparam>
    /// <param name="provider">An optional format provider that provides formatting functionality for individual arguments.</param>
    /// <param name="arg0">First argument to use in the formatting operation.</param>
    /// <param name="arg1">Second argument to use in the formatting operation.</param>
    /// <returns>The formatted string.</returns>
    public string Format<T0, T1>(IFormatProvider? provider, T0 arg0, T1 arg1);

    /// <summary>
    /// Formats a string with three arguments.
    /// </summary>
    /// <typeparam name="T0">Type of the first argument.</typeparam>
    /// <typeparam name="T1">Type of the second argument.</typeparam>
    /// <typeparam name="T2">Type of the third argument.</typeparam>
    /// <param name="arg0">First argument to use in the formatting operation.</param>
    /// <param name="arg1">Second argument to use in the formatting operation.</param>
    /// <param name="arg2">Third argument to use in the formatting operation.</param>
    /// <returns>The formatted string.</returns>
    public string Format<T0, T1, T2>(T0 arg0, T1 arg1, T2 arg2);

    /// <summary>
    /// Formats a string with three arguments.
    /// </summary>
    /// <typeparam name="T0">Type of the first argument.</typeparam>
    /// <typeparam name="T1">Type of the second argument.</typeparam>
    /// <typeparam name="T2">Type of the third argument.</typeparam>
    /// <param name="provider">An optional format provider that provides formatting functionality for individual arguments.</param>
    /// <param name="arg0">First argument to use in the formatting operation.</param>
    /// <param name="arg1">Second argument to use in the formatting operation.</param>
    /// <param name="arg2">Third argument to use in the formatting operation.</param>
    /// <returns>The formatted string.</returns>
    public string Format<T0, T1, T2>(IFormatProvider? provider, T0 arg0, T1 arg1, T2 arg2);

    /// <summary>
    /// Formats a string with arguments.
    /// </summary>
    /// <typeparam name="T0">Type of the first argument.</typeparam>
    /// <typeparam name="T1">Type of the second argument.</typeparam>
    /// <typeparam name="T2">Type of the third argument.</typeparam>
    /// <param name="arg0">First argument to use in the formatting operation.</param>
    /// <param name="arg1">Second argument to use in the formatting operation.</param>
    /// <param name="arg2">Third argument to use in the formatting operation.</param>
    /// <param name="args">Additional arguments to use in the formatting operation.</param>
    /// <returns>The formatted string.</returns>
    public string Format<T0, T1, T2>(T0 arg0, T1 arg1, T2 arg2, params object?[]? args);

    /// <summary>
    /// Formats a string with arguments.
    /// </summary>
    /// <param name="provider">An optional format provider that provides formatting functionality for individual arguments.</param>
    /// <typeparam name="T0">Type of the first argument.</typeparam>
    /// <typeparam name="T1">Type of the second argument.</typeparam>
    /// <typeparam name="T2">Type of the third argument.</typeparam>
    /// <param name="arg0">First argument to use in the formatting operation.</param>
    /// <param name="arg1">Second argument to use in the formatting operation.</param>
    /// <param name="arg2">Third argument to use in the formatting operation.</param>
    /// <param name="args">Additional arguments to use in the formatting operation.</param>
    /// <returns>The formatted string.</returns>
    public string Format<T0, T1, T2>(IFormatProvider? provider, T0 arg0, T1 arg1, T2 arg2, params object?[]? args);

    /// <summary>
    /// Formats a string with arguments.
    /// </summary>
    /// <param name="args">Arguments to use in the formatting operation.</param>
    /// <returns>The formatted string.</returns>
    public string Format(params object?[]? args);

    /// <summary>
    /// Formats a string with arguments.
    /// </summary>
    /// <param name="provider">An optional format provider that provides formatting functionality for individual arguments.</param>
    /// <param name="args">Arguments to use in the formatting operation.</param>
    /// <returns>The formatted string.</returns>
    public string Format(IFormatProvider? provider, params object?[]? args);

    /// <summary>
    /// Formats a string with one argument.
    /// </summary>
    /// <typeparam name="T">Type of the single argument.</typeparam>
    /// <param name="destination">Where to write the resulting string.</param>
    /// <param name="charsWritten">The number of characters actually written to the destination span.</param>
    /// <param name="provider">An optional format provider that provides formatting functionality for individual arguments.</param>
    /// <param name="arg">An argument to use in the formatting operation.</param>
    /// <returns>True if there was enough room in teh destination span for the resulting string.</returns>
    public bool TryFormat<T>(Span<char> destination, out int charsWritten, IFormatProvider? provider, T arg);

    /// <summary>
    /// Formats a string with two arguments.
    /// </summary>
    /// <typeparam name="T0">Type of the first argument.</typeparam>
    /// <typeparam name="T1">Type of the second argument.</typeparam>
    /// <param name="destination">Where to write the resulting string.</param>
    /// <param name="charsWritten">The number of characters actually written to the destination span.</param>
    /// <param name="provider">An optional format provider that provides formatting functionality for individual arguments.</param>
    /// <param name="arg0">First argument to use in the formatting operation.</param>
    /// <param name="arg1">Second argument to use in the formatting operation.</param>
    /// <returns>True if there was enough room in teh destination span for the resulting string.</returns>
    public bool TryFormat<T0, T1>(Span<char> destination, out int charsWritten, IFormatProvider? provider, T0 arg0, T1 arg1);

    /// <summary>
    /// Formats a string with three arguments.
    /// </summary>
    /// <typeparam name="T0">Type of the first argument.</typeparam>
    /// <typeparam name="T1">Type of the second argument.</typeparam>
    /// <typeparam name="T2">Type of the third argument.</typeparam>
    /// <param name="destination">Where to write the resulting string.</param>
    /// <param name="charsWritten">The number of characters actually written to the destination span.</param>
    /// <param name="provider">An optional format provider that provides formatting functionality for individual arguments.</param>
    /// <param name="arg0">First argument to use in the formatting operation.</param>
    /// <param name="arg1">Second argument to use in the formatting operation.</param>
    /// <param name="arg2">Third argument to use in the formatting operation.</param>
    /// <returns>True if there was enough room in teh destination span for the resulting string.</returns>
    public bool TryFormat<T0, T1, T2>(Span<char> destination, out int charsWritten, IFormatProvider? provider, T0 arg0, T1 arg1, T2 arg2);

    /// <summary>
    /// Formats a string with arguments.
    /// </summary>
    /// <typeparam name="T0">Type of the first argument.</typeparam>
    /// <typeparam name="T1">Type of the second argument.</typeparam>
    /// <typeparam name="T2">Type of the third argument.</typeparam>
    /// <param name="destination">Where to write the resulting string.</param>
    /// <param name="charsWritten">The number of characters actually written to the destination span.</param>
    /// <param name="provider">An optional format provider that provides formatting functionality for individual arguments.</param>
    /// <param name="arg0">First argument to use in the formatting operation.</param>
    /// <param name="arg1">Second argument to use in the formatting operation.</param>
    /// <param name="arg2">Third argument to use in the formatting operation.</param>
    /// <param name="args">Additional arguments to use in the formatting operation.</param>
    /// <returns>True if there was enough room in teh destination span for the resulting string.</returns>
    public bool TryFormat<T0, T1, T2>(Span<char> destination, out int charsWritten, IFormatProvider? provider, T0 arg0, T1 arg1, T2 arg2, params object?[]? args);

    /// <summary>
    /// Formats a string with arguments.
    /// </summary>
    /// <param name="destination">Where to write the resulting string.</param>
    /// <param name="charsWritten">The number of characters actually written to the destination span.</param>
    /// <param name="provider">An optional format provider that provides formatting functionality for individual arguments.</param>
    /// <param name="args">Arguments to use in the formatting operation.</param>
    /// <returns>True if there was enough room in teh destination span for the resulting string.</returns>
    public bool TryFormat(Span<char> destination, out int charsWritten, IFormatProvider? provider, params object?[]? args);

    /// <summary>
    /// Gets the number of arguments required in order to produce a string with this instance.
    /// </summary>
    public int NumArgumentsNeeded { get; }
}

In addition to the feature set shown above, it's also desirable to add extension methods to the StringBuilder which leverage the same model for enhanced formatting performance in that context. See here for the prototype API for these.

Usage Examples

// preprocess the format string statically at startup
private static readonly _cf = new CompositeFormat(MyResources.MyHelloFormatString);

public void Foo(string name)
{
    // format the string
    var str = _cf.Format(name);    // logically equivalent to String.Format(MyResources.MyHelloFormatString, name);
    Console.WriteLine(str);

    str = new StringBuilder().AppendFormat(_cf, name).ToString();
    Console.WriteLine(str);
}

Alternative Designs

Rather than implementing the Format/TryFormat methods as instance methods on the CompositeFormat type, they could be integrated directly into the String type as first class methods. This would provide a clean model. Similarly, the extensions methods for StringBuilder could be baked in directly to the StringBuilder type.

Using such an approach doesn't yield additional performance, it's really just a matter of how the API is suraced.

Benchmarks

Here are benchmarks for the current prototype implementation:

|                  Method |      Mean |     Error |    StdDev |     Gen 0 |     Gen 1 |    Gen 2 |  Allocated |
|------------------------ |----------:|----------:|----------:|----------:|----------:|---------:|-----------:|
|     ClassicStringFormat | 45.036 ms | 26.859 ms | 1.4723 ms | 3083.3333 | 1333.3333 | 416.6667 | 16800613 B |
|           Interpolation | 43.340 ms |  3.324 ms | 0.1822 ms | 3000.0000 | 1187.5000 | 312.5000 | 16799929 B |
|           StringBuilder | 51.144 ms |  3.827 ms | 0.2098 ms | 2900.0000 | 1100.0000 | 200.0000 | 16799922 B |
|         CompositeFormat | 26.178 ms | 16.042 ms | 0.8793 ms | 2062.5000 |  781.2500 | 156.2500 | 11999929 B |
| CompositeFormatWithSpan | 10.382 ms |  1.172 ms | 0.0642 ms |         - |         - |        - |        2 B |
|             StringMaker |  9.573 ms |  1.488 ms | 0.0816 ms | 5546.8750 |         - |        - | 23199601 B |
|     StringMakerWithSpan |  8.550 ms |  6.963 ms | 0.3817 ms | 2671.8750 |         - |        - | 11199680 B |

Risks

None that I'm aware.

Implementation Notes

The prototype implementation I linked to above requires .NET Standard 2.1 as a baseline.

The implementation as presented depends on an internal StringMaker type, which is a highly efficient replacement for StringBuilder packaged as a ref readonly struct. It would be easy to reimplement this code on top of equivalent data types being created to support the C# compiler's enhanced interpolation features in C# 10.

@geeknoid geeknoid added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Mar 28, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Mar 28, 2021
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@davidfowl
Copy link
Member

It occurs to me that this is very similar to

. Might be able to reuse it if it supported named holes...

@geeknoid
Copy link
Member Author

@davidfowl I think you could create a new constructor for StringFormatter that understands the named hole syntax and produces the same internal representation that drives the efficient formatting.

@geeknoid geeknoid changed the title Boost performance of localized string formatting Boost performance of localized and non-static string formatting Mar 28, 2021
@geeknoid geeknoid changed the title Boost performance of localized and non-static string formatting Boost performance of localized, non-static or incremental string formatting. Apr 11, 2021
@geeknoid
Copy link
Member Author

@davidfowl The prototype implementation has been updated to include support for the template-based model (a.k.a. named holes) model. It's just a different constructor for the CompositeFormat type, and the formatting operations then work the same as before. The new constructor returns an ordered set of template names. And unlike the semantics of LoggerMessage.Define, this version correctly handles having the same template repeated in the format string.

I believe integrating this functionality into the guts of Microsoft.Extensions.Logging.Abstractions would substantially improve the performance of the logging delegates returned by LoggerMessage.Define, thus reducing the overhead of application logging.

@davidfowl
Copy link
Member

I concur, there's also overlap with #14484.

@stephentoub the one thing not addressed by the improved string interpolation proposal is strings loaded at runtime https://github.com/dotnet/aspnetcore/blob/1870441efdc00a6c253c2a5c54c20a313fb56ee2/src/Servers/Kestrel/Core/src/Middleware/HttpsConnectionMiddleware.cs#L509-L552

Can we include this issue and #14484 into that epic? It seems like some of the assets from the string interpolation epic could be used as part of this.

@stephentoub
Copy link
Member

I'd be ok seeing something along those lines incorporated as an implementation detail into LoggerMessage.Define. I'm not excited right now at the idea of shipping such a CompositeFormat type publicly (in part for the same reasons we've been hesitant to expose generic overloads for string.Format itself, namely code size from generic instantiations... that ship has already sailed for LoggerMessage); even if we did, I expect you'd want any such optimization to be supported down level from .NET 6, and I don't think we'd ship such a type in its own OOB. If you did want to only target .NET 6, you could use InterpolatedStringBuilder as a building block if desired to actually build up the resulting string from the constituent components. (As an aside, #14484 isn't really an epic... it was a proposal for a specific set of APIs, which we largely declined, but the issue has stayed open as a discussion ground for alternatives.)

@davidfowl
Copy link
Member

I'd be ok seeing something along those lines incorporated as an implementation detail into LoggerMessage.Define. I'm not excited right now at the idea of shipping such a CompositeFormat type publicly (in part for the same reasons we've been hesitant to expose generic overloads for string.Format itself, namely code size from generic instantiations... that ship has already sailed for LoggerMessage); even if we did, I expect you'd want any such optimization to be supported down level from .NET 6, and I don't think we'd ship such a type in its own OOB.

That sounds reasonable but it feels like we're saying we don't need to improve the performance of runtime generated strings because of generic instantiation bloat. Re-reading #14484 makes me feel like we don't see any value in avoiding boxing in string.Format overloads because we're allocating a string anyways. Is that right?

If you did want to only target .NET 6, you could use InterpolatedStringBuilder as a building block if desired to actually build up the resulting string from the constituent components. (As an aside, #14484 isn't really an epic... it was a proposal for a specific set of APIs, which we largely declined, but the issue has stayed open as a discussion ground for alternatives.)

I do want to start cross compiling the Extensions assemblies to target ns2.0, ns2.1, latest .NET build.

(As an aside, #14484 isn't really an epic... it was a proposal for a specific set of APIs, which we largely declined, but the issue has stayed open as a discussion ground for alternatives.)

I meant the "improve interpolated strings" epic.

@geeknoid how do you feel about burying this logic into the LoggerMessage code? We can potentially remove a bunch of it in .NET 6 by using built ins.

@stephentoub
Copy link
Member

stephentoub commented Apr 12, 2021

we don't see any value in avoiding boxing in string.Format overloads because we're allocating a string anyways. Is that right?

With string.Format, there are other costs involved, with few benefits to such a change other than avoiding the boxing, and so you're trading a potentially huge growth in generic instantiations (every different combination of T1, T2, T3 where at least one is a value type getting its own copy) for avoiding just one of the costs involved, and a cost that can be dwarfed by the typically larger string allocation happening. It's one of the reasons I've pushed for the interpolated strings improvements the way we're doing it:

  • You avoid the boxing
  • You avoid the params array
  • You get generic specialization without combinatoric growth, and applying only to the code for formatting that type
  • You get parsing done at compile time rather than at startup or steady-state
  • You can support types that can't be boxed and can't be generic, e.g. ReadOnlySpan<char>
  • It applies to scenarios other than just creating strings, including zero-alloc formatting into a span, formatting into a StringBuilder, etc.

It does have the limitation that it can't be applied dynamically to format strings at runtime, and I would like to address that at some point, but I'd like to do so:

  • After we get more experience with how folks use interpolated strings and builders with C# 10
  • When we have a better sense of if/when/how ref structs will be usable with generics so it can be factored in
  • After reflecting on whether there's an approach that would work for localization as well but still at compile time

In the meantime, the two main places I see devs loading the same resource string to be processed many times in an app are:

  • logging that requires localization
  • exception messages that require localization

I'm not particularly concerned with reducing string formatting costs for the latter. I would like to for the former, and we can take a big swing at that now by incorporating such a preprocessing approach into LoggerMessage.Define under the covers. It already limits what types can be used (things that can be generics), it already has the generics explosion potential and so isn't introducing a new issue, its primary purpose is preprocessing, it's what ASP.NET uses, it focuses on a known problem space, etc., plus it won't encourage use of this for things like exception messages where the tradeoffs are likely to be wrong for the end developer who may not realize it. We can get experience from that and C# 10 interpolated strings, and see if/when/how more is desired in the future. In the meantime, I'd encourage Martin to release a nuget package with his CompositeFormat, so that anyone can use it, vote with their feet, and drive further learnings for future potential incorporation.

(That said, @jkotas had provided feedback on string.Format and boxing in that issue. I don't want to put words in his mouth, so he can comment here as well if he disagrees with my stance.)

@geeknoid
Copy link
Member Author

Searching through our code base, it's clear that the vast majority of uses of String.Format is around error handling. In fact, for our (service code) these strings aren't localized or customized and hence could be converted to string interpolation (someone should write a code fixer to automate converting code over).

However, there are also several cases where the format string comes from either a resource file or config. For example, we format some URLs and file paths that way. And these things are done in our request paths, not in error paths. I suspect that many user-facing code is in the same boat. I suspect that framework code is not representative of application-level code in this respect, application-level code will tend to format localized content as part of main business logic.

@stephentoub The use of generics in the Format API could be removed/reduced which would trigger a teeny more boxing, but would still retain most of the value that CompositeFormat provides. It factors out parsing of the format string so that the work is not done on every use. My gut says that no matter what syntax the compiler may or may not provide, in order to handle dynamic string formatting, you will end up with something like CompositeFormat to separate parsing the input string from the format operation. Anyway, I would be happy to simplify the API to reduce generic explosion in order to get the overall benefits of this model.

@davidfowl As I mentioned to you this weekend, it took a few minutes to create a prototype of LoggerMessage.Define which used this new string formatting code in its implementation. It was pretty straightforward and did improve performance. If we can't get CompositeFormat in as a first class concept in the framework, I'd be totally fine to exploit it as an implementation detail of the logging system.

@geeknoid
Copy link
Member Author

It's also worth considering StringBuilder.AppendFormat. With the extensions I've integrated into my code, AppendFormat can execute without any allocations in the happy path. Formatting occurs in an on-stack span which is then copied directly into the StringBuilder. Using interpolation there would produce a transient string (unless there's some new voodoo magic coming to address this in C# 10).

So barring new language features, my extensions to AppendFormat are likely the fastest way to format text into a string builder.

@stephentoub
Copy link
Member

StringBuilder.AppendFormat...Using interpolation there would produce a transient string

It won't.

unless there's some new voodoo magic coming to address this in C# 10

There is.

I'd be totally fine to exploit it as an implementation detail of the logging system.

Great. Let's start with that. Thanks!

@geeknoid
Copy link
Member Author

@stephentoub Will the new string interpolation functionality support formatting straight into a span? If so, how will that look syntax-wise?

@stephentoub
Copy link
Member

Will the new string interpolation functionality support formatting straight into a span? If so, how will that look syntax-wise?

Yes. We still need to decide on the method name (I currently have it implemented as "TryWrite"), but the syntax will look like:

bool success = span.TryWrite($"{a} = {b}", out int charsWritten);

or

bool success = span.TryWrite(provider, $"{a} = {b}", out int charsWritten);

@geeknoid
Copy link
Member Author

@stephentoub @davidfowl There is something I don't understand about generics explosion vs. string interpolation.

The cool new string interpolation model involves taking every interpolated string and expanding that inline into a series of function calls to create a span, copy literal chunks into it, format variables into it, and produce an output string. In other words, there is a custom span of IL and native code emitted for each individual interpolated string, regardless of the type of arguments it contains.

If I have a Format<T0, T1, T2> generic function, it can lead to the underlying IL being converted to native code many times, depending on the specific types of the generic arguments. All reference types are folded together, and multiple uses of the function which share the same generic types in the same order will end up using the same native code.

It appears to me that if the implementation of the Format<T0, T1, T2> function is small and quickly transitions to non-generic code (which my code does), or generic code with a lower arity, it could lead to considerably smaller IL and smaller native code too in the end as compared to equivalent C# 10- style string interpolation.

My point being that rejecting a format function that has 3 generics on the grounds that it may lead to generics explosion seems incongruous with the introduction of new-style string interpolation which will likely have a total process-level cost in IL and native code higher than it would an equivalent number of generic format calls would cost.

In other words, if the platform can bare string interpolation, it can bare format calls with generics.

Am I missing something?

@stephentoub
Copy link
Member

(I have other things I need to focus on today. I'll respond later today or tomorrow...)

@stephentoub
Copy link
Member

stephentoub commented Apr 15, 2021

To be clear, when I was commenting on generic specialization explosion, it was in the context of being asked about string.Format overloads that take generics, e.g. string.Format<T1, T2, T3>(string format, T1 arg1, T2 arg2, T3 arg3). It’s difficult to separate well there the parsing code from the formatting code without adding additional overheads, which then means the several-hundred line AppendFormatHelper routine that’s ~2.5K of asm would be duplicated for each unique combination that involves at least one value type. I called out that some of this concern carries over to CompositeFormat.

There are multiple concerns related to this, though. One concern is the raw asm size you end up with, which can be a problem for AOT. That asm also impacts JIT time for non-AOT or hybrid, and also working set (also, yes, if all of the generics in a given use are reference types than the asm will be shared, but there’s still overhead per combination in order to support that sharing). Working set in and of itself is a concern, of course, and that’s not just specific to IL/asm: any helper data structures we create and root in the assistance of a particular call site is effectively working set tied to that one call site. In this regard, I looked through LoggerMessage.Define usage in ASP.NET, and SR resource usage in Corelib. For the former, almost every log was consumed in one and only one use site, and for SR the vast majority were similarly used in only one place (a decent number were used in two, maybe three, and then there was a tail of a few that were used in many places). Point being that if you then need to declare a field and instantiate an object (which stores a lot more state under the covers) per call site, you’re effectively increasing the working set associated with that call site to include all of that stuff. Generic instantiations are just a piece of that.

Let’s look at string.Format as it’s defined today. It’s not generic, so there’s only one stamp of the asm, and so the working set overhead for a particular call site is the IL/asm for that call site. Let’s use this as an example:

public string StringFormat(int i, double d, Guid g)
{
    return string.Format("{0} {1} {2}", i, d, g);
}

The IL for that looks like:

IL_0000: ldstr "{0} {1} {2}"
IL_0005: ldarg.1
IL_0006: box [System.Private.CoreLib]System.Int32
IL_000b: ldarg.2
IL_000c: box [System.Private.CoreLib]System.Double
IL_0011: ldarg.3
IL_0012: box [System.Private.CoreLib]System.Guid
IL_0017: call string [System.Private.CoreLib]System.String::Format(string, object, object, object)
IL_001c: ret

and on .NET 5 the asm looks like:

    L0000: push rdi
    L0001: push rsi
    L0002: push rbx
    L0003: sub rsp, 0x40
    L0007: vzeroupper
    L000a: vxorps xmm4, xmm4, xmm4
    L000e: vmovdqa [rsp+0x20], xmm4
    L0014: vmovdqa [rsp+0x30], xmm4
    L001a: mov edi, edx
    L001c: mov rsi, r9
    L001f: vmovsd [rsp+0x70], xmm2
    L0025: mov rcx, 0x7ffc3e12b258
    L002f: call 0x00007ffc9dcda370
    L0034: mov rbx, rax
    L0037: mov [rbx+8], edi
    L003a: mov rcx, 0x7ffc3e12e688
    L0044: call 0x00007ffc9dcda370
    L0049: mov rdi, rax
    L004c: vmovsd xmm2, [rsp+0x70]
    L0052: vmovsd [rdi+8], xmm2
    L0057: mov rcx, 0x7ffc3e207848
    L0061: call 0x00007ffc9dcda370
    L0066: vmovdqu xmm0, [rsi]
    L006a: vmovdqu [rax+8], xmm0
    L006f: mov r8, 0x2a730bb1348
    L0079: mov r8, [r8]
    L007c: mov rdx, 0x2a630c19ed0
    L0086: mov rdx, [rdx]
    L0089: lea rcx, [rsp+0x20]
    L008e: mov [rcx], rbx
    L0091: mov [rcx+8], rdi
    L0095: mov [rcx+0x10], rax
    L0099: mov [rcx+0x18], r8
    L009d: lea r8, [rsp+0x20]
    L00a2: xor ecx, ecx
    L00a4: call System.String.FormatHelper(System.IFormatProvider, System.String, System.ParamsArray)
    L00a9: nop
    L00aa: add rsp, 0x40
    L00ae: pop rbx
    L00af: pop rsi
    L00b0: pop rdi
    L00b1: ret

so ~28 bytes of IL and ~177 bytes of asm. But that of course has all the deficiencies we’ve talked about:

  • Boxing of all arguments
  • You hit a perf cliff after 3 arguments (params array)
  • No devirtualization / inlining for value types
  • All parsing done each time it’s executed
  • No support for ref structs like ReadOnlySpan

You asked about C# 10 interpolation. The compiler generated code for the same example as above would look like:

var b = InterpolatedStringBuilder.Create(2, 3);
b.AppendFormatted(i);
b.AppendLiteral(" ");
b.AppendFormatted(d);
b.AppendLiteral(" ");
b.AppendFormatted(g);
return b.ToStringAndClear();

which results in the IL:

IL_0000: ldc.i4.2
IL_0001: ldc.i4.3
IL_0002: call valuetype System.Runtime.CompilerServices.InterpolatedStringBuilder System.Runtime.CompilerServices.InterpolatedStringBuilder::Create(int32, int32)
IL_0007: stloc.0
IL_0008: ldloca.s 0
IL_000a: ldarg.1
IL_000b: call instance void System.Runtime.CompilerServices.InterpolatedStringBuilder::AppendFormatted<int32>(!!0)
IL_0010: ldloca.s 0
IL_0012: ldstr " "
IL_0017: call instance void System.Runtime.CompilerServices.InterpolatedStringBuilder::AppendLiteral(string)
IL_001c: ldloca.s 0
IL_001e: ldarg.2
IL_001f: call instance void System.Runtime.CompilerServices.InterpolatedStringBuilder::AppendFormatted<float64>(!!0)
IL_0024: ldloca.s 0
IL_0026: ldstr " "
IL_002b: call instance void System.Runtime.CompilerServices.InterpolatedStringBuilder::AppendLiteral(string)
IL_0030: ldloca.s 0
IL_0032: ldarg.3
IL_0033: call instance void System.Runtime.CompilerServices.InterpolatedStringBuilder::AppendFormatted<valuetype [System.Private.CoreLib]System.Guid>(!!0)
IL_0038: ldloca.s 0
IL_003a: call instance string System.Runtime.CompilerServices.InterpolatedStringBuilder::ToStringAndClear()
IL_003f: ret

and the asm:

    L0000: push rdi
    L0001: push rsi
    L0002: sub rsp, 0x58
    L0006: vzeroupper
    L0009: xor eax, eax
    L000b: mov [rsp+0x28], rax
    L0010: vxorps xmm4, xmm4, xmm4
    L0014: vmovdqa [rsp+0x30], xmm4
    L001a: vmovdqa [rsp+0x40], xmm4
    L0020: mov [rsp+0x50], rax
    L0025: mov edi, edx
    L0027: mov rsi, r9
    L002a: vmovsd [rsp+0x80], xmm2
    L0033: lea rcx, [rsp+0x28]
    L0038: mov edx, 0x23
    L003d: call System.Runtime.CompilerServices.InterpolatedStringBuilder..ctor(Int32)
    L0042: lea rcx, [rsp+0x28]
    L0047: mov edx, edi
    L0049: call System.Runtime.CompilerServices.InterpolatedStringBuilder.AppendFormatted[[System.Int32, System.Private.CoreLib]](Int32)
    L004e: mov rdx, 0x2a730bc52f0
    L0058: mov rdx, [rdx]
    L005b: lea rcx, [rsp+0x28]
    L0060: call System.Runtime.CompilerServices.InterpolatedStringBuilder.AppendLiteral(System.String)
    L0065: lea rcx, [rsp+0x28]
    L006a: vmovsd xmm1, [rsp+0x80]
    L0073: call System.Runtime.CompilerServices.InterpolatedStringBuilder.AppendFormatted[[System.Double, System.Private.CoreLib]](Double)
    L0078: mov rdx, 0x2a730bc52f0
    L0082: mov rdx, [rdx]
    L0085: lea rcx, [rsp+0x28]
    L008a: call System.Runtime.CompilerServices.InterpolatedStringBuilder.AppendLiteral(System.String)
    L008f: lea rcx, [rsp+0x28]
    L0094: mov rdx, rsi
    L0097: call System.Runtime.CompilerServices.InterpolatedStringBuilder.AppendFormatted[[System.Guid, System.Private.CoreLib]](System.Guid)
    L009c: lea rcx, [rsp+0x28]
    L00a1: call System.Runtime.CompilerServices.InterpolatedStringBuilder.ToStringAndClear()
    L00a6: nop
    L00a7: add rsp, 0x58
    L00ab: pop rsi
    L00ac: pop rdi
    L00ad: ret

So ~63 bytes of IL and ~173 bytes of asm. IL went up by ~40 bytes of IL, but asm was effectively the same, and that addresses all the deficiencies we’ve talked about:

  • Nothing is boxed
  • There is no perf cliff, regardless of how many holes there are in the input string.
  • Everything can be devirtualized / inlined to the JIT’s heart’s content
  • Parsing is not only done once, it’s done at compile time.
  • It supports any types we want it to, including ReadOnlySpan.
    and while portions of this are generic, they’re generic only on a given T, so you only pay for int once, you only pay for double once, etc., not per every permutation of all of the type arguments. And it ends up being much faster than string.Format, for many of the above reasons (and we’re not done optimizing it yet). And as a pattern targetable by the compiler, it supports a myriad of additional scenarios.

Now let’s look at CompositeFormat. I grabbed the latest from your repo this morning. Its version of this same example would look like:

private static readonly CompositeFormat s_cf = new CompositeFormat("{0} {1} {2}");return _cf.Format(i, d, g);

From an IL perspective, as long as you don’t go beyond the three-argument limit, as one would expect from the signature, the IL is the smallest at 15 bytes:

IL_0000: ldarg.0
IL_0001: ldflda valuetype Text.CompositeFormat C::_cf
IL_0006: ldarg.1
IL_0007: ldarg.2
IL_0008: ldarg.3
IL_0009: call instance string Text.CompositeFormat::Format<int32, float64, valuetype [System.Private.CoreLib]System.Guid>(!!0, !!1, !!2)
IL_000e: ret

but the asm ends up being a different story. The call site itself here ends up being

    L0000: push r14
    L0002: push rdi
    L0003: push rsi
    L0004: push rbp
    L0005: push rbx
    L0006: sub rsp, 0x80
    L000d: vzeroupper
    L0010: xor eax, eax
    L0012: mov [rsp+0x40], rax
    L0017: mov esi, edx
    L0019: vmovsd [rsp+0xc0], xmm2
    L0022: cmp [rcx], ecx
    L0024: lea rdi, [rcx+8]
    L0028: vmovdqu xmm1, [r9]
    L002d: vmovdqu [rsp+0x70], xmm1
    L0033: vmovupd xmm1, [rsp+0x70]
    L0039: vmovupd [rsp+0x60], xmm1
    L003f: mov rcx, rdi
    L0042: mov edx, 3
    L0047: xor r8d, r8d
    L004a: call Text.CompositeFormat.CheckNumArgs(Int32, System.Object[])
    L004f: xor ebx, ebx
    L0051: xor ebp, ebp
    L0053: mov ecx, esi
    L0055: call Text.CompositeFormat.EstimateArgSize[[System.Int32, System.Private.CoreLib]](Int32)
    L005a: mov r14d, eax
    L005d: vmovsd xmm0, [rsp+0xc0]
    L0066: call Text.CompositeFormat.EstimateArgSize[[System.Double, System.Private.CoreLib]](Double)
    L006b: add eax, r14d
    L006e: mov rcx, rdi
    L0071: mov r8d, esi
    L0074: vmovsd xmm0, [rsp+0xc0]
    L007d: vmovupd xmm3, [rsp+0x60]
    L0083: vmovupd [rsp+0x50], xmm3
    L0089: lea rdx, [rsp+0x40]
    L008e: mov [rdx], rbx
    L0091: mov [rdx+8], ebp
    L0094: add eax, 8
    L0097: mov [rsp+0x30], eax
    L009b: vmovaps xmm3, xmm0
    L009f: lea rdx, [rsp+0x50]
    L00a4: mov [rsp+0x20], rdx
    L00a9: lea rdx, [rsp+0x40]
    L00ae: mov [rsp+0x28], rdx
    L00b3: xor edx, edx
    L00b5: call Text.CompositeFormat.Fmt[[System.Int32, System.Private.CoreLib],[System.Double, System.Private.CoreLib],[System.Guid, System.Private.CoreLib]](System.IFormatProvider, Int32, Double, System.Guid, System.ReadOnlySpan`1<System.Object>, Int32)
    L00ba: nop
    L00bb: add rsp, 0x80
    L00c2: pop rbx
    L00c3: pop rbp
    L00c4: pop rsi
    L00c5: pop rdi
    L00c6: pop r14
    L00c8: ret

for ~200 bytes of asm. But in addition to that, there are a few routines (Fmt and CoreFmt) that are generic on all of the parameter types, and thus are stamped for every permutation that’s used. That doesn’t necessarily equate to a stamp for each call site, but it can devolve into that:

Text.CompositeFormat.Fmt[[System.Int32, System.Private.CoreLib],[System.Double, System.Private.CoreLib],[System.Guid, System.Private.CoreLib]](System.IFormatProvider, Int32, Double, System.Guid, System.ReadOnlySpan`1<System.Object>, Int32)
    L0000: push rbp
    L0001: push r15
    L0003: push r14
    L0005: push rdi
    L0006: push rsi
    L0007: push rbx
    L0008: sub rsp, 0x88
    L000f: vzeroupper
    L0012: lea rbp, [rsp+0x40]
    L0017: vxorps xmm4, xmm4, xmm4
    L001b: vmovdqa [rbp], xmm4
    L0020: vmovdqa [rbp+0x10], xmm4
    L0025: vmovdqa [rbp+0x20], xmm4
    L002a: vmovdqa [rbp+0x30], xmm4
    L002f: xor eax, eax
    L0031: mov [rbp+0x40], rax
    L0035: mov rax, 0x1b910314300a
    L003f: mov [rbp], rax
    L0043: mov rsi, rcx
    L0046: mov rdi, rdx
    L0049: mov ebx, r8d
    L004c: vmovsd [rbp+0x98], xmm3
    L0054: mov r14, [rbp+0xa0]
    L005b: mov r15, [rbp+0xa8]
    L0062: mov edx, [rbp+0xb0]
    L0068: mov rcx, [rsi+8]
    L006c: add edx, [rcx+8]
    L006f: cmp edx, 0x80
    L0075: jl short L0082
    L0077: lea rcx, [rbp+8]
    L007b: call Text.StringMaker..ctor(Int32)
    L0080: jmp short L00be
    L0082: add rsp, 0x40
    L0086: mov edx, 0x10
    L008b: push 0
    L008d: push 0
    L008f: dec rdx
    L0092: jne short L008b
    L0094: sub rsp, 0x40
    L0098: lea rdx, [rsp+0x40]
    L009d: xor ecx, ecx
    L009f: mov [rbp+8], rcx
    L00a3: lea rcx, [rbp+0x18]
    L00a7: mov [rcx], rdx
    L00aa: mov dword ptr [rcx+8], 0x80
    L00b1: mov byte ptr [rbp+0x14], 0
    L00b5: xor edx, edx
    L00b7: mov [rbp+0x10], edx
    L00ba: mov byte ptr [rbp+0x15], 0
    L00be: vmovdqu xmm0, [rbp+8]
    L00c3: vmovdqu [rbp+0x28], xmm0
    L00c8: vmovdqu xmm0, [rbp+0x18]
    L00cd: vmovdqu [rbp+0x38], xmm0
    L00d2: vmovsd xmm3, [rbp+0x98]
    L00da: vmovsd [rsp+0x20], xmm3
    L00e0: mov [rsp+0x28], r14
    L00e5: mov [rsp+0x30], r15
    L00ea: lea rdx, [rbp+0x28]
    L00ee: mov rcx, rsi
    L00f1: mov r8, rdi
    L00f4: mov r9d, ebx
    L00f7: call Text.CompositeFormat.CoreFmt[[System.Int32, System.Private.CoreLib],[System.Double, System.Private.CoreLib],[System.Guid, System.Private.CoreLib]](Text.StringMaker ByRef, System.IFormatProvider, Int32, Double, System.Guid, System.ReadOnlySpan`1<System.Object>)
    L00fc: lea rcx, [rbp+0x28]
    L0100: call Text.StringMaker.ExtractString()
    L0105: mov rcx, 0x1b910314300a
    L010f: cmp [rbp], rcx
    L0113: je short L011a
    L0115: call 0x00007ffc9de2d430
    L011a: nop
    L011b: lea rsp, [rbp+0x48]
    L011f: pop rbx
    L0120: pop rsi
    L0121: pop rdi
    L0122: pop r14
    L0124: pop r15
    L0126: pop rbp
    L0127: ret

Text.CompositeFormat.CoreFmt[[System.Int32, System.Private.CoreLib],[System.Double, System.Private.CoreLib],[System.Guid, System.Private.CoreLib]](Text.StringMaker ByRef, System.IFormatProvider, Int32, Double, System.Guid, System.ReadOnlySpan`1<System.Object>)
    L0000: push r15
    L0002: push r14
    L0004: push r13
    L0006: push r12
    L0008: push rdi
    L0009: push rsi
    L000a: push rbp
    L000b: push rbx
    L000c: sub rsp, 0x78
    L0010: vzeroupper
    L0013: xor eax, eax
    L0015: mov [rsp+0x58], rax
    L001a: mov [rsp+0x60], rax
    L001f: mov [rsp+0xd8], r9d
    L0027: mov rbp, rcx
    L002a: mov rsi, rdx
    L002d: mov rdi, r8
    L0030: mov rbx, [rsp+0xf0]
    L0038: xor r12d, r12d
    L003b: mov r13, [rbp]
    L003f: xor eax, eax
    L0041: mov r8d, [r13+8]
    L0045: mov [rsp+0x44], r8d
    L004a: test r8d, r8d
    L004d: jle L022a
    L0053: mov [rsp+0x74], eax
    L0057: movsxd rcx, eax
    L005a: shl rcx, 4
    L005e: lea rcx, [r13+rcx+0x10]
    L0063: mov r9, [rcx]
    L0066: mov [rsp+0x38], r9
    L006b: movsx r10, word ptr [rcx+8]
    L0070: movsx r11, word ptr [rcx+0xa]
    L0075: mov [rsp+0x6c], r11w
    L007b: movsx rdx, word ptr [rcx+0xc]
    L0080: mov [rsp+0x68], dx
    L0085: test r10d, r10d
    L0088: jle short L00f5
    L008a: mov rcx, [rbp+8]
    L008e: test rcx, rcx
    L0091: jne short L00ab
    L0093: mov ecx, r12d
    L0096: or ecx, r10d
    L0099: jne L023b
    L009f: xor ecx, ecx
    L00a1: xor r14d, r14d
    L00a4: mov [rsp+0x70], r10d
    L00a9: jmp short L00d4
    L00ab: mov r14d, r12d
    L00ae: mov r15d, r10d
    L00b1: add r14, r15
    L00b4: mov r15d, [rcx+8]
    L00b8: cmp r14, r15
    L00bb: ja L023b
    L00c1: add rcx, 0xc
    L00c5: movsxd r14, r12d
    L00c8: lea rcx, [rcx+r14*2]
    L00cc: mov [rsp+0x70], r10d
    L00d1: mov r14d, r10d
    L00d4: lea r15, [rsp+0x58]
    L00d9: mov [r15], rcx
    L00dc: mov [r15+8], r14d
    L00e0: mov rcx, rsi
    L00e3: lea rdx, [rsp+0x58]
    L00e8: call Text.StringMaker.Append(System.ReadOnlySpan`1<Char>)
    L00ed: mov r14d, [rsp+0x70]
    L00f2: add r12d, r14d
    L00f5: mov r11d, [rsp+0x6c]
    L00fa: movsx rcx, r11w
    L00fe: test ecx, ecx
    L0100: jl L0211
    L0106: cmp ecx, 2
    L0109: ja L01d4
    L010f: mov r9, [rsp+0x38]
    L0114: mov r14d, [rsp+0xd8]
    L011c: mov ecx, ecx
    L011e: lea r10, [Text.CompositeFormat.CoreFmt[[System.Int32, System.Private.CoreLib],[System.Double, System.Private.CoreLib],[System.Guid, System.Private.CoreLib]](Text.StringMaker ByRef, System.IFormatProvider, Int32, Double, System.Guid, System.ReadOnlySpan`1<System.Object>)]
    L0125: mov r10d, [r10+rcx*4]
    L0129: lea r11, [L0038]
    L0130: add r10, r11
    L0133: jmp r10
    L0136: mov edx, [rsp+0x68]
    L013a: movsx rcx, dx
    L013e: mov [rsp+0x20], ecx
    L0142: mov rcx, rsi
    L0145: mov edx, r14d
    L0148: mov r8, r9
    L014b: mov r9, rdi
    L014e: call Text.CompositeFormat.AppendArg[[System.Int32, System.Private.CoreLib]](Text.StringMaker ByRef, Int32, System.String, System.IFormatProvider, Int32)
    L0153: mov [rsp+0xd8], r14d
    L015b: jmp L0211
    L0160: mov edx, [rsp+0x68]
    L0164: movsx rcx, dx
    L0168: mov [rsp+0x20], ecx
    L016c: mov rcx, rsi
    L016f: vmovsd xmm1, [rsp+0xe0]
    L0178: mov r8, r9
    L017b: mov r9, rdi
    L017e: call Text.CompositeFormat.AppendArg[[System.Double, System.Private.CoreLib]](Text.StringMaker ByRef, Double, System.String, System.IFormatProvider, Int32)
    L0183: mov [rsp+0xd8], r14d
    L018b: jmp L0211
    L0190: mov r15, [rsp+0xe8]
    L0198: vmovdqu xmm0, [r15]
    L019d: vmovdqu [rsp+0x48], xmm0
    L01a3: mov edx, [rsp+0x68]
    L01a7: movsx rcx, dx
    L01ab: mov [rsp+0x20], ecx
    L01af: mov rcx, rsi
    L01b2: lea rdx, [rsp+0x48]
    L01b7: mov r8, r9
    L01ba: mov r9, rdi
    L01bd: call Text.CompositeFormat.AppendArg[[System.Guid, System.Private.CoreLib]](Text.StringMaker ByRef, System.Guid, System.String, System.IFormatProvider, Int32)
    L01c2: mov [rsp+0xd8], r14d
    L01ca: mov [rsp+0xe8], r15
    L01d2: jmp short L0211
    L01d4: add ecx, 0xfffffffd
    L01d7: cmp ecx, [rbx+8]
    L01da: jae short L0246
    L01dc: mov r10, [rbx]
    L01df: movsxd rcx, ecx
    L01e2: mov rcx, [r10+rcx*8]
    L01e6: mov [rsp+0x20], rdi
    L01eb: mov edx, [rsp+0x68]
    L01ef: movsx rdx, dx
    L01f3: mov [rsp+0x28], edx
    L01f7: mov rdx, rsi
    L01fa: mov r8, rcx
    L01fd: mov r9, [rsp+0x38]
    L0202: mov rcx, 0x7ffc46c23880
    L020c: call Text.CompositeFormat.AppendArg[[System.__Canon, System.Private.CoreLib]](Text.StringMaker ByRef, System.__Canon, System.String, System.IFormatProvider, Int32)
    L0211: mov eax, [rsp+0x74]
    L0215: inc eax
    L0217: mov r8d, [rsp+0x44]
    L021c: cmp r8d, eax
    L021f: mov [rsp+0x44], r8d
    L0224: jg L0053
    L022a: add rsp, 0x78
    L022e: pop rbx
    L022f: pop rbp
    L0230: pop rsi
    L0231: pop rdi
    L0232: pop r12
    L0234: pop r13
    L0236: pop r14
    L0238: pop r15
    L023a: ret
    L023b: mov ecx, 0x21
    L0240: call System.ThrowHelper.ThrowArgumentOutOfRangeException(System.ExceptionArgument)
    L0245: int3
    L0246: call 0x00007ffc9de2bc70
    L024b: int3

for ~882 bytes of asm per permutation used. In addition to the IL/asm, there’s the managed object overhead associated with the given CompositeFormat, which would generally be per call site, and that’s another 3 objects / 138 bytes in this case. The above is also only when you use up to the three-argument generic version: after that, it falls back to a params array, where you no longer have the generic issue, but the only thing you gain over string.Format is the parsing at first use rather than at every use.

So, my concern with pushing such a thing right now is that:

  • Even culled down it still potentially contributes a non-trivial amount of code / size per call site.
  • You fall off a similar cliff after three arguments, or whatever number of generic arguments we decide to support.
  • It doesn’t support things that can’t be generic/boxed, like ReadOnlySpan.
  • You still pay for the parsing at runtime, just on first use instead.

combined with us shipping something shiny new and folks potentially using it in cases they shouldn’t, e.g. paying those costs for exception resource strings.

My position is not “we will never do this”, as I agree there’s merit (as I’ve stated), I’d just like to take some time with it. For example, in a not-terribly-far-off future C#/.NET we might introduce a ref struct generic constraint, which might enable ReadOnlySpan to be used, and I’d want to make sure the design played in well with that. I'd also want think about if/how it might be done without needing to manage separate type/field, e.g. could a string.Format overload be provided that was told "it's ok to cache this" in addition to being appropriately generic, and use a concurrent dictionary to map the string as a key to the parsed data. It’s also why I think we should start by incorporating it into LoggerMessage.Define: it already has all the cited issues, and isn’t a new API that represents yet another way to do something that’s already possible, rather this ends up just being an implementation detail beneath an API that is already one of the primary reasons you'd want this. And in particular, LoggerMessage.Define<T1, T2, T3> is already contributing something like ~10K of asm and managed objects for each define, so someone using it has already signed up for those overheads.

@tannergooding tannergooding removed the untriaged New issue has not been triaged by the area owner label Jul 12, 2021
@tannergooding tannergooding modified the milestones: 7.0.0, Future Jul 12, 2021
@stephentoub stephentoub modified the milestones: Future, 8.0.0 Jun 22, 2022
@stephentoub stephentoub added the api-ready-for-review API is ready for review, it is NOT ready for implementation label Jan 13, 2023
@iSazonov
Copy link
Contributor

If today format in resource is string will there auto conversion from string to CompositeFormat?

@stephentoub
Copy link
Member

If today format in resource is string will there auto conversion from string to CompositeFormat?

No, and we wouldn't want that to happen.

@iSazonov
Copy link
Contributor

In the case how will we have to modify existing code with strong typed resources? Could you please show examples?

@stephentoub
Copy link
Member

In the case how will we have to modify existing code with strong typed resources? Could you please show examples?

You can keep doing exactly what you're doing today if you're fine with the perf. If your goal is to reduce the overheads associated with the formatting, then you want to create and cache the CompositeFormat from the resource, e.g. instead of:

string s = string.Format(SR.WhateverResourceString, arg1, arg2);

you'd do:

private static readonly CompositeFormat s_whateverResourceString = CompositeFormat.Parse(SR.WhateverResourceString);
...
string s = string.Format(s_whateverResourceString, arg1, arg2);

@iSazonov
Copy link
Contributor

Thanks! My question was just about why the string.Format can't do parsing internally or why it's undesirable. Could you please explain?

@stephentoub
Copy link
Member

stephentoub commented Jan 13, 2023

can't do parsing internally or why it's undesirable. Could you please explain?

It does do parsing internally. Which takes time. Why parse over and over again?

If the code isn't on a hot path, it doesn't matter.

If the code is on a hot path, better to do all the preprocessing once and then reuse the fruits of your labor over and over.

@iSazonov
Copy link
Contributor

In other words, would you like to avoid an internal cache in string.Format?

@stephentoub
Copy link
Member

stephentoub commented Jan 13, 2023

In other words, would you like to avoid an internal cache in string.Format?

Any such cache would both cache too much (we don't need to be caching the parsing of strings destined for exceptions) and too little (it would need a cap to avoid growing unbounded), it would require synchronization which is either or both a scalability bottleneck and a performance hit, it lacks the ability for a developer to control if / how that caching happens, etc. And on top of that it would require the costs of the cache lookup.

@geeknoid
Copy link
Member Author

It seems like it should be possible to augment the resx format and associated tool such that you could mark a string resource as being a format string, and then the generated Resources.Designer.cs file would expose that string as a CompositeFormat instance instead of a raw string. Just a small convenience...

@rainersigwald
Copy link
Member

It seems like it should be possible to augment the resx format and associated tool such that you could mark a string resource as being a format string, and then the generated Resources.Designer.cs file would expose that string as a CompositeFormat instance instead of a raw string. Just a small convenience...

For the codepath that is all MSBuild, we could certainly do this--we'd have to consult with the resource-designer folks in VS to make sure the additional information would be accessible if you used the UI to add or edit a string.

There's an older experience where VS itself generates the C# (and generally it gets checked in). That is more commonly used today, and it uses .NET Framework resx handling code that wouldn't get the new feature. But that's not a blocker IMO.

@davidfowl
Copy link
Member

#50330 (comment)

+💯💯

These internal caches are never great, for all the reasons mentioned 😄

@iSazonov
Copy link
Contributor

The motivation states that this is an API for localized strings, too:

For any scenario where the text to format is not known at compilation time, such as for localized content being pulled from a resource file, then classic String.Format and StringBuilder.AppendFormat remain the workhorses.

This explains why the question about caching arises. Resource manager returns a string depending on the current UI language. This means we need to remember the result of parsing that string and reset the cache if the language changes. Doing this manually for each string is too tedious. This caching should be done either in the Resource manager or in an additional intermediate layer, if not in the CompositeFormat. This intermediate layer can be generated for strong typed resources for example.

Here is an example from PowerShell repository. Current code for strong typed resource string:

    internal static string Reason_NeverRun {
        get {
            return ResourceManager.GetString("Reason_NeverRun", resourceCulture);
        }

now would be:

    private static CompositeFormat s_Reason_NeverRun;

    internal static string Reason_NeverRun {
        get {
            if (languageChanged)
            {
                s_Reason_NeverRun = CompositeFormat.Parse(ResourceManager.GetString("Reason_NeverRun", resourceCulture));
            }
            return s_Reason_NeverRun;
        }

If we reread the same

For any scenario where the text to format is not known at compilation time, such as for localized content being pulled from a resource file, then classic String.Format and StringBuilder.AppendFormat remain the workhorses.

it says that the resource string is unknown at compile time. But I would speculate that for most applications this is not the case. I can't even think of an application that uses resource strings that are not known in advance to the developer and use an unknown number of arguments. If we dismiss this as an extreme case, we can claim that all resource strings are known at compile time and have a strictly defined number of parameters. (I'm thinking about user messages, UI elements, and how strong typed resources work.) This means that resource strings can be parsed at compile time and even converted to code. At a minimum, the developer has to create a resource file for the default language, which could act as a schema or definition of interfaces for resource strings of other languages. The source string for the default language is defacto a unique identifier (if it changes, the corresponding resource strings for other languages must be updated too).
This suggests that we could save the developer from creating a resource file for the default language altogether - the compiler can collect interpolated (resource) strings from source code and generate reference XML resource file for the default language, then msbuild can generate XML files for other requested languages based on it, it is possible to automatically translate strings to other languages (without this, an incremental mode is needed, which will allow tracking changes (by compiler warnings) and the need to manually translate new strings or modified strings) and then the compiler it can convert them into code that will be executed when switching to a suitable language in the same callsite where the original interpolated string is located.
This will require possibly compiler, msbuild support and possibly new public types (not CompositeFormat but maybe new attribute, new resource manager to switch delegates). I can't estimate how difficult it is to implement, but I don't think it's more complicated than an Interpolated string handler. And it would definitely simplify the localization of applications, tracking changes and automating these processes.

@KalleOlaviNiemitalo
Copy link

if (languageChanged)

In a web application, resourceCulture can be null all the time, but CultureInfo.CurrentUICulture can come from the HTTP Accept-Language header field and then used for the resource lookup. It would be wasteful if Resources.Designer.cs discarded the cached CompositeFormat instances whenever the language of the request differs from the previous one. Rather, I think the CompositeFormat instances should be constructed by ResourceReader as part of deserialization and then cached by ResourceSet like other types.

@iSazonov
Copy link
Contributor

if (languageChanged)

In a web application, resourceCulture can be null all the time, but CultureInfo.CurrentUICulture can come from the HTTP Accept-Language header field and then used for the resource lookup. It would be wasteful if Resources.Designer.cs discarded the cached CompositeFormat instances whenever the language of the request differs from the previous one. Rather, I think the CompositeFormat instances should be constructed by ResourceReader as part of deserialization and then cached by ResourceSet like other types.

This confirms that caching is needed in real applications. And it is even more efficient to compile resource strings directly into code - then the web application will simply call the delegate corresponding to the value of the Accept-Language header field.

@stephentoub
Copy link
Member

stephentoub commented Jan 16, 2023

This confirms that caching is needed in real applications

No one is saying otherwise; everyone here has stated that caching is required for this to make sense. What I objected to was caching as part of string.Format, StringBuilder.AppendFormat, etc. The consumer is responsible for caching these CompositeFormat instances in whatever manner makes sense. That could be the direct consumer, that could be tooling-generated code, etc. This provides a primitive that puts the caller in control of the cacheable item and the APIs that seamlessly consume it.

@KalleOlaviNiemitalo
Copy link

@iSazonov, in your design, would it still be possible to keep the (code generated from) localized resources in satellite assemblies and deploy them from optional language packs? Or would you rather place the strings of all languages in the main assembly?

Common Lisp has prior art for generating code from a format string at compile time (CLHS: Section 22.2.1.3 Compiling Format Strings). The type system is different there, though.

@iSazonov
Copy link
Contributor

No one is saying otherwise; everyone here has stated that caching is required for this to make sense. What I objected to was caching as part of string.Format, StringBuilder.AppendFormat, etc.

@stephentoub I'm not trying to disprove you. I'm trying to figure out how to apply this to a specific PowerShell project. And as I pointed out in a previous post, we could modify our ResGen tool so as to get the benefits of CompositeFormat. No doubt it would work.
But I would appreciate your opinion and that of other experts on my assumption above that the vast majority of resource strings are the same constants as interpolated strings and can be converted into code at compile time excluding allocations and slow startup. If this is correct and possible I would be glad if compiler and msbuild do most of the localization work for me.

in your design, would it still be possible to keep the (code generated from) localized resources in satellite assemblies and deploy them from optional language packs? Or would you rather place the strings of all languages in the main assembly?

@KalleOlaviNiemitalo I don't see any obstacles. If we now have a ResourceManager that allows us to load and use satellite assemblies, the new manager could do the same thing only by loading delegates instead of resource strings. Especially since there is now support for unloadable assemblies.
The main thing is that the original interpolated string completely defines the interface, i.e. the number of parameters (and literals). If the compiler writes this information into an outgoing XML file (like the current resource XML one), it can be used autonomously (in another msbuild project) to create satellite assemblies for other languages.

@KalleOlaviNiemitalo
Copy link

KalleOlaviNiemitalo commented Jan 16, 2023

@iSazonov perhaps the main assembly could have an abstract class with an abstract method for each parameterized localizable string, and then the satellite assembly would define a class that extends that and implements the methods. That way it wouldn't need all the delegate instances, and the parameters could have names so that arguments in C# calls could be specified by name. Unclear:

  • what kind of code the programmers and the localizers would write in this system
  • how the runtime would locate the implementing class in the satellite assembly
  • how the satellite would be protected from trimming
  • whether this could support partial translations and fallback to neutral cultures

@iSazonov
Copy link
Contributor

@KalleOlaviNiemitalo These are implementation details. I believe C# already has everything we need to implement this script and is already in use elsewhere. The compiler and msbuild support just adds to the convenience. Plugins themselves are not new to C#.

@bartonjs
Copy link
Member

bartonjs commented Jan 17, 2023

Video

Looks good as proposed

  • We discussed whether CompositeFormat could just be FormattableString, and the answer was no.
  • We discussed namespaces for FormattableString/StandardFormat/CompositeFormat and didn't feel like there was a better place for the new type than System.Text.
  • We discussed the TArg0...TArgN being after the out parameter for TryWrite, and decided it is correct.
namespace System
{
    public class String
    {
        public static string Format<TArg0>(IFormatProvider? provider, CompositeFormat format, TArg0 arg0);
        public static string Format<TArg0, TArg1>(IFormatProvider? provider, CompositeFormat format, TArg0 arg0, TArg1 arg1);
        public static string Format<TArg0, TArg1, TArg2>(IFormatProvider? provider, CompositeFormat format, TArg0 arg0, TArg1 arg1, TArg2 arg2);
        public static string Format(IFormatProvider? provider, CompositeFormat format, params object?[] args);
        public static string Format(IFormatProvider? provider, CompositeFormat format, ReadOnlySpan<object?> args);
    }

    public static class MemoryExtensions
    {
        public static bool TryWrite<TArg0>(this Span<char> destination, IFormatProvider? provider, CompositeFormat format, out int charsWritten, TArg0 arg);
        public static bool TryWrite<TArg0, TArg1>(this Span<char> destination, IFormatProvider? provider, CompositeFormat format, out int charsWritten, TArg0 arg, TArg1 arg1);
        public static bool TryWrite<TArg0, TArg1, TArg2>(this Span<char> destination, IFormatProvider? provider, CompositeFormat format, out int charsWritten, TArg0 arg, TArg1 arg1, TArg2 arg2);
        public static bool TryWrite(this Span<char> destination, IFormatProvider? provider, CompositeFormat format, out int charsWritten, params object?[] args);
        public static bool TryWrite(this Span<char> destination, IFormatProvider? provider, CompositeFormat format, out int charsWritten, ReadOnlySpan<object?> args);
    }
}

namespace System.Text
{
    public class StringBuilder
    {
        public static StringBuilder AppendFormat<TArg0>(IFormatProvider? provider, CompositeFormat format, TArg0 arg0);
        public static StringBuilder AppendFormat<TArg0, TArg1>(IFormatProvider? provider, CompositeFormat format, TArg0 arg0, TArg1 arg1);
        public static StringBuilder AppendFormat<TArg0, TArg1, TArg2>(IFormatProvider? provider, CompositeFormat format, TArg0 arg0, TArg1 arg1, TArg2 arg2);
        public static StringBuilder AppendFormat(IFormatProvider? provider, CompositeFormat format, params object?[] args);
        public static StringBuilder AppendFormat(IFormatProvider? provider, CompositeFormat format, ReadOnlySpan<object?> args);
    }
}

namespace System.Text
{
    public sealed class CompositeFormat
    {
        public static CompositeFormat Parse([StringSyntax(StringSyntaxAttribute.CompositeFormat)] string format);
        public static bool TryParse([StringSyntax(StringSyntaxAttribute.CompositeFormat)] string format, [NotNullWhen(true)] out CompositeFormat? compositeFormat);

        public string Format { get; }
    }
}

@bartonjs bartonjs added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Jan 17, 2023
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jan 17, 2023
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jan 20, 2023
@stephentoub
Copy link
Member

@bartonjs, @terrajobst, in the API review for this, you were asking about what ambiguities made me remove new overloads that didn't take an IFormatProvider. Here's an example:
SharpLab

using System;

public class C
{
    public static string M(string format, object[] args) => MyString.Format(null, format, args);
}

public class MyString 
{
    public static string Format(IFormatProvider? provider, string format, params object[] args) => null;
    public static string Format<T1, T2>(CompositeFormat format, T1 arg1, T2 arg2) => null;
}

public class CompositeFormat {}

That fails to compile due to the call being ambiguous: it could interpret the null as either an IFormatProvider or a CompositeFormat, the format string as either a string or a generic T1, and the object[] as either the args or a generic T2, and there's no tie-breaker in overload resolution in the language to disambiguate those. Similar issue exists for StringBuilder.AppendFormat.

This seemed like enough of a deal breaker for me that I kept those overloads out, but during the meeting we did say we'd want to add them if the overloads turned out not to be problematic. Do you agree we shouldn't add them?

@ghost ghost locked as resolved and limited conversation to collaborators Mar 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.Runtime
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants