Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use generated generic Linker.DefineFunction() and Function.FromCallback() overloads for efficiently invoking callbacks #163

Merged

Conversation

kpreisser
Copy link
Contributor

@kpreisser kpreisser commented Oct 11, 2022

Hi! Currently, callbacks defined with Linker.DefineFunction() and Function.FromCallback() are called using reflection, which has some overhead:

// NOTE: reflection is extremely slow for invoking methods. in the future, perhaps this could be replaced with
// source generators, system.linq.expressions, or generate IL with DynamicMethods or something
var result = callback.Method.Invoke(callback.Target, BindingFlags.DoNotWrapExceptions, null, invokeArgs, null);

With this PR, T4 text templates are used to automatically generate generic Linker.DefineFunction() and Function.FromCallback() overloads for different combinations of parameter count, result count, and the state of having a Caller argument.

Based on the idea from @martindevans in #160 (comment), the template will generate code that uses ValueBox.Converter<T> for each generic parameter and result type, and then directly invoke the callback, which avois using reflection and a number of heap allocations (e.g. allocating the arguments array, and boxing the arguments and return values), thereby improving performance and contributing to #113.
This way, no dynamic code generation will be needed (see previous PR #160).

Resolving the correct overload by the C# compiler seems to work well:
grafik

Currently, overloads will be generated for up to 12 parameters and 4 result values, which will generate 2 * 13 * 5 = 130 overloads. Note that more than 200 overloads will cause the C# compiler to fail resolving them correctly (tested in VS 17.4.0 Preview 2.1).

For delegate types not covered by the overloads (e.g. a Action or Func with more parameter/result types, or for other delegate types), or if the delegate type isn't known at compile-time, reflection will still be used by providing a overload that takes a Delegate (that one is still defined in Linker.cs and Function.cs).

Edit: To generate the code files (Linker.DefineFunction.cs, Function.FromCallback.cs), we add dotnet-t4 as local tool, which is automatically invoked when building the project. The generated files are still part of the repo, so that e.g. SourceLink will still work correctly for these files.

The performance improvements are in the same area of the previous PR (#160) that used DynamicMethod to dynamically generate code to invoke the callback:

The most performance boost occurs when defining a function with a single parameter. When testing with the following code with .NET 7.0.0-rc.1 on Windows 10 Version 21H2 x64, using an Action<int>:

using var config = new Config();
config.WithOptimizationLevel(OptimizationLevel.Speed);

using var engine = new Engine(config);
using var module = Module.FromText(
    engine,
    "hello",
    @"
(module 
    (func $hello (import """" ""hello"") (param i32))
    (func (export ""run"")
        (local $0 i32)
        loop $for-loop|0
            local.get $0
            i32.const 2000000
            i32.lt_s
            if
                local.get $0
                call $hello

                local.get $0
                i32.const 1
                i32.add
                local.set $0
                br $for-loop|0
            end
        end
    )
)
");

using var linker = new Linker(engine);
using var store = new Store(engine);

int calls = 0;
linker.DefineFunction(
    "",
    "hello",
    (int x) =>
    {
        calls++;
    }
);

var instance = linker.Instantiate(store, module);
var run = instance.GetAction("run")!;

var sw = new Stopwatch();
for (int i = 0; i < 5; i++)
{
    sw.Restart();
    run();
    sw.Stop();

    Console.WriteLine("Elapsed: " + sw.Elapsed);
}

Before the change, the times are listed as follows (when compiling for Release):

Elapsed: 00:00:00.3099829
Elapsed: 00:00:00.4299516
Elapsed: 00:00:00.4231544
Elapsed: 00:00:00.4376052
Elapsed: 00:00:00.4250763

After the change:

Elapsed: 00:00:00.1333811
Elapsed: 00:00:00.1264500
Elapsed: 00:00:00.1125693
Elapsed: 00:00:00.1105346
Elapsed: 00:00:00.1083284

However, when using more than one arguments, the time with reflection suddenly decreases, and the performance gain is much less. For example, using a Action<int, float, long>:

using var config = new Config();
config.WithOptimizationLevel(OptimizationLevel.Speed);

using var engine = new Engine(config);
using var module = Module.FromText(
    engine,
    "hello",
    @"
(module 
    (func $hello (import """" ""hello"") (param i32 f32 i64))
    (func (export ""run"")
        (local $0 i32)
        loop $for-loop|0
            local.get $0
            i32.const 2000000
            i32.lt_s
            if
                local.get $0
                f32.const 123.456
                i64.const 1234567890
                call $hello

                local.get $0
                i32.const 1
                i32.add
                local.set $0
                br $for-loop|0
            end
        end
    )
)
");

using var linker = new Linker(engine);
using var store = new Store(engine);

int calls = 0;
linker.DefineFunction(
    "",
    "hello",
    (int x, float y, long z) =>
    {
        calls++;
    }
);

var instance = linker.Instantiate(store, module);
var run = instance.GetAction("run")!;

var sw = new Stopwatch();
for (int i = 0; i < 5; i++)
{
    sw.Restart();
    run();
    sw.Stop();

    Console.WriteLine("Elapsed: " + sw.Elapsed);
}

Before the change:

Elapsed: 00:00:00.3110580
Elapsed: 00:00:00.2320351
Elapsed: 00:00:00.2336036
Elapsed: 00:00:00.2323470
Elapsed: 00:00:00.2347170

After the change:

Elapsed: 00:00:00.2289068
Elapsed: 00:00:00.2005959
Elapsed: 00:00:00.2001471
Elapsed: 00:00:00.1991165
Elapsed: 00:00:00.1986648

Testing with a Func<int, float, long, ValueTuple<int, int, long>>:

using var config = new Config();
config.WithOptimizationLevel(OptimizationLevel.Speed);

using var engine = new Engine(config);
using var module = Module.FromText(
    engine,
    "hello",
    @"
(module 
    (func $hello (import """" ""hello"") (param i32 f32 i64) (result i32 i32 i64))
    (func (export ""run"")
        (local $0 i32)
        loop $for-loop|0
            local.get $0
            i32.const 2000000
            i32.lt_s
            if
                local.get $0
                f32.const 123.456
                i64.const 1234567890
                call $hello
                drop
                drop
                drop

                local.get $0
                i32.const 1
                i32.add
                local.set $0
                br $for-loop|0
            end
        end
    )
)
");

using var linker = new Linker(engine);
using var store = new Store(engine);

int calls = 0;
linker.DefineFunction(
    "",
    "hello",
    (int x, float y, long z) =>
    {
        calls++;
        return (1, 2, 3L);
    }
);

var instance = linker.Instantiate(store, module);
var run = instance.GetAction("run")!;

var sw = new Stopwatch();
for (int i = 0; i < 5; i++)
{
    sw.Restart();
    run();
    sw.Stop();

    Console.WriteLine("Elapsed: " + sw.Elapsed);
}

Before:

Elapsed: 00:00:00.6005550
Elapsed: 00:00:00.5286508
Elapsed: 00:00:00.5350325
Elapsed: 00:00:00.5246926
Elapsed: 00:00:00.5371668

After:

Elapsed: 00:00:00.4286678
Elapsed: 00:00:00.3949199
Elapsed: 00:00:00.3934236
Elapsed: 00:00:00.3999471
Elapsed: 00:00:00.3924870

Comparison to other approaches:

  • Compared to generating dynamic code (see Use code generation with an DynamicMethod (System.Reflection.Emit) #160), this will also work on .NET Runtimes that don't support dynamic code (or would interpret it); for example, when using Native AOT. Additionally, this will avoid the small runtime cost of code generation when defining the function. However, this approach only works if the delegate is known at compile-time to be a Func<...>/Action<...> with the maximum number of type parameters; otherwise, reflection will still be used to call it. (Note that this Func/Action type restriction is already the case today.)
  • Compared to source generators, this approach is independent from the language used to compile to IL (e.g. C#, F#, VB.NET etc.).

What do you think?

Thanks!

…Function() methods that can efficiently call the specified callback without reflection.
@kpreisser
Copy link
Contributor Author

kpreisser commented Oct 16, 2022

It seems we can improve performance even more by using unchecked function variants (bytecodealliance/wasmtime#3350). With commit kpreisser@09560a0 (separate branch generic-linker-define-function-unchecked that is based on this PR), I added a ValueRaw struct (that maps to wasmtime_val_raw_t) and an IValueRawConverter<T> interface similar to the existing ones, which are then used by the unchecked callback functions.
Since the .NET side knows the exact parameter and result types used by the function, this should be safe. (However I'm not very familiar with wasmtimes resource management, so I'm not sure if have done the externref and funcref management right in the converters.)

When trying the above benchmarks again (under .NET 7.0.0-rc.2, Windows 10 Version 21H2 x64), I get the following results:

Benchmark 1 (Action<int>):

Without this PR:

Elapsed: 00:00:00.2786417
Elapsed: 00:00:00.4104838
Elapsed: 00:00:00.4092130
Elapsed: 00:00:00.4186214
Elapsed: 00:00:00.4116475

With this PR:

Elapsed: 00:00:00.1474995
Elapsed: 00:00:00.1077812
Elapsed: 00:00:00.1064291
Elapsed: 00:00:00.1127987
Elapsed: 00:00:00.1071312

With this PR + commit kpreisser@09560a0:

Elapsed: 00:00:00.0673361
Elapsed: 00:00:00.0612401
Elapsed: 00:00:00.0603837
Elapsed: 00:00:00.0458639
Elapsed: 00:00:00.0466120

Benchmark 2 (Action<int, float, long>):

Without this PR:

Elapsed: 00:00:00.3057047
Elapsed: 00:00:00.2382728
Elapsed: 00:00:00.2427044
Elapsed: 00:00:00.2426993
Elapsed: 00:00:00.2464523

With this PR:

Elapsed: 00:00:00.2293002
Elapsed: 00:00:00.2010061
Elapsed: 00:00:00.2026756
Elapsed: 00:00:00.2042638
Elapsed: 00:00:00.2037475

With this PR + commit kpreisser@09560a0:

Elapsed: 00:00:00.0803906
Elapsed: 00:00:00.0766781
Elapsed: 00:00:00.0751499
Elapsed: 00:00:00.0688984
Elapsed: 00:00:00.0541781

Benchmark 3 (Func<int, float, long, ValueTuple<int, int, long>>):

Without this PR:

Elapsed: 00:00:00.5830719
Elapsed: 00:00:00.5139658
Elapsed: 00:00:00.5282328
Elapsed: 00:00:00.5148367
Elapsed: 00:00:00.5155885

With this PR:

Elapsed: 00:00:00.4300748
Elapsed: 00:00:00.3947863
Elapsed: 00:00:00.3907995
Elapsed: 00:00:00.3930814
Elapsed: 00:00:00.3916519

With this PR + commit kpreisser@09560a0:

Elapsed: 00:00:00.1369848
Elapsed: 00:00:00.1291589
Elapsed: 00:00:00.0874956
Elapsed: 00:00:00.0789046
Elapsed: 00:00:00.0782615

With the last benchmark, this seems to be roughly a 6x improvement comparing to the current state (without this PR).

What do you think?

Thanks!

@kpreisser kpreisser marked this pull request as ready for review October 21, 2022 13:53
@kpreisser
Copy link
Contributor Author

kpreisser commented Oct 21, 2022

I think the PR is now ready for review. The use of unchecked function variants could be done in a separate follow-up PR.

A downside with the overloads is that if you have a Func<...> accepting at most 12 values but returning more than 4 values, for example Func<ValueTuple<int, int, int, int, int>>:
In that case, the overload Function.FromCallback<int, ValueTuple<int, int, int, int, int>> would be resolved (because the generated overloads only include ValueTuple with up to 4 type parameters), which will throw an exception due to the use of the tuple.

Instead, you would need to explicitly cast the delegate to Delegate to make it work (see commit dbd8945 for an example), as in that case the reflection variant is used.

Edit: I implemented a solution by falling back to using reflection (instead of throwing an exception) in such a case.

Thank you!

…tion when the parameter/result type combination cannot be represented with the current generic parameters.
@peterhuene
Copy link
Member

@kpreisser this looks like really excellent work! I'll try to get this reviewed early next week.

@peterhuene
Copy link
Member

@kpreisser I apologize for the delay in reviewing this PR. I have time to dive into it fully tomorrow.

Copy link
Member

@peterhuene peterhuene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks extremely promising and I really appreciate the templating use here as those overloads were getting unwieldy already.

I just have a few comments to discuss and once we've resolved that, I'll do a final review for approval as I think this is a very valuable thing to have.

src/Wasmtime.csproj Show resolved Hide resolved
src/Function.FromCallback.tt Outdated Show resolved Hide resolved
src/Linker.DefineFunction.tt Outdated Show resolved Hide resolved
src/Linker.DefineFunction.tt Outdated Show resolved Hide resolved
@peterhuene
Copy link
Member

I'll review the recent changes and I'll likely approve, but let's hold off on merging this until CI goes green again with the other PR.

peterhuene
peterhuene previously approved these changes Nov 8, 2022
@kpreisser kpreisser changed the title Use generated generic Linker.DefineFunction() overloads for efficiently invoking callbacks Use generated generic Linker.DefineFunction() and Function.FromCallback() overloads for efficiently invoking callbacks Nov 9, 2022
@peterhuene peterhuene merged commit cc1179f into bytecodealliance:main Nov 11, 2022
@peterhuene
Copy link
Member

@kpreisser thanks for all the hard work on this!

@kpreisser kpreisser deleted the generic-linker-define-function branch November 11, 2022 20:33
kpreisser added a commit to kpreisser/wasmtime-dotnet that referenced this pull request Nov 12, 2022
This is a follow-up to PR bytecodealliance#163.

This also improves the NRT annotations for callback delegates, and fixes an regression that prevented to define callbacks taking or returning an interface.
peterhuene pushed a commit that referenced this pull request Nov 28, 2022
* Use unchecked functions with raw values for better performance.

This is a follow-up to PR #163.

This also improves the NRT annotations for callback delegates, and fixes an regression that prevented to define callbacks taking or returning an interface.

* Follow-Up: Pass the ValueRaw as reference when boxing a value, for better performance.

* Follow-Up: Always set the first 64 bit of the ValueRaw, to match the behavior of the Rust implementation (and this doesn't seem to affect performance).

* Only create the Caller instance if it is actually needed.
This allows callbacks that don't use a Caller, Function or object parameter to be allocation-free.

* Empty commit to retrigger CI.

* Refactor to always pass a StoreContext to the IValueRawConverter<T>, so that allocating a Caller is now only necessary when unboxing Functions.

* Simplify handling of V128.
@kpreisser kpreisser mentioned this pull request Jan 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants