Qute Improvements #42909

franz1981 · 2024-08-30T13:37:08Z

what benchmarks I could use to verify these, among yours?

mkouba · 2024-08-30T14:55:29Z

independent-projects/qute/core/src/main/java/io/quarkus/qute/LoopSectionHelper.java

+         * This method exists for the purpose of saving the cost of creating a substring of the key and
+         * computing its hashCode.
+         */
+        private CompletionStage<Object> getCompletedStage(String key, int keyStartIndex, int count) {


Hm, I see what you mean but I wonder if it would make more sense to precompute the metadata keys, store it in an array (in a field of LoopSectionHelper) and replace this stuff with a simple if-then-else. I'll try to prepare something so that we can compare the results...

If we can have something which comparing is as easy as comparing among enums - would be even better

The thing is that the alias (which is part of the prefix) can be different for a {#for} definition...

mkouba · 2024-08-30T15:05:12Z

what benchmarks I could use to verify these, among yours?

Loop15 and Loop50 from https://github.com/mkouba/qute-benchmarks BUT you'll have to rebase because up until now we tested the version without the prefix (because it was added later, quarkus 2.5) 🤷.

franz1981 · 2024-08-30T15:46:01Z

I hope @mkouba to have the chance to try it tomorrow :( in the template benchmark it was pretty relevant (the 3 opts I mean) - clearly, because it was memory bound as we said in our call

franz1981 · 2024-08-30T20:30:25Z

Hi @mkouba TLDR:

the 2 tests Loop15 and Loop50 doesn't stress that much the parts which I made my changes as https://github.com/casid/template-benchmark does - and given that templace-benchmark in the form reviewed by you wasn't that bad, I would rather add it to the qute benchmarks too
I've found another interesting optimization (and simplification, for once!) in 2de99fd: JDK has simplified the immutable list concrete types exactly with the purpose of reduce the arity of the virtual calls involving List and guarantees better performance, saving megamorphic invokeInterface for List::get, List::size`, etc etc

This latter thing could be proved by running async-profiler 3,0 on JMH via the agent and providing rawCommand=features=vtable as an argument; it will print, on top of each itable_stub frame, the concrete types called.

The last change along, for the Loop15 benchmark has delivered this improvement
before:

Benchmark                                        Mode  Cnt       Score       Error   Units
Loop15.render                                   thrpt    9  100823.778 ±  9422.943   ops/s

after:

Benchmark                                        Mode  Cnt       Score       Error   Units
Loop15.render                                   thrpt    9  115051.414 ±  4634.761   ops/s

which seems worthy, especially because it removes code to be maintained as well...
I'll run the full matrix of tests as soon as I could.

There are still few things which I don't understand (of the latest flamegraph collected on this PR), still for Loop15:

I can see some reflective access, see

why? is it expected?

due to the previous resolver, the arity of the call Resolver::resolve become megamorphic (for inferfaces, just 3 types seen are sufficient) i.e.

the previously reported ReflectionValueResolver
io.quarkus.qute.benchmark.data.Item_ValueResolver
io.quarkus.qute.ValueResolvers$10

This fact is what mess up with the benchmarks result, because getting a costly megamorphic call while accessing a single field, can be as costy as decoding a BigDecimal especially because if an unpredicted branch - and branch mispredict on modern CPUs could be super costy.

I'll see what I can do for this second point, but having the previous one fixed would be better IMO

franz1981 · 2024-09-01T13:50:52Z

Regardless, here the improved tests:

3.13.2 patched:

Benchmark                      Mode  Cnt       Score      Error  Units
HelloSimple.render            thrpt    9   77929.880 ± 1920.733  ops/s
IncludeSimple.render          thrpt    9   66147.296 ±  532.993  ops/s
JavaBeanValueResolver.render  thrpt    9   89911.121 ± 1363.160  ops/s
LetSimple.render              thrpt    9   51097.779 ±  144.413  ops/s
Loop15.render                 thrpt    9  114840.904 ±  847.497  ops/s
Loop50.render                 thrpt    9   35786.937 ±  322.756  ops/s
NameResolver.render           thrpt    9   56090.358 ±  754.946  ops/s

3.13.2 unpatched:

Benchmark                      Mode  Cnt      Score      Error  Units
HelloSimple.render            thrpt    9  74255.051 ± 2166.508  ops/s
IncludeSimple.render          thrpt    9  60716.557 ± 2650.541  ops/s
JavaBeanValueResolver.render  thrpt    9  85669.300 ± 1089.360  ops/s
LetSimple.render              thrpt    9  46460.260 ±  942.743  ops/s
Loop15.render                 thrpt    9  96688.930 ± 2129.914  ops/s
Loop50.render                 thrpt    9  31300.958 ±  853.332  ops/s
NameResolver.render           thrpt    9  50921.428 ± 1032.362  ops/s

Not a huge margin i.e. 5->17% but still better

And the previous comment at #42909 (comment) is still relevant and it seems to be related to the Date type.

franz1981 · 2024-09-01T19:17:44Z

@mkouba
The other thing I need to understand is why the engine keeps on asking to all the configured ValueResolvers to find one which can resolve it; it seems a good case where caching it would improve things. I expect that the decision about the ValueResolver is not expected to change, per type.

mkouba · 2024-09-02T08:01:26Z

the 2 tests Loop15 and Loop50 doesn't stress that much the parts which I made my changes

Hm, this is a bit suspicious because those benchmarks are very similar. Both contain a simple loop, a few expressions accessing POJO properties, and 2 metadata properties. Loop15 is just a little bit more complex.

I've found another interesting optimization (and simplification, for once!) in 2de99fd: JDK has simplified the immutable list concrete types exactly with the purpose of reduce the arity of the virtual calls involving List and guarantees better performance, saving megamorphic invokeInterface for List::get, List::size`, etc etc

That's a great improvement and really worthy IMO!

There are still few things which I don't understand (of the latest flamegraph collected on this PR), still for Loop15:
I can see some reflective access, see
why? is it expected?

And the previous comment at #42909 (comment) is still relevant and it seems to be related to the Date type.

Ah, yes, it's the {item.created.time} where the time property is not covered by a generated value resolver. We could add a specific value resolver or even generate one but in practice (1) you'll have plenty of value resolvers in the engine and (2) it may happen quite often that reflection fallback is used.

The other thing I need to understand is why the engine keeps on asking to all the configured ValueResolvers to find one which can resolve it; it seems a good case where caching it would improve things. I expect that the decision about the ValueResolver is not expected to change, per type.

We do cache the value resolver per each part of an output expression. See https://github.com/quarkusio/quarkus/blob/main/independent-projects/qute/core/src/main/java/io/quarkus/qute/EvaluatorImpl.java#L142-L154. But it's not set in stone in the sense that there can be corner cases where the cached resolver does not apply and we have to iterate over all resolvers again...

franz1981 · 2024-09-02T08:34:39Z

We could add a specific value resolver or even generate one but in practice (1) you'll have plenty of value resolvers in the engine and (2) it may happen quite often that reflection fallback is used.

I have some suggestion on how to do this "right" (according to how JIT work) if you wish - because:

resolver(s) are interfaces and invokeinterface performance really sucks for cpu bound operations like this, because past 2 observed concrete implementors, it become megamorphic and NOT get inlined, which bring both branch misses and worse performance overall as result
there are known types in the JDK which I expect to be used a lot - and Date/Calendar or other time related ones are some of them

The suggestions here are:

if we make qute to generated bytecode for such types we will likely have less to maintain, but...
as said, having too many concrete types is not good (past 2)
probably make sense to have separate call-sites for custom/reflection based/default ones

In order to achieve it, we need:

common types should be handled without using many types (we can do it in few ways - I'll prepare a branch with the proposal...)
custom types should extend a known type using a separate call site (which become invokespecial and not invokeinterface - former better optimized)
reflective ones should accept their fate and just be in their own separate call site - slower, using invokeinterface

I will soon prepare a branch with this proposal and show the benefit, wdyt?

mkouba · 2024-09-04T09:46:05Z

This PR is superseded by #43003, #43006 and #42982.

I think that we can close it.

franz1981 added 3 commits August 30, 2024 15:36

Improve Qute TemplateNode::resolve

7b4504a

Save constructing a loop key substring on Qute Loop section

97fd670

Save allocating an intermediate list of Qute ResultNode

5ac15a9

quarkus-bot bot added the area/qute The template engine label Aug 30, 2024

mkouba reviewed Aug 30, 2024

View reviewed changes

Use JDK immutable lists to reduce arity of List invokeInterface

2de99fd

This was referenced Sep 3, 2024

Use JDK immutable lists to reduce arity of List invokeInterface #42982

Merged

Qute: optimize iteration metadata access in loops #43003

Merged

franz1981 closed this Sep 4, 2024

quarkus-bot bot added the triage/invalid This doesn't seem right label Sep 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qute Improvements #42909

Qute Improvements #42909

franz1981 commented Aug 30, 2024 •

edited

Loading

mkouba Aug 30, 2024

franz1981 Aug 30, 2024

mkouba Aug 30, 2024

mkouba commented Aug 30, 2024

franz1981 commented Aug 30, 2024

franz1981 commented Aug 30, 2024

franz1981 commented Sep 1, 2024 •

edited

Loading

franz1981 commented Sep 1, 2024 •

edited

Loading

mkouba commented Sep 2, 2024

franz1981 commented Sep 2, 2024 •

edited

Loading

mkouba commented Sep 4, 2024

Qute Improvements #42909

Qute Improvements #42909

Conversation

franz1981 commented Aug 30, 2024 • edited Loading

mkouba Aug 30, 2024

Choose a reason for hiding this comment

franz1981 Aug 30, 2024

Choose a reason for hiding this comment

mkouba Aug 30, 2024

Choose a reason for hiding this comment

mkouba commented Aug 30, 2024

franz1981 commented Aug 30, 2024

franz1981 commented Aug 30, 2024

franz1981 commented Sep 1, 2024 • edited Loading

franz1981 commented Sep 1, 2024 • edited Loading

mkouba commented Sep 2, 2024

franz1981 commented Sep 2, 2024 • edited Loading

mkouba commented Sep 4, 2024

franz1981 commented Aug 30, 2024 •

edited

Loading

franz1981 commented Sep 1, 2024 •

edited

Loading

franz1981 commented Sep 1, 2024 •

edited

Loading

franz1981 commented Sep 2, 2024 •

edited

Loading