Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qute Improvements #42909

Closed
wants to merge 4 commits into from
Closed

Qute Improvements #42909

wants to merge 4 commits into from

Conversation

franz1981
Copy link
Contributor

@franz1981 franz1981 commented Aug 30, 2024

@mkouba

what benchmarks I could use to verify these, among yours?

@quarkus-bot quarkus-bot bot added the area/qute The template engine label Aug 30, 2024
* This method exists for the purpose of saving the cost of creating a substring of the key and
* computing its hashCode.
*/
private CompletionStage<Object> getCompletedStage(String key, int keyStartIndex, int count) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I see what you mean but I wonder if it would make more sense to precompute the metadata keys, store it in an array (in a field of LoopSectionHelper) and replace this stuff with a simple if-then-else. I'll try to prepare something so that we can compare the results...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can have something which comparing is as easy as comparing among enums - would be even better

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is that the alias (which is part of the prefix) can be different for a {#for} definition...

@mkouba
Copy link
Contributor

mkouba commented Aug 30, 2024

what benchmarks I could use to verify these, among yours?

Loop15 and Loop50 from https://github.com/mkouba/qute-benchmarks BUT you'll have to rebase because up until now we tested the version without the prefix (because it was added later, quarkus 2.5) 🤷.

@franz1981
Copy link
Contributor Author

I hope @mkouba to have the chance to try it tomorrow :( in the template benchmark it was pretty relevant (the 3 opts I mean) - clearly, because it was memory bound as we said in our call

@franz1981
Copy link
Contributor Author

Hi @mkouba TLDR:

  • the 2 tests Loop15 and Loop50 doesn't stress that much the parts which I made my changes as https://github.com/casid/template-benchmark does - and given that templace-benchmark in the form reviewed by you wasn't that bad, I would rather add it to the qute benchmarks too
  • I've found another interesting optimization (and simplification, for once!) in 2de99fd: JDK has simplified the immutable list concrete types exactly with the purpose of reduce the arity of the virtual calls involving List and guarantees better performance, saving megamorphic invokeInterface for List::get, List::size`, etc etc

This latter thing could be proved by running async-profiler 3,0 on JMH via the agent and providing rawCommand=features=vtable as an argument; it will print, on top of each itable_stub frame, the concrete types called.

The last change along, for the Loop15 benchmark has delivered this improvement
before:

Benchmark                                        Mode  Cnt       Score       Error   Units
Loop15.render                                   thrpt    9  100823.778 ±  9422.943   ops/s

after:

Benchmark                                        Mode  Cnt       Score       Error   Units
Loop15.render                                   thrpt    9  115051.414 ±  4634.761   ops/s

which seems worthy, especially because it removes code to be maintained as well...
I'll run the full matrix of tests as soon as I could.

There are still few things which I don't understand (of the latest flamegraph collected on this PR), still for Loop15:

  1. I can see some reflective access, see
    image

why? is it expected?

  1. due to the previous resolver, the arity of the call Resolver::resolve become megamorphic (for inferfaces, just 3 types seen are sufficient) i.e.
  • the previously reported ReflectionValueResolver
  • io.quarkus.qute.benchmark.data.Item_ValueResolver
    image
  • io.quarkus.qute.ValueResolvers$10
    image

This fact is what mess up with the benchmarks result, because getting a costly megamorphic call while accessing a single field, can be as costy as decoding a BigDecimal especially because if an unpredicted branch - and branch mispredict on modern CPUs could be super costy.

I'll see what I can do for this second point, but having the previous one fixed would be better IMO

@franz1981
Copy link
Contributor Author

franz1981 commented Sep 1, 2024

Regardless, here the improved tests:

3.13.2 patched:

Benchmark                      Mode  Cnt       Score      Error  Units
HelloSimple.render            thrpt    9   77929.880 ± 1920.733  ops/s
IncludeSimple.render          thrpt    9   66147.296 ±  532.993  ops/s
JavaBeanValueResolver.render  thrpt    9   89911.121 ± 1363.160  ops/s
LetSimple.render              thrpt    9   51097.779 ±  144.413  ops/s
Loop15.render                 thrpt    9  114840.904 ±  847.497  ops/s
Loop50.render                 thrpt    9   35786.937 ±  322.756  ops/s
NameResolver.render           thrpt    9   56090.358 ±  754.946  ops/s

3.13.2 unpatched:

Benchmark                      Mode  Cnt      Score      Error  Units
HelloSimple.render            thrpt    9  74255.051 ± 2166.508  ops/s
IncludeSimple.render          thrpt    9  60716.557 ± 2650.541  ops/s
JavaBeanValueResolver.render  thrpt    9  85669.300 ± 1089.360  ops/s
LetSimple.render              thrpt    9  46460.260 ±  942.743  ops/s
Loop15.render                 thrpt    9  96688.930 ± 2129.914  ops/s
Loop50.render                 thrpt    9  31300.958 ±  853.332  ops/s
NameResolver.render           thrpt    9  50921.428 ± 1032.362  ops/s

Not a huge margin i.e. 5->17% but still better

And the previous comment at #42909 (comment) is still relevant and it seems to be related to the Date type.

@franz1981
Copy link
Contributor Author

franz1981 commented Sep 1, 2024

@mkouba
The other thing I need to understand is why the engine keeps on asking to all the configured ValueResolvers to find one which can resolve it; it seems a good case where caching it would improve things. I expect that the decision about the ValueResolver is not expected to change, per type.

@mkouba
Copy link
Contributor

mkouba commented Sep 2, 2024

  • the 2 tests Loop15 and Loop50 doesn't stress that much the parts which I made my changes

Hm, this is a bit suspicious because those benchmarks are very similar. Both contain a simple loop, a few expressions accessing POJO properties, and 2 metadata properties. Loop15 is just a little bit more complex.

I've found another interesting optimization (and simplification, for once!) in 2de99fd: JDK has simplified the immutable list concrete types exactly with the purpose of reduce the arity of the virtual calls involving List and guarantees better performance, saving megamorphic invokeInterface for List::get, List::size`, etc etc

That's a great improvement and really worthy IMO!

There are still few things which I don't understand (of the latest flamegraph collected on this PR), still for Loop15:
I can see some reflective access, see
why? is it expected?

And the previous comment at #42909 (comment) is still relevant and it seems to be related to the Date type.

Ah, yes, it's the {item.created.time} where the time property is not covered by a generated value resolver. We could add a specific value resolver or even generate one but in practice (1) you'll have plenty of value resolvers in the engine and (2) it may happen quite often that reflection fallback is used.

The other thing I need to understand is why the engine keeps on asking to all the configured ValueResolvers to find one which can resolve it; it seems a good case where caching it would improve things. I expect that the decision about the ValueResolver is not expected to change, per type.

We do cache the value resolver per each part of an output expression. See https://github.com/quarkusio/quarkus/blob/main/independent-projects/qute/core/src/main/java/io/quarkus/qute/EvaluatorImpl.java#L142-L154. But it's not set in stone in the sense that there can be corner cases where the cached resolver does not apply and we have to iterate over all resolvers again...

@franz1981
Copy link
Contributor Author

franz1981 commented Sep 2, 2024

We could add a specific value resolver or even generate one but in practice (1) you'll have plenty of value resolvers in the engine and (2) it may happen quite often that reflection fallback is used.

I have some suggestion on how to do this "right" (according to how JIT work) if you wish - because:

  1. resolver(s) are interfaces and invokeinterface performance really sucks for cpu bound operations like this, because past 2 observed concrete implementors, it become megamorphic and NOT get inlined, which bring both branch misses and worse performance overall as result
  2. there are known types in the JDK which I expect to be used a lot - and Date/Calendar or other time related ones are some of them

The suggestions here are:

  1. if we make qute to generated bytecode for such types we will likely have less to maintain, but...
  2. as said, having too many concrete types is not good (past 2)
  3. probably make sense to have separate call-sites for custom/reflection based/default ones

In order to achieve it, we need:

  1. common types should be handled without using many types (we can do it in few ways - I'll prepare a branch with the proposal...)
  2. custom types should extend a known type using a separate call site (which become invokespecial and not invokeinterface - former better optimized)
  3. reflective ones should accept their fate and just be in their own separate call site - slower, using invokeinterface

I will soon prepare a branch with this proposal and show the benefit, wdyt?

@mkouba
Copy link
Contributor

mkouba commented Sep 4, 2024

This PR is superseded by #43003, #43006 and #42982.

I think that we can close it.

@franz1981 franz1981 closed this Sep 4, 2024
@quarkus-bot quarkus-bot bot added the triage/invalid This doesn't seem right label Sep 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/qute The template engine triage/invalid This doesn't seem right
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants