Reduce calls to Crystal::Type#remove_indirection in module dispatch#14992
Merged
straight-shoota merged 3 commits intocrystal-lang:masterfrom Sep 21, 2024
Merged
Conversation
straight-shoota
approved these changes
Sep 10, 2024
Member
straight-shoota
left a comment
There was a problem hiding this comment.
What a tiny little change and auch a big impact!
The explanation of whats going on is great. This is an explemplary pull requests. Thank you @HertzDevil!
Sija
reviewed
Sep 10, 2024
straight-shoota
approved these changes
Sep 10, 2024
1 task
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
tl;dr Module types are expanded into unions during Crystal's codegen phase when the receiver of a call has a module type. This PR makes this expansion occur once for the entire call, instead of once for each including type.
It is a well-known issue that dynamic dispatch on a call can lead to very slow build times when the receiver is a module type with many includers. Here is a minimal example that demonstrates this:
{% begin %} N = {{ env("N").to_i }} {% end %} module Foo end {% for i in 1..N %} struct Bar{{ i }} include Foo def foo end end {% end %} Bar1.new.as(Foo).fooWe benchmark the compiler using
hyperfine --warmup 5 --prepare 'crystal clear_cache' -P N 100 2000 -D 100 'N={N} crystal build --prelude=empty test.cr':Results
N=100 crystal build --prelude=empty test.crN=200 crystal build --prelude=empty test.crN=300 crystal build --prelude=empty test.crN=400 crystal build --prelude=empty test.crN=500 crystal build --prelude=empty test.crN=600 crystal build --prelude=empty test.crN=700 crystal build --prelude=empty test.crN=800 crystal build --prelude=empty test.crN=900 crystal build --prelude=empty test.crN=1000 crystal build --prelude=empty test.crN=1100 crystal build --prelude=empty test.crN=1200 crystal build --prelude=empty test.crN=1300 crystal build --prelude=empty test.crN=1400 crystal build --prelude=empty test.crN=1500 crystal build --prelude=empty test.crN=1600 crystal build --prelude=empty test.crN=1700 crystal build --prelude=empty test.crN=1800 crystal build --prelude=empty test.crN=1900 crystal build --prelude=empty test.crN=2000 crystal build --prelude=empty test.crIf we plot the mean times we could observe quadratic growth (R2 = 0.997), even though nothing in the code suggests there would be quadratic behavior. Passing
--statsreveals that most of the time is spent on Crystal's codegen phase, rather than LLVM's, so this has nothing to do with large structs (e.g. #12801).There are at least two sources contributing equally to this accidentally quadratic behavior. One is the following fragment of
Crystal::CodeGenVisitor#codegen_dispatch:crystal/src/compiler/crystal/codegen/call.cr
Lines 367 to 373 in ce76bf5
There are
Ntarget defs becauseFoohasNincluders. For each target def, the compiler calls#match_type_id:crystal/src/compiler/crystal/codegen/match.cr
Lines 4 to 6 in ce76bf5
Therefore,
Crystal::Type#remove_indirectionis also calledNtimes for this entire dispatch. Normally this method is idempotent, but if the receiver represents a module type, then the method converts that module into a union of its including types, i.e. the typeBar1 | Bar2 | ... | Bar{{ N }}:crystal/src/compiler/crystal/types.cr
Lines 1172 to 1178 in ce76bf5
As each call produces an array with all
Nincluding types, the whole dispatch creates a total ofN * Nsuch elements, hence the quadratic behavior.Back to
Crystal::CodeGenVisitor#codegen_dispatch, knowing thatowner = owner.remove_indirectionis idempotent and thatowneris used only by that onematch_type_id, we simply call#remove_indirectionbefore the loop. The new benchmark results with a local release build of the compiler are:Results
N=100 bin/crystal build --prelude=empty test.crN=200 bin/crystal build --prelude=empty test.crN=300 bin/crystal build --prelude=empty test.crN=400 bin/crystal build --prelude=empty test.crN=500 bin/crystal build --prelude=empty test.crN=600 bin/crystal build --prelude=empty test.crN=700 bin/crystal build --prelude=empty test.crN=800 bin/crystal build --prelude=empty test.crN=900 bin/crystal build --prelude=empty test.crN=1000 bin/crystal build --prelude=empty test.crN=1100 bin/crystal build --prelude=empty test.crN=1200 bin/crystal build --prelude=empty test.crN=1300 bin/crystal build --prelude=empty test.crN=1400 bin/crystal build --prelude=empty test.crN=1500 bin/crystal build --prelude=empty test.crN=1600 bin/crystal build --prelude=empty test.crN=1700 bin/crystal build --prelude=empty test.crN=1800 bin/crystal build --prelude=empty test.crN=1900 bin/crystal build --prelude=empty test.crN=2000 bin/crystal build --prelude=empty test.crAs expected, this roughly halves the compilation times.
The other half comes from downcasting the receiver to each of the
Nincluding types:crystal/src/compiler/crystal/codegen/call.cr
Line 405 in ce76bf5
crystal/src/compiler/crystal/codegen/codegen.cr
Lines 1353 to 1359 in ce76bf5
This
varis a temporaryCrystal::LLVMVarthat is created once outside the loop:crystal/src/compiler/crystal/codegen/call.cr
Lines 342 to 344 in ce76bf5
crystal/src/compiler/crystal/codegen/call.cr
Lines 362 to 363 in ce76bf5
This
%self's type is used only at thedowncastcall, which likewise uses#remove_indirection. We can therefore precompute all including types right when theLLVMVaris constructed, and this time, all the quadratic slowdown is practically gone:Results
N=100 bin/crystal build --prelude=empty usr/small.crN=200 bin/crystal build --prelude=empty usr/small.crN=300 bin/crystal build --prelude=empty usr/small.crN=400 bin/crystal build --prelude=empty usr/small.crN=500 bin/crystal build --prelude=empty usr/small.crN=600 bin/crystal build --prelude=empty usr/small.crN=700 bin/crystal build --prelude=empty usr/small.crN=800 bin/crystal build --prelude=empty usr/small.crN=900 bin/crystal build --prelude=empty usr/small.crN=1000 bin/crystal build --prelude=empty usr/small.crN=1100 bin/crystal build --prelude=empty usr/small.crN=1200 bin/crystal build --prelude=empty usr/small.crN=1300 bin/crystal build --prelude=empty usr/small.crN=1400 bin/crystal build --prelude=empty usr/small.crN=1500 bin/crystal build --prelude=empty usr/small.crN=1600 bin/crystal build --prelude=empty usr/small.crN=1700 bin/crystal build --prelude=empty usr/small.crN=1800 bin/crystal build --prelude=empty usr/small.crN=1900 bin/crystal build --prelude=empty usr/small.crN=2000 bin/crystal build --prelude=empty usr/small.crThere is still quadratic growth, but the compilation time remains well below 1 second until
Nis around 9500. (This growth is dominated byCrystal::TypeDeclarationProcessorrather than codegen.)