New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

WIP: add blog post on new latency-reduction tools #1093

Closed

timholy wants to merge 1 commit into master from teh/latency_deep

Member

timholy commented Nov 28, 2020 •

edited

Loading

This focuses on the new Core.Compiler.Timings inference-timing tools, and
the utilities in SnoopCompile for analyzing the results (@snoopi_deep and friends). These tools were
introduced by Nathan Daly, who is a co-author of the post. CC @NHDaly

This WIP in part because it depends on quite a few outstanding PRs:

Support Core.Compiler.InterpreterIP in building stacktraces julia#38566
Add pretty-printing for inference timing julia#38596 will force me to re-run all the examples with nicer printing
Add accumulate_by_method timholy/SnoopCompile.jl#152
Add module_roots timholy/SnoopCompile.jl#157
Add inference_triggers timholy/SnoopCompile.jl#159
Add support for Core.Compiler.Timings JuliaCollections/AbstractTrees.jl#65

Nevertheless it seemed time to post this so that @NHDaly, among others, can collaborate on the writing and so that the DataFrames developers can get a sense for the overall context.


          WIP: add blog post on new latency-reduction tools

948f93b

This focuses on the new Core.Compiler.Timings inference-timing tools, and
the utilities in SnoopCompile for analyzing the results. These tools were
introduced by Nathan Daly, who is a co-author of the post.

github-actions bot commented Nov 28, 2020

Once the build has completed, you can preview your PR at this URL: https://julialang.netlify.app/previews/PR1093/

timholy mentioned this pull request

Split methods to reduce compile cost of argument diversity JuliaData/DataFrames.jl#2563

Closed

nalimilan reviewed

View reviewed changes

blog/2020/12/package_latency.md


		- two arguments (`first` and `incols`) could potentially be `NamedTuple`s, and since `(x=1,)` and `(y=1,)` are different `NamedTuple` types, these arguments alone have potentially-huge possibility for specialization. (If these are specialized for the particular column names in a DataFrame, then the scope for specialization is essentially limitless.) Indeed, a check `methodinstances(DataFrames._combine_with_first)` reveals that many of these specializations are for different `NamedTuple`s.

		- the `f::Base.Callable` argument is either a function or a type, again a potentially-limitless source of specialization. However, checking the output of `methodinstances`, you'll see that this argument is not specialized. Presumably this is due to the major callers of `_combine_with_first` using a `@nospecialize` on their corresponding argument. In this case, over-specialization does not seem to be a concern, but generally speaking function or type arguments are prime candidates for risk of over-specialization.

Member

nalimilan Nov 29, 2020

Isn't the absence of specialization just due to the fact that these methods don't call f, but pass it to another function?

timholy mentioned this pull request

Add some nospecialize and a few precompiles JuliaPlots/AbstractPlotting.jl#502

Closed

Sacha0 reviewed

View reviewed changes

blog/2020/12/package_latency.md


		\toc

		[The Julia programming language][Julia] delivers remarkable runtime performance and flexibility. Julia's flexibility depends on the ability to of methods to handle arguments of many different types. This flexibility would be in competition with runtime performance, were it not for the "trick" of method specialization. Julia compiles a separate "instance" of a method for each distinct combination of argument types; this specialization allows code to be optimized to take advantage of specific features of the inputs, eliminating most of the runtime cost that would otherwise be the result of Julia's flexibility.

Member

Sacha0 Dec 14, 2020

"the ability to of methods" -> "the ability of methods" I think? :)

Sacha0 reviewed

View reviewed changes

blog/2020/12/package_latency.md


		[The Julia programming language][Julia] delivers remarkable runtime performance and flexibility. Julia's flexibility depends on the ability to of methods to handle arguments of many different types. This flexibility would be in competition with runtime performance, were it not for the "trick" of method specialization. Julia compiles a separate "instance" of a method for each distinct combination of argument types; this specialization allows code to be optimized to take advantage of specific features of the inputs, eliminating most of the runtime cost that would otherwise be the result of Julia's flexibility.

		Unfortunately, method specialization has its own cost: compiler latency. Since compilation is expensive, there is a measurable delay that occurs on first invokation of a method for a specific combination of argument types. There are cases where one can do some of this work once, in advance, using utilities like [`precompile`] or building a custom system with [PackageCompiler]. In other cases, the number of distinct argument types that a method might be passed seems effectively infinite, and in such cases precompilation seems unlikely to be a comprehensive solution.

Member

Sacha0 Dec 14, 2020

"first invokation of" -> "first invocation of" I think? :)

Sacha0 reviewed

View reviewed changes

blog/2020/12/package_latency.md


		[The Julia programming language][Julia] delivers remarkable runtime performance and flexibility. Julia's flexibility depends on the ability to of methods to handle arguments of many different types. This flexibility would be in competition with runtime performance, were it not for the "trick" of method specialization. Julia compiles a separate "instance" of a method for each distinct combination of argument types; this specialization allows code to be optimized to take advantage of specific features of the inputs, eliminating most of the runtime cost that would otherwise be the result of Julia's flexibility.

		Unfortunately, method specialization has its own cost: compiler latency. Since compilation is expensive, there is a measurable delay that occurs on first invokation of a method for a specific combination of argument types. There are cases where one can do some of this work once, in advance, using utilities like [`precompile`] or building a custom system with [PackageCompiler]. In other cases, the number of distinct argument types that a method might be passed seems effectively infinite, and in such cases precompilation seems unlikely to be a comprehensive solution.

Member

Sacha0 Dec 14, 2020

"using utilities like [precompile] or building a custom system with [PackageCompiler]" sounds slightly off to me. Perhaps "or building" -> "or by building" and "custom system" -> "custom system image" or so? :)

Sacha0 reviewed

View reviewed changes

blog/2020/12/package_latency.md

+              In this post, we'll walk through the process of analyzing and optimizing the [DataFrames] package.  We chose DataFrames for several reasons:
+              - DataFrames is widely used
+              - the DataFrames API seems fairly stable, and they are approaching their 1.0 release

Member

Sacha0 Dec 14, 2020

Perhaps "the" -> "The" for consistency with capitalization later in the list (or decapitalize the "In" below, alternatively)? :)

Sacha0 reviewed

View reviewed changes

blog/2020/12/package_latency.md


		In this post, we'll walk through the process of analyzing and optimizing the [DataFrames] package. We chose DataFrames for several reasons:

		- DataFrames is widely used

Member

Sacha0 Dec 14, 2020

Hm, perhaps add terminating periods to the items on this list for consistency with the last item and the presence of punctuation in some of the bodies of the items? :)

Sacha0 reviewed

View reviewed changes

blog/2020/12/package_latency.md

+              - DataFrames is developed by a sophisticated and conscientious team, and the package has already been [aggressively optimized for latency](https://discourse.julialang.org/t/release-announcements-for-dataframes-jl/18258/112?u=tim.holy) using tools that were, until now, state-of-the-art; this sets a high bar for any new tools (don't worry, we're going to crest that bar ;-) )
+              - In a previous [blog post][invalidations], one of the authors indirectly "called out" DataFrames (and more accurately its dependency [CategoricalArrays]) for having a lot of difficult-to-fix invalidations.  To their credit, the developers made changes that dropped the number of invalidations by about 10×. This post is partly an attempt to return the favor.  That said, we hope they don't mind being guinea pigs for these new tools.
+              This post is based on DataFrames 0.22.1, and version 0.9 of the underlying CategoricalArrays.  If you follow the steps of this blog post with different versions, you're likely to get different results from those shown here, partly because many of the issues we identified have been fixed in more recent releases.  It should also be emphasize that these analysis tools are only supported on Julia 1.6 and above; at the time of this post, Julia 1.6 not yet to "alpha" release phase but can be obtained from [nightly] snapshots or built from [source].

Member

Sacha0 Dec 14, 2020

"It should also be emphasize that" -> "It should also be emphasized that" I think? :)

Sacha0 reviewed

View reviewed changes

blog/2020/12/package_latency.md

+              - DataFrames is developed by a sophisticated and conscientious team, and the package has already been [aggressively optimized for latency](https://discourse.julialang.org/t/release-announcements-for-dataframes-jl/18258/112?u=tim.holy) using tools that were, until now, state-of-the-art; this sets a high bar for any new tools (don't worry, we're going to crest that bar ;-) )
+              - In a previous [blog post][invalidations], one of the authors indirectly "called out" DataFrames (and more accurately its dependency [CategoricalArrays]) for having a lot of difficult-to-fix invalidations.  To their credit, the developers made changes that dropped the number of invalidations by about 10×. This post is partly an attempt to return the favor.  That said, we hope they don't mind being guinea pigs for these new tools.
+              This post is based on DataFrames 0.22.1, and version 0.9 of the underlying CategoricalArrays.  If you follow the steps of this blog post with different versions, you're likely to get different results from those shown here, partly because many of the issues we identified have been fixed in more recent releases.  It should also be emphasize that these analysis tools are only supported on Julia 1.6 and above; at the time of this post, Julia 1.6 not yet to "alpha" release phase but can be obtained from [nightly] snapshots or built from [source].

Member

Sacha0 Dec 14, 2020

"Julia 1.6 not yet to" -> "Julia 1.6 is not yet to" I think? :)

Sacha0 reviewed

View reviewed changes

blog/2020/12/package_latency.md


		## Identifying the most costly-to-infer methods

		Our first goal is to identify methods that cost the most in inference.

Member

Sacha0 Dec 14, 2020

Perhaps "cost the most in inference" -> "cost the most to infer"? :)

Sacha0 reviewed

View reviewed changes

blog/2020/12/package_latency.md

+              ⋮
+              ```
+              `@snoopi_deep` is a new tool in [SnoopCompile] which leverages new functionality in Julia.  Like the older `@snoopi`, it measures what is being inferred and how much time it takes.  However, `@snoopi` measures aggregate time for each "entrance" into inference, and it includes the time spent inferring all the methods that get inferrably dispatched from the entrance point.  In contrast, `@snoopi_deep` extracts this data for each method instance, regardless of whether it is an "entrance point" or called by something else.

Member

Sacha0 Dec 14, 2020 •

edited

Loading

Perhaps "extracts this data for each method instance" -> "extracts the time spent inferring each method instance exclusive of time spent inferring other (e.g. callee) method instances" or similar? :)

Sacha0 reviewed

View reviewed changes

blog/2020/12/package_latency.md

+              │  │        │  ⋮
+              ```
+              Each branch of a node indents further to the right, and represents callees of the node.  The `ROOT` object is special: it measures the approximate time spent on the entire operation, excepting inference, and consequently combines native code generation and runtime. Each other entry reports the time needed to infer just that method instance, not including the time spent inferring its callees.

Member

Sacha0 Dec 14, 2020

Perhaps "Each other entry" -> "Every other entry"? :)

Sacha0 reviewed

View reviewed changes

blog/2020/12/package_latency.md


		- two arguments (`first` and `incols`) could potentially be `NamedTuple`s, and since `(x=1,)` and `(y=1,)` are different `NamedTuple` types, these arguments alone have potentially-huge possibility for specialization. (If these are specialized for the particular column names in a DataFrame, then the scope for specialization is essentially limitless.) Indeed, a check `methodinstances(DataFrames._combine_with_first)` reveals that many of these specializations are for different `NamedTuple`s.

		- the `f::Base.Callable` argument is either a function or a type, again a potentially-limitless source of specialization. However, checking the output of `methodinstances`, you'll see that this argument is not specialized. Presumably this is due to the major callers of `_combine_with_first` using a `@nospecialize` on their corresponding argument. In this case, over-specialization does not seem to be a concern, but generally speaking function or type arguments are prime candidates for risk of over-specialization.

Member

Sacha0 Dec 15, 2020

Depending on which style guide you prefer, "potentially-limitless" -> "potentially limitless", or not :).

Sacha0 reviewed

View reviewed changes

blog/2020/12/package_latency.md


		- the `f::Base.Callable` argument is either a function or a type, again a potentially-limitless source of specialization. However, checking the output of `methodinstances`, you'll see that this argument is not specialized. Presumably this is due to the major callers of `_combine_with_first` using a `@nospecialize` on their corresponding argument. In this case, over-specialization does not seem to be a concern, but generally speaking function or type arguments are prime candidates for risk of over-specialization.

		Some strategies, like adding `@nospecialize`s, might be effective in reducing compile-time cost. But without knowing a lot more about this package, it is difficult to know whether this might have undesirable effects on runtime performance. So here we pursue a different strategy: let's focus on the fact that inference has to be performed for each unique combination of input types. Since we have two highly-diverse argument types, the effect is essentially multiplicative. But we also note that `incols` is just "passed through"; while we might want to preserve this type information, specializing on `incols` does not affect any portion of the body of this method other than the final calls to `_combine_tables_with_first!` or `_combine_rows_with_first!`. Consequently, we may be wasting a lot of time specializing code that doesn't actually change dependening on the type of `incols`.

Member

Sacha0 Dec 15, 2020

Likewise here, depending on which style guide you prefer, "highly-diverse" -> "highly diverse", or not :).

thofma reviewed

View reviewed changes

Contributor

thofma left a comment

I tried it out and noticed the name changed from accumulate_by_method to accumulate_by_source.

blog/2020/12/package_latency.md

+              ```julia
+              julia> using DataFrames; tinf = @snoopi_deep include("grouping.jl");
+              julia> tm = accumulate_by_method(flatten_times(tinf))

Contributor

thofma Dec 15, 2020

Suggested change

      
            julia> tm = accumulate_by_method(flatten_times(tinf))
          
            julia> tm = accumulate_by_source(flatten_times(tinf))

blog/2020/12/package_latency.md

+              and after we had
+              ```julia
+              julia> tm = accumulate_by_method(flatten_times(tinf))

Contributor

thofma Dec 15, 2020

Suggested change

      
            julia> tm = accumulate_by_method(flatten_times(tinf))
          
            julia> tm = accumulate_by_source(flatten_times(tinf))

blog/2020/12/package_latency.md

+              This is a truncated version of the output; if you look at more of the entries carefully, you'll notice a number of near-duplicates: `do_call` appears numerous times, with different argument types. While `do_call` has eight methods, there are many more entries in `flatten_times(tinf)` than these eight, and this is explained by multiple specializations of single methods.  It's of particular interest to aggregate all the instances of a particular method, since this represents the cost of the method itself:
+              ```julia
+              julia> tm = accumulate_by_method(flatten_times(tinf))

Contributor

thofma Dec 15, 2020

Suggested change

      
            julia> tm = accumulate_by_method(flatten_times(tinf))
          
            julia> tm = accumulate_by_source(flatten_times(tinf))

odow mentioned this pull request

Address invalidations to speed up compile time jump-dev/JuMP.jl#2273

Closed

timholy mentioned this pull request

Reducing latency: acceptable strategies discussion MakieOrg/Makie.jl#792

Closed

Sacha0 reviewed

View reviewed changes

blog/2020/12/package_latency.md

+              ```
+              The aggregate cost is a sum of the cost of all individual `MethodInstance`s.
+              (`do_call` has even more instances, at 1260, but some of these instances must be must less time-consuming than the worst offender we noted above.)

Member

Sacha0 Dec 22, 2020

"must be must less" -> "must be much less" I think? :)

timholy mentioned this pull request

Add a blog post on precompilation #1111

Merged

Sacha0 mentioned this pull request

Rename snoopi_deep -> snoopi_tree timholy/SnoopCompile.jl#190

Closed

tpapp reviewed

View reviewed changes

blog/2020/12/package_latency.md

+              Let's apply this to DataFrames. After collecting the data with `@snoopi_deep include("runtests.jl")`, we can see inference failures with
+              ```julia
+              julia> ibs = SnoopCompile.inference_breaks(tinf)

tpapp Jan 1, 2021

I can't find inference_breaks in SnoopCompile (latest master), did the name change?

Member Author

timholy commented Jan 2, 2021

Yeah, it's changed a lot. Your best source now is timholy/SnoopCompile.jl#192, though I'm going to push a couple more changes before merging.

I am almost certainly going to replace this wholesale, starting off the foundation in #1111, so for safety I'll close this.

timholy closed this

DilumAluthge deleted the teh/latency_deep branch

February 8, 2021 05:15

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet