Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make String more GC friendly #40840

Open
bkamins opened this issue May 17, 2021 · 4 comments
Open

Make String more GC friendly #40840

bkamins opened this issue May 17, 2021 · 4 comments
Labels
GC Garbage collector performance Must go faster strings "Strings!"

Comments

@bkamins
Copy link
Member

bkamins commented May 17, 2021

I am opening this issue as a follow-up to a discussion on Slack not to loose track of it.

Rationale: in data-science workflows it is very common to have very large tables that hold columns that consist of many unique strings (e.g. product ID that is non-numeric character sequence).

In such cases the current design of combination of GC and String type cause us to create a lot of small strings. The effect is described e.g. here h2oai/db-benchmark#210 (in these benchmarks the high-count string column is not taking part in any computations - it just sits there using-up memory and causing GC strain). The issue is especially apparent in multi-threading contexts (i.e. when the operation you want to do is parallelized and fast in general, but is paused by triggered GC collection cycles).

I think - given we want Julia to be fast in data science workflows - this issue critically needs to be resolved (it is apparent in H2O benchmarks, but I get this problem constantly reported by users of DataFrames.jl).

As this issue touches deep Julia Base internals, I am probably not the best person to decide what should be done (as there are for sure many considerations that have to be made before making a decision), but once the decision on what to do is made I can help implementing the changes (unless of course core devs would be willing to do them). Here is a list of options I can see (some of them might immediately make no sense for Julia core devs - in such case please comment, but I do not want to limit myself at this stage of thinking about the issue):

  • improve the "generational" aspect of GC (related: The GC often doesn't act generational #40644)
  • have a special handling of String type in GC (related to the above, but we might e.g. decide to always treat String as very old; possibly this could be enabled/disabled by some run-time option)
  • have a run-time option to turn on/off String interning (thus fully disabling GC for them when interning is on) - this would have an additional benefit of faster comparisons at the expense of creation time
  • have a special representation of short strings that would be non-allocating (if you have very many strings most likely they are short)

In the mean time @quinnj is working on improving the handling of this issue on CSV.jl side (to avoid allocation of strings at all), but I think it is kind of a second-best and we should have a good solution in Julia Base.

@bkamins
Copy link
Member Author

bkamins commented May 17, 2021

As a reference. What I am aiming to achieve is to have things like https://discourse.julialang.org/t/the-state-of-dataframes-jl-h2o-benchmark/43081/21 working with just Julia Base as much as possible.

@nalimilan nalimilan added GC Garbage collector performance Must go faster strings "Strings!" labels May 17, 2021
@JeffBezanson
Copy link
Sponsor Member

have a special representation of short strings that would be non-allocating

If I understand the main issues, this would not fix it, since in a Vector{String} the GC would still have to look at every element to see which ones are references. It would save memory but not marking time. I can imagine fancy things like keeping a flag in the Vector to track whether every element is short, but that is probably too special-case and we'd be better off doing general card-marking.

@bkamins
Copy link
Member Author

bkamins commented May 17, 2021

track whether every element is short

This is what I have imagined. However, I understand your point of not over-specializing code. Also saving memory will also help GC as it will occur less frequently.


Let me give an example of the pattern in which I want to ensure GC does not get super slow:

  1. we have a large Vector{String}; for simplicity assume all its entries are unique; the vector has 10^9 elements; call this vector x
  2. using this vector we create a large temporary object, e.g. d = Dict(x .=> axes(x, 1))
  3. we do some operations using d but eventually we drop it
  4. x is still alive, but d needs to be garbage collected as it is large and GC needs to reclaim the memory occupied by it

(this is not an artificial example - something like this happens e.g. in when doing joins to create left to right table mapping which is needed only temporarily, but can be large)

In such a scenario I would prefer to avoid high cost of GC due to the fact that we have 10^9 strings that potentially have to be marked (or at least to have to pay this cost very infrequently, and most of the time not have to pay it).
Also - what I am advocating - is that if ensuring this would mean that String type gets a special treatment by GC then that the core Julia team could consider allowing for specially handling this case. Thank you!

@bkamins
Copy link
Member Author

bkamins commented Jun 3, 2021

A related question is: if we have a collection of Symbol - would it be possible to avoid full GC scan of its entries (the same when eltype of the collection is e.g Union{Symbol, Missing})? The issue is that in such cases it is clear by design that marking is not required however we have (fresh Julia session):

julia> x = Symbol.(1:10^7);

julia> GC.gc(true);

julia> @time GC.gc(true);
  0.069542 seconds (99.98% gc time)

julia> @time GC.gc(true);
  0.071290 seconds (99.99% gc time)

which to my understanding shows that every element of x is investigated by GC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GC Garbage collector performance Must go faster strings "Strings!"
Projects
None yet
Development

No branches or pull requests

3 participants