Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

thread local cache may cause memory leaks #362

Closed
saarw opened this issue May 12, 2017 · 32 comments · Fixed by #749
Closed

thread local cache may cause memory leaks #362

saarw opened this issue May 12, 2017 · 32 comments · Fixed by #749
Labels

Comments

@saarw
Copy link

saarw commented May 12, 2017

The thread local cache causes a memory leak if you use the regex library in environments where you don't have direct control over threads that can be created and destroyed at the platform's whims, such as when executing tasks inside Microsoft IIS. Setting the dfa_size_limit to 0 does not eliminate the cache. Eliminating the cache should also reduce the memory footprint when you have lots of regexes in lots of threads.

Is it possible to make the thread_local cache optional?

@BurntSushi
Copy link
Member

Is it possible to write a small reproducible example of this problem?

The short answer is no: the thread local cache cannot be disabled. Setting it to a low value should reduce memory increases quite a bit because it will effectively stop the DFA from running. The DFA is the biggest user of the thread local cache, but not the only one. All of the regex engines require some type of mutable state while matching.

If the only way to fix this is to disable the thread local cache, then the only way to make that happen is to introduce a lower level API that forces the caller to manage that state.

@BurntSushi BurntSushi changed the title Make thread local cache optional for use in platforms without control over threads thread local cache may cause memory leaks May 12, 2017
@BurntSushi
Copy link
Member

@saarw Can you say more about what's happening? Is your program spawning an unbounded number of threads and never terminating them?

@saarw
Copy link
Author

saarw commented May 12, 2017

The greenlighted answer to the question below describes how the thread pool in this environment can be trimmed and expanded. Rust thread locals are not destroyed just because the thread exits (from the thread_local crate "Per-thread objects are not destroyed when a thread exits. Instead, objects are only destroyed when the ThreadLocal containing them is destroyed.") so trimming and expanding a thread pool seems to create an unbounded growing number of thread locals, even though the number of actual threads that are alive at any time is bounded.
http://stackoverflow.com/questions/12304691/why-are-iis-threads-so-precious-as-compared-to-regular-clr-threads

@BurntSushi
Copy link
Member

even though the number of actual threads that are alive at any time is bounded.

Ah I see, then there might be a work-around. Presumably, in your current code (it would be nice to have a reproducible example) you are using either an Arc<Regex> or a Regex through lazy_static!. In both cases, a single Regex value will ever exist in your program, and you're right, that will hang on to that thread pool for as long as that value exists. However if you explicitly clone your Regex for each thread that it's used in, then each Regex value gets its own thread local pool. So once a thread exits, its corresponding Regex should drop and so to should its thread pool.

Let me know if this doesn't make sense and I'll try to explain more with examples.

@saarw
Copy link
Author

saarw commented May 12, 2017

I have a Rust lib with basically the following methods that is built as a DLL and runs in IIS

- init_regexes -> AtomicPtr<MyRegexes>
- match_regexes(ptr: AtomicPtr<MyRegexes>, string data to match)
- free_regexes(ptr: AtomicPtr<MyRegexes>) 

A C# program creates my collection of regexes using the init_regexes method and then runs tasks that match_regexes on the thread pool described above. The tasks run in sequence so the atomic ptr is just there as a memory fence to prevent partial creation before use, but the C# program does not have control of what thread that runs each task.

I can certainly create a thread local regex collection for each new thread, but I don't have control over when a thread exits so I wouldn't know when to drop the copy of the regexes (or drop the clones as in your suggestion).

@BurntSushi
Copy link
Member

Yeah, I'm not sure what to say. Regex matching requires some kind of mutable scratch space, so there must be some synchronization that happens if you're sharing a single regex across multiple threads. The alternative is to not share a single regex across multiple threads. It sounds like your environment is pretty constrained, but that's the trade off you have available to you today

but I don't have control over when a thread exits so I wouldn't know when to drop the copy of the regexes (or drop the clones as in your suggestion).

If that's true, then I don't see how you could ever avoid a leak. A leak seems inevitable if you can't free the things you allocate. That means that even if there were a lower level API that provided a way for the caller to control allocation of the scratch space, then that also wouldn't be useful to you because you wouldn't know when to free the scratch space.

What kind of solution are you expecting here?

@saarw
Copy link
Author

saarw commented May 12, 2017

I would know when to free the scratch space - after each match operation (or at least before returning from the function call to my DLL). So either a caller-controlled cache, or a config option that makes the library simply drop all cached data at the end of each match operation would help me.

@BurntSushi
Copy link
Member

@saarw That config option doesn't make sense. If all cached data is dropped after every match, then the performance will be so bad that you might as well use whatever C# regex library is available to you. The cached data is there for really good reasons. :-)

In any case, you can do that today. Clone the regex before every match, then drop the regex after you're done with it. You can clone the regex from your global collection. When you do that, the thread local state in your collection will never be used.

@saarw
Copy link
Author

saarw commented May 12, 2017

Yes, cloning the regexes before every match is the only workaround I've come up with so far, but agree that it may be too slow. We are really not that performance sensitive and were hoping to use Rust to be able to deploy it on a wide variety of platforms, but guess we may need a special non-native implementation for C#/IIS because we don't control the threading on that platform.

I know you have a lot of experience of regex performance, do you perhaps know if there is any engine that does not persist state between match operations and that runs faster than clone-per-match rust-regex instances, even though it may be a lot slower than cached rust-regex? Maybe PCRE? RE2?

@BurntSushi
Copy link
Member

clone-per-match rust-regex

The regex struct internally looks like this:

struct Regex {
    compiled: Arc<Internal>,
    cached: CachedThreadLocal,
}

This representation supports two cheap ways to use regexes in multithreaded environments. The first is the normal way: you put a Regex in an Arc<Regex> and the cached data becomes a pool that is reused across multiple threads. This is not optimal since it increases per match synchronization, but it is convenient because Arc<T>s are a common way of sharing state across threads in Rust.

However, the second way is to just clone the Regex struct itself. You'll note that the compiled internal state is already in an Arc, so cloning the Regex won't actually clone all of the compiled state machine stuff. However, it will create new scratch space for running the regex engines that is owned completely by this new cloned regex. The costs of doing this are roughly identical to the former, but match performance is better because there is less synchronization required to use the scratch space.

So for example, if you do:

let re = my_global_regexes.Whatever.clone();

re.is_match(...)
for m in re.find_iter(...) { ... }
re.captures(...)

drop(re)

then the same exact scratch space will be used for all calls to the Regex API, and that scratch space will get freed once you call drop(re). And importantly for you, the thread local state in my_global_regexes isn't impacted at all.

Now, this will actually work just fine and should give you good performance. But this roughly assumes that you can amortize that my_global_regexes.Whatever.clone() call, because that's doing the "expensive" work of allocation that scratch space. If you're only doing a single match in each thread, then obviously amortization doesn't apply, but you're pretty much hosed in that case regardless of what you do because things like "overhead of creating a thread" probably start to matter.

I know you have a lot of experience of regex performance, do you perhaps know if there is any engine that does not persist state between match operations and that runs faster than clone-per-match rust-regex instances, even though it may be a lot slower than cached rust-regex? Maybe PCRE? RE2?

One important correction: I'm not advocating clone-per-match, but rather, clone-per-matches.

All regex engines I know of require some kind of scratch space. It's inherent in how regex engines work. Rust's regex engine is based on RE2, so it works similarly, although RE2 actually tries to share the same scratch space across multiple threads for a single regex. This fixes your memory issue, but will make searching using the same regex from multiple threads slower.

I'm less familiar with PCRE, but it's a backtracking engine, which means it must either use exponential time or space in some cases. And it must use some kind state to keep track of the match, capturing groups, etc etc.

@saarw
Copy link
Author

saarw commented May 12, 2017

Yes, it definitely seems like RE2's concurrency model would work better for my current use case where we only have single, but different, threads accessing the regexes at any one time so there isn't any actual contention on the locks. For platforms where we do control the threading and can use the thread local cache, I'm also guessing RE2 may have smaller memory footprint with its single cache compared to the solution with thread local caches (considering a typical web server has about 200 threads).

However, I will eventually need to handle the situation with concurrent accesses later as well and don't want RE2 to become a point of contention so I will definitely try the clone-per-matches approach with the rust-regex crate. The rust-regex functionality and focus on safe behavior is really nice! Thanks a lot for all the info!

@BurntSushi
Copy link
Member

No problem! I'll keep noodling on this problem by the way. At some point, there may be more explicit control over allocations, but it won't be happening any time soon! :)

@saarw
Copy link
Author

saarw commented May 13, 2017

I put together a benchmark app of using the Rust lib in a threadpool-friendly way with a new clone for every match vs using Java's regex engine in the same way and recreating its mutable parts (the Matcher object) for every match
https://github.com/saarw/regexperf

It seems the Rust lib gets about 7-8x slower than Java's implementation. I wonder if it's the thread local cloning operation that may be the culprit or if the Rust lib is just more reliant on its mutable state than other regex implementations. Guessing it would be easier for me to hack in a thread-unsafe cache in a fork of the rust lib than to try to integrate RE2 or other C-libs in my build process, if the threadpool-friendly performance turns out to be unacceptable...

@BurntSushi
Copy link
Member

@saarw If you are literally doing the clone for every match, then yes, this regex engine will get very slow. The reason why is because the DFA itself is lazily generated during matching, and the DFA is itself stored in this scratch space. By clearing the scratch space, you're forcing the DFA to be regenerated, which is quite slow. The key is doing the clone for every group of matches.

@saarw
Copy link
Author

saarw commented May 13, 2017

Ok, just out of interest I added the ability to specify batch sizes for the Rust lib where it only clones the regex every X number of matches. Seems the lib is only about 4x slower than Java with a batch size of 10, and stabilizes somewhere around 3x slower regardless of batch size (so the overhead for cloning for each match is just 2-3x vs batching).

@saarw
Copy link
Author

saarw commented May 15, 2017

Tested a thread-unsafe cache version in a fork. I first started with an empty cache whenever the regex was cloned, not much effect on clone-per-match timing. Also tested cloning the existing cache when calling the clone operation and tried initializing the original regex cache with some warm-up data before starting the actual matches, but performance got worse. The experiments are in the last commits in this branch of the fork https://github.com/saarw/regex/tree/threadunsafe-perf

Unfortunately, since we are using the library in a web app that processes requests in a thread pool, we can only group together data that comes from a single request to run through each regex instance, so we don't reach the batch volumes necessary to make this efficient enough for us but it doesn't seem to matter that the scratch space is allocated in a thread local.

@BurntSushi
Copy link
Member

@saarw Very interesting stuff. I think I'm going to close this because I think we've reached a point at which we understand the fundamental trade offs: this implementation requires mutable scratch space for matching, and allocating a fresh space for every match causes the implementation to regress quite badly.

I'm going to close this for now, and although I do see a future where some lower level access to the matching machines/state is available, it will probably be quite some time before that happens.

@saarw
Copy link
Author

saarw commented May 21, 2017

OK! Yes, I'm focusing on trying to find some other library with different performance characteristics in relation to scratch space. That thread local use probably justifies a warning somewhere in the docs though, it's not really expected from a regex lib, and a lot of people seem to be trying to push Rust into the web backend space where thread pools aren't uncommon... (can the thread local cause memory leaks in regexes that get evaluated in parallel with Rayon? does it destroy threads when it no longer needs them?)

edit: The Rayon docs warn "In the future, however, the default behavior may change to dynamically add or remove threads as needed.", so maybe this won't be causing memory leaks from parallel execution today, but will if Rayon gets extended.

@BurntSushi
Copy link
Member

it's not really expected from a regex lib

I don't know of any production grade regex library that doesn't utilize some kind of scratch space... Whether it's implicit on the stack or more explicit on the heap, you really need some kind of space proportional to the regex in order to execute a search. This problem is probably exacerbated in a regex engine based on finite state machines, since a backtracking engine can probably get away with using a lot less space. (But then you lose your linear time guarantee.)

backend space where thread pools aren't uncommon

Note that thread pools on their own aren't a problem. Your particular environment is quite a bit more constraining:

  1. You have to deal with threads that are continually created and destroyed.
  2. You have no ability to do something upon thread creation or destruction.
  3. Amortization is impossible because the work each thread does is very small.

I think that if any one of those points weren't true, then there would have been a way to fix this in a satisfying way:

  1. If a thread pool keeps a fixed number of threads alive for the duration of the service, then there is no leak.
  2. If you can hook into thread creation/destruction, then you should be able to clone a Regex for each thread. When that thread dies, that Regex should get dropped, including its scratch space.
  3. If each thread did more work, then you could clone a global Regex just before using it and drop it when you're done.

That thread local use probably justifies a warning somewhere in the docs though

Yeah, it's talked about a bit in the performance guide: https://github.com/rust-lang/regex/blob/master/PERFORMANCE.md#using-a-regex-from-multiple-threads --- But I agree, it would be wise to add more details and the lessons learned from this ticket. So I'll morph this into a documentation bug. :-) Thanks for all your patience and helping to illuminate a weak point in the implementation. It will undoubtedly be on mind when I revisit how all this stuff works.

@BurntSushi BurntSushi added the doc label May 21, 2017
@BurntSushi
Copy link
Member

BurntSushi commented May 21, 2017

Adding another note here. I took a look at how Go manages this with their regexp library, and they basically use the same strategy as this crate. The key difference is that it maintains a pool of scratch space, where any thread can use any part of the pool. A new scratch space is only added to the pool when there are matches happening simultaneously. So their library also suffers from a similar problem, although it might be harder to provoke. In particular, Go's pool is limited to the maximum number of simultaneous searches, where as this library is limited to the number of searches performed on unique threads.

@saarw
Copy link
Author

saarw commented May 21, 2017

Sure, the use of a scratch space is common, it's only the use of a thread local that I think should be documented (I only started suspecting it from seeing the import on the dependencies on regex's crates.io page). After all, it doesn't help to have the ability to act when a thread is created or goes away unless you also know that you need to do so.

I guess the need for such documentation especially applies to languages like Rust, since even if GCed language did use a thread local, its GC would clean up the memory when the thread dies and prevent the memory leak. This has really given me something to think about when designing concurrent components in the future in Rust (and I will be keeping an eye out for thread locals in dependency pages for other crates). Thanks again for helping me sort this out!

@Amanieu
Copy link
Member

Amanieu commented May 25, 2017

After discussing this with @BurntSushi on IRC for a bit, one potential solution would be to change the way thread_local allocates thread IDs to prefer reusing IDs of threads that have exited wherever possible. This would limit the number of entries in the ThreadLocal to the maximum number of threads live at any one point, and avoid unbounded growth.

@saarw
Copy link
Author

saarw commented May 29, 2017

That sounds like a great improvement! Anything that helps clean up or limit thread local states of exited threads sounds very useful for thread locals in general. The change would make Rust thread locals a lot safer to use from other languages.

It is probably still worth documenting the thread local use in the regex crate though. A non-reactive web server thread pool can often use ~200 threads, and a memory footprint of 2 MB regex cache x 200 threads per regex is probably worth warning about...

@Amanieu
Copy link
Member

Amanieu commented Jun 27, 2017

This is now fixed in thread_local 0.3.4.

@BurntSushi
Copy link
Member

@Amanieu ah that's awesome! Thanks so much!

@saarw
Copy link
Author

saarw commented Jun 27, 2017

Very nice! I guess that fix closes this issue since the memory shouldn't be leaking with that behavior.

@BurntSushi
Copy link
Member

@saarw Nice! Were you able to confirm that?

@saarw
Copy link
Author

saarw commented Jun 27, 2017

Sorry, can't easily reproduce it since I'm not personally working in the .Net integration where it happened and we swapped in Oniguruma to solve the problem there.

This was referenced Mar 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants