Skip to content

Conversation

@AaronOpfer
Copy link
Contributor

@AaronOpfer AaronOpfer commented Jul 26, 2016

This changeset fixes #1769 to my satisfaction (and I hope everyone else's).

My changeset causes the gen.coroutine (and gen.engine, I think) to keep a WeakKeyDictionary mapping result futures to Runner objects. This WeakKeyDictionary is closured to the coroutine generator function itself a global.

The resulting behavior is that, if there exists hard references to the result future object and to the coroutine function, the Runner object will not die.

Everything below is no longer relevant but I'm keeping it for historical reasons

In addition, I added a __del__ method to the Runner object (In Python > 3.4 only) which causes it to "cancel" (load with an exception, since tornado's futures don't have cancel semantics) the result future if the Runner is collected before it is completed. This prevents applications from hanging forever waiting on futures that will never complete.

Things I'm still concerned about:

  • I am a little bit concerned about my last change with __del__ since it could cause "working" code to start spamming "future exception never cleaned up" notices if they somehow made a habit out of getting their Runner's collected. It is difficult to imagine a situation where Runners being garbage collected before the future has a result being a desired behavior, though.
  • I'm also not entirely sure how this interacts with other code like WaitIterator where multiple futures' results are tied together (would their runners get GCed and give weird exceptions in the log instead of finishing?). But the unit tests pass so what could go wrong right?
  • I copy and pasted _GC_CYCLE_FINALIZERS from tornado.concurrent. Maybe the symbol should have been imported instead, but the fact it is "private" made me a bit nervous about that.

@AaronOpfer
Copy link
Contributor Author

Looks like my __del__ method has some problems, I'll fix them.

@AaronOpfer
Copy link
Contributor Author

It looks like implementing the __del__ method broke various unit tests which expected no consequences of letting half-finished coroutines get garbage collected. These test cases pass but log errors about uncaught future exceptions, failing the overall test. I fixed one instance of this in queues_test.py locally but there seems to be many more instances where this can occur.

I can continue to clean up instances like these but first I'd like a Tornado developer to tell me if I'm on the right path with this or if I should just leave the __del__ change out and quit while I'm ahead.

@bdarnell
Copy link
Member

bdarnell commented Aug 1, 2016

It is difficult to imagine a situation where Runners being garbage collected before the future has a result being a desired behavior, though.

It doesn't make sense for the Runner to be collected before the Future, but they could be collected at the same time. I think that's what's happening in the tests. I think this is legitimate and we want to allow it. Besides, after the first change to use the WeakKeyDictionary, what other scenarios would this destructor be guarding against? I'd just get rid of it.

I think I'd make the WeakKeyDictionary a global. I'm not seeing any benefit to giving each coroutine function its own dict (which could possibly go out of scope if the decorated function does.

I'm concerned about the performance impact of a WeakKeyDictionary - will it increase GC overhead? I'd like to see some benchmarks showing no significant impact before merging this.

@AaronOpfer
Copy link
Contributor Author

I agree with your points. I'll remove the __del__ method and make the WeakKeyDictionary a global.

I agree that we should test the performance. Does Tornado have any existing performance benchmarking suites we can use for testing performance after my changes?

@bdarnell
Copy link
Member

bdarnell commented Aug 3, 2016

We don't have great benchmarks. What we do have is in demos/benchmark/benchmark.py (and a few other scripts, but benchmark.py is the main one). It should be enough to show the impact of this change. Run it with --keepalive since that removes some of the networking overhead.

demos/benchmark/benchmark.py --keepalive --quiet --num_runs=5|grep "Requests per second"

@AaronOpfer
Copy link
Contributor Author

I ran the performance benchmarks and I saw no differences between my implementation and the previous implementation. The tests were highly variable, I ran each multiple times and they would vary between 1100 req/sec and 1600 req/sec.

@AaronOpfer
Copy link
Contributor Author

I believe this changeset addresses all of your concerns @bdarnell , let me know if you'd like to see any other adjustments.



# Ties lifetime of futures to their runners. Github Issue #1769
_futures_to_runners = weakref.WeakKeyDictionary()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a blank line after this definition, and expand the comment. Something like:
Unlike regular stack frames, idle coroutines are not anchored to a garbage-collection root. Object references point in the opposite direction from what you'd expect, with child frames holding references to the parent frames that they will awaken when completed, but there is not necessarily anything keeping the child frames alive (In most cases the innermost child frame will be registered as a callback on the IOLoop, but other patterns are possible especially when weak references are used). This map provides hard references in the other direction, so that as long as the Future returned by a coroutine is referenced (typically by the parent's Runner), this coroutine's Runner will be kept alive.

@bdarnell
Copy link
Member

Thanks, looks good. But in addition to the comment above, please squash the two commits into one. It's good that you wrote the test first to make sure it's sensitive enough to detect the problem, but I'd rather not have commits that are known to fail their tests in the history, since they make bisecting difficult.

@AaronOpfer
Copy link
Contributor Author

Okay, sounds good to me. I've squashed the commits and added some comments detailing the problem we've solved to gen.py.

@bdarnell
Copy link
Member

bdarnell commented Sep 5, 2016

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Runner objects can get GCed unexpectedly

2 participants