Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

close SummaryWriters #409

Merged
merged 1 commit into from
Mar 30, 2020
Merged

close SummaryWriters #409

merged 1 commit into from
Mar 30, 2020

Conversation

pkol
Copy link
Contributor

@pkol pkol commented Mar 30, 2020

SummaryWrites should be closed after use. Otherwise they have some
threads running and prevents application from terminate. Relaying on
garbage collector for this is unstable - sometimes it kills them faster,
sometimes slower.

This fixes #298

Without this commit, following command does not end (process does not exit):
python rl_trainer.py --config_file=rl/configs/light_awr_cartpole.gin --output_dir=/tmp/awr1

SummaryWrites should be closed after use. Otherwise they have some
threads running and prevents application from terminate. Relaying on
garbage collector for this is unstable - sometimes it kills them faster,
sometimes slower.

This fixes google#298
@afrozenator afrozenator added the ready to pull Added when the PR is ready to be merged. label Mar 30, 2020
@afrozenator
Copy link
Contributor

Hey @pkol this is really helpful! Will merge it in shortly, may I ask how you debugged this issue? i.e. come to the conclusion that the SummaryWriters were the issue

@pkol
Copy link
Contributor Author

pkol commented Mar 30, 2020

Hi @afrozenator .
Thanks for accepting my contributions :-)

I was debugging it for a while... Finally I printed traceback of all threads which remains after RL training (using faulthandler.dump_traceback() ) and there were only threads running code from event_file_writer.py (so I suspected SummaryWriters). Then noticed there is also SummaryWrite.__del__() calling close() from EventFileWriter. So I checked if calling 1gc.collect()1 after training will help. And it did.
I think it was something like above but it's really hard to recall this process exactly.

BTW there is still some mystery. In TF 1.5 in _EventLoggerThread.__init__() there is self.daemon=True so this thread should not block app from exiting. However it's set after calling Thread.__init()__ which I think (but I'm not 100% sure) is a bug.

@afrozenator
Copy link
Contributor

Thanks @pkol for teaching me something new today! Merging in shortly.

@trax-robot trax-robot merged commit cebad02 into google:master Mar 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes ready to pull Added when the PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

program using supervised.train does not end
4 participants