Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix EventFileWriter deadlock on exception in background thread #6168

Merged
merged 1 commit into from
Feb 3, 2023

Conversation

crassirostris
Copy link
Contributor

Motivation for features / changes

To address #6167

Technical description of changes

This is a bug fix for possible deadlock when writing events through EventFileWriter. The PR adds logic in _AsyncWriterThread to catch exception to propagate it to the calling thread and adds logic to _AsyncWriter to propagate exception raised in _AsyncWriterThread

Detailed steps to verify changes work correctly (as executed by you)

New unit test that is not passing on master

Alternate designs / implementations considered

  • Instead of popping an item from the queue on exception, it's possible to make wait/flush methods re-check the status periodically
  • Instead of raising an exception in the foreground thread, it's possible to ignore the raised exception altogether and just start dropping events
  • It's possible to drop the data after it cannot be added to the queue for a certain period of time

@crassirostris
Copy link
Contributor Author

@groszewn the issue with build seems to be unrelated and I fixed the linter warning. Could you please re-run the workflow?

@crassirostris
Copy link
Contributor Author

@groszewn sorry, now fixed the linter warning for real:

$ black --check --diff .
Skipping .ipynb files as Jupyter dependencies are not installed.
You can fix this by running ``pip install black[jupyter]``
All done! ✨ 🍰 ✨
366 files would be left unchanged.

Could you please restart CI once again?

Copy link
Contributor

@groszewn groszewn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @crassirostris, really appreciate the well-documented issue and PR! Left a few comments.

tensorboard/summary/writer/event_file_writer_test.py Outdated Show resolved Hide resolved
tensorboard/summary/writer/event_file_writer.py Outdated Show resolved Hide resolved
tensorboard/summary/writer/event_file_writer.py Outdated Show resolved Hide resolved
Copy link
Contributor Author

@crassirostris crassirostris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the thorough review, @groszewn! Fixed behavior for flush, could you please take another look?

tensorboard/summary/writer/event_file_writer.py Outdated Show resolved Hide resolved
tensorboard/summary/writer/event_file_writer.py Outdated Show resolved Hide resolved
tensorboard/summary/writer/event_file_writer_test.py Outdated Show resolved Hide resolved
Copy link
Contributor

@groszewn groszewn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate the fix!

@groszewn groszewn merged commit b1ac492 into tensorflow:master Feb 3, 2023
yatbear pushed a commit to yatbear/tensorboard that referenced this pull request Mar 27, 2023
…rflow#6168)

## Motivation for features / changes

To address tensorflow#6167

## Technical description of changes

This is a bug fix for possible deadlock when writing events through
`EventFileWriter`. The PR adds logic in `_AsyncWriterThread` to catch
exception to propagate it to the calling thread and adds logic to
`_AsyncWriter` to propagate exception raised in `_AsyncWriterThread`

## Detailed steps to verify changes work correctly (as executed by you)

New unit test that is not passing on master

## Alternate designs / implementations considered

* Instead of popping an item from the queue on exception, it's possible
to make `wait`/`flush` methods re-check the status periodically
* Instead of raising an exception in the foreground thread, it's
possible to ignore the raised exception altogether and just start
dropping events
* It's possible to drop the data after it cannot be added to the queue
for a certain period of time

Signed-off-by: Mik Vyatskov <[email protected]>
dna2github pushed a commit to dna2fork/tensorboard that referenced this pull request May 1, 2023
…rflow#6168)

## Motivation for features / changes

To address tensorflow#6167

## Technical description of changes

This is a bug fix for possible deadlock when writing events through
`EventFileWriter`. The PR adds logic in `_AsyncWriterThread` to catch
exception to propagate it to the calling thread and adds logic to
`_AsyncWriter` to propagate exception raised in `_AsyncWriterThread`

## Detailed steps to verify changes work correctly (as executed by you)

New unit test that is not passing on master

## Alternate designs / implementations considered

* Instead of popping an item from the queue on exception, it's possible
to make `wait`/`flush` methods re-check the status periodically
* Instead of raising an exception in the foreground thread, it's
possible to ignore the raised exception altogether and just start
dropping events
* It's possible to drop the data after it cannot be added to the queue
for a certain period of time

Signed-off-by: Mik Vyatskov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants