Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Probable race condition #7617

Open
2 tasks done
almln opened this issue Mar 16, 2020 · 6 comments
Open
2 tasks done

Probable race condition #7617

almln opened this issue Mar 16, 2020 · 6 comments
Labels
P3 Issue moderate in impact or severity question Just a question :)

Comments

@almln
Copy link

almln commented Mar 16, 2020

What is the problem?

While trying to use ray for image processing pipeline we faced with (probably) race condition. Debug output from ray says:

2020-03-16 11:51:23,719	DEBUG async_plasma.py:38 -- Completing plasma futures for object id ObjectID(276fa5c994b5ce43ffffffff010000c801000000)
2020-03-16 11:51:23,719 [DEBUG] ray.services: Completing plasma futures for object id ObjectID(276fa5c994b5ce43ffffffff010000c801000000) [Dummy-1] 
2020-03-16 11:51:23,719	DEBUG async_plasma.py:38 -- Completing plasma futures for object id ObjectID(276fa5c994b5ce43ffffffff010000c801000000)
2020-03-16 11:51:23,719 [DEBUG] ray.services: Completing plasma futures for object id ObjectID(276fa5c994b5ce43ffffffff010000c801000000) [MainThread] 
ERROR:asyncio:Exception in callback get_async.<locals>.done_callback(<Future finis... result=None)>) at /home/melnikov-a/venvs/research3/lib/python3.6/site-packages/ray/async_compat.py:62
handle: <Handle get_async.<locals>.done_callback(<Future finis... result=None)>) at /home/melnikov-a/venvs/research3/lib/python3.6/site-packages/ray/async_compat.py:62>
Traceback (most recent call last):
  File "/usr/lib/python3.6/asyncio/events.py", line 145, in _run
    self._callback(*self._args)
  File "/home/melnikov-a/venvs/research3/lib/python3.6/site-packages/ray/async_compat.py", line 83, in done_callback
    retry_plasma_future = as_future(result.plasma_fallback_id)
  File "/home/melnikov-a/venvs/research3/lib/python3.6/site-packages/ray/experimental/async_api.py", line 59, in as_future
    return handler.as_future(object_id)
  File "/home/melnikov-a/venvs/research3/lib/python3.6/site-packages/ray/experimental/async_plasma.py", line 76, in as_future
    self.check_immediately(object_id)
  File "/home/melnikov-a/venvs/research3/lib/python3.6/site-packages/ray/experimental/async_plasma.py", line 59, in check_immediately
    self._complete_future(object_id)
  File "/home/melnikov-a/venvs/research3/lib/python3.6/site-packages/ray/experimental/async_plasma.py", line 41, in _complete_future
    futures = self._waiting_dict.pop(ray_object_id)
KeyError: ObjectID(276fa5c994b5ce43ffffffff010000c801000000)

Seems like thread Dummy-1 (who is this?) make some bad things with Object 276f, and then we unsuccessfully try to use this object as a key.

It may be connected with TODO in _complete_future().

In what way should we think to fix or bypass this issue?

Environment:
Ubuntu 18.04
Python 3.6 (the same on 3.7)
Ray 0.8.2

Reproduction

Heavily truncated version of my code for reproduction:

Reproduction example
import os
import logging
import asyncio

import numpy as np

import ray


@ray.remote
def get_result():
  return np.random.rand(2000, 2000)


class SomeUselessProcessing:
  def __init__(self):
      self._queue = asyncio.Queue(maxsize=5)
      self._counter = 0

  def _submit_ray_task(self):
      return get_result.remote()

  async def _feed_queue(self):
      while True:
          await self._queue.put(self._submit_ray_task())

  async def _get_from_queue(self):
      while True:
          loaded_frame_hdlr = await self._queue.get()
          _ = await loaded_frame_hdlr

          print('Got %d frames' % self._counter)
          self._counter += 1

  def get_tasks(self):
      return [self._feed_queue(), self._get_from_queue()]


# Ray initialization
os.environ.pop("HTTP_PROXY", None)
os.environ.pop("HTTPS_PROXY", None)
os.environ.pop("http_proxy", None)
os.environ.pop("https_proxy", None)
ray.init()


def main():
  FORMAT = "%(asctime)s [%(levelname)s] %(name)s: %(message)s [%(threadName)s] "
  logging.basicConfig()
  logger = logging.getLogger('ray')
  logger.setLevel(logging.DEBUG)
  formatter = logging.Formatter(FORMAT)
  console_handler = logging.StreamHandler()
  console_handler.setFormatter(formatter)
  logger.addHandler(console_handler)

  processor = SomeUselessProcessing()

  ioloop = asyncio.get_event_loop()
  tasks = processor.get_tasks()
  wait_tasks = asyncio.wait([ioloop.create_task(t) for t in tasks])
  ioloop.run_until_complete(wait_tasks)
  ioloop.close()


if __name__ == "__main__":
  main()
  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.
@almln almln added the bug Something that is supposed to be working; but isn't label Mar 16, 2020
@rkooo567
Copy link
Contributor

@simon-mo

@edoakes
Copy link
Contributor

edoakes commented Mar 18, 2020

@ijrsvt

@simon-mo
Copy link
Contributor

Can you try this on latest master https://ray.readthedocs.io/en/latest/installation.html#latest-snapshots-nightlies ?

This bug might be fixed by 46953c5

@almln
Copy link
Author

almln commented Mar 18, 2020

I tried on master: no more errors about KeyError, but now there is issue about "the local object store is full":

Got 128 frames
2020-03-18 21:19:51,841 DEBUG async_plasma.py:26 -- Completing plasma futures for object id ObjectID(e928ffbbbde9a4deffffffff010000c801000000)
2020-03-18 21:19:51,841 [DEBUG] ray.services: Completing plasma futures for object id ObjectID(e928ffbbbde9a4deffffffff010000c801000000) [MainThread] 
Got 129 frames
2020-03-18 21:19:51,858 DEBUG async_plasma.py:26 -- Completing plasma futures for object id ObjectID(aa109a285657aa96ffffffff010000c801000000)
2020-03-18 21:19:51,858 [DEBUG] ray.services: Completing plasma futures for object id ObjectID(aa109a285657aa96ffffffff010000c801000000) [MainThread] 
Got 130 frames
2020-03-18 21:19:51,966 INFO (unknown file):0 -- gc.collect() freed 484 refs in 0.057550188000050184 seconds
(pid=14065) E0318 21:19:51.896755 14065 plasma_store_provider.cc:108] Failed to put object d3640eedabc51b70ffffffff010000c801000000 in object store because it is full. Object size is 32000232 bytes.
(pid=14065) Waiting 1000ms for space to free up...

and several times i saw this message in logs. Store never be cleaned.

I don't know if this connected with my code example, or something bad happens in master. Looks like my code is pretty straightforward.

@simon-mo
Copy link
Contributor

@almln I was able to run it with ray.init(lru_evict=True) and run up to 2k frames.

cc @stephanie-wang @edoakes looks like some objects are refs not cleaned up or gc'ed in this use case.

cc @ijrsvt i found a new bug, can you reproduce it with lru_evict=True?

Got 2776 frames
2020-03-18 12:08:13,237	DEBUG async_plasma.py:26 -- Completing plasma futures for object id ObjectID(ecc0c4a1ebdac702ffffffff010000c801000000)
2020-03-18 12:08:13,237 [DEBUG] ray.services: Completing plasma futures for object id ObjectID(ecc0c4a1ebdac702ffffffff010000c801000000) [MainThread]
Got 2777 frames
(pid=80736) E0318 12:08:13.280831 483843520 plasma_store_provider.cc:101] Failed to put object 69c9aef03bcac832ffffffff010000c801000000 in object store because it is full. Object size is 32000232 bytes.
(pid=80736) Waiting 1000ms for space to free up...
(pid=80738) E0318 12:08:13.267246 71347648 plasma_store_provider.cc:101] Failed to put object eeff420bf79bbbe3ffffffff010000c801000000 in object store because it is full. Object size is 32000232 bytes.
(pid=80738) Waiting 1000ms for space to free up...
(pid=80742) E0318 12:08:13.279240 463310272 plasma_store_provider.cc:101] Failed to put object 0362616892f7ca31ffffffff010000c801000000 in object store because it is full. Object size is 32000232 bytes.
(pid=80742) Waiting 1000ms for space to free up...
(pid=80730) E0318 12:08:13.266654 164875712 plasma_store_provider.cc:101] Failed to put object d13f2d6e496b9f11ffffffff010000c801000000 in object store because it is full. Object size is 32000232 bytes.
(pid=80730) Waiting 1000ms for space to free up...
(pid=80741) E0318 12:08:13.265328 366800320 plasma_store_provider.cc:101] Failed to put object 5f113d52bc0deba9ffffffff010000c801000000 in object store because it is full. Object size is 32000232 bytes.
(pid=80741) Waiting 1000ms for space to free up...
(pid=80732) E0318 12:08:13.260447 227659200 plasma_store_provider.cc:101] Failed to put object 166ad35a93719117ffffffff010000c801000000 in object store because it is full. Object size is 32000232 bytes.
(pid=80732) Waiting 1000ms for space to free up...
(pid=80737) E0318 12:08:13.268399 258112960 plasma_store_provider.cc:101] Failed to put object 52f5df47abbcb4adffffffff010000c801000000 in object store because it is full. Object size is 32000232 bytes.
(pid=80737) Waiting 1000ms for space to free up...
2020-03-18 12:08:13,409	INFO (unknown file):0 -- gc.collect() freed 196 refs in 0.055547380999996676 seconds
2020-03-18 12:08:13,409 [INFO] ray._raylet: gc.collect() freed 196 refs in 0.055547380999996676 seconds [Dummy-50]
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
AttributeError: 'PlasmaEventHandler' object has no attribute 'process_notifications'
Exception ignored in: 'ray._raylet.async_plasma_callback'
AttributeError: 'PlasmaEventHandler' object has no attribute 'process_notifications'

@simon-mo
Copy link
Contributor

nvm for the 'PlasmaEventHandler' object has no attribute 'process_notifications' error, it turns out I didn't build the binary for raylet. It seems to run just fine now with lru_evict=True

@ericl ericl added P3 Issue moderate in impact or severity question Just a question :) and removed bug Something that is supposed to be working; but isn't labels Mar 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P3 Issue moderate in impact or severity question Just a question :)
Projects
None yet
Development

No branches or pull requests

5 participants