You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
These two curves record the total memory usage of the two nodes respectively.
My question is why points C and D in the figure grow to twice the size of the data, and then drop down to be the same size as the data.
In my opinion, the first ray.get() has fetched the data to plasma from the remote node. The second calling should reuse the same memory rather than fetching again.
2.We use ray.get() in a function of an actor to fetch data from another node. After this function ends, the data is copied from the remote node to the local plasma memory and has not been released even if we never use it.
We also found that even if the plasma triggers spilling out, the copied data is still kept and not released.
When and how this copy data be released?
@ray.remote(resources = {"Resource0":8})
class Worker1:
def __init__(self):
self.value = 0
def put_data(self):
param = nn.Parameter(torch.rand(1024, 4096),
requires_grad=False)
ref_ray = []
time_start_put = time.time()
for i in range(global_nums):
ref = ray.put(param)
ref_ray.append(ref)
ray_put_time = time.time() - time_start_put
return ref_ray
def get_data(self,ref_ray):
time_start_get = time.time()
for i in range(len(ref_ray)):
ref = ray.get(ref_ray[i])
ray_get_time = time.time() - time_start_get
return ray_get_time
@ray.remote(resources = {"Resource1":8})
class Worker2:
def __init__(self):
self.value = 0
def put_data(self):
param = nn.Parameter(torch.rand(1024, 4096),
requires_grad=False) # 4M*4=16MB
ref_ray = []
time_start_put = time.time()
for i in range(global_nums):
ref = ray.put(param)
ref_ray.append(ref)
ray_put_time = time.time() - time_start_put
return ref_ray
def get_data(self, ref_ray):
time_start_get = time.time()
for i in range(len(ref_ray)):
ray.get(ref_ray[i])
ray_get_time = time.time() - time_start_get
return ray_get_time
def ray_multinode_put_get_test():
ray.init(address="ray://******:10001")
total_start_time = time.time()
worker1 = Worker1.remote()
worker2 = Worker2.remote()
ref_ray1 = worker1.put_data.remote()
ref_ray2 = worker2.put_data.remote()
list1 = ray.get(ref_ray1)
list2 = ray.get(ref_ray2)
print("len:", len(list2))
num = 20
for i in range(num):
get_time1 = worker1.get_data.remote(list2[(int)(len(list2)/num * i):(int)(len(list2)/num * (i+1))])
ray.get(get_time1)
print("end to sleep")
time.sleep(30)
print("start again")
total_end_time = time.time() - total_start_time
print("total costs:{}".format(total_end_time))
ray.shutdown()
if __name__ == "__main__":
ray_multinode_put_get_test()
Versions / Dependencies
Ray version 1.13.0
Reproduction script
Test code please see description.
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered:
zyheeeeee
added
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Nov 23, 2022
sven1977
added
P2
Important issue, but not time-critical
and removed
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Nov 23, 2022
I also found this problem. I am wondering when I keep an objectRef in an actor called A, and send the objectRef to another actor called B (notice that actor A and actor B are on different node) so that it can obtain the object from plasma into process memory using ray.get, what would the plasma do after actor B released the objectRef? Would the plasma release the object? If not, how should I do to release the object in actor B's node plasma, and retain the object in actor A's node plasma?
rkooo567
added
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
core
Issues that should be addressed in Ray Core
and removed
P2
Important issue, but not time-critical
labels
Nov 25, 2022
Great question, I think part of the problem here is that torch tensors are not zero-copy deserializable, so every time you call ray.get, it actually doubles the memory usage: one copy stays in shared memory and the other is the deserialized copy in the worker's heap. This issue is tracked here. You could try your script again with a numpy array (which is zero-copy) and see if you have the same issue.
Here is the condition for when object copies can get released from plasma:
No worker is actively using it (via ray.get or task arg). For zero-copy deserializable objects, the worker also has to release any refs to the value returned by ray.get. For non-zero-copy objects, the object can get released as soon as the worker finishes the ray.get.
For the original copy of the object only (i.e. the driver's node in your first script), the copy can be released once all ObjectRefs have been deleted or once it has been spilled.
Note that these are just the conditions for release; the actual release happens when:
There is memory pressure on the node.
All ObjectRefs have gone out of scope.
hora-anyscale
added
P2
Important issue, but not time-critical
and removed
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Dec 5, 2022
What happened + What you expected to happen
Recently, we found several unexpected phenomena when using the ray.get(). We hope to get a corresponding explanation
1.When using ray.get() to fetch the same data twice from another node, the maximum memory used by the user has doubled twice.
We use Prometheus to record the change curve of the entire process. As shown below.
Memory change
These two curves record the total memory usage of the two nodes respectively.
My question is why points C and D in the figure grow to twice the size of the data, and then drop down to be the same size as the data.
In my opinion, the first ray.get() has fetched the data to plasma from the remote node. The second calling should reuse the same memory rather than fetching again.
2.We use ray.get() in a function of an actor to fetch data from another node. After this function ends, the data is copied from the remote node to the local plasma memory and has not been released even if we never use it.
We also found that even if the plasma triggers spilling out, the copied data is still kept and not released.
When and how this copy data be released?
Versions / Dependencies
Ray version 1.13.0
Reproduction script
Test code please see description.
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: