-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
to_device() -- any way to force back to host "portably?" #626
Comments
I would like to propose to refrain from using "cpu" as an alias for the host (in SYCL terminology), i.e. where the application is started (usually a CPU). At the same time SYCL allows for, and oneAPI DPC++ implements, exposing multi-core CPU as a SYCL device, allowing user to offload data-parallel computation to CPU, as well as to GPUs, or other accelerators known to SYCL runtime. Bringing data to the host is left out of scope by the standard. I memory serves, most array library implementation provide standalone Would it be agreeable to propose free function |
Kokkos also has parallel backends for CPU. I think I'm just asking for any portable way to do it across the libs eventually. |
As discussed in the cross-linked PR, another complication is that i.e., |
The two I think that a utility can indeed be added to Choosing cupy->numpy seems reasonable for a pragmatic testing utility, but can't be part of the standard I'd say. |
This issue came up during the most recent dataframe workgroup meeting (2023-05-23). cc @kkraus14 |
By the way, it occured to me that we should probably remove 'cpu' from |
Comment from @fcharras, copied from #710 (comment): Adding to
and
remarks, should we use a separate issue that covers inter-namespace conversion specifically rather than tolist ? I want to emphasize with this usecase I have when trying to adapt code for Array API compliance. There is some code that can't compromise on numerical accuracy and absolutly requires at least float64 precision, but I could use an integrated GPU that supports at most float32 (e.g using For this usecase an intermediate object that enable inter-device and inter-namespace conversion surely would be practical.
wouldn't it be practical to add to the Array API a conversion to numpy, e.g |
There was some discussion about this (I think on the SciPy repo?) and the conclusion was that the standard is supposed to be completely agnostic - requiring a library to know about and interact with NumPy places NumPy on a pedestal which doesn't make sense for an agnostic standard. Then it would just be "why not |
This way, it does not appear that "cpu" is a portable device object across different array API compatible libraries. See data-apis/array-api#626.
|
What we need is a standard way to create an Array instance of lib1 (e.g. numpy) from an Array instance of lib2 (e.g. cupy), possibly with cross-device transfers, even if lib1 only works on host and lib2 only work on device. Special casing numpy would be fine for us (scikit-learn) but I understand that others would like a more library neutral solution to this problem. |
Let me iterate my proposal of This way not a single implementation is singled out. The only drawback, every library must implement one. It could be a syntactic sugar around |
|
@ogrisel strides / dtype are all accessible via buffer protocol / memoryview. I am confused why we are discussing this, though, when DLPack is basically a superset of the buffer protocol. They do the same thing, we just didn't allow DLPack to make a copy before (see a recent user question at dmlc/dlpack#132). Since
Specifically, this allows the following usage: import numpy as np
import cupy as cp
a_cp = cp.empty((10, 20), dtype=cp.float32)
b_np = np.from_dlpack(a_cp, copy=True) and similarly a_np = np.empty((10, 20), dtype=np.float32)
b_cp = cp.from_dlpack(a_np, copy=True) The reason we need def from_dlpack(x, *, copy=False):
dev_type, dev_id = x.__dlpack_device__()
if not copy:
# the existing code path
if dev_type not in what_I_expected:
raise BufferError(...)
else:
# zero-copy conversion
capsule = x.__dlpack__(stream=my_stream(dev_id))
# wrap capsule and return an consumer array
...
else:
# this proposal
if dev_type not in what_I_expected:
capsule = x.__dlpack__(stream=my_stream(dev_id), dl_device=my_dev_type) # e.g. dl_device=kDLCPU
# wrap capsule and return an consumer array
...
else:
# I think we should just raise here, as no copy is needed and this is a perf hit
raise BufferError(...) This hint can that be taken by the producer to target the right device and make a copy. If the consumer This proposal also clarifies whether (If we don't like the names I am slightly disfavor adding |
I like the proposal very much, my first thoughts about
that does already suggest that it can make a copy. Then I became confused about the usage scope that is really intended). Minor suggestion: |
But what if the source namespace (the library for What would be the public Array API to go through the host as an intermediate step in cases where direct device to device transfer is not possible? The |
I like @leofang's proposal. The key missing bit though is a cross-library way to identify devices. Leo proposes See also the device syntax for Maybe a way out is for the device object of each library to gain the same from_dlpack(x, /, *, device=None)
# or with a copy keyword (seems less urgent, the default now is copy=None)
from_dlpack(x, /, *, copy=None, device=None) and this would allow users to write things like x2 = from_dlpack(y_lib2, device=x1.device) There's still no way to write `device='cpu' then, but that isn't a hard necessity probably and would be harder to add (at least, would take more API surface). |
@rgommers Note that it's But, you're right that since For NumPy (or any CPU library), If
is not a very satisfying semantics IMHO. It's better if we can be sure whether the returned array is a fresh copy or a view, and without using a |
No, nothing will crash. As I already noted, the producer should raise A
I think with Ralf's addendum (adding But in case it's not addressed, let's consider a hypothetical, extreme case, where the producer holds data on an NVIDIA GPU, and the consumer only supports Intel GPU (for the record I am not aware of such cases 🙂). What you're asking is a double copy (NV -> CPU -> Intel). I feel this should be fully controlled by the user. I would be very reluctant to act smart here, especially when all accelerator libraries try very hard to avoid unnecessary copies. |
Ah yes, I think this is converging - should work like this.
Not sure about this one - I think it should export the data on the current device instead (i.e., no change from today - leave the raising to the consumer). Think about a situation where CuPy (CUDA-only) sees a CPU array and decided to ask the producer for And +1 to not allowing double transfers, that'd be taking it a bit too far. |
Yes, this afternoon I was staring at the snippet in my earlier comment a_np = np.empty((10, 20), dtype=np.float32)
b_cp = cp.from_dlpack(a_np, copy=True) and realized this loophole too. Good catch, @rgommers! I blame early mornings...🙂 API-wise, I think nothing that we discussed so far need to change. Semantics-wise, we just need a minor correction to help with this case: Instead of
it should be "to hint the producer and consumer that we need to cross the library boundary." Whoever is capable of handling it should take care of it, but since the producer usually is the one doing most of the work (it takes the Happy to see we're converging! Let us summarize the behavior of For
Note that for Step 2, while it's handy to have a As a thought exercise, let's also see how adding
This is partly why I feel @fcharras's suggestion #626 (comment) for allowing more than two possible If everyone is happy, I can create a PR later this week or next. |
I would be happy with both of those. I slightly prefer the version with try:
array_out = from_dlpack(array_in, device=device, copy=False)
except BufferError:
array_out = from_dlpack(array_out, device=device, copy=True) |
There are two ways. Either a try/except to have "if needed" copy logic, or allow To me, the only "device" request anyone should (maybe actually must, since I doubt it is a big burden) support is plain CPU. Beyond that, I don't think there is a point (i.e. up to the implementor, if they want to support moving between GPU and odd-hardware, that is their choice). Another question is whether there is a need/gain for such a request to support being a set, so that I can say that any of Unfortunately, all of this is bloating the spec. It now has multiple requests API and queries additionally to the original exchange; something that it original had not and didn't want to have. |
Indeed, at the moment I do not see a use case for direct, heterogeneous device to device transfers without going through the host either. So, if specializing the particular case of "transfer from device to host" can make it simpler to implement and adopt then personally I am fine with this. |
This does not look quite right to me. If it's a hint and producer cannot satisfy the hint, it should simply export the data on its current device (just like now) - no need to raise an exception or have any extra logic here. And step (c) here is already the status quo.
That seems like a bit too much. |
It's actually also helpful for safety when using |
@fcharras See also #721. Maybe let's move the
Sorry, @rgommers, this is a limitation of not writing actual code to deliver the concept... What I meant "tries to perform copy" for step b is: Consumer asks Producer to again return a capsule, but this time without passing its In terms of code: try:
capsule = x.__dlpack__(stream=stream, dl_device=dl_device)
except BufferError:
# this should work as it does today
# also, since Producer can't handle dl_device, we may not have a valid stream here?
capsule = x.__dlpack__(stream=stream)
@seberg yes and no.
So I wouldn't say the spec is bloated, it's a gap being filled 😄
@rgommers. As I said, it was just an exercise. But, I disagree that
Agreed. Even if zero-copy is not feasible and Producer has to perform a copy, we want it to do it asynchronously if possible, and to do so we need a stream/queue. |
I still think that there should/must(?) be a flag to indicate if the return is a copy or not. I also think it would make sense to have a |
Why? It semantically does not matter, it is only a performance issue. Remember there's no concept of a view. |
@leofang and @rgommers proposal to add This would be the case where oneAPI can deal with capsule as originally given by producer. For CuPy to import oneAPI capsule, DPC++ calls must be made to unbox SYCL objects to native objects, and hence CuPy would need to ask oneAPI library to give We should give the user control where it is OK to perform data transfer via host (is that |
|
That's a good argument I think.
Ah wait, I think I misread your comment about a "flag". If you meant "+1 to adding a |
No, I didn't just mean +1 for copy kwarg. If we add a It is just such a trivial addition and in numpy we are annoyed about missing it with |
Some discussion today about whether it is useful to add a device ID in the hint, rather than only a device type. Wasn't entirely clear whether there was a strong enough use case - something that would need a bit more thought perhaps.
Okay - seems like that would be a DLPack C level change first, right next to the readonly flag. |
Since Leo asked if we can push forward with this, I created: dmlc/dlpack#134. I would strongly prefer pushing both hand-in-hand, because otherwise things tend to get lost and bite us later. But it is a trivial change in DLPack now that we have flags. |
This way, it does not appear that "cpu" is a portable device object across different array API compatible libraries. See data-apis/array-api#626. Original NumPy Commit: 3b20ad9c5ead16282c530cf48737aa3768a77f91
This way, it does not appear that "cpu" is a portable device object across different array API compatible libraries. See data-apis/array-api#626. Original NumPy Commit: 3b20ad9c5ead16282c530cf48737aa3768a77f91
Yes, I agreed with the assessment. Created #741 and dmlc/dlpack#136 to address this issue. |
This came up in data-apis/array-api-compat#40.
My particular use case is something like this:
x.to_device("cpu")
to allow downstream library test suites to guarantee that an array is returned to host (or no-op if already there) for usage in i.e.,assert_allclose(x, expected)
whereexpected
is a NumPy array (always on host) butx
could be on some other device depending on the array lib/backend.I think specification to that level of detail is left to the libs at the moment. CuPy doesn't even have
to_device()
at the moment let alonecpu
, instead requiring usage of i.e.,arr.get()
to return to host, but even if they did, I'm not so sure the current array API spec would mandate that they need to allowcpu
as a valid argument.Maybe the decision is that we can't really require that level of detail, in which case I can at least point to this issue when justifying shims to get around it when testing. If it were possible to mandate a common call signature for returning to host someday, hopefully the benefit of that would be fairly obvious re: testing multiple backends.
The text was updated successfully, but these errors were encountered: