Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_device() -- any way to force back to host "portably?" #626

Closed
tylerjereddy opened this issue Apr 28, 2023 · 35 comments · Fixed by #741
Closed

to_device() -- any way to force back to host "portably?" #626

tylerjereddy opened this issue Apr 28, 2023 · 35 comments · Fixed by #741
Labels
Question General question. topic: Device Handling Device handling.

Comments

@tylerjereddy
Copy link

This came up in data-apis/array-api-compat#40.

My particular use case is something like this:

x.to_device("cpu") to allow downstream library test suites to guarantee that an array is returned to host (or no-op if already there) for usage in i.e., assert_allclose(x, expected) where expected is a NumPy array (always on host) but x could be on some other device depending on the array lib/backend.

I think specification to that level of detail is left to the libs at the moment. CuPy doesn't even have to_device() at the moment let alone cpu, instead requiring usage of i.e., arr.get() to return to host, but even if they did, I'm not so sure the current array API spec would mandate that they need to allow cpu as a valid argument.

Maybe the decision is that we can't really require that level of detail, in which case I can at least point to this issue when justifying shims to get around it when testing. If it were possible to mandate a common call signature for returning to host someday, hopefully the benefit of that would be fairly obvious re: testing multiple backends.

@oleksandr-pavlyk
Copy link
Contributor

I would like to propose to refrain from using "cpu" as an alias for the host (in SYCL terminology), i.e. where the application is started (usually a CPU). At the same time SYCL allows for, and oneAPI DPC++ implements, exposing multi-core CPU as a SYCL device, allowing user to offload data-parallel computation to CPU, as well as to GPUs, or other accelerators known to SYCL runtime.

Bringing data to the host is left out of scope by the standard. I memory serves, most array library implementation provide standalone asnumpy function, but the standard wants to consider NumPy on the same footings as other array Python libraries.

Would it be agreeable to propose free function asbuffer(array) which would convert an array object to a Python object that implements Python buffer protocol?

@tylerjereddy
Copy link
Author

Kokkos also has parallel backends for CPU. I think I'm just asking for any portable way to do it across the libs eventually.

@tylerjereddy
Copy link
Author

As discussed in the cross-linked PR, another complication is that i.e., cupy.get() actually gives you a NumPy array rather than staying in the same array namespace like torch switching from GPU to CPU. Both conversions ultimately give me something that I can use directly with NumPy assert_allclose(), but that distinction may of course be relevant for design.

@rgommers
Copy link
Member

rgommers commented May 17, 2023

The two .. note::'s under https://data-apis.org/array-api/latest/design_topics/device_support.html#intended-usage have quite a bit to say about this exact question.

I think that a utility can indeed be added to array_api_compat in order to facilitate testing, but it should be clear there that it's separate from the things in the namespaces of implementing array libraries, and the utility will have to be library-specific.

Choosing cupy->numpy seems reasonable for a pragmatic testing utility, but can't be part of the standard I'd say.

@rgommers rgommers added Question General question. topic: Device Handling Device handling. labels May 17, 2023
@kgryte
Copy link
Contributor

kgryte commented May 31, 2023

This issue came up during the most recent dataframe workgroup meeting (2023-05-23). cc @kkraus14

@asmeurer
Copy link
Member

By the way, it occured to me that we should probably remove 'cpu' from numpy.array_api, since it might give people the impression that it is portable.

@rgommers
Copy link
Member

Comment from @fcharras, copied from #710 (comment):

Adding to

@betatim

This issue made me wonder about converting from one namespace to another. Say from PyTorch to Numpy. This works:

x = array_api_compat.torch.asarray([1,2,3])
array_api_compat.numpy.asarray(x) # -> array([1, 2, 3])

The reason I was thinking about this was that it would be nice to have a consistent way of converting things. Of course, there is no asarray for normal Python, so this is more of a thought experiment.

and

@oleksandr-pavlyk

We did discuss a possibility to standardize bringing data from any array object to Python. It would make sense to have a function that would transfer content of array into another type that exposes Python buffer protocol. From here the content could be converted to NumPy, or passed to xp.asarray in another library.

remarks,

should we use a separate issue that covers inter-namespace conversion specifically rather than tolist ?

I want to emphasize with this usecase I have when trying to adapt code for Array API compliance. There is some code that can't compromise on numerical accuracy and absolutly requires at least float64 precision, but I could use an integrated GPU that supports at most float32 (e.g using mps or xpu backends using pytorch) for everything else. For this I would have to transfer data from device to cpu, run the float64 compute, and transfer back to device. But .to_device("cpu"), this is not part of the standard and some array libraries might not support it (like cupy arrays) so I can't rely on it. .from_dlpack does not support inter-device conversion so it's not appropriate either.

For this usecase an intermediate object that enable inter-device and inter-namespace conversion surely would be practical.

tolist have been mentionned but also conversion to and from numpy is commonly supported:

  • torch have Tensor.numpy and torch.from_numpy
  • cupy have cupy.asnumpy and cupy.asarray works with numpy arrays
  • jax have jax.numpy.array and np.asarray works with jax arrays
  • dask from_array supports numpy inputs and np.asarray works with dask arrays
  • tensorflow have Tensor.numpy and tf.convert_to_tensor()
  • mxnet have NDArray.asnumpy and array supports numpy inputs
  • dpctl.tensor have asnumpy and dpctl.tensor.asarray supports numpy inputs

wouldn't it be practical to add to the Array API a conversion to numpy, e.g to_numpy or asnumpy ? (from_numpy doesn't seem as necessary since asarray or from_dlpack commonly already works with numpy inputs)

@lucascolley
Copy link
Contributor

lucascolley commented Dec 11, 2023

wouldn't it be practical to add to the Array API a conversion to numpy, e.g to_numpy or asnumpy ?

There was some discussion about this (I think on the SciPy repo?) and the conclusion was that the standard is supposed to be completely agnostic - requiring a library to know about and interact with NumPy places NumPy on a pedestal which doesn't make sense for an agnostic standard. Then it would just be "why not to_torch etc.", no?

asmeurer added a commit to asmeurer/numpy that referenced this issue Dec 12, 2023
This way, it does not appear that "cpu" is a portable device object across
different array API compatible libraries. See
data-apis/array-api#626.
@fcharras
Copy link

fcharras commented Dec 12, 2023

Then it would just be "why not to_torch etc.", no?

to_numpy makes much more sense than to_torch by far though (commonly installed in a lot of environments, ease of installation (default channels), cpu only, staple of the ecosystem, reasonably lightweight...) so I don't think this question would come up, in the same way that (I think) the other libraries (cupy, jax, mxnet, etc) do not have a to_torch-like feature although they do have a to_numpy-like feature. So IMO it makes sense to consider that numpy is an already commonly adopted tool for portable data container on cpu, and overlook for a moment that it is also an array library.

@ogrisel
Copy link

ogrisel commented Dec 12, 2023

What we need is a standard way to create an Array instance of lib1 (e.g. numpy) from an Array instance of lib2 (e.g. cupy), possibly with cross-device transfers, even if lib1 only works on host and lib2 only work on device.

Special casing numpy would be fine for us (scikit-learn) but I understand that others would like a more library neutral solution to this problem.

@oleksandr-pavlyk
Copy link
Contributor

Let me iterate my proposal of to_memoryview which would copy content of array into an object that implements Python buffer protocol and create memoryview from it. A NumPy array could be created from memoryview with zero copy.

This way not a single implementation is singled out. The only drawback, every library must implement one. It could be a syntactic sugar around to_numpy present in every library as pointed out by @fcharras

@ogrisel
Copy link

ogrisel commented Dec 12, 2023

to_memmoryview is a first start but ideally we would also want at least strides and dtype information and even a standard public API to convert from one namespace to another.

@leofang
Copy link
Contributor

leofang commented Dec 12, 2023

@ogrisel strides / dtype are all accessible via buffer protocol / memoryview.

I am confused why we are discussing this, though, when DLPack is basically a superset of the buffer protocol. They do the same thing, we just didn't allow DLPack to make a copy before (see a recent user question at dmlc/dlpack#132).

Since xp.from_dlpack() is considered an array constructor, I feel this seems to be the easiest solution:

  1. Add a copy keyword (default to False, the status quo) to from_dlpack(), and
  2. Add a dl_device keyword (default to None, the status quo) to __dlpack__()

Specifically, this allows the following usage:

import numpy as np
import cupy as cp

a_cp = cp.empty((10, 20), dtype=cp.float32)
b_np = np.from_dlpack(a_cp, copy=True)

and similarly

a_np = np.empty((10, 20), dtype=np.float32)
b_cp = cp.from_dlpack(a_np, copy=True)

The reason we need copy/dl_device is to hint the producer that we need to cross the library boundary and so it needs to help us perform necessary actions (i.e., copy) before the consumer can take over. This is needed because a consumer like NumPy is ignorant about GPUs or accelerator devices, and only an accelerator library like dpctl or CuPy can help facilitate this conversion.

def from_dlpack(x, *, copy=False):
    dev_type, dev_id = x.__dlpack_device__()
    if not copy:
        # the existing code path
        if dev_type not in what_I_expected:
            raise BufferError(...)
        else:
            # zero-copy conversion
            capsule = x.__dlpack__(stream=my_stream(dev_id))
            # wrap capsule and return an consumer array
            ...
    else:
        # this proposal
        if dev_type not in what_I_expected:
            capsule = x.__dlpack__(stream=my_stream(dev_id), dl_device=my_dev_type)  # e.g. dl_device=kDLCPU
            # wrap capsule and return an consumer array
            ...
        else:
            # I think we should just raise here, as no copy is needed and this is a perf hit
            raise BufferError(...)

This hint can that be taken by the producer to target the right device and make a copy. If the consumer dl_device is not recognized (ex: we pass kDLMetal to CuPy), the producer raises an error as usual.

This proposal also clarifies whether from_dlpack() can make a copy.

(If we don't like the names copy/dl_device, we can discuss. The "hint" concept is what's essential here.)

I am slightly disfavor adding to_numpy()/asnumpy() to the standard. I feel it's a setback to the very nice story that we presented at SciPy about "NumPy is just another array library, it enjoys no special status."

@fcharras
Copy link

I like the proposal very much, my first thoughts about from_dlpack before trying to use it was actually that it already filled this need (I thought that the from_dlpack spec hints at that with this note:

The returned array may be either a copy or a view. See Data interchange mechanisms for details.

that does already suggest that it can make a copy. Then I became confused about the usage scope that is really intended).

Minor suggestion: copy=True seems to tell to force a copy regardless of it being necessary. For xp.asarray copy can take values True (force), False (don't copy, raise an error if impossible), or None (don't copy, except if it's necessary). The same logic could apply here ?

@ogrisel
Copy link

ogrisel commented Dec 13, 2023

capsule = x.dlpack(stream=my_stream(dev_id), dl_device=my_dev_type)

But what if the source namespace (the library for x) does not understand how to handle my_dev_type? Will it crash?

What would be the public Array API to go through the host as an intermediate step in cases where direct device to device transfer is not possible?

The to_memoryview solution might be a good way to address the original case in the description of this issue which is about defining a standard way to compare results by 2 array api libraries in tests by moving everything to the host and using numpy.testing.assert_allclose or similar to do the comparison.

@rgommers
Copy link
Member

I like @leofang's proposal. The key missing bit though is a cross-library way to identify devices. Leo proposes dl_device, meaning a simple integer/enum (se https://github.com/dmlc/dlpack/blob/main/include/dlpack/dlpack.h#L74). This does not yet connect to the .device attribute and existing device objects in libraries. And it's hard to understand for users. I think to get this right, we have to be able to glue this together.

See also the device syntax for torch.device and jax.Device, which are already misaligned. And I'm guessing the SYCL version of that is different again?

Maybe a way out is for the device object of each library to gain the same __dlpack_device__ attribute that we came up with for arrays? That way the signature could be

from_dlpack(x, /, *, device=None)
# or with a copy keyword (seems less urgent, the default now is copy=None)
from_dlpack(x, /, *, copy=None, device=None)

and this would allow users to write things like

x2 = from_dlpack(y_lib2, device=x1.device)

There's still no way to write `device='cpu' then, but that isn't a hard necessity probably and would be harder to add (at least, would take more API surface).

@leofang
Copy link
Contributor

leofang commented Dec 13, 2023

@rgommers Note that it's __dlpack__() accepting dl_device, not from_dlpack(). So, this is intended to be managed by the libraries, not by users (it's not user visible, by design).

But, you're right that since from_dlpack() is an array constructor, it's best to accept a device keyword. Then it's the consumer library's job to translate the user-provided device to a dl_device value that it supports.

For NumPy (or any CPU library), device can be None if CPU is the default, or something like "cpu" that the library already supports (ex: PyTorch).

If device is added, then it can takes over the role of copy in my proposal, and whether to add copy or not is orthogonal to the present mission. Personally, though, I am in favor of adding it. The quote pointed out by @fcharras above

The returned array may be either a copy or a view. See Data interchange mechanisms for details.

is not a very satisfying semantics IMHO. It's better if we can be sure whether the returned array is a fresh copy or a view, and without using a copy keyword it's not possible to disambiguate. We can fork this issue to discuss further.

@leofang
Copy link
Contributor

leofang commented Dec 13, 2023

capsule = x.__dlpack__(stream=my_stream(dev_id), dl_device=my_dev_type)

But what if the source namespace (the library for x) does not understand how to handle my_dev_type? Will it crash?

No, nothing will crash. As I already noted, the producer should raise BufferError if it doesn't know how to handle the consumer's dl_device.

A BufferError can already be raised before this discussion by the consumer (when checking the input array's __dlpack_device__ and not understanding the returned dl_device from the producer). Now we make it symmetric.

What would be the public Array API to go through the host as an intermediate step in cases where direct device to device transfer is not possible?

I think with Ralf's addendum (adding device to from_dlpack()) this is not needed anymore? A user can/should specify a proper device in this case.

But in case it's not addressed, let's consider a hypothetical, extreme case, where the producer holds data on an NVIDIA GPU, and the consumer only supports Intel GPU (for the record I am not aware of such cases 🙂). What you're asking is a double copy (NV -> CPU -> Intel). I feel this should be fully controlled by the user. I would be very reluctant to act smart here, especially when all accelerator libraries try very hard to avoid unnecessary copies.

@rgommers
Copy link
Member

Note that it's dlpack() accepting dl_device, not from_dlpack(). So, this is intended to be managed by the libraries, not by users (it's not user visible, by design).

Ah yes, I think this is converging - should work like this.

As I already noted, the producer should raise BufferError if it doesn't know how to handle the consumer's dl_device.

Not sure about this one - I think it should export the data on the current device instead (i.e., no change from today - leave the raising to the consumer). Think about a situation where CuPy (CUDA-only) sees a CPU array and decided to ask the producer for dl_device=CUDA. If the producer is NumPy, it cannot do that. But if NumPy exports CPU memory, CuPy is still able to consume that (it can do the device transfer on its side). There shouldn't be a need to raise, if dl_device is considered as a hint like you said.

And +1 to not allowing double transfers, that'd be taking it a bit too far.

@leofang
Copy link
Contributor

leofang commented Dec 14, 2023

Not sure about this one - I think it should export the data on the current device instead (i.e., no change from today - leave the raising to the consumer). Think about a situation where CuPy (CUDA-only) sees a CPU array and decided to ask the producer for dl_device=CUDA. If the producer is NumPy, it cannot do that. But if NumPy exports CPU memory, CuPy is still able to consume that (it can do the device transfer on its side). There shouldn't be a need to raise, if dl_device is considered as a hint like you said.

Yes, this afternoon I was staring at the snippet in my earlier comment

a_np = np.empty((10, 20), dtype=np.float32)
b_cp = cp.from_dlpack(a_np, copy=True)

and realized this loophole too. Good catch, @rgommers! I blame early mornings...🙂

API-wise, I think nothing that we discussed so far need to change. Semantics-wise, we just need a minor correction to help with this case: Instead of

to hint the producer that we need to cross the library boundary

it should be "to hint the producer and consumer that we need to cross the library boundary."

Whoever is capable of handling it should take care of it, but since the producer usually is the one doing most of the work (it takes the stream and now dl_device arguments from the consumer and returns a capsule), it should give it a shot first. If the producer fails, then the consumer takes over and retries. We raise BufferError only if both fail.

Happy to see we're converging! Let us summarize the behavior of from_dlpack() based on the above discussion. (No pseudo code this time, as the logic becomes a bit convolved. It's easier if I just use bullet points to illustrate.)

For def from_dlpack(x, *, device=None):

  1. device is None: Consumer uses its default/current device. (status quo)
  2. device is not None: Use device to determine Consumer device type/ID. (new)
  3. Once Consumer device type/ID are determined:
    1. Producer device type == Consumer device type, Producer device ID == Consumer device ID:
      a. zero-copy conversion (status quo)
    2. Producer device type == Consumer device type, Producer device ID != Consumer device ID:
      a. copy or raise (status quo)
    3. Producer device type != Consumer device type:
      a. Producer tries to perform copy, if not working then raise BufferError, then (new)
      b. Consumer catches BufferError, and tries to perform copy, if not working then (new)
      c. raise BufferError to user (new)

Note that for Step 2, while it's handy to have a __dlpack_device__() method defined for the device instance, it is not required, because it's supposed to be a Consumer device (or something that it recognizes), and Consumer knows how to retrieve the device type/ID from it.

As a thought exercise, let's also see how adding copy affects the logic. For def from_dlpack(x, *, device=None, copy=False):

  1. device is None: Consumer uses its default/current device. (status quo)
  2. device is not None: Use device to determine Consumer device type/ID. (new)
  3. Once Consumer device type/ID are determined:
    1. copy is False, Producer device type == Consumer device type, Producer device ID == Consumer device ID:
      a. zero-copy conversion (status quo)
    2. copy is False, Producer device type == Consumer device type, Producer device ID != Consumer device ID:
      a. always raise (~new?)
    3. copy is False, Producer device type != Consumer device type:
      a. always raise (status quo)
    4. copy is True, Producer device type == Consumer device type, Producer device ID == Consumer device ID:
      a. issue a performance warning then make a copy (despite zero-copy conversion is possible) (new)
    5. copy is True, Producer device type == Consumer device type, Producer device ID != Consumer device ID:
      a. make a copy (new)
    6. copy is True, Producer device type != Consumer device type:
      a. Producer tries to perform copy, if not working then raise BufferError, then (new)
      b. Consumer catches BufferError, and tries to perform copy, if not working then (new)
      c. raise BufferError to user (new)

This is partly why I feel @fcharras's suggestion #626 (comment) for allowing more than two possible copy values is a bit challenging to maintain: The logic gets quite complicated.

If everyone is happy, I can create a PR later this week or next.

@fcharras
Copy link

I would be happy with both of those. I slightly prefer the version with copy. Regarding copy=None, now wouldn't it simply amount to wrapping the call with this logic in a try/except block at the python level:

try:
    array_out = from_dlpack(array_in, device=device, copy=False)
except BufferError:
    array_out = from_dlpack(array_out, device=device, copy=True)

@seberg
Copy link
Contributor

seberg commented Dec 14, 2023

There are two ways. Either a try/except to have "if needed" copy logic, or allow copy=None/if_needed with requests, but add a flag for the producer to indicate that the returned array is a copy (which still can mean that with copy=True it must be a copy and be indicated as such).

To me, the only "device" request anyone should (maybe actually must, since I doubt it is a big burden) support is plain CPU. Beyond that, I don't think there is a point (i.e. up to the implementor, if they want to support moving between GPU and odd-hardware, that is their choice).

Another question is whether there is a need/gain for such a request to support being a set, so that I can say that any of CPU, cuda_managed, ... is OK (all CPU available).
The alternative would be to be able to query which exports are supported.

Unfortunately, all of this is bloating the spec. It now has multiple requests API and queries additionally to the original exchange; something that it original had not and didn't want to have.

@ogrisel
Copy link

ogrisel commented Dec 14, 2023

Indeed, at the moment I do not see a use case for direct, heterogeneous device to device transfers without going through the host either. So, if specializing the particular case of "transfer from device to host" can make it simpler to implement and adopt then personally I am fine with this.

@rgommers
Copy link
Member

3. Producer device type != Consumer device type:
a. Producer tries to perform copy, if not working then raise BufferError, then (new)
b. Consumer catches BufferError, and tries to perform copy, if not working then (new)
c. raise BufferError to user (new)

This does not look quite right to me. If it's a hint and producer cannot satisfy the hint, it should simply export the data on its current device (just like now) - no need to raise an exception or have any extra logic here. And step (c) here is already the status quo.

As a thought exercise, let's also see how adding copy affects the logic. For def from_dlpack(x, *, device=None, copy=False):

That seems like a bit too much. copy should be reserved for its current meaning I think, namely a memory copy on the same device. Data transfer between devices is much more expensive typically, and so far we have treated that as an orthogonal concept.

@rgommers
Copy link
Member

To me, the only "device" request anyone should (maybe actually must, since I doubt it is a big burden) support is plain CPU. Beyond that, I don't think there is a point (i.e. up to the implementor, if they want to support moving between GPU and odd-hardware, that is their choice).

It's actually also helpful for safety when using stream=, because right now the producer must assume that the consumer did everything correctly and that the stream number is correct for the device that the data is already on. Receiving both stream number and device type is more robust.

@leofang
Copy link
Contributor

leofang commented Dec 14, 2023

Regarding copy=None, now wouldn't it simply amount to wrapping the call with this logic in a try/except block at the python level

@fcharras See also #721. Maybe let's move the copy discussion there?

  1. Producer device type != Consumer device type:
    a. Producer tries to perform copy, if not working then raise BufferError, then (new)
    b. Consumer catches BufferError, and tries to perform copy, if not working then (new)
    c. raise BufferError to user (new)

This does not look quite right to me. If it's a hint and producer cannot satisfy the hint, it should simply export the data on its current device (just like now) - no need to raise an exception or have any extra logic here. And step (c) here is already the status quo.

Sorry, @rgommers, this is a limitation of not writing actual code to deliver the concept...

What I meant "tries to perform copy" for step b is: Consumer asks Producer to again return a capsule, but this time without passing its dl_device to __dlpack__(). If you think about it, this is the only way to achieve what you said ("simply export the data just like now"), right? Without getting a capsule from Croducer, Consumer can't do anything. Catching an exception from Producer is merely a way to signal Consumer to take action.

In terms of code:

try:
    capsule = x.__dlpack__(stream=stream, dl_device=dl_device) 
except BufferError:
    # this should work as it does today
    # also, since Producer can't handle dl_device, we may not have a valid stream here?
    capsule = x.__dlpack__(stream=stream)  

Unfortunately, all of this is bloating the spec. It now has multiple requests API and queries additionally to the original exchange; something that it original had not and didn't want to have.

@seberg yes and no.

  • Yes, originally using DLPack was mainly aiming for zero-copy exchange.
  • Yes, this proposal "bloats" the spec.
  • No, we also wanted a cross-device data exchange protocol, and we OK'd that if copy is unavoidable a library can choose to do that, we are just formalizing/strengthening it here.

So I wouldn't say the spec is bloated, it's a gap being filled 😄

That seems like a bit too much. copy should be reserved for its current meaning I think, namely a memory copy on the same device. Data transfer between devices is much more expensive typically, and so far we have treated that as an orthogonal concept.

@rgommers. As I said, it was just an exercise. But, I disagree that copy is only for the same device. Recall the position that from_dlpack() is an array creation function, so in terms of semantics for device and copy it should stay the same as those of asarray(). Not honoring is more confusing IMHO.

It's actually also helpful for safety when using stream=, because right now the producer must assume that the consumer did everything correctly and that the stream number is correct for the device that the data is already on. Receiving both stream number and device type is more robust.

Agreed. Even if zero-copy is not feasible and Producer has to perform a copy, we want it to do it asynchronously if possible, and to do so we need a stream/queue.

@seberg
Copy link
Contributor

seberg commented Dec 14, 2023

I still think that there should/must(?) be a flag to indicate if the return is a copy or not. I also think it would make sense to have a nocopy "request" API since asarray at least has the ambition to have that. (nocopy, since apparently copies were never strictly forbidden.)

@rgommers
Copy link
Member

I still think that there should/must(?) be a flag to indicate if the return is a copy or not.

Why? It semantically does not matter, it is only a performance issue. Remember there's no concept of a view.

@oleksandr-pavlyk
Copy link
Contributor

@leofang and @rgommers proposal to add device keyword to from_dlpack also open up possibility for between frameworks interoperability, e.g. oneAPI implementation may import data with kDCUDAGPU , provided the appropriate plugin is available.

This would be the case where oneAPI can deal with capsule as originally given by producer. For CuPy to import oneAPI capsule, DPC++ calls must be made to unbox SYCL objects to native objects, and hence CuPy would need to ask oneAPI library to give kDCUDAGPU DLPack after recognizing that it can not deal with kDLOneAPI one directly. Provided that succeeds, it would still be a zero copy transfer.

We should give the user control where it is OK to perform data transfer via host (is that copy keyword?). It is better to be explicit here. We also need to realize that the copy via host may not be readily available across implementations for some time.

@seberg
Copy link
Contributor

seberg commented Dec 14, 2023

Why? It semantically does not matter, it is only a performance issue.

  1. Because we have copy=True/False/None in asarray() and without that you cannot mirror it in for from_dlpack(). At least not without defensive copies. And to me it seems like a very reasonable request to add such a kwarg.
  2. Because for DLPack to be a success, array-api users shouldn't be the only DLPack users. So even if pure array-api use may never care, other users (which may not be classical array-libraries at all) may care more. So I think a simple flag is worth the trouble.

@rgommers
Copy link
Member

So even if pure array-api use may never care, other users (which may not be classical array-libraries at all) may care more.

That's a good argument I think.

Because we have copy=True/False/None in asarray() and without that you cannot mirror it in for from_dlpack(). At least not without defensive copies. And to me it seems like a very reasonable request to add such a kwarg.

Ah wait, I think I misread your comment about a "flag". If you meant "+1 to adding a copy keyword", then sure. What I understood was a request to let the producer pass back a new flag to say if a copy was actually made if the consumer asked for None (as-needed).

@seberg
Copy link
Contributor

seberg commented Dec 14, 2023

No, I didn't just mean +1 for copy kwarg. If we add a nocopy "request" (which may be encoded as a copy kwarg on __dlpack__) then I additionally want a new flag (additional to readonly, etc.) that encodes copy or not (or maybe it should be view or not?).

It is just such a trivial addition and in numpy we are annoyed about missing it with asarray(copy=True/False/None) having to be very defensive in both erroring or doing additional copies.
E.g. also think of out= kwargs which must know that it is a view! Thus, I want to make sure this flag isn't forgotten when the rest of this is added.

@rgommers
Copy link
Member

rgommers commented Dec 14, 2023

Some discussion today about whether it is useful to add a device ID in the hint, rather than only a device type. Wasn't entirely clear whether there was a strong enough use case - something that would need a bit more thought perhaps.

then I additionally want a new flag (additional to readonly, etc.) that encodes copy or not (or maybe it should be view or not?).

Okay - seems like that would be a DLPack C level change first, right next to the readonly flag.

@seberg
Copy link
Contributor

seberg commented Jan 18, 2024

Since Leo asked if we can push forward with this, I created: dmlc/dlpack#134. I would strongly prefer pushing both hand-in-hand, because otherwise things tend to get lost and bite us later. But it is a trivial change in DLPack now that we have flags.

asmeurer added a commit to data-apis/array-api-strict that referenced this issue Jan 22, 2024
This way, it does not appear that "cpu" is a portable device object across
different array API compatible libraries. See
data-apis/array-api#626.

Original NumPy Commit: 3b20ad9c5ead16282c530cf48737aa3768a77f91
asmeurer added a commit to data-apis/array-api-strict that referenced this issue Jan 22, 2024
This way, it does not appear that "cpu" is a portable device object across
different array API compatible libraries. See
data-apis/array-api#626.

Original NumPy Commit: 3b20ad9c5ead16282c530cf48737aa3768a77f91
@leofang
Copy link
Contributor

leofang commented Feb 8, 2024

Since Leo asked if we can push forward with this, I created: dmlc/dlpack#134. I would strongly prefer pushing both hand-in-hand, because otherwise things tend to get lost and bite us later. But it is a trivial change in DLPack now that we have flags.

Yes, I agreed with the assessment. Created #741 and dmlc/dlpack#136 to address this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Question General question. topic: Device Handling Device handling.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

11 participants