-
-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python Array API Standard #109
Comments
Do we have someone representing us on the array side? If not, I'd be happy to volunteer. Though would need to get plugged into the meetings somehow |
cc @rgommers for visibility |
Thanks @jakirkham! So far it's been mainly @TomAugspurger and @shoyer who contributed Dask expertise. Would be great to have you join. Also happy to have a call soon to catch you up on what's been happening. I'll try and collect a few of the Dask-related points (there weren't many at all in the past few months) and will get back to you. Also, let me point out the NumPy Enhancement Proposal for adoption of the array API standard: numpy/numpy#18456 |
Sounds good. Happy to do some reading as well (guessing there's lots to catch up on). That would be great. Thank you Ralf! 😄 Yep saw that. Will take a look at that as well 🙂 |
The one issue with an unresolved Dask-related angle is data-apis/array-api#97 (related to data/value-dependent output shapes and how to represent "unknown shape" - Dask uses data-apis/array-api#84 on including boolean indexing that's hard for Dask to support may also be good to review. Discussed concluded with color-coding that indexing mode as "not all libraries may support this, it's hard in combination with delayed evaluation and JIT compilation" (not implemented yet, that's why the issue is still open). Other than that, I'd suggest looking at https://data-apis.github.io/array-api/latest/ and https://numpy.org/neps/nep-0047-array-api-standard.html for things that are suboptimal or missing from Dask's perspective. |
Commented on the Dask does support a fair bit of boolean indexing (though we may find some edge cases that don't work). However as the mask itself may have unknown values, this falls in the category of things where the array's content affects the array shape and so the shape of the array is unknown. I'm not sure I'd write this off as not supported, but it's also possible I don't understand what supported would mean in the context of the spec or the cases you have in mind. So more context here would be welcome Ok will read up on this before we talk more 🙂 Edit: It would be good to discuss |
Also related to Data API framework. Generally we reach for and write functions to check if they are generally an Array, DataFrame, Series, etc. ( dask/dask#7327 ) Not sure if there are general tools scoped out in this framework to make those checks easier, but that would be very helpful for Dask |
Interesting. AFAIK there is not yet a way to distinguish a Data API array from a Data API dataframe. But since the latter is still in a draft status (more "drafty" than the former, as it's not yet made public), perhaps we can discuss about it? |
There is no
Arrays which support the standard, or from which you can at least retrieve the standard namespace with a compliant implementation, have a |
OK. With the ongoing discussion with the dataframe API, a similar check can be done for dataframes: |
Already enforced by them being methods I think - attributes can easily have side effects, methods can't. Of course Python is dynamic in nature, so you could come up with some hack that has an unwanted side effect, but it's highly unlikely to occur in practice. |
If folks wanted to start using the compliance suite Generally, please feel free to reach out via issues for any questions/bugs. We want to make the suite easier to use for implementors, so any feedback would be appreciated. |
Thanks @honno for providing the wrapper and the test suite. There are likely to be some compatibility issues with using
whereas in the Array API it is
This means that an expression like It might be sensible therefore to use a different module, like Running the test suite I get:
So: a few failures, as might be expected, but also lots of tests passing, which is very encouraging. A few caveats: I ran with only a maximum of two hypothesis examples just to do a quick check at this stage. Increasing it would almost certainly find more failing cases. I also used the following
The next steps might be to investigate the failures, and to implement some of the functions that are missing in Dask (some definitely easier than others). |
Thanks for pushing on this @tomwhite! 😄 Nice to see some initial success here. Agree we would want a different Yeah this is a good point that really hasn't come up in our discussions thus far @rgommers @kgryte (likely because it is very Dask specific), but creation functions in Dask need a Looking at the failures. Guessing Seems we fallback to auto chunking in a lot of cases. Idk how desirable that is. It is simpler to implement, but may have performance costs depending on what chunks are needed and what are used. |
Great to see Dask is decently compliant already! Thanks for the wrapper @tomwhite.
Ah assuming you mean the failing
Yep this is known, hopefully you can keep relying on
Ah this is interesting, it might be a good idea then for the suite to test shapes last so it can also get to test the dtype and the out values are as expected. |
unique() also falls into that category of functions that have value-based shapes that we should allow skipping entirely with dask. |
Thanks for all the comments. I've filed Dask issues for some of these errors: dask/dask#8667, dask/dask#8668, dask/dask#8669, dask/dask#8670.
Filed dask/dask#8669. I tried a local fix, and it fixes all the
This is dask/dask#8670.
+1 I also noticed that all of the special case tests fail because they rely on boolean indexing, which - while supported in Dask - is lazy and causes these tests to fail because array shapes are unknown. I made a temporary local change to force shape computation (by calling |
Yep we've been phasing out masking throughout the suite, and it's just special cases left! (Special cases will get generally overhauled in the next two weeks, we just left it alone because the spec's RST conversion was blocking until now.) |
Is it ok to always have to call |
Not really. Calling |
Yes, it is always fine to extend the array API standard signatures with extra keywords that are library-specific. At that point of course you lose portability, but that's fine. You have a default that is reasonable, and then a Dask-specific way to tweak The array API standard signatures making heavy use of keyword-only arguments helps here. If you keep |
With these fixes the number of unexpected failures is now down to 11: https://github.com/tomwhite/dask/runs/5142135752?check_suite_focus=true. Most of them seem to be indexing issues. |
@tomwhite Ah it looks like the Ah and |
Have reviewed/merged 3 of those. The 4th one still has a pending review comment. |
Thanks @honno and @jakirkham! |
@tomwhite I pushed some improvements to the test suite per your work here, thanks! I stopped generating 0d arrays in >>> from dask import array_api as dxp
>>> x =d xp.asarray(0)
>>> out = dxp.argmin(x)
>>> int(out)
ValueError: At least one iterator operand must be non-NULL
...
>>> from numpy import array_api as nxp
>>> nxp.argmin(nxp.asarray(0))
Array(0, dtype=int64) # i.e. the first index Can Dask not support 0d inputs for RE Per @asmeurer's suggestion I've added a |
I'm not sure how dask works, but I would assume that so long as you avoid operations that produce data-dependent shapes ( |
I've opened dask/dask#8750 to provide the Dask implementation. Comments very welcome. |
Hi all,
A group of folks from various projects / companies have put together a draft of an Array API standard, and are looking for feedback.
We have lots of experience working with Arrays, both as consumers of arrays implementing the NumPy API (cupy, sparse) and implementing the API ourselves. Our feedback on the proposed standard would be welcome in the GitHub repo linked above.
We can use this issue to discuss if and how
dask.array
should adopt this standard. The document gives some guidance at https://data-apis.github.io/array-api/latest/purpose_and_scope.html#how-to-adopt-this-api, but this is fairly uncertain territory.The text was updated successfully, but these errors were encountered: