-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Numpy arrays ctors and flags rework #338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi Ivan,
|
@wjakob So, regarding the stricter casts, if Currently, if something fails with casting, it just basically sets the However, if /// Numpy function which only accepts specific data types
m.def("selective_func", [](py::array_t<int, py::array::c_style>) { return "Int branch taken."; });
m.def("selective_func", [](py::array_t<float, py::array::c_style>) { return "Float branch taken."; });
m.def("selective_func", [](py::array_t<std::complex<float>, py::array::c_style>) { return "Complex float branch taken."; }); It's not actually very numpy.h-specific, more generally - if we fail in object creation/casting early, we have exception data but we can't do overloads; if we fail later, overloads work, but exception info is lost. I don't know, maybe I'm missing something :) |
Following up on the last comment, to give a specific example from the test suite (it's actually the only one that failed, and quite legitimately so, the rest of the test suite passed just fine after the rework), it's > assert np.all(symmetric_lower(asymm) == symm_lower)
E TypeError: Incompatible function arguments. The following argument types are supported:
E 1. (arg0: numpy.ndarray[int32[m, n]]) -> numpy.ndarray[int32[m, n]]
E Invoked with: [[ 1 2 3 4]
E [ 5 6 7 8]
E [ 9 10 11 12]
E [13 14 15 16]]
# __str__() doesn't help much versus > assert np.all(symmetric_lower(asymm) == symm_lower)
E RuntimeError: Failed to convert NumPy array (Cannot cast array data from dtype('int64') to dtype('int32') according to the rule 'safe')
# the actual error that NumPy emits |
@aldanor: I think a compromise should be possible. You can throw
This type caster could work just like the one for |
Ah yes that's a great idea, thanks! I think it should work. I'll get on with it and ping you when there's something ready. Actually, there's one more thing. Currently, we call Now, when we're converting an unknown object into an array, there's two options -- we don't have a dtype, just the flags ( The way >>> import numpy as np
>>> import pandas as pd
>>> casting = 'no', 'equiv', 'safe', 'same_kind', 'unsafe'
>>> types = 'int32', 'int64', 'double'
>>> pd.DataFrame([(f, t, c, np.can_cast(f, t, c))
... for f in types for t in types for c in casting if f != t],
... columns=['from', 'to', 'casting', 'can_cast']).set_index(['from', 'to', 'casting'])
can_cast
from to casting
int32 int64 no False
equiv False
safe True
same_kind True
unsafe True
double no False
equiv False
safe True
same_kind True
unsafe True
int64 int32 no False
equiv False
safe False
same_kind True
unsafe True
double no False
equiv False
safe True
same_kind True
unsafe True
double int32 no False
equiv False
safe False
same_kind False
unsafe True
int64 no False
equiv False
safe False
same_kind False
unsafe True It would be nice to be able to relax casting rules down to "same_kind" or tighten it to "equiv" for ctors / type casters, but I'm not quite sure how it should be expressed. I thought of something like this:
Also, it's not clear where these casting policy should go. It's not really a flag, more like an enum, and there's already flags in the type signature (which I'm not too fond of either). Also, I think it matters mostly for the function signatures (both the input / output types), because in other parts of code you're always free to do |
It seems to me like 4.a) in your list is safer. The reason for adding these enums/flags as template parameters of the type is because it makes it convenient to annotate the conversion intent in binding declarations like:
I'm open to other suggestions but generally think that this is too nice to give up. |
I'm thinking casting rules would probably have to go in their separate type parameter since it makes sense to have an Since this would be a third type parameter though, to specify casting, you would also need to specify the flags :/ |
@wjakob I've sketched a rough initial prototype (without the casting at the moment) which compiles and the existing tests pass, so whenever you have time... :) aldanor#2. I've opened it on my repo again since it's largely unfinished, may get completely mangled from the ground up, and I'll have to do large rebases anyway later on. It's just to be able to get some early feedback and have somewhere to post the prototype code. |
@wjakob Ok I'm in the process rewriting the initial flags branch (given all the recent big changes like ctors rework; plus many features in the original PR have been already implemented, like NumPy C API access). One question that emerged is this: how do we encode the One problem is that
Here are a few points:
(One other idea I had was that if a certain thing (like writeability requirement for arrays, or alignedness) is only required for converting incoming arguments, they could be into "extra" arguments pack that we use for return value policies etc; but that would probably be a mess so I'm not suggesting that). I also just realized (correct me if I'm wrong) that |
Hi @aldanor, apologies for the delay. Regarding
Potentially one way to resolve this discussion is to simply not use ExtraFlags for case 1. AFAIK currently ExtraFlags is currently always zero in this case (it's the default argument of IMHO the only "real" application of |
Ok, here's a question then - do you think that this (current master) behaviour is correct / obvious? I would personally consider it extremely confusing, error-prone and unintuitive. #include <pybind11/numpy.h>
PYBIND11_PLUGIN(test_array_t) {
namespace py = pybind11;
using farray = py::array_t<int, py::array::f_style>;
return py::module("test_array_t")
.def("new_farray_1", []() -> farray { return farray({10, 20}).cast<py::object>(); })
.def("new_farray_2", []() { return farray({10, 20}); })
.ptr();
} In >>> from test_array_t import *
>>> new_farray_1().flags
C_CONTIGUOUS : False
F_CONTIGUOUS : True # ok
OWNDATA : True
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False
>>> new_farray_2().flags
C_CONTIGUOUS : True
F_CONTIGUOUS : False # oops
OWNDATA : True
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False |
As far as I see, the options are:
What do you think? |
Hi @aldanor, I'm curious if there were any news here? If you think the flags changes will still take while, do you think that there a way to push out a new release now with some smallish-breaking changes that would allow your patch to go into a minor revision? (such as removing forcecast) Wenzel |
@wjakob Hi! Sorry for the delay, end-of-year madness :) It turned out I'll have entire next week off (after which I'll be gone/inaccessible for half a month) so I'll try to attack the darned flags thing again in a few days. If things still are in flux by next Friday, then yea, I concur, removing forcecast and fixing up a few tests is the minimum we should do. |
Re: your comment above, I agree, we can add a bunch of static asserts (like, if you specify c-style/f-style, you can't provide strides anymore). Btw: should the same logic apply to contiguity flags vs. being able to access the |
I think it's important to be able to access the .data() pointer in any case. The way to access it is dictated by the strides, and violating that contract may obviously lead to unexpected results. So I don't think there is anything that needs to change about that function. |
Hi @aldanor, just a status update on the releases: I'd like to push out a v2.0.0 pre-release next week before christmas (followed by a feature freeze for stabilization). I'll remove the forcecast feature now -- if you can still land your patch soon, we can include it. Otherwise I'd say let's postpone it to within the 2.x cycle. |
@wjakob Hi! Yea, all good, grats on the 2.0 release (and happy NY!), good timing! I've a bit of spare time on my hands now so I'm continuing the work on this issue, albeit after a considerable delay. I wanted to discuss a separate problem that is related to this (as it's directly related to conversion) but which we haven't covered explicitly. Maybe it makes sense to consider it a part of this rework. Check out this example: #include <cstdint>
#include <pybind11/numpy.h>
namespace py = pybind11;
struct Foo {
const int32_t *ptr;
Foo(const int32_t* ptr) : ptr(ptr) {}
int32_t value() const { return *ptr; }
};
PYBIND11_PLUGIN(arr_conv) {
py::module m("arr_conv");
py::class_<Foo>(m, "Foo")
.def("__init__", [](Foo& self, py::array_t<int32_t, 0>& arr)
{ new(&self) Foo(arr.data()); }, py::keep_alive<0, 1>())
.def_property_readonly("value", &Foo::value);
return m.ptr();
} and this test case: import arr_conv
l = [42] * 10000
f = arr_conv.Foo(l)
print(f.value)
l = [43] * 10000 + l
print(f.value) What do you expect it to print? :) (and also, what do you expect a person not familiar with pybind11 internals would expect it to print?...) |
@wjakob So, on my machine this example prints something like "42 0" most of the time (due to the memory being reallocated); could also be "42 42" sometimes. The problem here is that Unless I'm missing something, I think this is a design flaw on our side, by doing conversion eagerly and unconditionally, we lose the ability to request the input argument to be an "lvalue" array, i.e. an array that already exists and doesn't get converted because it already satisfies all requirements (dtype, extra flags for array_t). This is also very important when you deal with huge arrays nearing RAM limit and you have to be sure no implicit copies would be made. Another important reason is the following distinction: we don't expose internal pointers for any other Python objects (e.g. list, dict, anything else) via C++, so all interactions go through Python; however this is not true for numpy arrays since you can request a pointer via It may have been appropriate to choose to not do conversions in ctors of If you agree with my point, maybe we can open a separate issue for discussion to see what people think. |
I agree with most of the issues raised here, as well as some ideas discussed in other issues/PRs. When dealing with large, memory-intensive datasets, I need to be able to read through code and know that copies are not made (or if they are, I explicitly allowed it). There are several related conversations happening about the same general theme of array type handling. This issue itself is really the precursor discussion to what will likely be a collection of issues. Could I suggest that we make an umbrella issue (like [1]) or a github milestone with a birds-eye view of the different ideas? I'd like to stay up to date but maybe don't need the granularity of following every single issue. Also if I can be of help please let me know. [1] https://github.com/facebook/react/ issues/7925 (intentionally mangled URL so it doesnt reference over there) |
Yeah... How to setflags for pybind11::array ???? |
@wjakob This is a discussion / brainstorming issue for flags-related stuff in numpy api. Here's an unordered collection of my thoughts about it resulting from digging around numpy/pybind11 source, please feel free to comment:
NPY_ARRAY_FORCECAST
by default which is very bad. NumPy will then happily convert anything to anything even if it it's complete bollocks (this triggers unsafe casting mode), which doesn't play well for either input arguments or return values and quite often yields surprising results.forcecast
option completely as it doesn't make much sense and is contradictory. You can only sensibly use it for strongly typedarray_t<T, array::forcecast>
, which on the one hand implies that you actually do wantT
, but on the other it will almost completely disregard the array's dtype because of forcecast. If you want this type of behaviour, you can always accept just anarray
and then do.astype()
(see below)-- which would be a lot more precise because you can specify casting rules. I can't think of a single legitimate example where you would use forcecast either for input arguments or return values -- if you can think of any, I'm all ears :)NPY_ARRAY_DEFAULT
which is comprised of:NPY_ARRAY_C_CONTIGUOUS
NPY_ARRAY_WRITEABLE
NPY_ARRAY_ALIGNED
(this in particular is a very sensible default)NPY_ARRAY_OUT_ARRAY
which is the same asNPY_ARRAY_DEFAULT
, andNPY_ARRAY_IN_ARRAY
which is the same thing but without the "writeable" bit. If you think about it, most of the times the input arguments should not require writeability unless the purpose is to mutate them (dropping the writeable flag from requirements would avoid having numpy to make an unneeded copy in some cases). It would be nice to be able to easily specify that.array_t
callsPyArray_FromAny
, which is a universal conversion function "from anything". While its nice on its own and it would be beneficial to expose it separately (e.g. a hypothetical::from_object()
static method), I believe it shouldn't be called in the ctor. Instead, it should check that the object is already an array (PyArray_Check
) and then call array conversion routine (PyArray_FromArray
) which also benefits from checking the casting rules (only two available here: safe / force, but that should be sufficient for ctor purposes).::astype(dtype, casting = safe) -> array
method on thearray
and also::astype<T>(casting = safe) -> array_t<T>
method (the flags should be preserved from the caller). Here we can accept all 5 casting types (e.g.array::casting::same_kind
).array
whereas it may sometimes be beneficial (at least controlling the writeability). Obviously,forcecast
flag doesn't apply here, but numpy handles redundant flags the same way, some routines ignore some flags. This would mean that the ctor ofarray
would be almost the same of that forarray_t
, callingPyArray_CheckAny
and thenPyArray_FromArray
.ensurecopy
flag? When would it be used? (in light of the coming changes that would allow to provide an owner for the array so data is not copied)array::in(...)
orarray_t<int>::out(...)
orarray_t<int, array::forcecast>::out
that would set the proper set of flags (witharray_t()
being essentially the same asarray_t::out
). Most of the time just these two sets of flags would be used, I believe (input arguments and output values, respectively).array
andarray_t
, plus maybe the aliases and a few new methods as described above.The text was updated successfully, but these errors were encountered: