Skip to content

Numpy arrays ctors and flags rework #338

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
aldanor opened this issue Aug 15, 2016 · 21 comments
Open

Numpy arrays ctors and flags rework #338

aldanor opened this issue Aug 15, 2016 · 21 comments

Comments

@aldanor
Copy link
Member

aldanor commented Aug 15, 2016

@wjakob This is a discussion / brainstorming issue for flags-related stuff in numpy api. Here's an unordered collection of my thoughts about it resulting from digging around numpy/pybind11 source, please feel free to comment:

  • We currently have NPY_ARRAY_FORCECAST by default which is very bad. NumPy will then happily convert anything to anything even if it it's complete bollocks (this triggers unsafe casting mode), which doesn't play well for either input arguments or return values and quite often yields surprising results.
  • I would actually advocate for removing forcecast option completely as it doesn't make much sense and is contradictory. You can only sensibly use it for strongly typed array_t<T, array::forcecast>, which on the one hand implies that you actually do want T, but on the other it will almost completely disregard the array's dtype because of forcecast. If you want this type of behaviour, you can always accept just an array and then do .astype() (see below)-- which would be a lot more precise because you can specify casting rules. I can't think of a single legitimate example where you would use forcecast either for input arguments or return values -- if you can think of any, I'm all ears :)
  • The default flags combination in NumPy is NPY_ARRAY_DEFAULT which is comprised of:
    • NPY_ARRAY_C_CONTIGUOUS
    • NPY_ARRAY_WRITEABLE
    • NPY_ARRAY_ALIGNED (this in particular is a very sensible default)
  • NumPy also defines NPY_ARRAY_OUT_ARRAY which is the same as NPY_ARRAY_DEFAULT, and NPY_ARRAY_IN_ARRAY which is the same thing but without the "writeable" bit. If you think about it, most of the times the input arguments should not require writeability unless the purpose is to mutate them (dropping the writeable flag from requirements would avoid having numpy to make an unneeded copy in some cases). It would be nice to be able to easily specify that.
  • Constructor of array_t calls PyArray_FromAny, which is a universal conversion function "from anything". While its nice on its own and it would be beneficial to expose it separately (e.g. a hypothetical ::from_object() static method), I believe it shouldn't be called in the ctor. Instead, it should check that the object is already an array (PyArray_Check) and then call array conversion routine (PyArray_FromArray) which also benefits from checking the casting rules (only two available here: safe / force, but that should be sufficient for ctor purposes).
  • It would be also nice to have an ::astype(dtype, casting = safe) -> array method on the array and also ::astype<T>(casting = safe) -> array_t<T> method (the flags should be preserved from the caller). Here we can accept all 5 casting types (e.g. array::casting::same_kind).
  • It's currently impossible to specify flags for array whereas it may sometimes be beneficial (at least controlling the writeability). Obviously, forcecast flag doesn't apply here, but numpy handles redundant flags the same way, some routines ignore some flags. This would mean that the ctor of array would be almost the same of that for array_t, calling PyArray_CheckAny and then PyArray_FromArray.
  • Would it make sense to add an ensurecopy flag? When would it be used? (in light of the coming changes that would allow to provide an owner for the array so data is not copied)
  • API-wise, it would be nice to be able to easily specify input/output arrays without fussing with type parameters. Not sure about it at all yet, but maybe some type aliases like array::in(...) or array_t<int>::out(...) or array_t<int, array::forcecast>::out that would set the proper set of flags (with array_t() being essentially the same as array_t::out). Most of the time just these two sets of flags would be used, I believe (input arguments and output values, respectively).
  • Code-wise, this shouldn't be a huge change; all of this is likely to be contained just within the ctors of array and array_t, plus maybe the aliases and a few new methods as described above.
@wjakob
Copy link
Member

wjakob commented Aug 17, 2016

Hi Ivan,

  1. I agree that the forcecast option is problematic as a default. I've needed it in my own projects in the past when the conversion code was a bit too conservative and wouldn't accept some input (I don't have the full context, but it might have been something as innocent as converting a list of integers to complex floats). I would be happier with keeping it as a non-default option for array_t only.
  2. All agreed about NPY_ARRAY_DEFAULT , NPY_ARRAY_IN_ARRAY, and NPY_ARRAY_OUT_ARRAY. These could map to enum values like default, in_array and out_array. It might be simpler to specify those as a template parameter to array_t rather than adding static methods.
  3. Using PyArray_FromArray from the constructor when possible sounds reasonable, but it's important that non-arrays are handled as well (e.g. a standard python list). I would assume that PyArray_FromAny defers to PyArray_FromArray when given an array input, so the advantage is not 100% clear to me here.
  4. +1 for the astype proposal, that sounds very useful.
  5. I think having a ensurecopy flag that can be disabled would be extremely useful (in conjunction with specifying an owner). This way, it would be possible to have dynamic C++ data structure that can return a NumPy view of themselves. On the user's side, this entails adding a binding method which creates a dtype on the fly and then returns a NumPy array with owner==this and ensurecopy not specified.

@aldanor
Copy link
Member Author

aldanor commented Aug 27, 2016

@wjakob So, regarding the stricter casts, if array_t can hypothetically throw when trying to convert/cast to it, how is it best to implement it?

Currently, if something fails with casting, it just basically sets the m_ptr = nullptr and you get an invalid object. But, in this case you lose the ability to get an actual error from NumPy which could be very helpful (e.g. RuntimeError: Failed to convert NumPy array (Cannot cast array data from dtype('int64') to dtype('int32') according to the rule 'safe')), you will only get "something went wrong" (eventually...).

However, if array_t(...) could fail, how should example below work?

    /// Numpy function which only accepts specific data types
    m.def("selective_func", [](py::array_t<int, py::array::c_style>) { return "Int branch taken."; });
    m.def("selective_func", [](py::array_t<float, py::array::c_style>) { return "Float branch taken."; });
    m.def("selective_func", [](py::array_t<std::complex<float>, py::array::c_style>) { return "Complex float branch taken."; });

It's not actually very numpy.h-specific, more generally - if we fail in object creation/casting early, we have exception data but we can't do overloads; if we fail later, overloads work, but exception info is lost. I don't know, maybe I'm missing something :)

@aldanor
Copy link
Member Author

aldanor commented Aug 28, 2016

Following up on the last comment, to give a specific example from the test suite (it's actually the only one that failed, and quite legitimately so, the rest of the test suite passed just fine after the rework), it's

>       assert np.all(symmetric_lower(asymm) == symm_lower)
E       TypeError: Incompatible function arguments. The following argument types are supported:
E           1. (arg0: numpy.ndarray[int32[m, n]]) -> numpy.ndarray[int32[m, n]]
E           Invoked with: [[ 1  2  3  4]
E        [ 5  6  7  8]
E        [ 9 10 11 12]
E        [13 14 15 16]]
# __str__() doesn't help much

versus

>       assert np.all(symmetric_lower(asymm) == symm_lower)
E       RuntimeError: Failed to convert NumPy array (Cannot cast array data from dtype('int64') to dtype('int32') according to the rule 'safe')
# the actual error that NumPy emits

@wjakob
Copy link
Member

wjakob commented Aug 29, 2016

@aldanor: I think a compromise should be possible. You can throw pybind11::error_already_set in array_t constructor to preserve NumPy's error message. After that, you could add a custom type caster with type signature

template <typename T, int ExtraFlags> struct type_caster<array_t<T, ExtraFlags>> {
....
};

This type caster could work just like the one for pybind11::handle / pybind11::object except that it catches error_already_set and turns it into a normal conversion failure. Does that seem reasonable?

@aldanor
Copy link
Member Author

aldanor commented Aug 29, 2016

Ah yes that's a great idea, thanks! I think it should work. I'll get on with it and ping you when there's something ready.

Actually, there's one more thing.

Currently, we call PyArray_FromAny in the ctor (in ensure_array) and it sort of makes sense for the cases when you are lazy and want to pass something array-like or just a scalar and let numpy handle it for you. This is more of a nice-to-have kind of thing, but still.

Now, when we're converting an unknown object into an array, there's two options -- we don't have a dtype, just the flags (array), or we do (array_t<>). Then there's two more options -- the input is already an array, or it's not.

The way PyArray_FromAny handles casting is: if "forcecast" flag is not set (which it wouldn't normally be), the casting is set to "safe". However, this would prohibit e.g. accepting int64 arrays when int32 is required. Here's a short illustration of casting rules:

>>> import numpy as np
>>> import pandas as pd
>>> casting = 'no', 'equiv', 'safe', 'same_kind', 'unsafe'
>>> types = 'int32', 'int64', 'double'
>>> pd.DataFrame([(f, t, c, np.can_cast(f, t, c)) 
...     for f in types for t in types for c in casting if f != t],
...     columns=['from', 'to', 'casting', 'can_cast']).set_index(['from', 'to', 'casting'])
                        can_cast
from   to     casting           
int32  int64  no           False
              equiv        False
              safe          True
              same_kind     True
              unsafe        True
       double no           False
              equiv        False
              safe          True
              same_kind     True
              unsafe        True
int64  int32  no           False
              equiv        False
              safe         False
              same_kind     True
              unsafe        True
       double no           False
              equiv        False
              safe          True
              same_kind     True
              unsafe        True
double int32  no           False
              equiv        False
              safe         False
              same_kind    False
              unsafe        True
       int64  no           False
              equiv        False
              safe         False
              same_kind    False
              unsafe        True

It would be nice to be able to relax casting rules down to "same_kind" or tighten it to "equiv" for ctors / type casters, but I'm not quite sure how it should be expressed. PyArray_FromAny doesn't directly accept casting policy, so it should be done separately / manually. There's PyArray_CanCastTo which accepts two dtypes and performs the check and PyArray_CastTo which does the actual casting.

I thought of something like this:

  1. casting an object to array: don't have target dtype, do nothing
  2. constructing an array or array_t from a pointer: the dtypes will always match, so do nothing
  3. casting an object to array_t, input is already an array: check if can cast, do the cast
  4. casting an object to array_t, input is not an array: either
    4.a) convert it to an array without specifying target dtype (so that minimal dtype is inferred), then 3)
    4.b) convert it to an array with target dtype (in which case it's almost like a forcecast), then do nothing

Also, it's not clear where these casting policy should go. It's not really a flag, more like an enum, and there's already flags in the type signature (which I'm not too fond of either). Also, I think it matters mostly for the function signatures (both the input / output types), because in other parts of code you're always free to do .astype(whatever, casting_policy="...").

@wjakob
Copy link
Member

wjakob commented Aug 30, 2016

It seems to me like 4.a) in your list is safer.

The reason for adding these enums/flags as template parameters of the type is because it makes it convenient to annotate the conversion intent in binding declarations like:

myClass.def_([](array_t<uint32_t, /* conversion flags*/>) { ... });

I'm open to other suggestions but generally think that this is too nice to give up.

@aldanor
Copy link
Member Author

aldanor commented Sep 1, 2016

I'm thinking casting rules would probably have to go in their separate type parameter since it makes sense to have an enum class casting - it's a legitimate "one-of" kind of enum which would be weird to mix with array flags.

Since this would be a third type parameter though, to specify casting, you would also need to specify the flags :/

@aldanor
Copy link
Member Author

aldanor commented Sep 1, 2016

@wjakob I've sketched a rough initial prototype (without the casting at the moment) which compiles and the existing tests pass, so whenever you have time... :) aldanor#2.

I've opened it on my repo again since it's largely unfinished, may get completely mangled from the ground up, and I'll have to do large rebases anyway later on. It's just to be able to get some early feedback and have somewhere to post the prototype code.

@aldanor
Copy link
Member Author

aldanor commented Nov 29, 2016

@wjakob Ok I'm in the process rewriting the initial flags branch (given all the recent big changes like ctors rework; plus many features in the original PR have been already implemented, like NumPy C API access). One question that emerged is this: how do we encode the readonly / writeable? I've added buffer_info.readonly attribute, so now that needs support on the array / array_t side.

One problem is that ExtraFlags are used for two completely different purposes:

  1. construction from pointer - all ctors get essentially get forwarded to array::array()
  2. conversion of an object - array_t<...>::ensure()

Here are a few points:

  1. When constructing/converting arrays, many flags act as requirements only, this affects whether a copy will be made (only if the array is required to be writeable, but a passed array is not). If you don't request writeability, it doesn't mean that array will not be writeable -- it may be writeable or it may be not, who knows.
  2. Having template-level parameters for run-time properties is generally a bit strange. You can always do array_proxy(m_ptr)->flags |= NPY_WRITEABLE (or do it from Python). This means that if array::writeable was a flag that you could pass to ExtraFlags, then array_t<T, array::writeable> would not in general be guaranteed to be writeable since it's a run-time property.
  3. Sometimes you would want to enforce readonly property on a newly created or an existing array -- i.e., set the flag manually on the array. Logically, it would be passed to ctor and not as a template argument (although the current ctors already have a ton of arguments, including ones with default values...).
  4. However, sometimes you would want to either ensure the array you receive as an argument to your function is either (a) writeable so you can mutate it, or maybe (b) relax writeability requirement (NB: != readonly), so that NumPy doesn't accidentally make a copy where it shouldn't. Obviously, this has to reside outside of ctors (either a template argument; or what?).

(One other idea I had was that if a certain thing (like writeability requirement for arrays, or alignedness) is only required for converting incoming arguments, they could be into "extra" arguments pack that we use for return value policies etc; but that would probably be a mess so I'm not suggesting that).

I also just realized (correct me if I'm wrong) that ExtraFlags are not used at all for array creation, i.e. the instance created by array_t<T, flags>::array_t() does not depend on flags at all since it gets forwarded blindly to array::array(), discarding all flags. Is that intended?

@wjakob
Copy link
Member

wjakob commented Dec 3, 2016

Hi @aldanor,

apologies for the delay. Regarding

One problem is that ExtraFlags are used for two completely different purposes:

  1. construction from pointer - all ctors get essentially get forwarded to array::array()

2 .conversion of an object - array_t<...>::ensure()

Potentially one way to resolve this discussion is to simply not use ExtraFlags for case 1. AFAIK currently ExtraFlags is currently always zero in this case (it's the default argument of raw_array). As you said, some of the flags can even be controlled at runtime, so it's weird to bake them into the type.

IMHO the only "real" application of ExtraFlags has been the situation where we want to bind a function taking a py::array, but the C++ code makes a lot of assumptions about how data is laid out, what type it has, etc. So then we can e.g. bind a function taking an array_t<float, array::f_style> instead to eliminate an annoying amount of boilerplate code. Furthermore, this makes it possible to overload a function with respect to arrays of different types. This approach could even be extended with some additional flags that will ensure that only 1D or 2D arrays can be supplied, etc.

@aldanor
Copy link
Member Author

aldanor commented Dec 5, 2016

Potentially one way to resolve this discussion is to simply not use ExtraFlags for case 1.

Ok, here's a question then - do you think that this (current master) behaviour is correct / obvious? I would personally consider it extremely confusing, error-prone and unintuitive.

#include <pybind11/numpy.h>

PYBIND11_PLUGIN(test_array_t) {
    namespace py = pybind11;
    using farray = py::array_t<int, py::array::f_style>;
    return py::module("test_array_t")
    .def("new_farray_1", []() -> farray { return farray({10, 20}).cast<py::object>(); })
    .def("new_farray_2", []() { return farray({10, 20}); })
    .ptr();
}

In new_farray_2(), you would think you have just created an f-style array, right? Nope :)

>>> from test_array_t import *

>>> new_farray_1().flags
  C_CONTIGUOUS : False
  F_CONTIGUOUS : True  # ok
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False

>>> new_farray_2().flags
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False  # oops
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False

@wjakob
Copy link
Member

wjakob commented Dec 5, 2016

As far as I see, the options are:

  1. We could put a static_assert to ensure that the non-converting constructor can only be called if ExtraFlags==0.

  2. Certain flags such as f_style are also permissible for the non-converting constructors. In this case, it wouldn't e.g. make sense to specify f_style and use a constructor which takes the strides as an argument, so there would need to be some extra static_asserts.

What do you think?

@wjakob
Copy link
Member

wjakob commented Dec 12, 2016

Hi @aldanor,

I'm curious if there were any news here? If you think the flags changes will still take while, do you think that there a way to push out a new release now with some smallish-breaking changes that would allow your patch to go into a minor revision? (such as removing forcecast)

Wenzel

@aldanor
Copy link
Member Author

aldanor commented Dec 14, 2016

@wjakob Hi! Sorry for the delay, end-of-year madness :)

It turned out I'll have entire next week off (after which I'll be gone/inaccessible for half a month) so I'll try to attack the darned flags thing again in a few days. If things still are in flux by next Friday, then yea, I concur, removing forcecast and fixing up a few tests is the minimum we should do.

@aldanor
Copy link
Member Author

aldanor commented Dec 14, 2016

Re: your comment above, I agree, we can add a bunch of static asserts (like, if you specify c-style/f-style, you can't provide strides anymore).

Btw: should the same logic apply to contiguity flags vs. being able to access the .data()? E.g. see this issue - #538 (comment) -- does it make sense to expose the data pointer if the memory storage is not known to be contiguous, would you be able to think of an example?

@wjakob
Copy link
Member

wjakob commented Dec 14, 2016

I think it's important to be able to access the .data() pointer in any case. The way to access it is dictated by the strides, and violating that contract may obviously lead to unexpected results. So I don't think there is anything that needs to change about that function.

@wjakob
Copy link
Member

wjakob commented Dec 18, 2016

Hi @aldanor,

just a status update on the releases: I'd like to push out a v2.0.0 pre-release next week before christmas (followed by a feature freeze for stabilization). I'll remove the forcecast feature now -- if you can still land your patch soon, we can include it. Otherwise I'd say let's postpone it to within the 2.x cycle.
Thanks,
Wenzel

@aldanor
Copy link
Member Author

aldanor commented Jan 1, 2017

@wjakob Hi! Yea, all good, grats on the 2.0 release (and happy NY!), good timing!

I've a bit of spare time on my hands now so I'm continuing the work on this issue, albeit after a considerable delay.

I wanted to discuss a separate problem that is related to this (as it's directly related to conversion) but which we haven't covered explicitly. Maybe it makes sense to consider it a part of this rework.

Check out this example:

#include <cstdint>
#include <pybind11/numpy.h>
namespace py = pybind11;

struct Foo {
    const int32_t *ptr;
    Foo(const int32_t* ptr) : ptr(ptr) {}
    int32_t value() const { return *ptr; }
};

PYBIND11_PLUGIN(arr_conv) {
    py::module m("arr_conv");
    py::class_<Foo>(m, "Foo")
        .def("__init__", [](Foo& self, py::array_t<int32_t, 0>& arr)
             { new(&self) Foo(arr.data()); }, py::keep_alive<0, 1>())
        .def_property_readonly("value", &Foo::value);
    return m.ptr();
}

and this test case:

import arr_conv
l = [42] * 10000
f = arr_conv.Foo(l)
print(f.value)
l = [43] * 10000 + l
print(f.value)

What do you expect it to print? :) (and also, what do you expect a person not familiar with pybind11 internals would expect it to print?...)

@aldanor
Copy link
Member Author

aldanor commented Jan 3, 2017

@wjakob So, on my machine this example prints something like "42 0" most of the time (due to the memory being reallocated); could also be "42 42" sometimes.

The problem here is that Foo::__init__ stores the data pointer of the array which appears to be temporary due to the conversion happening. Also, if I understand correctly, py::keep_alive is misleading since it increfs the original input argument (i.e., here a list) and not the converted input argument (i.e., here an array).

Unless I'm missing something, I think this is a design flaw on our side, by doing conversion eagerly and unconditionally, we lose the ability to request the input argument to be an "lvalue" array, i.e. an array that already exists and doesn't get converted because it already satisfies all requirements (dtype, extra flags for array_t). This is also very important when you deal with huge arrays nearing RAM limit and you have to be sure no implicit copies would be made.

Another important reason is the following distinction: we don't expose internal pointers for any other Python objects (e.g. list, dict, anything else) via C++, so all interactions go through Python; however this is not true for numpy arrays since you can request a pointer via .data(). Being able to get the internal pointer but not being able to enforce lvalue-ness doesn't sound good.

It may have been appropriate to choose to not do conversions in ctors of array and array_t and have separate types like array_like and array_like_t instead to use as input arguments which would do conversions in ctors (this also correlates with how things are named in numpy). However, this would be too intrusive a change (although I, for one, would support it)?... So maybe we can figure out an alternative solution -- e.g., if an argument is lvalue-ref: array& or array_t<>&, no conversions are made and instead the requirements are checked; exception would be thrown if the argument is not already an array that satisfies the requirements.

If you agree with my point, maybe we can open a separate issue for discussion to see what people think.

@llchan
Copy link

llchan commented Feb 10, 2017

I agree with most of the issues raised here, as well as some ideas discussed in other issues/PRs. When dealing with large, memory-intensive datasets, I need to be able to read through code and know that copies are not made (or if they are, I explicitly allowed it).

There are several related conversations happening about the same general theme of array type handling. This issue itself is really the precursor discussion to what will likely be a collection of issues. Could I suggest that we make an umbrella issue (like [1]) or a github milestone with a birds-eye view of the different ideas? I'd like to stay up to date but maybe don't need the granularity of following every single issue.

Also if I can be of help please let me know.

[1] https://github.com/facebook/react/ issues/7925 (intentionally mangled URL so it doesnt reference over there)

@jiapei100
Copy link

Yeah... How to setflags for pybind11::array ????

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants