Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-554: [C++] Add functions to unify dictionary types and arrays #3165

Closed
wants to merge 1 commit into from

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented Dec 12, 2018

No description provided.

@pitrou pitrou added the WIP PR is work in progress label Dec 12, 2018
@pitrou pitrou force-pushed the ARROW-554-conform-dicts branch 2 times, most recently from 28e55ce to cf6f8e2 Compare December 12, 2018 19:35
@pitrou pitrou force-pushed the ARROW-554-conform-dicts branch 3 times, most recently from f961288 to a72a759 Compare December 18, 2018 20:05
@pitrou pitrou changed the title [WIP] ARROW-554: [C++] Add functions to conform dictionaries ARROW-554: [C++] Add functions to conform dictionaries Dec 18, 2018
@pitrou pitrou removed the WIP PR is work in progress label Dec 18, 2018
@pitrou
Copy link
Member Author

pitrou commented Dec 18, 2018

This PR is ready for review now.

@codecov-io
Copy link

Codecov Report

Merging #3165 into master will increase coverage by 1.19%.
The diff coverage is 89.71%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3165      +/-   ##
==========================================
+ Coverage   86.44%   87.63%   +1.19%     
==========================================
  Files         508      446      -62     
  Lines       70077    66052    -4025     
==========================================
- Hits        60576    57885    -2691     
+ Misses       9398     8167    -1231     
+ Partials      103        0     -103
Impacted Files Coverage Δ
cpp/src/arrow/visitor_inline.h 96.89% <ø> (ø) ⬆️
cpp/src/arrow/util/hashing.h 99.25% <ø> (ø) ⬆️
cpp/src/arrow/type.h 87.01% <ø> (ø) ⬆️
python/pyarrow/tests/test_types.py 96.89% <100%> (ø) ⬆️
cpp/src/arrow/type-test.cc 100% <100%> (ø) ⬆️
cpp/src/arrow/util/int-util-test.cc 100% <100%> (ø) ⬆️
cpp/src/arrow/type.cc 92.5% <100%> (+0.08%) ⬆️
cpp/src/arrow/util/int-util.cc 99.6% <100%> (+0.02%) ⬆️
cpp/src/arrow/array.h 98.44% <50%> (-0.77%) ⬇️
cpp/src/arrow/array/builder_dict.cc 65.15% <74.62%> (+11.3%) ⬆️
... and 87 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 944b9e3...7d2579b. Read the comment docs.

Copy link
Contributor

@fsaintjacques fsaintjacques left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -740,4 +743,63 @@ TEST(TestDictionary, FromArray) {
ASSERT_RAISES(Invalid, DictionaryArray::FromArrays(dict_type, indices4, &arr4));
}

TEST(TestDictionary, TransposeBasic) {
std::shared_ptr<Array> arr, out, expected;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you type them DictionaryArray at the declaration to avoid all casts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, other APIs take Array pointers, not DictionaryArray.

std::vector<int32_t> transpose_map = {1111, 2222, 3333, 4444, 5555, 6666, 7777};
std::vector<int64_t> dest(src.size());

TransposeInts(src.data(), dest.data(), 6, transpose_map.data());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src.size() instead of 6, just in case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may emit a cast warning, though :-/

template <typename InputInt, typename OutputInt>
void TransposeInts(const InputInt* src, OutputInt* dest, int64_t length,
const int32_t* transpose_map) {
while (length >= 4) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clang does this unrolling naturally, gcc does not :( https://godbolt.org/z/ZHobe1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither does MSVC. The compiler generally doesn't have enough context to know whether unrolling is a good idea (I'm surprised than clang does it at -O2, btw). I'd rather unroll explicitly where we have that context.

while (length >= 4) {
dest[0] = static_cast<OutputInt>(transpose_map[src[0]]);
dest[1] = static_cast<OutputInt>(transpose_map[src[1]]);
dest[2] = static_cast<OutputInt>(transpose_map[src[2]]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't you concerned by truncation due to incompatible size, maybe a static_assert somewhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would we static_assert? Indeed it's up to the caller to ensure the input is ok (it is by construction in case of dictionary unification).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sizeof(in_type) <= sizeof(out_type)
is_signed(in_type) == is_signed(out_type)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, but the converse is supported as well.

/// one per input type. Each integer vector represents the transposition
/// of input type indices into unified type indices.
// XXX Should we return something special (an empty transpose map?) when
// the transposition is the identity function?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an instance where the first transposition isn't the identity function? If not then maybe we could just return types.size()-1 transpositions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in this implementation, but the XXX is about the more general case (all transpositions could actually be the identity function).

@wesm
Copy link
Member

wesm commented Dec 19, 2018

Giving this a look before merging, bear with me

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for kicking off this work. This can be a nuanced topic -- there's a number of follow up items that I've noted in comments. I'm going to go ahead and merge this patch and then we can create follow up JIRAs for discussion and work. A number of these items of follow up work will be forced by the development of other things, so no specific urgency

in_data.null_count);
internal::TransposeInts(in_data.GetValues<in_c_type>(1),
out_data->GetMutableValues<out_c_type>(1), in_data.length,
transpose_map.data());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this is a specific case of the "take" function, a la ndarray.take or "from" verb in J. Here we have

out = in TAKE transpose_map

https://issues.apache.org/jira/browse/ARROW-772

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, though this one has transparent up/downcasting as well (which we probably don't want in the general case of a "take" function).

// implement dictionary-to-dictionary casting.
auto in_type_id = dict_type_->index_type()->id();
auto out_type_id = out_dict_type.index_type()->id();

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's worth the (minor) expense of checking for no-op case where transpose_map is range(dictionary_size)? This covers the case where the unified dictionary type is an extended version of the current type, e.g. as the result of a delta dictionary

Perhaps let's create a follow up JIRA so as to not hold up this patch

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already an XXX for that in type.h. I agree it could be beneficial in many cases (such as when you are simply augmenting an existing dictionary).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Status Transpose(MemoryPool* pool, const std::shared_ptr<DataType>& type,
const std::vector<int32_t>& transpose_map,
std::shared_ptr<Array>* out) const;
// XXX Do we also want an unsafe in-place Transpose?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do want that but how we handle it may require some care. This also came up in the context of Arrow Flight where the dictionaries are unknown at the start of an RPC session

https://issues.apache.org/jira/browse/ARROW-3144

I would suggest that we address both cases through a single design, i.e. a MutableDictionaryType subtype or something similar.

One complexity that this introduces (which we need to be prepared for anyway) is the some code will need to check whether type->dictionary() is null or not (if it were null, it would mean that some data has been used inappropriately before the dictionary has been resolved), but I think that's OK.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a comment in ARROW-3144 directing to this discussion

dict_types.push_back(checked_cast<const DictionaryType*>(type));
}

// XXX Should we check the ordered flag?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit tricky. We should consider it in a follow up JIRA.

You could conceivably have a sequence of dictionaries like

a < b < c
a < b < c < d
a < b < c < d < e < f

This could result from a sequence of ordered delta dictionaries. You could argue in this case that the output type should also be ordered.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO the "ordered" flag doesn't seem to make sense. A dictionary is simply a linear lookup table, so it's always ordered in a sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor optimization: if dictionaries have an ordering, that could be used to merge them without building a memo table.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkietz I'm not sure I understand your suggestion. "ordered" here does not mean "sorted", just that the dictionary's order is semantically meaningful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A   0     1 2 3 4 5       6

B     7 8 1   3 4   10 11 6

U   0 7 8 1 2 3 4 5 10 11 6

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but how does the algorithm look like and why would it more efficient than a series of hash lookups?

Copy link
Member

@bkietz bkietz Dec 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, you're right; talking about optimization is not justified when there is not a version which preserves order using a hash table. I'll write an example of each and benchmark them

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say it's not worth spending time right now on handling multiple "ordered" dictionaries (where the only case where "ordered" would be preserved is when all the dictionaries are prefixes of a common dictionary)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created https://issues.apache.org/jira/browse/ARROW-4096. This is low priority


Status DictionaryType::Unify(MemoryPool* pool, const std::vector<const DataType*>& types,
std::shared_ptr<DataType>* out_type,
std::vector<std::vector<int32_t>>* out_transpose_maps) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also consider the case where we want to conform a dictionary to an other, without unification. For example:

[a, b, c, a, b, c], dictionary [a, b, c]

conform to dictionary [a, c, d]

yields

[a, null, c, a, null, c]

This isn't unification so much as conformance. This was what was meant by my comment "Conform the dictionary indices to reference a common dictionary" in ARROW-554. The use case would be implementing casts between dictionary types -- this would include a safe and unsafe mode (safe would error if a value is made null)

Since this use case is not addressed in this patch (unless I missed it), let us create a follow up JIRA

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I see. I have misinterpreted the original JIRA then. In any case, I was thinking of the CSV use case, where we'll be unifying dictionaries at the end.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto t2 = dictionary(int8(), ArrayFromJSON(int32(), "[3, 4, 5, 6]"));
auto t3 = dictionary(int16(), ArrayFromJSON(int32(), "[3, 4, 5, 6]"));
auto t4 = dictionary(int8(), ArrayFromJSON(int16(), "[3, 4, 5, 6]"));
auto t5 = dictionary(int8(), ArrayFromJSON(int32(), "[3, 4, 7, 6]"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm quite glad we have this function now =) We might consider adding even more testing sugar so that we can write jsarray::int64("[3, 4, 5, 6]") or some such

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"[3, 1, 4, 2]"_array(int64())

ty1 = pa.dictionary(pa.float32(), pa.array([1.0, 2.0]), ordered=True)
assert ty1.index_type == pa.float32()
ty1 = pa.dictionary(pa.int8(), pa.array([1.0, 2.0]), ordered=True)
assert ty1.index_type == pa.int8()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@wesm wesm changed the title ARROW-554: [C++] Add functions to conform dictionaries ARROW-554: [C++] Add functions to unify dictionary types and arrays Dec 19, 2018
@wesm wesm closed this in f66fa80 Dec 19, 2018
@pitrou pitrou deleted the ARROW-554-conform-dicts branch December 19, 2018 22:11
@wesm
Copy link
Member

wesm commented Dec 21, 2018

OK I think I have all the follow up items accounted for in JIRA

cav71 pushed a commit to cav71/arrow that referenced this pull request Dec 22, 2018
Author: Antoine Pitrou <[email protected]>

Closes apache#3165 from pitrou/ARROW-554-conform-dicts and squashes the following commits:

7d2579b <Antoine Pitrou> ARROW-554:  Add functions to conform dictionaries
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants