-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-6582: [R] Arrow to R fails with embedded nuls in strings #8365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
r/tests/testthat/test-Array.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[1] "person" "woman" "ma" "camera" "tv"
😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙈
|
if what we want is that the nul is kept, based on @bkietz comment from #8536 cpp11::unwind_protect([&] {
if (array->null_count()) {
// need to watch for nulls
arrow::internal::BitmapReader null_reader(array->null_bitmap_data(),
array->offset(), n);
for (int i = 0; i < n; i++, null_reader.Next()) {
if (null_reader.IsSet()) {
SET_STRING_ELT(data, start + i, unsafe_r_string(string_array->GetString(i)));
} else {
SET_STRING_ELT(data, start + i, NA_STRING);
}
}
} else {
for (int i = 0; i < n; i++) {
SET_STRING_ELT(data, start + i, unsafe_r_string(string_array->GetString(i)));
}
}
});with: private:
SEXP unsafe_r_string(const std::string& s) const {
return Rf_mkCharLenCE(s.c_str(), s.size(), CE_UTF8);
}this builds on knowing that i.e. it assumes utf-8 but since it does not use the known size, it searches for cc @jimhester, is this on purpose that this constructor uses |
|
@nealrichardson is the intent that we do get to the |
|
Oh hmm, it is not on purpose, I think it was just copy pasted from the That being said having an embedded Which erroring maybe is the intent here, so perhaps switching this to |
|
Arguably failing is better than silently truncating, but that puts us back at the original user report. I see our options as:
|
|
I think failing asap is better, either with the current code, or with an StringArrayType* string_array = static_cast<StringArrayType*>(array.get());
auto unsafe_r_string = [](const std::string& s) {
return Rf_mkCharCE(s.c_str(), CE_UTF8);
};
cpp11::unwind_protect([&] {
if (array->null_count()) {
// need to watch for nulls
arrow::internal::BitmapReader null_reader(array->null_bitmap_data(),
array->offset(), n);
for (int i = 0; i < n; i++, null_reader.Next()) {
if (null_reader.IsSet()) {
SET_STRING_ELT(data, start + i, unsafe_r_string(string_array->GetString(i)));
} else {
SET_STRING_ELT(data, start + i, NA_STRING);
}
}
} else {
for (int i = 0; i < n; i++) {
SET_STRING_ELT(data, start + i, unsafe_r_string(string_array->GetString(i)));
}
}
});
return Status::OK(); |
|
@romainfrancois that looks good to me. I'd recommend using |
|
What you describe (including using GetView) is essentially what we now have on master: https://github.com/apache/arrow/blob/master/r/src/array_to_vector.cpp#L290-L321 The difference is that we moved back to If |
|
It does look like cpp11::cpp_eval('Rf_mkCharLenCE("camer\\0a", 7, CE_UTF8)')
#> Error in f(): embedded nul in string: 'camer\0a'Created on 2020-11-13 by the reprex package (v0.3.0.9001) |
8180ee2 to
09446f3
Compare
|
It would be good to get this resolved for 3.0. I pushed a naive fix: if |
4142e80 to
9c20721
Compare
|
@nealrichardson 1) I'll push an implementation of this 2) unfortunately, unwind_exceptions can't really be caught. They are used by cpp11 to get c++ stack unwinding correct but if one is currently in flight then the R runtime has already been informed that |
nealrichardson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this better :)
|
@jimhester, @nealrichardson, @bkietz @dianaclarke @romainfrancois Just wanted to say thanks for working on this. I reported it a long time ago and have just been periodically watching the developments slowly progress. I'm excited to see that there will be a resolution! Cheers! |
No description provided.