Skip to content

Conversation

@nealrichardson
Copy link
Member

@nealrichardson nealrichardson commented Oct 6, 2020

No description provided.

@github-actions
Copy link

github-actions bot commented Oct 6, 2020

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[1] "person" "woman" "ma" "camera" "tv"

😆

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙈

@romainfrancois
Copy link
Contributor

romainfrancois commented Oct 28, 2020

if what we want is that the nul is kept, based on @bkietz comment from #8536 Converter_String::Ingest_some_nulls() could end with:

    cpp11::unwind_protect([&] {
      if (array->null_count()) {
        // need to watch for nulls
        arrow::internal::BitmapReader null_reader(array->null_bitmap_data(),
                                                  array->offset(), n);
        for (int i = 0; i < n; i++, null_reader.Next()) {
          if (null_reader.IsSet()) {
            SET_STRING_ELT(data, start + i, unsafe_r_string(string_array->GetString(i)));
          } else {
            SET_STRING_ELT(data, start + i, NA_STRING);
          }
        }

      } else {
        for (int i = 0; i < n; i++) {
          SET_STRING_ELT(data, start + i, unsafe_r_string(string_array->GetString(i)));
        }
      }
    });

with:

private:
  SEXP unsafe_r_string(const std::string& s) const {
    return Rf_mkCharLenCE(s.c_str(), s.size(), CE_UTF8);
  }

this builds on knowing that GetString() gives an utf8 string, and that we know its size, as opposed to what cpp11::r_string(std::string) does here: https://github.com/r-lib/cpp11/blob/master/inst/include/cpp11/r_string.hpp#L18 and uses unwind protection around the loop instead of the individual conversions.

i.e. it assumes utf-8 but since it does not use the known size, it searches for nul:

r_string(const std::string& data) : data_(safe[Rf_mkCharCE](data.c_str(), CE_UTF8)) {}

cc @jimhester, is this on purpose that this constructor uses Rf_mkCharCE instead of Rf_mkCharLenCE ?

@romainfrancois
Copy link
Contributor

@nealrichardson is the intent that we do get to the embedded nul in string: 'ma\0n' error ?

@jimhester
Copy link

jimhester commented Oct 28, 2020

Oh hmm, it is not on purpose, I think it was just copy pasted from the const char* implementation above. We should use mkCharLenCE for std::string() since we know the length.

That being said having an embedded NULL in any CHARSXP is not really going to work in R. Lots of places in the R codebase assumes strings are null terminated. I think the only way you can handle string data with embedded nulls in R is with raw vectors.

Which erroring maybe is the intent here, so perhaps switching this to mkCharLenCE will do what we want.

@nealrichardson
Copy link
Member Author

Arguably failing is better than silently truncating, but that puts us back at the original user report. I see our options as:

  1. Do nothing in the code; document a workflow of casting the utf8 column to binary, then stripping out the 00s yourself in R
  2. In arrow_to_vector, if a nul is hit, downcast to binary and proceed with conversion (raise a warning instead of erroring)
  3. Add a skipNul-like option (cf. https://stat.ethz.ch/R-manual/R-devel/library/base/html/readLines.html and others) and strip nulls if they're hit, raising a warning but keeping the result as a character vector.

@romainfrancois
Copy link
Contributor

I think failing asap is better, either with the current code, or with an unwind_protect() outside the loop.

StringArrayType* string_array = static_cast<StringArrayType*>(array.get());
    auto unsafe_r_string = [](const std::string& s) {
      return Rf_mkCharCE(s.c_str(), CE_UTF8);
    };
    cpp11::unwind_protect([&] {
      if (array->null_count()) {
        // need to watch for nulls
        arrow::internal::BitmapReader null_reader(array->null_bitmap_data(),
                                                  array->offset(), n);
        for (int i = 0; i < n; i++, null_reader.Next()) {
          if (null_reader.IsSet()) {
            SET_STRING_ELT(data, start + i, unsafe_r_string(string_array->GetString(i)));
          } else {
            SET_STRING_ELT(data, start + i, NA_STRING);
          }
        }

      } else {
        for (int i = 0; i < n; i++) {
          SET_STRING_ELT(data, start + i, unsafe_r_string(string_array->GetString(i)));
        }
      }
    });

    return Status::OK();

@bkietz
Copy link
Member

bkietz commented Nov 12, 2020

@romainfrancois that looks good to me. I'd recommend using GetView instead of GetString to avoid allocating an unnecessary temporary for non-short strings

@nealrichardson
Copy link
Member Author

What you describe (including using GetView) is essentially what we now have on master: https://github.com/apache/arrow/blob/master/r/src/array_to_vector.cpp#L290-L321

The difference is that we moved back to Rf_mkCharLenCE, which is much faster because length is known and errors (correctly) when there are embedded nuls, instead of Rf_mkCharCE, which IIUC was the reason for the performance regression in #8356 and which was responsible for silently truncating at an embedded nul.

If Rf_mkCharLenCE is what is raising the "embedded nul in string" error, then we have the option of catching that and falling back to a slower skipNul path to strip the nulls. If the error comes from somewhere else later, we may need to consider other options.

@romainfrancois
Copy link
Contributor

romainfrancois commented Nov 13, 2020

It does look like Rf_mkCharLenCE() is generating the error:

cpp11::cpp_eval('Rf_mkCharLenCE("camer\\0a", 7, CE_UTF8)')
#> Error in f(): embedded nul in string: 'camer\0a'

Created on 2020-11-13 by the reprex package (v0.3.0.9001)

@nealrichardson
Copy link
Member Author

It would be good to get this resolved for 3.0. I pushed a naive fix: if arrow.skip_nul = TRUE (default FALSE, per base::readLines and base::scan), we go through a slow path and strip out nuls. A better solution (1) would check arrow.skip_nul outside of the loop (I could do this now but figure there's a smarter C/C++ way than I would have come up with); (2) would probably try the fast conversion path anyway and catch the exception and retry with the slow path, so even if arrow.skip_nul == true, we only do the slow path if there is a nul (I tried try/catch but didn't get it working right); and (3) it should raise a warning when it does strip a nul, as I believe readLines and scan do.

@nealrichardson nealrichardson marked this pull request as ready for review December 29, 2020 22:36
@bkietz
Copy link
Member

bkietz commented Jan 4, 2021

@nealrichardson 1) I'll push an implementation of this 2) unfortunately, unwind_exceptions can't really be caught. They are used by cpp11 to get c++ stack unwinding correct but if one is currently in flight then the R runtime has already been informed that stop() was called. If the exception is caught and not rethrown, the R stack will continue to unwind (leaking c++ resources). 3) wilco

Copy link
Member Author

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this better :)

@nealrichardson nealrichardson deleted the r-nul branch January 4, 2021 20:48
@nealrichardson nealrichardson restored the r-nul branch January 4, 2021 20:48
@johncassil
Copy link

@jimhester, @nealrichardson, @bkietz @dianaclarke @romainfrancois

Just wanted to say thanks for working on this. I reported it a long time ago and have just been periodically watching the developments slowly progress. I'm excited to see that there will be a resolution!

Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants