Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sure path is in native encoding #477

Merged
merged 8 commits into from
May 11, 2018
Merged

Make sure path is in native encoding #477

merged 8 commits into from
May 11, 2018

Conversation

jennybc
Copy link
Member

@jennybc jennybc commented May 5, 2018

Fixes #476 CP1251 char set in file name

Fixes #370 read_excel on Windows fails to open a file from a UTF-8 encoded path containing special characters

This is a new problem seen with R 3.5, we believe as a result of measures taken in response to Bug 17120 :

In a non-UTF-8 locale (basically Windows?), path.expand() returns a UTF-8 encoded path, even for non UTF-8 input. readxl pre-processes xls paths via normalizePath(), which calls path.expand(). And then libxls uses fopen(), which expects the path in the native encoding.

We have two choices:

  1. Stop using normalizePath() on xls files. It's not clear to me why we do this. This code goes back to Day 3 of readxl development, in 2015 fc076c4. The history doesn't indicate if there is a reason to use normalizePath(). I explored "stop calling normalizePath()" in 945178d, which passes tests. If we choose this, I will obviously remove more even code vs. comment it out.
  2. Keep using normalizePath() on xls files, but process immediately with enc2native(). Explored in 8a8c061, which also passes tests.

Either way, I think we should consider incorporating the new test. However I decided skip_on_cran() might be wise.

@jennybc jennybc requested review from hadley and jimhester May 5, 2018 02:50
@yutannihilation
Copy link
Member

FYI: using Rf_translateChar() may be another choice. (e.g. https://github.com/tidyverse/readr/pull/838/files)

IMHO, converting Rcpp::String directly to std::string might not do the right thing, as it doesn't care about the encoding, whereas base R provides useful functions that considers the encoding (Rf_translateChar(), Rf_translateChar0() and Rf_translateCharUTF8()). But, I don't know how to do this with Rcpp properly... (I've asked on RStudio Community, but no one has answered so far)

@jimhester
Copy link
Contributor

There is no Rcpp::String created in this case, the results from the R function calls are just SEXP objects and coercing to std::string should just do the equivalent of std::string(CHAR()).

@yutannihilation
Copy link
Member

There is no Rcpp::String created in this case

Thanks for pointing out, I was wrong at this point. I meant, it might be also good if we can use the encoding attribute in C++ codes. But, as it seems difficult, just ignore me. Sorry for the noise...

@jennybc
Copy link
Member Author

jennybc commented May 9, 2018

@hadley Do you have any memory of why xls paths are sent through normalizePath()? But xlsx are not.

@jimhester Which approach do you think I should take?

I now introduce a 3rd option:

We should normalize all paths and do enc2native() on the R side, for xls and xlsx.

Why?
I now see there was a pre-existing issue about xlsx and encoding of the path: #370
Also a new xls one that's probably this: #478

@jimhester
Copy link
Contributor

jimhester commented May 9, 2018

I think doing enc2native(normalizePath()) on the R side for both xls and xlsx seems the best option.

@hadley
Copy link
Member

hadley commented May 9, 2018

I have no recollection of why there is a difference.

@jennybc jennybc changed the title Make sure xls path is in native encoding Make sure path is in native encoding May 11, 2018
[skip ci]
@jennybc
Copy link
Member Author

jennybc commented May 11, 2018

I went with a different option:

  • Add a test with a filename containing non-ASCII characters. Apply it to both xls and xlsx.
  • Stop normalizing xls paths.
  • Apply enc2native() to both xls and xlsx paths on the R side.

@jennybc jennybc merged commit b10a1a8 into master May 11, 2018
@jennybc jennybc deleted the path-encoding branch May 17, 2018 16:12
jennybc added a commit that referenced this pull request Aug 20, 2018
Fixes #498

Prior to b10a1a8,all xls paths were normalized, but we didn't know exactly why. In b10a1a8, I stopped normalizing xls paths, while solving an unrelated path encoding problem presented by R 3.5.

In discussion around #477, @jimhester said:

"I think doing enc2native(normalizePath()) on the R side for both xls and xlsx seems the best option."

I am now taking this wise advice.
jennybc added a commit that referenced this pull request Aug 21, 2018
Fixes #498

Prior to b10a1a8,all xls paths were normalized, but we didn't know exactly why. In b10a1a8, I stopped normalizing xls paths, while solving an unrelated path encoding problem presented by R 3.5.

In discussion around #477, @jimhester said:

"I think doing enc2native(normalizePath()) on the R side for both xls and xlsx seems the best option."

I am now taking this wise advice.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants