Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CP1251 char set in file name #476

Closed
yurasmol opened this issue May 3, 2018 · 6 comments
Closed

CP1251 char set in file name #476

yurasmol opened this issue May 3, 2018 · 6 comments
Labels
bug an unexpected problem or unintended behavior xls 👵

Comments

@yurasmol
Copy link

yurasmol commented May 3, 2018

On Windows 7 and R version 3.5.0 (3.4.4 works without errors)
with the file included in ZIP (I can not upload .XLS files in this form).
ISSUE.zip
library(readxl)
fName <- "ТЕСТ.xls"
RES <- read_excel(fName, sheet=1, col_names=TRUE, col_types=NULL, na=" ", skip=0)
Error message is:
Error in read_fun(path = path, sheet_i = sheet, limits = limits, ...)
Filed to open ТЕСТ.xls

!?!File with .XLSX extension works without errors:
fName <- "ТЕСТ.xlsx"

But I always prefer Excel 2003 .XLS format.

Session Info:
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=Russian_Russia.1251 LC_CTYPE=Russian_Russia.1251
[3] LC_MONETARY=Russian_Russia.1251 LC_NUMERIC=C
[5] LC_TIME=Russian_Russia.1251

attached base packages:
[1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached):
[1] compiler_3.5.0

@jennybc
Copy link
Member

jennybc commented May 3, 2018

I believe the root cause of this is a change in the encoding handling of base::normalizePath(). I think this is basically the same problem over in readr: tidyverse/readr#834, tidyverse/readr#837

We are contemplating what to do to about this and wondering if the change in base R was intended.

@jennybc
Copy link
Member

jennybc commented May 3, 2018

Here is how base::normalizePath() is made available when ingesting xls:

inline std::string normalizePath(std::string path) {
Rcpp::Environment baseEnv = Rcpp::Environment::base_env();
Rcpp::Function normalizePath = baseEnv["normalizePath"];
return Rcpp::as<std::string>(normalizePath(path, "/", true));
}

And here's where it is used (note the UTF-8 you see here pertains to the processing of strings inside the file):

readxl/src/XlsWorkBook.h

Lines 28 to 34 in 66df4b9

XlsWorkBook(const std::string& path) {
path_ = normalizePath(path);
xls::xlsWorkBook* pWB_ = xls::xls_open(path_.c_str(), "UTF-8");
if (pWB_ == NULL) {
Rcpp::stop("Failed to open %s", path);
}

@jimhester Would it make sense to get the path prep out of compiled code altogether and deal with it on the R side?

@jennybc jennybc added bug an unexpected problem or unintended behavior xls 👵 labels May 3, 2018
@yurasmol
Copy link
Author

yurasmol commented May 3, 2018 via email

@yurasmol
Copy link
Author

yurasmol commented May 4, 2018 via email

@jennybc
Copy link
Member

jennybc commented May 4, 2018

For example cairo_pdf() crash with 'Out of memory' error in same situation.

I'm not sure if that has the same origin or not. Quite a different symptom.

Why it does not work for *.XLS, but works fine for *.XLSX file extensions?
Where is the difference?

Internally, we call base::normalizePath() on xls file paths but not xlsx. I don't have R 3.5 on my Windows VM yet but I'm going to try to figure out why we do this (happened before I took over as maintainer). Maybe we don't actually need this and I can remove it and that will fix the problem. Otherwise, we will fix directly.

@yutannihilation
Copy link
Member

We are contemplating what to do to about this and wondering if the change in base R was intended.

As I described in tidyverse/readr#834 (comment), I bet this is intended. as the change of path.expand() is stated in the release note; the path have to be encoded as UTF-8 to allow characters that cannot be represented in the native locale.

And, I think it's readr's and readxl's responsibility to take care of the encoding of the string in C++ functions. See tidyverse/readr#838.

(Sorry, I don't have enough time to explain about the detail for now... Hope the information above is useful for you!)

@lock lock bot locked and limited conversation to collaborators Oct 10, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug an unexpected problem or unintended behavior xls 👵
Projects
None yet
Development

No branches or pull requests

3 participants