Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework filepath (re-)encoding #438

Merged
merged 17 commits into from
May 17, 2022
Merged

Conversation

jennybc
Copy link
Member

@jennybc jennybc commented May 8, 2022

This PR basically reverts #434, which was proving to be an untenable basis for truly fixing things on linux, in a non UTF-8 locale. I don't really think that situation is actually important for real usage; I'm using it more as a way to make sure that the filepath handling works by definition and not just by coincidence.

If you can't beat 'em, join 'em.

This PR embraces UTF-8 encoding of file paths, then re-encodes the path to the native encoding just prior to any call to fopen() or mio::make_mmap_source() (on a non-Windows OS). (The relevant path gymnastics for Windows were already here, mercifully.)

Review of basic facts:

  • vroom needs to do file I/O, both itself and via mio. In general, there are places where we need to know that we've got a filepath in the native encoding.
  • cpp11 is determined to translate every string it touches into UTF-8. This can be (sort of) visible / explicit, such as when you call as_cpp(), but can also be quite invisible / implicit. It's extremely difficult to predict, prevent, or reverse this. I've decided to remove any uncertainty about how a path is currently encoded by making it UTF-8 (almost) everywhere.

This PR also addresses file writing and reading fixed width files.

Still having to skip some tests due to r-lib/archive#75.

Sidebar: In my explorations of other solutions, I think I've seen an example where cpp11 is marking a string as UTF-8, but the bytes are actually latin1. However, that's a matter to nail down and report upstream in cpp11. But it's part of the motivation for approach taken in this PR.

@jennybc jennybc force-pushed the yet-another-filepath-encoding-experiment branch from 78c6983 to 8e821c0 Compare May 8, 2022 22:19
@jennybc jennybc force-pushed the yet-another-filepath-encoding-experiment branch from 68602ec to 4317c8a Compare May 8, 2022 22:54
@jennybc jennybc force-pushed the yet-another-filepath-encoding-experiment branch from 3ded2a1 to bdecf47 Compare May 8, 2022 23:41
@jennybc jennybc changed the title Yet another filepath encoding experiment Rework filepath (re-)encoding May 8, 2022
@jennybc jennybc force-pushed the yet-another-filepath-encoding-experiment branch from 7c90b19 to 1023b1c Compare May 9, 2022 00:32
@jennybc jennybc marked this pull request as ready for review May 9, 2022 01:21
@jennybc jennybc requested a review from DavisVaughan May 9, 2022 01:22
Copy link
Member

@DavisVaughan DavisVaughan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the high level feel of this is right to me, and better supports the "UTF-8 everywhere" idea that you always live in a UTF-8 world except right before you pass off to a 3rd party library or something like fopen().

I have a small feeling we may discover more places where we may need to add enc2utf8() on the R side to support this, but I feel more confident in what we are doing on the C++ side for sure

R/path.R Show resolved Hide resolved
R/path.R Outdated Show resolved Hide resolved
@jennybc
Copy link
Member Author

jennybc commented May 11, 2022

Full analysis of file path handling in vroom. The encodings below should reflect reality with this PR (although I'm about to add some more enc2utf8()!).

Filepaths are handed from R to C++ and back again many, many times and pass through both base R's file path functions (like basename() and normalizePath()) and through cpp11's facilities. Overall, without explicit management, it's extremely difficult to know how a path is encoded at various points in execution.

This analysis also resulted in a small empirical study of basename() and normalizePath(): gaborcsardi/rencfaq#6.

The complexity seen below is why this PR implements UTF-8 (almost) everywhere.

The vroom() R function

encoding of   |         |
the filepath, | where   |
whatever it's | are     |
called        | we?     | code
--------------+---------+-------------------------------------------------------
                R        vroom(file, ...) {
who knows?                 ...
UTF-8                      file <- standardise_path(file)
                           ...
UTF-8           R -> C++   vroom_(file, ...)
                           ...
                         }

encoding of   |         |
the filepath, | where   |
whatever it's | are     |
called        | we?     | code
--------------+---------+-------------------------------------------------------
who knows?      R        standardise_path(path) {
                           ...
                           # ultimately returns
UTF-8                      as.list(enc2utf8(path))
                         }

Digging in to the C++ vroom_() function

encoding of   |               |
the filepath, | where         |
whatever it's | are           |
called        | we?           | code
--------------+---------------+-------------------------------------------------
UTF-8           C++            SEXP vroom_(const cpp11::list& inputs, ...)
                                 auto idx = std::make_shared<vroom::index_collection>(inputs, ...)
                                   index_collection(const cpp11::list& in, ...)
                                     make_delimited_index(in[i], ...)
                                     make_delimited_index(const cpp11::sexp& in, ...)
                                       ...
                                       auto standardise_one_path = cpp11::package("vroom")["standardise_one_path"];
UTF-8           C++ > R > C++          auto x = standardise_one_path(in);
                                       auto filename = cpp11::as_cpp<std::string>(x);
                                       return std::make_shared<vroom::delimited_index>(filename.c_str(), ...)

encoding of   |            |
the filepath, | where      |
whatever it's | are        |
called        | we?        | code
--------------+------------+----------------------------------------------------
UTF-8           R            standardise_one_path(path) {
                               ...
who knows?                     p <- split_path_ext(basename(path))
                               ...
who knows?                     path <- normalizePath(path, ...)
                               ...
UTF-8                          path <- enc2utf8(path)
                R > C++ > R    if (!has_trailing_newline(path)) {
                                 file(path)
                               } else {
                                 path
                               }
                             }

encoding of   |               |
the filepath, | where         |
whatever it's | are           |
called        | we?           | code
--------------+---------------+-------------------------------------------------
UTF-8           C++             bool has_trailing_newline(const cpp11::strings& filename) {
                                  unicode_fopen(CHAR(filename[0]), "rb");
                                  ...
                                }

encoding of   |               |
the filepath, | where         |
whatever it's | are           |
called        | we?           | code
--------------+---------------+-------------------------------------------------
UTF-8           C++             FILE* unicode_fopen(const char* path, const char* mode) {
                                  #ifdef _WIN32
                                    // imagine a whole lot of Windows wide character gymnastics
UTF-16                              MultiByteToWideChar(CP_UTF8, 0, path, -1, buf, len);
                                    out = _wfopen(buf, mode_w);
                                  #else
"native"                            const char* native_path = Rf_translateChar(cpp11::r_string(path));
                                    out = fopen(native_path, mode);
                                  #endif
                                  return out;
                                }

encoding of   |               |
the filepath, | where         |
whatever it's | are           |
called        | we?           | code
--------------+---------------+-------------------------------------------------
                                file(description, ...) {
                                    # File paths
                                    # In most cases these are translated to the native encoding.

                                    # The exceptions are file and pipe on Windows, where a description which is
                                    # marked as being in UTF-8 is passed to Windows as a ‘wide’ character
                                    # string. This allows files with names not in the native encoding to be
                                    # opened on file systems which use Unicode file names (such as NTFS but
                                    # not FAT32).
                                }

Digging in to the C++ constructor for vroom::delimited_index

encoding of   |               |
the filepath, | where         |
whatever it's | are           |
called        | we?           | code
--------------+---------------+-------------------------------------------------
UTF-8           C++             delimited_index::delimited_index(const char* filename, ...):
                                  filename_(filename), ... {
                                    ...
                                    std::unique_ptr<multi_progress> pb = nullptr;

                                    if (progress_) {
                                      #ifndef VROOM_STANDALONE
                C++ > R > C++           auto format = get_pb_format("file", filename);
                                        auto width = get_pb_width(format);
                                        pb = std::unique_ptr<multi_progress>(
                                          new multi_progress(format, file_size, width));
                                        pb->tick(start);
                                      #endif
                                    }
                                    ...
                                  }

encoding of   |               |
the filepath, | where         |
whatever it's | are           |
called        | we?           | code
--------------+---------------+-------------------------------------------------
UTF-8                           get_pb_format(const std::string& which, const std::string& filename = "") {
                                  auto fun_name = std::string("pb_") + which + "_format";
                C++ > R > C++     auto fun = cpp11::package("vroom")[fun_name.c_str()];
                                  return cpp11::as_cpp<std::string>(fun(filename));
                                }

encoding of   |               |
the filepath, | where         |
whatever it's | are           |
called        | we?           | code
--------------+---------------+-------------------------------------------------
UTF-8           R               pb_file_format(filename) {
                                  ...
who knows?                        basename(filename)
                                  ...
                                }

jennybc added 11 commits May 11, 2022 15:10
This guards against the scenario where the tempdir's path has non-ascii characters in it.

Presumably that could arise on, say, Windows if the user name has non-ascii characters:

> tempdir()
[1] "C:\\Users\\jenny\\AppData\\Local\\Temp\\Rtmpg30qBQ"
Everytime we use a base R path-handling function, explicitly re-encode the result as UTF-8.
Skip this test in non-UTF-8 locales for now
@jennybc jennybc merged commit 85143f7 into main May 17, 2022
@jennybc jennybc deleted the yet-another-filepath-encoding-experiment branch May 17, 2022 14:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants