Rework filepath (re-)encoding #438

jennybc · 2022-05-08T22:08:05Z

This PR basically reverts #434, which was proving to be an untenable basis for truly fixing things on linux, in a non UTF-8 locale. I don't really think that situation is actually important for real usage; I'm using it more as a way to make sure that the filepath handling works by definition and not just by coincidence.

If you can't beat 'em, join 'em.

This PR embraces UTF-8 encoding of file paths, then re-encodes the path to the native encoding just prior to any call to fopen() or mio::make_mmap_source() (on a non-Windows OS). (The relevant path gymnastics for Windows were already here, mercifully.)

Review of basic facts:

vroom needs to do file I/O, both itself and via mio. In general, there are places where we need to know that we've got a filepath in the native encoding.
cpp11 is determined to translate every string it touches into UTF-8. This can be (sort of) visible / explicit, such as when you call as_cpp(), but can also be quite invisible / implicit. It's extremely difficult to predict, prevent, or reverse this. I've decided to remove any uncertainty about how a path is currently encoded by making it UTF-8 (almost) everywhere.

This PR also addresses file writing and reading fixed width files.

Still having to skip some tests due to r-lib/archive#75.

Sidebar: In my explorations of other solutions, I think I've seen an example where cpp11 is marking a string as UTF-8, but the bytes are actually latin1. However, that's a matter to nail down and report upstream in cpp11. But it's part of the motivation for approach taken in this PR.

DavisVaughan

I think the high level feel of this is right to me, and better supports the "UTF-8 everywhere" idea that you always live in a UTF-8 world except right before you pass off to a 3rd party library or something like fopen().

I have a small feeling we may discover more places where we may need to add enc2utf8() on the R side to support this, but I feel more confident in what we are doing on the C++ side for sure

R/path.R

jennybc · 2022-05-11T21:38:26Z

Full analysis of file path handling in vroom. The encodings below should reflect reality with this PR (although I'm about to add some more enc2utf8()!).

Filepaths are handed from R to C++ and back again many, many times and pass through both base R's file path functions (like basename() and normalizePath()) and through cpp11's facilities. Overall, without explicit management, it's extremely difficult to know how a path is encoded at various points in execution.

This analysis also resulted in a small empirical study of basename() and normalizePath(): gaborcsardi/rencfaq#6.

The complexity seen below is why this PR implements UTF-8 (almost) everywhere.

The `vroom()` R function

encoding of   |         |
the filepath, | where   |
whatever it's | are     |
called        | we?     | code
--------------+---------+-------------------------------------------------------
                R        vroom(file, ...) {
who knows?                 ...
UTF-8                      file <- standardise_path(file)
                           ...
UTF-8           R -> C++   vroom_(file, ...)
                           ...
                         }

encoding of   |         |
the filepath, | where   |
whatever it's | are     |
called        | we?     | code
--------------+---------+-------------------------------------------------------
who knows?      R        standardise_path(path) {
                           ...
                           # ultimately returns
UTF-8                      as.list(enc2utf8(path))
                         }

Digging in to the C++ `vroom_()` function

encoding of   |               |
the filepath, | where         |
whatever it's | are           |
called        | we?           | code
--------------+---------------+-------------------------------------------------
UTF-8           C++            SEXP vroom_(const cpp11::list& inputs, ...)
                                 auto idx = std::make_shared<vroom::index_collection>(inputs, ...)
                                   index_collection(const cpp11::list& in, ...)
                                     make_delimited_index(in[i], ...)
                                     make_delimited_index(const cpp11::sexp& in, ...)
                                       ...
                                       auto standardise_one_path = cpp11::package("vroom")["standardise_one_path"];
UTF-8           C++ > R > C++          auto x = standardise_one_path(in);
                                       auto filename = cpp11::as_cpp<std::string>(x);
                                       return std::make_shared<vroom::delimited_index>(filename.c_str(), ...)

encoding of   |            |
the filepath, | where      |
whatever it's | are        |
called        | we?        | code
--------------+------------+----------------------------------------------------
UTF-8           R            standardise_one_path(path) {
                               ...
who knows?                     p <- split_path_ext(basename(path))
                               ...
who knows?                     path <- normalizePath(path, ...)
                               ...
UTF-8                          path <- enc2utf8(path)
                R > C++ > R    if (!has_trailing_newline(path)) {
                                 file(path)
                               } else {
                                 path
                               }
                             }

encoding of   |               |
the filepath, | where         |
whatever it's | are           |
called        | we?           | code
--------------+---------------+-------------------------------------------------
UTF-8           C++             bool has_trailing_newline(const cpp11::strings& filename) {
                                  unicode_fopen(CHAR(filename[0]), "rb");
                                  ...
                                }

encoding of   |               |
the filepath, | where         |
whatever it's | are           |
called        | we?           | code
--------------+---------------+-------------------------------------------------
UTF-8           C++             FILE* unicode_fopen(const char* path, const char* mode) {
                                  #ifdef _WIN32
                                    // imagine a whole lot of Windows wide character gymnastics
UTF-16                              MultiByteToWideChar(CP_UTF8, 0, path, -1, buf, len);
                                    out = _wfopen(buf, mode_w);
                                  #else
"native"                            const char* native_path = Rf_translateChar(cpp11::r_string(path));
                                    out = fopen(native_path, mode);
                                  #endif
                                  return out;
                                }

encoding of   |               |
the filepath, | where         |
whatever it's | are           |
called        | we?           | code
--------------+---------------+-------------------------------------------------
                                file(description, ...) {
                                    # File paths
                                    # In most cases these are translated to the native encoding.

                                    # The exceptions are file and pipe on Windows, where a description which is
                                    # marked as being in UTF-8 is passed to Windows as a ‘wide’ character
                                    # string. This allows files with names not in the native encoding to be
                                    # opened on file systems which use Unicode file names (such as NTFS but
                                    # not FAT32).
                                }

Digging in to the C++ constructor for `vroom::delimited_index`

encoding of   |               |
the filepath, | where         |
whatever it's | are           |
called        | we?           | code
--------------+---------------+-------------------------------------------------
UTF-8           C++             delimited_index::delimited_index(const char* filename, ...):
                                  filename_(filename), ... {
                                    ...
                                    std::unique_ptr<multi_progress> pb = nullptr;

                                    if (progress_) {
                                      #ifndef VROOM_STANDALONE
                C++ > R > C++           auto format = get_pb_format("file", filename);
                                        auto width = get_pb_width(format);
                                        pb = std::unique_ptr<multi_progress>(
                                          new multi_progress(format, file_size, width));
                                        pb->tick(start);
                                      #endif
                                    }
                                    ...
                                  }

encoding of   |               |
the filepath, | where         |
whatever it's | are           |
called        | we?           | code
--------------+---------------+-------------------------------------------------
UTF-8                           get_pb_format(const std::string& which, const std::string& filename = "") {
                                  auto fun_name = std::string("pb_") + which + "_format";
                C++ > R > C++     auto fun = cpp11::package("vroom")[fun_name.c_str()];
                                  return cpp11::as_cpp<std::string>(fun(filename));
                                }

encoding of   |               |
the filepath, | where         |
whatever it's | are           |
called        | we?           | code
--------------+---------------+-------------------------------------------------
UTF-8           R               pb_file_format(filename) {
                                  ...
who knows?                        basename(filename)
                                  ...
                                }

This guards against the scenario where the tempdir's path has non-ascii characters in it. Presumably that could arise on, say, Windows if the user name has non-ascii characters: > tempdir() [1] "C:\\Users\\jenny\\AppData\\Local\\Temp\\Rtmpg30qBQ"

Everytime we use a base R path-handling function, explicitly re-encode the result as UTF-8.

Skip this test in non-UTF-8 locales for now

…rywhere

jennybc added 2 commits May 8, 2022 15:18

Revert filepath reencoding

76b93e6

Unskip this test

8e821c0

jennybc force-pushed the yet-another-filepath-encoding-experiment branch from 78c6983 to 8e821c0 Compare May 8, 2022 22:19

Re-encode to native just prior to fopen() or mio::make_mmap_source()

4317c8a

jennybc force-pushed the yet-another-filepath-encoding-experiment branch from 68602ec to 4317c8a Compare May 8, 2022 22:54

jennybc added 2 commits May 8, 2022 16:11

Expose the bytes

000790d

Unconditionally encode paths as UTF-8

bdecf47

jennybc force-pushed the yet-another-filepath-encoding-experiment branch from 3ded2a1 to bdecf47 Compare May 8, 2022 23:41

jennybc changed the title ~~Yet another filepath encoding experiment~~ Rework filepath (re-)encoding May 8, 2022

Clean up comments, remove print debugging

1023b1c

jennybc force-pushed the yet-another-filepath-encoding-experiment branch from 7c90b19 to 1023b1c Compare May 9, 2022 00:32

jennybc mentioned this pull request May 9, 2022

Read file paths with non-ascii characters in a non UTF-8 locale #437

Closed

jennybc marked this pull request as ready for review May 9, 2022 01:21

jennybc requested a review from DavisVaughan May 9, 2022 01:22

DavisVaughan approved these changes May 9, 2022

View reviewed changes

R/path.R Show resolved Hide resolved

R/path.R Outdated Show resolved Hide resolved

jennybc added 11 commits May 11, 2022 15:10

Add more enc2utf8()

5792653

Everytime we use a base R path-handling function, explicitly re-encode the result as UTF-8.

Inline the reference object

114c671

Test writing to a non-ascii path

8f180e0

Test read/write of .zip with non-ascii characters in path

004f42c

Rewrite this test

9777c2e

I now think the problem IS in archive, for Windows and unix

a9436d4

Skip this test in non-UTF-8 locales for now

Add a test re: reading fwf from non-ascii filepath

4be132c

Tweaks NEWS

e4a6321

Make versions of basename() and normalizePath() that uphold UTF-8 eve…

432b918

…rywhere

Merge branch 'main' into yet-another-filepath-encoding-experiment

833e193

jennybc merged commit 85143f7 into main May 17, 2022

jennybc deleted the yet-another-filepath-encoding-experiment branch May 17, 2022 14:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework filepath (re-)encoding #438

Rework filepath (re-)encoding #438

jennybc commented May 8, 2022 •

edited

Loading

DavisVaughan left a comment

jennybc commented May 11, 2022

Rework filepath (re-)encoding #438

Rework filepath (re-)encoding #438

Conversation

jennybc commented May 8, 2022 • edited Loading

DavisVaughan left a comment

Choose a reason for hiding this comment

jennybc commented May 11, 2022

The vroom() R function

Digging in to the C++ vroom_() function

Digging in to the C++ constructor for vroom::delimited_index

jennybc commented May 8, 2022 •

edited

Loading

The `vroom()` R function

Digging in to the C++ `vroom_()` function

Digging in to the C++ constructor for `vroom::delimited_index`