Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A small empirical study of basename() and normalizePath() #6

Open
jennybc opened this issue May 11, 2022 · 0 comments
Open

A small empirical study of basename() and normalizePath() #6

jennybc opened this issue May 11, 2022 · 0 comments

Comments

@jennybc
Copy link
Contributor

jennybc commented May 11, 2022

I've been sorting out some filepath encoding issues in vroom and eventually grew desperate enough to make this table.

Anyone interested in this repo might also find this interesting. The OS * R version * locale combos arise from what I can easily lay my hands on / what I've had to setup for the vroom work:

            | R       |                           | encoding |                | encoding
OS          | version | locale                    | of input | function       | of output
------------+---------+---------------------------+----------+----------------+----------
macOS         4.1.2     en_CA.UTF-8                 UTF-8      basename()       "unknown" (but UTF-8 bytes)
macOS         4.1.2     en_CA.UTF-8                 UTF-8      normalizePath()  UTF-8

windows       4.2.0     English_United States.utf8  UTF-8      basename()       UTF-8
windows       4.2.0     English_United States.utf8  UTF-8      normalizePath()  UTF-8

ubuntu 18.04  4.2.0     C.UTF-8                     UTF-8      basename()       "unknown" (but UTF-8 bytes)
ubuntu 18.04  4.2.0     C.UTF-8                     UTF-8      normalizePath()  UTF-8

windows       4.1.2     English_United States.1252  UTF-8      basename()       UTF-8
windows       4.1.2     English_United States.1252  UTF-8      normalizePath()  UTF-8

ubuntu 18.04  4.2.0     en_US (this is ISO-8859-1)  UTF-8      basename()       "unknown" (but latin1 bytes)
ubuntu 18.04  4.2.0     en_US (this is ISO-8859-1)  UTF-8      normalizePath()  latin1

Things that jump out:

  • basename() appears to re-encode to native on unix, but then marks the string as having "unknown" encoding.
  • basename() retains UTF-8 bytes and encoding mark on Windows, even if UTF-8 is not the native encoding.
  • normalizePath() re-encodes to native in unix and also marks the encoding correctly.
  • normalizePath() retains UTF-8 bytes and encoding mark on Windows, even if UTF-8 is not the native encoding.

Here's the code snippet I ran in various places:

R.version.string
.Platform$OS.type
Sys.getlocale()
l10n_info()

filepath <- "b\u00e9.csv"
Encoding(filepath)
charToRaw(filepath)

Encoding(basename(filepath))
charToRaw(basename(filepath))

Encoding(normalizePath(filepath, mustWork = FALSE))
charToRaw(normalizePath(filepath, mustWork = FALSE))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant