Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding store Paths on Windows and Unix #3197

Open
Ericson2314 opened this issue Nov 2, 2019 · 5 comments
Open

Encoding store Paths on Windows and Unix #3197

Ericson2314 opened this issue Nov 2, 2019 · 5 comments
Labels

Comments

@Ericson2314
Copy link
Member

As #2634 points out, we can share derivations and builds between Windows and Unix machines. That means we cannot just be like Rust, new Python, etc., and do both types of path correctly and be done with it. We need to also figure out how to put a Windows path in a Unix store, and Unix Path in a Windows store.

#2634 handles problems with the path root (/... vs DOS-style C:\... vs UNC \\..\...), this can be just about the encoding. As @conferno points out::

I must explain the constraints which makes difficult using of many imperative languages originated from Unix world as stdenv.shell.

  1. Path is UTF-16 string and this is difficult to avoid, because

    1. we hit 256-byte length limit (I won't go into details now, there are some ways to workarond)

    2. files with names which cannot be represented as UTF-8 could be on disk, after fetchzip or in check phase. stdenv.shell must be able at least cp -r and rm -rf directories which have such files inside.

Thus, internal representation of paths has to be UTF-16, so neither bash+coreutils nor stock perl nor stock lua would work out of the box :(

To start tackling this issue, I would recommend https://simonsapin.github.io/wtf-8/. Rust uses it too. It can encode any windows path such that valid unicode is meaning-preserved in both directions, and also round trip. It cannot, however, represent non-UTF-8, non-WTF-8 Unix paths on Windows. We cannot fix that because as Windows uses a fixed-length encoding, there is no more room to represent anything else. Beyond representing foreign paths, this is a good canonical form to ensure that "normal" Windows paths have the same hash.

We can also normalize path separators, since Windows accepts both.

Unix paths that are not well-formed WTF-8 I suggest we just ban. Do they exist already, say in cache.nix.org?

@Ericson2314
Copy link
Member Author

Ericson2314 commented Nov 2, 2019

We can also normalize path separators, since Windows accepts both

In UNC-paths only \
And Windows-native programs like dir, del understand only \ in command-line arguments

Is / an allowed character in filenames though? If not, we can just losslessly convert / to \ on the fly.

The problem with Lua or bash+coreutils (or whatever it be instead them) must do _deletePath staying in UTF-16 when doing recusion and preserve the UTF-16 names between FileFile and RemoveDirectory

So per https://www.lua.org/manual/5.3/manual.html#3.1 strings in Lua can contain arbitrary bytes (even including nulls), so the WTF-8 or even UTF-16 + unpaired surrogates (the original), at the cost of confusing literals, will work fine.

Or course there could be any lossless representation of UTF-16, but it is not part of the interface.

But the canonical form which is hashed is part of the interface. I would hope "clean" ASCII / Unicode relative paths (assuming something like #2634 where we don't store the /nix/store in derivations) have the same hash. This will make transferring data/builds without self-references (also important cause intentional store) between Windows and Unix much, much easier.

But if you meant all store paths are valid unicode (hashed from utf-8), so UTF-16 is only a concern to the bash + coreutils replacement and not Nix itself, than yes, it is an unexposed implementation detail. :)

@stale
Copy link

stale bot commented Feb 16, 2021

I marked this as stale due to inactivity. → More info

@stale stale bot added the stale label Feb 16, 2021
@stale
Copy link

stale bot commented Apr 29, 2022

I closed this issue due to inactivity. → More info

@stale stale bot closed this as completed Apr 29, 2022
@Ericson2314
Copy link
Member Author

Still interested.

@Ericson2314 Ericson2314 reopened this Apr 29, 2022
@stale stale bot removed the stale label Apr 29, 2022
@stale stale bot added the stale label Oct 30, 2022
@stale stale bot removed the stale label Jan 14, 2024
@Ericson2314
Copy link
Member Author

#9205 should help with this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants
@Ericson2314 and others