Store file names in POSIX format #40

jfontan · 2019-01-15T17:45:38Z

Current implementation stores file names in the index in the format sent by the program using the library. This means that the format stored in Windows and Linux differ. This fact makes incompatible using siva files created in different OS as path separator is not the same. For example, Index.Glob did not work correctly as it was not able to find directory separators.

This change makes old siva files created in Windows incompatible but is always compatible when the file was created in a POSIX OS.

Now the index header names are written in POSIX format and sanitized with ToSafePath. Also Find and Glob convert input paths with it.

This change makes interoperation between Windows and Linux possible.

Signed-off-by: Javi Fontan <[email protected]>

Now the index header names are written in posix format and sanitized with ToSafePath. Also Find an Glob convert input paths with it. This change makes interoperation between Windows and Linux possible. Signed-off-by: Javi Fontan <[email protected]>

ajnavarro · 2019-01-16T09:35:43Z

LGTM!

Shall we change siva spec to reflect that changes? https://github.com/src-d/go-siva/blob/master/SPEC.md#index

Signed-off-by: Javi Fontan <[email protected]>

erizocosmico · 2019-01-16T14:29:56Z

SPEC.md

@@ -52,7 +52,7 @@ at all.
 Each index entry has the following fields:

 * Byte length of the entry name (uint32).
-* Entry name (UTF-8 string).
+* Entry name (UTF-8 string in UNIX format).


UNIX or POSIX?

I've used UNIX as POSIX also mentions remote paths (https://en.wikipedia.org/wiki/Path_(computing)#POSIX_pathname_definition) and UNIX is already mentioned in file mode. I don't have a preference for any of them.

Another concern is that normalization affects path encoding, and in particular path comparison. That's not something this PR needs to address, but it's something to keep in mind for any storage format.

Both calls to Find and Glob that do searches in the file index use the same normalization for input so maybe this is enough.

Add user\files\test.txt

It's normalized and saved as user/files/test.txt

Call Gob("user\\files\\*.txt"), converted to user/files/.txt so the file is found

@creachadair is this what you mean?

@creachadair is this what you mean?

That's part of it. I also mean normalization at the Unicode level: e.g., é can be represented as either LATIN SMALL LETTER E plus COMBINING ACUTE ACCENT (decomposed) or as LATIN SMALL LETTER E WITH ACUTE (composed). The two are equivalent but not byte-for-byte equal in UTF-8 (\xc3\xa9 vs. \x65\xcc\x81).

Worse, some filesystems (notably Apple's HFS family) require a particular normalization and will canonicalize paths when they are opened. So if you are storing and querying path, you need to either decode both the needle and the haystack and compare rune-by-rune, or remember which encoding you used originally so you can canonicalize the needle at query time. (The former is easy to describe but hard to do efficiently)

[edit: fix typo]

jfontan · 2019-01-23T08:58:46Z

@smola, @mcuadros friendly ping

jfontan added 2 commits January 15, 2019 18:05

Make ToSafePath work the same in Unix and Windows

83321a8

Signed-off-by: Javi Fontan <[email protected]>

Store index names in posix format

adc5826

Now the index header names are written in posix format and sanitized with ToSafePath. Also Find an Glob convert input paths with it. This change makes interoperation between Windows and Linux possible. Signed-off-by: Javi Fontan <[email protected]>

jfontan changed the title ~~[WIP] Store file names in POSIX format~~ Store file names in POSIX format Jan 15, 2019

smola requested review from smola and mcuadros January 15, 2019 18:29

Change spec to specify index entry name format

4670656

Signed-off-by: Javi Fontan <[email protected]>

erizocosmico reviewed Jan 16, 2019

View reviewed changes

erizocosmico approved these changes Jan 16, 2019

View reviewed changes

ajnavarro approved these changes Jan 16, 2019

View reviewed changes

jfontan mentioned this pull request Jan 23, 2019

Does not work on Windows src-d/go-billy-siva#29

Closed

mcuadros merged commit 36b4509 into src-d:master Jan 28, 2019

jfontan mentioned this pull request Feb 11, 2019

Windows compatibility src-d/go-billy-siva#35

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store file names in POSIX format #40

Store file names in POSIX format #40

jfontan commented Jan 15, 2019 •

edited

Loading

ajnavarro commented Jan 16, 2019

erizocosmico Jan 16, 2019

jfontan Jan 16, 2019

creachadair Jan 16, 2019

jfontan Jan 16, 2019

creachadair Jan 16, 2019 •

edited

Loading

jfontan commented Jan 23, 2019

Store file names in POSIX format #40

Store file names in POSIX format #40

Conversation

jfontan commented Jan 15, 2019 • edited Loading

ajnavarro commented Jan 16, 2019

erizocosmico Jan 16, 2019

Choose a reason for hiding this comment

jfontan Jan 16, 2019

Choose a reason for hiding this comment

creachadair Jan 16, 2019

Choose a reason for hiding this comment

jfontan Jan 16, 2019

Choose a reason for hiding this comment

creachadair Jan 16, 2019 • edited Loading

Choose a reason for hiding this comment

jfontan commented Jan 23, 2019

jfontan commented Jan 15, 2019 •

edited

Loading

creachadair Jan 16, 2019 •

edited

Loading