Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store file names in POSIX format #40

Merged
merged 3 commits into from
Jan 28, 2019
Merged

Conversation

jfontan
Copy link
Contributor

@jfontan jfontan commented Jan 15, 2019

Current implementation stores file names in the index in the format sent by the program using the library. This means that the format stored in Windows and Linux differ. This fact makes incompatible using siva files created in different OS as path separator is not the same. For example, Index.Glob did not work correctly as it was not able to find directory separators.

This change makes old siva files created in Windows incompatible but is always compatible when the file was created in a POSIX OS.

Now the index header names are written in POSIX format and sanitized with ToSafePath. Also Find and Glob convert input paths with it.

This change makes interoperation between Windows and Linux possible.

Now the index header names are written in posix format and sanitized
with ToSafePath. Also Find an Glob convert input paths with it.

This change makes interoperation between Windows and Linux possible.

Signed-off-by: Javi Fontan <[email protected]>
@jfontan jfontan changed the title [WIP] Store file names in POSIX format Store file names in POSIX format Jan 15, 2019
@smola smola requested review from smola and mcuadros January 15, 2019 18:29
@ajnavarro
Copy link
Contributor

LGTM!

Shall we change siva spec to reflect that changes? https://github.com/src-d/go-siva/blob/master/SPEC.md#index

@@ -52,7 +52,7 @@ at all.
Each index entry has the following fields:

* Byte length of the entry name (uint32).
* Entry name (UTF-8 string).
* Entry name (UTF-8 string in UNIX format).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UNIX or POSIX?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've used UNIX as POSIX also mentions remote paths (https://en.wikipedia.org/wiki/Path_(computing)#POSIX_pathname_definition) and UNIX is already mentioned in file mode. I don't have a preference for any of them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another concern is that normalization affects path encoding, and in particular path comparison. That's not something this PR needs to address, but it's something to keep in mind for any storage format.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both calls to Find and Glob that do searches in the file index use the same normalization for input so maybe this is enough.

  • Add user\files\test.txt
  • It's normalized and saved as user/files/test.txt
  • Call Gob("user\\files\\*.txt"), converted to user/files/.txt so the file is found

@creachadair is this what you mean?

Copy link
Contributor

@creachadair creachadair Jan 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@creachadair is this what you mean?

That's part of it. I also mean normalization at the Unicode level: e.g., é can be represented as either LATIN SMALL LETTER E plus COMBINING ACUTE ACCENT (decomposed) or as LATIN SMALL LETTER E WITH ACUTE (composed). The two are equivalent but not byte-for-byte equal in UTF-8 (\xc3\xa9 vs. \x65\xcc\x81).

Worse, some filesystems (notably Apple's HFS family) require a particular normalization and will canonicalize paths when they are opened. So if you are storing and querying path, you need to either decode both the needle and the haystack and compare rune-by-rune, or remember which encoding you used originally so you can canonicalize the needle at query time. (The former is easy to describe but hard to do efficiently)

[edit: fix typo]

@jfontan
Copy link
Contributor Author

jfontan commented Jan 23, 2019

@smola, @mcuadros friendly ping

@mcuadros mcuadros merged commit 36b4509 into src-d:master Jan 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants