-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store file names in POSIX format #40
Conversation
Signed-off-by: Javi Fontan <[email protected]>
Now the index header names are written in posix format and sanitized with ToSafePath. Also Find an Glob convert input paths with it. This change makes interoperation between Windows and Linux possible. Signed-off-by: Javi Fontan <[email protected]>
LGTM! Shall we change siva spec to reflect that changes? https://github.com/src-d/go-siva/blob/master/SPEC.md#index |
Signed-off-by: Javi Fontan <[email protected]>
@@ -52,7 +52,7 @@ at all. | |||
Each index entry has the following fields: | |||
|
|||
* Byte length of the entry name (uint32). | |||
* Entry name (UTF-8 string). | |||
* Entry name (UTF-8 string in UNIX format). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UNIX or POSIX?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've used UNIX as POSIX also mentions remote paths (https://en.wikipedia.org/wiki/Path_(computing)#POSIX_pathname_definition) and UNIX is already mentioned in file mode. I don't have a preference for any of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another concern is that normalization affects path encoding, and in particular path comparison. That's not something this PR needs to address, but it's something to keep in mind for any storage format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both calls to Find
and Glob
that do searches in the file index use the same normalization for input so maybe this is enough.
- Add
user\files\test.txt
- It's normalized and saved as
user/files/test.txt
- Call
Gob("user\\files\\*.txt")
, converted touser/files/.txt
so the file is found
@creachadair is this what you mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@creachadair is this what you mean?
That's part of it. I also mean normalization at the Unicode level: e.g., é
can be represented as either LATIN SMALL LETTER E
plus COMBINING ACUTE ACCENT
(decomposed) or as LATIN SMALL LETTER E WITH ACUTE
(composed). The two are equivalent but not byte-for-byte equal in UTF-8 (\xc3\xa9
vs. \x65\xcc\x81
).
Worse, some filesystems (notably Apple's HFS family) require a particular normalization and will canonicalize paths when they are opened. So if you are storing and querying path, you need to either decode both the needle and the haystack and compare rune-by-rune, or remember which encoding you used originally so you can canonicalize the needle at query time. (The former is easy to describe but hard to do efficiently)
[edit: fix typo]
Current implementation stores file names in the index in the format sent by the program using the library. This means that the format stored in Windows and Linux differ. This fact makes incompatible using siva files created in different OS as path separator is not the same. For example,
Index.Glob
did not work correctly as it was not able to find directory separators.This change makes old siva files created in Windows incompatible but is always compatible when the file was created in a POSIX OS.
Now the index header names are written in POSIX format and sanitized with
ToSafePath
. AlsoFind
andGlob
convert input paths with it.This change makes interoperation between Windows and Linux possible.