Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New path handling module that uses Nim's type system, using Path instead of string #54

Closed
timotheecour opened this issue Jul 18, 2018 · 3 comments · Fixed by nim-lang/Nim#20582

Comments

@timotheecour
Copy link
Member

@Araq in https://github.com/nim-lang/Nim/issues/8268#issuecomment-405817765 suggested:

You know... I think we need type Path = distinct string; type Filename = distinct string; type FileExt = distinct string and a new path handling module that uses Nim's type system.

let me expand on this a bit with other ideas to open a discussion

example usage

import ospaths2

let path ="/tmp//foo.txt".Path # cross platform, works on posix, windows, etc (no allocation)
when defined(posix)
  doAssert path.internal == "/tmp//foo.txt" # just returns internal representation (no allocation)
  doAssert $path == "/tmp/foo.txt" # note: `//` got normalized to `/`
when defined(Windows)
  doAssert path.internal == "/tmp/foo.txt"
  doAssert $path == r"C:\tmp\foo.txt"
  doAssert path == r"C:\TMP\\foo.txt" # platform specific path compare (note the case insensitivity and \\

# revamp functions that accept paths to be type-safe, self-documenting, and avoid confusion between string params (eg file contents) and path params
copyFile($"/tmp/foo.txt", $"/tmp/foo2.txt") # ok
copyFile("/tmp/foo.txt", "/tmp/foo2.txt") # ok via implicit conversion

related work

benefits

  • more type safety: eg, we can have == overloaded to be platform specific path comparison
  • self-documenting APIs, better encapsulation
  • avoid confusion between string params (eg file contents) and path params
  • cross platform paths, eg this could work on windows: "/tmp/foo.txt".Path

cons

see arguments from Walter Bright here: https://forum.dlang.org/post/[email protected]

  • cannot hope to duplicate the rich interface available for strings
  • APIs that deal with filenames take strings and return strings, not Path objects. Your code gets littered with path and filename components that are sometimes Paths and sometimes strings and sometimes both
  • People like writing paths as "/etc/hosts", not Path("/etc/hosts"). People will not stand for a Path constructor that winds up allocating memory so it can rewrite the string in a canonical path representation.
  • There really isn't any such thing as a portable path representation. It's more than just \ vs /. There are the drive prefixes in Windows that have no analog in Linux. Sometimes case matters in Linux, where it would be ignored under Windows. There are 8.3 issues sometimes. The only thing you can do is come up with a subset of what works across systems, and then of course you have to go back to using strings when you need to access D:\foo\abc.c

question

  • how easy would be to migrate code? automated tooling possible
  • estimate code breakage; can breakage be 100% avoided? (as always, problem is third party libs)
  • efficiency: will that lead to more or less efficient code? eg possibly less conversion in most apps assuming most string handling happens early on

design decisions

  • type Path = distinct string or type Path = object ... ? the latter allows more efficient operations potentially
  • should $path be path.internal or path.canonical ?
  • should Path implicitly convert to string? cf in D we can use alias this to have that; what about Nim?
    if it doesn't implicitly convert to string, a ton of code will break. => SUGGESTION: should implicitly convert
  • which operations allocate?
var myPath = "/etc/hosts".Path # does this allocate? I think it should not (eg in case it's unused in some code path)
let a = $myPath # this should allocate (or: this always allocates)
let ok = myPath == "/ETC//hosts" # does this allocate?
  • should we allow this on windows: let path ="/tmp/foo.txt".Path
  • should we allow this on windows: let path =r"C:\tmp\foo.txt".Path

links

Maybe we should have a Path "distinct string" type that stores validated canonical paths.

@timotheecour timotheecour changed the title [WIP] [RFC] new path handling module that uses Nim's type system [WIP] [RFC] new path handling module that uses Nim's type system, using Path instead of string Jul 18, 2018
@Araq
Copy link
Member

Araq commented Jul 19, 2018

Here is just a suggestion: Start with the distinct string idea, don't use a converter to implicitly convert it back to string, write a simplistic ospaths2 library and port the compiler and koch.nim to use that instead. This way we can get realworld insights. Btw Walter Bright's points are excellent but I think 0-overhead construction via Nim's path"/foo/bar" syntax would be acceptable.

@awr1
Copy link

awr1 commented Jul 25, 2018

vanilla strings of folders, files, etc. spliced against the / proc should return a Path type IMO

@timotheecour timotheecour reopened this Sep 9, 2018
@narimiran narimiran transferred this issue from nim-lang/Nim Jan 2, 2019
@ringabout ringabout mentioned this issue Jan 21, 2022
33 tasks
@ringabout ringabout changed the title [WIP] [RFC] new path handling module that uses Nim's type system, using Path instead of string New path handling module that uses Nim's type system, using Path instead of string Jan 24, 2022
@barcharcraz
Copy link

The "real" reason to use a path type instead of "just strings" is that it makes it somewhat easier to round-trip uint16_t paths (like on windows).

The design options here are:

  1. paths are represented as strings, transcoded to utf-8, simple, can use fast bulk transcode, but DOES NOT round trip, you can give the path API a path that exists and then it transforms it to some internal representation and gives you back a path that doesn't exist

  2. wtf-8, this is UTF-8 but unpaired surrogates are encoded as normal, so you can represent all windows paths with no data loss

  3. uft-8b (pep383 format) Similar to utf-8 but additionally can represent any other non-unicode narrow encoding bytes. I think you can use both, (so 383 for narrow<->narrow and wtf-8 for wide<->narrow) but you need to keep in mind http://unicode.org/L2/L2009/09236-pep383-problems.html. Basically when converting from the storage format (that's supposed to round trip) to any other format, if you encounter invalid utf-8 you must ensure that the result would produce invalid UTF-8 if you were to subsequently transform it back to the storage encoding.

In any event, supporting this for the filesystem isn't needed for nim, if you have a narrow string we should just not touch it.

  1. the entire program (basically) is parameterized on the path type. On windows "path" is wide_string and on unix "path" is narrow_string, this makes it pretty easy to write non-portable programs but is the simplest and arguably fastest. (ideally this includes the type of the parameters to main, but happily nim hides that from you in a way that makes this insanely easy to do correctly, no need for tmain horribleness.

I think (2) and (4) are the best options, as for if it's literal string types or some path object I don't think it really matters, except that I think the obvious data representation is either an array of uint8 or an array of uint16 (little endian, always, except it doesn't really matter because only windows does wide paths and windows has never supported any big endian architecture, and probably never will (windows does not support big-endian mode on arm, or arm64, or even itanium). I think representing things as some array of path components is a crummy idea, just store offsets to where they are.

Basically I think it matters much more that we can guide users down a path (heh) that never modifies their paths than it is that we use any particular fancy object-based API. Especially important is that if you read a directory and store the result somewhere, then later (without calling any APIs to modify the path) open that file to do operations on it that open call should both complete successfully (assuming nobody deleted that file out from under you) and open the same file the OS returned. You would think this would be easy but it's not, and on windows a huge quantity of software gets this wrong, try making a file with the name "U+D800" and running some common tools on it :). If you use such an API for security sensitive things it can even cause vulnerabilities, although, to be honest, it's really window's fault for not having an "openat" system call, despite NT supporting that construct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants