Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

don't clobber an existing file when building the site #1374

Merged
merged 2 commits into from
Sep 17, 2024
Merged

Conversation

Fil
Copy link
Contributor

@Fil Fil commented May 21, 2024

closes #1367

Analysis of the issue:

A file is created with the name given in its reference, even if this doesn't match the actual file name (for example because we use a case-insensitive OS, or symlinks).

For example, if we have a reference in page.md to <img src=IMAGE.jpg> and a symlink from image.jpg to an actual file on disk called horse.jpg, we will create _file/IMAGE.xxxx.jpg, and reference this correctly in page.html; the build will not carry any information about the original "true" filename.

However, if several references to the same file use inconsistent names that end up being the same file on disk (e.g. _file/IMAGE.jpg and _file/image.jpg on a case-insensitive system), then we might have an issue when serving the site — if the server itself is case sensitive it will only be able to serve one of the references, and will 404 on the other(s).

Approach taken in this PR:

Thankfully, the build process is already careful to write each file once. We now enforce this: before writing a file to a destination in the output root, we check that it is not clobbering an existing file. If that is the case, then it means that the destination has not been deduplicated properly, and on a case insensitive OS this is most probably due to inconsistent casing (or maybe inconsistent encoding of unicode characters) across the site.

Since this can might problems when hosting, as described above, we now break the build. We report a "File name conflict over (found path)" with no assumption made about the nature of the error. Hopefully, if it's a user error, the cause will be visible by comparing the paths shown in the log.

logs

I have tested this with IMAGE.jpg and image.jpg, and also ÉTÉ.jpg vs été.jpg. Since it's enforced in the effects’ copyFile and writeFile methods, it's not limited to “files”—for instance, it also breaks if you are using inconsistent naming in the style front matter option: STYLE.css vs. style.css.

Notes:

  • we can only detect this problem when running build—preview doesn't know about all the files, and even if it did, I think it can't know about what the filesystem considers "the same path" without writing to it.
  • the “inverse problem” (building on a case sensitive system, then serving from a case insensitive system) is not considered here; but [in theory] if two files references correspond to the same resource (as guaranteed by the content-hash), it doesn't matter what copy we use. I haven't tested this.
  • the use of realpath to get the "true" filename is now limited to the output directory that we just created—this ensures that there are no issues with symlinks. (Also it's only used to report the error to the user, not to detect the situation.)

It's very difficult to create unit tests that will run on a case sensitive ubuntu; should I create a test to make sure that we fail on darwin (local testing) and win32 (CI testing)?

previous branches:

supersedes #1369
supersedes #1373

@Fil Fil requested a review from mbostock May 21, 2024 10:53
Copy link
Member

@mbostock mbostock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the simplicity of this check. I wonder if we simplify further and drop the usage of realpath entirely? I’m not entirely sure the usage of realpath correct here (in particular the assumption that it’s safe to slice — that one realpath is a prefix of the other), but I’m feeling lazy and don’t want to investigate much further — what would be the harm in just reporting the path as-is instead of using realpath?

@Fil
Copy link
Contributor Author

Fil commented May 23, 2024

[is it] safe to slice — that one realpath is a prefix of the other

I think so, because there should be no symlinks (nor hard links) in the output (dist/) folder, which we just created and populate only with the most basic directory and file commands.

what would be the harm in just reporting the path as-is instead of using realpath?

My intent here is to show the "true" case of the path, so that the user can compare it with the line just above that describes the path we tried to write to. It's a crucial hint that will (hopefully) help them understand that the issue is case (in)consistency.

@Fil
Copy link
Contributor Author

Fil commented Jun 22, 2024

An alternative might be to include the file path into the hash?

@Fil
Copy link
Contributor Author

Fil commented Aug 26, 2024

I'm not sure this fixes the issue I've seen in #1061 (comment)

EDIT: but this should be fixed by #1651, I think

@Fil Fil merged commit 2019dd3 into main Sep 17, 2024
4 checks passed
@Fil Fil deleted the fil/no-clobber branch September 17, 2024 14:18
@Fil Fil mentioned this pull request Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

case sensitive file names
2 participants