Skip to content

add --index-files flag, and turn off index file checking by default#1777

Merged
thomas-zahner merged 26 commits intolycheeverse:masterfrom
rina-forks:katrinafyi-patch-1
Aug 14, 2025
Merged

add --index-files flag, and turn off index file checking by default#1777
thomas-zahner merged 26 commits intolycheeverse:masterfrom
rina-forks:katrinafyi-patch-1

Conversation

@katrinafyi
Copy link
Member

@katrinafyi katrinafyi commented Jul 27, 2025

this pr introduces the --index-files argument. this argument lets the user specify a list of index files names to require for local directory links. if specified, at least one index file must exist in order for a directory link to be valid.

this pr also makes the default behaviour more consistent. after this pr, index files are never used by default and directories are valid as long as they exist. this is a small difference from the current behaviour which implicitly searches index.{ext} (implemented in #1752) - this causes the +1 error count in test_fragments.

this pr also makes some clarifying points about the interactions of --index-files, --fallback-extensions, and fragment checking.

  • for a given link, only one of index files or fallback extensions will take effect. this is determined by whether the original path exists or not. if the path exists and is a directory, index files are applied. if the path does not exist, fallback extensions are applied.
  • when a link resolves to a directory (e.g., in the default behaviour), the directory is treated as a file with no fragments for the purpose of fragment checking,

some more details about the changes:

  • add checker::file::tests module with some specific unit tests for the local file checker
  • adds InvalidIndexFile to ErrorKind to display a more helpful "cannot find index file in directory", rather than "cannot find file"
  • adds fixture_uri method to test_utils to construct a URI relative to the fixtures dir. importantly, this supports building a URI with a fragment.
  • restructured the file.rs code to separate methods for applying index files and fallback extensions. these are changed to return a Result so they can report specific errors, rather than just option.
  • the code logic handling fragment checking directories is pushed into the check_fragment method. previously, it was an ad-hoc if statement higher up the call stack.
  • the index files/fallback extension behaviour is determined by a match statement in check_path. hopefully, this makes it more obvious what cases will lead to particular code paths.

this pr can be tested with:

cargo run -- fixtures/fragments --include-fragments

note that fixtures/fragments/sub_dir#a-link-inside-index-html-inside-sub-dir now fails with Cannot find fragment, because index.html is no longer searched by default.

the full diff of cargo run -- fixtures/fragments --include-fragments between previous and this pr is below. also note that the empty_dir cases with a fragment have the more specific "cannot find fragment" message, because the directory does exist.

--- a   2025-07-27 22:35:20.601240030 +1000
+++ b   2025-07-27 22:35:50.691518775 +1000
@@ -1,10 +1,11 @@
 Issues found in 2 inputs. Find details below.
 
 [fixtures/fragments/file1.md]:
-[ERROR] file:///home/x/progs/lychee/fixtures/fragments/empty_dir#non-existing-fragment-3 | Cannot find file
-[ERROR] file:///home/x/progs/lychee/fixtures/fragments/empty_dir#non-existing-fragment-4 | Cannot find file
+[ERROR] file:///home/x/progs/lychee/fixtures/fragments/empty_dir#non-existing-fragment-3 | Cannot find fragment
+[ERROR] file:///home/x/progs/lychee/fixtures/fragments/empty_dir#non-existing-fragment-4 | Cannot find fragment
 [ERROR] file:///home/x/progs/lychee/fixtures/fragments/file1.md#missing-fragment | Cannot find fragment
 [ERROR] file:///home/x/progs/lychee/fixtures/fragments/file2.md#missing-fragment | Cannot find fragment
+[ERROR] file:///home/x/progs/lychee/fixtures/fragments/sub_dir#a-link-inside-index-html-inside-sub-dir | Cannot find fragment
 [ERROR] file:///home/x/progs/lychee/fixtures/fragments/sub_dir#non-existing-fragment-1 | Cannot find fragment
 [ERROR] file:///home/x/progs/lychee/fixtures/fragments/sub_dir#non-existing-fragment-2 | Cannot find fragment
 [ERROR] file:///home/x/progs/lychee/fixtures/fragments/sub_dir_non_existing_1 | Cannot find file
@@ -15,4 +16,4 @@
 [ERROR] file:///home/x/progs/lychee/fixtures/fragments/file.html#in-THE-begiNNing | Cannot find fragment
 [ERROR] file:///home/x/progs/lychee/fixtures/fragments/file.html#in-the-end | Cannot find fragment
 
-🔍 42 Total (in 0s) ✅ 31 OK 🚫 11 Errors
+🔍 42 Total (in 0s) ✅ 30 OK 🚫 12 Errors

then, try

cargo run -- fixtures/fragments --index-files index.html --include-fragments    

and observe the output contains

[fixtures/fragments/file1.md]:
   [ERROR] file:///home/x/progs/lychee/fixtures/fragments/empty_dir | Cannot find index file within directory
   [ERROR] file:///home/x/progs/lychee/fixtures/fragments/empty_dir#non-existing-fragment-3 | Cannot find index file within directory
   [ERROR] file:///home/x/progs/lychee/fixtures/fragments/empty_dir#non-existing-fragment-4 | Cannot find index file within directory

of course, you should also try to run the new test suite. i hope that the test code is readable and is an understandable description of the behaviour.

cargo nextest run -p lychee-lib --lib checker::file::tests

i would appreciate any comments! especially proofreading and feedback on all the documentation and messages.

this pr properly implements #542. also related to #1765

katrinafyi and others added 15 commits July 24, 2025 11:58
i'm thinking about this more than anyone ever has. index_files interacts
with fallback_extensions in unclear ways. i feel like we should just
collapse them both and have like

suffixes = ["", ".html", "/index.html"]

in conjunction with a "--accept-directories" flag that defaults to true.

this suffixes list would be used to implement the fallback_extensions.
Conflicts:
lychee-lib/src/checker/file.rs

resolved by taking our version
now fails:

fixtures/fragments/sub_dir#a-link-inside-index-html-inside-sub-dir
the test was broken because we no longer check index.html by default.
i change the test URLs to have `/index`. this still tests that the
fallback extension `.html` is being applied correctly.
README.md Outdated
Comment on lines 469 to 471
The special name `.` can be used to resolve directory links to the directory
itself. When `.` is specified, a directory link will be accepted as long as the
directory exists. This is the default behavior.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty clever

Copy link
Member Author

@katrinafyi katrinafyi Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's too clever. For instance, I have to apply this default value at lots of different levels - CLI parsing and ClientBuilder both set it.

Is this okay? Or maybe the API should take Options?

Edit: In fact, I just found a bug to do with setting the default value 😖

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an interesting idea. But I think it's overly complicated. Also if I didn't misunderstand there does not seem to be any use-case for this ever. If a user ever provides . as a value (e.g. --index-files index.html,index.md,.) it should be exactly equivalent to not providing the whole flag at all.

So the only reason for .'s existence is for internal representation which is much cleaner represented with an empty list and probably a new condition withing apply_index_files. (or Option if really necessary)

Copy link
Member Author

@katrinafyi katrinafyi Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--index-files index.html,index.md,. is different to omitting the flag because it will use an index file if it exists before falling back to the directory. this could be useful if somebody has a setup like this.

file names after a ., however, will have no effect

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The field has been changed to an option :) See f6b678e for more commentary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification and for the new commit. I'm starting to think that we should now either:

Rename the flag from --index-files to something like --remap-directory. (maybe you have a another idea) This captures the current behaviour better IMO, as this is not necessarily about index files. (though this might be the most common use-case) Especially since we can provide arbitrary values such as ., .. and some/subdirectory/random.file. Or only allow values of simple file names that don't affect the directory in which the remapped files lie. But we pretty much already agreed that the latter case is not a priority. So my preference is renaming the flag. The flag description and documentation of course can still explain the common "index" use-case.

Copy link
Member Author

@katrinafyi katrinafyi Aug 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, I can't imagine a use for this flag which doesn't include index files in some way. . is the only directory name which will be accepted, and this is to support the case for optional index files. If you want, I can add this use case to the help and test cases to make it more visible. If you pass .. or some other path to a directory, those entries will never be considered a valid index file. This is recorded by one of the corner test cases.

And we've established that path traversal into nested directories is accidental rather than intentional, so I wouldn't want to generalise the option name to include this unsupported use. In the other comment thread, I talked about eventually restricting this with validation.

So, my recommendation is to keep the current name. It's clear to users who will be looking for this feature, and I don't think anyone will be surprised by its name or functionality.

But again, if you want to overrule me, let me know your desired name and I will make it so.

Copy link
Member Author

@katrinafyi katrinafyi Aug 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other problem is remap is already a defined term in the Lycheeverse. remap-directories has the connotation that this is a directory-specific version of remap, but its functionality is not nearly that powerful. It can't rewrite directories in the middle of paths, for instance. Nor can it change its behaviour based on the URL being checked.

I'd rather have a name which is very slightly too narrow than one which is too broad. (And I don't want to think of another name rn :))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay let's keep the flag name then. Ah true, in the current state a parameter like .. doesn't make sense because directories are not accepted, except for .. But I actually think it's not necessary then to handle . differently than all other directories

@katrinafyi
Copy link
Member Author

Test fails at https://example.net and seems unrelated/flaky. It does seem to be broken when visiting in Firefox, but it works in Chrome and curl. Weird.

README.md Outdated
Comment on lines 469 to 471
The special name `.` can be used to resolve directory links to the directory
itself. When `.` is specified, a directory link will be accepted as long as the
directory exists. This is the default behavior.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an interesting idea. But I think it's overly complicated. Also if I didn't misunderstand there does not seem to be any use-case for this ever. If a user ever provides . as a value (e.g. --index-files index.html,index.md,.) it should be exactly equivalent to not providing the whole flag at all.

So the only reason for .'s existence is for internal representation which is much cleaner represented with an empty list and probably a new condition withing apply_index_files. (or Option if really necessary)

async fn check_path(&self, path: &Path, uri: &Uri) -> Status {
let file_path = self.resolve_file_path(path);
let has_fragment = uri.url.fragment().is_some_and(|x| !x.is_empty());
let path = match path.metadata() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This match is a great solution, better than before 👍
I just think that the comments inside the match are not necessary, they just repeat the code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think the comments hurt. also, i think some cases like the Err(e) and the last Ok(_) deserve some explanation. these cases are less obvious because their condition implicitly includes the negation of the previous if conditions.

but if you direct me to remove them, i will do so. just let me know which you want to remove.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can keep them if you prefer. It's just that if I were to refactor the code in the future for some reason I personally would not keep such comments, as I think they don't add any value. There's no new insight about "why" we do this and "what" it does is obvious from the code. To me this is like a form of code duplication so I would remove all comments inside this match.


#[tokio::test]
async fn test_empty_index_list_corner() {
// empty index_files list will reject all directory links
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is the case we could redefine. "no specified index files == skip check" If you find this weird then the alternative would be to wrap it with Option and make None skip the check

}
}
None
let path = dir_path.join(filename);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be able to replace join with with_file_name. Should resolve the comment in test_index_file_traversal_corner

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the docs for with_file_name point to set_file_name which says

The argument is not sanitized, so can include separators. This behavior may be changed to a panic in the future.

so, at this point in time, it will probably not affect the behaviour. but i'll make the change anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i tried to make the change and it is a bit troublesome. using .with_file_name strips the last path component, effectively making it relative to the parent directory, rather than a file within the directory as we would like.

to avoid this, you can use

dir_path.join("nothing").with_file_name(filename)

but this is really just identical to .join.

so i don't think it's worth making the change here. if you want, a separate PR could add validation to the CLI arg to ensure no slashes are present.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. Yes we should leave it in that case.

this changes all the `index_files` fields to be `Option<Vec<String>>`
instead of only a vec. this lets API users simply pass a None to
get the default behaviour.

documentation at the higher level interfaces (e.g., the CLI help)
is changed to remove reference to `.` as the default behaviour.
now, the default behaviour is defined in terms of its observable
behaviour without reference to this implementation detail.

that said, the index file resolution still has this special case
that `.` is allowed in the `index_files` list. this allows the
(possibly esoteric) case where a user passes `--index-files index.html,.`
to use an index file if it exists but still accept the link regardless.

the None case for index_files is still implemented by mapping
to an index file list of `["."]`, but this is now only an
implementation detail of `apply_index_files`.

finally, using an Option is nicer than special-casing the empty
`Vec<String>` case, because the empty vec case /should/ reject
all directory links. this follows logically because the fewer
items in the index file list, the fewer directory links are valid.
@dkarlovi
Copy link

dkarlovi commented Aug 1, 2025

@katrinafyi how is this addressing #1765, should that be addressed here? Assuming you want to explicitly say index files are not supported and want to make linking to a dir an error?

@thomas-zahner noted:

But we could introduce a new feature flag to make lychee reject any directories that do not contain an index.html file.

Since the default is "link to a directory is valid", this feature is saying "link to a directory with <index file> is valid", we seem to need a feature flag to say "link to a directory is not valid" which can also be read as "index files not supported", maybe --no-index-files or similar?

@katrinafyi
Copy link
Member Author

@dkarlovi That issue has had a few back and forths.

  • It was originally called "Links to an empty directory should fail when running offline" and in the text, I wrote about requiring a index.html. That would be implemented by --index-files index.html. This was my intention when filing the issue.
  • The title was changed to "Links to directories should fail when running offline" without qualification on the directories. With this PR, that could be done by using --index-files ''. This is a bit obscure but it will work. Idk if this would be a common enough case to warrant mentioning in the help.

@dkarlovi
Copy link

dkarlovi commented Aug 1, 2025

With this PR, that could be done by using --index-files ''. This is a bit obscure but it will work. Idk if this would be a common enough case to warrant mentioning in the help.

Ah, OK. 👍 I'd definitely make at least a mention about it in the docs because it's definitely not obvious.

The use case we have for this is: when you're working on Gitlab, you can directly browse your static site in the MR artifacts, but it doesn't support index files, linking to a dir is a mistake which should get detected. When you deploy the site to Gitlab pages, it does support index pages so linking to an index file directly is a mistake which should be detected.

@katrinafyi
Copy link
Member Author

Gitab lets you browse an artifact in a MR? That's really cool! For your Gitlab pages, this PR won't be able to flag links that unnecessarily point to index.html - but maybe you have another approach for that already.

I'll make the changes to the docs :)

@dkarlovi
Copy link

dkarlovi commented Aug 1, 2025

Gitab lets you browse an artifact in a MR? That's really cool!

Yes! It's a REALLY cool feature because you can browse the site directly from the MR without deploying it anywhere which automatically uses the access control for the project.

this PR won't be able to flag links that unnecessarily point to index.html - but maybe you have another approach for that already.

Not really, do you have any ideas?

Copy link
Member

@thomas-zahner thomas-zahner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@katrinafyi I think the PR is pretty much ready to be merged. Can you resolve the conflicts?

}

#[tokio::test]
async fn test_index_list_of_directories_corner() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update: i've added a test case to guard against a regression of --index-files '' dcb1a1d

Thanks. The tests are great and keep the PR manageable 👍

endorsing path traversal is done. i decided we don't care about base-url/root-dir and it will always be literal files on disk.

Sounds sensible to me.

This alone would make the --index-files name wildly inaccurate.

True, I initially wanted to keep the option more general as discussed previously with renaming the flag. But if we stick to this use-case - as we probably should - and keep the flag name I'm okay with this.

Co-authored-by: Thomas Zahner <thomas.zahner@protonmail.ch>
@katrinafyi
Copy link
Member Author

katrinafyi commented Aug 7, 2025

Has rust got a new version that's learned more compiler warnings?

Do you want to fix those in a separate PR or should I just add it here? I'm not usually a rust user, so I wouldn't be super sure of the changes.

Edit: or the third option is to rollback the CI version for the time being.

this assertion is meant to check that there are two failing empty dir links. however, by checking stderr, it would've passed even if those links succeeded. this is because `--verbose` prints all links (both valid and invalid) to stderr.

we change it to search in stdout which contains only the failing links.
@mre
Copy link
Member

mre commented Aug 7, 2025

Good point. Opened #1785 for that.

@thomas-zahner
Copy link
Member

@katrinafyi Yes, Clippy updates often add new lints leading to new warnings which we reject in CI. It might make sense to switch to a fixed Rust+Clippy version for CI/CD in the future. The pipeline should now pass if you rebase on top of master

@mre mre requested a review from thomas-zahner August 8, 2025 10:26
@mre
Copy link
Member

mre commented Aug 8, 2025

From my side, this looks good! Thanks!
@thomas-zahner, feel free to merge if you're fine with it.

@thomas-zahner thomas-zahner merged commit cfec468 into lycheeverse:master Aug 14, 2025
6 checks passed
@thomas-zahner
Copy link
Member

@katrinafyi Thanks a lot for the PR and for bearing with my discussions 👍

@mre mre mentioned this pull request Aug 14, 2025
@mre
Copy link
Member

mre commented Aug 14, 2025

Yes, thanks @katrinafyi, you do a lot of great work recently. We all highly appreciate it.

@dkarlovi
Copy link

@katrinafyi thank you from my side too, as you can probably tell I'm pretty eager to have this feature available! 👏

@katrinafyi
Copy link
Member Author

Thank you for your kind words :) I feel like it's unwarranted, because I'm really just fixing things for my own use case. I'm glad it's appreciated, though. I hope I haven't been too overbearing in your project.

@ofek
Copy link

ofek commented Sep 29, 2025

I can't get this to work when defined via the config file. Is that not supported?

@katrinafyi
Copy link
Member Author

@ofek Sorry, it should've been supported, but there was a bug that silently ignored certain config.toml fields and this was one of the affected fields.

The bug has been fixed but it's not released yet (#1843). It is in the nightly version, so you can try that if you need.

@ofek
Copy link

ofek commented Sep 29, 2025

Great, thanks! Just to confirm, this is indeed the proper option to choose when generators like MkDocs are configured to use directory URLs right? For example I'm doing:

lychee site --index-files index.html

@katrinafyi
Copy link
Member Author

Yes, that use looks right. I don't know this specific generator, but other options which might be useful are --root-dir (for root-relative links) and --fallback-extensions (for resolving filename to filename.html).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants