add --index-files flag, and turn off index file checking by default#1777
add --index-files flag, and turn off index file checking by default#1777thomas-zahner merged 26 commits intolycheeverse:masterfrom
--index-files flag, and turn off index file checking by default#1777Conversation
i'm thinking about this more than anyone ever has. index_files interacts with fallback_extensions in unclear ways. i feel like we should just collapse them both and have like suffixes = ["", ".html", "/index.html"] in conjunction with a "--accept-directories" flag that defaults to true. this suffixes list would be used to implement the fallback_extensions.
Conflicts: lychee-lib/src/checker/file.rs resolved by taking our version
now fails: fixtures/fragments/sub_dir#a-link-inside-index-html-inside-sub-dir
the test was broken because we no longer check index.html by default. i change the test URLs to have `/index`. this still tests that the fallback extension `.html` is being applied correctly.
README.md
Outdated
| The special name `.` can be used to resolve directory links to the directory | ||
| itself. When `.` is specified, a directory link will be accepted as long as the | ||
| directory exists. This is the default behavior. |
There was a problem hiding this comment.
I wonder if it's too clever. For instance, I have to apply this default value at lots of different levels - CLI parsing and ClientBuilder both set it.
Is this okay? Or maybe the API should take Options?
Edit: In fact, I just found a bug to do with setting the default value 😖
There was a problem hiding this comment.
It's an interesting idea. But I think it's overly complicated. Also if I didn't misunderstand there does not seem to be any use-case for this ever. If a user ever provides . as a value (e.g. --index-files index.html,index.md,.) it should be exactly equivalent to not providing the whole flag at all.
So the only reason for .'s existence is for internal representation which is much cleaner represented with an empty list and probably a new condition withing apply_index_files. (or Option if really necessary)
There was a problem hiding this comment.
--index-files index.html,index.md,. is different to omitting the flag because it will use an index file if it exists before falling back to the directory. this could be useful if somebody has a setup like this.
file names after a ., however, will have no effect
There was a problem hiding this comment.
The field has been changed to an option :) See f6b678e for more commentary.
There was a problem hiding this comment.
Thanks for the clarification and for the new commit. I'm starting to think that we should now either:
Rename the flag from --index-files to something like --remap-directory. (maybe you have a another idea) This captures the current behaviour better IMO, as this is not necessarily about index files. (though this might be the most common use-case) Especially since we can provide arbitrary values such as ., .. and some/subdirectory/random.file. Or only allow values of simple file names that don't affect the directory in which the remapped files lie. But we pretty much already agreed that the latter case is not a priority. So my preference is renaming the flag. The flag description and documentation of course can still explain the common "index" use-case.
There was a problem hiding this comment.
I mean, I can't imagine a use for this flag which doesn't include index files in some way. . is the only directory name which will be accepted, and this is to support the case for optional index files. If you want, I can add this use case to the help and test cases to make it more visible. If you pass .. or some other path to a directory, those entries will never be considered a valid index file. This is recorded by one of the corner test cases.
And we've established that path traversal into nested directories is accidental rather than intentional, so I wouldn't want to generalise the option name to include this unsupported use. In the other comment thread, I talked about eventually restricting this with validation.
So, my recommendation is to keep the current name. It's clear to users who will be looking for this feature, and I don't think anyone will be surprised by its name or functionality.
But again, if you want to overrule me, let me know your desired name and I will make it so.
There was a problem hiding this comment.
The other problem is remap is already a defined term in the Lycheeverse. remap-directories has the connotation that this is a directory-specific version of remap, but its functionality is not nearly that powerful. It can't rewrite directories in the middle of paths, for instance. Nor can it change its behaviour based on the URL being checked.
I'd rather have a name which is very slightly too narrow than one which is too broad. (And I don't want to think of another name rn :))
There was a problem hiding this comment.
Okay let's keep the flag name then. Ah true, in the current state a parameter like .. doesn't make sense because directories are not accepted, except for .. But I actually think it's not necessary then to handle . differently than all other directories
Conflicts: lychee-bin/tests/cli.rs resolved by taking main copy
because we no longer check index.html by default
|
Test fails at https://example.net and seems unrelated/flaky. It does seem to be broken when visiting in Firefox, but it works in Chrome and curl. Weird. |
README.md
Outdated
| The special name `.` can be used to resolve directory links to the directory | ||
| itself. When `.` is specified, a directory link will be accepted as long as the | ||
| directory exists. This is the default behavior. |
There was a problem hiding this comment.
It's an interesting idea. But I think it's overly complicated. Also if I didn't misunderstand there does not seem to be any use-case for this ever. If a user ever provides . as a value (e.g. --index-files index.html,index.md,.) it should be exactly equivalent to not providing the whole flag at all.
So the only reason for .'s existence is for internal representation which is much cleaner represented with an empty list and probably a new condition withing apply_index_files. (or Option if really necessary)
| async fn check_path(&self, path: &Path, uri: &Uri) -> Status { | ||
| let file_path = self.resolve_file_path(path); | ||
| let has_fragment = uri.url.fragment().is_some_and(|x| !x.is_empty()); | ||
| let path = match path.metadata() { |
There was a problem hiding this comment.
This match is a great solution, better than before 👍
I just think that the comments inside the match are not necessary, they just repeat the code.
There was a problem hiding this comment.
i don't think the comments hurt. also, i think some cases like the Err(e) and the last Ok(_) deserve some explanation. these cases are less obvious because their condition implicitly includes the negation of the previous if conditions.
but if you direct me to remove them, i will do so. just let me know which you want to remove.
There was a problem hiding this comment.
You can keep them if you prefer. It's just that if I were to refactor the code in the future for some reason I personally would not keep such comments, as I think they don't add any value. There's no new insight about "why" we do this and "what" it does is obvious from the code. To me this is like a form of code duplication so I would remove all comments inside this match.
|
|
||
| #[tokio::test] | ||
| async fn test_empty_index_list_corner() { | ||
| // empty index_files list will reject all directory links |
There was a problem hiding this comment.
So this is the case we could redefine. "no specified index files == skip check" If you find this weird then the alternative would be to wrap it with Option and make None skip the check
| } | ||
| } | ||
| None | ||
| let path = dir_path.join(filename); |
There was a problem hiding this comment.
Should be able to replace join with with_file_name. Should resolve the comment in test_index_file_traversal_corner
There was a problem hiding this comment.
the docs for with_file_name point to set_file_name which says
The argument is not sanitized, so can include separators. This behavior may be changed to a panic in the future.
so, at this point in time, it will probably not affect the behaviour. but i'll make the change anyway.
There was a problem hiding this comment.
i tried to make the change and it is a bit troublesome. using .with_file_name strips the last path component, effectively making it relative to the parent directory, rather than a file within the directory as we would like.
to avoid this, you can use
dir_path.join("nothing").with_file_name(filename)
but this is really just identical to .join.
so i don't think it's worth making the change here. if you want, a separate PR could add validation to the CLI arg to ensure no slashes are present.
There was a problem hiding this comment.
Ah I see. Yes we should leave it in that case.
this changes all the `index_files` fields to be `Option<Vec<String>>` instead of only a vec. this lets API users simply pass a None to get the default behaviour. documentation at the higher level interfaces (e.g., the CLI help) is changed to remove reference to `.` as the default behaviour. now, the default behaviour is defined in terms of its observable behaviour without reference to this implementation detail. that said, the index file resolution still has this special case that `.` is allowed in the `index_files` list. this allows the (possibly esoteric) case where a user passes `--index-files index.html,.` to use an index file if it exists but still accept the link regardless. the None case for index_files is still implemented by mapping to an index file list of `["."]`, but this is now only an implementation detail of `apply_index_files`. finally, using an Option is nicer than special-casing the empty `Vec<String>` case, because the empty vec case /should/ reject all directory links. this follows logically because the fewer items in the index file list, the fewer directory links are valid.
|
@katrinafyi how is this addressing #1765, should that be addressed here? Assuming you want to explicitly say index files are not supported and want to make linking to a dir an error? @thomas-zahner noted:
Since the default is "link to a directory is valid", this feature is saying "link to a directory with |
|
@dkarlovi That issue has had a few back and forths.
|
Ah, OK. 👍 I'd definitely make at least a mention about it in the docs because it's definitely not obvious. The use case we have for this is: when you're working on Gitlab, you can directly browse your static site in the MR artifacts, but it doesn't support index files, linking to a dir is a mistake which should get detected. When you deploy the site to Gitlab pages, it does support index pages so linking to an index file directly is a mistake which should be detected. |
|
Gitab lets you browse an artifact in a MR? That's really cool! For your Gitlab pages, this PR won't be able to flag links that unnecessarily point to index.html - but maybe you have another approach for that already. I'll make the changes to the docs :) |
Yes! It's a REALLY cool feature because you can browse the site directly from the MR without deploying it anywhere which automatically uses the access control for the project.
Not really, do you have any ideas? |
dcb1a1d to
ed499f7
Compare
thomas-zahner
left a comment
There was a problem hiding this comment.
@katrinafyi I think the PR is pretty much ready to be merged. Can you resolve the conflicts?
| } | ||
|
|
||
| #[tokio::test] | ||
| async fn test_index_list_of_directories_corner() { |
There was a problem hiding this comment.
update: i've added a test case to guard against a regression of --index-files '' dcb1a1d
Thanks. The tests are great and keep the PR manageable 👍
endorsing path traversal is done. i decided we don't care about base-url/root-dir and it will always be literal files on disk.
Sounds sensible to me.
This alone would make the
--index-filesname wildly inaccurate.
True, I initially wanted to keep the option more general as discussed previously with renaming the flag. But if we stick to this use-case - as we probably should - and keep the flag name I'm okay with this.
Co-authored-by: Thomas Zahner <thomas.zahner@protonmail.ch>
|
Has rust got a new version that's learned more compiler warnings? Do you want to fix those in a separate PR or should I just add it here? I'm not usually a rust user, so I wouldn't be super sure of the changes. Edit: or the third option is to rollback the CI version for the time being. |
this assertion is meant to check that there are two failing empty dir links. however, by checking stderr, it would've passed even if those links succeeded. this is because `--verbose` prints all links (both valid and invalid) to stderr. we change it to search in stdout which contains only the failing links.
|
Good point. Opened #1785 for that. |
|
@katrinafyi Yes, Clippy updates often add new lints leading to new warnings which we reject in CI. It might make sense to switch to a fixed Rust+Clippy version for CI/CD in the future. The pipeline should now pass if you rebase on top of master |
|
From my side, this looks good! Thanks! |
|
@katrinafyi Thanks a lot for the PR and for bearing with my discussions 👍 |
|
Yes, thanks @katrinafyi, you do a lot of great work recently. We all highly appreciate it. |
|
@katrinafyi thank you from my side too, as you can probably tell I'm pretty eager to have this feature available! 👏 |
|
Thank you for your kind words :) I feel like it's unwarranted, because I'm really just fixing things for my own use case. I'm glad it's appreciated, though. I hope I haven't been too overbearing in your project. |
|
I can't get this to work when defined via the config file. Is that not supported? |
|
Great, thanks! Just to confirm, this is indeed the proper option to choose when generators like MkDocs are configured to use directory URLs right? For example I'm doing: |
|
Yes, that use looks right. I don't know this specific generator, but other options which might be useful are --root-dir (for root-relative links) and --fallback-extensions (for resolving filename to filename.html). |
this pr introduces the
--index-filesargument. this argument lets the user specify a list of index files names to require for local directory links. if specified, at least one index file must exist in order for a directory link to be valid.this pr also makes the default behaviour more consistent. after this pr, index files are never used by default and directories are valid as long as they exist. this is a small difference from the current behaviour which implicitly searches
index.{ext}(implemented in #1752) - this causes the +1 error count intest_fragments.this pr also makes some clarifying points about the interactions of
--index-files,--fallback-extensions, and fragment checking.some more details about the changes:
checker::file::testsmodule with some specific unit tests for the local file checkerInvalidIndexFileto ErrorKind to display a more helpful "cannot find index file in directory", rather than "cannot find file"fixture_urimethod to test_utils to construct a URI relative to the fixtures dir. importantly, this supports building a URI with a fragment.check_fragmentmethod. previously, it was an ad-hocifstatement higher up the call stack.check_path. hopefully, this makes it more obvious what cases will lead to particular code paths.this pr can be tested with:
note that
fixtures/fragments/sub_dir#a-link-inside-index-html-inside-sub-dirnow fails with Cannot find fragment, because index.html is no longer searched by default.the full diff of
cargo run -- fixtures/fragments --include-fragmentsbetween previous and this pr is below. also note that theempty_dircases with a fragment have the more specific "cannot find fragment" message, because the directory does exist.then, try
and observe the output contains
of course, you should also try to run the new test suite. i hope that the test code is readable and is an understandable description of the behaviour.
i would appreciate any comments! especially proofreading and feedback on all the documentation and messages.
this pr properly implements #542. also related to #1765