Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have an option to make Cargo attempt to clean up after itself. #6229

Open
icefoxen opened this issue Oct 27, 2018 · 14 comments
Open

Have an option to make Cargo attempt to clean up after itself. #6229

icefoxen opened this issue Oct 27, 2018 · 14 comments
Labels
A-caching Area: caching of dependencies, repositories, and build artifacts C-feature-request Category: proposal for a feature. Before PR, ping rust-lang/cargo if this is not `Feature accepted` Command-clean S-needs-design Status: Needs someone to work further on the design for the feature or fix. NOT YET accepted.

Comments

@icefoxen
Copy link

Cargo leaves build artifacts around forever, without making any attempt to ever clean them up. This is a feature but also a somewhat inconvenient one since your ./target folder grows forever, and this can make life harder for large projects and CI systems and such. You either delete everything and rebuild everything all the time, wasting a lot of time, or you deal with potentially very large folders. This also makes life harder for the idea of cargo eventually being able to share build artifacts between projects, as described in this reddit thread.

I propose to add a simple, conservative command line flag/config option for Cargo to attempt to clean up files that have not been used in any builds for a while. Something like --clean-unused 30d to remove any files that haven't been used in any builds that have run in the last 30 days. Hopefully this will provide a simple and useful tool for solving this sort of problem in many cases, while not resulting in unnecessary rebuilding of things. It also should pave the way for experimenting with more sophisticated decision-making about which build artifacts to keep and which to delete in the future.

Related: #5026

@dwijnand dwijnand added Command-clean C-cleanup Category: cleanup within the codebase C-feature-request Category: proposal for a feature. Before PR, ping rust-lang/cargo if this is not `Feature accepted` labels Oct 27, 2018
@Eh2406
Copy link
Contributor

Eh2406 commented Oct 27, 2018

As I said on that thread I would love to see this experimented with out of tree. There are a lot of good heuristics that may work well. When we see which are widely useful we can talk about bringing them in.

If out of tree development is not possible do to limitations of cargo, (like dose cargo leave a timestamp of the most recent time it used an artefact,) then let's see about getting it to be possible.

I'd love a tool based on last used date. I'd love a tool that compared the version of the compiler used to the installed ones by rustup. I'd love to point it at a parent folder and have it search for all the places I hid a rust project.

@icefoxen
Copy link
Author

Actually I'm now wondering if this could be done as a cargo plugin? I confess I know nothing about how cargo's internals are structured.

@holmgr
Copy link

holmgr commented Oct 29, 2018

I would think a cargo plugin would work, if the needed information can be gained from cargo.
I tried fiddling around a bit and found that:

cargo check --message-format=json (release flag for target/release)
Seems to output which build artifacts are used currently, and the fresh flag specifies whether any compiling was needed.
I could not find any timestamp from cargo when it last used an artifact, but perhaps the tool itself could maintain a list of used artifacts and timestamps which it updates each time it is run. Or just clean all unused artifacts, if maintaining this file is too problematic. Maybe it can just be stored in the target directory?

Might hack on something for this during this week if there is any interest in it :)

@Eh2406
Copy link
Contributor

Eh2406 commented Oct 29, 2018

@alexcrichton Is there a good way to determine the last time a file in /target/.../deps/ was used, if not how hard would it be to add a timestamp file/field?

@holmgr I would definitely be interested in seeing an initial hack!

@alexcrichton
Copy link
Member

I don't think there's a great way to figure it out unfortunately other than running a full build and just seeing what was used (and considering everything else as unused). Supporting this in a first-class fashion I think will be a nontrivial undertaking!

@Eh2406
Copy link
Contributor

Eh2406 commented Oct 29, 2018

@alexcrichton I agree that "first-class" support will be hard, hence my encouragement of "out of tree" "derty hack" experimentation.

Maybe I am missing something important, but at some point cargo has to make the list of things it needs to be on disk (or to build if they are not on disk). How hard would it be for cargo to take that list, for each item on the write a "x.timestamp" file? If all the cargos (that we care about) did that, then an out of tree tool could come along and del files associated with timestamp's that are older than 30 days. For initial testing purposes we could decide that the tool is only compatible with locally bilt cargos from someone's fork, if we are not willing to merge do to the feature freeze.

@alexcrichton
Copy link
Member

Oh small things here and there probably aren't that hard I think? We could certainly try to patch the situation in the meantime!

@holmgr
Copy link

holmgr commented Nov 7, 2018

So I have been doing a bit of hacking on this issue to see what is possible, or not (https://github.com/holmgr/cargo-sweep).
Currently I do a full build using cargo build --message-format=json to get the artifacts used, I extract the hashes and then remove all files in target which contains this (timestamps not yet considered). This does not however seem to perform a "full" clean, since there are files in /target/.../build/ and /target/.../deps/ which are not mentioned in the build output. This made me wonder, is it only /target/.../deps/ that we are interested in cleaning up?

Since this type of solution is likely not the way we want to do this (i.e letting the tool maintain the timestamp files) I have been doing some digging around to see what else is possible.
One simple approach is to use fs::metadata to get access times for the files we are interested in, avoiding the need to generate separate timestamp files. This however would be sensitive to moving the target directory etc i guess. But this would not introduce any false positives so maybe it is the best solution.

Finally, the most complex solution is to let cargo output the timestamp files as we discussed above. I have been searching through the source a bit, but not yet found a good place to place that code (partially because I am unsure of exactly which files we are interested in cleaning).

@holmgr
Copy link

holmgr commented Nov 9, 2018

Another update :)

So I went with the second approach outlined in my comment above which seems to be the most straightforward solution, and it seems to be working pretty well.
I also added a recursive flag so it can clean all Cargo projects below a given path like @Eh2406 suggested.

It is now published on crates.io and source is here. Hopefully this solves at least the basic needs for cleaning.
Any feedback on the code, or possible extensions is very appreciated but it should probably be done as an issue on cargo-sweep rather than here.

@Eh2406
Copy link
Contributor

Eh2406 commented Nov 29, 2018

@pietroalbini suggested a command that would clean up files needed to build a dependency but not needed to use that dependency. cargo clean-everything-except-the-minimum-needed-not-to-rebuild-deps @joshtriplett suggested that we move those files to a .cache folder or something similar to mark that they can be deleted by any gc that wants. I was wondering if cargo can just delete them when it is done with them.

The big question is what are "they"? What are the minimum files required to use a dependency?

@Eh2406
Copy link
Contributor

Eh2406 commented Dec 6, 2018

cc some other discussion going on at #5885 (comment)

@dwijnand dwijnand removed the C-cleanup Category: cleanup within the codebase label Dec 14, 2018
bors added a commit that referenced this issue Jan 16, 2019
touch some files when we use them

This is a small change to improve the ability for a third party subcommand to clean up a target folder. I consider this part of the push to experiment with out of tree GC, as discussed in #6229.

how it works?
--------

This updates the modification time of a file in each fingerprint folder and the modification time of the intermediate outputs every time cargo checks that they are up to date. This allows a third party subcommand to look at the modification time of the timestamp file to determine the last time a cargo invocation required that file. This is far more reliable then the current practices of looking at the `accessed` time. `accessed` time is not available or disabled on many operating systems, and is routinely set by arbitrary other programs.

is this enough to be useful?
--------

The current implementation of cargo sweep on master will automatically use this data with no change to the code. With this PR, it will work even on systems that do not update `accessed` time.

This also allows a crude script to clean some of the largest subfolders based on each files modification time.

is this worth adding, or should we just build `clean --outdated` into cargo?
------
I would love to see a `clean --outdated` in cargo! However, I think there is a lot of design work before we can make something good enough to deserve the cargo teams stamp of approval. Especially as an in tree version will have to work with many use cases some of witch are yet to be designed (like distributed builds). Even just including `cargo-sweep`s existing functionality opens a full bike shop about what arguments to take, and in what form (`cargo-sweep` takes a days argument, but maybe we should have a minutes or a ISO standard time or ...). This PR, or equivalent, allows out of tree experimentation with all different interfaces, and is basically required for any `LRU` based system. (For example [Crater](rust-lang/crater#346) wants a GC that cleans files in an `LRU` manner to maintain a target folder below a target size. This is not a use case that is widely enough needed to be worth adding to cargo but one supported by this PR.)

what are the downsides?
----

1. There are legitimate performance concerns about writing so many small files during a NOP build.
2. There are legitimate concerns about unnecessary wrights on read-only filesystems.
3. If we add this, and it starts seeing widespread use, we may be de facto stabilizing the folder structure we use. (This is probably true of any system that allows out of tree experimentation.)
4. This may not be an efficient way to store the data. (It does have the advantage of not needing different cargos to manipulate the same file. But if you have a better idea please make a suggestion.)
@nuke-web3
Copy link

I would live a feature as described above, is there any update since 2018 on ways to achieve this?

@kautkata
Copy link

Since this type of solution is likely not the way we want to do this (i.e letting the tool maintain the timestamp files) I have been doing some digging around to see what else is possible. One simple approach is to use fs::metadata to get access times for the files we are interested in, avoiding the need to generate separate timestamp files. This however would be sensitive to moving the target directory etc i guess. But this would not introduce any false positives so maybe it is the best solution.

File access time upkeeping is a heavy process as during each files access separate IO call is executed just to update Atime. So what happens on a lot of build hosts (at least in decent organizations) all file systems are mounted with noatime option

@epage
Copy link
Contributor

epage commented Sep 20, 2023

A garbage collector is being worked on. The tracking issue is #12633. It is first starting with global resources and then we'll be looking at the target directory.

@epage epage added the A-caching Area: caching of dependencies, repositories, and build artifacts label Sep 20, 2023
@epage epage added the S-needs-design Status: Needs someone to work further on the design for the feature or fix. NOT YET accepted. label Oct 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-caching Area: caching of dependencies, repositories, and build artifacts C-feature-request Category: proposal for a feature. Before PR, ping rust-lang/cargo if this is not `Feature accepted` Command-clean S-needs-design Status: Needs someone to work further on the design for the feature or fix. NOT YET accepted.
Projects
None yet
Development

No branches or pull requests

8 participants