-
-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Locate files with no extension from file type statistics #130
Comments
To put this more into perspective so we get an idea of a real world scenario: This is the root filesystem of my Xubuntu 18.04 LTS that I use at home (OS stuff only, home directory and data files are on separate filesystems): As usual, files in the No Extension category make up a whopping 33.96% of all files: 28338 of 74356 total. This is by no means an exception; this is pretty much the normal case for Linux systems. The log tells us a bit more about the cruft files:
Making this a bit less unwieldy with some creative
That's 457 suffixes. |
Freak accident? Are root filesystems so special in having a lot of files that cannot easily be categorized? Okay, let's give it another try: My Okay, that's considerably less files in the No Extension category: Only 0.19%, but still 11728 out of 83368 total. So what about cruft here? Let's check the log:
Yikes; still 426 suffixes, and many of them really weird. Duh. |
Just looking at the treemap also gives some hints: All the grey areas are either directories with a multitude of tiny files (so small that they are not rendered in the treemap because performance would severely suffer) or files in the Other category a.k.a. "we have no clue what all that stuff is". |
Yes, locating them is disabled for several reasons:
That was pretty much the trade-off to get that file type feature in the first place: Keep it simple and apply the concept only to those files that can reasonably be identified and put all the rest into the "great unknown" category simplistically named "No Extension" in the File Type window. The ugly truth is it's not exactly all just "No Extension", it also includes "weird extensions". This tidbit of information is kind of held back from normal users because it would just confuse most of them with little gain for anyone.
The difference might (!) be because of the cruft files. They have extensions, albeit typically very weird ones. Many of them might simply be unknown to QDirStat's preconfigured MIME categories.
That window contains only the directories where they are. Is this what you want? Or would you rather like to have a complete list of all the files in that category?
A very pedestrian approach to that would be grepping the QDirStat log for that line listing the extensions that were considered cruft, splitting it up like I did in my previous comments (a very simple |
You just did something really naughty: You raised my interest. 😃 Yes, indeed: What the hell are all those cruft files? How can we get more information about them? But rather than just starting some creative scripting, QDirStat might have something built in that could be pressed into service. You might or might not have read about the Packages view in QDirStat where it shows what software package each file belongs to. There is also its counterpart Unpackaged Files that shows the opposite: "What files do not belong to an installed software package?" Sounds unrelated? Bear with me. That view is built all around one feature that was really expensive and hard to implement: Ignoring files in the directory tree. In this case, it ignores files that are known to belong to an installed software package. All directories that only contain ignored files are shown dimmed (light grey) in the tree view, and the files are not displayed in the treemap, leaving only the unpackaged files, i.e. those that are not part of an installed package. |
If we want to know more about cruft files or files with no filename extension, how about doing something similar for them? Rebuild the entire directory tree and ignore files that belong to a well-known MIME category? That would leave the entire remaining treemap uncolored because color there means we know what a file is. But it would make it so much easier to spot cruft files that are worth taking care of, i.e. large ones. |
One major problem with that is that a large number of files without suffix are executables: Binaries or scripts in a plethora of different scripting languages. To be of any use, that stuff would need to be filtered out as well. This reopens that old discussion of using the This is an expensive operation; in the case of a root filesystem, that would mean reading those 33.94% of all files, 28338 in the above case. Yes, it's just a partial read; just reading the first few blocks. Still, it's 28k Hm. Not good. |
Let me think about this for a while. In the meantime, please elaborate some more on your use case. Is it more than just curiosity and exploring the unknown? This is where the ignored view would come in, and I am very much inclined to pursue that further. Or is it a very specific use case? If so, please describe it. |
Thanks for your thoughts on this.
That may prove to be very useful, at least for identifying them.
Interesting, that might well explain the discrepancy.
I see what you mean, but it would still be useful to be able to find the directories with the largest number of cruft files, or probably more usefully - the largest amount of space taken up / wasted by them. Perhaps it could show the top 50 or 100 directories similar too the way it does top 20 for the file extensions. (Not wanting to get side-tracked but I would find it useful to be able to configure that number also for the uncategorised file types).
Directories (with the associated statistics - number of files and total size of those files) would be a good starting point. I wouldn't say no to being able to drill down to get/export/copy-paste the entire list of individual files as well, but if I had to choose files or directories I'd go with directories, not least because its consistent with the UI. Another reason is that on my system I think the list of files would be too large to do anything useful with.
Yes that's exactly what I'm thinking I might try, though as previously discussed the result might not be that useful. I'm making the following numbers up, but a list of, say, 400 directories is easier to work with than a list of 80,000 files!
That is definitely a part of the question I'm trying to answer (see more detail below).
This would definitely be useful - in fact, more useful than the "locate by type" option. If you did implement this, you could make the feature more discoverable and hence useful to a wider number of people if by having a message box that pops-up when you try to use the "locate" option on the "No Extensions" line in the "Locate files by type" dialog we were originally discussing. The message box could direct the user towards this feature i.e. suggesting that they re-scan the directory tree ignoring files of known MIME categories
Maybe a different device (i.e. other than colours) could be employed to give an indication of file sizes, or perhaps various shades of grey or patterns... or maybe colours could be used here but given a different meaning, though I'm not sure if that's good UX practice. Hmm these are just random thoughts I'm not certain how useful that would be as I write this.
That isn't necessarily a problem. After all these files are still taking up space on the disk and they are also files that cannot easily be categorised by file extension. If for some reason
Clearly it would take too to do this for every cruft file, but perhaps we could do this just for the largest individual files (e.g. the largest 100 cruft files, again perhaps it would be good to be able to configure this - or to run it on demand e.g. "categorise the next uncategorised 50 files" using the
I have a 1.5Tb hard drive drive and my home directory is occupying almost 2/3 of that space. Within that is 40Gb of 'No Extension' files! That's quite a lot of space that's potentially unaccounted for. I'm trying to work out:
If it turns out to be mostly static stuff like photos and videos, I might buy a second hard drive, copy them onto it and leave it in a remote location. I realise that's quite a tall order which is why I set out writing scripts to run Thanks again for your thoughts and comments on this - it's all very much appreciated. |
Before this goes off-track a bit too far: Those "cruft" files are not to be confused with "junk" files. If disk space is running low (or you are just being extra tidy), it makes perfect sense to get rid of "junk" files such as editor backup files ( All those weird filename extensions in the Details lists above are simply Linux developers being a bit too creative with naming their files. On MS-DOS and thus on all types of MS Windows, filename extensions have clear implications, so it's not advisable to misappropriate the concept for general separator characters in filenames. On Linux / Unix, however, it's little more than a convention (still, it tends to confuse a lot of tools if dots are used as just another character in filenames). So, the basic problem here is not so much about those "cruft" files that use things that look like filename extensions, but are often just a regular part of the filename. Those files are really not different from files without any extension: We simply don't know what they are. A completely unrelated question is what purpose they serve and if it's safe or advisable to delete them; just because a large directory tree has tons of The relevant information here is context: What directory tree are those files in? Is it something that I created manually? Is it something that belongs to some piece of software that I use? Or is it a game that I downloaded, and my system's package manager doesn't know anything about it? Or an application that I built and installed manually? ( This all comes down to having context knowledge and making conscious decisions as a human. A tool like QDirStat can only deliver technical information; the user has to decide what to do with that information. The design decision here for QDirStat is what amount of additional information can be made accessible to the user, and how useful that information might be; also, how to obtain and visualize it while maintaining good usability. |
Thanks for sharing further thoughts on this. Just to clarify a lot of what I said in my previous message was to explain the background behind the feature request, e.g.
These were meant as rhetorical questions rather than questions for you or QDirStat to answer directly. To sum up,- I think that the feature you previously suggested about showing a treemap which includes only files that do not belong to a well-known MIME category would be immensely useful in providing the some of the contextual details necessary to begin thinking about some of these questions. I think that suggestion of rebuilding the whole treemap for just the uncategorised files is even better than the "locate files by category" feature that I originally enquired about. And yes, by uncategorised files, I mean to include both files with no extension as well as those that have an extension which does not have a category. |
Okay, so let me think about refining that thought further: Ignoring known MIME categories to get a tree and a treemap that shows only those files that we don't know anything about. Side note: In the QDirStat classes, ignored files are put into a special place called attic. For a first shot (that might be quite easy to implement - let's see), the leftover files will probably also contain executables and tons of scripts. But maybe there will be an option to also ignore files that are both executable and in one of the well-known system directories like |
Thanks, that all sounds very useful |
Moving to ideas document. |
Firstly, thank you so much for #45 and #48. I've been looking for these features for years (obviously last time I tried QDirStat was an older version or before those features got added).
I'd been writing a collection of rather unwieldy shell scripts to try to concoct a series of
find
commands that would provide the same information, so having it built-in to QDirStat is a massive help.I have a few questions / suggestions about this feature that I'd appreciate your thoughts on please:-
The reason I ask is because I've been trying to find these files to get an idea of what they are. I tried the following
find
command but there to be some discrepancy in the number of files reported:This yields 82,658 files. However according to QDirStat there are actually 83,487 files in my home directory.
Related to this, is it possible to make a way to quickly export/copy & paste all the information shown in the "Locate files by type" window for a given type (I guess some kind of text or csv format would be most convenient for including all the columns) - that way it could be saved to a text file or spreadsheet for further manipulation?
Finally, I note your comments about 'cruft' files. Is there a convenient way to do a similar analysis of these - perhaps give them all a separate category and again a way to see number of cruft files and total size wasted in cruft files per directory, which might be useful in trying to clean up some of them?
Even if the answers to the above are no, no and no - thank you anyway as I really appreciate this feature that you've added.
The text was updated successfully, but these errors were encountered: