-
-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Suggestion] Add an extension table with stats #45
Comments
Did you see the treemap color configuration? This is the graphical equivalent of what you suggest. I know that WinDirStat has that sidebar that lists the most used file extensions. The problem with that on Linux (and other Unix-like systems) is that there is no real equivalent: You cannot simply tell a file's type by its filename extension in most cases. That works only for a very limited variety of files, such as images (.jpg, .png, ...), videos (.mp4, .avi, .mkv, ...) and to some extent text or office files. That's basically the ones the QDirStat treemap highlights: Everything in color belongs to any of those categories. And now look how much grey tiles you see in the treemap. That's "the rest", "miscellaneous", a.k.a. "I have no clue what that stuff might be". In particular, executables don't sport anything like "*.exe" as on Windows, and there are several dozens if not hundreds of files that belong to some software package or the other, but that cannot be identified as such without reading at least part of them: That's what the |
This is a screenshot of QDirStat displaying the root filesystem on my Kubuntu machine. Notice how much grey stuff you see. The orange tiles are libraries of some kind, but that's already kind of cheating since it assumes all "lib*" and all ".so." files are libraries (it might also be something completely different). |
Verdict: Things just don't work that way on Linux. While some of the file types can be identified, all in all they represent just a small part of the overall picture (literally even). While that file type side box might be of limited interest on Windows (even there I have doubts), it loses all usefulness on a Linux-like system. When I started the QDirStat rewrite from the KDE 3 KDirStat code, I carefully considered the pros and cons of that WinDirStat feature and decided against it. |
Thanks a lot for explanation and the write up, I really appreciate it. I understand if you think that this feature wouldn't have a huge user base, I don't know how other people usually deal with their files. I get that Linux works mostly without any extensions, so it would be more or less useless for system files, but for user files it could be very helpful. For example - my use case at the time - enumerating all JPG/PNG files and looking at the file size. Personally I wouldn't mind only being able to see user files, as managing OS files is a thing for a package management system. |
Just now I had to deal with a samba share consisting of 31k files - it was very useful to see where what files with what extensions were(as I was converting lots of images/videos/audio). QDirStat was helpful but for overview by extensions and highlighting all files with that certain extensions I had to resort to WinDirStat. Just wanted to say that the use case is there, even if not as much as on Windows ^^ |
I am curious: What's the information good for that you have 12 GB worth of JPG files vs. 7 GB worth of PNG files in a directory tree? Because that's about all that this view would give you. In the current QDirStat, by default those very similar file types are grouped together in the same MIME categories (Settings -> Configure -> MIME Categories) so they show up in the same color in the treemap which gives you a much better impression of their relative proportion in the tree, if they are grouped together or scattered around, and if it's a lot of little things or a few big blobs. And grouping them into categories rather than have another color for each individual filename extension is the only thing that makes sense - there are just not enough colors anybody can reasonably tell apart from each other. Just look at the number of filename extensions for common file types like images, videos, all kinds of office documents. But if you really need it to be more specific, you are free to configure it to your liking; just add more MIME categories, move the filename extensions around as you like and assign a color. And unlike in WinDirStat's file type list, the colors are stable (and configurable per MIME category); WinDirStat has a sequence of colors that always assigns the type with most disk usage (whichever that may be) to color #1, the next most to color #2 etc., so you'll never know by looking at the treemap what each of those colors means; so in WinDirStat, the primary use of that extra panel is to be a legend for the treemap. And it's necessary there because the colors mean something different each time. So, the funny thing is that while searching the web for example screenshots, I did not find a single WinDirStat screenshot that did not have the size column of that file type panel cut off - most often it's cut off completely, in a few cases it's only cut off partly. That's how important that information is to people. So, seriously, what's the use case for that information? I am not going to clutter the display of something as complex as QDirStat with even more stuff that does not add a value to most users. Also, be aware that QDirStat is not meant as a substitute for the |
I am aware, I was using find/convert/ffmpeg to do the job and QDirStat to locate folders with the files (Since I had to do this guided - as I can only convert some files and not others because of reasons). Thanks for the tips though. Never heard about mogrify and a while back it could save me a lot of time >.<
Example - Internet connection here is not the greatest, so uploading/downloading 25GB worth of JPG files when syncing to a remote backup server takes a while. After converting about some 5GB of JPG files to WebP, I could easily see that I just saved about 2GB, now that I have 20GB worth of JPG files and 3GB of WebP files. I could see that I get some near 50% compression rate and weight the benefits of converting everything to WebP vs keeping something/most files in JPG for compatibility with a site that does not support WebP to avoid the hassle of reconverting back to JPG when those files need to be used again. Really helps with arranging files when you can instantly see "Hey, 20GB of your files are JPG, 2GB are BMP files which could easily be 10MB total after converting and you have 2GB of PDF files, you monster". EDIT: I wouldn't really need a complete clone of how WinDirStat does it(with the selecting magic and such)
|
I did some experimenting. It's not very pretty yet, but it shows the principle: Notice that right now it's just text mode because I didn't have the slightest clue if it would warrant the additional work to make it pretty with a table and everything. So far, this code lives in a Git branch. If you know how to do that, you can check it out, build it and see it live on your machine: https://github.com/shundhammer/qdirstat/tree/huha-extension-stats After the usual
then build QDirStat as always:
You get to that new window with menu View -> File Type Statistics.... |
Complete output from my /work partition:
|
The list is still sorted by filename extensions. Once there is a proper list widget, of course the user will be able to switch the sort order between extensions, number of items, and total size. There will also be a percent column. I will probably add tabs to switch between the categories (seen here on top) and the extensions -- just to avoid confusion because otherwise items will be counted twice, and the sum of all percentages will be way above 100%. |
I still don't know how useful this really is, but usefulness is decided by the users. Maybe people will really get creative what to do with this. One problem was that on a Linux filesystem there is a lot of "cruft" that accumulates in statistics of this kind: Unlike on Windows, a dot does not only serve to separate a file's base name from its extension; it is also used often enough as a general purpose character in filenames. It's really hard to tell automatically what is a real filename extension and what comes from the crazed imagination of some Linux developer who thinks dots in filenames are just great. For example, look at your
There are dots all over the place. Of course, a human can easily tell that in this case, In my new code, I used a lot of heuristics, and they might result in false positives or negatives. For example, QDirStat already has a class called MimeCategorizer to figure out the treemap colors. It already knows a lot of filename extensions (and regexp rules). So I found another use for that class here when trying to figure out bona fide extensions (as opposed to random crap where people misused the concept of filename extensions to do any random stuff). If the MimeCategorizer knows a suffix (a filename extension), I believe it. Not (only) because I wrote it, but because those lists are carefully hand-crafted. And then there are the real heuristics if the MimeCategorizer doesn't know a suffix; beware, there be dragons and all. ;-) If a suffix consists solely of numbers, it's very likely that it's not anything useful, so those files are discarded (disregarded in the statistics). If a suffix has 3 letters, it's very likely a valid filename extension. Those are kept. If a suffix is very long, and there are very few files of that type, it's probably also cruft, and those files are disregarded. Etc. etc. etc.; see To give an impression just how much such cruft there is, here is some log output while removing it:
|
Please notice that so far this is only experimental; that's why the code lives in that Git branch. There are no promises yet that this will make it into Git master (the code main line) anytime soon. Apart from the missing pretty user interface (and there will be no permanent text mode stuff like this experimental code in QDirStat), there are still issues like how to handle tree refresh and updating this information. So far, this is only very static and thus only a snapshot of the current situation; it gets outdated whenever the user starts a cleanup action, refreshes the tree or a branch of it. |
In the final version, the list will probably be restricted to show only the top 50 or so; the 50 filename extensions with most size/percentage and the 50 with most files (so it will be more than 50, 100 in the worst case). Or only the top 20. I don't know yet. Showing everything where most filename extensions have only a very small number of files or a very small size percentage doesn't seem to make that much sense. The number will be configurable, of course. |
Thank you, this is awesome! I'll try to check the branch out soon (on vacation atm), but it looks like the exact thing I was missing! Seriously, thank you for even willing to add this feature in a separate branch! |
Not so fast, my friend, we are not done yet... 😄 I turned this very crude text display into something a lot more Qt-ish: A multi-column tree with the MIME categories and corresponding suffixes (the filename extensions) that belong to that category below it. And since there are a lot of suffixes that don't belong to any of the predefined categories, there is now a category "Other" for them. But I restricted that to the "Top 20" (i.e. the 20 suffixes with the most total sizes). That number will be (but is not yet) configurable. Screenshots: My /work partition: My root partition: My Windows C: drive: My Windows D: drive that hosts mostly games: By now you can probably tell what my favourite game is... 😄 |
Now that I invested so much work into that, it won't go away anymore; it will make it to Git master for sure. 😄 I am still not 100% sure how useful it really is. I did get some unexpected insights into my disk usage, though, and that might be a sign that it's not quite as useless as I initially thought. 😄 But you can see from these screenshots that your mileage will definitely vary:
Also, there is not yet any kind of communication between this new window and the internal database, the DirTree: Update the stats window when the tree changes, not overdoing the update (wait a few seconds until things have settled down and only then recalculate the statistics etc.). And of course since I now have that nifty window telling me that there are still junk files around, I want it to find them for me. Right now this is an absolutely passive window (albeit it's non-modal, i.e. you can work in the main window while it is open). I have some ideas in my head how to do that; maybe applying a filter to the main window based on suffixes I click in the stats window or whatever. But first it needs to get a bit more complete, and it needs to stabilize. Frankly, I had not expected in the least that QDirStat development would take this turn. But since there are positive results, it's a pleasant surprise. 😃 And you know what? Now this thing is a lot better than its WinDirStat counterpart because in WinDirStat that panel mostly serves as a legend to the treemap colors that are ever-changing in WinDirStat. 💪 |
The best points about this are being able to categorise files that havent been properly identified before, and thus to be able to present some meaningful stats about the whole file system.
|
This is exactly the one thing that cannot be done. I wrote that in my very first comment here. That would mean reading the first few blocks of every file that cannot be properly identified. For a Linux root filesystem that would mean reading the first few blocks of roughly 80.000 (!) files - everything below the "other" category. Just look at the screenshot of my root filesystem above. How long do you think this would take? This is only feasible for a handful of files - when you invoke |
Of course it makes sure that one file only adds to one category. What makes you think it doesn't? Where did you see any percentages that add up to more than 100%? As a matter of fact, right now it has the exact opposite problem: A large portion of the disk space remains unaccounted for in those statistics. I'll have to check where all that goes missing. Some of it can be explained by directories that also use some disk space, but right now this is way out of proportion. Some more of it may be because of the other file types below "other" that don't belong to the "Top 20" (typically around 100 suffixes), some may be because of stuff that has been classified as "cruft". Probably it can all be explained, but I'd rather make sure and not just guess. |
Ta-da... 😄 This has been available since 2016-06-29: The config file is now split into four independent ones. |
Sorry my mistake about the >100%, I had read the following quote and misremembered it
|
I saw the QDirStat-mime.conf file but thats not quite what I had in mind, though it does go someway towards it. I accept your claim that it is too arduous a task to read all the files (for now) but it may be ok to have a background process that does it for the files that arent identified by other means - or perhaps if the user requests it - say they are looking at a sub-directory with a limited number of files a mere 1,000 or so, or they are prepared to let it chug away while they have dinner. Obviuously a better solution is needed - extensions are a hack and cant be trusted, and reading the contents is primitive, prone to error and also a hack. MIME types are not a hack and the system is extensible but it seems to already be out of control - too much and at the same time too little, might just need some standardisation. I have seen somewhere (not sure which distro or DE) a file explorer that allowed me to sort files by "detailed filetype" (or MIMEtype) which is the closest I have seen to sorting by extension in windows explorer (probably the only windows feature I actually miss). |
@flurbius I think you mean Nautilus(though other file explorers may have the capability too) http://i.imgur.com/FaCWl52.png |
No, Nautilus does not do that; it also just looks at filename extensions. I just renamed a .png file to .txt. |
I found all the missing items: I used the wrong way of iterating over the tree which disregarded the DotEntries. Now that I use the iterator class I had created for just this purpose many years ago, the sums add up to just below 100% (a little space is still used for directory nodes etc.). |
I've tried the branch out and it looks perfect, the only thing I'd like(besides highlighting files by extension, but that's just a nice to have) is adding an Again, thanks a lot for the feature! |
Merged to Git master. |
Next things to come: Use the file type stats window to locate the corresponding files in the main window. I wrote that this is not supposed to be a replacement for the Right now the idea is to search the tree again for all files with that suffix and list the results in another (again non-modal) window; probably just one entry per directory that contains any of them. When you click on such a result, it would open that branch in the tree of the main window and select (highlight) all files with that suffix. That will work well with a handful of files; I don't know yet how to do this efficiently with things like my photo collection: It's 29,000+ JPG files scattered over 850+ directories. That's a bit much to navigate in. Would I want only my /work/photos directory to appear there? That's a bit limiting. I'll have to think about that. |
See #48 for that new idea and any follow-up discussions. |
I toyed with that idea, but so far I couldn't come up with a good concept to have this non-disruptive: Right now all the percentages add up to (about) 100%; this is what users expect. With an All category in the same tree, however, everything would show up twice, so the overall sum would be about 200%. So it would have to be either tabs or a check box (or combo box or radio box) to switch between the existing By Category view and that new All view. This would considerably add to the complexity of that window; that's the major gripe the Apple fans always have with Linux things. They love the beauty of the simplicity that is so prevalent in Apple products, and they do have a point there. So, again, what is the use case for that? A user would surely have a general idea what general category a file type belongs to, right? So given the small number of those categories, any suffix is much easier and quicker to locate right now than having to sift through some 500 suffixes in al All view, right? And comparing the cumulative sizes of, say, all *.jpg against all *.mp4 on a disk is literally comparing apples (note the lower-case 'a' here 😃 ) to oranges. So I don't see the point (but then, we've been there before). |
Personally I prefer to see everything at once without opening many tree menus. I will adapt fine to the way it is currently done as the groups themselves are sorted by size, it's just not what I'm used to. EDIT: Upon further inspection I see you made the categories editable, I'll just create my own monster category with everything if I'll need it. |
WinDirStat has this extension table that lets you see files by their extension.
I'd love to see this feature in QDirStat, as I think it'd be helpful to many people. Personally right now I'd like to see how much data I have in image files, but I cannot use QDirStat to find out.
The text was updated successfully, but these errors were encountered: