-
Notifications
You must be signed in to change notification settings - Fork 784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compressed docset #138
Comments
This is something I tried to add to Dash, so for what it's worth I'll share my progress. I'm writing this from memory, so sorry for any mistakes. What seems to be needed is an indexed archive format which lets you extract individual files really fast. Archive formats I've tried:
If anyone has any experience with archive formats, help would be appreciated 👍 |
Hi, Thanks for your explanation on this. Personally I don't think the docset size itself really matters, with the size of current harddisks. The problem I'm facing is with the number of small files, the storage size could be much larger because each small files take at least a filesystem block (4 KB?), even when the file size is only 1 KB. Also moving the docsets takes a very long time. I tried compressing several of the docsets with zip Yii
J2SE
While zip does not compresses as small as 7z with PPMd, still it achieves a pretty good compression ratio. |
Unfortunately, I have no interest in pursuing this other than for file size issues. The size of the current HDDs is not an issue, but the size of SSDs is. Also there are hosting and bandwidth issues, so size matters there as well, but I think one way around that would be to compress using a format (zip) and then recompress that using tgz or other. Zip does work for some docsets, but fails with others. I can't remember which. Sorry. |
I understand about your reason, but speaking about file size, surely for the user, having the docsets in zip format will still result in much less space than the uncompressed files. What do you think about distributing the docsets in 7z format (less distribution bandwidth, faster download time - why 7z, because AFAIK only 7z supports PPMd algorithms and this is the fastest and smallest compression for text files), and converting them into zip after being downloaded by the user. Hopefully this can be done without using temporary file (more lifetime for SSD). |
We use a zip format for storing text documents in Mono's documentation tool. We use our own indexing code. If you're interested, I can point you to the specific code in the mono project. |
Adding some thoughts... I am planning to add QCH (Qt Assistant format) and CHM support to Zeal at some point in the future. Both formats provide everything in a single file. CHM files are compressed with LZX algorithm. As the next step I'd like to evaluate a Zeal-specific format (most likely extended from QCH) which would provide some compression level for data. I am not sure if that would work out with planned full-text search. |
Compressing a single row in sqlite database would be less effective since the dictionary will be limited to that single text, isn't it? I think it's more practical to use zip as the archive format, then embed the toc and index as json files inside the zip. Full text search can be created when adding the documentation the first time. Converters can be created to convert from chm/qch to the zip format. This will also keep the binary size smaller since you do not have to embed the decoding libraries to zeal. User which want to create their own documentation can simply zip the html files and add it to zeal. |
@Kapeli Could you try to test lrzip ? |
A lot has changed since I last posted in this issue. I forgot it even exists. Sorry! Anyways, Dash for iOS supports archived docsets right now. Dash for OS X will get support for archived docsets in a future update too. Archived docsets are only supported for my "official" docsets (i.e. the ones at https://kapeli.com/docset_links) and for user contributed docsets. This is enough, as these are the docsets that can be quite large, others are not really an issue. I still use tgz for the archived docset format, the only difference is that I compress the docsets using tarix, which has proven to be very reliable. Performance-wise, it takes about 5-10 times longer to read a file from archive than it takes to read it directly from disk. Directly from disk on my Mac it takes up to 0.001s for the larger doc pages, while from an archived docset it takes up to 0.01s. Despite that, there's no noticeable impact as when a page is loaded the actual read of the files takes very very little time when compared to the loading of the WebView and the DOM and whatever (the WebView takes up about 90% of the load time). |
@Kapeli @trollixx Dash needn't all decompress that tgz file .(? I'm not sure whether you want to express this meaning? ) But zeal also need all decompress that tgz file .So I think you should understand what I want to express. :) |
Dash does not need to decompress the tgz file anymore, no. |
Sounds interesting. I'll look into handling of tarix indices to eliminate docset unpacking. I haven't heard about tarix before. |
Kind reminder, can extract the index file in advance, because the index file is IO intensive |
In the meantime Mac users can use HFS compression, and Linux users put their docset folder on a filesystem with transparent compression like btrfs or ZFS. |
about bundling, in numbers:
It seems 10 Gb was spent to store file tables, attributes etc I think the bundling (doesn't matter compressed or not) is a must. |
On Friday March 31 2017 06:35:40 evgeny g likov wrote:
about bundling, in numbers:
- a VHD container (ntfs, compression enabled) with docsets has the size of 19 Gb
- 700 thousands of files inside of the VHD have the summary size of ~9 Gb
That many files will almost unavoidably lead to disk space overhead ("waste") because changes are slim that the majority will be an exact multiple of the disk block size (4096 for most modern disks). Not to mention the free-space fragmentation they can cause.
|
I think what you meant was the filesystem block. A disk block (sector) is only used for addressing while a single file cannot occupy less that a filesystem block. |
How about using dar to store and compress docset? |
If the goal is not to preserve the docset bundle "as is", couldn't you use a lightweight key/value database engine like LMDB? File names (or paths) would be the keys, and then you can use whatever compression gives the desired cost/benefit trade-off to store the values (i.e. file content). I've used this approach (with LZ4 compression) to replace a file-based data cache in my personal KDevelop fork, and it works quite nicely (with an API that mimics the file IO API). This gives me 2 files on disk instead of thousands which is evidently a lot more efficient. FWIW my docset collection is over 3Gb before HFS compression, just over 1Gb after. I have enough diskspace not to compress, but that doesn't mean I spit on saving 2Gb. "There are no small economies" as they say in France, and following that guideline is probably why I still have lots of free disk space. |
SQLite with LZ4 or zstd for blob compression is what I have in mind. There are also some larger goals that I hope to achieve with moving to the new docset format, such as embedded metadata, ToC support, etc. |
Zstandard supports precomputed dictionary which should be beneficial for compressing a lot of small files. |
On 02 Dec 2018, at 10:34, Charles ***@***.***> wrote:
Zstandard supports precomputed dictionary which should be beneficial for compressing a lot of small files.
I think that argument is largely moot when you combine files in a single compressed file (which doesn’t mean there can be a benefit to using a dictionary; lz4 allows this too).
|
When storing the files in a key value database or sqlite, each file is compressed independently, which is why a precomputed dictionary will reduce the compression significantly, not to mention reducing the space usage required to store the dictionary multiple times in each row. |
Using one dictionary per docset is an interesting idea, definitely worth benchmarking. Regarding LZ4 and zstd, I just mentioned these two as an example, nothing has been decided so far. |
Just wanted to say that I feel this is the number 1 issue with Zeal and should be given much higher priority. Tens to hundreds of thousands of files means that any time I perform any large disk I/O tasks on any of my systems they get choked on the Zeal docsets. If I try to use a large-directory-finding program (like WinDirStat or KDirStat) I have to wait as Zeal's docsets take up roughly 1/2 to 1/3rd of the total time searching. Making backups or copies of my home directory takes ages as the overhead for reading each of these files is incredible. I bet the the search cache must be much larger and slower on each of my systems because of having to index all of Zeals docsets. Even Doom (a very, very early example of a game we have the source code to) solved this problem back in the day. Almost all of the games data is stored in a few "WAD" files (standing for "Wheres All the Data?"). If a user wants to play user-made mods or backup their game's data they just need to copy and paste a WAD file. Sorry if this comes accross as bitching or complaining, I'm just trying to express how much this issue matters to me (and presumably many other users). I'm going to try to dust off my programming skills and work on this too. |
here is workaround for this issue: #!/bin/bash #prepare: #fstab: #name the file /usr/local/bin/zeal to have higher priority over /usr/bin/zeal |
here is workaround for this issue:
I've got an even bigger/longer one ;)
- migrate your entire root to ZFS
- create a dataset for the docsets, with compression=gzip-9, decide where to mount it (I use /opt/docs/docsets)
- move all docsets there, and point zeal to that path in its settings.
However, every solution that uses filesystem-based compression will still be suboptimal because even the tiniest file in the docset will still occupy the minimum filesystem or disk block - and it is not cross-platform. The way around that would be for Zeal itself to support compressed docsets, or simply use one of the existing libraries to access a compressed archive as a directory. Compressed archives can be packed much more compactly than a generic filesystem, and they're cross-platform/
|
nope. you completely missed the point). look at the year:
and there wasn't a solution. just speculations like yours.) |
any progress or formal thoughts on this or where to start? Text compression would be 🔥. Currently using 1/3 of my 256gb expensive mac nvme. |
Maybe you can zip your data directory, mount it using fuse, then put an overlay filesystem over it to allow for modifications. |
Is R/W access necessary?
Regardless, it strikes me that this is something Zeal could do as an alternative to implementing its own support for compressed docsets.
|
Currently I don't use Zeal because the docsets take up a lot of hard disk space and I don't have much space available, but I've seen a format that's used to archive web pages, it's called WARC (Web ARChive), it has support for compression and indexing, here I leave some links with information about the format:
|
Any updates? Will this feature be considered ? |
FWIW, there's a multi-platform kit PFM using which a docset can be turned into a private compressed virtual file system transparently, so hopefully there'll be no need to modify the existing code much, at least conceptually. |
dwarfs is a read-only file system that could work for this. Here are some compression results:
|
DwarFS looks interesting and should address the lack of indexing in docset archives. Converting docsets on the client may be quite slow though. |
@trollixx Distributing converted image files directly through the server can avoid converting the images on the client side. |
Hi,
Do you have a plan to support compressed docset, like zipped archive? It will help very much in reducing the size and disk reading time.
The text was updated successfully, but these errors were encountered: