Compressed docset #138

char101 · 2014-05-04T04:43:26Z

Hi,

Do you have a plan to support compressed docset, like zipped archive? It will help very much in reducing the size and disk reading time.

Kapeli · 2014-05-04T12:33:59Z

This is something I tried to add to Dash, so for what it's worth I'll share my progress. I'm writing this from memory, so sorry for any mistakes.

What seems to be needed is an indexed archive format which lets you extract individual files really fast.

Archive formats I've tried:

Zip has the best index as far as I can tell and extraction of individual files occurs really fast. The problem with zip is that the compression benefits are minimal. Zip seems to be really bad at archiving folders with a lot of small files. Some docsets even get bigger when you compress them with zip.
7Zip has an index but it sucks. As far as I can tell when you ask 7Zip to unarchive an individual file, it searches through its entire index to find files that match. This takes a very long time for large docsets.
Tar has no index at all
There is a way to index data inside of a gzip-compressed file, using https://code.google.com/p/zran/, so what I've tried is to make my own archive format which appends all of the files into one huge file and then compresses it with gzip and indexes it. This works great, but during unarchival of individual files it sometimes takes a lot longer to unarchive some files than others (during tests on my Mac most files unarchived in 0.01s with this format, but some files unarchived in 0.1-0.2s). I couldn't figure out why.

If anyone has any experience with archive formats, help would be appreciated 👍

char101 · 2014-05-05T02:24:18Z

Hi,

Thanks for your explanation on this.

Personally I don't think the docset size itself really matters, with the size of current harddisks. The problem I'm facing is with the number of small files, the storage size could be much larger because each small files take at least a filesystem block (4 KB?), even when the file size is only 1 KB. Also moving the docsets takes a very long time.

I tried compressing several of the docsets with zip

Yii

Total size                       15 270 914
Size on disk                     15 904 768
Zipped (max)                      1 828 455
7z (PPMd)                           604 276

J2SE

Total size                      295 540 254 
Size on disk                    318 820 352
Zipped (max)                     52,252,656 
7z (PPMd)                        18,412,863

While zip does not compresses as small as 7z with PPMd, still it achieves a pretty good compression ratio.

Kapeli · 2014-05-05T03:12:36Z

Unfortunately, I have no interest in pursuing this other than for file size issues. The size of the current HDDs is not an issue, but the size of SSDs is.

Also there are hosting and bandwidth issues, so size matters there as well, but I think one way around that would be to compress using a format (zip) and then recompress that using tgz or other.

Zip does work for some docsets, but fails with others. I can't remember which. Sorry.

char101 · 2014-05-05T03:52:24Z

I understand about your reason, but speaking about file size, surely for the user, having the docsets in zip format will still result in much less space than the uncompressed files.

What do you think about distributing the docsets in 7z format (less distribution bandwidth, faster download time - why 7z, because AFAIK only 7z supports PPMd algorithms and this is the fastest and smallest compression for text files), and converting them into zip after being downloaded by the user. Hopefully this can be done without using temporary file (more lifetime for SSD).

lobrien · 2014-12-06T01:27:31Z

We use a zip format for storing text documents in Mono's documentation tool. We use our own indexing code. If you're interested, I can point you to the specific code in the mono project.

trollixx · 2015-02-09T20:37:07Z

Adding some thoughts...

I am planning to add QCH (Qt Assistant format) and CHM support to Zeal at some point in the future. Both formats provide everything in a single file.

CHM files are compressed with LZX algorithm.
QCH is just an SQLite database and does not provide any compression.

As the next step I'd like to evaluate a Zeal-specific format (most likely extended from QCH) which would provide some compression level for data. I am not sure if that would work out with planned full-text search.

char101 · 2015-02-10T02:28:41Z

Compressing a single row in sqlite database would be less effective since the dictionary will be limited to that single text, isn't it?

I think it's more practical to use zip as the archive format, then embed the toc and index as json files inside the zip. Full text search can be created when adding the documentation the first time. Converters can be created to convert from chm/qch to the zip format. This will also keep the binary size smaller since you do not have to embed the decoding libraries to zeal. User which want to create their own documentation can simply zip the html files and add it to zeal.

zjzdy · 2015-03-28T23:28:24Z

@Kapeli Could you try to test lrzip ?
http://ck.kolivas.org/apps/lrzip/
You can use -l or -Ul to comparison(use lzo) if you want to fast decompress.
benchmark: http://ck.kolivas.org/apps/lrzip/README.benchmarks

Kapeli · 2015-03-29T00:23:24Z

A lot has changed since I last posted in this issue. I forgot it even exists. Sorry!

Anyways, Dash for iOS supports archived docsets right now. Dash for OS X will get support for archived docsets in a future update too. Archived docsets are only supported for my "official" docsets (i.e. the ones at https://kapeli.com/docset_links) and for user contributed docsets. This is enough, as these are the docsets that can be quite large, others are not really an issue.

I still use tgz for the archived docset format, the only difference is that I compress the docsets using tarix, which has proven to be very reliable.

Performance-wise, it takes about 5-10 times longer to read a file from archive than it takes to read it directly from disk. Directly from disk on my Mac it takes up to 0.001s for the larger doc pages, while from an archived docset it takes up to 0.01s.

Despite that, there's no noticeable impact as when a page is loaded the actual read of the files takes very very little time when compared to the loading of the WebView and the DOM and whatever (the WebView takes up about 90% of the load time).

zjzdy · 2015-03-29T03:39:53Z

@Kapeli @trollixx Dash needn't all decompress that tgz file .(? I'm not sure whether you want to express this meaning? ) But zeal also need all decompress that tgz file .So I think you should understand what I want to express. :)
I watch tarix project ,this is a good project,But I found it as thought long time not update ?

Kapeli · 2015-03-29T12:01:42Z

Dash does not need to decompress the tgz file anymore, no.

trollixx · 2015-03-30T02:49:30Z

Sounds interesting. I'll look into handling of tarix indices to eliminate docset unpacking. I haven't heard about tarix before.

zjzdy · 2015-04-01T10:33:02Z

Kind reminder, can extract the index file in advance, because the index file is IO intensive

RJVB · 2017-02-04T09:28:35Z

In the meantime Mac users can use HFS compression, and Linux users put their docset folder on a filesystem with transparent compression like btrfs or ZFS.

reclaimed · 2017-03-31T13:35:36Z

about bundling, in numbers:

a VHD container (ntfs, compression enabled) with docsets has the size of 19 Gb
700 thousands of files inside of the VHD have the summary size of ~9 Gb

It seems 10 Gb was spent to store file tables, attributes etc

I think the bundling (doesn't matter compressed or not) is a must.

RJVB · 2017-03-31T13:54:07Z

On Friday March 31 2017 06:35:40 evgeny g likov wrote: about bundling, in numbers: - a VHD container (ntfs, compression enabled) with docsets has the size of 19 Gb - 700 thousands of files inside of the VHD have the summary size of ~9 Gb

That many files will almost unavoidably lead to disk space overhead ("waste") because changes are slim that the majority will be an exact multiple of the disk block size (4096 for most modern disks). Not to mention the free-space fragmentation they can cause.

char101 · 2017-03-31T14:00:09Z

That many files will almost unavoidably lead to disk space overhead ("waste") because changes are slim that the majority will be an exact multiple of the disk block size (4096 for most modern disks). Not to mention the free-space fragmentation they can cause.

I think what you meant was the filesystem block. A disk block (sector) is only used for addressing while a single file cannot occupy less that a filesystem block.

livelazily · 2018-08-06T02:48:37Z

How about using dar to store and compress docset?

RJVB · 2018-11-28T16:02:53Z

If the goal is not to preserve the docset bundle "as is", couldn't you use a lightweight key/value database engine like LMDB? File names (or paths) would be the keys, and then you can use whatever compression gives the desired cost/benefit trade-off to store the values (i.e. file content). I've used this approach (with LZ4 compression) to replace a file-based data cache in my personal KDevelop fork, and it works quite nicely (with an API that mimics the file IO API). This gives me 2 files on disk instead of thousands which is evidently a lot more efficient.

FWIW my docset collection is over 3Gb before HFS compression, just over 1Gb after. I have enough diskspace not to compress, but that doesn't mean I spit on saving 2Gb. "There are no small economies" as they say in France, and following that guideline is probably why I still have lots of free disk space.

trollixx · 2018-12-02T03:41:23Z

SQLite with LZ4 or zstd for blob compression is what I have in mind. There are also some larger goals that I hope to achieve with moving to the new docset format, such as embedded metadata, ToC support, etc.

char101 · 2018-12-02T09:34:14Z

Zstandard supports precomputed dictionary which should be beneficial for compressing a lot of small files.

RJVB · 2018-12-02T12:10:17Z

On 02 Dec 2018, at 10:34, Charles ***@***.***> wrote: Zstandard supports precomputed dictionary which should be beneficial for compressing a lot of small files.

I think that argument is largely moot when you combine files in a single compressed file (which doesn’t mean there can be a benefit to using a dictionary; lz4 allows this too).

char101 · 2018-12-03T01:25:43Z

I think that argument is largely moot when you combine files in a single compressed file

When storing the files in a key value database or sqlite, each file is compressed independently, which is why a precomputed dictionary will reduce the compression significantly, not to mention reducing the space usage required to store the dictionary multiple times in each row.

trollixx · 2018-12-03T05:49:03Z

Using one dictionary per docset is an interesting idea, definitely worth benchmarking.

Regarding LZ4 and zstd, I just mentioned these two as an example, nothing has been decided so far.

fearofshorts · 2020-03-07T11:47:45Z

Just wanted to say that I feel this is the number 1 issue with Zeal and should be given much higher priority. Tens to hundreds of thousands of files means that any time I perform any large disk I/O tasks on any of my systems they get choked on the Zeal docsets.

If I try to use a large-directory-finding program (like WinDirStat or KDirStat) I have to wait as Zeal's docsets take up roughly 1/2 to 1/3rd of the total time searching. Making backups or copies of my home directory takes ages as the overhead for reading each of these files is incredible. I bet the the search cache must be much larger and slower on each of my systems because of having to index all of Zeals docsets.

Even Doom (a very, very early example of a game we have the source code to) solved this problem back in the day. Almost all of the games data is stored in a few "WAD" files (standing for "Wheres All the Data?"). If a user wants to play user-made mods or backup their game's data they just need to copy and paste a WAD file.

Sorry if this comes accross as bitching or complaining, I'm just trying to express how much this issue matters to me (and presumably many other users). I'm going to try to dust off my programming skills and work on this too.

coding-moding · 2020-07-03T04:26:38Z

here is workaround for this issue:

#!/bin/bash
cd /home/user/.local/share/Zeal/Zeal
mkdir -p mnt mnt/lowerdir mnt/upperdir mnt/workdir
#sudo mount docsets.sqsh mnt/lowerdir -t squashfs -o loop
#sudo mount -t overlay -o lowerdir=mnt/lowerdir,upperdir=mnt/upperdir,workdir=mnt/workdir overlay docsets
mount mnt/lowerdir
mount docsets
/usr/bin/zeal "$@"
umount docsets
umount mnt/lowerdir

#prepare:
#mksquashfs docsets docsets.sqsh

#fstab:
#/home/user/.local/share/Zeal/Zeal/docsets.sqsh /home/user/.local/share/Zeal/Zeal/mnt/lowerdir squashfs user,loop,ro 0 0
#/dev/loop0 /home/user/.local/share/Zeal/Zeal/mnt/lowerdir squashfs user,loop,ro 0 0
#overlay /home/user/.local/share/Zeal/Zeal/docsets overlay noauto,lowerdir=/home/user/.local/share/Zeal/Zeal/mnt/lowerdir,upperdir=/home/user/.local/share/Zeal/Zeal/mnt/upperdir,workdir=/home/user/.local/share/Zeal/Zeal/mnt/workdir,user 0 0

#name the file /usr/local/bin/zeal to have higher priority over /usr/bin/zeal
#so docsets.sqsh will be mounted before running zeal and unmounted after it exits

RJVB · 2020-07-03T08:10:35Z

here is workaround for this issue:

I've got an even bigger/longer one ;) - migrate your entire root to ZFS - create a dataset for the docsets, with compression=gzip-9, decide where to mount it (I use /opt/docs/docsets) - move all docsets there, and point zeal to that path in its settings. However, every solution that uses filesystem-based compression will still be suboptimal because even the tiniest file in the docset will still occupy the minimum filesystem or disk block - and it is not cross-platform. The way around that would be for Zeal itself to support compressed docsets, or simply use one of the existing libraries to access a compressed archive as a directory. Compressed archives can be packed much more compactly than a generic filesystem, and they're cross-platform/

coding-moding · 2020-07-03T10:19:06Z

nope. you completely missed the point). look at the year:

char101 commented on 4 May 2014

and there wasn't a solution. just speculations like yours.)

ashtonian · 2020-09-14T21:05:10Z

any progress or formal thoughts on this or where to start? Text compression would be 🔥. Currently using 1/3 of my 256gb expensive mac nvme.

char101 · 2020-09-16T12:04:02Z

Maybe you can zip your data directory, mount it using fuse, then put an overlay filesystem over it to allow for modifications.

RJVB · 2020-09-16T12:39:56Z

Is R/W access necessary? Regardless, it strikes me that this is something Zeal could do as an alternative to implementing its own support for compressed docsets.

LawrenceJGD · 2021-07-24T23:30:16Z

Currently I don't use Zeal because the docsets take up a lot of hard disk space and I don't have much space available, but I've seen a format that's used to archive web pages, it's called WARC (Web ARChive), it has support for compression and indexing, here I leave some links with information about the format:

Just Solve the File Format article.
IIPC WARC Specifications.
Wikipedia article.
Library of Congress article.
ReplayWeb.page, a browser application that allows displaying WARC files.

macsunmood · 2022-01-31T17:45:11Z

Any updates? Will this feature be considered ?

tophf · 2023-04-16T06:28:14Z

FWIW, there's a multi-platform kit PFM using which a docset can be turned into a private compressed virtual file system transparently, so hopefully there'll be no need to modify the existing code much, at least conceptually.

Cologler · 2024-09-07T07:24:52Z

dwarfs is a read-only file system that could work for this.

Here are some compression results:

docset	original size	compressed size
SQLite3	76.1 MB	9.02 MB
CSS	306 MB	18.4 MB
JavaScript	900 MB	79.6 MB

trollixx · 2024-09-07T18:44:26Z

DwarFS looks interesting and should address the lack of indexing in docset archives. Converting docsets on the client may be quite slow though.

Cologler · 2024-09-09T08:40:47Z

@trollixx Distributing converted image files directly through the server can avoid converting the images on the client side.

trollixx added scope/misc/docsets type/feature labels Feb 9, 2015

trollixx mentioned this issue Feb 14, 2015

Can increase the function directly read from *.tgz file, without the need to unzip all the files. #273

Closed

trollixx mentioned this issue Jan 15, 2016

Zeal eats my inodes/Please try to use compression #450

Closed

trollixx mentioned this issue Apr 9, 2016

Unpacking overload #503

Closed

This was referenced Sep 3, 2017

Improve install update experience #764

Open

Docset cannot be installed if path contains non-Latin characters #747

Closed

trollixx mentioned this issue Nov 21, 2017

Support installing docs via *.tgz/compressed files #823

Closed

This was referenced Jan 8, 2018

Add support for custom temp folder #306

Open

Windows Defender slows docsets installation #866

Closed

trollixx added this to the 1.0.0 milestone Jan 22, 2018

trollixx mentioned this issue Mar 16, 2018

Too many files #904

Closed

trollixx mentioned this issue Sep 19, 2018

Windows: Pages with * (star) in URL cannot be opened #994

Closed

trollixx mentioned this issue Oct 13, 2018

In javascript docset if...else page not opening #1009

Open

trollixx mentioned this issue Nov 28, 2018

Why not use libarchive throughout to extract downloaded docsets? #1043

Closed

Kapeli mentioned this issue Dec 2, 2018

New docset format #1046

Open

trollixx added scope/format/dash scope/docset-registry labels Dec 14, 2018

trollixx mentioned this issue Nov 19, 2019

[FR] offline documentation browsing without internet #1157

Closed

trollixx mentioned this issue Apr 15, 2023

Can't open link to shared_ptr/operator* #1489

Closed

Compressed docset #138

Compressed docset #138

Comments

char101 commented May 4, 2014

Kapeli commented May 4, 2014

char101 commented May 5, 2014

Kapeli commented May 5, 2014

char101 commented May 5, 2014

lobrien commented Dec 6, 2014

trollixx commented Feb 9, 2015

char101 commented Feb 10, 2015

zjzdy commented Mar 28, 2015

Kapeli commented Mar 29, 2015

zjzdy commented Mar 29, 2015

Kapeli commented Mar 29, 2015

trollixx commented Mar 30, 2015

zjzdy commented Apr 1, 2015

RJVB commented Feb 4, 2017

reclaimed commented Mar 31, 2017

RJVB commented Mar 31, 2017 via email

char101 commented Mar 31, 2017

livelazily commented Aug 6, 2018

RJVB commented Nov 28, 2018

trollixx commented Dec 2, 2018

char101 commented Dec 2, 2018

RJVB commented Dec 2, 2018 via email

char101 commented Dec 3, 2018

trollixx commented Dec 3, 2018

fearofshorts commented Mar 7, 2020

coding-moding commented Jul 3, 2020 • edited Loading

RJVB commented Jul 3, 2020 via email

coding-moding commented Jul 3, 2020 • edited Loading

ashtonian commented Sep 14, 2020

char101 commented Sep 16, 2020

RJVB commented Sep 16, 2020 via email

LawrenceJGD commented Jul 24, 2021 • edited Loading

macsunmood commented Jan 31, 2022

tophf commented Apr 16, 2023 • edited Loading

Cologler commented Sep 7, 2024

trollixx commented Sep 7, 2024

Cologler commented Sep 9, 2024

coding-moding commented Jul 3, 2020 •

edited

Loading

coding-moding commented Jul 3, 2020 •

edited

Loading

LawrenceJGD commented Jul 24, 2021 •

edited

Loading

tophf commented Apr 16, 2023 •

edited

Loading