Evaluate efficiency of zsync2 #53

TheAssassin · 2017-12-04T01:49:21Z

I'd like to spend some time (2+ days) on evaluating the current use of zsync2 with type 2 AppImages, i.e., compare how much has changed file wise (before calling mksquashfs, that is), and the difference ratio that zsync2 calculates. I feel like it tends to download more data than what has actually changed, but I'd rather perform some measurements before speculating in any way.

User stories:

I as a user expect AppImageUpdate to download only the blocks for a file that is added between two releases.
I as a user expect AppImageUpdate to be able to download blocks for an entire file at max if that file has been changed, and nothing else.
I as a developer hosting releases of my application expect AppImageUpdate to minimize the traffic generated for updates.

At the moment, this is promised, and might be true, but it'd be nice to create a little, meaningful study on that, which we can show to people asking about this. Also, it's a great way to find potential for optimizations. And, as we plan to keep zsync2 as our core functionality (even when using alternatives to the classic server-client architecture, such as peer to peer networks (see AppImage/AppImageKit#175)), now seems to be the right time to investigate these issues.

Factors that might have to be optimized are unequal block sizes for zsyncmake2 and mksquashfs, compression after generation of the squashfs image (as far as I know, mksquashfs pads files to fill up the remaining bytes in the last block of a file), etc. Anything that could lead to equal files being stored in a way so that the hash sums for the occupied blocks are different, basically.

Rule of thumb for block sizes: mksquashfs block size >= zsyncmake2 block size, mksquashfs block size mod zsyncmake2 = 0, block sizes should be powers of 2.

To my knowledge, such measurements haven't been performed yet, or at least haven't been set up in a more scientific way, capturing and visualizing the results in a repeatable way.

I think we could e.g., use any random Qt application (bundled with linuxdeployqt and the exact same Qt version) with changes in the main binary only, or some CLI applications bundling just a few but rarely changing libraries.

The goal is to find potential optimizations in our use of squashfs which would decrease the aforementioned difference ratio to save on bandwidth. I think it should be possible to find an acceptable trade-off of higher file space versus more efficient updates.

TODO:

set up test infrastructure (i.e., a zsync2 flag only calculating the difference ratio, without actually downloading the file (an extension to -j), and some repository containing the AppDirs of which we generate AppImages for with appimagetool, eliminating external dependencies)
describe test methods
perform first measurements representing the current state
experiment with settings, generating more data, to be compared with first measurements
make up list of parameters used initially and the ones gained from the experiments
generate final data for all parameters ever used, plot data

The text was updated successfully, but these errors were encountered:

probonopd · 2017-12-04T01:54:13Z

Chopping the file in chunks more intelligently could possibly help, e.g., after each file inside the squashfs. Also the different compressors might influence this. Let's keep in mind that we might want to use the Zstandard compressor now that it is available.

This area of research is especially important when combined with p2p, because if we get chunks with checksums that are the same between different files, then this helps the p2p performance greatly... so let's see if we can get some people from #ipfs and #ipfs-dev interested in this, too...

TheAssassin · 2017-12-04T02:07:12Z

@probonopd that's what I thought, too, I've commented in the other issue. This is all about playing with the parameters for both mksquashfs and zsync2 for now. But it should be done properly.

If it turns out that compression ruins block alignment (although I think the only sane way they implemented it is that they did it file wise), we'd have to change it to compressing files rather than the image or chunks with a size higher than the block sizes.

For now, I can tell that the block sizes of mksquashfs and zsync2 differ right now. First approach will be to equalize them, because using a block size lower than the one used for the squashfs image for zsync2 doesn't add anything but costs performance and adds bloat (i.e., additional hashes, by the factor (mksquashfs block size) / (zsync2 block size)), so that's the first thing to optimize. But, please keep in mind that in order to get meaningful results, I wouldn't change either option before performing the evaluations. I might be wrong as well.

probonopd · 2017-12-04T20:40:42Z

Maybe we can get @whyrusleeping's opinion on this one, too.

He suggested that we may need to make the ipfs chunker aware of the compressed file format.

Probably using the right type of compression for the AppImage could help a lot, so let's understand

Is zstd splitabble in hadoop/spark/etc? facebook/zstd#395 (Splittable/Seekable Format with Jump Table)

Reference:
AppImage/AppImageKit#175 (comment)

TheAssassin mentioned this issue Dec 4, 2017

Investigate libp2p AppImageCommunity/zsync2#15

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate efficiency of zsync2 #53

Evaluate efficiency of zsync2 #53

TheAssassin commented Dec 4, 2017

probonopd commented Dec 4, 2017 •

edited

Loading

TheAssassin commented Dec 4, 2017

probonopd commented Dec 4, 2017

Evaluate efficiency of zsync2 #53

Evaluate efficiency of zsync2 #53

Comments

TheAssassin commented Dec 4, 2017

probonopd commented Dec 4, 2017 • edited Loading

TheAssassin commented Dec 4, 2017

probonopd commented Dec 4, 2017

probonopd commented Dec 4, 2017 •

edited

Loading