Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate efficiency of zsync2 #53

Open
6 tasks
TheAssassin opened this issue Dec 4, 2017 · 3 comments
Open
6 tasks

Evaluate efficiency of zsync2 #53

TheAssassin opened this issue Dec 4, 2017 · 3 comments

Comments

@TheAssassin
Copy link
Member

I'd like to spend some time (2+ days) on evaluating the current use of zsync2 with type 2 AppImages, i.e., compare how much has changed file wise (before calling mksquashfs, that is), and the difference ratio that zsync2 calculates. I feel like it tends to download more data than what has actually changed, but I'd rather perform some measurements before speculating in any way.

User stories:

  • I as a user expect AppImageUpdate to download only the blocks for a file that is added between two releases.

  • I as a user expect AppImageUpdate to be able to download blocks for an entire file at max if that file has been changed, and nothing else.

  • I as a developer hosting releases of my application expect AppImageUpdate to minimize the traffic generated for updates.

At the moment, this is promised, and might be true, but it'd be nice to create a little, meaningful study on that, which we can show to people asking about this. Also, it's a great way to find potential for optimizations. And, as we plan to keep zsync2 as our core functionality (even when using alternatives to the classic server-client architecture, such as peer to peer networks (see AppImage/AppImageKit#175)), now seems to be the right time to investigate these issues.

Factors that might have to be optimized are unequal block sizes for zsyncmake2 and mksquashfs, compression after generation of the squashfs image (as far as I know, mksquashfs pads files to fill up the remaining bytes in the last block of a file), etc. Anything that could lead to equal files being stored in a way so that the hash sums for the occupied blocks are different, basically.

Rule of thumb for block sizes: mksquashfs block size >= zsyncmake2 block size, mksquashfs block size mod zsyncmake2 = 0, block sizes should be powers of 2.

To my knowledge, such measurements haven't been performed yet, or at least haven't been set up in a more scientific way, capturing and visualizing the results in a repeatable way.

I think we could e.g., use any random Qt application (bundled with linuxdeployqt and the exact same Qt version) with changes in the main binary only, or some CLI applications bundling just a few but rarely changing libraries.

The goal is to find potential optimizations in our use of squashfs which would decrease the aforementioned difference ratio to save on bandwidth. I think it should be possible to find an acceptable trade-off of higher file space versus more efficient updates.

TODO:

  • set up test infrastructure (i.e., a zsync2 flag only calculating the difference ratio, without actually downloading the file (an extension to -j), and some repository containing the AppDirs of which we generate AppImages for with appimagetool, eliminating external dependencies)
  • describe test methods
  • perform first measurements representing the current state
  • experiment with settings, generating more data, to be compared with first measurements
  • make up list of parameters used initially and the ones gained from the experiments
  • generate final data for all parameters ever used, plot data
@probonopd
Copy link
Member

probonopd commented Dec 4, 2017

Chopping the file in chunks more intelligently could possibly help, e.g., after each file inside the squashfs. Also the different compressors might influence this. Let's keep in mind that we might want to use the Zstandard compressor now that it is available.

This area of research is especially important when combined with p2p, because if we get chunks with checksums that are the same between different files, then this helps the p2p performance greatly... so let's see if we can get some people from #ipfs and #ipfs-dev interested in this, too...

@TheAssassin
Copy link
Member Author

@probonopd that's what I thought, too, I've commented in the other issue. This is all about playing with the parameters for both mksquashfs and zsync2 for now. But it should be done properly.

If it turns out that compression ruins block alignment (although I think the only sane way they implemented it is that they did it file wise), we'd have to change it to compressing files rather than the image or chunks with a size higher than the block sizes.

For now, I can tell that the block sizes of mksquashfs and zsync2 differ right now. First approach will be to equalize them, because using a block size lower than the one used for the squashfs image for zsync2 doesn't add anything but costs performance and adds bloat (i.e., additional hashes, by the factor (mksquashfs block size) / (zsync2 block size)), so that's the first thing to optimize. But, please keep in mind that in order to get meaningful results, I wouldn't change either option before performing the evaluations. I might be wrong as well.

@probonopd
Copy link
Member

Maybe we can get @whyrusleeping's opinion on this one, too.

He suggested that we may need to make the ipfs chunker aware of the compressed file format.

Probably using the right type of compression for the AppImage could help a lot, so let's understand

Reference:
AppImage/AppImageKit#175 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants