-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues with Python 2.6.6 #1
Comments
I fixed compilation bugs and compare your Python rdiff with original rdiff. Python rdiff: time python pyrdiff.py signature /vz/template/cache/debian-7.0-x86_64-minimal.tar.gz /tmp/signature
real 0m16.551s
user 0m16.431s
sys 0m0.118s Original rdiff: time rdiff signature /vz/template/cache/debian-7.0-x86_64-minimal.tar.gz /tmp/signature
real 0m0.422s
user 0m0.376s
sys 0m0.044s Maybe you can try to do same in Golang? It's really fast and cute :) |
But signatures is really same :) md5sum /tmp/signature..original
3d823f471f88f54c046c5f2408cecd7f /tmp/signature.original
md5sum /tmp/signature.python
3d823f471f88f54c046c5f2408cecd7f /tmp/signature.python Nice work! :) For Python 2.6. compatibility I'm used this recipe http://stackoverflow.com/questions/16022556/has-python-3-2-to-bytes-back-ported-to-python-2-7 def to_bytes(n, length, endianess='big'):
h = '%x' % n
s = ('0'*(len(h) % 2) + h).zfill(length*2).decode('hex')
return s if endianess == 'big' else s[::-1] And add ord conversion to faster_rollsum function: def faster_rollsum(data):
A = 0
B = 0
for d in data:
A += ord(d) + 31
B += A
return (A & 0xffff) | ((B & 0xffff) * 65536) I'm really sure you can do it better because I'm not an python-guy :) |
Hi, This was originally written as part of an effort to understand rdiff properly - to ensure I wasn't talking crap when I logged the other bugs against rdiff (including a critical security bug!). I write all my code for python3 these days - I could make a python2 branch if it were useful, but I'd replace the md4 signatures with sha256. I'll try to make a python2 branch for you, but no guarantees of speed (maybe in pypy it would be ok) - I was aiming for clarity. |
Try grabbing the python2 branch, and let me know how it goes. I tested against python2.7. The delta files will come out slightly different to real rdiff - this is due to an rdiff bug though - my ones will be smaller (but still compatible). I REALLY don't recommend using rdiff (or pyrdiff) on untrusted data - MD4 is horribly broken. It's easy enough to convert it to use sha256, so if you have a real application for this I'll happily do so. I was waiting to see how librsync handles this though (apparently ignoring me is the solution so far). |
Thank you so much, Michael! I checked version for Python 2 and all works fine but signature is not equal to original rdiff: rdiff signature /vz/private/1214/root.hdd/root.hdd /tmp/1214_signature
./pyrdiff.py signature /vz/private/1214/root.hdd/root.hdd /tmp/1214_pysignature LANG=C cmp /tmp/1214_pysignature /tmp/1214_signature
/tmp/1214_pysignature /tmp/1214_signature differ: char 17, line 1
md5sum /tmp/1214_pysignature
1aa907bca16f00f3dd7d4aa4677e180f /tmp/1214_pysignature
md5sum /tmp/1214_signature
7e30877026141179281a19cfa240bc8f /tmp/1214_signature
ls -al /tmp/1214_signature
-rw-r--r-- 1 root root 18382860 Июл 25 10:57 /tmp/1214_signature
ls -al /tmp/1214_pysignature
-rw-r--r-- 1 root root 18382860 Июл 25 10:56 /tmp/1214_pysignature |
I'm not an hash expert but I'm really worry about collisions because my data in 1MB blocks (it's OpenVZ ploop disk images) and any collision will destroy my data. I checked sha1 implementation from openssl package and standard sha1sum and found openssl version is really very fast: https://translate.google.com/translate?sl=auto&tl=en&js=y&prev=_t&hl=en&ie=UTF-8&u=http%3A%2F%2Fwww.stableit.ru%2F2014%2F04%2Fsha1sum-openssl-sha1.html&edit-text= And second issue with rdiff is speed. I have servers with 40 cores modern CPU with and ssd drives but rdiff uses only one core for 100%. Maybe you have ideas about parallel version of pyrdiff? Do you have any ideas about future of pyrdiff? Is it only educations interests or you plans do really new project for data backup for drop-out replace rdiff? Absence of really reliable and fast solution for incremental backups is a real pain of almost every system administrator. |
Hi, I'm unable to reproduce the problem on my system, but I suspect it could be related to my abuse of ord() and chr(). Try pulling the latest version. For speed, python2 is really slow. I ran it with pypy and it was reasonable, but still slower than the original rdiff. I think the original rdiff should use blake2b, which would make the signature files bigger, but would avoid the collision problem and would use a modern core very well (SSE etc). If I can't convince the authors of rdiff and rsync to do something, I will roll my own - it won't take me long now that I've done this prototype. |
Amazing! All works fine now :) pyrdiff.py signature /vz/private/1214/root.hdd/root.hdd /tmp/pysignature
rdiff signature /vz/private/1214/root.hdd/root.hdd /tmp/signature md5sum /tmp/pysignature
7e30877026141179281a19cfa240bc8f /tmp/pysignature
md5sum /tmp/signature
7e30877026141179281a19cfa240bc8f /tmp/signature Thank you so much! It will be really nice if you build brand new modern rdiff and rsync. I read about blake2b and it's looks like good. What do you think about librsync source code? I test it and can't understand why it's so slow :( What language you plans use for new rdiff realization? |
I was originally hoping to target duplicity, which is written in python. I just committed mksigs.py - this is 3x slower than rdiff when run in pypy (I tried on a 4G iso). It should be trivial enough to modify normal rdiff to use blake2 signatures, but I don't intend on maintaining that codebase. This is unlikely to speed it up much. If I don't hear anything from the rdiff maintainer soon, I might consider writing something. |
mksigs looks really nice! I checked it for 150M file and got not so good results unfortunately :( time python mksigs.py /vz/template/cache/debian-6.0-x86_64.tar.gz /tmp/resultsignature
real 0m21.468s
user 0m21.426s
sys 0m0.053s
time rdiff signature /vz/template/cache/debian-6.0-x86_64.tar.gz /tmp/resultsignature
real 0m0.323s
user 0m0.281s
sys 0m0.042s I read your code and found very interesting md4 truncation which they used in original rdiff... Worst idea at ever. I can find only one description for this - space efficiency for deltas but it's really strange. I will try to translate mksigs into C or Go and execute performance tests. But C++ is best approach for patch and delta code because hashes/maps and dynamic arrays used widely. |
I tried to do very basic and dirty version in Go: https://gist.github.com/pavel-odintsov/a8f677d76213bddbccd6 It can't generate compatible signatures now but all hash generation work is same. And profiling results for 700MB file is not so bad for first try: time rdiff --block-size 1048576 signature /vz/template/cache/debian-7.0-x86_64-isppro.tar.gz /tmp/rdiff
real 0m3.464s
user 0m3.251s
sys 0m0.217s
time ./gordiff
real 0m3.920s
user 0m3.702s
sys 0m0.223s As you can see result is very closer :) But when I changed md4 to standard md5 from Golang standard library I got amazing speedup: time ./gordiff
real 0m2.040s
user 0m1.828s
sys 0m0.215s And with blake2 from here https://godoc.org/github.com/codahale/blake2 I got: time ./gordiff
real 0m2.194s
user 0m2.026s
sys 0m0.223s |
It seems like rdiff is going to get fixed. Perhaps it's best to wait to see how fast that goes. If you're backing up raw disk images, you could easily get better performance by simply getting rid of the rolling sum, and simply creating signatures for each block. Then the delta file would simply be (blocknum, data) for any block that's changed. |
I did my own fastrdiff signature in plain C with OpenSSL md4 realization and got following results: time rdiff signature --block-size 1048576 root.hdd root.hdd.signature
real 0m24.051s
user 0m23.025s
sys 0m1.022s
time ./fastrdiff signature
real 0m19.266s
user 0m18.261s
sys 0m0.987s :) |
Hello!
I read your rdiff implementation in pure Python and it's amazing! It's really awesome work! Did you execute performance tests? I do it with original rdiff and they are really worst - librsync/librsync#6 :(
Could you add compatibility with Python 2.6?
I fix this lines but scripts still not working:
Thank you!
The text was updated successfully, but these errors were encountered: