Skip to content

Rabin fingerprinting and deduplication library in C

License

Notifications You must be signed in to change notification settings

BeeBreeze/librabinpoly

 
 

Repository files navigation

librabinpoly

BeeBreeze made some modification to be compatibile with Python 3.

Hash is calculated in rabinpoly.c, with append8. Maybe I will modify the comments in source codes latter. For data structure or compatibility problem, you may would like to read ctypes_difference_py23.txt.

If you read orginal repo, in test/test_zeros.py and test/test_ones.py, chr(0) or chr(1) returns a string. However, in src/rabinpoly.c, the 2nd argument should be unsigned char. That is to say, in rabinpoly.py, the 2nd argument of rp_from_buffer should be c_ubyte. If you use Python 2/3, c_ubyte should be c_char(read ctypes difference). Nomatter what function/package you use, we should know that chr() in python2 returns a string of number in [0,255], and bytes([]) in python3 returns bytes, whose integer form is also in [0,255]. Although c_char means char type in C, by testing, it can pass [0,255] to unsigned char value correctly(you may printf src in rp_from_buffer in rabinpoly.c).

For converting bytes to int, I recommond

Python2: int(chr(N).encode('hex'), 16) and Python3: import sys int.from_bytes(bytes([N]), byteorder=sys.byteorder),

where N is a python int in [0,255].

Rabin fingerprinting library in C, for chunking files into content-delimited variable sized blocks.

Includes python bindings. The python library uses ctypes, so the C library is pure C. This makes it possible to add other language bindings; send me pull requests.

The API should not be considered stable until we reach version 1.X.

Install

From tarball

./configure
make 
make test
make install

From git

Change path of libs in rabinpoly.py in all directories.

./Compile.sh
./TestingTest.sh

OR

make -f autotools.mk
./configure  --prefix=/usr
make
2to3 python/rabinpoly.py -w 

If you don't like 2to3, you may add --output-language=py32 in Makefile.am and recompile from the beginning. Edit python/rabinpoly.py, replace c_ubyte in rp_from_buffer with c_char.

make test
sudo make install

After above steps, you may add /usr/local/lib in $PATH by export PATH=$PATH:/usr/local/lib. If you don't want to install it, you may use the library by export LD_LIBRARY_PATH=$(pwd)/src/.libs.

History

(Please send me pull requests for any corrections or additional information anyone wants to add here. I want to set the record straight, and also think that understanding the history helps with understanding the code.)

It all started with Michael O. Rabin's 1981 paper "Fingerprinting by Random Polynomials", TR-15-81, Harvard University. (As of this writing, a copy of the paper can be found at http://www.xmailserver.org/rabin.pdf.)

The earliest implementation I know of is the 1999 code by David Mazieres [email protected] for the Low-Bandwidth File System project:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.28.8654

The authoritative location for LBFS and SFS appears to be http://pdos.csail.mit.edu/lbfs/dist.html. The LBFS CVS server is down, but the CVS repository is archived at http://www.fs.net/.

David's rabinpoly implementation in SFS is written in C++. I don't know if this is his first implementation, or if David wrote an earlier version in C. Later implementations crediting David are found in both C and C++.

Several later rabinpoly derivatives include some _LPCOX_DEBUG_ debug statements. LPCOX appears to be Landon Cox (http://www.cs.duke.edu/~lpcox/), one of the co-authors on the 2002 paper "Pastiche: Making Backup Cheap and Easy": https://www.usenix.org/legacy/event/osdi02/tech/full_papers/cox/cox.pdf

The last archive.org capture of the Pastiche web page was in 2004: https://web.archive.org/web/20040211222605/http://mobility.eecs.umich.edu/pastiche/

So far I haven't been able to find the Pastiche code itself -- it seems to have dropped off the net.

Sometime around 2002-2004, Niraj Tolia, apparently starting from Landon's Pastiche code, converted rabinpoly to a stand-alone library. I haven't been able to find Niraj's version. Niraj also cited the Pastiche paper in two papers of his own: http://dl.acm.org/citation.cfm?id=1060316&preflayout=flat.

In 2003-2005, Hyang-Ah Kim [email protected], starting from Niraj's version, repackaged and re-released rabinpoly, adding support for a configurable sliding window size. Her code can be found at http://www.cs.cmu.edu/~hakim/software/, and is in C++. It still has Landon's debug statements in it. Hyang-Ah's README is where I got the clue about Niraj.

In 2006, Geert Bosch [email protected] attached a C version of rabinpoly to an email message he sent to the git developers' mailing list, crediting David Mazieres' Rabinpoly code and D. Phillips's fls code. This version appears to share a common ancestor with Hyang-Ah's code, but doesn't include Landon's debug statements. His rabinpoly.h says "Translated to C and simplified by Geert Bosch ([email protected])". Geert's email message includes a description of the internal working of the algorithm. http://www.gelato.unsw.edu.au/archives/git/0604/18872.html

Junio C Hamano did some cleanup and checked Geert's code into git-core at https://code.google.com/p/git-core/source/detail?r=fd2bbdd2386ea0c558aba95711ef53c4552a6146&path=/rabinpoly.c -- it's not clear to me at the moment whether it was ever used though, by git or any other package.

In 2010, Bobby Martin wrote rabin-tool, starting from Hyang-Ah's version: https://github.com/wurp/rabin-tool

In 2010, Dan Stromberg (http://stromberg.dnsalias.org/~dstromberg/) refactored Hyang-Ah's C++ code into a python library, adding test cases. http://stromberg.dnsalias.org/svn/pyrabinf/trunk/

In 2013, Pavan Kumar Alampalli [email protected], (http://www.pdl.cmu.edu/People/pavan.shtml) credits Hyang-Ah in a pure-C version he worked on. This version appears to be either a merge of Hyang-Ah's C++ code into Geert's C code, or a completely new C translation (still trying to figure that out). Unlike other derivatives, this code does not include Landon's debug statements. It can be found at http://www.ece.cmu.edu/~ganger/746.spring13/projects/proj2_fuse/746-handout/src/dedup-lib/.

Pavan's version is used for Greg Ganger's 15-746/18-746 Storage Systems course at CMU (http://www.ece.cmu.edu/~ganger). I'm basing my first version of librabinpoly on this code.

In October 2020, BeeBreeze, a Doctor in Applied Math, modified the branch by stevegt to be compatible with Python 3. Since he is not an engineer and not good at coding, he just made some minor modifications.

Other rabinpoly libraries

An unrelated C implementation by Joel Tucci can be found at https://github.com/joeltucci/rabin-fingerprint-c. This version doesn't use David's ubiquitous bit-shifting algorithms, and I don't know how fast or compliant it is compared to David's code.

AZ Huang converted Joel's version to a Python module: https://github.com/aitjcize/pyrabin. I've forked and contributed to AZ's version at https://github.com/stevegt/pyrabin.

There is a description of Rabin's algorithm in Bill Dwyer's various Java implementations: https://github.com/themadcreator/rabinfingerprint/blob/master/src/main/java/org/rabinfingerprint/fingerprint/

About

Rabin fingerprinting and deduplication library in C

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 75.1%
  • C 20.7%
  • Makefile 2.2%
  • M4 1.5%
  • Shell 0.5%