Skip to content

Commit 8c83e95

Browse files
committed
updated README.rst from README.md
1 parent 828a7c1 commit 8c83e95

File tree

1 file changed

+147
-134
lines changed

1 file changed

+147
-134
lines changed

README.rst

+147-134
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
1-
Bounter -- Counter for large datasets
2-
=====================================
1+
Bounter Counter for large datasets
2+
====================================
33

4-
|Build Status|\ |GitHub release|\ |Mailing List|\ |Gitter|\ |Follow|
4+
|License| |Build Status| |GitHub release| |Downloads|
55

66
Bounter is a Python library, written in C, for extremely fast
77
probabilistic counting of item frequencies in massive datasets, using
@@ -11,17 +11,17 @@ Why Bounter?
1111
------------
1212

1313
Bounter lets you count how many times an item appears, similar to
14-
Python's built-in ``dict`` or ``Counter``:
14+
Pythons built-in ``dict`` or ``Counter``:
1515

1616
.. code:: python
1717
18-
from bounter import bounter
18+
from bounter import bounter
1919
20-
counts = bounter(size_mb=1024) # use at most 1 GB of RAM
21-
counts.update([u'a', 'few', u'words', u'a', u'few', u'times']) # count item frequencies
20+
counts = bounter(size_mb=1024) # use at most 1 GB of RAM
21+
counts.update([u'a', 'few', u'words', u'a', u'few', u'times']) # count item frequencies
2222
23-
print(counts[u'few']) # query the counts
24-
2
23+
print(counts[u'few']) # query the counts
24+
2
2525
2626
However, unlike ``dict`` or ``Counter``, Bounter can process huge
2727
collections where the items would not even fit in RAM. This commonly
@@ -38,25 +38,27 @@ below, Bounter uses 31x less memory compared to ``Counter``.
3838
Bounter is also marginally faster than the built-in ``dict`` and
3939
``Counter``, so wherever you can represent your **items as strings**
4040
(both byte-strings and unicode are fine, and Bounter works in both
41-
Python2 and Python3), there's no reason not to use Bounter instead
41+
Python2 and Python3), theres no reason not to use Bounter instead
4242
except:
4343

4444
When not to use Bounter?
4545
------------------------
4646

4747
Beware, Bounter is only a probabilistic frequency counter and cannot be
48-
relied on for exact counting. (You can't expect a data structure with
48+
relied on for exact counting. (You cant expect a data structure with
4949
finite size to hold infinite data.) Example of Bounter failing:
5050

5151
.. code:: python
5252
53-
from bounter import bounter
54-
bounts = bounter(size_mb=1)
55-
bounts.update(str(i) for i in range(10000000))
56-
bounts['100']
57-
0
53+
from bounter import bounter
54+
bounts = bounter(size_mb=1)
55+
bounts.update(str(i) for i in range(10000000))
56+
bounts['100']
57+
0
5858
59-
Please use ``Counter`` or ``dict`` when such exact counts matter. When they don't matter, like in most NLP and ML applications with huge datasets, Bounter is a very good alternative.
59+
Please use ``Counter`` or ``dict`` when such exact counts matter. When
60+
they don’t matter, like in most NLP and ML applications with huge
61+
datasets, Bounter is a very good alternative.
6062

6163
Installation
6264
------------
@@ -66,15 +68,15 @@ C compiler:
6668

6769
.. code:: bash
6870
69-
pip install bounter # install from PyPI
71+
pip install bounter # install from PyPI
7072
7173
Or, if you prefer to install from the `source
7274
tar.gz <https://pypi.python.org/pypi/bounter>`__:
7375

7476
.. code:: bash
7577
76-
python setup.py test # run unit tests
77-
python setup.py install
78+
python setup.py test # run unit tests
79+
python setup.py install
7880
7981
How does it work?
8082
-----------------
@@ -83,117 +85,119 @@ No magic, just some clever use of approximative algorithms and solid
8385
engineering.
8486

8587
In particular, Bounter implements three different algorithms under the
86-
hood, depending on what type of "counting" you need:
88+
hood, depending on what type of counting you need:
8789

88-
1. **`Cardinality
89-
estimation <https://en.wikipedia.org/wiki/Count-distinct_problem>`__:
90-
"How many unique items are there?"**
90+
1. `Cardinality
91+
estimation <https://en.wikipedia.org/wiki/Count-distinct_problem>`__\ **:
92+
How many unique items are there?**
9193

92-
.. code:: python
94+
.. code:: python
9395
94-
from bounter import bounter
96+
from bounter import bounter
9597
96-
counts = bounter(need_counts=False)
97-
counts.update(['a', 'b', 'c', 'a', 'b'])
98+
counts = bounter(need_counts=False)
99+
counts.update(['a', 'b', 'c', 'a', 'b'])
98100
99-
print(counts.cardinality()) # cardinality estimation
100-
3
101-
print(counts.total()) # efficiently accumulates counts across all items
102-
5
101+
print(counts.cardinality()) # cardinality estimation
102+
3
103+
print(counts.total()) # efficiently accumulates counts across all items
104+
5
103105
104-
This is the simplest use case and needs the least amount of memory, by
105-
using the `HyperLogLog
106-
algorithm <http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf>`__
107-
(built on top of Joshua Andersen's
108-
`HLL <https://github.com/ascv/HyperLogLog>`__ code).
106+
This is the simplest use case and needs the least amount of memory, by
107+
using the `HyperLogLog
108+
algorithm <http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf>`__
109+
(built on top of Joshua Andersens
110+
`HLL <https://github.com/ascv/HyperLogLog>`__ code).
109111

110-
2. **Item frequencies: "How many times did this item appear?"**
112+
2. **Item frequencies: How many times did this item appear?**
111113

112-
.. code:: python
114+
.. code:: python
113115
114-
from bounter import bounter
116+
from bounter import bounter
115117
116-
counts = bounter(need_iteration=False, size_mb=200)
117-
counts.update(['a', 'b', 'c', 'a', 'b'])
118-
print(counts.total(), counts.cardinality()) # total and cardinality still work
119-
(5L, 3L)
118+
counts = bounter(need_iteration=False, size_mb=200)
119+
counts.update(['a', 'b', 'c', 'a', 'b'])
120+
print(counts.total(), counts.cardinality()) # total and cardinality still work
121+
(5L, 3L)
120122
121-
print(counts['a']) # supports asking for counts of individual items
122-
2
123+
print(counts['a']) # supports asking for counts of individual items
124+
2
123125
124-
This uses the `Count-min Sketch
125-
algorithm <https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch>`__ to
126-
estimate item counts efficiently, in a **fixed amount of memory**. See
127-
the `API
128-
docs <https://github.com/RaRe-Technologies/bounter/blob/master/bounter/bounter.py>`__
129-
for full details and parameters.
126+
This uses the `Count-min Sketch
127+
algorithm <https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch>`__ to
128+
estimate item counts efficiently, in a **fixed amount of memory**. See
129+
the `API
130+
docs <https://github.com/RaRe-Technologies/bounter/blob/master/bounter/bounter.py>`__
131+
for full details and parameters.
130132

131-
As a further optimization, Count-min Sketch optionally support a
132-
`logarithmic probabilistic
133-
counter <https://en.wikipedia.org/wiki/Approximate_counting_algorithm>`__:
133+
As a further optimization, Count-min Sketch optionally support a
134+
`logarithmic probabilistic
135+
counter <https://en.wikipedia.org/wiki/Approximate_counting_algorithm>`__:
134136

135-
- ``bounter(need_iteration=False)``: default option. Exact counter, no
136-
probabilistic counting. Occupies 4 bytes (max value 2^32) per bucket.
137-
- ``bounter(need_iteration=False, log_counting=1024)``: an integer
138-
counter that occupies 2 bytes. Values up to 2048 are exact; larger
139-
values are off by +/- 2%. The maximum representable value is around
140-
2^71.
141-
- ``bounter(need_iteration=False, log_counting=8)``: a more aggressive
142-
probabilistic counter that fits into just 1 byte. Values up to 8 are
143-
exact and larger values can be off by +/- 30%. The maximum
144-
representable value is about 2^33.
137+
- ``bounter(need_iteration=False)``: default option. Exact counter, no
138+
probabilistic counting. Occupies 4 bytes (max value 2^32) per bucket.
139+
- ``bounter(need_iteration=False, log_counting=1024)``: an integer
140+
counter that occupies 2 bytes. Values up to 2048 are exact; larger
141+
values are off by +/- 2%. The maximum representable value is around
142+
2^71.
143+
- ``bounter(need_iteration=False, log_counting=8)``: a more aggressive
144+
probabilistic counter that fits into just 1 byte. Values up to 8 are
145+
exact and larger values can be off by +/- 30%. The maximum
146+
representable value is about 2^33.
145147

146-
Such memory vs. accuracy tradeoffs are sometimes desirable in NLP, where
147-
being able to handle very large collections is more important than
148-
whether an event occurs exactly 55,482x or 55,519x.
148+
Such memory vs. accuracy tradeoffs are sometimes desirable in NLP, where
149+
being able to handle very large collections is more important than
150+
whether an event occurs exactly 55,482x or 55,519x.
149151

150-
3. **Full item iteration: "What are the items and their frequencies?"**
152+
3. **Full item iteration: What are the items and their frequencies?**
151153

152-
.. code:: python
154+
.. code:: python
153155
154-
from bounter import bounter
156+
from bounter import bounter
155157
156-
counts = bounter(size_mb=200) # default version, unless you specify need_items or need_counts
157-
counts.update(['a', 'b', 'c', 'a', 'b'])
158-
print(counts.total(), counts.cardinality()) # total and cardinality still work
159-
(5L, 3)
160-
print(counts['a']) # individual item frequency still works
161-
2
158+
counts = bounter(size_mb=200) # default version, unless you specify need_items or need_counts
159+
counts.update(['a', 'b', 'c', 'a', 'b'])
160+
print(counts.total(), counts.cardinality()) # total and cardinality still work
161+
(5L, 3)
162+
print(counts['a']) # individual item frequency still works
163+
2
162164
163-
print(list(counts)) # iterator returns keys, just like Counter
164-
[u'b', u'a', u'c']
165-
print(list(counts.iteritems())) # supports iterating over key-count pairs, etc.
166-
[(u'b', 2L), (u'a', 2L), (u'c', 1L)]
165+
print(list(counts)) # iterator returns keys, just like Counter
166+
[u'b', u'a', u'c']
167+
print(list(counts.iteritems())) # supports iterating over key-count pairs, etc.
168+
[(u'b', 2L), (u'a', 2L), (u'c', 1L)]
167169
168-
Stores the keys (strings) themselves in addition to the total
169-
cardinality and individual item frequency (8 bytes). Uses the most
170-
memory, but supports the widest range of functionality.
170+
Stores the keys (strings) themselves in addition to the total
171+
cardinality and individual item frequency (8 bytes). Uses the most
172+
memory, but supports the widest range of functionality.
171173

172-
This option uses a custom C hash table underneath, with optimized string
173-
storage. It will remove its low-count objects when nearing the maximum
174-
alotted memory, instead of expanding the table.
174+
This option uses a custom C hash table underneath, with optimized string
175+
storage. It will remove its low-count objects when nearing the maximum
176+
alotted memory, instead of expanding the table.
175177

176178
--------------
177179

178180
For more details, see the `API
179-
docstrings <https://github.com/RaRe-Technologies/bounter/blob/master/bounter/bounter.py>`__.
181+
docstrings <https://github.com/RaRe-Technologies/bounter/blob/master/bounter/bounter.py>`__
182+
or read the
183+
`blog <https://rare-technologies.com/counting-efficiently-with-bounter-pt-1-hashtable/>`__.
180184

181185
Example on the English Wikipedia
182186
--------------------------------
183187

184-
Let's count the frequencies of all bigrams in the English Wikipedia
188+
Lets count the frequencies of all bigrams in the English Wikipedia
185189
corpus:
186190

187191
.. code:: python
188192
189-
with smart_open('wikipedia_tokens.txt.gz') as wiki:
190-
for line in wiki:
191-
words = line.decode().split()
192-
bigrams = zip(words, words[1:])
193-
counter.update(u' '.join(pair) for pair in bigrams)
193+
with smart_open('wikipedia_tokens.txt.gz') as wiki:
194+
for line in wiki:
195+
words = line.decode().split()
196+
bigrams = zip(words, words[1:])
197+
counter.update(u' '.join(pair) for pair in bigrams)
194198
195-
print(counter[u'czech republic'])
196-
42099
199+
print(counter[u'czech republic'])
200+
42099
197201
198202
The Wikipedia dataset contained 7,661,318 distinct words across
199203
1,860,927,726 total words, and 179,413,989 distinct bigrams across
@@ -202,42 +206,54 @@ would consume over 31 GB RAM.
202206

203207
To test the accuracy of Bounter, we automatically extracted
204208
`collocations <https://en.wikipedia.org/wiki/Collocation>`__ (common
205-
multi-word expressions, such as "New York", "network license", "Supreme
206-
Court" or "elementary school") from these bigram counts.
209+
multi-word expressions, such as New York”, “network license”, “Supreme
210+
Court or elementary school) from these bigram counts.
207211

208212
We compared the set of collocations extracted from Counter (exact
209213
counts, needs lots of memory) vs Bounter (approximate counts, bounded
210214
memory) and present the precision and recall here:
211215

212-
+----------------------------------------------+----------+---------+-----------+----------+----------+
213-
| Algorithm | Time to | Memory | Precision | Recall | F1 score |
214-
| | build | | | | |
215-
+==============================================+==========+=========+===========+==========+==========+
216-
| ``Counter`` (built-in) | 32m 26s | 31 GB | 100% | 100% | 100% |
217-
+----------------------------------------------+----------+---------+-----------+----------+----------+
218-
| ``bounter(size_mb=128, need_iteration=False, | 19m 53s | **128 | 95.02% | 97.10% | 96.04% |
219-
| log_counting=8)`` | | MB** | | | |
220-
+----------------------------------------------+----------+---------+-----------+----------+----------+
221-
| ``bounter(size_mb=1024)`` | 17m 54s | 1 GB | 100% | 99.27% | 99.64% |
222-
+----------------------------------------------+----------+---------+-----------+----------+----------+
223-
| ``bounter(size_mb=1024, | 19m 58s | 1 GB | 99.64% | 100% | 99.82% |
224-
| need_iteration=False)`` | | | | | |
225-
+----------------------------------------------+----------+---------+-----------+----------+----------+
226-
| ``bounter(size_mb=1024, | 20m 05s | 1 GB | **100%** | **100%** | **100%** |
227-
| need_iteration=False, log_counting=1024)`` | | | | | |
228-
+----------------------------------------------+----------+---------+-----------+----------+----------+
229-
| ``bounter(size_mb=1024, | 19m 59s | 1 GB | 97.45% | 97.45% | 97.45% |
230-
| need_iteration=False, log_counting=8)`` | | | | | |
231-
+----------------------------------------------+----------+---------+-----------+----------+----------+
232-
| ``bounter(size_mb=4096)`` | **16m | 4 GB | 100% | 100% | 100% |
233-
| | 21s** | | | | |
234-
+----------------------------------------------+----------+---------+-----------+----------+----------+
235-
| ``bounter(size_mb=4096, | 20m 14s | 4 GB | 100% | 100% | 100% |
236-
| need_iteration=False)`` | | | | | |
237-
+----------------------------------------------+----------+---------+-----------+----------+----------+
238-
| ``bounter(size_mb=4096, | 20m 14s | 4 GB | 100% | 99.64% | 99.82% |
239-
| need_iteration=False, log_counting=1024)`` | | | | | |
240-
+----------------------------------------------+----------+---------+-----------+----------+----------+
216+
+-------------------------------------+-------+-----+-----+----+----+
217+
| Algorithm | Time | Mem | Pre | Re | F1 |
218+
| | to | ory | cis | ca | s |
219+
| | build | | ion | ll | co |
220+
| | | | | | re |
221+
+=====================================+=======+=====+=====+====+====+
222+
| ``Counter`` (built-in) | 32m | 31 | 1 | 10 | 10 |
223+
| | 26s | GB | 00% | 0% | 0% |
224+
+-------------------------------------+-------+-----+-----+----+----+
225+
| ``bounter(size_mb=128, need | 19m | ** | 95. | 97 | 96 |
226+
| _iteration=False, log_counting=8)`` | 53s | 128 | 02% | .1 | .0 |
227+
| | | M | | 0% | 4% |
228+
| | | B** | | | |
229+
+-------------------------------------+-------+-----+-----+----+----+
230+
| ``bounter(size_mb=1024)`` | 17m | 1 | 1 | 99 | 99 |
231+
| | 54s | GB | 00% | .2 | .6 |
232+
| | | | | 7% | 4% |
233+
+-------------------------------------+-------+-----+-----+----+----+
234+
| ``bounter(si | 19m | 1 | 99. | 10 | 99 |
235+
| ze_mb=1024, need_iteration=False)`` | 58s | GB | 64% | 0% | .8 |
236+
| | | | | | 2% |
237+
+-------------------------------------+-------+-----+-----+----+----+
238+
| ``bounter(size_mb=1024, need_it | 20m | 1 | ** | ** | ** |
239+
| eration=False, log_counting=1024)`` | 05s | GB | 100 | 10 | 10 |
240+
| | | | %** | 0% | 0% |
241+
| | | | | ** | ** |
242+
+-------------------------------------+-------+-----+-----+----+----+
243+
| ``bounter(size_mb=1024, need | 19m | 1 | 97. | 97 | 97 |
244+
| _iteration=False, log_counting=8)`` | 59s | GB | 45% | .4 | .4 |
245+
| | | | | 5% | 5% |
246+
+-------------------------------------+-------+-----+-----+----+----+
247+
| ``bounter(size_mb=4096)`` | **16m | 4 | 1 | 10 | 10 |
248+
| | 21s** | GB | 00% | 0% | 0% |
249+
+-------------------------------------+-------+-----+-----+----+----+
250+
| ``bounter(si | 20m | 4 | 1 | 10 | 10 |
251+
| ze_mb=4096, need_iteration=False)`` | 14s | GB | 00% | 0% | 0% |
252+
+-------------------------------------+-------+-----+-----+----+----+
253+
| ``bounter(size_mb=4096, need_it | 20m | 4 | 1 | 99 | 99 |
254+
| eration=False, log_counting=1024)`` | 14s | GB | 00% | .6 | .8 |
255+
| | | | | 4% | 2% |
256+
+-------------------------------------+-------+-----+-----+----+----+
241257

242258
Bounter achieves a perfect F1 score of 100% at 31x less memory (1GB vs
243259
31GB), compared to a built-in ``Counter`` or ``dict``. It is also 61%
@@ -262,14 +278,11 @@ license <https://github.com/rare-technologies/bounter/blob/master/LICENSE>`__.
262278
Copyright (c) 2017 `RaRe
263279
Technologies <https://rare-technologies.com/>`__
264280

281+
.. |License| image:: https://img.shields.io/pypi/l/bounter.svg
282+
:target: https://github.com/RaRe-Technologies/bounter/blob/master/LICENSE
265283
.. |Build Status| image:: https://travis-ci.org/RaRe-Technologies/bounter.svg?branch=master
266284
:target: https://travis-ci.org/RaRe-Technologies/bounter
267285
.. |GitHub release| image:: https://img.shields.io/github/release/rare-technologies/bounter.svg?maxAge=3600
268286
:target: https://github.com/RaRe-Technologies/bounter/releases
269-
.. |Mailing List| image:: https://img.shields.io/badge/-Mailing%20List-lightgrey.svg
270-
:target: https://groups.google.com/forum/#!forum/gensim
271-
.. |Gitter| image:: https://img.shields.io/badge/gitter-join%20chat%20%E2%86%92-09a3d5.svg
272-
:target: https://gitter.im/RaRe-Technologies/gensim
273-
.. |Follow| image:: https://img.shields.io/twitter/follow/gensim_py.svg?style=social&label=Follow
274-
:target: https://twitter.com/gensim_py
275-
287+
.. |Downloads| image:: https://pepy.tech/badge/bounter/week
288+
:target: https://pepy.tech/project/bounter/week

0 commit comments

Comments
 (0)