1
- Bounter -- Counter for large datasets
2
- =====================================
1
+ Bounter – Counter for large datasets
2
+ ====================================
3
3
4
- |Build Status |\ |GitHub release |\ | Mailing List | \ | Gitter | \ | Follow |
4
+ |License | | Build Status | |GitHub release | | Downloads |
5
5
6
6
Bounter is a Python library, written in C, for extremely fast
7
7
probabilistic counting of item frequencies in massive datasets, using
@@ -11,17 +11,17 @@ Why Bounter?
11
11
------------
12
12
13
13
Bounter lets you count how many times an item appears, similar to
14
- Python' s built-in ``dict `` or ``Counter ``:
14
+ Python’ s built-in ``dict `` or ``Counter ``:
15
15
16
16
.. code :: python
17
17
18
- from bounter import bounter
18
+ from bounter import bounter
19
19
20
- counts = bounter(size_mb = 1024 ) # use at most 1 GB of RAM
21
- counts.update([u ' a' , ' few' , u ' words' , u ' a' , u ' few' , u ' times' ]) # count item frequencies
20
+ counts = bounter(size_mb = 1024 ) # use at most 1 GB of RAM
21
+ counts.update([u ' a' , ' few' , u ' words' , u ' a' , u ' few' , u ' times' ]) # count item frequencies
22
22
23
- print (counts[u ' few' ]) # query the counts
24
- 2
23
+ print (counts[u ' few' ]) # query the counts
24
+ 2
25
25
26
26
However, unlike ``dict `` or ``Counter ``, Bounter can process huge
27
27
collections where the items would not even fit in RAM. This commonly
@@ -38,25 +38,27 @@ below, Bounter uses 31x less memory compared to ``Counter``.
38
38
Bounter is also marginally faster than the built-in ``dict `` and
39
39
``Counter ``, so wherever you can represent your **items as strings **
40
40
(both byte-strings and unicode are fine, and Bounter works in both
41
- Python2 and Python3), there' s no reason not to use Bounter instead
41
+ Python2 and Python3), there’ s no reason not to use Bounter instead
42
42
except:
43
43
44
44
When not to use Bounter?
45
45
------------------------
46
46
47
47
Beware, Bounter is only a probabilistic frequency counter and cannot be
48
- relied on for exact counting. (You can' t expect a data structure with
48
+ relied on for exact counting. (You can’ t expect a data structure with
49
49
finite size to hold infinite data.) Example of Bounter failing:
50
50
51
51
.. code :: python
52
52
53
- from bounter import bounter
54
- bounts = bounter(size_mb = 1 )
55
- bounts.update(str (i) for i in range (10000000 ))
56
- bounts[' 100' ]
57
- 0
53
+ from bounter import bounter
54
+ bounts = bounter(size_mb = 1 )
55
+ bounts.update(str (i) for i in range (10000000 ))
56
+ bounts[' 100' ]
57
+ 0
58
58
59
- Please use ``Counter `` or ``dict `` when such exact counts matter. When they don't matter, like in most NLP and ML applications with huge datasets, Bounter is a very good alternative.
59
+ Please use ``Counter `` or ``dict `` when such exact counts matter. When
60
+ they don’t matter, like in most NLP and ML applications with huge
61
+ datasets, Bounter is a very good alternative.
60
62
61
63
Installation
62
64
------------
@@ -66,15 +68,15 @@ C compiler:
66
68
67
69
.. code :: bash
68
70
69
- pip install bounter # install from PyPI
71
+ pip install bounter # install from PyPI
70
72
71
73
Or, if you prefer to install from the `source
72
74
tar.gz <https://pypi.python.org/pypi/bounter> `__:
73
75
74
76
.. code :: bash
75
77
76
- python setup.py test # run unit tests
77
- python setup.py install
78
+ python setup.py test # run unit tests
79
+ python setup.py install
78
80
79
81
How does it work?
80
82
-----------------
@@ -83,117 +85,119 @@ No magic, just some clever use of approximative algorithms and solid
83
85
engineering.
84
86
85
87
In particular, Bounter implements three different algorithms under the
86
- hood, depending on what type of " counting" you need:
88
+ hood, depending on what type of “ counting” you need:
87
89
88
- 1. ** `Cardinality
89
- estimation <https://en.wikipedia.org/wiki/Count-distinct_problem>`__:
90
- " How many unique items are there?" **
90
+ 1. `Cardinality
91
+ estimation <https://en.wikipedia.org/wiki/Count-distinct_problem> `__\ ** :
92
+ “ How many unique items are there?” **
91
93
92
- .. code :: python
94
+ .. code :: python
93
95
94
- from bounter import bounter
96
+ from bounter import bounter
95
97
96
- counts = bounter(need_counts = False )
97
- counts.update([' a' , ' b' , ' c' , ' a' , ' b' ])
98
+ counts = bounter(need_counts = False )
99
+ counts.update([' a' , ' b' , ' c' , ' a' , ' b' ])
98
100
99
- print (counts.cardinality()) # cardinality estimation
100
- 3
101
- print (counts.total()) # efficiently accumulates counts across all items
102
- 5
101
+ print (counts.cardinality()) # cardinality estimation
102
+ 3
103
+ print (counts.total()) # efficiently accumulates counts across all items
104
+ 5
103
105
104
- This is the simplest use case and needs the least amount of memory, by
105
- using the `HyperLogLog
106
- algorithm <http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf> `__
107
- (built on top of Joshua Andersen' s
108
- `HLL <https://github.com/ascv/HyperLogLog >`__ code).
106
+ This is the simplest use case and needs the least amount of memory, by
107
+ using the `HyperLogLog
108
+ algorithm <http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf> `__
109
+ (built on top of Joshua Andersen’ s
110
+ `HLL <https://github.com/ascv/HyperLogLog >`__ code).
109
111
110
- 2. **Item frequencies: " How many times did this item appear?" **
112
+ 2. **Item frequencies: “ How many times did this item appear?” **
111
113
112
- .. code :: python
114
+ .. code :: python
113
115
114
- from bounter import bounter
116
+ from bounter import bounter
115
117
116
- counts = bounter(need_iteration = False , size_mb = 200 )
117
- counts.update([' a' , ' b' , ' c' , ' a' , ' b' ])
118
- print (counts.total(), counts.cardinality()) # total and cardinality still work
119
- (5L , 3L )
118
+ counts = bounter(need_iteration = False , size_mb = 200 )
119
+ counts.update([' a' , ' b' , ' c' , ' a' , ' b' ])
120
+ print (counts.total(), counts.cardinality()) # total and cardinality still work
121
+ (5L , 3L )
120
122
121
- print (counts[' a' ]) # supports asking for counts of individual items
122
- 2
123
+ print (counts[' a' ]) # supports asking for counts of individual items
124
+ 2
123
125
124
- This uses the `Count-min Sketch
125
- algorithm <https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch> `__ to
126
- estimate item counts efficiently, in a **fixed amount of memory **. See
127
- the `API
128
- docs <https://github.com/RaRe-Technologies/bounter/blob/master/bounter/bounter.py> `__
129
- for full details and parameters.
126
+ This uses the `Count-min Sketch
127
+ algorithm <https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch> `__ to
128
+ estimate item counts efficiently, in a **fixed amount of memory **. See
129
+ the `API
130
+ docs <https://github.com/RaRe-Technologies/bounter/blob/master/bounter/bounter.py> `__
131
+ for full details and parameters.
130
132
131
- As a further optimization, Count-min Sketch optionally support a
132
- `logarithmic probabilistic
133
- counter <https://en.wikipedia.org/wiki/Approximate_counting_algorithm> `__:
133
+ As a further optimization, Count-min Sketch optionally support a
134
+ `logarithmic probabilistic
135
+ counter <https://en.wikipedia.org/wiki/Approximate_counting_algorithm> `__:
134
136
135
- - ``bounter(need_iteration=False) ``: default option. Exact counter, no
136
- probabilistic counting. Occupies 4 bytes (max value 2^32) per bucket.
137
- - ``bounter(need_iteration=False, log_counting=1024) ``: an integer
138
- counter that occupies 2 bytes. Values up to 2048 are exact; larger
139
- values are off by +/- 2%. The maximum representable value is around
140
- 2^71.
141
- - ``bounter(need_iteration=False, log_counting=8) ``: a more aggressive
142
- probabilistic counter that fits into just 1 byte. Values up to 8 are
143
- exact and larger values can be off by +/- 30%. The maximum
144
- representable value is about 2^33.
137
+ - ``bounter(need_iteration=False) ``: default option. Exact counter, no
138
+ probabilistic counting. Occupies 4 bytes (max value 2^32) per bucket.
139
+ - ``bounter(need_iteration=False, log_counting=1024) ``: an integer
140
+ counter that occupies 2 bytes. Values up to 2048 are exact; larger
141
+ values are off by +/- 2%. The maximum representable value is around
142
+ 2^71.
143
+ - ``bounter(need_iteration=False, log_counting=8) ``: a more aggressive
144
+ probabilistic counter that fits into just 1 byte. Values up to 8 are
145
+ exact and larger values can be off by +/- 30%. The maximum
146
+ representable value is about 2^33.
145
147
146
- Such memory vs. accuracy tradeoffs are sometimes desirable in NLP, where
147
- being able to handle very large collections is more important than
148
- whether an event occurs exactly 55,482x or 55,519x.
148
+ Such memory vs. accuracy tradeoffs are sometimes desirable in NLP, where
149
+ being able to handle very large collections is more important than
150
+ whether an event occurs exactly 55,482x or 55,519x.
149
151
150
- 3. **Full item iteration: " What are the items and their frequencies?" **
152
+ 3. **Full item iteration: “ What are the items and their frequencies?” **
151
153
152
- .. code :: python
154
+ .. code :: python
153
155
154
- from bounter import bounter
156
+ from bounter import bounter
155
157
156
- counts = bounter(size_mb = 200 ) # default version, unless you specify need_items or need_counts
157
- counts.update([' a' , ' b' , ' c' , ' a' , ' b' ])
158
- print (counts.total(), counts.cardinality()) # total and cardinality still work
159
- (5L , 3 )
160
- print (counts[' a' ]) # individual item frequency still works
161
- 2
158
+ counts = bounter(size_mb = 200 ) # default version, unless you specify need_items or need_counts
159
+ counts.update([' a' , ' b' , ' c' , ' a' , ' b' ])
160
+ print (counts.total(), counts.cardinality()) # total and cardinality still work
161
+ (5L , 3 )
162
+ print (counts[' a' ]) # individual item frequency still works
163
+ 2
162
164
163
- print (list (counts)) # iterator returns keys, just like Counter
164
- [u ' b' , u ' a' , u ' c' ]
165
- print (list (counts.iteritems())) # supports iterating over key-count pairs, etc.
166
- [(u ' b' , 2L ), (u ' a' , 2L ), (u ' c' , 1L )]
165
+ print (list (counts)) # iterator returns keys, just like Counter
166
+ [u ' b' , u ' a' , u ' c' ]
167
+ print (list (counts.iteritems())) # supports iterating over key-count pairs, etc.
168
+ [(u ' b' , 2L ), (u ' a' , 2L ), (u ' c' , 1L )]
167
169
168
- Stores the keys (strings) themselves in addition to the total
169
- cardinality and individual item frequency (8 bytes). Uses the most
170
- memory, but supports the widest range of functionality.
170
+ Stores the keys (strings) themselves in addition to the total
171
+ cardinality and individual item frequency (8 bytes). Uses the most
172
+ memory, but supports the widest range of functionality.
171
173
172
- This option uses a custom C hash table underneath, with optimized string
173
- storage. It will remove its low-count objects when nearing the maximum
174
- alotted memory, instead of expanding the table.
174
+ This option uses a custom C hash table underneath, with optimized string
175
+ storage. It will remove its low-count objects when nearing the maximum
176
+ alotted memory, instead of expanding the table.
175
177
176
178
--------------
177
179
178
180
For more details, see the `API
179
- docstrings <https://github.com/RaRe-Technologies/bounter/blob/master/bounter/bounter.py> `__.
181
+ docstrings <https://github.com/RaRe-Technologies/bounter/blob/master/bounter/bounter.py> `__
182
+ or read the
183
+ `blog <https://rare-technologies.com/counting-efficiently-with-bounter-pt-1-hashtable/ >`__.
180
184
181
185
Example on the English Wikipedia
182
186
--------------------------------
183
187
184
- Let' s count the frequencies of all bigrams in the English Wikipedia
188
+ Let’ s count the frequencies of all bigrams in the English Wikipedia
185
189
corpus:
186
190
187
191
.. code :: python
188
192
189
- with smart_open(' wikipedia_tokens.txt.gz' ) as wiki:
190
- for line in wiki:
191
- words = line.decode().split()
192
- bigrams = zip (words, words[1 :])
193
- counter.update(u ' ' .join(pair) for pair in bigrams)
193
+ with smart_open(' wikipedia_tokens.txt.gz' ) as wiki:
194
+ for line in wiki:
195
+ words = line.decode().split()
196
+ bigrams = zip (words, words[1 :])
197
+ counter.update(u ' ' .join(pair) for pair in bigrams)
194
198
195
- print (counter[u ' czech republic' ])
196
- 42099
199
+ print (counter[u ' czech republic' ])
200
+ 42099
197
201
198
202
The Wikipedia dataset contained 7,661,318 distinct words across
199
203
1,860,927,726 total words, and 179,413,989 distinct bigrams across
@@ -202,42 +206,54 @@ would consume over 31 GB RAM.
202
206
203
207
To test the accuracy of Bounter, we automatically extracted
204
208
`collocations <https://en.wikipedia.org/wiki/Collocation >`__ (common
205
- multi-word expressions, such as " New York", " network license", " Supreme
206
- Court" or " elementary school" ) from these bigram counts.
209
+ multi-word expressions, such as “ New York”, “ network license”, “ Supreme
210
+ Court” or “ elementary school” ) from these bigram counts.
207
211
208
212
We compared the set of collocations extracted from Counter (exact
209
213
counts, needs lots of memory) vs Bounter (approximate counts, bounded
210
214
memory) and present the precision and recall here:
211
215
212
- +----------------------------------------------+----------+---------+-----------+----------+----------+
213
- | Algorithm | Time to | Memory | Precision | Recall | F1 score |
214
- | | build | | | | |
215
- +==============================================+==========+=========+===========+==========+==========+
216
- | ``Counter `` (built-in) | 32m 26s | 31 GB | 100% | 100% | 100% |
217
- +----------------------------------------------+----------+---------+-----------+----------+----------+
218
- | ``bounter(size_mb=128, need_iteration=False, | 19m 53s | **128 | 95.02% | 97.10% | 96.04% |
219
- | log_counting=8)`` | | MB** | | | |
220
- +----------------------------------------------+----------+---------+-----------+----------+----------+
221
- | ``bounter(size_mb=1024) `` | 17m 54s | 1 GB | 100% | 99.27% | 99.64% |
222
- +----------------------------------------------+----------+---------+-----------+----------+----------+
223
- | ``bounter(size_mb=1024, | 19m 58s | 1 GB | 99.64% | 100% | 99.82% |
224
- | need_iteration=False)`` | | | | | |
225
- +----------------------------------------------+----------+---------+-----------+----------+----------+
226
- | ``bounter(size_mb=1024, | 20m 05s | 1 GB | **100% ** | **100% ** | **100% ** |
227
- | need_iteration=False, log_counting=1024)`` | | | | | |
228
- +----------------------------------------------+----------+---------+-----------+----------+----------+
229
- | ``bounter(size_mb=1024, | 19m 59s | 1 GB | 97.45% | 97.45% | 97.45% |
230
- | need_iteration=False, log_counting=8)`` | | | | | |
231
- +----------------------------------------------+----------+---------+-----------+----------+----------+
232
- | ``bounter(size_mb=4096) `` | **16m | 4 GB | 100% | 100% | 100% |
233
- | | 21s** | | | | |
234
- +----------------------------------------------+----------+---------+-----------+----------+----------+
235
- | ``bounter(size_mb=4096, | 20m 14s | 4 GB | 100% | 100% | 100% |
236
- | need_iteration=False)`` | | | | | |
237
- +----------------------------------------------+----------+---------+-----------+----------+----------+
238
- | ``bounter(size_mb=4096, | 20m 14s | 4 GB | 100% | 99.64% | 99.82% |
239
- | need_iteration=False, log_counting=1024)`` | | | | | |
240
- +----------------------------------------------+----------+---------+-----------+----------+----------+
216
+ +-------------------------------------+-------+-----+-----+----+----+
217
+ | Algorithm | Time | Mem | Pre | Re | F1 |
218
+ | | to | ory | cis | ca | s |
219
+ | | build | | ion | ll | co |
220
+ | | | | | | re |
221
+ +=====================================+=======+=====+=====+====+====+
222
+ | ``Counter `` (built-in) | 32m | 31 | 1 | 10 | 10 |
223
+ | | 26s | GB | 00% | 0% | 0% |
224
+ +-------------------------------------+-------+-----+-----+----+----+
225
+ | ``bounter(size_mb=128, need | 19m | ** | 95. | 97 | 96 |
226
+ | _iteration=False, log_counting=8)`` | 53s | 128 | 02% | .1 | .0 |
227
+ | | | M | | 0% | 4% |
228
+ | | | B** | | | |
229
+ +-------------------------------------+-------+-----+-----+----+----+
230
+ | ``bounter(size_mb=1024) `` | 17m | 1 | 1 | 99 | 99 |
231
+ | | 54s | GB | 00% | .2 | .6 |
232
+ | | | | | 7% | 4% |
233
+ +-------------------------------------+-------+-----+-----+----+----+
234
+ | ``bounter(si | 19m | 1 | 99. | 10 | 99 |
235
+ | ze_mb=1024, need_iteration=False)`` | 58s | GB | 64% | 0% | .8 |
236
+ | | | | | | 2% |
237
+ +-------------------------------------+-------+-----+-----+----+----+
238
+ | ``bounter(size_mb=1024, need_it | 20m | 1 | ** | ** | ** |
239
+ | eration=False, log_counting=1024)`` | 05s | GB | 100 | 10 | 10 |
240
+ | | | | %** | 0% | 0% |
241
+ | | | | | ** | ** |
242
+ +-------------------------------------+-------+-----+-----+----+----+
243
+ | ``bounter(size_mb=1024, need | 19m | 1 | 97. | 97 | 97 |
244
+ | _iteration=False, log_counting=8)`` | 59s | GB | 45% | .4 | .4 |
245
+ | | | | | 5% | 5% |
246
+ +-------------------------------------+-------+-----+-----+----+----+
247
+ | ``bounter(size_mb=4096) `` | **16m | 4 | 1 | 10 | 10 |
248
+ | | 21s** | GB | 00% | 0% | 0% |
249
+ +-------------------------------------+-------+-----+-----+----+----+
250
+ | ``bounter(si | 20m | 4 | 1 | 10 | 10 |
251
+ | ze_mb=4096, need_iteration=False)`` | 14s | GB | 00% | 0% | 0% |
252
+ +-------------------------------------+-------+-----+-----+----+----+
253
+ | ``bounter(size_mb=4096, need_it | 20m | 4 | 1 | 99 | 99 |
254
+ | eration=False, log_counting=1024)`` | 14s | GB | 00% | .6 | .8 |
255
+ | | | | | 4% | 2% |
256
+ +-------------------------------------+-------+-----+-----+----+----+
241
257
242
258
Bounter achieves a perfect F1 score of 100% at 31x less memory (1GB vs
243
259
31GB), compared to a built-in ``Counter `` or ``dict ``. It is also 61%
@@ -262,14 +278,11 @@ license <https://github.com/rare-technologies/bounter/blob/master/LICENSE>`__.
262
278
Copyright (c) 2017 `RaRe
263
279
Technologies <https://rare-technologies.com/> `__
264
280
281
+ .. |License | image :: https://img.shields.io/pypi/l/bounter.svg
282
+ :target: https://github.com/RaRe-Technologies/bounter/blob/master/LICENSE
265
283
.. |Build Status | image :: https://travis-ci.org/RaRe-Technologies/bounter.svg?branch=master
266
284
:target: https://travis-ci.org/RaRe-Technologies/bounter
267
285
.. |GitHub release | image :: https://img.shields.io/github/release/rare-technologies/bounter.svg?maxAge=3600
268
286
:target: https://github.com/RaRe-Technologies/bounter/releases
269
- .. |Mailing List | image :: https://img.shields.io/badge/-Mailing%20List-lightgrey.svg
270
- :target: https://groups.google.com/forum/#!forum/gensim
271
- .. |Gitter | image :: https://img.shields.io/badge/gitter-join%20chat%20%E2%86%92-09a3d5.svg
272
- :target: https://gitter.im/RaRe-Technologies/gensim
273
- .. |Follow | image :: https://img.shields.io/twitter/follow/gensim_py.svg?style=social&label=Follow
274
- :target: https://twitter.com/gensim_py
275
-
287
+ .. |Downloads | image :: https://pepy.tech/badge/bounter/week
288
+ :target: https://pepy.tech/project/bounter/week
0 commit comments