Skip to content

Commit 92f5ce5

Browse files
committed
doc tweak: make usage in data-scripts consistent with filenames in data/
1 parent 87b555f commit 92f5ce5

File tree

3 files changed

+7
-3
lines changed

3 files changed

+7
-3
lines changed

data-scripts/count_wikipedia.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ def usage():
1616
tokenize a directory of text and count unigrams.
1717
1818
usage:
19-
%s input_dir ../data/written_english.txt
19+
%s input_dir ../data/english_wikipedia.txt
2020
2121
input_dir is the root directory where sentence files live. Each file should contain
2222
one sentence per line, with punctuation. This script will walk the directory recursively,

data-scripts/count_wiktionary.py

+5-1
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ def usage():
1717
1818
Put those into a single directory and point it to this script:
1919
20-
%s wiktionary_html_dir ../data/spoken_english.txt
20+
%s wiktionary_html_dir ../data/us_tv_and_film.txt
2121
2222
output.txt will include one line per word in the study, ordered by rank, of the form:
2323
@@ -31,6 +31,7 @@ def parse_wiki_tokens(html_doc_str):
3131
results = []
3232
last3 = ['', '', '']
3333
header = True
34+
skipped = 0
3435
for line in html_doc_str.split('\n'):
3536
last3.pop(0)
3637
last3.append(line.strip())
@@ -49,9 +50,12 @@ def parse_wiki_tokens(html_doc_str):
4950
#
5051
# otherwise end up with a bunch of duplicates eg victor / victor's
5152
if token.endswith("'s") and rank > 1000:
53+
skipped += 1
5254
continue
5355
count = int(count)
5456
results.append((rank, token, count))
57+
# early docs have 1k entries, later 2k, last 1284
58+
assert len(results) + skipped in [1000, 2000, 1284]
5559
return results
5660

5761
def normalize(token):

data-scripts/count_xato.coffee

+1-1
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ sprintf = require('sprintf-js').sprintf
99
check_usage = () ->
1010
usage = '''
1111
12-
Run a frequency count on the raw 10M xato password set and keep the top 40k by
12+
Run a frequency count on the raw 10M xato password set and keep counts over CUTOFF in
1313
descending frequency. That file can be found by googling around for:
1414
"xato 10-million-combos.txt"
1515

0 commit comments

Comments
 (0)