Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a bigger English default dictionary #11

Open
bknowles opened this issue Aug 12, 2015 · 23 comments
Open

Use a bigger English default dictionary #11

bknowles opened this issue Aug 12, 2015 · 23 comments

Comments

@bknowles
Copy link

The dictionary provided by default with this program is very small. For enhanced security, it would be better to replace or augment this dictionary with one that has at least ten to fifteen thousand words. There are many dictionary lists available from the page at http://wordlist.aspell.net/other-dicts/, and they'll even help you tune your dictionary list to just the right "size", using the tool at http://app.aspell.net/create

I was able to quickly create a dictionary with over 10,000 common English words based on their information sources.

Check out the list provided by the URL http://app.aspell.net/create?max_size=20&spelling=US&spelling=GBz&spelling=CA&max_variant=0&diacritic=strip&special=hacker&special=roman-numerals&download=wordlist&encoding=utf-8&format=inline and then strip out lines 1-44 (the header), and all lines that contain an apostrophe in them.

@bknowles
Copy link
Author

Note that if you use the 2+2gfreq.txt file from the version 5.0 of 12Dicts via the link at http://wordlist.aspell.net/12dicts/ that will give you a file that is generally sorted by word frequency use. You can delete all the lines that begin with a space character, and if you go down to the end of section 16, that will give you about 13000 words to work with.

If you strip out all the words with less than four letters, and the lines separating the sections, that results in 12774 words to work with.

@PlotCitizen
Copy link

I would also like to bring attention to Diceware, a project vaguely similar in purpose to HSXKPasswd. The words from there can also be included in the new dictionary, or otherwise bundled with HSXKPasswd in some other way,

@bknowles
Copy link
Author

Yeah, I meant to mention diceware myself, but I forgot. Thanks for the reminder!

@bbusschots
Copy link
Owner

My thoughts on dictionaries is that the module should move towards a model where users would chose a base dictionary, like English, and then select specialist dictionaries for their areas of interest, e.g. Harry Potter words, Astronomical terms, US place names, and so on.

Having said that, the base English dictionary could still do with some beefing up, so it's something I'll be working on over the next few months.

@bknowles
Copy link
Author

I'm happy to contribute the dictionary I created, using the methods described above. That would at least give us a decent baseline with a reasonable number of words in the base dictionary, but not too many.

From there, people could create whatever other dictionaries they might want.

@bbusschots
Copy link
Owner

@bknowles that would be great - thanks!

Would you be happy to update the file share/sample_dict_EN.txt and issue a pull request? Or would you prefer to share the file in some other way?

@bknowles
Copy link
Author

Yeah, I'll do a PR.

Is there a particular branch name that you would want me to work from?

@bbusschots
Copy link
Owner

@bknowles
Copy link
Author

Will do.

@cmrd-senya
Copy link

@bknowles, are you going to make a PR still?

@bknowles
Copy link
Author

Sorry, I completely forgot about this one.

@bknowles
Copy link
Author

Would I be correct in assuming you want me to update the file at hsxkpasswd/lib/Crypt/HSXKPasswd/Dictionary/EN.pm?

@bbusschots
Copy link
Owner

@bknowles yes please.

@bknowles
Copy link
Author

Starting with the file at https://github.com/en-wl/wordlist/blob/master/alt12dicts/3esl.txt (see the instructions at https://github.com/en-wl/wordlist/blob/master/alt12dicts/README-orig#L554), we begin with 21877 words. If we strip out all the words that have capital letters or punctuation, that leaves us with 19217 words. If we then strip out all the words that are three characters or shorter, we have 18693 words. If we remove words with 9 characters or more, that gets us down to 11768 words.

I'm having trouble locating a current copy of the 2+2gfreq.txt file that I had referenced earlier. If I can find one, I'll see if that adds anything to the mix, or if maybe I should take the intersection of these two sets instead of the union.

@clsn
Copy link
Contributor

clsn commented Feb 27, 2017

The whole point of this algorithm is ease of remembering the passwords by using extremely common words. A 1000-word dictionary gives ~10 bits of entropy per word, and with 4 words that's 40 bits already right there.

Which is not to say we mustn't use a larger dictionary, but don't feel pressed to expand it all that much. We should stay within the very frequently-used, very memorable English words. Doubling the size of the dictionary adds another bit of entropy per word, which would be four more in a normal password. Just keep the words "normal" and common and not too confusable.

@bknowles
Copy link
Author

bknowles commented Feb 27, 2017

I was targeting ~10k words, preferably as common as possible for the English language. But finding current data for words sorted by frequency or commonality of use is kind of hard these days -- the old word lists seem to have been "updated" with newer formatting and the simple files lost, and most of the new research is either behind a paywall or the data is stored in a pretty arcane format.

Unfortunately, 40 bits of entropy is not very much these days. That's less entropy than you'd get with a randomly generated 7 character base64 password (6 bits of entropy per base64 character times seven characters = 42). Brute force methods using GPUs can run at speeds of billions of MD5 password cracks per second, and tens of thousands of cracks per second on more advanced algorithms like sha512crypt with 5000 rounds. For 40 bits of entropy, that could be cracked in minutes to hours.

So, if you want to use much smaller dictionaries, you're going to need to use more words in your xkcd-style passphrase to compensate. A ten thousand word dictionary will get you about 13.3 bits of entropy per word, so that four word xkcd-style passphrase would instead be about 53 bits of entropy total, instead of just 40.

Sure, you could have a five-word xkcd-style passphrase with 10 bits of entropy per word, but you'd still be short three bits, and each additional bit of entropy doubles the amount of cracking time required.

EDIT: Let's say you were using that 40-bit password as your wifi password for a network that is secured with WPA/WPA2, and your attacker knows you're using xkcd-style passwords, with the default 1000 word dictionary. The latest GPU benchmarks I can find for hashcat (see https://passwordrecovery.io/nvidia-gtx-1080-hashcat-benchmarks/) indicate that it can do 3177.6 kH/s.

Let's do the math. First, 2^40 = 1099511627776. Divide by 1000 (convert kH/s -> H/s), and you get 1099511627.776. Divide by 3177.6, and you get about 346019.52. That's how many CPU/GPU seconds it would take to crack your password. That's 5766.992 minutes, 96.1165 hours, or 4.005 days.

That's one machine to crack your 40 bit password, and WPA/WPA2 is much more secure than MD5. You can rent a machine of that caliber in AWS for a few dollars per hour.

In contrast, with a standard ~10k word dictionary, that same password would have ~53 bits of entropy, and take 2^13 (8192) times as long to crack. Instead of just over 4 days, it's now just under 32809 days, or 89.89 years.

That's the power of default.

@bknowles
Copy link
Author

bknowles commented Feb 27, 2017

Okay, I got confirmation from the author that the 2+2+3frq and 2+2+3cmn word lists are available in the official 12Dicts zipfile, currently at http://downloads.sourceforge.net/wordlist/12dicts-6.0.2.zip. This is the zipfile that is linked from http://wordlist.aspell.net/12dicts/.

I could have sworn that I previously tried this zipfile and found that it was somehow corrupted and I could not extract it, but maybe I was just smoking something. ;)

Anyway, now that I have these files, I will re-work my baseline and update the file hsxkpasswd/lib/Crypt/HSXKPasswd/Dictionary/EN.pm and generate a PR as mentioned above.

@bknowles
Copy link
Author

bknowles commented Feb 27, 2017

Okay, so looking at the wordlist in hsxkpasswd/lib/Crypt/HSXKPasswd/Dictionary/EN.pm, I find four duplicated words, differing only in capitalization: earth, march, mark, and moon (see lines 72, 111, 112, 119, 429, 726, 727, and 763).

For my wordlist, I'm going to assume each word should be in lowercase only, and if you want to capitalize the word as part of the presentation for adding a few additional bits of entropy to the generated password, that's up to you.

@bknowles
Copy link
Author

Okay, starting with 2+2+3frq.txt, I first chopped off all the words from section 16 and beyond. I then went through and manually removed a few words that were lemmas relating to other common words (e.g., "hast", "hath", "doth", etc...). All the remaining lemmas were promoted to primary words, and excess spacing was removed.

I removed all words that consisted of three characters or less, or more than eight characters. I removed certain special characters like parentheses, asterisk, plus, etc... because they did not appear to affect the quality of the word in any way. I removed all contractions and hyphenated words, as well as all alternating words (e.g., "him/her", "his/hers", etc...).

I converted all uppercase characters to lowercase, and eliminated all abbreviations (such as "e.g.,", "a.m.", "p.m.", "etc...", and so on).

After removing the lines that separated sections and doing a sort | uniq, I wound up with 10291 words left.

Yes, this includes "zigzag", which I consider to be a hyphenated word, but the dictionary I'm working from doesn't. I also preserved British versus American spelling as much as possible (e.g., "color" and "colour" are both included, as are "armor" and "armour", etc...).

I also found 216 words that are included in the current EN.pm dictionary which are not included in my list of 10291 words. This seems to include a lot of country names, names of US states, names of certain world cities, months, planets, and a few other things. I'm not sure if I should include these in my list or not.

@mshulman
Copy link
Contributor

I commented on the other duplicate issue, and created a PR based on the most common words in English (https://github.com/first20hours/google-10000-english/blob/master/google-10000-english-no-swears.txt), removed swear words, and removed all 1, 2, and 3 character words. That left about 9,000 words, all of which are common. Apologies if I created a PR on the wrong file - I created it on the sample dictionary.

@bbusschots
Copy link
Owner

@bknowles sorry - for the slow reply - real life has been distracting me a little too much.

I intentionally sought out the geographical names to add to my original list to bulk it out with things that people would find every bit as memorable as official words.

Ideally - what we need would be the union of all the words I already had, all the words you've come up with, and all the words @mshulman has come up with, and then get all that merged into share/sample_dict_EN.txt to be used as the base for auto-generating lib/Crypt/HSXKPasswd/Dictionary/EN.pm

@mshulman
Copy link
Contributor

@bbusschots

The list that I added in PR#28 does have the geographical names. They're just not capitalized so they're sorted into the combined list. But I'll confirm by pulling your original list, removing capitalization, and then sorting and removing dupes. If there are any from your original list that aren't in my PR #28, I'll add them back in.

@bbusschots
Copy link
Owner

Thanks @mshulman.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants