-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use a bigger English default dictionary #11
Comments
Note that if you use the 2+2gfreq.txt file from the version 5.0 of 12Dicts via the link at http://wordlist.aspell.net/12dicts/ that will give you a file that is generally sorted by word frequency use. You can delete all the lines that begin with a space character, and if you go down to the end of section 16, that will give you about 13000 words to work with. If you strip out all the words with less than four letters, and the lines separating the sections, that results in 12774 words to work with. |
I would also like to bring attention to Diceware, a project vaguely similar in purpose to HSXKPasswd. The words from there can also be included in the new dictionary, or otherwise bundled with HSXKPasswd in some other way, |
Yeah, I meant to mention diceware myself, but I forgot. Thanks for the reminder! |
My thoughts on dictionaries is that the module should move towards a model where users would chose a base dictionary, like English, and then select specialist dictionaries for their areas of interest, e.g. Harry Potter words, Astronomical terms, US place names, and so on. Having said that, the base English dictionary could still do with some beefing up, so it's something I'll be working on over the next few months. |
I'm happy to contribute the dictionary I created, using the methods described above. That would at least give us a decent baseline with a reasonable number of words in the base dictionary, but not too many. From there, people could create whatever other dictionaries they might want. |
@bknowles that would be great - thanks! Would you be happy to update the file share/sample_dict_EN.txt and issue a pull request? Or would you prefer to share the file in some other way? |
Yeah, I'll do a PR. Is there a particular branch name that you would want me to work from? |
Will do. |
@bknowles, are you going to make a PR still? |
Sorry, I completely forgot about this one. |
Would I be correct in assuming you want me to update the file at hsxkpasswd/lib/Crypt/HSXKPasswd/Dictionary/EN.pm? |
@bknowles yes please. |
Starting with the file at https://github.com/en-wl/wordlist/blob/master/alt12dicts/3esl.txt (see the instructions at https://github.com/en-wl/wordlist/blob/master/alt12dicts/README-orig#L554), we begin with 21877 words. If we strip out all the words that have capital letters or punctuation, that leaves us with 19217 words. If we then strip out all the words that are three characters or shorter, we have 18693 words. If we remove words with 9 characters or more, that gets us down to 11768 words. I'm having trouble locating a current copy of the |
The whole point of this algorithm is ease of remembering the passwords by using extremely common words. A 1000-word dictionary gives ~10 bits of entropy per word, and with 4 words that's 40 bits already right there. Which is not to say we mustn't use a larger dictionary, but don't feel pressed to expand it all that much. We should stay within the very frequently-used, very memorable English words. Doubling the size of the dictionary adds another bit of entropy per word, which would be four more in a normal password. Just keep the words "normal" and common and not too confusable. |
I was targeting ~10k words, preferably as common as possible for the English language. But finding current data for words sorted by frequency or commonality of use is kind of hard these days -- the old word lists seem to have been "updated" with newer formatting and the simple files lost, and most of the new research is either behind a paywall or the data is stored in a pretty arcane format. Unfortunately, 40 bits of entropy is not very much these days. That's less entropy than you'd get with a randomly generated 7 character base64 password (6 bits of entropy per base64 character times seven characters = 42). Brute force methods using GPUs can run at speeds of billions of MD5 password cracks per second, and tens of thousands of cracks per second on more advanced algorithms like sha512crypt with 5000 rounds. For 40 bits of entropy, that could be cracked in minutes to hours. So, if you want to use much smaller dictionaries, you're going to need to use more words in your xkcd-style passphrase to compensate. A ten thousand word dictionary will get you about 13.3 bits of entropy per word, so that four word xkcd-style passphrase would instead be about 53 bits of entropy total, instead of just 40. Sure, you could have a five-word xkcd-style passphrase with 10 bits of entropy per word, but you'd still be short three bits, and each additional bit of entropy doubles the amount of cracking time required. EDIT: Let's say you were using that 40-bit password as your wifi password for a network that is secured with WPA/WPA2, and your attacker knows you're using xkcd-style passwords, with the default 1000 word dictionary. The latest GPU benchmarks I can find for hashcat (see https://passwordrecovery.io/nvidia-gtx-1080-hashcat-benchmarks/) indicate that it can do 3177.6 kH/s. Let's do the math. First, 2^40 = 1099511627776. Divide by 1000 (convert kH/s -> H/s), and you get 1099511627.776. Divide by 3177.6, and you get about 346019.52. That's how many CPU/GPU seconds it would take to crack your password. That's 5766.992 minutes, 96.1165 hours, or 4.005 days. That's one machine to crack your 40 bit password, and WPA/WPA2 is much more secure than MD5. You can rent a machine of that caliber in AWS for a few dollars per hour. In contrast, with a standard ~10k word dictionary, that same password would have ~53 bits of entropy, and take 2^13 (8192) times as long to crack. Instead of just over 4 days, it's now just under 32809 days, or 89.89 years. That's the power of default. |
Okay, I got confirmation from the author that the I could have sworn that I previously tried this zipfile and found that it was somehow corrupted and I could not extract it, but maybe I was just smoking something. ;) Anyway, now that I have these files, I will re-work my baseline and update the file |
Okay, so looking at the wordlist in For my wordlist, I'm going to assume each word should be in lowercase only, and if you want to capitalize the word as part of the presentation for adding a few additional bits of entropy to the generated password, that's up to you. |
Okay, starting with I removed all words that consisted of three characters or less, or more than eight characters. I removed certain special characters like parentheses, asterisk, plus, etc... because they did not appear to affect the quality of the word in any way. I removed all contractions and hyphenated words, as well as all alternating words (e.g., "him/her", "his/hers", etc...). I converted all uppercase characters to lowercase, and eliminated all abbreviations (such as "e.g.,", "a.m.", "p.m.", "etc...", and so on). After removing the lines that separated sections and doing a Yes, this includes "zigzag", which I consider to be a hyphenated word, but the dictionary I'm working from doesn't. I also preserved British versus American spelling as much as possible (e.g., "color" and "colour" are both included, as are "armor" and "armour", etc...). I also found 216 words that are included in the current |
I commented on the other duplicate issue, and created a PR based on the most common words in English (https://github.com/first20hours/google-10000-english/blob/master/google-10000-english-no-swears.txt), removed swear words, and removed all 1, 2, and 3 character words. That left about 9,000 words, all of which are common. Apologies if I created a PR on the wrong file - I created it on the sample dictionary. |
@bknowles sorry - for the slow reply - real life has been distracting me a little too much. I intentionally sought out the geographical names to add to my original list to bulk it out with things that people would find every bit as memorable as official words. Ideally - what we need would be the union of all the words I already had, all the words you've come up with, and all the words @mshulman has come up with, and then get all that merged into |
The list that I added in PR#28 does have the geographical names. They're just not capitalized so they're sorted into the combined list. But I'll confirm by pulling your original list, removing capitalization, and then sorting and removing dupes. If there are any from your original list that aren't in my PR #28, I'll add them back in. |
Thanks @mshulman. |
The dictionary provided by default with this program is very small. For enhanced security, it would be better to replace or augment this dictionary with one that has at least ten to fifteen thousand words. There are many dictionary lists available from the page at http://wordlist.aspell.net/other-dicts/, and they'll even help you tune your dictionary list to just the right "size", using the tool at http://app.aspell.net/create
I was able to quickly create a dictionary with over 10,000 common English words based on their information sources.
Check out the list provided by the URL http://app.aspell.net/create?max_size=20&spelling=US&spelling=GBz&spelling=CA&max_variant=0&diacritic=strip&special=hacker&special=roman-numerals&download=wordlist&encoding=utf-8&format=inline and then strip out lines 1-44 (the header), and all lines that contain an apostrophe in them.
The text was updated successfully, but these errors were encountered: