This repository contains data based on a corpus of texts written in the Hawaiian language (ʻŌlelo Hawaiʻi). The data includes frequency lists, stopwords, and lists of most common n-grams. The text in the corpus was obtained from Ulukau, the Hawaiian Electronic Library.
There are a total of 10.7 million words in the corpus, which was restricted to modern (post-20th century) and non-scriptural text. An overview of statistics for the corpus (including the top most common words and n-grams) can be seen here.
Files included in this repository:
- Hawaiian frequency list: A list of all the words in the corpus, arranged by frequency
- Hawaiian stopwords list: A list of stopwords derived from the frequency file (this is being actively verified and updated for eventual inclusion in the stopwords-json project)
- List of Hawaiian bigrams - A list of the most common sequences of two words, arranged by frequency
- List of Hawaiian 3-grams - A list of the most common sequences of three words, arranged by frequency
- List of Hawaiian 4-grams - A list of the most common sequences of four words, arranged by frequency
- Statistics for the Hawaiian corpus
CC0.