Skip to content

Latest commit

 

History

History
28 lines (20 loc) · 1.21 KB

corpus_stats-haw.md

File metadata and controls

28 lines (20 loc) · 1.21 KB

Corpus Statistics

General information about the Hawaiian corpus

  • Total number of files in the corpus: 266

Words

  • Total number of words in the corpus: 10715321
  • Average number of words per file: 40283
  • Longest file in corpus (words): 390161 (keku4.txt)
  • Shortest file in corpus (words): 74 (aplc07.txt)
  • Top 5 most common words in corpus: i (322041 - 3.00%), ka (310362 - 2.89%), ke (132437 - 1.23%), ia (102421 - 0.95%), ma (83400 - 0.77%)
  • Top 5 least common words in corpus: ponaponawaikuamoo (1), ponetine (1), poneeaku (1), poneaikai (1), pone-to (1)

N-grams

  • Top 5 most common 2-grams: o,ka (4877), i,ka (4119), me,ka (2949), a,me (2876), o,ke (1913)
  • Top 5 most common 3-grams: a,me,ka (12209), no,ka,mea (7566), i,loko,o (5234), a,me,nā (4826), ʻo,ia,i (4389)
  • Top 5 most common 4-grams: no,ka,mea,ua (2270), i,loko,o,ka (1919), he,wahi,moʻolelo,no (1297), moʻolelo,no,kauaʻula,a (1281), no,kauaʻula,a,me (1281)

Lines

  • Total number of lines in the corpus: 1118176
  • Average number of lines per file: 4203
  • Longest file in corpus (lines): 37281 (kam.txt)
  • Shortest file in corpus (lines): 26 (aplc07.txt)
  • Total number of lines in the frequency file: 96140