Skip to content

Handy frequency lists of words in Italian, collected from the ItWaC corpus (Baroni, Bernardini, Ferraresi, & Zanchetta, 2009).

License

Notifications You must be signed in to change notification settings

franfranz/Word_Frequency_Lists_ITA

Repository files navigation

Word_frequency_Lists_ITA

Handy frequency lists for Italian lexical words calculated from the corpus ItWac (Baroni, M., Bernardini, S., Ferraresi, A., & Zanchetta, E., 2009).

Contents:

NOUNS

  • itwac_nouns_lemmas_notail_2_0_0.csv List of word forms tagged as NOUNS. The minimum token frequency in this list is 3. Contains: wordform, lemma, POS, frequency (raw), frequency per million words (fpmw), frequency (zipf). Encoding: utf-8. Calculated using countlemma_v2.0.0

  • itwac_nouns_lemmas_raw_2_0_0.zip List of word forms tagged as NOUNS. The minimum token frequency in this list is 1. Contains: wordform, lemma, POS, frequency (raw), frequency per million words (fpmw), frequency (zipf). Encoding: utf-8. Calculated using countlemma_v2.0.0

VERBS

  • itwac_verbs_lemmas_notail_2_1_0.csv List of word forms tagged as lexical VERBS (no auxiliary verbs). Contains: wordform, lemma, POS, modality, POS2 (ideally, functional verbs), frequency (raw), frequency per million words (fpmw), frequency (zipf). Encoding: utf-8. Calculated using countlemma_verb_2_1_0.

  • itwac_verbs_list_of_lemmas_2_1_0.csv List of lemmas from most to least represented across lexical VERB wordforms. Encoding: utf-8. Calculated using countlemma_verb_2_1_0.

ADJECTIVES

  • itwac_adj_lemmas_notail_2_1_0.csv List of word forms tagged as ADJ. The minimum token frequency in this list is 3. Contains: wordform, lemma, POS, frequency (raw), frequency per million words (fpmw), frequency (zipf). Encoding: utf-8. Calculated using countlemmaADJ

  • itwac_adj_lemmas_raw_2_1_0.zip List of word forms tagged as ADJ. The minimum token frequency in this list is 1. Contains: wordform, lemma, POS, frequency (raw), frequency per million words (fpmw), frequency (zipf). Encoding: utf-8. Calculated using countlemmaADJ

  • countlemma_v2.0.0.R Code used to provide a frequency list of all the NOUN forms present in Itwac, tagged for POS and lemma. This version is less time consuming in handling big files if compared to v_1.

  • countlemma_verb_2_1_0.R Code used to provide a frequency list of all the VERB forms present in Itwac, tagged for POS, lemma, modality.

  • countlemma_adj.R Code used to provide a frequency list of all the ADJ forms present in Itwac, tagged for POS and lemma.

About

Handy frequency lists of words in Italian, collected from the ItWaC corpus (Baroni, Bernardini, Ferraresi, & Zanchetta, 2009).

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages