Skip to content

alvations/sugali

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

sugali

This is a legacy repository of the language identification project for many (many) languages project for the software project course, NLP projects for low-resource languages.

Final technical report on http://www.coli.uni-saarland.de/courses/cl4lrl-swp/data/SugaliPoster.pdf

Description

Given a string of text in an arbitrary language, can we train a system to recognize what language the text is written in? The project uses three sources of data: the Universal Declaration of Human Rights, Wikipedia, ODIN, and some portions of the data available from Omniglot. The resulting sytem cover well over 1000 languages with their system.

As a spin-off, we've also produce the SeedLing corpus with data from over a 1000 languages. The corpus is freely available on the SeedLing github repository. The reference paper for the corpus is on https://www.aclweb.org/anthology/W14-2211/

Credits

  • Susanne Fertmann
  • Guy Emerson
  • Liling Tan
  • Alexis Palmer
  • Michaela Regneri

Cite

If you would need to refer to the poster or the code, feel free to cite

@misc{sugali,
  author = {Susanne Fertmann and Guy Emerson and Liling Tan},
  title = {Language Identification for Low-Resource Languages},
  year = {2014}, 
  url = "https://github.com/alvations/sugali/",
  institution = {Saarland University, Germany},
  note = "Technical Report for NLP projects for low-resource languages. Saarland, Germany"
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •