Skip to content
louismullie edited this page Mar 22, 2012 · 72 revisions

Build Status Dependency Status]

Treat is a toolkit for natural language processing and computational linguistics. It provides a common API for a number of existing tools in C, Ruby and Java for document retrieval, parsing, annotation, and information extraction.

Warning

This GitHub project is currently out-of-sync with the gem. I am holding back on pushing a new gem until I am able to put out a well-tested release. This Wiki is applicable to the latest version of Treat, i.e. the one on GitHub. Things are currently moving fast; even if it builds, this library is still alpha.

Resources


**Current features**
  • Text extractors for PDF, HTML, XML, Word, AbiWord, OpenOffice and image formats (Ocropus)
  • Text retrieval with indexation and full-text search (Ferret)
  • Text chunkers, sentence segmenters, tokenizers, and parsers for several languages (Stanford & Enju)
  • Word inflectors, including stemmers, conjugators, declensors, and number inflection
  • Lexical resources (WordNet interface, several POS taggers for English, Stanford taggers for several languages)
  • Language, date/time, general topic and keyword extraction
  • Simple text statistics (frequency, TF*IDF)
  • Serialization of annotated entities to YAML or XML format
  • Visualization in ASCII tree, directed graph (DOT) and tag-bracketed (standoff) formats
  • Linguistic resources, including full ISO-639-1 and ISO-639-2 support, and tag alignments for five treebanks.

**Caveats/Planned features**
  • The few native Ruby statistics algorithms are slow. Some of the highly recursive code in the core Tree and Entity classes will need to be ported to inline C.
  • XML unserializer is currently broken; it will need to be fixed.
  • A faster WordNet API in Java will be interfaced.

**License**

This software is released under the GPL License and includes software released under the GPL, Ruby, Apache 2.0 and MIT licenses.

Clone this wiki locally