-
Notifications
You must be signed in to change notification settings - Fork 128
Home
louismullie edited this page Mar 22, 2012
·
72 revisions
Treat is a toolkit for natural language processing and computational linguistics. It provides a common API for a number of existing tools in C, Ruby and Java for document retrieval, parsing, annotation, and information extraction.
Resources
Warning
This GitHub project is currently out-of-sync with the gem. I am holding back on pushing a new gem until I am able to put out a well-tested release. This Wiki is applicable to the latest version of Treat, i.e. the one on GitHub. Things are currently moving fast; this is still far from a stable library.
- Read the latest documentation.
- See how to install Treat.
- Learn how to use Treat.
- Help out by contributing to the project.
- View a list of papers about tools included in this toolkit.
- Open an issue and get a quick answer.
**Current features**
- Text extractors for PDF, HTML, XML, Word, AbiWord, OpenOffice and image formats (Ocropus)
- Text retrieval with indexation and full-text search (Ferret)
- Text chunkers, sentence segmenters, tokenizers, and parsers for several languages (Stanford & Enju)
- Word inflectors, including stemmers, conjugators, declensors, and number inflection
- Lexical resources (WordNet interface, several POS taggers for English, Stanford taggers for several languages)
- Language, date, time and named entity extraction, as well as coreference resolution
- Topic extraction (LDA or Reuters-trained model)
- Simple text statistics (frequency, TF*IDF)
- Serialization of annotated entities to YAML or XML format
- Visualization in ASCII tree, directed graph (DOT) and tag-bracketed (standoff) formats
- Linguistic resources, including full ISO-639-1 and ISO-639-2 support, and tag alignments for five treebanks.
**Caveats/Planned features**
- The few native Ruby statistics algorithms are slow. Some of the highly recursive code in the core Tree and Entity classes will be ported to inline C.
- XML unserializer is currently broken; it will need to be fixed.
- The API to the Stanford Coreference Resolver and the NER system will need to be integrated with the parser to allow retrieval of coreferences/tags at the same time as the parse tree. Currently, it is only possible to retrieve them separately.
- Tests need to be improved for extractors and processors.
- A faster WordNet API in Java will be interfaced.
**License**
This software is released under the GPL License and includes software released under the GPL, Ruby, Apache 2.0 and MIT licenses.