-
Notifications
You must be signed in to change notification settings - Fork 128
Contributing
louismullie edited this page Jul 11, 2012
·
6 revisions
Here is a list of ideas for contributing to the project. If you want to write bindings for Java/C libraries, I would prefer that you package it as a gem in order to keep the core code as clean as possible. Eventually, some of the current core code is also probably going to be moved to external gems.
Information Extraction
- ABNER - "ABNER is a software tool for molecular biology text analysis." (http://pages.cs.wisc.edu/~bsettles/abner/).
- Mark Watson has a Java utility to identify human names and locations (http://www.markwatson.com/opensource/)
- Ariel - "Ariel is a Ruby library that allows you to extract information from semi-structured documents (such as websites)." (http://ariel.rubyforge.org/)
Crawling and Indexing
Ferret gem - "Ferret is a port of the Java Lucene project."- Spidr gem - "Spidr is a versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely." (http://spidr.rubyforge.org/)
- Anemone
- Sphinx
Semantics, RDF and logical inference
- JAWS - "Java API for Wordnet Searching" (http://lyle.smu.edu/~tspell/jaws/index.html)
- YAWNI - "Yawni is a pure Java object-oriented interface to the WordNet database of lexical relationships." (http://yawni.sourceforge.net/wiki/index.php)
- RDF.rb - "A pure-Ruby library for working with Resource Description Framework (RDF) data. "(http://rdf.rubyforge.org/)
- Text2rdf library – "A text mining application to extract terms and phrases from the text documents and annotate them with domain specific terminologies." (http://code.google.com/p/text2rdf/)
- OpenCyc - "The world's largest and most complete general knowledge base and common-sense reasoning engine" (http://www.opencyc.org/)
- Mark Watson has JRuby bindings for the PowerLoom AI reasoning and knowledge representation system
Text Statistics
- SWIG bindings for LIBLINEAR, a high performance machine learning library for large scale text mining (https://github.com/tomz/liblinear-ruby-swig)
- PLDA - "A parallel C++ implementation of fast Gibbs sampling of Latent Dirichlet Allocation" (http://code.google.com/p/plda/)
- SRILM utilities - Many n-gram utilities (http://www.speech.sri.com/projects/srilm/manpages/)
- RSemantic gem - "A document vector search with flexible matrix transforms for Ruby." (https://github.com/josephwilk/rsemantic/wiki/)
Machine Learning and Artificial Intelligence
- Here is a good list of artificial intelligence libraries in Ruby and Java: http://web.media.mit.edu/~dustin/papers/ai_ruby_plugins/
- AI4R - "AI4R is a collection of ruby algorithms implementations, covering several Artificial intelligence fields." (http://ai4r.rubyforge.org/)
- SVMlight - "SVMlight is an implementation of Support Vector Machines (SVMs) in C." (http://svmlight.joachims.org/)
- GHMM - "The General Hidden Markov Model library (GHMM) is a freely available C library implementing efficient data structures and algorithms for basic and extended HMMs with discrete and continous emissions." (http://ghmm.org/)
Math and Graph
- Tapas Kanugo has written a very nice statistical package in Ruby (http://mastarpj.nict.go.jp/~mutiyama/software.html#docs) as well as a Hidden Markov Toolkit and a K-Means clustering toolkit.
- ROOT - "ROOT is an object-oriented framework aimed at solving the data analysis challenges of high-energy physics" (has very nice Math and Graph modules) (http://root.cern.ch/root/HowtoRuby.html)
Parsers and Chunkers
- Link Parser - "The Link Grammar Parser is a syntactic parser of English, based on link grammar." (http://www.link.cs.cmu.edu/link/api/index.html)
- PET Parser - "A platform for experimentation with efficient HPSG processing techniques." (http://heartofgold.dfki.de/PET.html)
- Berkeley Parser - "A natural language parser from UC Berkeley." (http://code.google.com/p/berkeleyparser/)
- Alpino Parser - "Alpino is a dependency parser for Dutch." (http://www.let.rug.nl/vannoord/alp/Alpino/)
- Tapas Kanugo has written a page chunking utility (http://www.kanungo.com/software/software.html).
Toolkits
- OpenNLP - "The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. "(http://incubator.apache.org/opennlp/)
- NLTK - "Open source Python modules for research and development in natural language processing and text analytics." (http://www.nltk.org/code)
- LingPipe - "LingPipe is tool kit for processing text using computational linguistics." http://alias-i.com/lingpipe/
Stemmers
- Fast-stemmer - "Wrappers for a multithreaded Porter stemming algorithm." (https://github.com/romanbsd/fast-stemmer)
- Sphinx Stemmer - (https://github.com/joho/sphinx-stemmer-ruby-port/tree/master/ruby)
Taggers
- Citar - "Citar is a simple part-of-speech tagger, based on a trigram Hidden Markov Model (HMM)." (https://github.com/danieldk/citar)
- Mark Watson has some C++ and C# utilities for text tagging (http://www.markwatson.com/opensource/)
- ACOPOST - "A free and open source collection of part-of-speech taggers." (http://acopost.sourceforge.net/) Inflection
- Claws tagger - Part-of-speech tagger for English. (http://ucrel.lancs.ac.uk/claws/)
- FastTag - "- FastTag, a fast Java part of speech tagger." (http://www.markwatson.com/opensource/)
Inflection
- NLP gem - "Tools for processing polish language. Tokenization, scanning, categorization." (https://rubygems.org/gems/nlp)