Data Analysis in RUby
daru (Data Analysis in RUby) is a library for storage, analysis, manipulation and visualization of data.
daru is inspired by pandas, a very mature solution in Python.
Written in pure Ruby so should work with all ruby implementations. Tested with MRI 2.0, 2.1, 2.2.
- Data structures:
- Vector - A basic 1-D vector.
- DataFrame - A 2-D spreadsheet-like structure for manipulating and storing data sets. This is daru's primary data structure.
- Compatible with IRuby notebook, statsample and statsample-glm.
- Singly and hierarchially indexed data structures.
- Flexible and intuitive API for manipulation and analysis of data.
- Easy plotting, statistics and arithmetic.
- Plentiful iterators.
- Optional speed and space optimization on MRI with NMatrix and GSL.
- Easy splitting, aggregation and grouping of data.
- Quickly reducing data with pivot tables for quick data summary.
- Import and exports dataset from and to Excel, CSV, Databases and plain text files.
- Basic Creation of Vectors and DataFrame
- Detailed Usage of Daru::Vector
- Detailed Usage of Daru::DataFrame
- Visualizing Data With Daru::DataFrame
- Grouping, Splitting and Pivoting Data
- Logistic Regression Analysis with daru and statsample-glm
- Finding and Plotting most heard artists from a Last.fm dataset
- Data Analysis in RUby: Basic data manipulation and plotting
- Data Analysis in RUby: Splitting, sorting, aggregating data and data types
Docs can be found here.
- Enable creation of DataFrame by only specifying an NMatrix/MDArray in initialize. Vector naming happens automatically (alphabetic) or is specified in an Array.
- Basic Data manipulation and analysis operations:
- DF concat
- Assignment of a column to a single number should set the entire column to that number.
- == between daru_vector and string/number.
- Multiple column assignment with []=
- Multiple value assignment for vectors with []=.
- #find_max function which will evaluate a block and return the row for the value of the block is max.
- Function to check if a value of a row/vector is within a specified range.
- Create a new vector in map_rows if any of the already present rows dont match the one assigned in the block.
- Sort by index.
- Statistics on DataFrame over rows and columns.
- Cumulative sum.
- Calculate percentage change.
- Have some sample data sets for users to play around with. Should be able to load these from the code itself.
- Sorting with missing data present.
- Change internals of indexes to raise errors when a particular index is missing and the passed key is a Fixnum. Right now we just return the Fixnum for convienience.
Pick a feature from the Roadmap or the issue tracker or think of your own and send me a Pull Request!
- Google and the Ruby Science Foundation for the Google Summer of Code 2015 grant for further developing daru and integrating it with other ruby gems.
- Thank you last.fm for making user data accessible to the public.
Copyright (c) 2015, Sameer Deshmukh All rights reserved