Tabular data structure implementation for data analysis.
The dataframe
library provides an interface for representing
numerical data in tables with rows and columns. It is inspired by the
various dataframe implementations found in R, Python and Racket.
The dataframe
library also provides functions for loading and saving
data from data frames as well as routines for descriptive statistics
and linear regression.
Each dataframe consists of a collection of columns, which in turn is an object consisting of a unique key, data collection, and an associative list of properties. The following operations are defined on columns.
(column? obj) Returns true if the given object is a column.
(get-column-properties column) Returns an associative list with column properties.
(get-column-key column) Returns the key of the column.
(get-column-collection column) Returns the data collection of the column.
(column-deserialize column port) Loads the data collection of a column from the given port.
(column-serialize column port) Stores the data collection of a column to the given port in an s-expression format.
(make-data-frame [column-key-compare: compare-symbol]) Creates a new dataframe, with optional argument a procedure that specifies how to compare column keys. Default is comparison on symbols. Returns the new dataframe.
(df-insert-column df key collection properties) Inserts a new column with the given key, data collection, and properties. Returns a new dataframe with the inserted column.
(df-insert-derived df parent-key key proc properties) Inserts a derived column, that is a column whose data elements are obtained by mapping a procedure onto the elements of an existing (parent) column. Returns a new dataframe with the inserted column.
(df-insert-columns df lseq) Inserts the columns contained in the given lseq of column objects.
(show df) Displays a subset of the rows and columns contained in the dataframe.
(row-count df) Returns the number of rows in the dataframe.
(df-column df key) Returns the column indicated by the given key.
(df-columns df) Returns a lazy sequence containing the columns of the dataframe.
(df-filter-columns df proc) Returns a filtered lseq of the columns of the dataframe according to the given filter predicate procedure.
(df-select-columns df keys) Returns an lseq of the columns of the dataframe that have the keys enumerated in the given list of keys.
(df-keys df) Returns the keys of all columns in the dataframe.
(df-items df) Returns an lseq of the key-column pairs contained in the dataframe.
(apply-collections proc df key ...) Applies the given procedure to the data collections of the named columns of the dataframe and returns the result as a list.
(apply-columns proc df key ...) Applies the given procedure to the named columns of the dataframe and returns the result as a list.
(map-collections proc df key ...) Applies the given procedure to the data collections of the named columns of the dataframe and returns the result as a dataframe.
(map-columns proc df key ...) Applies the given procedure to the named columns of the dataframe and returns the result as a dataframe.
(reduce-collections proc df seed key ...) Fold over the data collections of the named columns.
(df-for-each-column df proc) Applies proc to each column.
(df-for-each-collection df proc) Applies proc to the data collection of each column.
(df-gen-rows df) Returns a generator procedure that returns the dataframe rows in succession.
(df-gen-columns df) Returns a generator procedure the returns the dataframe columns in succession.
(describe df port) Displays a table with the min/max/mean/sdev of each column in the dataframe.
(cmin df) Computes the minimum value of each column.
(cmax df) Computes the maximum value of each column.
(mean df) Computes the mean value of each column.
(median df) Computes the median value of each column.
(mode df) Computes the mode value of each column.
(range df) Computes the difference between maximum and minimum value of each column.
(percentile df) Computes the percentile values of each column.
(variance df) Computes the variance of each column.
(standard-deviation df) Computes the standard deviation of each column.
(coefficient-of-variation df) Computes the coefficient of variation of each column.
(linear-regression df x y) Linear regression between columns x and y.
(correlation-coefficient df x y) Correlation coefficient between columns x and y.
(df-serialize df port) Stores the dataframe in an s-expression format to the given port.
(df-deserialize df port) Loads the data collections of the dataframe columns from the given port.
(import scheme yasos dataframe dataframe-statistics)
(define df (make-data-frame))
(define df1
(df-insert-column
df
'base
(list-tabulate 100 (lambda (x) (- x 10)))
'()))
;; exponential series
(define df2
(df-insert-derived
df1 'base 'exp
(lambda (x) (* 2.0 (exp (* 0.1 x))))
'()
))
(show df2 #f)
(describe df2 #f)
(linear-regression df2 'base 'exp)
Copyright 2019 Ivan Raikov
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
A full copy of the GPL license can be found at http://www.gnu.org/licenses/.