Skip to content

Statistical Programming

smc77 edited this page Mar 31, 2012 · 8 revisions

Language-level inspiration:

Packages for inspiration:

Here is a likely-incomplete list of early requirements to get to a stage where basic linear models could be easily built in Julia. Some are specific to statistical programming, while others are language-general.

  • New data types that support NA. They might be called IntData, NumData, BoolData, StrData, etc. Issue #470.
  • An updated testing framework to better allow test-driven development. Issue #8.
  • A FactorData type, supporting optionally ordered enumerations with NAs.
  • Either named arguments with defaults (e.g., f(a, b, q=7, x="hi")) or some alternative approach to options to functions. Issue #485.
  • A DataFrame (or maybe DataTable is a better name) type, of heterogeneous *Data columns, complete with rownames and colnames. We should find out more about what John Chambers thinks about data.frames in S/R and how they should be done better. We should also look at the data.table implementation and also at what Pandas is doing.
  • The power of reshape2 is severely limited by the asymmetric treatment of row and column variables in a data.drame. New data type should treat column and row variables symmetrically, and may be a better name would be data.matrix or even data.array. A related limitation of R's data.frame is that values in a column must have same type. Pandas corrected for this issue in the implementation of the data frame by have symmetrical treatment for rows and columns.
  • A deep dive into the core libraries of R and Pandas and maybe other languages to learn from previous mistakes and develop a clean, modern, orthogonal set of methods for data manipulation. For the love of god, please let Julia not have a broken sample() function like R's...
  • Formulas will probably be explicitly quoted expression in Julia, ala lm(:(y ~ x), dat). So we just need a set of conventions (and maybe an extra operator or two).
  • csvread() and dlmread() only generate matrices. There should be similar functions that read into DataFrames, as well as output them.
  • model.matrix and related equivalent methods on formulas.
  • a pure-julia implementation of lm().
  • Packages/Libraries/Gems/whatever.
  • Date/Time types, inspired by Joda Time (Java) and Lubridate (R)
  • ggplot like functionality in a core library

Please add or edit this list as thinking evolves!

Clone this wiki locally