TODO

http://www.inference.vc/deep-learning-is-easy/
google 2014 paper: inferring causal impact using bayesian structural time-series models https://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/41854.pdf
perplexity http://en.wikipedia.org/wiki/Perplexity#Perplexity_of_a_probability_model
generalized additive models (GAM) http://www.win-vector.com/blog/2011/02/the-cranky-guide-to-trying-r-packages/
"linearizing" variables: http://www.win-vector.com/blog/2012/07/modeling-trick-masked-variables/
adaptive lasso aka iterated reweighted l1 (see also iterated l2)
merge in notes from CS182, 6.867
Why do kernel smoothers divide the norm by λ before applying the weight fn? Isn't it dealt with in the summation in the estimator fn?
kalman filters, smoothing
particle filters
http://measuringmeasures.com/blog/2010/3/12/learning-about-machine-learning-2nd-ed.html
separating hyperplane
sigmoid kernel function
k-means++
http://hunch.net/?p=224
http://googleresearch.blogspot.com/2010/04/lessons-learned-developing-practical.html
top 10: http://gettinggeneticsdone.blogspot.com/2010/04/top-10-algorithms-in-data-mining.html
k-means/canopy clustering
model selection criteria: akaike information criterion (AIC), minimum description length (MDL), bayesian information criterion (BIC)
common distributions: normal, beta binomial, multinomial
$t$-test: matched pairs and unmatched
mixture model for collab filtering
principal component regression (PCR), partial least squares regression (PLS)
GMMs
navia
- Veritable
  - cross-categorization: group cols as views, then group rows in each view as categories
  - monte core uses CRP
- nonparam models
  - dirichlet process
  - chinese restaurant process (CRP)
- conjugate models
- undirected models, Isings, clique potential
Kullback-Leibler analysis
http://www.youtube.com/user/sanjeev3007#p/u
H-measure: fixes some shortcomings of AUC
http://hunch.net/?p=273
http://hunch.net/~jl/projects/hash_reps/index.html
https://sites.google.com/site/icml2011sparsity/
http://2010.ladisworkshop.org/node/10#keynote1
PLSR
self organizing maps
structural equation modeling
belief networks (90s; DAGs of hidden/visible), RBMs, stacked RBMs into DBNs
bayesian smoothing: generalization of kalman filters
from quora:
mixed models

Compressed Sensing and Dictionary learning: These algorithms both exploit sparsely-activating (few ones in a zero-one vector) signals that appear in many real-world contexts Although sparsity-inducing algorithms have been around for a long time (in the form of "basis pursuit" and lasso" they've exploded recently, with many new advances (largely initiated by Donoho and Tao's proof of effectiveness). So far, these have touched MRI, radar and computer vision, and lead to an enormous amount of cross-talk between signal processing, statistics and machine learning. Watch for future advances.

Advances in linear, structural and multi-kernel Support Vector Machines: SVMs were invented in the 90's and were a huge deal. In the 2000's they've been radically extended, becoming much faster (linear case), much more powerful (multi-kernel case) and much more flexible (structural case). These three extensions have also motivated a lot of research into connections between SVMs and other algorithms and original research in convex optimization methods.

Latent Dirichlet Allocation: Made great use of an existing paradigm, graphical models, to allow for fast inference of human-interpretable (sort of) topic that make up a document collection. It really made people realize the power of graphical models, and has been endlessly extended.

Deep Learning: Deep Belief Nets restarted interest in deep-learning methods, like neural nets. These have also been coming increasingly in contact with graphical models, like LDA above. People have also shown that they can achieve state-of-the-art results, which people doubted for a long time. Deep architectures are still slow and complex, but they can do a small handful of things that other methods simply can't, and their connections to real neural architecture are tantalizing.

Huge strides in online learning: This is more than an algorithm, it's a whole setting. When you're being given data one point and one label at a time and you're not allowed to store much of it (which is common when dealing with huge data), you still want to be able to learn well from it. In the last decade, many people have successfully expanded connections to game theory (you want to minimize the "regret" of your strategy), found new optimization methods that make use of contextual and gradient information, and now are even expanding into the domain of "stochastic convex optimization", where you want to optimize over a convex function but you get one noisy piece of data at a time (the setting has existed a while, but theoretical research has been much stronger in the 2000's).Suggest Edits

wisdom

data dredging aka data fishing aka data snooping: inappropriate use of data mining to uncover misleading relationships in data
- can occur when no hyp formed in advance or reduced data reduces prob of refuting some hyp
- when doing many significance tests, expect that some results are significant by chance alone
no free lunch theorem: many hyps can fit; always assuming something
- bias-free learning is futile
- for search/optimization: all algos perform the same on avg (over all possible cost functions); improving perf requires prior info to match procedures to problems
- for compression: "No compression algorithm can compress data on average."
- http://www.aihorizon.com/essays/generalai/no_free_lunch_machine_learning.htm
curse of dimensionality: volume grows exponentially with dimensions, so everything is far away
to whiten features is to center & rescale them by dividing by SD

learning taxonomies

great resource: http://machine-learning.martinsewell.com/machine-learning.pdf
parametric v nonparametric
- parametric learning: models have params that we're estimating, eg Bayesian learning; hyp complexity fixed
  - once params learned, they're all you need to predict
- nonparametric learning: hyp complexity can grow with data
  - instance-based aka memory-based learning: hyps directly constructed from training instances, eg nearest-neighbor, kernels
  - eg LWR: must keep training set around; "re-training" all the time
- nonparametric learning (2): infinite dimensional hyp (think of as function)
  - rationale: flexibility, performance, more realistic
  - TODO
  - http://learning.eng.cam.ac.uk/zoubin/talks/nips09npb.pdf

data preparation

attribute selection
attribute discretization
data transformation
- interactions (multiplications)
data cleansing
- missing values, duplicates, noise, typos, systematic errors
- outliers, anomalies

semi-supervised

EM (TODO: add some instances of EM)
- train on labeled data, then iteratively label unlabeled data probabilistically (E-step) and re-train on examples weighted by their probabilities (M-step)
- might work w any classifier & iterative clustering algo, but must ensure feedback loop is positive; NB and EM are good pair (both assume conditional independence given class)
- can choose to weigh unlabeled data lower than labeled data
- can assume mixture of multiple components for each class (instead of just one cluster in each class): in E-step, also probabilistically assign to components
co-training
- relies on two separate views where they're
  - complementary: $f(x) = f_1(x_1) = f_2(x_2)$
  - conditionally indep given label
- pseudocode
  
  LABELED = ... UNLABELED = ... UNLABELED' = sample u examples from UNLABELED iterate: H1 = train1(view1(LABELED)) H2 = train2(view2(LABELED)) let each H add p positive and n negative examples to LABELED choose 2(p+n) from UNLABELED to replenish UNLABELED'
- use UNLABELED' bc forces H1/H2 to choose more representative examples of underlying distribution (they try to choose the examples they're most confident of)
- p,n should maintain ratio of underlying distribution
- H1 labels examples for H2, and vice-versa
- co-training incrementally labels; EM iteratively refines labels
- even when no natural split, artificial (even random) splits can work (strive for independence)
co-EM
- co-EM better than co-training better than EM
  - better than co-training probably bc it doesn't commit to labels, but weighs probabilistically
- steps
  - train A on labeled data and probabilistically label unlabeled data
  - train B on all data and probabilistically relabel
  - iterate
- good text classif results with SVMs

expectation-maximization (EM)

invented in statistics
Gibbs sampling a kind of Bayesian or stochastic analogue of (frequentist) EM
- whereas EM computes likelihoods of missing values (in the E step) and maximizes parameters (in the M step), Gibbs sampling treats missing values and parameters as random variables that we sample from and whose full distributions we want to know
- ie while EM does updates by (deterministically) computing maximum-likelihood point estimates of parameters, Gibbs does updates by sampling for the parameters instead.
- http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/
conn btwn gibbs sampling & LDA

data set balance

class imbalance is common problem in many problems
most common techniques
- random undersampling (RUS) of majority class
  - shouldn't affect ROC curve
  - don't undersample test set
- random oversampling (ROS) of minority class: duplicate samples
synthetic minority over-sampling technique (SMOTE)
- "informed over-sampling": for each minority sample $s$:
  - find $k$ nearest neighbors
  - randomly select $j$ of them ($j$ depends on oversampling desired)
  - randomly generate synthetic samples along the lines joining $s$ and the $j$ neighbors
- shortcomings
  - may overgeneralize bc it disregards majority class; esp problematic for highly skewed class distributions bc minority class is very sparse wrt majority class
  - inflexibility: $j$ is fixed in advance
- adaptive synthetic minority oversampling (ASMO): overcomes shortcomings
- http://www.iro.umontreal.ca/~lisa/workshop2004/slides/japkowicz_slides.ppt
- may combine SMOTE and undersampling
http://metaoptimize.com/qa/questions/232/does-subsampling-bias-the-auc
http://www.machinelearning.org/proceedings/icml2007/papers/62.pdf

ensemble methods

decision stumps aka weak learners: predict based on just one feature
bootstrap aggregating aka bagging
- given a learner (model), trains $t$ classifiers (instances of the learner), each on some bootstrap sample from the training data
- bootstrap sample: sample w replacement (replace each example)
- given a test example, classifiers take a vote (regressions take avg)
  - averages predictions, so not useful for improving linear models
- error estimate: accuracy of the classifiers on their excluded sets
boosting (adaboost): usu applied to decision trees
- given a learner, trains $t$ classifiers (instances of the learner) on samples of the training data
- when a classifier is trained, weights for misclassified instances are increased, so that next classifier must try harder to get those right
- classifiers also assigned a weight based on accuracy on training set
- given a test example, classifiers take a weighted vote
rely on randomization/perturbation to build diverse models, so only unstable learners can be used; not useful for stable learners like SVM, NB
- stable learners: produce similar models upon small perturbations to training set

supervised learning

classification: discrete/categorical
regression: continuous
ordinal regression: ranking (in btwn classification & regression; used in IR)
old statistically inspired linear methods
neural networks
- nodes aka units are connected by links propagating activations
- each node is activation function of weighted inputs
- activation functions: threshold or sigmoid/logistic (differentiable)
- bias weight $W_{0,i}$ connected to fixed input $a_0 = -1$, sets actual threshold
- acyclic/feed-forward networks or cyclic/recurrent networks (memory)
- hidden units/layers: internal units/layers
- FF networks usu arranged in layers
- perceptron: single-layer FF using threshold function; linear
- multilayer backpropagates errors to hidden layers by assuming each hidden node responsible for some fraction of the error in each of the output nodes it connects to
gradient descent for logistic regression is perceptron with sigmoid/logit
decision trees
- pruning: remove irrelevant features; eg $\chi^2$ pruning (math)
- finding split points on continuous features is usu most expensive part (math)
- ID3: standard simple learning algo
  - choose attr w max info gain (or, equivalently, min entropy)
- C4.5: standard modern decision tree learner
  - supports continuous attrs, missing values, diffing costs, pruning irrelevant attrs
support vector machines (SVM) aka kernel machine
- maximum margin classifiers, potentially with slack
- linear: similar to logistic regression
Gaussian processes: TODO
$k$-nearest-neighbors (KNN): classify objects based on closest $k$ neighbors
case-based reasoning
- knowledge-based ai
- statistical ml
anytime learning: online continuous learning
margin infused relaxation algo (MIRA): TODO
- http://aria42.com/blog/?p=216

coordinate descent

no derivative needed; simply pick a direction and move along it

gradient descent

better alternatives: conjugate gradient, BFGS, L-BFGS (limited-memory BFGS)

graph search

BFS, DFS, IDFS
best-first search: explore most promising node according to heuristic fn
- usu.: minimize distance to goal
A* search: best-first where heuristic fn is distance to goal + distance from start
- unlike best-first min-dist, may jump off a certain path you've been heading down to keep from going too far off
beam search: best-first but at ecah step keep limited # of best candidates
dijkstra: exact min cost path
- orig algo $O(|V|^2)$; using Fib heap $O(|E| + |V| \log |V|)$
- for each node, maintain a "min known distance from root"
- starting from root, relax neighbors' distances & put into/pop from PQ
- once all neighbors of a node explored, dist won't relax again
- neg edge weights cause inf loops
- pseudocode:
  
  for x in nodes: x.dist = inf x.prev = null root.dist = 0 q = PQ({root}) while x = q.pop(): for y in x.neighbors: m = min(y.dist, x.dist + edge(x,y)) if m < y.dist: y.dist = m y.prev = x // record path insert or relax y in q
bellman-ford: single-source shortest-paths all-pairs
- same as dijkstra but instead of greedily relaxing min-dist unvisited node, relax all edges, and can report neg edges (no 'shortest path')
  
  for v in vs: v.dist, v.pred = inf, null for i in range(len(vs)): for u,v in es: if u.dist + edge(u,v) < v.dist: v.dist = u.dist + edge(u,v) v.pred = u for u,v in es: if u.dist + edge(u,v) < v.dist: raise Exception('graph contains neg weight edges')

constraint satisfaction

unification
- backtracking

local search/optimization algos

hill climbing: each step adjusts a single element in the vector; various ways to choose successor
simulated annealing: adjusts all elements in the vector according to gradient of the hill
local beam search: hill-climbing in lock-step; share info so that each exploring thread jumps to one of the best successors found by any of the other threads
stochastic beam search: local +
minimax
nearest neighbors
- knn: see above
- cuckoo hashing
genetic algos

instance-based aka memory-based learning

eg nearest-neighbor, kernels
attribute weighting: randomly sample instances and check neighbors
- near hits w diff attr val decreases attr relevance
- near misses w diff attr val boosts attr relevance

nearest-neighbor

density estimation: defines joint dists over input space
unlike Bayes net, no hidden vars/unsupervised clustering
distance functions
- continuous features: don't just use Euclidian; normalize by standard deviation ($Z$-distance)
  - special case of Mahalanobis distance, which uses covariance as well
- discrete features: Hamming distance
can predict some attr $y$ with $P(y|x) = P(y,x) / P(x)$
supervised learning: predict $y=h(x)$ from $k$ nearest neighbors of $x$
- for discrete: majority vote
- for continuous: average, or do local linear regression using the $k$ points
high-dimensional problems
- slow kNN search
- curse of dimensionality on distance functions

clustering

difference from partitioning: clustering is on points that have coordinates or distances, rather than shapeless graphs
hierarchical
- agglomerative: bottom-up merging
- divisive: top-down splitting
partitional: determine all clusters at once; can be used in divisive
- $k$-means: assign each point to cluster with closest centroid; initialize with random centroids
  - can use results to determine clustroids (points closest to centroids)
- self-organizing maps (SOM) aka self-organizing feature map (SOFM)
- locality sensitive hashing (LSH): hash similar values to similar values
  - a form of dimensionality reduction
  - also used in nearest neighbor search
  - feature space vectors are sets
  - jaccard distance
  - random projection:
    - each value is a vector in hyperspace
    - divide hyperspace with a hyperplane to classify the values in 1 bit
    - repeat for more bits
  - min-hash, sim-hash are instances of this
  - min-hash: http://knol.google.com/k/simple-simhashing
  - TODO (gaussian/kernelized) random projections; see eg sklearn's impl
  - sim-hash
    
    pick hash size, eg 32 bits let V = [0] * hash size break up token sequence into features (eg n-char shingles) hash each feature for each hash if bit i is set, V[i]++ if bit i is not set, V[i]-- simhash bit i is set iff V[i] > 0
    - http://matpalm.com/resemblance/simhash/
    - http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CharikarEstim.pdf
density-based: discover arbitrary shapes
2-way-/co-/bi-clustering: objects are clustered but also their features; if data in matrix, then rows and columns clustered simultaneously
distance measures: euclidian/2-norm, manhattan/1-norm, maximum norm aka infinity norm, hamming, inner product/cosine similarity
many clustering algos require number of clusters as input;

mean shift

"non-param" in that # modes gives # clusters, but still need kernel fn (eg bandwidth $h$)
given data points in $k$-dim space, there's a KDE PDF; find its modes
algo: given kernel fn & data pts, do for each data pt:
- start w window centered at data pt
- until convergence, find kernel-weighted mean over data pts in window (eg hypersphere, an approximation of where kernel is non-zero) and update window center
- clustering: assign data pt to the basin it falls into
theory: this actually is just ascent of gradient of KDE
used in CV: segmentation, tracking
- tracking: create confidence map in new img based on color histogram of obj in prev img; use mean shift to find peaks of confidence map
- segmentation: use discontinuity preserving smoothing
  - combine spatial and range (color) values into $x = [ x_s x_r ]$ (5-dim)
  - use special $K(x) = c K(x_s / h_s) K(x_r / h_r)$
  - filtered pixel uses orig spatial vector but new range vector of convergence pt
slow; some optimizations:
- sampling: sample KDE to get pts for new ("reduced") KDE, and run over these pts, then simply associate each pt to closest
- adaptive mean shift: vary bandwidth for ea pt: let $h$ be dist to nearest neighbor (that's L1 norm, can also use L2)
- actually move each data pt at each iteration

v-measure cluster evaluation measure

"conditional entropy-based external cluster evaluation measure"
harmonic mean of homogoneity score & completeness score (optionally weighing each differently)
other commonly used scores: purity, entropy, F-measure-like "clustering accuracy", more...
http://acl.ldc.upenn.edu/D/D07/D07-1043.pdf

machine learning

linear classif, perceptron, SVM
non-linear classif, kernels
n-way classif, rating, ranking
anomaly detection
(logistic) regression
collab filt
feat sel
ensembles, boosting
active learning
model sel, complexity
VC-dim, generalization guarantees
mixture models, EM
topic models, markov models, HMMs
bayesian nets (and learning them)
random fields, conditional random fields, markov random fields, factor graphs, inference
inference, belief propagation
inference, junction trees
conditional models, structured prediction
- structured prediction: similar to classification but predicting an output with many possible values but still tractable to learn, e.g. structures; eg POS tags for a string of words
learning conditional models

Developing a self-driving car for the 2007 DARPA Urban Challenge (Seth Teller's Angstrom talk, 5/1/08)

40 compute cores
many other teams performed human-assisted map annotation
Seth's team focused on going almost entirely from sensing
many other teams used much fewer cores
uses of sensing
- using vision for finding paint
- using active sensing for depth
stereo algos: reconstruct 3D from cameras
- recent stereo algos have a global component that is harder to parallelize
optical flow: estimating motion across time
groups
- planning & control: how to stay on the white dotted lines; one technical focus
- perception: another technical focus
- working on the car
- software infrastructure: codebase, OS, etc.
power-bound
- also CPU-bound in the sense that they could process rawer images, at higher rates, etc.
- also IO-bound; had 2 GigE and a CAN, all mostly utilized
Anant: near future will yield cameras that have encoders
- Seth: these are not as useful for vision research
- algos need to know gradients, for instance; what does MPEG do to gradients?
looked at GPUs
- programming models still painful
- form factor limits (1U blades)
didn't want DSPs either because they didn't know what algos they ultimately wanted; everything was exploratory
Anant: suggest fifth team, on computation resources

MACHINE LEARNING

variables
- visibility
  - hidden variables, latent variables, model parameters, hypothetical variables
  - observable variables, manifest variables
- causality
  - independent variable, predictor variable, regressor, controlled variable, manipulated variable, explanatory variable
  - dependent variable, response variable, regressand, measured variable, responding variable, explained variable, outcome variable
- TODO any diff btwn visibility/causality?
bayes vs frequentist
- TODO diff btwn | and ;?
models
data mining
analyses
- mathematical
  - real analysis
  - complex analysis
- statistical
  - analysis of variance (ANOVA)
  - time-series analysis
- latent variable model
  - factor analysis
  - latent trait analysis
  - latent profile analysis
  - latent class analysis
- PCA
- exploratory data analysis

feature/dimensionality reduction

feature selection
feature extraction
- principal component analysis (PCA)
autoencoder aka sparse coding

missing values

usu. non-random; missing bc of other attr (or class label)
model-specific treatment
- discriminative, eg log reg, can't handle unknowns, need generic treatment
- generative can marginalize over unknowns
- other models, eg trees, can handle: eg treat missing as separate value, or split instance into new instances (with value filled in) weighted by frequencies of filled-in value
  - http://www.slideshare.net/pierluca.lanzi/machine-learning-and-data-mining-11-decision-trees
generic treatments
- discard examples with missing values; should only do this if don't expect them to come up again (in test set)
- treat as special value; new discrete value or new indicator for continuous
- reduced-feature models: build models excluding missing features
- imputation: fill in missing values
  - avg/maj are most common, but generally can use any learner to predict value
  - distribution-based imputation: use the tree splitting technique described above to impute
  - EM: build model (ignoring missing), estimate missing, repeat till convergence
resources

principal component analysis (PCA)

use covariance method or SVD (preferred)
"eval decomp of data covariance matrix" method: rarely used bc inefficient
- row_data_adjusted = row_data - mean(row_data) (normalized to have unit variance?)
- find eig of compute covariance (correlation) matrix
- choose eigenvectors that have biggest eigenvalues; these are the (principal) components
- feature vector is matrix of col evecs $[e_1, e_2, \dots, e_n]
- transformed_data = trans(feature_vector) * row_data_adjusted
"SVD of mean-centered data matrix" method: faster
components optimally account for variance in features
- first component extracted accounts for max amt of total variance
  - i.e., choose new axis covering greatest "spread" in data points
  - i.e., choose new axis minimizing L2 errors
  - typically means it will be correlated w several/many vars
- second component extracted accounts for max amt of variance not accounted for by first component
results depend on scaling of variables
eval directly corresponds to fraction of total variance
- eg if total variance of 10, eval of 6.1 means 61% of total variance
can extract orthogonal components or oblique components (which may produce cleaner results but also may be more complicated to interpret)
not factor analysis; makes no assumptions about underlying causal model or latent vars; simply reduce to components that account for most of the variance
- but may sometimes produce similar results
optimal orthogonal xform but more computationally expensive than eg DCT
refs
- http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf
- http://support.sas.com/publishing/pubcat/chaps/55129.pdf

factor analysis

for when you believe latent factors cause observations

linear discriminant analysis

alt to PCA that optimizes for class separability
TODO
TODO canonical correlation analysis (CCA)
- http://www.cs.huji.ac.il/~shashua/papers/class8-PCA-LDA.pdf
despite name, it's generative; name makes sense vs PCA (which ignores labels)

independent components analysis (ICA)

PCA: finds orthogonal basis; suited for normal processes
ICA: suited for strongly non-normal processes
http://www.stanford.edu/class/cs229/notes/cs229-notes11.pdf

eigenfaces

faces = 100x100 images = 10,000D vectors
PCA on faces: keep principal eigenvectors ("eigenfaces")
use PCA trick

PCA transpose trick

rank of covar matrix is limited by # training examples: if there are $N$ examples, then there are at most $N-1$ evecs with non-0 evals
if dimensionality much larger than $N$, then transpose the observation matrix
http://blog.echen.me/2011/03/14/pca-transpose-trick/

locally linear embedding

non-linear dimensionality reduction technique TODO

trending algorithms

simple baseline trend algorithm: daily trend = most recent day's change * log(monthly total)
- http://stackoverflow.com/questions/1635703/understanding-algorithms-for-measuring-trends

decision tree learning algorithms

high variance
unsupervised clustering: put your data all in one class and synthesize a second class that is uniformly distributed throughout the same space
- RF will then try to distinguish the second class by finding the dense clusters
- http://www.genetics.ucla.edu/labs/horvath/RFclustering/RFclustering.htm
GINI: TODO
regression trees can fit any model, but if tree built well then constant should do well
- usu. use MSE instead of mutual information (pick attr to split where partitions have minimal variance, as opposed to greatest information gain)
- http://www.stat.cmu.edu/~cshalizi/350-2006/lecture-10.pdf
entropy for discrete, gini for continuous
- may not matter
ID3
C4.5
M5
M5P
- at leaves, train linear regressors
- http://www.opentox.org/dev/documentation/components/m5p
http://www.garysieling.com/blog/sklearn-gini-vs-entropy-criteria

advice for applying ML (CS229)

always evaluate a model on holdout
model selection: use three datasets
- train params on training set
- choose hyperparams that minimize CV set
- evaluate on test set, since evaluating on CV set is more optimistic, since hyperparams were chosen to minimize CV error
- use error, which is loss without the regularization
TODO: include plots of error vs model complexity
- training error monotonically decreases
- CV error is convex above this; want to find the min point
- where they converge and error is high: bias (underfitting)
- where they diverge: variance (overfitting)
- similar for plot of error vs regularization $\lambda$ but flipped horizontally
TODO: include example learning curves, plots of error over training set size
- train error increases, CV error decreases
- high bias: curves converge; more data won't help
- high variance: curves remain far apart; more data should help
strategies
- high variance
  - more training data
  - fewer features
  - larger $\lambda$
- high bias
  - more features
  - more complexity (more NN nodes, more polynomial degrees, etc.)
  - smaller $\lambda$

generalization guarantees

bias: underfitting
variance: overfitting
Vapnik-Chervonenkis (VC) dimension
- model shatters some data if for all labelings there's some perfect params (no errs)
- VC dim of model is size of largest set that model can shatter
- eg linear classifiers: VC-dim of 3
- can predict probabilistic upper bound on test err of classification model TODO

comparing classifiers

not sure how trustworthy the following is, but mostly accurate
high-bias/low-var (eg NB) better for small training sets
low-bias/high-var (eg kNN, LR) better for bigger training sets
NB: simple; fast; converges faster than discriminative model like LR; good for small datasets
LR: many ways to regularize, so less worry about correlated features; nice prob interp useful for adjusting thresholds or getting confidence scores; easily update model w online grad desc, unlike SVMs/DTs
DT: easy to explain; non-param; RFs avoid overfitting; RFs beat SVMs
SVM: accurate; slow; too many kernels/params
http://www.quora.com/What-are-the-advantages-of-different-classification-algorithms

generalization error

cross validation, bootstrapping: methods to estimate generalization error; based on "resampling"
- often used to choose among different models; e.g. different neural network architectures; choose one with lowest gen error
cross validation
- doesn't estimate error on future data
- rather, a way to pick best learning algo (and best learning params that aren't already being numerically optimized)
- $k$-fold cross validation: divide data into $k$ subsets of roughly same size (folds)
- train $k$ times, each time leaving out one of the subsets, then testing on one subset
- LOOCV: when $k$ is the sample size ($k=n$ where $n$ is data size)
- leave $v$ out cross validation: more elaborate and expensive that leaves out all possible subsets
- diff from "split-sample"/"hold-out" method commonly used for early stopping in neural networks
  - only 1 fixed subset (the validation set) used to estimate generalization error, not $k$ subsets; no "crossing"
- cross-validation to split-sample is superior for small datasets
- LOOCV good for continuous error functions/model-selection methods, bad for discontinuous ones
- repeated random sub-sampling validation (this may be referred to as bootstrapping)
  - split multiple times randomly; avg over splits
  - split multiple times randomly; avg over splits
- stratification
  - repeated random sub-sampling validation
    - maintain same mean response value is equal in training and test sets TODO understand that sentence
    - useful if responses are dichotomous, where data has unbalanced repr of the two response values
  - stratified $k$-fold CV
    - choose folds so that mean response value is approx equal in all folds TODO understand that sentence
    - for dichotomous classification, each fold has ~same +/- labels
jackknifing: LOOCV but for estimating bias & variance, not gen error
- compute some statistic of interest in each subset
- compare avg of these to avg of entire sample to estimate the latter's bias
bootstrapping
- repeatedly analyze subsamples of the data rather than subsets
- ea subsample is a random sample with replacement from full sample
  - subsamples may be of same cardinality as full sample
- allows us to evaluate estimators
  - example from classical statistics: what's the variance of the mean? what are our confidence intervals?
  - example from machine learning: for a decision tree algo, evaluate mean & variance of (# correct)/(# elements)
- TODO clarify relationship w CV
  - seems CV is better for ML tasks, since test set != train set
  - bootstrapping works better than CV in many cases? http://www.faqs.org/faqs/ai-faq/neural-nets/part3/section-12.html
ref: http://www.faqs.org/faqs/ai-faq/neural-nets/part3/section-12.html

ML vs statistics

http://www-stat.stanford.edu/~jhf/ftp/dm-stat.pdf
- statistics too focused on math technique rather than ideas
- empirical validation and mathematics are complementary
- stats: prob theory, real analysis, measure theory, asymptotics, decision theory, markov chains, martingales, ergotic theory
  - tools: hyp testing, experimental design, response surface modeling, ANOVA/MANOVA, linear regression, discriminant analysis, logistic regression, GLM, canonical correlation, principal components, factor analysis
- DM
  - tools: decision trees, rule induction, nearest neighbors, clustering, association rules, feature selection, neural nets, graphical models, genetic algos, self-organizing maps, neuro-fuzzy systems
http://brenocon.com/blog/2008/12/statistics-vs-machine-learning-fight/

computational learning theory

probably approximately correct (PAC)
VC theory
bayesian inference
algorithmic learning theory
online machine learning

dumb learners

majority learner aka ZeroR
OneR aka stump: generate one rule per predictor, then use only the rule w smallest error

receiver operating characteristic (ROC)

based on confusion matrix aka contingency table
true positive rate aka TPR aka sensitivity: TP/P or TP/(TP+FP)
false positive rate aka FPR aka 1 - specificity: FP/N or FP/(TN+FN)
ROC curve: plot of FPR on x and TPR on y while varying threshold
- input: true labels and scores from classifier
- top left is perfect; random tends to be around FPR=TPR diagonal
- python to generate a ROC curve
  
  say sorted(zip(scores, labels), reverse=True) returns:
  
  [(.9, 1), (.8, 1), (.6, 0), (.4, 1), (.2, 0), (.1, 0)]
  
  pos, neg = labels.count(1), labels.count(0)
  
  simple implementation: doesn't sort data first
  
  thresholds = sorted(set(scores), reverse=True) for i, t in enumerate(thresholds): # num examples with predict=1 and label=1 tp = sum(label for score, label in zip(scores, labels) if score >= t and label == 1) # num examples with predict=1 and label=0 fp = sum(label for score, label in zip(scores, labels) if score >= t and label == 0) yield tp/pos, fp/neg
  
  more efficient implementation sorts data first and accumulates data
  
  tp, fp = 0
  
  this descends the threshold to each distinct value of score
  
  for score, label in sorted(zip(scores, labels), reverse=True): if label: tp += 1 else: fp += 1 yield tp/pos, fp/neg
area under curve (AUC) metric
- random classifier: AUC = .5
- perfect classifier: AUC = 1
- different apps have different prefs
  - spam filtering: want low FP
  - medical screening tests: want high TP
AUC has implicit cost functions; diff cost distributions for diff classifiers
- don't use AUC to compare diff classifiers
- http://www.springerlink.com/content/y35743hp7010g354/
http://metaoptimize.com/qa/questions/4370/how-to-generate-a-roc-curve-for-binary-classification
http://metaoptimize.com/qa/questions/232/does-subsampling-bias-the-auc

loss functions

0-1 loss
pairwise cross entropy
lambda loss
pairwise + RMSE

adaboost

idea: combine weak classifiers

initially, all examples equally weighted for some number of rounds, find a weak classifier that minimizes weighted error if weighted error >=50% accuracy, break down-weight correct examples, up-weight incorrect examples combine subordinate models into final model, each weighted inversely by training error

learning bayes net structures

still active research
search algo: modifying an initial structure (nodes, arcs)
- some assume partial topological order over vars is given
criteria
- accuracy after fitting params
- if implicit conditional independence assertions are satisfied
- MLE results in fully connected network; must penalize with bayesian (MAP/MDL) approach over network hyps (prefer simpler networks)
  - usu too many structures to sum over, so sample w MCMC

posterior simulation (TODO)

objective: evaluate integrals over density $p(x)$
- eg estimate expected value
direct sampling methods
- generally given an un-normalized density $p*(x) = c p(x)$
- uniform: grid search; $$ prohibitively many samples req'd for high dims
- importance: uniform w sampler density $q(x)$
  - $q$ should be simple enough to be sampled
  - hope that $q*$ is reasonable approx to $p*$
  - adjust est by weighting "importance" of each sample
  - hard to choose good $q$, esp in high dims
- rejection: like importance but proposal density strictly bounds target density: $c q*(x) > p*(x)$
  - sample $x$ from $q*(x)$,
  - $c$ often too big for poor choice of $q$ or high dims
MCMC
- BUGS, JAGS, etc use M-H not Gibbs; by Gibbs they mean one-var-at-a-time
  - http://stats.stackexchange.com/questions/4191/gibbs-sampling-where-do-the-full-conditionals-come-from
- metropolis-hastings
  - use local density $q(x'; x^{(t)})$, dep on current state $x^{(t)}$
  - build markov chain through state space, converging to target density $P*(x)$
  - $Q$ doesn't need to closely approx $P$
  - at time $t$, generate tentative state $x'$ from $Q$
  - eval $a = \dots$
- gibbs
  - inappropriate if any deterministic nodes (transition prob 1) or non-ergodic (disconnected components)
  - one pseudocode (for discrete dists, from http://dclib.hg.sourceforge.net/hgweb/dclib/dclib/file/474a3c84a175/dlib/bayes_utils/bayes_utils.h)
    
    iterate for node in nodes if node is evidence, skip compute dist P(node | neighbors) = normalize( for value in node.values P(node=value | node.parents) * prod(P(child | child.parents (incl node=value)) for child in node.children) ) node.currvalue = sample from dist
refs
- http://jakehofman.com/talks/nyc_rmeetup_20090709_jmh.pdf
- gelman. bayesian data analysis.

sampling

fwd sampling
- for bayesian networks
- topological order
queries w evidence: est $P(Y = y \mid E = e)$
- rejection sampling: fwd sampling, discard where $E \ne e$
  - $P(E = e)$ usu prohibitively small; too many samples required
- fwd sampling generally infeasible for markov networks; no "roots"
markov chain: probabilistic state machine
TODO

markov chain monte carlo (MCMC)

TODO taxonomy: http://brenocon.com/blog/2013/04/monte-carlo-algorithm-inputs/
award-winning expository paper: http://www.ams.org/journals/bull/2009-46-02/S0273-0979-08-01238-X/S0273-0979-08-01238-X.pdf
variational bayes TODO
- deterministic

graphical models

bayesian network aka belief network aka directed graphical model
- local markov prop: any 2 nodes conditionally indep given parents' values
- edges usu repr causality but not required (either dir)
- plates: for repeating sub vars
- eg naive bayes, neural nets
- markov blanket: neighbors; parents, children, children's parents
  - every node is conditionally indep of all other nodes in network given blanket
- dynamic bayesian network: for sequences of vars, eg time series
  - eg HMMs, kalman filter
markov network aka markov random fields aka undirected graphical model
- reprs some things bayes nets can't (cyclic deps) & vice versa (induced deps)
- eg log-linear models, gaussian markov random field TODO
- must satisfy pairwise, local, global markov props TODO
- conditional random fields TODO
factor graph
- bipartite hypergraph representing factorization of a function TODO
http://cacm.acm.org/magazines/2010/12/102122-bayesian-networks/fulltext

linear classifiers

naive bayes
- generative
- it's actually linear: http://nlp.stanford.edu/IR-book/html/htmledition/linear-versus-nonlinear-classifiers-1.html
  - choose $c_0$ iff $\log \frac{P(c_0|d)}{P(c_1|d)} = \log \frac{P(c_0)}{P(c_1)} + \sum_{k=K} \log \frac{P(t_k|c_0)}{P(t_k|c_1)} > 0$ where $K$ is the set of terms in the document
logistic regression
- discriminative
- for $K$ classes, $P(y|x) = \frac{ \exp(w_y \cdot x) }{ \sum_{y'\in{1,\dots,K}} \exp(w_{y'} \cdot x) }$
http://www.quora.com/What-is-the-difference-between-logistic-regression-and-Naive-Bayes

generative vs discriminative

discriminative: LR, perceptron, SVM
- good generalization when much data
generative: LDA, NB
- more interest in semisupervised learning bc harder to get labels
  - can train discriminatively, but no longer make use of unlabeled data
  - http://research.microsoft.com/en-us/um/people/minka/papers/LasserreBishopMinka06.pdf

GLM

families
- Bernoulli and binomial: logistic regression
- Gaussian: OLS regression
- multinomial: softmax regression
- Poisson: for modelling count-data
- gamma and exponential: for modelling continuous, non-neg RVs, eg time intervals
- beta and Dirichlet: for dists over probs
- many more...
residual deviance, null deviance TODO

maximum entropy

max entropy classifier aka multinomial logit
- generalizes logistic regression
- assumes independence of irrelevant alternatives (IIA): eg choices are blue bus, yellow bus, car, but most ppl choose btwn (any) bus and car
- unlike alts like naive bayes, doesn't require feature independence, but learning is much slower
- $P(y_i=j) = \frac{\exp(X_i \beta_j)}{1 + \sum_{j=1}^J \exp(X_i \beta_j)}$ and $P(y_i=0) = \frac{1}{1 + \sum_{j=1}^J \exp(X_i \beta_j)}$
- $\beta_j$ typically estimated using maximum a posteriori (MAP), which is an extension of max likelihood using regularization of weights

feature selection

why?
- most methods implicitly do feature selection
  - decision trees: use info gain or gain ration to decide what attrs to use as tests; many features don't get used
  - neural nets: backprop learns strong connections to some inputs, near-0 connections to others
  - kNN, MBL: weights in weighted euclidian distance
  - SVMs: max margin hyperplane may focus on important features, ignore irrelevant features
- answer: empirically degrading performance with more features; cf curse of dimensionality
  - DT (and many others, esp eg kNN) vuln to irrelevant attrs
    - as you go down the tree, less data avail bc of splitting
    - even relevant attrs can hurt; eg adding a high-signal attr deteriorates perf by being chosen high up the tree, reducing the data down the tree
  - NB robust to irrelevant attrs but vuln to redundant attrs by the same principal
- also, simpler models: faster, easier to understand, see most important
techniques
- subset search algos (see below); some subset evaluation measures:
  - filter: indep assessment based on general characteristics of data, eg simple model
  - wrapper: evaluate subset using the learner that will be ultimately used
  - keep attrs that have high class cor but low inter-cor
    - using symmetric uncertainty $U$, goodness of feature set is $\sum_j U(A_j, C) / \sqrt{\sum_{i,j} U(A_i, A_j)}$
- use learner: eg, use attrs used by DT, or use top-coef attrs in linear model, or use top 1R attrs, or use attribute weighting in instance-based learning like kNN
  - but error-based method like 1R may not be best for ranking attrs TODO supervised discretization
search algos
- metrics for filters/wrappers: F-test (sig tests are discouraged; not even sure what this is testing), AIC, BIC, $R^2$, CV accuracy
- best-first/greedy stepwise regression: most popular; data dredging?
  - forward selection: add features one/some at a time
  - backward elimination aka recursive elimination: eg this faster/sloppier version of naive backward (which is opposite of forward)
    
    start with all features in model candidates (for removal) = all features for each iteration, remove from candidates any feature whose exclusion yields no acc drop remove (from model and candidates) feature in candidates with biggest acc drop
  - combination: add & remove features, eg forward while optionally removing a feature at each step
- forward stagewise regression: "partially add" vars by increasing weight by epsilon in correct direction
  - increase weight along most-correlated var by $\epsilon$
    
    let vector r = y let vector beta = 0 iterate: find x[j] most correlated with r let delta = epsilon * sign(r, x[j]) set beta[j] += delta set r -= delta * x[j]
  - identical to LASSO for orthogonal predictors, and similar in general case
- least angle regression (LARS): fwd stagewise made fast
  - increase weight along most-correlated var until another var just as correlated, then move along the bisection of both vars, etc.
  - similar results as LASSO and fwd stagewise, which can be thought of as restricted versions of LARS; slight modifications yield LASSO/fwd stagewise
  - same complexity as OLS on full data set; also yields full path; no reason to not try it
  - http://blog.echen.me/tag/least-angle-regression/
  - http://www.stanford.edu/~hastie/TALKS/larstalk.pdf
- L1 (LASSO) regression/regularization
  - LASSO = least absolute shrinkage and selection operator
  - impl w quad prog, coordinate descent, or LARS for exact solution
  - assume attrs and response are centered at 0 and scaled to unit variance
  - minimize RMSE where L1 norm of $\beta$ bounded by $s$ (diamond in 2D)
  - (for small $s$) does selection as well as shrinkage, since it may be optimal to put all weight on fewer attrs than a bit on each
relevance determination: rank by information gain and evaluate first 10, 20, ... features
- information gain: reduction in entropy due to attribute: $G(S,A) = H(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} H(S_v)$
limitations
- feature selection can overfit
- wrapper methods can be expensive
- non-selected != non-predictor (eg redundant features)
- may discard features that domain experts want to keep
- most feature selection greedy, non-optimal
http://www.quora.com/Regression-statistics/What-is-Least-Angle-Regression-and-when-should-it-be-used
TODO stepwise regression
TODO optimality criteria
http://www.cs.cornell.edu/courses/cs578/2007fa/missing_featsel_lecture.pdf

recommender systems

some flavors: depends on how much data you have
- collaborative filtering: works best in isolation but requires enough data to combat sparsity
- content-based
- social/trust: based on network's items/activity
- usually mixing is good
"the science and the magic of user and expert feedback for improving recommendations" (telefonica)
- feedback is noisy; users inconsistent, esp. for mild ratings; re-ratings very effective in increasing prediction acc (choose milder re-rating)
- expert-based recs have many benefits
- video
- slides
if $M$ is ratings matrix with users in rows and movies in cols, then:
- for $M M^T$: one evec may be for "horror" movies, and $i$th component is user $i$'s pref for horror movies; evals are feature importance
- for $M^T M$: one evec may be for "horror" movies, and $j$th component is how horrific movie $j$ is

collaborative filtering

various similarity measures
- jaccard for binary features
- cosine similarity
- pearson correlation: useful if want insensitivity to users' differing scales
item-item similarity (first pub by glinden et al)
- predict a user's rating of an item as the weighted sum of other items' ratings by the user, where weights are item similarities:
  
  score[usr,itm] = avg(sim(itm,itm2) * score[usr,itm2] for itm2 in itms)
user-user similarity
- predict a user's rating of an item as the weighted avg of other users' ratings for the item, where weights are user similarities:
  
  score[usr,itm] = avg(sim(usr,usr2) * score[usr2,itm] for usr2 in usrs)
- http://software-carpentry.org/4_0/matrix/recommend/
evaluation metrics: RMSE is typical
TODO slope one

netflix prize

http://blog.echen.me/2011/10/24/winning-the-netflix-prize-a-summary/
identifying biases/features is just as important as modeling
- baseline rating
- user-specific effect
- movie-specific effect
- time since user's first rating (harsher critic over time)
- time since movie's first rating by anyone (early watchers bigger fans)
- let user's rating depend on # ratings for movie (user may be anti-popular)
- let user's rating depend on movie's overall rating
neighborhood models
- problems
  - neighbors aren't indep
  - diff movies should use diff # neighbors
- so
  - find neighbors using standard sim metric
  - sparse linear regression: movie rating ~ neighbors' ratings
implicit data: similar to neighborhood model, learn offset ~ whether user rated neighbor (for each neighbor)
matrix factorization: most important
- std SVD
- asymmetric SVD: don't maintain user factors; just use (possibly weighted) sum of item factors of items she rated
- SVD++: use both above
regression: can derive predictors from PCA/MDS/SVD
RBMs were applied
temporal effects: can bin movies into buckets spanning some months, and allow movie biases to change within each bin (eg Time nominated Titanic as best movie ever, causing burst in glowing reviews)
regularization: heavily used ridge regression; lasso less effective
ensemble methods: winners used GBDTs to combine 500 models
- add some useful features for clustering
  - movies ea user rated
  - users that rated ea movie
  - factor vectors of users and movies
  - hidden units of an RBM
- Bell/Koren found using earlier ensemble method that RBMs more useful when movie/user has low # ratings
- previous sols use linear regressions to combine models

random forests

bagged ensemble of decision trees (orange uses 100 by default)
each decision tree uses subset of all features (param $m$)
each decision tree may be grown fully without pruning
each decision tree trained on a bootstrapped sample of the data (as per bagging)
their reported error (confidence?) does take into account their performance on the training set, somehow
may be obvious but: despite feature sampling, feature selection still helps, since the trees are not weighted at all by their performance (if they were, then they'd implicitly be weighing their own subset of features)
efficient, parallelizable
RFs is very competitive, but also has danger of overfitting
can rank features with this rough technique:
- each tree is trained over a bootstrapped sample, so check tree accuracy on the complement ("test set")
- for each feature, randomly permute the values over the records, and re-measure accuracy
- rank/score features by how much its randomization affected tree accuracy
- accumulate scores over all trees

extremely randomized trees aka extra trees

same as RF but also split thresholds drawn randomly, then best chosen (instead of looking for most discriminative thresh)

multiclass classification

some classifiers handle it naturally
- dec trees, bayes methods, log reg, ...
generic meta learners
- must ensure that output scores are comparable! problematic for many classifiers
  - not "Bayes consistent"
  - http://metaoptimize.com/qa/questions/6543/one-vs-all-multi-class-svm
- 1 vs rest aka 1 vs all aka OVA: build classifier for each class where pos iff labeled w that class; in test, choose class w highest classifier score
- 1 vs 1 aka pairwise aka all vs all aka AVA: build classifier for each pair of classes ($\choose{n,2}$ classifiers); in test, choose class w most wins
- AVA builds more classifiers but pairwise classification much faster (?) and more balanced (easier to find best regularization), hence often faster than OVA http://www.cs.berkeley.edu/~pliang/cs294-spring08/lectures/multiclass/slides.pdf
- error-correcting output codes (ECOC)
  - each class given unique binary str of length $n$
  - train $n$ classifiers, one per bit
  - in test, run $n$ classifiers on $x$ to get $n$-bit str
    - coding must be carefully chosen to maximize hamming dist among codewords and make sure each bit is uncorrelated w other bits
    - choose coding using exhaustive codes (all $2^{k-1} - 1$ bits), column selection from exhaustive codes, randomized hill climbing, BCH codes, more
  - choose class w closest str (eg by hamming dist)
- tutorial http://courses.washington.edu/ling572/winter2011/teaching_slides/2_15_multiclass_to_binary.pdf
- generalization: http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=B6C31DC721E6484E27B9380C764CBF98?doi=10.1.1.73.2154&rep=rep1&type=pdf
  - ECOC is a less-general intermediate
  - orthogonal probs: how to classify example or assign it a prob dist given classifiers' results
- In Defense of One-Vs-All Classification http://jmlr.csail.mit.edu/papers/volume5/rifkin04a/rifkin04a.pdf

latent dirichlet allocation (LDA)

learns from body of entities, inferring that a particular attr belongs to a topic
to generate a document:
- num words ~ Poisson
- topic mix ~ Dirichlet (over fixed set of topics)
- to generate a word:
  - topic ~ doc's multinomial of topics
  - word ~ topic's multinomial of words
learn with MCMC (collapsed Gibbs sampling)

misc

ANOVA: continuous response
LDA: categorical response
- unlike logit/probit regression, assumes vars are normally distributed
PCA: doesn't consider class
factor analysis: builds feature combos based on diffs not similarities
transfer learning: storing knowledge gained while solving one problem and applying it to a different but related problem
- eg useful for personalization
- simple form: final prediction = sum of global and local (user-specific) models

support vector machine (SVM)

hyperparameters
- eg $C$ penalty param
- eg kernel-specific params; eg poly kernel has more hyperparams than others
slack variables introduce soft margin

bayesian optimization TODO

not entirely clear yet how this works, but rough notes follow
way to globally optimize a costly-to-evaluate function
set prior over function and combine w evidence to get posterior
permits utility-based selection of next obs; considers both exploration (sampling from areas of high uncertainty) and exploitation (sampling areas likely to offer improvement)
how it works
- assume evaluations of a bunch of arbitrary points
- fit gaussian process and search for best next point (optimize an "acquisition function")
  - can be expected value + predicted variance (upper-confidence-bound acqusition fn)
  - or expected improvement
http://hboa.deg511.com/boa.pdf
http://atpassos.posterous.com/bayesian-optimization

kernels

from http://anyall.org/blog/2011/01/please-report-your-svms-kernel/:

An un-kernelized, linear SVM is nearly the same as logistic regression...But a quadratic kernelized SVM is much more like boosted depth-2 decision trees. It can do automatic combinations of pairs of features — a potentially very different thing, since you can start throwing in features that don’t do anything on their own but might have useful interactions with others. (And of course, more complicated kernels do progressively more complicated and non-linear things.)

I have heard people say they download an SVM package, try a bunch of different kernels, and find the linear kernel is the best. In such cases they could have just used a logistic regression. (Which is way faster and simpler to train! You can implement SGD for it in a few lines of code!)

A linear SVM sometimes has a tiny bit better accuracy than logistic regression, because hinge loss is a tiny bit more like error rate than is log-loss. But I really doubt this would matter in any real-world application, where much bigger issues are happening (like data cleanliness, feature engineering, etc.)

If a linear classifier is doing better than non-linear ones, that’s saying something pretty important about your problem. Saying that you’re using an SVM is missing the point. An SVM is interesting only when it’s kernelized. Otherwise it’s just a needlessly complicated variant of logistic regression.

regression

multivariate regression: multiple dependent vars
multinomial regression: multiple categorical dependent vars
multiple regression: multiple indep vars, one dep var
question: what does the objective function mean?
http://kernelsvm.tripod.com/
Tutorial on SVM Regression

multicollinearity

correlation btwn predictors
causes problems for LMs, neural nets, others
trees are resistant but may affect interpretability of model

pragmatics/usage

scale params to avoid feature imbalance and numeric difficulties
kernels
- RBF: good first choice
  - in (0,1], so fewer numerical difficulties
  - just one hyperparam ($\gamma$)
  - linear is special case of RBF; behaves like sigmoid for certain params
  - "great when it works, but it's highly vulnerable both to 'junk' dimensions which carry a large amount of noise and the choice of metric"
- linear: good when many features
simplest approach to finding hyperparams: grid search

multiclass

http://nlp.stanford.edu/IR-book/html/htmledition/multiclass-svms-1.html
generic approaches (OVA, AVA, etc) "not very elegant"
instead, build 2-class SVM over new feature vector that includes the class as a feature (or in general some function of it, $\Phi(x,y')$)
- in test, choose class w largest margin $y = \arg \max_{y'} w^T \Phi(x,y')$
- in train, margin is gap btwn this value for correct class and for the nearest other class, so new quadprog formulation required (see URL)

ranking

http://nlp.stanford.edu/IR-book/html/htmledition/result-ranking-by-machine-learning-1.html#sec:ranking-svm

assoc rule mining/learning, frequent itemset mining

assoc rules are just superficial variations on itemsets
apriori: best-known algo for association rule mining
- for items that are grouped in txns
- build up candidate itemsets one item at a time using BFS, but on each iteration prune out candidates, keeping only those with minimum support (frequencies) and (optionally) minimum confidence (# in subset including new item / # in current set)
- historically significant but has inefficiencies & trade-offs
FP-growth: frequent pattern growth
- param: min support
- build FP-tree data structure
  - sort each txn (itemset) by item frequency
  - build prefix tree from txns
  - chain identical nodes so we can find all tree paths containing that item
- to find frequent itemsets, for each distinct item $x$, find if it has enough support (sum of counts across all its nodes >= min support)
  - if so, yield that (+ suffix if currently recursing) as a frequent itemset
  - for each item $x$, build conditional FP-tree for $x$
    - FP-tree of itemsets ending in $x$, excluding $x$
    - build new FP-tree over paths collected by walking up parent ptrs
    - update node counts: sum of children's supports (anchor w $x$ node supports)
- fast: no candidate generation/test; use compact tree; scan DB once
- http://www.engr.uvic.ca/~seng474/association3.ppt
- http://www.engr.uvic.ca/~seng474/fptree_complete_example.ppt
- https://github.com/enaeseth/python-fp-growth/blob/master/fp_growth.py
- orig paper http://www.inf.unibz.it/dis/teaching/DWDM/project2010/FP-growth.pdf
similarity-based: calculate (e.g.) jaccard similarity between indicator attributes
- faster using minhash for first pass
- cohen et al, 2001

rule learning/induction

subgroup discovery: variation that focuses on discovering individual patterns of interest rather than model generation (classification/prediction rule learning)
- don't mind overlap; use downweighting instead of removal of covered items, and use quality metric like weighted relative accuracy instead of confidence/support
weighted relative accuracy: $P(X) (P(Y|X) - P(Y))$ or P(cond) (P(class|cond) - P(class))
- probs are freqs, but weighted
- instead of focusing on accuracy, focus on finding big groups
- $P(X)$ is generality; find big groups
- $P(Y|X) - P(Y)$ is relative accuracy
- $= P(X,Y) - P(Y) P(X)$ (diff btwn actual joint and expected joint if indep)
- laplace accuracy can be used for $P(Y|X)$
- better WRA: use unweighted $P(Y) = n(Y)/N$
apriori-C: apriori for classification
- discretize cont vars, binarize disc vars
- apriori considering only rules/itemsets containing class value
- post-process: keep just $n$ highest-support rules (for each class)
  - iteratively select best rule, remove covered examples, repeat sorting
- some optimizations, eg:
  - suppress generating rule if already have more general rule w satisfactory support & confidence
apriori-SD
- apriori-c w new post-process:
  - order by weighted relative accuracy quality function
  - downweight covered examples, repeat sorting
- downweighting of covered examples: $w = 1/(i+1)$ (init $w=1$)
AQ: outputs sequence of "if COND then CLASS" rules
- beam search (best-first search with pruning)
- once find rule, remove covered data, repeat
- TODO diff from CN2?
CN2: rule induction algo
- based on ideas from the ID3 decision tree algo
- significance test filters out rules that cover too few examples
- possible heuristic criteria to rank rules in beam (best-first) search
  - entropy: used in ID3 and orig CN2
    - tends to select very specific rules covering few examples; significance test is only a threshold; tihs is called downward bias
  - accuracy: behaves similarly to entropy; $P(class|cond) = n_c/n$, where:
    - $n_c$ is # examples in predicted class $c$ covered by rule
    - $n$ is total # examples covered by rule
  - laplace accuracy: $\frac{n_c + 1}{n + k}$, where $k$ is # classes
    - not sure about rationale here
  - weighted relative accuracy
- ordered v unordered: whether rule is defined conditioned on rules above it having failed
  - unordered algo change: iterate the search for each class in turn, removing only covered examples of that class when rule is found
- TODO how to know when to stop?
CN2-SD (subgroup discovery): instead of removing, downweight
- bc in CN2 only first few rules may be of interest
- can decrease multiplicatively or additively
C4.5rules
RIPPER
- faster/worse than C4.5rules, better than IREP http://www.let.rug.nl/~nerbonne/teach/learning/cohen95fast.pdf
- worse than NB for mail classif http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3844
SLIPPER: competitive w/better than ripper, c4.5rules
SD-map: based on FP-growth

backoff

m-estimate: backoff to handle sparsity
- instead of n_c / n where n_c is # matching examples in class c and n is # matching examples, use (n_c + m p_c) / (n + m) where p_c is relative class frequency (over all)
- mitchell, 1997 / lidstone's law

getting in shape for the sport of data science (jeremy howard, kaggle chief scientist)

subject: how to compete in kaggle
coding to prep data for models is most important
- else, you're just applying models
- regexes get you very far
excel is very powerful for data understanding
- cell color scales
- pivot tables
programming platform recommendations
- python, numpy, scipy, matplotlib, ipython
- c#, alglib (ML), math.net numerics (linalg), chartcontrol (charts)
- c++, eigen (powerful linalg)
- java, weka
R packages
- caret: auto find models, grid search hyperparams, etc.
  
  mdl <- train(data, "svm", center, pca)
- ggplot2: transparent scatterplot points
- ggobi: like tableau
  - brushing: select pits in one plot, highlight in others
  - not used as much in practice bc excel/ggplot2 so good
viz must be flexible more than fancy
- eg plotting hundreds of time series is surprisingly useful (won that comp)
models
- parametric models: GLM
- non-parametric models: SVM, RF
R's RF impl limitations: 32 factors, no NA handling
NA handling: fill in w median and add indicator var
decision trees use info gain or GINI to find break points
traffic prediction: excel color scales helped see the pattern/propagation of jams
cran recommendations problem
- P(user installing any pkg) * P(pkg getting installed) -> 84% AUC
- use max of probs -> 95% AUC
- use RF and a bunch more data -> 93% AUC
- use RF on max -> 83% AUC
- GLM on a bunch more data -> 97% AUC
- GLM + mark deps as installed (type of programming that's cumbersome in R, a "shit programming language"; used C#) -> 98%
- 6th place
ensemble: varying high-var learners; rely on one/few learners working well while all others cancel each other out
- can increase variance by choosing your own (set of possible) break points in continuous variables
- the smaller the subspaces/samples, the more variance, but the longer it takes
- don't prune trees; that introduces bias
http://blog.kaggle.com/2011/03/23/getting-in-shape-for-the-sport-of-data-sciencetalk-by-jeremy-howard/

deep learning TODO

sparse coding aka dictionary learning aka autoencoder
learn low-level features; similar to PCA, but yields not necessarily orthogonal basis
multilayer feed-fwd neural networks using back-prop (with small hidden layer)
RBMs
https://venefrombucharest.wordpress.com/2011/04/15/an-overview-of-dictionary-learning-terminology/
http://www.stanford.edu/class/cs294a/

restricted Boltzmann machines (RBMs)

a stochastic neural net w 1 hidden layer; undirected graphical model
essentially a binary version of factor analysis (continuous)
training: contrastive divergence, which is approximate gradient descent
- randomly init weights (0-mean 0.1-SD Gaussian)
- update hidden from visible
  - activate each node with prob p_j = sigmoid(sum of weighted activations)
  - set $P = x_i x_j$ or $x_i p_j$ (can use either state or prob)
- update visible from hidden
- update hidden from visible; set $N$ same way as $P$
- adjust by $\alpha (P - N)$ where $\alpha$ is learning rate
- intuition: $P$ is rooted in observations, $N$ is RBM's own tendencies
- when iterating, discard updated visibles; just use orig training data
- TODO better understand why this works (esp $-N$)

gradient boosted decision trees

related terms: GEM algo, treeboost, additive model
favors many shallow trees (depth 2 or 3, maybe even stumps)
train tree $k$ on sample of training set where labels are residuals from model-so-far $z$
- regression: $r = y - F_{k-1}(\vec{x})$
- classif: $r = y - \frac{1}{1 + \exp(-F_{k-1}(\vec{x}))}$
- initial $r = y$
sum the outputs
general formulation defined in terms of abstract loss function
- can't actually calculate gradient, but follow residuals for steepest descent
usage
- yahoo has used this extensively: GBRank, SmoothDCG
- blender in BellKor
pros
- feature normalization not required
- auto feature selection; not prone to colinear/identical features
- easy to interpret
- easy to use diff loss functions
cons
- boosting is sequential
- compute intensive: in DTs, for each feature/value, must compute gain
- can perform poorly on high-dim sparse data (eg bag of words) bc too many nodes/trees needed
variations
- initialized GB: init $F_0$ is output of random forest

bayesian blocks

nonparametric unsupervised discretization
flexible piecewise shapes & fitness functions
cute DP algo
refs
- http://jakevdp.github.com/blog/2012/09/12/dynamic-programming-in-python/
- http://arxiv.org/pdf/1207.5578v3.pdf

systems/applications

gmail priority inbox

http://research.google.com/pubs/archive/36955.pdf
concise paper
use simple linear logistic regression for scalable learning & prediction
features: social (sender/recip metrics), msg content, thread, labels
importance metric: $p=P(a \in A, t \in (T_{min}, T_{max}) | \mathbf{f}, s)$
- action $a$ within time $t$ of delivery given whether seen ($s$)
- $T_{min}$ under 24 hours, constrained by model update frequency and gives users time to react
- $T_{max}$ in days; longer interaction periods are not considered
- [prediction error strange]
personalizing model uses transfer learning
- $s = \sum_{i=1}^n f_i g_i + \sum_{i=1}^{n+k} f_i w_i, p=\frac{1}{1 + \exp^{-s}}$
- $k$ user-specific features
use PA-II online passive-aggressive regression variant
misc
- continuous -> binary features via ID3-style algo [?]

Running Large Graph Algorithms: Evaluation of Current State-Of-the-Art and Lessons Learned (Dr. Andy Yoo, Lawrence Livermore National Laboratory, Google Tech Talk 2/18/2010)

scale-free graphs aka power-law graphs aka complex networks
- random graphs that can be constructed by adding vertices to highly connected vertices
- hubs: high-deg vertices
- small diameter
- communities
- self-similarity: recursive structural similarities within communities as overall graph structure
apps: internet/web, social/citations, bioinformatics (protein-protein interaction)
- graph mining
  - spectral graph theory: finding eigenvals/eignevecs; web mining, pagerank, personalized web search, etc
  - social network analysis: structure of the graph
    - community detection, role finding, centrality study (who's important), influence study, organization study
    - counter-terrorism
  - pattern analysis: apply subgraph pattern matching
    - fraud detection, cyber security
big graph processing challenges
- big
- complex and hard to parallelize: even basic BFS
- lesser-known: large intermediate data; eg space complexity of BFS
used bluegene
social network analysis: http://mlg.ucd.ie/files/summer/tutorial.pdf

Google ML Infrastructure

SmartASS: adwords, adsense, doubleclick; worked on by topcoder ppl
SETI: general ML infra; simon tong (acclaimed scientist, google sets)
S___: content ads
Prediction API: much smaller
learn to rank not really used in ranking algo; want explainable/tweakable results

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AI.page

AI.page

say sorted(zip(scores, labels), reverse=True) returns:

[(.9, 1), (.8, 1), (.6, 0), (.4, 1), (.2, 0), (.1, 0)]

simple implementation: doesn't sort data first

more efficient implementation sorts data first and accumulates data

this descends the threshold to each distinct value of score

movies ea user rated

users that rated ea movie

systems/applications

Files

AI.page

Latest commit

History

AI.page

File metadata and controls

say sorted(zip(scores, labels), reverse=True) returns:

[(.9, 1), (.8, 1), (.6, 0), (.4, 1), (.2, 0), (.1, 0)]

simple implementation: doesn't sort data first

more efficient implementation sorts data first and accumulates data

this descends the threshold to each distinct value of score

movies ea user rated

users that rated ea movie

systems/applications