TODO
-
google 2014 paper: inferring causal impact using bayesian structural time-series models https://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/41854.pdf
-
perplexity http://en.wikipedia.org/wiki/Perplexity#Perplexity_of_a_probability_model
-
generalized additive models (GAM) http://www.win-vector.com/blog/2011/02/the-cranky-guide-to-trying-r-packages/
-
"linearizing" variables: http://www.win-vector.com/blog/2012/07/modeling-trick-masked-variables/
-
adaptive lasso aka iterated reweighted l1 (see also iterated l2)
-
merge in notes from CS182, 6.867
-
Why do kernel smoothers divide the norm by λ before applying the weight fn? Isn't it dealt with in the summation in the estimator fn?
-
kalman filters, smoothing
-
particle filters
-
http://measuringmeasures.com/blog/2010/3/12/learning-about-machine-learning-2nd-ed.html
-
separating hyperplane
-
sigmoid kernel function
-
k-means++
-
http://googleresearch.blogspot.com/2010/04/lessons-learned-developing-practical.html
-
top 10: http://gettinggeneticsdone.blogspot.com/2010/04/top-10-algorithms-in-data-mining.html
-
k-means/canopy clustering
-
model selection criteria: akaike information criterion (AIC), minimum description length (MDL), bayesian information criterion (BIC)
-
common distributions: normal, beta binomial, multinomial
-
$t$ -test: matched pairs and unmatched -
mixture model for collab filtering
-
principal component regression (PCR), partial least squares regression (PLS)
-
GMMs
-
navia
- Veritable
- cross-categorization: group cols as views, then group rows in each view as categories
- monte core uses CRP
- nonparam models
- dirichlet process
- chinese restaurant process (CRP)
- conjugate models
- undirected models, Isings, clique potential
- Veritable
-
Kullback-Leibler analysis
-
H-measure: fixes some shortcomings of AUC
-
PLSR
-
self organizing maps
-
structural equation modeling
-
belief networks (90s; DAGs of hidden/visible), RBMs, stacked RBMs into DBNs
-
bayesian smoothing: generalization of kalman filters
-
from quora:
-
mixed models
Compressed Sensing and Dictionary learning: These algorithms both exploit sparsely-activating (few ones in a zero-one vector) signals that appear in many real-world contexts Although sparsity-inducing algorithms have been around for a long time (in the form of "basis pursuit" and lasso" they've exploded recently, with many new advances (largely initiated by Donoho and Tao's proof of effectiveness). So far, these have touched MRI, radar and computer vision, and lead to an enormous amount of cross-talk between signal processing, statistics and machine learning. Watch for future advances.
Advances in linear, structural and multi-kernel Support Vector Machines: SVMs were invented in the 90's and were a huge deal. In the 2000's they've been radically extended, becoming much faster (linear case), much more powerful (multi-kernel case) and much more flexible (structural case). These three extensions have also motivated a lot of research into connections between SVMs and other algorithms and original research in convex optimization methods.
Latent Dirichlet Allocation: Made great use of an existing paradigm, graphical models, to allow for fast inference of human-interpretable (sort of) topic that make up a document collection. It really made people realize the power of graphical models, and has been endlessly extended.
Deep Learning: Deep Belief Nets restarted interest in deep-learning methods, like neural nets. These have also been coming increasingly in contact with graphical models, like LDA above. People have also shown that they can achieve state-of-the-art results, which people doubted for a long time. Deep architectures are still slow and complex, but they can do a small handful of things that other methods simply can't, and their connections to real neural architecture are tantalizing.
Huge strides in online learning: This is more than an algorithm, it's a whole setting. When you're being given data one point and one label at a time and you're not allowed to store much of it (which is common when dealing with huge data), you still want to be able to learn well from it. In the last decade, many people have successfully expanded connections to game theory (you want to minimize the "regret" of your strategy), found new optimization methods that make use of contextual and gradient information, and now are even expanding into the domain of "stochastic convex optimization", where you want to optimize over a convex function but you get one noisy piece of data at a time (the setting has existed a while, but theoretical research has been much stronger in the 2000's).Suggest Edits
wisdom
- data dredging aka data fishing aka data snooping: inappropriate use of data
mining to uncover misleading relationships in data
- can occur when no hyp formed in advance or reduced data reduces prob of refuting some hyp
- when doing many significance tests, expect that some results are significant by chance alone
- no free lunch theorem: many hyps can fit; always assuming something
- bias-free learning is futile
- for search/optimization: all algos perform the same on avg (over all possible cost functions); improving perf requires prior info to match procedures to problems
- for compression: "No compression algorithm can compress data on average."
- http://www.aihorizon.com/essays/generalai/no_free_lunch_machine_learning.htm
- curse of dimensionality: volume grows exponentially with dimensions, so everything is far away
- to whiten features is to center & rescale them by dividing by SD
learning taxonomies
- great resource: http://machine-learning.martinsewell.com/machine-learning.pdf
- parametric v nonparametric
- parametric learning: models have params that we're estimating, eg Bayesian
learning; hyp complexity fixed
- once params learned, they're all you need to predict
- nonparametric learning: hyp complexity can grow with data
- instance-based aka memory-based learning: hyps directly constructed from training instances, eg nearest-neighbor, kernels
- eg LWR: must keep training set around; "re-training" all the time
- nonparametric learning (2): infinite dimensional hyp (think of as function)
- rationale: flexibility, performance, more realistic
- TODO
- http://learning.eng.cam.ac.uk/zoubin/talks/nips09npb.pdf
- parametric learning: models have params that we're estimating, eg Bayesian
learning; hyp complexity fixed
data preparation
- attribute selection
- attribute discretization
- data transformation
- interactions (multiplications)
- data cleansing
- missing values, duplicates, noise, typos, systematic errors
- outliers, anomalies
semi-supervised
-
EM (TODO: add some instances of EM)
- train on labeled data, then iteratively label unlabeled data probabilistically (E-step) and re-train on examples weighted by their probabilities (M-step)
- might work w any classifier & iterative clustering algo, but must ensure feedback loop is positive; NB and EM are good pair (both assume conditional independence given class)
- can choose to weigh unlabeled data lower than labeled data
- can assume mixture of multiple components for each class (instead of just one cluster in each class): in E-step, also probabilistically assign to components
-
co-training
-
relies on two separate views where they're
- complementary:
$f(x) = f_1(x_1) = f_2(x_2)$ - conditionally indep given label
- complementary:
-
pseudocode
LABELED = ... UNLABELED = ... UNLABELED' = sample u examples from UNLABELED iterate: H1 = train1(view1(LABELED)) H2 = train2(view2(LABELED)) let each H add p positive and n negative examples to LABELED choose 2(p+n) from UNLABELED to replenish UNLABELED'
-
use
UNLABELED'
bc forces H1/H2 to choose more representative examples of underlying distribution (they try to choose the examples they're most confident of) -
p,n should maintain ratio of underlying distribution
-
H1 labels examples for H2, and vice-versa
-
co-training incrementally labels; EM iteratively refines labels
-
even when no natural split, artificial (even random) splits can work (strive for independence)
-
-
co-EM
- co-EM better than co-training better than EM
- better than co-training probably bc it doesn't commit to labels, but weighs probabilistically
- steps
- train A on labeled data and probabilistically label unlabeled data
- train B on all data and probabilistically relabel
- iterate
- good text classif results with SVMs
- co-EM better than co-training better than EM
expectation-maximization (EM)
- invented in statistics
- Gibbs sampling a kind of Bayesian or stochastic analogue of (frequentist) EM
- whereas EM computes likelihoods of missing values (in the E step) and maximizes parameters (in the M step), Gibbs sampling treats missing values and parameters as random variables that we sample from and whose full distributions we want to know
- ie while EM does updates by (deterministically) computing maximum-likelihood point estimates of parameters, Gibbs does updates by sampling for the parameters instead.
- http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/
- conn btwn gibbs sampling & LDA
data set balance
- class imbalance is common problem in many problems
- most common techniques
- random undersampling (RUS) of majority class
- shouldn't affect ROC curve
- don't undersample test set
- random oversampling (ROS) of minority class: duplicate samples
- random undersampling (RUS) of majority class
- synthetic minority over-sampling technique (SMOTE)
- "informed over-sampling": for each minority sample
$s$ :- find
$k$ nearest neighbors - randomly select
$j$ of them ($j$ depends on oversampling desired) - randomly generate synthetic samples along the lines joining
$s$ and the$j$ neighbors
- find
- shortcomings
- may overgeneralize bc it disregards majority class; esp problematic for highly skewed class distributions bc minority class is very sparse wrt majority class
- inflexibility:
$j$ is fixed in advance
- adaptive synthetic minority oversampling (ASMO): overcomes shortcomings
- http://www.iro.umontreal.ca/~lisa/workshop2004/slides/japkowicz_slides.ppt
- may combine SMOTE and undersampling
- "informed over-sampling": for each minority sample
- http://metaoptimize.com/qa/questions/232/does-subsampling-bias-the-auc
- http://www.machinelearning.org/proceedings/icml2007/papers/62.pdf
ensemble methods
- decision stumps aka weak learners: predict based on just one feature
- bootstrap aggregating aka bagging
- given a learner (model), trains
$t$ classifiers (instances of the learner), each on some bootstrap sample from the training data - bootstrap sample: sample w replacement (replace each example)
- given a test example, classifiers take a vote (regressions take avg)
- averages predictions, so not useful for improving linear models
- error estimate: accuracy of the classifiers on their excluded sets
- given a learner (model), trains
- boosting (adaboost): usu applied to decision trees
- given a learner, trains
$t$ classifiers (instances of the learner) on samples of the training data - when a classifier is trained, weights for misclassified instances are increased, so that next classifier must try harder to get those right
- classifiers also assigned a weight based on accuracy on training set
- given a test example, classifiers take a weighted vote
- given a learner, trains
- rely on randomization/perturbation to build diverse models, so only unstable
learners can be used; not useful for stable learners like SVM, NB
- stable learners: produce similar models upon small perturbations to training set
supervised learning
-
classification: discrete/categorical
-
regression: continuous
-
ordinal regression: ranking (in btwn classification & regression; used in IR)
-
old statistically inspired linear methods
-
neural networks
- nodes aka units are connected by links propagating activations
- each node is activation function of weighted inputs
- activation functions: threshold or sigmoid/logistic (differentiable)
- bias weight
$W_{0,i}$ connected to fixed input$a_0 = -1$ , sets actual threshold - acyclic/feed-forward networks or cyclic/recurrent networks (memory)
- hidden units/layers: internal units/layers
- FF networks usu arranged in layers
- perceptron: single-layer FF using threshold function; linear
- multilayer backpropagates errors to hidden layers by assuming each hidden node responsible for some fraction of the error in each of the output nodes it connects to
-
gradient descent for logistic regression is perceptron with sigmoid/logit
-
decision trees
- pruning: remove irrelevant features; eg
$\chi^2$ pruning (math) - finding split points on continuous features is usu most expensive part (math)
- ID3: standard simple learning algo
- choose attr w max info gain (or, equivalently, min entropy)
- C4.5: standard modern decision tree learner
- supports continuous attrs, missing values, diffing costs, pruning irrelevant attrs
- pruning: remove irrelevant features; eg
-
support vector machines (SVM) aka kernel machine
- maximum margin classifiers, potentially with slack
- linear: similar to logistic regression
-
Gaussian processes: TODO
-
$k$ -nearest-neighbors (KNN): classify objects based on closest$k$ neighbors -
case-based reasoning
- knowledge-based ai
- statistical ml
-
anytime learning: online continuous learning
-
margin infused relaxation algo (MIRA): TODO
coordinate descent
- no derivative needed; simply pick a direction and move along it
gradient descent
- better alternatives: conjugate gradient, BFGS, L-BFGS (limited-memory BFGS)
graph search
-
BFS, DFS, IDFS
-
best-first search: explore most promising node according to heuristic fn
- usu.: minimize distance to goal
-
A* search: best-first where heuristic fn is distance to goal + distance from start
- unlike best-first min-dist, may jump off a certain path you've been heading down to keep from going too far off
-
beam search: best-first but at ecah step keep limited # of best candidates
-
dijkstra: exact min cost path
-
orig algo
$O(|V|^2)$ ; using Fib heap$O(|E| + |V| \log |V|)$ -
for each node, maintain a "min known distance from root"
-
starting from root, relax neighbors' distances & put into/pop from PQ
-
once all neighbors of a node explored, dist won't relax again
-
neg edge weights cause inf loops
-
pseudocode:
for x in nodes: x.dist = inf x.prev = null root.dist = 0 q = PQ({root}) while x = q.pop(): for y in x.neighbors: m = min(y.dist, x.dist + edge(x,y)) if m < y.dist: y.dist = m y.prev = x // record path insert or relax y in q
-
-
bellman-ford: single-source shortest-paths all-pairs
-
same as dijkstra but instead of greedily relaxing min-dist unvisited node, relax all edges, and can report neg edges (no 'shortest path')
for v in vs: v.dist, v.pred = inf, null for i in range(len(vs)): for u,v in es: if u.dist + edge(u,v) < v.dist: v.dist = u.dist + edge(u,v) v.pred = u for u,v in es: if u.dist + edge(u,v) < v.dist: raise Exception('graph contains neg weight edges')
-
constraint satisfaction
- unification
- backtracking
local search/optimization algos
- hill climbing: each step adjusts a single element in the vector; various ways to choose successor
- simulated annealing: adjusts all elements in the vector according to gradient of the hill
- local beam search: hill-climbing in lock-step; share info so that each exploring thread jumps to one of the best successors found by any of the other threads
- stochastic beam search: local +
- minimax
- nearest neighbors
- knn: see above
- cuckoo hashing
- genetic algos
instance-based aka memory-based learning
- eg nearest-neighbor, kernels
- attribute weighting: randomly sample instances and check neighbors
- near hits w diff attr val decreases attr relevance
- near misses w diff attr val boosts attr relevance
nearest-neighbor
- density estimation: defines joint dists over input space
- unlike Bayes net, no hidden vars/unsupervised clustering
- distance functions
- continuous features: don't just use Euclidian; normalize by standard
deviation (
$Z$ -distance)- special case of Mahalanobis distance, which uses covariance as well
- discrete features: Hamming distance
- continuous features: don't just use Euclidian; normalize by standard
deviation (
- can predict some attr
$y$ with$P(y|x) = P(y,x) / P(x)$ - supervised learning: predict
$y=h(x)$ from$k$ nearest neighbors of$x$ - for discrete: majority vote
- for continuous: average, or do local linear regression using the
$k$ points
- high-dimensional problems
- slow kNN search
- curse of dimensionality on distance functions
clustering
- difference from partitioning: clustering is on points that have coordinates or distances, rather than shapeless graphs
- hierarchical
- agglomerative: bottom-up merging
- divisive: top-down splitting
- partitional: determine all clusters at once; can be used in divisive
-
$k$ -means: assign each point to cluster with closest centroid; initialize with random centroids- can use results to determine clustroids (points closest to centroids)
- self-organizing maps (SOM) aka self-organizing feature map (SOFM)
- locality sensitive hashing (LSH): hash similar values to similar values
-
a form of dimensionality reduction
-
also used in nearest neighbor search
-
feature space vectors are sets
-
jaccard distance
-
random projection:
- each value is a vector in hyperspace
- divide hyperspace with a hyperplane to classify the values in 1 bit
- repeat for more bits
-
min-hash, sim-hash are instances of this
-
TODO (gaussian/kernelized) random projections; see eg sklearn's impl
-
sim-hash
pick hash size, eg 32 bits let V = [0] * hash size break up token sequence into features (eg n-char shingles) hash each feature for each hash if bit i is set, V[i]++ if bit i is not set, V[i]-- simhash bit i is set iff V[i] > 0
-
-
- density-based: discover arbitrary shapes
- 2-way-/co-/bi-clustering: objects are clustered but also their features; if data in matrix, then rows and columns clustered simultaneously
- distance measures: euclidian/2-norm, manhattan/1-norm, maximum norm aka infinity norm, hamming, inner product/cosine similarity
- many clustering algos require number of clusters as input;
mean shift
- "non-param" in that # modes gives # clusters, but still need kernel fn (eg
bandwidth
$h$ ) - given data points in
$k$ -dim space, there's a KDE PDF; find its modes - algo: given kernel fn & data pts, do for each data pt:
- start w window centered at data pt
- until convergence, find kernel-weighted mean over data pts in window (eg hypersphere, an approximation of where kernel is non-zero) and update window center
- clustering: assign data pt to the basin it falls into
- theory: this actually is just ascent of gradient of KDE
- used in CV: segmentation, tracking
- tracking: create confidence map in new img based on color histogram of obj in prev img; use mean shift to find peaks of confidence map
- segmentation: use discontinuity preserving smoothing
- combine spatial and range (color) values into
$x = [ x_s x_r ]$ (5-dim) - use special
$K(x) = c K(x_s / h_s) K(x_r / h_r)$ - filtered pixel uses orig spatial vector but new range vector of convergence pt
- combine spatial and range (color) values into
- slow; some optimizations:
- sampling: sample KDE to get pts for new ("reduced") KDE, and run over these pts, then simply associate each pt to closest
- adaptive mean shift: vary bandwidth for ea pt: let
$h$ be dist to nearest neighbor (that's L1 norm, can also use L2) - actually move each data pt at each iteration
v-measure cluster evaluation measure
- "conditional entropy-based external cluster evaluation measure"
- harmonic mean of homogoneity score & completeness score (optionally weighing each differently)
- other commonly used scores: purity, entropy, F-measure-like "clustering accuracy", more...
- http://acl.ldc.upenn.edu/D/D07/D07-1043.pdf
machine learning
- linear classif, perceptron, SVM
- non-linear classif, kernels
- n-way classif, rating, ranking
- anomaly detection
- (logistic) regression
- collab filt
- feat sel
- ensembles, boosting
- active learning
- model sel, complexity
- VC-dim, generalization guarantees
- mixture models, EM
- topic models, markov models, HMMs
- bayesian nets (and learning them)
- random fields, conditional random fields, markov random fields, factor graphs, inference
- inference, belief propagation
- inference, junction trees
- conditional models, structured prediction
- structured prediction: similar to classification but predicting an output with many possible values but still tractable to learn, e.g. structures; eg POS tags for a string of words
- learning conditional models
Developing a self-driving car for the 2007 DARPA Urban Challenge (Seth Teller's Angstrom talk, 5/1/08)
- 40 compute cores
- many other teams performed human-assisted map annotation
- Seth's team focused on going almost entirely from sensing
- many other teams used much fewer cores
- uses of sensing
- using vision for finding paint
- using active sensing for depth
- stereo algos: reconstruct 3D from cameras
- recent stereo algos have a global component that is harder to parallelize
- optical flow: estimating motion across time
- groups
- planning & control: how to stay on the white dotted lines; one technical focus
- perception: another technical focus
- working on the car
- software infrastructure: codebase, OS, etc.
- power-bound
- also CPU-bound in the sense that they could process rawer images, at higher rates, etc.
- also IO-bound; had 2 GigE and a CAN, all mostly utilized
- Anant: near future will yield cameras that have encoders
- Seth: these are not as useful for vision research
- algos need to know gradients, for instance; what does MPEG do to gradients?
- looked at GPUs
- programming models still painful
- form factor limits (1U blades)
- didn't want DSPs either because they didn't know what algos they ultimately wanted; everything was exploratory
- Anant: suggest fifth team, on computation resources
MACHINE LEARNING
-
variables
- visibility
- hidden variables, latent variables, model parameters, hypothetical variables
- observable variables, manifest variables
- causality
- independent variable, predictor variable, regressor, controlled variable, manipulated variable, explanatory variable
- dependent variable, response variable, regressand, measured variable, responding variable, explained variable, outcome variable
- TODO any diff btwn visibility/causality?
- visibility
-
bayes vs frequentist
- TODO diff btwn | and ;?
-
models
-
data mining
-
analyses
- mathematical
- real analysis
- complex analysis
- statistical
- analysis of variance (ANOVA)
- time-series analysis
- latent variable model
- factor analysis
- latent trait analysis
- latent profile analysis
- latent class analysis
- PCA
- exploratory data analysis
- mathematical
feature/dimensionality reduction
- feature selection
- feature extraction
- principal component analysis (PCA)
- autoencoder aka sparse coding
missing values
- usu. non-random; missing bc of other attr (or class label)
- model-specific treatment
- discriminative, eg log reg, can't handle unknowns, need generic treatment
- generative can marginalize over unknowns
- other models, eg trees, can handle: eg treat missing as separate value, or split instance into new instances (with value filled in) weighted by frequencies of filled-in value
- generic treatments
- discard examples with missing values; should only do this if don't expect them to come up again (in test set)
- treat as special value; new discrete value or new indicator for continuous
- reduced-feature models: build models excluding missing features
- imputation: fill in missing values
- avg/maj are most common, but generally can use any learner to predict value
- distribution-based imputation: use the tree splitting technique described above to impute
- EM: build model (ignoring missing), estimate missing, repeat till convergence
- resources
principal component analysis (PCA)
- use covariance method or SVD (preferred)
- "eval decomp of data covariance matrix" method: rarely used bc inefficient
row_data_adjusted = row_data - mean(row_data)
(normalized to have unit variance?)- find eig of compute covariance (correlation) matrix
- choose eigenvectors that have biggest eigenvalues; these are the (principal) components
- feature vector is matrix of col evecs $[e_1, e_2, \dots, e_n]
transformed_data = trans(feature_vector) * row_data_adjusted
- "SVD of mean-centered data matrix" method: faster
- components optimally account for variance in features
- first component extracted accounts for max amt of total variance
- i.e., choose new axis covering greatest "spread" in data points
- i.e., choose new axis minimizing L2 errors
- typically means it will be correlated w several/many vars
- second component extracted accounts for max amt of variance not accounted for by first component
- first component extracted accounts for max amt of total variance
- results depend on scaling of variables
- eval directly corresponds to fraction of total variance
- eg if total variance of 10, eval of 6.1 means 61% of total variance
- can extract orthogonal components or oblique components (which may produce cleaner results but also may be more complicated to interpret)
- not factor analysis; makes no assumptions about underlying causal model or
latent vars; simply reduce to components that account for most of the variance
- but may sometimes produce similar results
- optimal orthogonal xform but more computationally expensive than eg DCT
- refs
factor analysis
- for when you believe latent factors cause observations
linear discriminant analysis
- alt to PCA that optimizes for class separability
- TODO
- TODO canonical correlation analysis (CCA)
- despite name, it's generative; name makes sense vs PCA (which ignores labels)
independent components analysis (ICA)
- PCA: finds orthogonal basis; suited for normal processes
- ICA: suited for strongly non-normal processes
- http://www.stanford.edu/class/cs229/notes/cs229-notes11.pdf
eigenfaces
- faces = 100x100 images = 10,000D vectors
- PCA on faces: keep principal eigenvectors ("eigenfaces")
- use PCA trick
PCA transpose trick
- rank of covar matrix is limited by # training examples: if there are
$N$ examples, then there are at most$N-1$ evecs with non-0 evals - if dimensionality much larger than
$N$ , then transpose the observation matrix - http://blog.echen.me/2011/03/14/pca-transpose-trick/
locally linear embedding
- non-linear dimensionality reduction technique TODO
trending algorithms
- simple baseline trend algorithm: daily trend = most recent day's change * log(monthly total)
decision tree learning algorithms
- high variance
- unsupervised clustering: put your data all in one class and synthesize a
second class that is uniformly distributed throughout the same space
- RF will then try to distinguish the second class by finding the dense clusters
- http://www.genetics.ucla.edu/labs/horvath/RFclustering/RFclustering.htm
- GINI: TODO
- regression trees can fit any model, but if tree built well then constant
should do well
- usu. use MSE instead of mutual information (pick attr to split where partitions have minimal variance, as opposed to greatest information gain)
- http://www.stat.cmu.edu/~cshalizi/350-2006/lecture-10.pdf
- entropy for discrete, gini for continuous
- may not matter
- ID3
- C4.5
- M5
- M5P
- at leaves, train linear regressors
- http://www.opentox.org/dev/documentation/components/m5p
- http://www.garysieling.com/blog/sklearn-gini-vs-entropy-criteria
advice for applying ML (CS229)
- always evaluate a model on holdout
- model selection: use three datasets
- train params on training set
- choose hyperparams that minimize CV set
- evaluate on test set, since evaluating on CV set is more optimistic, since hyperparams were chosen to minimize CV error
- use error, which is loss without the regularization
- TODO: include plots of error vs model complexity
- training error monotonically decreases
- CV error is convex above this; want to find the min point
- where they converge and error is high: bias (underfitting)
- where they diverge: variance (overfitting)
- similar for plot of error vs regularization
$\lambda$ but flipped horizontally
- TODO: include example learning curves, plots of error over training set size
- train error increases, CV error decreases
- high bias: curves converge; more data won't help
- high variance: curves remain far apart; more data should help
- strategies
- high variance
- more training data
- fewer features
- larger
$\lambda$
- high bias
- more features
- more complexity (more NN nodes, more polynomial degrees, etc.)
- smaller
$\lambda$
- high variance
generalization guarantees
- bias: underfitting
- variance: overfitting
- Vapnik-Chervonenkis (VC) dimension
- model shatters some data if for all labelings there's some perfect params (no errs)
- VC dim of model is size of largest set that model can shatter
- eg linear classifiers: VC-dim of 3
- can predict probabilistic upper bound on test err of classification model TODO
comparing classifiers
- not sure how trustworthy the following is, but mostly accurate
- high-bias/low-var (eg NB) better for small training sets
- low-bias/high-var (eg kNN, LR) better for bigger training sets
- NB: simple; fast; converges faster than discriminative model like LR; good for small datasets
- LR: many ways to regularize, so less worry about correlated features; nice prob interp useful for adjusting thresholds or getting confidence scores; easily update model w online grad desc, unlike SVMs/DTs
- DT: easy to explain; non-param; RFs avoid overfitting; RFs beat SVMs
- SVM: accurate; slow; too many kernels/params
- http://www.quora.com/What-are-the-advantages-of-different-classification-algorithms
generalization error
- cross validation, bootstrapping: methods to estimate generalization error;
based on "resampling"
- often used to choose among different models; e.g. different neural network architectures; choose one with lowest gen error
- cross validation
- doesn't estimate error on future data
- rather, a way to pick best learning algo (and best learning params that aren't already being numerically optimized)
-
$k$ -fold cross validation: divide data into$k$ subsets of roughly same size (folds) - train
$k$ times, each time leaving out one of the subsets, then testing on one subset - LOOCV: when
$k$ is the sample size ($k=n$ where$n$ is data size) - leave
$v$ out cross validation: more elaborate and expensive that leaves out all possible subsets - diff from "split-sample"/"hold-out" method commonly used for early stopping
in neural networks
- only 1 fixed subset (the validation set) used to estimate generalization
error, not
$k$ subsets; no "crossing"
- only 1 fixed subset (the validation set) used to estimate generalization
error, not
- cross-validation to split-sample is superior for small datasets
- LOOCV good for continuous error functions/model-selection methods, bad for discontinuous ones
- repeated random sub-sampling validation (this may be referred to as
bootstrapping)
- split multiple times randomly; avg over splits
- split multiple times randomly; avg over splits
- stratification
- repeated random sub-sampling validation
- maintain same mean response value is equal in training and test sets TODO understand that sentence
- useful if responses are dichotomous, where data has unbalanced repr of the two response values
- stratified
$k$ -fold CV- choose folds so that mean response value is approx equal in all folds TODO understand that sentence
- for dichotomous classification, each fold has ~same +/- labels
- repeated random sub-sampling validation
- jackknifing: LOOCV but for estimating bias & variance, not gen error
- compute some statistic of interest in each subset
- compare avg of these to avg of entire sample to estimate the latter's bias
- bootstrapping
- repeatedly analyze subsamples of the data rather than subsets
- ea subsample is a random sample with replacement from full sample
- subsamples may be of same cardinality as full sample
- allows us to evaluate estimators
- example from classical statistics: what's the variance of the mean? what are our confidence intervals?
- example from machine learning: for a decision tree algo, evaluate mean & variance of (# correct)/(# elements)
- TODO clarify relationship w CV
- seems CV is better for ML tasks, since test set != train set
- bootstrapping works better than CV in many cases? http://www.faqs.org/faqs/ai-faq/neural-nets/part3/section-12.html
- ref: http://www.faqs.org/faqs/ai-faq/neural-nets/part3/section-12.html
ML vs statistics
- http://www-stat.stanford.edu/~jhf/ftp/dm-stat.pdf
- statistics too focused on math technique rather than ideas
- empirical validation and mathematics are complementary
- stats: prob theory, real analysis, measure theory, asymptotics, decision
theory, markov chains, martingales, ergotic theory
- tools: hyp testing, experimental design, response surface modeling, ANOVA/MANOVA, linear regression, discriminant analysis, logistic regression, GLM, canonical correlation, principal components, factor analysis
- DM
- tools: decision trees, rule induction, nearest neighbors, clustering, association rules, feature selection, neural nets, graphical models, genetic algos, self-organizing maps, neuro-fuzzy systems
- http://brenocon.com/blog/2008/12/statistics-vs-machine-learning-fight/
computational learning theory
- probably approximately correct (PAC)
- VC theory
- bayesian inference
- algorithmic learning theory
- online machine learning
dumb learners
- majority learner aka ZeroR
- OneR aka stump: generate one rule per predictor, then use only the rule w smallest error
receiver operating characteristic (ROC)
-
based on confusion matrix aka contingency table
-
true positive rate aka TPR aka sensitivity: TP/P or TP/(TP+FP)
-
false positive rate aka FPR aka 1 - specificity: FP/N or FP/(TN+FN)
-
ROC curve: plot of FPR on x and TPR on y while varying threshold
-
input: true labels and scores from classifier
-
top left is perfect; random tends to be around FPR=TPR diagonal
-
python to generate a ROC curve
pos, neg = labels.count(1), labels.count(0)
thresholds = sorted(set(scores), reverse=True) for i, t in enumerate(thresholds): # num examples with predict=1 and label=1 tp = sum(label for score, label in zip(scores, labels) if score >= t and label == 1) # num examples with predict=1 and label=0 fp = sum(label for score, label in zip(scores, labels) if score >= t and label == 0) yield tp/pos, fp/neg
tp, fp = 0
for score, label in sorted(zip(scores, labels), reverse=True): if label: tp += 1 else: fp += 1 yield tp/pos, fp/neg
-
-
area under curve (AUC) metric
- random classifier: AUC = .5
- perfect classifier: AUC = 1
- different apps have different prefs
- spam filtering: want low FP
- medical screening tests: want high TP
-
AUC has implicit cost functions; diff cost distributions for diff classifiers
- don't use AUC to compare diff classifiers
- http://www.springerlink.com/content/y35743hp7010g354/
-
http://metaoptimize.com/qa/questions/4370/how-to-generate-a-roc-curve-for-binary-classification
-
http://metaoptimize.com/qa/questions/232/does-subsampling-bias-the-auc
loss functions
- 0-1 loss
- pairwise cross entropy
- lambda loss
- pairwise + RMSE
adaboost
-
idea: combine weak classifiers
initially, all examples equally weighted for some number of rounds, find a weak classifier that minimizes weighted error if weighted error >=50% accuracy, break down-weight correct examples, up-weight incorrect examples combine subordinate models into final model, each weighted inversely by training error
learning bayes net structures
- still active research
- search algo: modifying an initial structure (nodes, arcs)
- some assume partial topological order over vars is given
- criteria
- accuracy after fitting params
- if implicit conditional independence assertions are satisfied
- MLE results in fully connected network; must penalize with bayesian
(MAP/MDL) approach over network hyps (prefer simpler networks)
- usu too many structures to sum over, so sample w MCMC
posterior simulation (TODO)
-
objective: evaluate integrals over density
$p(x)$ - eg estimate expected value
-
direct sampling methods
- generally given an un-normalized density
$p*(x) = c p(x)$ - uniform: grid search; $$ prohibitively many samples req'd for high dims
- importance: uniform w sampler density
$q(x)$ -
$q$ should be simple enough to be sampled - hope that
$q*$ is reasonable approx to$p*$ - adjust est by weighting "importance" of each sample
- hard to choose good
$q$ , esp in high dims
-
- rejection: like importance but proposal density strictly bounds target
density:
$c q*(x) > p*(x)$ - sample
$x$ from$q*(x)$ , -
$c$ often too big for poor choice of$q$ or high dims
- sample
- generally given an un-normalized density
-
MCMC
- BUGS, JAGS, etc use M-H not Gibbs; by Gibbs they mean one-var-at-a-time
- metropolis-hastings
- use local density
$q(x'; x^{(t)})$ , dep on current state$x^{(t)}$ - build markov chain through state space, converging to target density
$P*(x)$ -
$Q$ doesn't need to closely approx$P$ - at time
$t$ , generate tentative state$x'$ from$Q$ - eval
$a = \dots$
- use local density
- gibbs
-
inappropriate if any deterministic nodes (transition prob 1) or non-ergodic (disconnected components)
-
one pseudocode (for discrete dists, from http://dclib.hg.sourceforge.net/hgweb/dclib/dclib/file/474a3c84a175/dlib/bayes_utils/bayes_utils.h)
iterate for node in nodes if node is evidence, skip compute dist P(node | neighbors) = normalize( for value in node.values P(node=value | node.parents) * prod(P(child | child.parents (incl node=value)) for child in node.children) ) node.currvalue = sample from dist
-
-
refs
- http://jakehofman.com/talks/nyc_rmeetup_20090709_jmh.pdf
- gelman. bayesian data analysis.
sampling
- fwd sampling
- for bayesian networks
- topological order
- queries w evidence: est
$P(Y = y \mid E = e)$ - rejection sampling: fwd sampling, discard where
$E \ne e$ -
$P(E = e)$ usu prohibitively small; too many samples required
-
- fwd sampling generally infeasible for markov networks; no "roots"
- rejection sampling: fwd sampling, discard where
- markov chain: probabilistic state machine
- TODO
markov chain monte carlo (MCMC)
- TODO taxonomy: http://brenocon.com/blog/2013/04/monte-carlo-algorithm-inputs/
- award-winning expository paper: http://www.ams.org/journals/bull/2009-46-02/S0273-0979-08-01238-X/S0273-0979-08-01238-X.pdf
- variational bayes TODO
- deterministic
graphical models
- bayesian network aka belief network aka directed graphical model
- local markov prop: any 2 nodes conditionally indep given parents' values
- edges usu repr causality but not required (either dir)
- plates: for repeating sub vars
- eg naive bayes, neural nets
- markov blanket: neighbors; parents, children, children's parents
- every node is conditionally indep of all other nodes in network given blanket
- dynamic bayesian network: for sequences of vars, eg time series
- eg HMMs, kalman filter
- markov network aka markov random fields aka undirected graphical model
- reprs some things bayes nets can't (cyclic deps) & vice versa (induced deps)
- eg log-linear models, gaussian markov random field TODO
- must satisfy pairwise, local, global markov props TODO
- conditional random fields TODO
- factor graph
- bipartite hypergraph representing factorization of a function TODO
- http://cacm.acm.org/magazines/2010/12/102122-bayesian-networks/fulltext
linear classifiers
- naive bayes
- generative
- it's actually linear: http://nlp.stanford.edu/IR-book/html/htmledition/linear-versus-nonlinear-classifiers-1.html
- choose
$c_0$ iff$\log \frac{P(c_0|d)}{P(c_1|d)} = \log \frac{P(c_0)}{P(c_1)} + \sum_{k=K} \log \frac{P(t_k|c_0)}{P(t_k|c_1)} > 0$ where$K$ is the set of terms in the document
- choose
- logistic regression
- discriminative
- for
$K$ classes,$P(y|x) = \frac{ \exp(w_y \cdot x) }{ \sum_{y'\in{1,\dots,K}} \exp(w_{y'} \cdot x) }$
- http://www.quora.com/What-is-the-difference-between-logistic-regression-and-Naive-Bayes
generative vs discriminative
- discriminative: LR, perceptron, SVM
- good generalization when much data
- generative: LDA, NB
- more interest in semisupervised learning bc harder to get labels
- can train discriminatively, but no longer make use of unlabeled data
- http://research.microsoft.com/en-us/um/people/minka/papers/LasserreBishopMinka06.pdf
- more interest in semisupervised learning bc harder to get labels
GLM
- families
- Bernoulli and binomial: logistic regression
- Gaussian: OLS regression
- multinomial: softmax regression
- Poisson: for modelling count-data
- gamma and exponential: for modelling continuous, non-neg RVs, eg time intervals
- beta and Dirichlet: for dists over probs
- many more...
- residual deviance, null deviance TODO
maximum entropy
-
max entropy classifier aka multinomial logit
- generalizes logistic regression
- assumes independence of irrelevant alternatives (IIA): eg choices are blue bus, yellow bus, car, but most ppl choose btwn (any) bus and car
- unlike alts like naive bayes, doesn't require feature independence, but learning is much slower
-
$P(y_i=j) = \frac{\exp(X_i \beta_j)}{1 + \sum_{j=1}^J \exp(X_i \beta_j)}$ and$P(y_i=0) = \frac{1}{1 + \sum_{j=1}^J \exp(X_i \beta_j)}$ -
$\beta_j$ typically estimated using maximum a posteriori (MAP), which is an extension of max likelihood using regularization of weights
feature selection
- why?
- most methods implicitly do feature selection
- decision trees: use info gain or gain ration to decide what attrs to use as tests; many features don't get used
- neural nets: backprop learns strong connections to some inputs, near-0 connections to others
- kNN, MBL: weights in weighted euclidian distance
- SVMs: max margin hyperplane may focus on important features, ignore irrelevant features
- answer: empirically degrading performance with more features; cf curse of
dimensionality
- DT (and many others, esp eg kNN) vuln to irrelevant attrs
- as you go down the tree, less data avail bc of splitting
- even relevant attrs can hurt; eg adding a high-signal attr deteriorates perf by being chosen high up the tree, reducing the data down the tree
- NB robust to irrelevant attrs but vuln to redundant attrs by the same principal
- DT (and many others, esp eg kNN) vuln to irrelevant attrs
- also, simpler models: faster, easier to understand, see most important
- most methods implicitly do feature selection
- techniques
- subset search algos (see below); some subset evaluation measures:
- filter: indep assessment based on general characteristics of data, eg simple model
- wrapper: evaluate subset using the learner that will be ultimately used
- keep attrs that have high class cor but low inter-cor
- using symmetric uncertainty
$U$ , goodness of feature set is$\sum_j U(A_j, C) / \sqrt{\sum_{i,j} U(A_i, A_j)}$
- using symmetric uncertainty
- use learner: eg, use attrs used by DT, or use top-coef attrs in linear
model, or use top 1R attrs, or use attribute weighting in instance-based
learning like kNN
- but error-based method like 1R may not be best for ranking attrs TODO supervised discretization
- subset search algos (see below); some subset evaluation measures:
- search algos
- metrics for filters/wrappers: F-test (sig tests are discouraged; not even
sure what this is testing), AIC, BIC,
$R^2$ , CV accuracy - best-first/greedy stepwise regression: most popular; data dredging?
-
forward selection: add features one/some at a time
-
backward elimination aka recursive elimination: eg this faster/sloppier version of naive backward (which is opposite of forward)
start with all features in model candidates (for removal) = all features for each iteration, remove from candidates any feature whose exclusion yields no acc drop remove (from model and candidates) feature in candidates with biggest acc drop
-
combination: add & remove features, eg forward while optionally removing a feature at each step
-
- forward stagewise regression: "partially add" vars by increasing weight by
epsilon in correct direction
-
increase weight along most-correlated var by
$\epsilon$ let vector r = y let vector beta = 0 iterate: find x[j] most correlated with r let delta = epsilon * sign(r, x[j]) set beta[j] += delta set r -= delta * x[j]
-
identical to LASSO for orthogonal predictors, and similar in general case
-
- least angle regression (LARS): fwd stagewise made fast
- increase weight along most-correlated var until another var just as correlated, then move along the bisection of both vars, etc.
- similar results as LASSO and fwd stagewise, which can be thought of as restricted versions of LARS; slight modifications yield LASSO/fwd stagewise
- same complexity as OLS on full data set; also yields full path; no reason to not try it
- http://blog.echen.me/tag/least-angle-regression/
- http://www.stanford.edu/~hastie/TALKS/larstalk.pdf
- L1 (LASSO) regression/regularization
- LASSO = least absolute shrinkage and selection operator
- impl w quad prog, coordinate descent, or LARS for exact solution
- assume attrs and response are centered at 0 and scaled to unit variance
- minimize RMSE where L1 norm of
$\beta$ bounded by$s$ (diamond in 2D) - (for small
$s$ ) does selection as well as shrinkage, since it may be optimal to put all weight on fewer attrs than a bit on each
- metrics for filters/wrappers: F-test (sig tests are discouraged; not even
sure what this is testing), AIC, BIC,
- relevance determination: rank by information gain and evaluate first 10, 20,
... features
- information gain: reduction in entropy due to attribute:
$G(S,A) = H(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} H(S_v)$
- information gain: reduction in entropy due to attribute:
- limitations
- feature selection can overfit
- wrapper methods can be expensive
- non-selected != non-predictor (eg redundant features)
- may discard features that domain experts want to keep
- most feature selection greedy, non-optimal
- http://www.quora.com/Regression-statistics/What-is-Least-Angle-Regression-and-when-should-it-be-used
- TODO stepwise regression
- TODO optimality criteria
- http://www.cs.cornell.edu/courses/cs578/2007fa/missing_featsel_lecture.pdf
recommender systems
- some flavors: depends on how much data you have
- collaborative filtering: works best in isolation but requires enough data to combat sparsity
- content-based
- social/trust: based on network's items/activity
- usually mixing is good
- "the science and the magic of user and expert feedback for improving recommendations" (telefonica)
- if
$M$ is ratings matrix with users in rows and movies in cols, then:- for
$M M^T$ : one evec may be for "horror" movies, and $i$th component is user$i$ 's pref for horror movies; evals are feature importance - for
$M^T M$ : one evec may be for "horror" movies, and $j$th component is how horrific movie$j$ is
- for
collaborative filtering
-
various similarity measures
- jaccard for binary features
- cosine similarity
- pearson correlation: useful if want insensitivity to users' differing scales
-
item-item similarity (first pub by glinden et al)
-
predict a user's rating of an item as the weighted sum of other items' ratings by the user, where weights are item similarities:
score[usr,itm] = avg(sim(itm,itm2) * score[usr,itm2] for itm2 in itms)
-
-
user-user similarity
-
predict a user's rating of an item as the weighted avg of other users' ratings for the item, where weights are user similarities:
score[usr,itm] = avg(sim(usr,usr2) * score[usr2,itm] for usr2 in usrs)
-
-
evaluation metrics: RMSE is typical
-
TODO slope one
netflix prize
- http://blog.echen.me/2011/10/24/winning-the-netflix-prize-a-summary/
- identifying biases/features is just as important as modeling
- baseline rating
- user-specific effect
- movie-specific effect
- time since user's first rating (harsher critic over time)
- time since movie's first rating by anyone (early watchers bigger fans)
- let user's rating depend on # ratings for movie (user may be anti-popular)
- let user's rating depend on movie's overall rating
- neighborhood models
- problems
- neighbors aren't indep
- diff movies should use diff # neighbors
- so
- find neighbors using standard sim metric
- sparse linear regression:
movie rating ~ neighbors' ratings
- problems
- implicit data: similar to neighborhood model, learn
offset ~ whether user rated neighbor (for each neighbor)
- matrix factorization: most important
- std SVD
- asymmetric SVD: don't maintain user factors; just use (possibly weighted) sum of item factors of items she rated
- SVD++: use both above
- regression: can derive predictors from PCA/MDS/SVD
- RBMs were applied
- temporal effects: can bin movies into buckets spanning some months, and allow movie biases to change within each bin (eg Time nominated Titanic as best movie ever, causing burst in glowing reviews)
- regularization: heavily used ridge regression; lasso less effective
- ensemble methods: winners used GBDTs to combine 500 models
- add some useful features for clustering
- factor vectors of users and movies
- hidden units of an RBM
- Bell/Koren found using earlier ensemble method that RBMs more useful when movie/user has low # ratings
- previous sols use linear regressions to combine models
- add some useful features for clustering
random forests
- bagged ensemble of decision trees (orange uses 100 by default)
- each decision tree uses subset of all features (param
$m$ ) - each decision tree may be grown fully without pruning
- each decision tree trained on a bootstrapped sample of the data (as per bagging)
- their reported error (confidence?) does take into account their performance on the training set, somehow
- may be obvious but: despite feature sampling, feature selection still helps, since the trees are not weighted at all by their performance (if they were, then they'd implicitly be weighing their own subset of features)
- efficient, parallelizable
- RFs is very competitive, but also has danger of overfitting
- can rank features with this rough technique:
- each tree is trained over a bootstrapped sample, so check tree accuracy on the complement ("test set")
- for each feature, randomly permute the values over the records, and re-measure accuracy
- rank/score features by how much its randomization affected tree accuracy
- accumulate scores over all trees
extremely randomized trees aka extra trees
- same as RF but also split thresholds drawn randomly, then best chosen (instead of looking for most discriminative thresh)
multiclass classification
- some classifiers handle it naturally
- dec trees, bayes methods, log reg, ...
- generic meta learners
- must ensure that output scores are comparable! problematic for many
classifiers
- not "Bayes consistent"
- http://metaoptimize.com/qa/questions/6543/one-vs-all-multi-class-svm
- 1 vs rest aka 1 vs all aka OVA: build classifier for each class where pos iff labeled w that class; in test, choose class w highest classifier score
- 1 vs 1 aka pairwise aka all vs all aka AVA: build classifier for each pair
of classes (
$\choose{n,2}$ classifiers); in test, choose class w most wins - AVA builds more classifiers but pairwise classification much faster (?) and more balanced (easier to find best regularization), hence often faster than OVA http://www.cs.berkeley.edu/~pliang/cs294-spring08/lectures/multiclass/slides.pdf
- error-correcting output codes (ECOC)
- each class given unique binary str of length
$n$ - train
$n$ classifiers, one per bit - in test, run
$n$ classifiers on$x$ to get$n$ -bit str- coding must be carefully chosen to maximize hamming dist among codewords and make sure each bit is uncorrelated w other bits
- choose coding using exhaustive codes (all
$2^{k-1} - 1$ bits), column selection from exhaustive codes, randomized hill climbing, BCH codes, more
- choose class w closest str (eg by hamming dist)
- each class given unique binary str of length
- tutorial http://courses.washington.edu/ling572/winter2011/teaching_slides/2_15_multiclass_to_binary.pdf
- generalization:
http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=B6C31DC721E6484E27B9380C764CBF98?doi=10.1.1.73.2154&rep=rep1&type=pdf
- ECOC is a less-general intermediate
- orthogonal probs: how to classify example or assign it a prob dist given classifiers' results
- In Defense of One-Vs-All Classification http://jmlr.csail.mit.edu/papers/volume5/rifkin04a/rifkin04a.pdf
- must ensure that output scores are comparable! problematic for many
classifiers
latent dirichlet allocation (LDA)
- learns from body of entities, inferring that a particular attr belongs to a topic
- to generate a document:
- num words ~ Poisson
- topic mix ~ Dirichlet (over fixed set of topics)
- to generate a word:
- topic ~ doc's multinomial of topics
- word ~ topic's multinomial of words
- learn with MCMC (collapsed Gibbs sampling)
misc
- ANOVA: continuous response
- LDA: categorical response
- unlike logit/probit regression, assumes vars are normally distributed
- PCA: doesn't consider class
- factor analysis: builds feature combos based on diffs not similarities
- transfer learning: storing knowledge gained while solving one problem and
applying it to a different but related problem
- eg useful for personalization
- simple form: final prediction = sum of global and local (user-specific) models
support vector machine (SVM)
- hyperparameters
- eg
$C$ penalty param - eg kernel-specific params; eg poly kernel has more hyperparams than others
- eg
- slack variables introduce soft margin
bayesian optimization TODO
- not entirely clear yet how this works, but rough notes follow
- way to globally optimize a costly-to-evaluate function
- set prior over function and combine w evidence to get posterior
- permits utility-based selection of next obs; considers both exploration (sampling from areas of high uncertainty) and exploitation (sampling areas likely to offer improvement)
- how it works
- assume evaluations of a bunch of arbitrary points
- fit gaussian process and search for best next point (optimize an
"acquisition function")
- can be expected value + predicted variance (upper-confidence-bound acqusition fn)
- or expected improvement
- http://hboa.deg511.com/boa.pdf
- http://atpassos.posterous.com/bayesian-optimization
kernels
-
from http://anyall.org/blog/2011/01/please-report-your-svms-kernel/:
An un-kernelized, linear SVM is nearly the same as logistic regression...But a quadratic kernelized SVM is much more like boosted depth-2 decision trees. It can do automatic combinations of pairs of features — a potentially very different thing, since you can start throwing in features that don’t do anything on their own but might have useful interactions with others. (And of course, more complicated kernels do progressively more complicated and non-linear things.)
I have heard people say they download an SVM package, try a bunch of different kernels, and find the linear kernel is the best. In such cases they could have just used a logistic regression. (Which is way faster and simpler to train! You can implement SGD for it in a few lines of code!)
A linear SVM sometimes has a tiny bit better accuracy than logistic regression, because hinge loss is a tiny bit more like error rate than is log-loss. But I really doubt this would matter in any real-world application, where much bigger issues are happening (like data cleanliness, feature engineering, etc.)
If a linear classifier is doing better than non-linear ones, that’s saying something pretty important about your problem. Saying that you’re using an SVM is missing the point. An SVM is interesting only when it’s kernelized. Otherwise it’s just a needlessly complicated variant of logistic regression.
regression
- multivariate regression: multiple dependent vars
- multinomial regression: multiple categorical dependent vars
- multiple regression: multiple indep vars, one dep var
- question: what does the objective function mean?
- http://kernelsvm.tripod.com/
- Tutorial on SVM Regression
multicollinearity
- correlation btwn predictors
- causes problems for LMs, neural nets, others
- trees are resistant but may affect interpretability of model
pragmatics/usage
- scale params to avoid feature imbalance and numeric difficulties
- kernels
- RBF: good first choice
- in (0,1], so fewer numerical difficulties
- just one hyperparam (
$\gamma$ ) - linear is special case of RBF; behaves like sigmoid for certain params
- "great when it works, but it's highly vulnerable both to 'junk' dimensions which carry a large amount of noise and the choice of metric"
- linear: good when many features
- RBF: good first choice
- simplest approach to finding hyperparams: grid search
multiclass
- http://nlp.stanford.edu/IR-book/html/htmledition/multiclass-svms-1.html
- generic approaches (OVA, AVA, etc) "not very elegant"
- instead, build 2-class SVM over new feature vector that includes the class
as a feature (or in general some function of it, $\Phi(x,y')$)
- in test, choose class w largest margin
$y = \arg \max_{y'} w^T \Phi(x,y')$ - in train, margin is gap btwn this value for correct class and for the nearest other class, so new quadprog formulation required (see URL)
- in test, choose class w largest margin
ranking
assoc rule mining/learning, frequent itemset mining
- assoc rules are just superficial variations on itemsets
- apriori: best-known algo for association rule mining
- for items that are grouped in txns
- build up candidate itemsets one item at a time using BFS, but on each iteration prune out candidates, keeping only those with minimum support (frequencies) and (optionally) minimum confidence (# in subset including new item / # in current set)
- historically significant but has inefficiencies & trade-offs
- FP-growth: frequent pattern growth
- param: min support
- build FP-tree data structure
- sort each txn (itemset) by item frequency
- build prefix tree from txns
- chain identical nodes so we can find all tree paths containing that item
- to find frequent itemsets, for each distinct item
$x$ , find if it has enough support (sum of counts across all its nodes >= min support)- if so, yield that (+ suffix if currently recursing) as a frequent itemset
- for each item
$x$ , build conditional FP-tree for $x$- FP-tree of itemsets ending in
$x$ , excluding$x$ - build new FP-tree over paths collected by walking up parent ptrs
- update node counts: sum of children's supports (anchor w
$x$ node supports)
- FP-tree of itemsets ending in
- fast: no candidate generation/test; use compact tree; scan DB once
- http://www.engr.uvic.ca/~seng474/association3.ppt
- http://www.engr.uvic.ca/~seng474/fptree_complete_example.ppt
- https://github.com/enaeseth/python-fp-growth/blob/master/fp_growth.py
- orig paper http://www.inf.unibz.it/dis/teaching/DWDM/project2010/FP-growth.pdf
- similarity-based: calculate (e.g.) jaccard similarity between indicator
attributes
- faster using minhash for first pass
- cohen et al, 2001
rule learning/induction
- subgroup discovery: variation that focuses on discovering individual patterns
of interest rather than model generation (classification/prediction rule
learning)
- don't mind overlap; use downweighting instead of removal of covered items, and use quality metric like weighted relative accuracy instead of confidence/support
- weighted relative accuracy:
$P(X) (P(Y|X) - P(Y))$ orP(cond) (P(class|cond) - P(class))
- probs are freqs, but weighted
- instead of focusing on accuracy, focus on finding big groups
-
$P(X)$ is generality; find big groups -
$P(Y|X) - P(Y)$ is relative accuracy -
$= P(X,Y) - P(Y) P(X)$ (diff btwn actual joint and expected joint if indep) - laplace accuracy can be used for
$P(Y|X)$ - better WRA: use unweighted
$P(Y) = n(Y)/N$
- apriori-C: apriori for classification
- discretize cont vars, binarize disc vars
- apriori considering only rules/itemsets containing class value
- post-process: keep just
$n$ highest-support rules (for each class)- iteratively select best rule, remove covered examples, repeat sorting
- some optimizations, eg:
- suppress generating rule if already have more general rule w satisfactory support & confidence
- apriori-SD
- apriori-c w new post-process:
- order by weighted relative accuracy quality function
- downweight covered examples, repeat sorting
- downweighting of covered examples:
$w = 1/(i+1)$ (init$w=1$ )
- apriori-c w new post-process:
- AQ: outputs sequence of "if COND then CLASS" rules
- beam search (best-first search with pruning)
- once find rule, remove covered data, repeat
- TODO diff from CN2?
- CN2: rule induction algo
- based on ideas from the ID3 decision tree algo
- significance test filters out rules that cover too few examples
- possible heuristic criteria to rank rules in beam (best-first) search
- entropy: used in ID3 and orig CN2
- tends to select very specific rules covering few examples; significance test is only a threshold; tihs is called downward bias
- accuracy: behaves similarly to entropy;
$P(class|cond) = n_c/n$ , where:-
$n_c$ is # examples in predicted class$c$ covered by rule -
$n$ is total # examples covered by rule
-
- laplace accuracy:
$\frac{n_c + 1}{n + k}$ , where$k$ is # classes- not sure about rationale here
- weighted relative accuracy
- entropy: used in ID3 and orig CN2
- ordered v unordered: whether rule is defined conditioned on rules above it
having failed
- unordered algo change: iterate the search for each class in turn, removing only covered examples of that class when rule is found
- TODO how to know when to stop?
- CN2-SD (subgroup discovery): instead of removing, downweight
- bc in CN2 only first few rules may be of interest
- can decrease multiplicatively or additively
- C4.5rules
- RIPPER
- faster/worse than C4.5rules, better than IREP http://www.let.rug.nl/~nerbonne/teach/learning/cohen95fast.pdf
- worse than NB for mail classif http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3844
- SLIPPER: competitive w/better than ripper, c4.5rules
- SD-map: based on FP-growth
backoff
- m-estimate: backoff to handle sparsity
- instead of n_c / n where n_c is # matching examples in class c and n is # matching examples, use (n_c + m p_c) / (n + m) where p_c is relative class frequency (over all)
- mitchell, 1997 / lidstone's law
getting in shape for the sport of data science (jeremy howard, kaggle chief scientist)
- subject: how to compete in kaggle
- coding to prep data for models is most important
- else, you're just applying models
- regexes get you very far
- excel is very powerful for data understanding
- cell color scales
- pivot tables
- programming platform recommendations
- python, numpy, scipy, matplotlib, ipython
- c#, alglib (ML), math.net numerics (linalg), chartcontrol (charts)
- c++, eigen (powerful linalg)
- java, weka
- R packages
-
caret: auto find models, grid search hyperparams, etc.
mdl <- train(data, "svm", center, pca)
-
ggplot2: transparent scatterplot points
-
ggobi: like tableau
- brushing: select pits in one plot, highlight in others
- not used as much in practice bc excel/ggplot2 so good
-
- viz must be flexible more than fancy
- eg plotting hundreds of time series is surprisingly useful (won that comp)
- models
- parametric models: GLM
- non-parametric models: SVM, RF
- R's RF impl limitations: 32 factors, no NA handling
- NA handling: fill in w median and add indicator var
- decision trees use info gain or GINI to find break points
- traffic prediction: excel color scales helped see the pattern/propagation of jams
- cran recommendations problem
- P(user installing any pkg) * P(pkg getting installed) -> 84% AUC
- use max of probs -> 95% AUC
- use RF and a bunch more data -> 93% AUC
- use RF on max -> 83% AUC
- GLM on a bunch more data -> 97% AUC
- GLM + mark deps as installed (type of programming that's cumbersome in R, a "shit programming language"; used C#) -> 98%
- 6th place
- ensemble: varying high-var learners; rely on one/few learners working well
while all others cancel each other out
- can increase variance by choosing your own (set of possible) break points in continuous variables
- the smaller the subspaces/samples, the more variance, but the longer it takes
- don't prune trees; that introduces bias
- http://blog.kaggle.com/2011/03/23/getting-in-shape-for-the-sport-of-data-sciencetalk-by-jeremy-howard/
deep learning TODO
- sparse coding aka dictionary learning aka autoencoder
- learn low-level features; similar to PCA, but yields not necessarily orthogonal basis
- multilayer feed-fwd neural networks using back-prop (with small hidden layer)
- RBMs
- https://venefrombucharest.wordpress.com/2011/04/15/an-overview-of-dictionary-learning-terminology/
- http://www.stanford.edu/class/cs294a/
restricted Boltzmann machines (RBMs)
- a stochastic neural net w 1 hidden layer; undirected graphical model
- essentially a binary version of factor analysis (continuous)
- training: contrastive divergence, which is approximate gradient descent
- randomly init weights (0-mean 0.1-SD Gaussian)
- update hidden from visible
- activate each node with prob
p_j = sigmoid(sum of weighted activations)
- set
$P = x_i x_j$ or$x_i p_j$ (can use either state or prob)
- activate each node with prob
- update visible from hidden
- update hidden from visible; set
$N$ same way as$P$ - adjust by
$\alpha (P - N)$ where$\alpha$ is learning rate - intuition:
$P$ is rooted in observations,$N$ is RBM's own tendencies - when iterating, discard updated visibles; just use orig training data
- TODO better understand why this works (esp
$-N$ )
gradient boosted decision trees
- related terms: GEM algo, treeboost, additive model
- favors many shallow trees (depth 2 or 3, maybe even stumps)
- train tree
$k$ on sample of training set where labels are residuals from model-so-far$z$ - regression:
$r = y - F_{k-1}(\vec{x})$ - classif:
$r = y - \frac{1}{1 + \exp(-F_{k-1}(\vec{x}))}$ - initial
$r = y$
- regression:
- sum the outputs
- general formulation defined in terms of abstract loss function
- can't actually calculate gradient, but follow residuals for steepest descent
- usage
- yahoo has used this extensively: GBRank, SmoothDCG
- blender in BellKor
- pros
- feature normalization not required
- auto feature selection; not prone to colinear/identical features
- easy to interpret
- easy to use diff loss functions
- cons
- boosting is sequential
- compute intensive: in DTs, for each feature/value, must compute gain
- can perform poorly on high-dim sparse data (eg bag of words) bc too many nodes/trees needed
- variations
- initialized GB: init
$F_0$ is output of random forest
- initialized GB: init
bayesian blocks
- nonparametric unsupervised discretization
- flexible piecewise shapes & fitness functions
- cute DP algo
- refs
gmail priority inbox
- http://research.google.com/pubs/archive/36955.pdf
- concise paper
- use simple linear logistic regression for scalable learning & prediction
- features: social (sender/recip metrics), msg content, thread, labels
- importance metric:
$p=P(a \in A, t \in (T_{min}, T_{max}) | \mathbf{f}, s)$ - action
$a$ within time$t$ of delivery given whether seen ($s$ ) -
$T_{min}$ under 24 hours, constrained by model update frequency and gives users time to react -
$T_{max}$ in days; longer interaction periods are not considered - [prediction error strange]
- action
- personalizing model uses transfer learning
$s = \sum_{i=1}^n f_i g_i + \sum_{i=1}^{n+k} f_i w_i, p=\frac{1}{1 + \exp^{-s}}$ -
$k$ user-specific features
- use PA-II online passive-aggressive regression variant
- misc
- continuous -> binary features via ID3-style algo [?]
Running Large Graph Algorithms: Evaluation of Current State-Of-the-Art and Lessons Learned (Dr. Andy Yoo, Lawrence Livermore National Laboratory, Google Tech Talk 2/18/2010)
-
scale-free graphs aka power-law graphs aka complex networks
- random graphs that can be constructed by adding vertices to highly connected vertices
- hubs: high-deg vertices
- small diameter
- communities
- self-similarity: recursive structural similarities within communities as overall graph structure
-
apps: internet/web, social/citations, bioinformatics (protein-protein interaction)
- graph mining
- spectral graph theory: finding eigenvals/eignevecs; web mining, pagerank, personalized web search, etc
- social network analysis: structure of the graph
- community detection, role finding, centrality study (who's important), influence study, organization study
- counter-terrorism
- pattern analysis: apply subgraph pattern matching
- fraud detection, cyber security
- graph mining
-
big graph processing challenges
- big
- complex and hard to parallelize: even basic BFS
- lesser-known: large intermediate data; eg space complexity of BFS
-
used bluegene
-
social network analysis: http://mlg.ucd.ie/files/summer/tutorial.pdf
Google ML Infrastructure
- SmartASS: adwords, adsense, doubleclick; worked on by topcoder ppl
- SETI: general ML infra; simon tong (acclaimed scientist, google sets)
- S___: content ads
- Prediction API: much smaller
- learn to rank not really used in ranking algo; want explainable/tweakable results