stanford linear systems & optimization | convex optimization I: http://see.stanford.edu/see/courseinfo.aspx?coll=2db7ced4-39d1-4fdb-90e8-364129597c87
finite element method
KKT conditions
http://www.geometer.org/mathcircles/graphprobs.pdf

signals

fourier, z, laplace, wavelet, hilbert xforms
butterworth filter

statistics

frequentist statistics
- Point estimates trade off bias and variance
- Interval estimates have the goal of achieving nominal coverage and the goal of being informative
- Tests have the goals of calibration and power
frequentists express prior info not in dist but in choice of methods---may still be too restrictive
prospective vs retrospective studies
- prospective: cohort is random sample of population; watch them; may be very skewed (eg if studying rare disease)
- retrospective: already know cases & controls; sample both evenly
ascertainment: just means sampling (ascertainment bias, ascertainment method)
TODO mixed effects
characteristic function = fourier transform
TODO http://itunes.apple.com/us/itunes-u/statistics-110-introduction/id495213607
bootstrap hypothesis testing: for 2 groups, merge them, generate a bunch of bootstrap samples, and see how frequently they generate means as different as what's seen (the p-value)
areas: descriptive, inferential
moments
- positive skew: lumped to the left (counter-intuitive)
- positive kurtosis: pointier peak
sample variance: same as var but divide by $n-1$ instead of $n$
mean deviation: the mean absolute deviation from the mean
- harder to eval in closed form
- used by TPC To calculate large deviations from RTT by calcing EWM deviation
weak law of large numbers (prove using Chebyshev's ineq): $\lim_{n \rightarrow \infty} P(|\bar{X_n} - \mu| < \eps) = 1$
strong law of large numbers (more complex proof): $P( \lim_{n \rightarrow \infty} \bar{X_n} = \mu ) = 1$
central limit theorem: mean/sum of sufficiently many independent random variables, each with finite mean & variance, is approximately normal
- as sample size increases, sample mean distribution approaches normal
standard error of the mean: standard deviation of the sampling distribution of the sample mean ("sampling distribution of the sample mean" is the "distribution of the mean of samples")
confidence interval: at a certain confidence level (eg 95%), the prob this experiment yielded a sample that contains the true pop param
- "Were this procedure to be repeated on multiple samples, the calculated confidence interval (which would differ for each sample) would encompass the true population parameter 95% of the time."
- not for predicting prob of true pop param being in interval; pop param has actual value, not a prob dist
- instead, indicates reliability of estimate
- find lowest value of param (e.g. population mean or population binomial proportion) such that estimate (e.g. sample mean or sample binomial proportion) is still in innermost 95% of the sampling distribution of the estimate
  - innermost means excluding lowest/highest 2.5%
  - sampling distribution:
    - for normal data $~ N(\mu, \sigma)$, it's $~ N(\mu, \sigma / \sqrt{n})$, where $\mu$ is unknown
    - for bernoulli data $~ Bern(p)$, it's $~ Bin(n,p)$, where $p$ is unknown
  - do the same for highest such value of param
- closely related to significance testing: testing null hyp $\theta=0$ can be done by seeing if confidence interval for $\theta$ includes 0 (w confidence level $y=1-\alpha$)
- intervals calculated from (functions of) sample, so they're statistics, functions of RVs, hence RVs themselves
- margin of error: radius (half) of confidence interval
statistical error: deviation of a sample from unobservable true/theoretical value (eg person's height vs population mean)
residual: deviation of a sample from observable estimated value (eg person's height vs sample mean)
survival function: complement of CDF
2 time series are cointegrated if linear combo is stationary
- 2 time series are cointegrated if they share common stochastic drift
- eg drunkard and his dog: both randomly constantly moving but periodically error-correcting; cointegrated of order zero
- eg stock market index and its futures contract, pairs trading
- eg, R-squared fails when dealing w non-stationary vars, leading to spurious regressions suggesting relationships when there's none
- Engle-Granger test: simplest detection approach TODO
order of integration $I(p)$: min num diffs required to obtain a stationary series
- eg 1,1,1: already stationary
- eg 1,2,3: diff w lag-1 is stationary
- eg 1,4,9,16,25: diff w lag-1 gives 3,5,7,9; second diff is stationary
- similar to calculus integration/differentiation
simpson's paradox: a correlation present in diff groups is reversed when groups are combined; often in social-/med-sciences when frequency data hastily given causal interps
TODO
(paired) student t-test
distributions
- student t-distribution: something about normals for small populations where variance is unknown
- chi-square distribution
- Gaussian distribution, Cauchy, Exponential, Gamma, Generalized Extreme Value, Rician, Wrapped Cauchy, Von Mises, Binomial, and Beta
$\mathrm{Correlation}(X, Y) = \frac{ \mathrm{Covariance}(X, Y) }{ \sigma_X \sigma_Y }$ where $\sigma$ is the SD
seq of RVs has heteroscedasticity if they have diff variances
- usu. error terms vary w each obs, often true in cross-sectional/time-series
copula, Gaussian copula: TODO
Kendall tau rank correlation coefficient $\tau$
- similarity of orderings of data
- TODO uses p-value somewhere

bayesians vs frequentists

views of probability
- frequentist: $P(H)=.5$ turns up heads .5 of the time
- bayesian: coin w $P(H)=.5$ deserves even-odds betting
views of stats
- frequentist: ask what's $P(H=h)$ given that null hyp that true odds are .5
- bayesian: estimate $P(\theta=.5)$ vs $P(\theta\ne.5)$ given #H & prior
frequentist
- believes pop mean is real, but unknown and unknowable
frequentist estimator
- find sample mean; construct confidence interval around it
- can't say "95% prob that true mean is in interval"; must say "95% prob over intervals that an interval contains true mean", i.e. "if we were to repeatedly sample data, find mean, and construct confidence interval, 95% of intervals would contain the true mean"
frequentist tester
- form hypothesis: "mean is blah"
- accept or reject hypothesis with test to significance level
bayesian
- only data is real; pop mean is an abstraction; construct a credible interval, centered near the sample mean and tempered by prior beliefs (sometimes very non-informative)
- can say "95% prob that this interval contains the mean"
http://www.statisticalengineering.com/frequentists_and_bayesians.htm
http://www.rasmusen.org/x/2007/09/25/bayesian-vs-frequentist-statistical-theory/

conjugate priors

conjugate priors produce posteriors of the same algebraic form
an algebraic convenience that yields closed-form posteriors
beta dist is conjugate prior for binomial & bernoulli
dirichlet dist is generalization of beta and conjugate prior for multinomial & categorical
TODO

non-informative aka objective priors

not nec. flat/uniform priors
eg for bernoulli/binomial it's beta(.5,.5)
construct according to Jeffrey's rule to minimize Fisher information
http://www.amstat.org/publications/jse/v12n2/zhu.pdf

probability vs statistics

probabilists: hypothesis to data
- very pure, "subjunctive"; strict if-thens; if die is fair, then $P(1)=1/6$; "is die fair?" -> "not my dept"
- "in 100 tosses of fair coin, what's $P(H=40)$?
statistician: data to hypothesis
- deal w not just if-then, but whether die fair, what the bias is
- more science-y; nature must be observed to get at the underlying rules
- "if 40 of 100 tosses were heads, what's $P(\theta=.5)$?": much harder question, such that no agreement among math/science/philos/prob/stats
TODO finish
http://appliedclinicaltrialsonline.findpharma.com/appliedclinicaltrials/Metrics%2FStatistics/Bayesian-Likelihood-and-Frequentist-Approaches-to-/ArticleLong/Article/detail/84710

test accuracy

precision = relevant & retrieved / retrieved
recall = relevant & retrieved / relevant
F1 score aka F-measure: F = 2 * (precision * recall) / (precision + recall)

statistical significance

standard score: number of SDs; aka z-value, z-score, normal score
this is frequentist hypothesis testing; Bayesian hypothesis testing is Bayesian inference
alt hyps
- Fisherian testing: no alt hyp, only accept/reject null hyp
- Neyman-Pearson: alt hyps
- in practice: alt hyp is just negation of null hyp
p-value: significance level; P(observation or more extreme|null hypothesis)
- assuming null hyp (no underlying relationship or diff btwn the variables), what's prob we'll see a diff at least as extreme as the one observed just by chance?
  - not same as saying, given that the alt hyp is true, we can replicate the results 5% of the time or 95% of the time; that's called the statistical power of the design
- the lower, the less likely the result if the null hyp is true, thus the more statistically significant the result is
- if < .05 or .01, corresponding to a 5% or 1% chance of rejecting the null hyp when it's actually true (type I error)
- 5% is a typical value (border-line acceptability); in the end this is all arbitrary, no way to say something is/isn't "really" significant
- eg P(14+ heads of 20 flips|fair coin)
simple hypothesis: of the form $\theta = \theta_0$; specifies pop dist completely
composite hypothesis: of the form $\theta > \theta_0$; doesn't specify population dist completely
two-tail: p-value is $P(X>a \vee X<b | \theta)$
- $H_0 : \theta = \theta_0$ vs $H_1 : \theta \ne \theta_0$
one-tail: p-value is $P(X>x|\theta)$
- $H_0 : \theta \le \theta_0$ vs $H_1 : \theta > \theta_0$ (or vice versa)
- problem is that the $\theta$ is unknown (just something under $\theta_0$; TODO how do we justify the p-value, then, which must assume some value of $\theta$?
E-value: expected num tests until a certain observation, assuming null hyp
how to efficiently calc prob of at least $h$ H in $n$ flips?
type I error: false pos; rej null hyp when it's actually true
type II error: false neg; acc null hyp when it's actually false
mustn't stop experiment as soon as significant results are detected
- http://www.evanmiller.org/how-not-to-run-an-ab-test.html
common "general format" of significance tests
- they're some ratio of (some measure of the differentiation common in the variables) / (overall differentiation of those vars)
- eg: (part of overall differentiation of the WCC scores) / (overall differentiation of the WCC scores)
- called ratio of "explained variation" to "total variation"
can only get to P(null hypothesis|observation) if we're Bayesian and have a prior
- Fisher violently opposes the concept of an alt hyp
many different and widespread interpretation pitfalls
quantile regression: estimate median or other quantile instead of mean
- good for risk models
- uses linear programming
robust/resistant statistics: around for 20+ yrs; not widely used; p-values reign for many non-ideal real-world reasons
z-test: appropriate only if you have a SD, else use t-test

statistical tests

many different statistical measures
power of a test is its probability of correctly rejecting null hyp (complement of false negative rate, $\beta$)
- power function $\beta(\theta) = P_\theta(X \in R)$, where $R$ is rejection region (this is a diff $\beta$)
size/significance level of a test ($\alpha$): prob of incorrectly rejecting null hyp (false positive rate), or upper bound of prob of rejecting the null hyp over all cases covered by the null hyp
always want to choose the most powerful test (highest power under $H_1$) for a given size/significance level
- finding most powerful tests is hard; most powerful tests don't exist for many cases
- 4 widely used tests: Wald, $\chi^2$, permutation, likelihood ratio
contingency tables: used by many statistical tests
test selection decision table: http://www.graphpad.com/www/book/choose.htm
- the complexity here is a critique of the frequentist approach
chi-square: typically Pearson's chi-square
Fisher's exact test: for small categorical data; compares two groups
- a type of exact permutation test; based on hypergeom dist
- for 2x2 contingency tables (1 DOF, since margins are held fixed)
- for each variation of table, calculate $p = \frac{(a+b choose a) (c+d choose c)}{(n choose a+c)} = \frac{(a+b)!(c+d)!(a+c)!(b+d)!}{a!b!c!d!n!}$
  - sum these up to get the $p$-value
- use fisher if possible; easy w computers; chi square is approximation
Pearson's chi-square: for large categorical data
- assesses goodness of fit and tests of independence (TODO)
- $X^2 = \sum_{i=1}^n \frac{(O_i - E_i)^2}{E_i}$, where
  - $X^2$ is Pearson's cumulative test statistic which asymptotically approaches a $\chi^2$ distribution
  - $O_i$ is observed freq
  - $E_i$ is expected (theoretical) freq
  - $n$ is number of cells in contingency table
- goodness of fit: compare group against theoretical expectation
- $\chi^2$ test approximates Fisher
  - $X^2$ becomes $(ad-bc)^2 (a+b+c+d) / ((a+b)(c+d)(b+d)(a+c))
- degrees of freedom
  - for goodness of fit, it's number of terms in $X^2$ expr - 1
  - for indep test, it's (num cols) (num rows) = 1
- for 1DOF, chi square dist = gaussian, so same as "equal proportions z test" (z test over the difference of 2 binomial random variables)
  - http://stats.stackexchange.com/questions/2391/what-is-the-relationship-between-a-chi-square-test-and-test-of-equal-proportions
  - http://stattrek.com/AP-Statistics-4/Test-Difference-Proportion.aspx?Tutorial=AP
- http://ccnmtl.columbia.edu/projects/qmss/the_chisquare_test/about_the_chisquare_test.html
- http://math.hws.edu/javamath/ryan/ChiSquare.html
binomial test: for small categorical data
- compare group against theoretical value (i.e. with-replacement; see binomial dist vs hypergeom dist)
non-parametric tests TODO: mann-whitney, wilcoxon
- Kolmogorov-Smirnov (K-S) test: nonparam test for equality of continuous 1D prob dists; often used to test for GOF
  - "The two-sample KS test is one of the most useful and general nonparametric methods for comparing two samples, as it is sensitive to differences in both location and shape of the empirical cumulative distribution functions of the two samples."
sequential analysis: sample size isn't fixed in advance; data is evaluated during collection; may conclude exp earlier
- alt is bayesian exp design
Mann-Whitney-Wilcoxon RankSum test: useful to determine if 2 distributions are significantly different or not. Unlike the t-test, the RankSum test does not assume that the data are normally distributed, potentially providing a more accurate assessment of the data sets.
Chow test: whether 2 linear regressions on diff data sets are equal
- also (in econometrics) for whether 2 time series over different times exhibit a structural break

data

censoring: limiting values to be within bounds
truncation: omitting values that are out of bounds
eg: if policyholders subject to policy limit $u$ and deductible $d$, then losses are left-truncated (unreported if under $d$) and right-censored (limited to $u-d$)

time series (TODO also cover math)

stationary process: stochastic process whose joint prob dist doesn't change when shifted in time/space, so mean/variance/autocorr don't change
- white noise is stationary; cymbal clash variance diminishes over time
- i.e. "flat", not trending over time
- for a non-stationary signal, either fitting a curve and then taking residuals or differencing may give stationary data
autocorrelation: cross-correlation of signal w itself; a lagged self-correlation; to detect non-randomness; works if fixed intervals
- IID iff 0 autocorrelation
- Durbin-Watson test: conventional test to detect autocorrelation
  - 0 to 4; 2 means no autocorrelation
cross-correlation: sliding dot product of two signals; like convolution but no reversal
linear prediction: future values of discrete-time signal estimated as linear function of previous samples
autoregressive (AR) model: random process used to model natural phenomena; a type of linear prediction formula
vector autoregressive (VAR) model: popular in econometrics
Dickey Fuller test: whether unit root is present in autoregressive model
TODO: (partial) autocorrelation plots, ARMA, ARIMA, Ljung-Box for zero autocorr, unit root test for coint, Granger-causality, whiteness (iid-ness) & normality, VAR impulse response analysis and forecast error variance decomp,
TODO filtering: Hodrick-Prescott (HP), Baxter-King, Christiano-Fitzgerald (all popular in finance)
TODO Bayesian dynamic linear models (DLM)
repetition
- seasonal aka periodic: fixed-length cycles
- cyclic: variable-length cycles
cyclic/seasonal models
- TODO cyclic vs seasonal vs periodic http://robjhyndman.com/researchtips/cyclicts/

statistical inference

analysis of variance (ANOVA or AOV): TODO
canonical (variate) analysis: TODO
- 1 predictor, 1 criterion: coef of corr $r$, coef of det $r^2$, coef $\beta$
- $n$ predictors, 1 criterion: multiple corr $R$, multiple coef of det $R^2$, weights $\beta$
- $n$ predictors, $m$ criteria: canonical corrs $\rho_1, \dots$, canonical weights $C,D$
- find linear combo of predictors and linear combo of criteria w max corr
kernel density estimator: basically place a small kernel everywhere you have a sample, resulting in somtehing much smoother than a histogram (where bins can make all the difference in interpretation)
mean absolute percentage error (MAPE): $M = \frac{1}{n} \sum_{t=1}^n \left\vert \frac{A_t - F_t}{A_t} \right\vert$

relationship btwn mean squared error and gaussian dist? (TODO cs229 notes)

outlier elimination

chauvenet's criterion
pierce's criterion: "if you need rigor, generality, and the ability to reject more than one data point"
- process involves multiple steps and lookup tables
- https://web.chemistry.ohio-state.edu/~coe/research/documents/error_distributions_new2.pdf

exponential smoothing for time series analysis

exponential moving average (EMA) aka exponentially weighted moving average (EWMA) aka single exponential smoothing (SES): does not spot trends
double exponential moving average (DEMA) aka linear exponential smoothing (LES): incorporates seasonality
triple exponential moving average (TEMA) aka triple exponential smoothing (TES): incorporates seasonality & trend
http://www.itl.nist.gov/div898/handbook/pmc/section4/pmc43.htm

likelihood

likelihoods for continuous distributions use PDFs instead of probabilities (since $P(X=x) = 0$); this is justified, see http://en.wikipedia.org/wiki/Likelihood_function#Likelihoods_for_continuous_distributions

misc

benford's law: distribution of leading digits is non-uniform (exp decay)

pitfalls

unbiased estimators can be terrible
drug X results may have $p=.04$ and drug Y results may have $p=.06$ but difference may well NOT be significant (diff btwn sig & insig may well be insig)
$p$-value is not prob of hyp being right; to compute this (posterior), more info is needed (prior)
statistical significance != significance
data dredging: many sig tests will inevitably result in false pos results
example takedown: http://www.americanscientist.org/issues/id.14344,y.0,no.,content.true,page.1,css.print/issue.aspx

regression tips

from andrew gelman http://andrewgelman.com/2010/12/what_do_practit/
(wouldn't take all thes as gospel)
pre-process your vars; discretize numeric, avg several related
usu fine to just treat discrete outcomes (eg 1-5 scale) as numeric; don't worry about ordered logit/probit/etc., just run the reg already
take the 2 most important input vars in your reg & throw in their interaction
The key assumptions of a regression model are validity and additivity. Except when you’re focused on predictions, don’t spend one minute worrying about distributional issues such as normality or equal variance of the errors.

surveys

Likert scale aka rating scale: psychometric scale (5/7-point)
- ordinal data, but not interval/linear data; requires nonparam tests

experiments

factorial design: 2+ factors, ea w discrete values/levels; experimental units take on all combos
- eg compare 2 motors at 2 speeds (4 combos)

combinatorics

permutations without replacement: P(n,r) = n! / (n-r)!
permutations with replacement: n^r
combinations without replacement (partitioning sets): C(n,k) = n! / (k! (n-k)!)
- this is the binomial coefficient
- multinomial coefficient: partitioning into multiple sets
  - binomial is actually C(n; k, n-k)
  - C(n; k1, k2, ..., kp) = n! / (k1! k2! ... kp!)
combinations with replacement (multisets/bags): C(n-1 + k; n-1, k) = (n-1+k)! / (k! (n-1)!)
- proof using "stars and bars"

mathematics

abstract algebra: study of algebraic structures such as groups, rings, fields, modules, vector spaces, and algebras
topology: extension of geometry that focuses on structure
algebraic topology: uses abstract algebra
topological graph theory: topology + graph theory; studies embeddings of graphs in surfaces
number theory
computational number theory, algorithmic number theory: applications in crypto

\begin{equation} e^x = 1 + x \left( 1 + x \left( \frac{1}{2} + x \left( \frac{1}{6} + x \left( \frac{1}{24} + x \left( \dots \right) \right) \right) \right) \right) \end{equation}

geometry

Pythagorean theorem
- easy proof

Interesting questions that came up during ICFPC08

find the tangent lines (up to four) between a pair of circles
- find the tangent lines between a circle and a point
if we roll a circle around the outside of an ellipse and trace the path of its center, is the resulting oval an ellipse?
circumcenter of a polygon = point equidistant to all vertices = center of circumscribed circle = intersection of all edge-bisecting lines

linear algebra

inverting matrices

Gaussian elim yields row echelon form (0s in lower-left)
Gaussian-Jordan elim yields reduced row echelon form ($I$)

From http://en.wikipedia.org/wiki/Sparse_matrix:

Conceptually, sparsity corresponds to systems which are loosely coupled. Consider a line of balls connected by springs from one to the next; this is a sparse system. By contrast, if the same line of balls had springs connecting every ball to every other ball, the system would be represented by a dense matrix. The concept of sparsity is useful in combinatorics and application areas such as network theory, of a low density of significant data or connections.

eigenvalues and eigenvectors

treating $M$ as linear xform, which vectors would keep same direction (but possibly scaled)? ($Mx=\lambda x$)
- $M$ xforms unit sphere $x^T x = 1$ into ellipsoid where evecs are axes, evals are radii
- if $M$ has data pts as col vecs, covariance matrix $M^T M$ has evecs as principal components, evals as their variances
- shear: 1 evec; uniform scale: $\infty$ evecs; non-uniform scale: 2 evecs; rotate: 0 evecs
power method or power iteration: eigenvalue algo
- given mat $A$, find scalar $\lambda$ (eigenvalue) and non-0 vector $v$ (eigenvector) s.t. $A v = \lambda v$
- doesn't compute matrix decomposition so suitable for large sparse A
- but will only find one eigenvalue (with max val, aka dominant eigenpair) and may converge slowly
- start with random vector and iteratively multiply by $A$ and normalize
- used in pagerank, HITS
- http://meyer.math.ncsu.edu/Meyer/PS_Files/IMAGE.pdf
for transition matrix of markov chain, evec is stationary state $\pi$
for ratings matrix (eg rows are users, cols are movies), evecs of $M M^T$ are row features, of $M^T M$ are col features

symmetric positive definite, positive semi-definite, negative definite, indefinite, etc.

$M$ is positive definite iff any of:
- $z^T M z > 0$ for all nonzero vector $z$
- all evals of $M$ are pos
- all leading principle minors are pos
  - i.e. upper left $k \times k$ submatrix of $M$ has pos det for all $k = 1,2,\dots,n$
  - eg: if $M$ is diagonal, then $M$ is pos def iff all diag elts are pos (diag elts are evals)
$M$ is neg def iff any of:
- $-M$ is pos def, i.e. $x^T M x < 0$ for all nonzero vector $x$
- all evals of $M$ are neg
- upper left $1 \times 1$ submatrix has pos det, upper left $2 \times 2$ submatrix has neg det, upper left $3 \times 3$ submatrix has pos det, etc.
$M$ is pos semi-def iff any of:
- $z^T M z \ge 0$ for all vector $z$ (usual definition)
- all evals of $M$ are non-neg
- note: if $M$ is pos semi-def, then all upper left $k \times k$ submatrices have non-neg dets, but this is not a sufficient condition any more; same for neg semi-def below
$M$ is neg semi-def iff any of:
- $z^T M z \le 0$ for all vector $z$ (usual definition)
- all evals of $M$ are non-pos $M$ is indef iff any of:
  - $M$ is neither pos semi-def nor neg semi-def
  - $M$ has pos and neg evals

definitions

vectors linearly dependent iff
- any of the vectors can be linear combo of any others (redundancy)
- only linear combo that equals 0 is all-0 coefs
- vectors are $n$-dimensional and span $\Re^n$
- determinant of vectors as cols is 0
basis of vector space: linearly indep spanning set of vectors
subspace of $\Re^n$: set of vectors that's closed under scalar mult (hence must incl. origin)
nullspace of $A$ is set of vectors $x$ s.t. $Ax=0$; a valid subspace
only way to represent a line in >2D is parametrically ($t$)
A matrix $A$ is diagonalizable if there exists a non-singular matrix $B$ and a diagonal matrix $D$ such that $A = BDB^{-1}$.
singular value decomposition (SVD): important factorization of a rectangular real or complex matrix
- many applications in signal processing, statistics
- eg computing the pseudoinverse, least squares fitting of data, matrix approximation, and determining the rank, range and null space of a matrix
singularity
- matrix is singular iff has no inverse iff any eigenvalues are zero iff characteristic polynomial is 0
- matrix is nonsingular iff invertible iff all eigenvalues are non-zero iff characteristic polynomial is non-0
characteristic polynomial: $p(\lambda) = \det(A - \lambda I)$
eigenvalues are roots of characteristic polynomial
eigenspace: for a given eigenvalue, linear subspace spanned by its eigenvectors
if $M$ is pos def then $z^* M z > 0$ for any vector z
a pos def matrix has pos def inverse
all eigenvalues of pos def are positive
a real sym matrix has real sym inverse
kernel of a mapping: elts that map to the zero elt
kernel aka null space of a matrix: nullity of A

foundations

vector space: eg $\mathbf{R}^2$ (set of all 2-dim vecs)
- subspace: subset of vector space closed under linear combination
  - subspaces of $\mathbf{R}^2$: $\mathbf{R}^2$, any line through origin, and ${(0,0)}$
  - subspaces of $\mathbf{R}^3$: $\mathbf{R}^3$, any plane through origin, any line through origin, and ${(0,0)}$
- column space $C(A)$ is span of $A$'s col vecs
  - if not fully linearly indep, then fewer dimensions spanned then #columns
  - $Ax=b$ can be solved iff $b \in C(A)$
- kernel aka null space $N(A)$ is set of all solutions to $Ax=0$
  - solutions $x$ always form a subspace, i.e. linear combos of $x$s are 0
  - $N(A)=0$ means col vecs of $A$ lin indep ($C(A)$ is a basis)
TODO
- stochastic matrix
- eigen*, singular, etc.
- http://www.cut-the-knot.org/arithmetic/algebra/eigens.shtml
- spectral decomposition aka eigendecomposition
- spectral theorem

norms/distance functions

taxicab geometry aka rectilinear dist aka L1 distance aka $\ell_1$ norm aka city block dist aka manhattan dist aka manhattan length
- sum of absolute diffs in coordinates
- hamming dist is manhattan dist of unit hypercube (since hamming dst components are either 0 or 1; hamming dist is # positions that are diff)
euclidian dist aka $L^2$/$\ell^2$ dist/norm
$p$ norm aka $L^p$ norm
- $( \sum_i |x_i|^p )^{1/p}$
maximum norm aka infinity norm aka uniform norm aka supremum norm
- $\max( |x_1|, \dots, |x_2| )$
triangle inequality holds for $n$-dim vectors

cryptography

RSA (TODO REVIEW!)

choose primes p, q
- primality testing: wheel factorization is certain; there are also probabilistic tests
find $e$ such that $gcd(e, (p-1)(q-1)) = 1$
- i.e., find $a$ such that $ae + b(p-1)(q-1) = 1$. This is a diophantine equation and can be solved using the Euclidian algorithm.
exponentiation: find $M^e \mod n$
- $M$ is data, $e$ is pub key, $n$ is priv key
- square and multiply algorithm
- Chinese Remainder Theorem (CRT) for more complex, faster algos

How To Share A Secret (Shamir 79)

split into $n$ pieces such that any $k$ can reconstruct but $k-1$ can't
http://research.swtch.com/2008/01/i-could-tell-you-but.html

signals, coding, information theory

Audio & Video Compression (Keith Winstein's SIPB cluedump 2008/11/06)

huffman code: recurse: least freq are same len, differ only in last bit
keithw: slides
entropy coding
arithmetic coding can get us flatter?

general

Shannon's theorem: describes channel capacity, the max information rate at which reliable comm is possible over a channel that has a certain error prob or SNR
- only existential, not constructive, hence actual capacity depends on algo

maximum entropy

given constraints ("testable information"), the dist that best represents the current state of knowledge is the one w largest entropy
- i.e., choose the least informative (and thus most generalizable) dist given what we know, since we don't know anything else
- also, many physical systems tend to move toward maxent configs over time
- makes explicit our freedom in using different forms of prior information
applications
- obtain priors for bayesian inference
- maxent models: observed data itself is the testable information
examples
- given just mean and sd, use normal
- given just the support [a,b], use uniform over [a,b]
- exponential dist w mean $1/\lambda$ is maxent dist over all dists in $[0,\infty]$ w mean $1/\lambda$

shannon entropy

$H(X) = E(I(X)) = \sum_i p(x_i) I(x_i) = - \sum_i p(x_i) \log p(x_i)$, assuming $X$ has possible values $x_i$
- $I(X)$ is the information content of $X$
- if $p_i = 0$, the $0 \log 0$ is taken to be 0
eg binary entropy: $H(X)$ for varying $P(X=1)$
distinguish btwn entropy of a distribution of messages/symbols and the entropy rate of a stochastic process
- eg need 7 bits to distinguish the first 128 symbols generated in the fibonacci sequence
- the fib sequence formula itself has a much lower entropy and applies to any length fib seq

Fisher information

reflects amt of (pos) info of observer
depends on PDF & its first derivatives
FIM is the curvature of the KL divergence

minimum description length (MDL)

best hyp yields best compression; occam's razor
minimum message length (MML) is very similar

probability

stochastic processes

bernoulli process: stream of discrete & independent events, usu binary; discrete-time, unlike poisson
poisson process: stream of continuous & independent events occurring at const avg rate; continuous-time (infinite-granularity) bernoulli process
- characterized by rate param $\lambda$

distributions

bernoulli: heads or tails
binomial: number of successes in $n$ draws with replacement (multiple bernoulli trials)
hypergeometric: number of successes in $n$ draws without replacement (unlike binomial)
poisson: # events over some amount of time according to poisson process; infinite-granularity binomial
exponential: time btwn events in poisson process
gaussian/normal
log-normal distribution: dist of a RV whose log is normal
- in Black-Scholes model, changes in the log of exchange rates, price indices, and stock market indices are assumed normal
Cauchy: occurs in spectroscopy
gamma: sum of fixed # exponentially dist vars
- eg total time for some # ppl to arrive
great reference illustrating relationships among distributions: http://www.johndcook.com/distribution_chart.html

misc

law of total expectation: $E(E(X|Y)) = E(X)$
law of total variance: $V(X) = V(E(X|Y)) + E(V(X|Y))$
- http://anyall.org/totvar/
$P(X < min(Y,Z)) = P(X > max(Y, Z)) + 1 – P(X > Y) – P(X > Z)$
Chebyshev's inequality: $P(|X-\mu| \ge k \sigma) \le \frac{1}{k^2}$ for any real $k>0$
Jensen's inequality: If $f$ is convex and $X$ is a random variable, $f(E(X)) \le E(f(X))$.

problems

expected # flips before HTH > that of HTT, because HTH overlaps with itself
- sketch: with a million flips, same # HTH and HTT, but HTH are more clumped, so more space btwn them
st. petersburg's paradox http://mindyourdecisions.com/blog/2010/06/07/the-st-petersburg-paradox-a-flimsy-critique-of-expectation-theory-by-people-who-dont-know-math-or-economics/
- game: toss coin until H; payoff is $2 for 0T, $4 for 1T, $8 for 2T, ...
- expected payout is $1/2 * 2 + 1/4 * 4 + 1/8 * 8 + ... = \infty$
- but must consider realistic payouts (nobody has $\infty$), utility of payout (e.g. log scale), and weight-vs-risk (why you don't play the lottery)
kelly criterion: optimal fraction to bet in a series of bets to maximize expected return/growth rate which is log(growth rate)

game theory

shapley shubik power index: measure power of players in weighted voting game
- of all possible permutations, when votes in favor happen in order, which player tilts the vote? rank them by the frequency this applies
pareto efficiency aka pareto optimality: a possibly local optimum
- can't improve one person without worsening another
nash equilibrium: nobody can improve by changing only own strategy unilaterally
- assumes each player knows equilibrium strategies of others
- doesn't mean there aren't strategies that are better for all
  - eg competing businesses forming cartel to increase profits
- may appear irrational to 3rd party, since not necessarily pareto optimal

puzzles

pirates
prisoner's dilemma
- if A testifies against B and B is silent, A goes free and B gets full 10yr
- if both silent, both get 6mo
- if both testify, both get 5yr
- regardless of what opponent chooses, you always are better off betraying
- nash equilibrium not nec pareto-optimal
- non-0-sum game
hotelling's game
- 2 competing hot dogs $(i,j)$ stand on beach that ranges over $[-1, 1]$
- customers go to nearest stand
- stands will naturally go to $(0,0)$, even though social opt is $(-.5,.5)$

number theory

fermat's last theorem: no 3 pos ints $a,b,c$ can satisfy $a^n+b^n=c^n$ for any int $n>2$
- long-standing conjecture; proven in 1995

misc

set is well-ordered iff has total ordering and any subset has least elt
harmonic series: $\sum_{n=1}^{\infty}\frac{1}{n}=\infty$
for density function $f(x)$ with one hump, choose $a,b$ on opposite sides of the hump with $f(a) = f(b)$, then $[a, b]$ is the shortest interval with its mass
- useful in statistics and more
- http://www.johndcook.com/blog/2009/02/23/finding-shortest-confidence-interval/

queueing theory

little's law: $L=\lambda W$ where $L$ is avg number of customers in system, $\lambda$ is avg arrival rate, $W$ is avg time each customer spends in system
Kendall's notation: A/B/C where A is arrival process, B is service time dist, C is # servers
- M/M/1 queue: Poisson arrivals, exp'l service time, 1 server
- http://en.wikipedia.org/wiki/Kendall%27s_notation

measure theory

studies ways of generalizing the notions of length/area/volume
measure theory asks questions like:
- How do we define a measure on our space? (Jordan measure and Lebesgue measure are two different options in Euclidean space.)
- What properties does our measure satisfy? (For example, does it satisfy translational invariance, rotational invariance, additivity?)
- Which objects are measurable/which objects can we say it's okay not to measure in order to preserve nice properties of our measure? (The Banach-Tarski ball can be rigidly reassembled into two copies of the same shape and size as the original, so we don't want it to be measurable, since then we would lose additivity properties.)
TODO URL

sequences and series

TODO name: $(1-1/x)^k \approx e^{-k/x}$, $(1-k/m)^n \approx e^{-kn/m}$ (last seen http://www.eecs.harvard.edu/~michaelm/postscripts/im2005b.pdf)

graph theory

a branch of combinatorics
eulerian paths
- eg bridges of Konigsberg
- poly-time algos
- euler trail aka eulerian path aka eulerian walk: visit each edge once; iff only 2 nodes have odd degrees
- eulerian circuit aka eulerian cycle aka eulerian tour: also return to start; iff all nodes have even degrees
hamiltonian path: visit each vertex once; NP-complete
overview: http://arxiv.org/abs/condmat/0303516
spectral graph theory
- study of graphs and the characteristic polynomials, evals, and evecs of $A$ or $L$
digraph: directed graph
sparse graph: # edges close to min # edges
dense graph: # edges close to max # edges
finite graph: finite # edges and # nodes
locally finite graph: each node has finite degree
maximal clique: clique that's not a subset of any other clique
maximum clique: biggest clique(s) in the graph
multigraph: a graph with multiple edges between same node pairs
spectrum of a graph: spectrum (eigenvalues) of graph's adj mat $A$
- isomorphic graphs have same spectrum
- $d$-regular graph is connected iff $\lambda_1 > \lambda-2$
- $d$-regular graph has largest eval $d$
laplacian matrix of a graph (of n nodes) $L = D - A$ where
- $D = diag(d_1, ..., d_n)$, where $d_i$ = degree of node i
- $A$ = adj mat
spectral embedding: plotting a graph using some of its spectral eigenvectors as coordinates
spectral gap: difference between largest evals, $\lambda_1 - \lambda_2$
expander graph: sparse graph with high connectivity
- name: subsets "expand" quickly
- standard constraint: $d$-regular
- expansion parameters: edge expansion, vertex expansion, spectral expansion
  - hard to compute (exponentially many subsets)
  - lower- and upper-bounded by expressions on the gap of the graph
- many apps involve random walks on expanders
  - markov transition matrix is just proportional to the adj mat
- applications
  - fault tolerance: networks with guaranteed connectivity (removing random edges does not reduce property of an expander by much)
  - more surprisingly: ECCs, PRNGs, crypto hashes, PCP theorem
$k$-regular graph: each node has degree $k$
- max degree is an eval
clustering coefficient of some node is (# links between neighbors / # possible); best to refer to wikipedia illustration
- intuitively: how close a node + neighbors are to being a clique
centrality: importance of a node; various measures
- degree centrality: just the node degree
- nodes that occur on many shorest paths between pairs of nodes in the graph have a high betweenness centrality score
- eigenvector centrality: proportional to sum of centrality scores of neighbors
  - calc eigendecomp of adj matrix, select evec w biggest eval
  - elt $i$ in evec gives centrality of node $i$
complex networks: exhibit non-trivial topological features
- e.g.:
  - heavy tail in degree distribution
  - high clustering coefficient TODO
  - assortativity: preference for a network's nodes to attach to others that are similar or different in some way
  - clustering at many scales
  - hierarchical structure
scale-free network: describes many "real-world networks"
- some nodes are highly connected hubs (high degree)
- most nodes are of low degree
- degree distribution follows power law
- structure and dynamics are independent of # nodes
- preferential attachment
- fault-tolerance: assuming uniformly random node failures, unlikely that a hub fails
hypergraph: generalization of a graph, where edges can connect any number of vertices
de Bruijn graph: with n dimensions and m symbols, this is a directed graph of m^n vertices, one for each length-n sequence of symbols
- rule for edges is more complicated; captures overlaps between sequences of symbols
butterfly graph: recursive criss-crosses
- found in fft
butterfly network: a network that's almost partitioned, connected only by a few edges
- different from butterfly graph
network theory: application of graph theory to networks, which are graphs that represent relations between objects
network coding: "combining" values to maximize information flow
Euclidian graph: edge weights = Euclidian distance
random graph: graph generated by randomly adding edges to a node set
- there are different random processes used
- Erdos number: collaborative distance to ("degrees away from") Paul Erdos
- eg: start w all disconnected nodes, then create an edge btwn each pair w some prob
small world networks
- randomly adding few edges to high-diameter graph drops diameter drastically
- typical properties:
  - high clustering coefficient
  - short avg path length
  - over-abundance of hub nodes
  - many local links and few long range "shortcuts"
  - create ring of nodes, each connected to $k$ nearest neighbors; then w some prob, rewire each edge to an existing dst node
planar graph
- $O(n)$ to determine planarity
- sparse: $O(v)$ edges, asymptotically smaller than max ($O(v^2)$)
- polyhedra can be "unrolled" into planar graphs
Euler characteristic: number that describes an aspect of a toplogical space's shape or structure
Euler's formula: for finite, connected, planar graphs, $v - e + f = 2$, where $f$ is # faces
- i.e., Euler characteristic is 2

category theory

a category consists of
- a class of objects
- a class of morphisms aka arrows
- for each morphism $f$, a domain object and a codomain object
- for each object $A$, an identity morphism $1_A: A \rightarrow A$
- for each pair of morphisms $f: A \rightarrow B$ and $g: B \rightarrow C$, a composite morphism $g \ocirc f: A \rightarrow C$
with these rules:
- identity composition: for each $f: A \rightarrow B$, $f \ocirc 1_A = f$ and $1_B \ocirc f = f$
- associativity
functors: morphisms from category to category
- define functor $F: C1 \rightarrow C2$
  - every object $A \in C1$ maps to an object $F(A) \in C2$
  - every map $A \right B \in C1$ maps to $F(A) \rightarrow F(B) \in C2$
- must preserve rules:
  - $F(1_A) = 1_{F(A)} \forall A \in C1$
  - $F(g \ocirc f) = F(g) \ocirc F(f) \forall f: A \rightarrow B, g: B \rightarrow C$ where $A,B,C \in C1$

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Math.page

Math.page

signals

statistics

pitfalls

regression tips

surveys

experiments

combinatorics

mathematics

geometry

linear algebra

cryptography

signals, coding, information theory

probability

game theory

number theory

misc

queueing theory

measure theory

sequences and series

graph theory

category theory

Files

Math.page

Latest commit

History

Math.page

File metadata and controls

signals

statistics

pitfalls

regression tips

surveys

experiments

combinatorics

mathematics

geometry

linear algebra

cryptography

signals, coding, information theory

probability

game theory

number theory

misc

queueing theory

measure theory

sequences and series

graph theory

category theory