We all understand that a sample mean--the average value of X in our sample--estimates E(X), the mean value of X in the sampled population. But in general this is less clear. What does "T estimates θ" mean?
Say we have a sample of size n, meaning that we have n datapoints. To emphasize the dependence on n, write T as Tn.
Consider the case of T and θ being the sample mean and population mean. One can show that T is an unbiased estimate of θ here, i.e.
E(Tn) = θ
The expected value here is over all possible samples of size n from the population. Sometmes Tn is too large, sometimes too small, but on average is is the right value.
Or, say we have a linear model of human weight vs. height,
mean(weight | height) = β0 + β1 height
Denote our sample estimates of the βi, from the usual least-squares method, by bi. The latter also turn out to be unbiased,
E(bi) = βi
If in addition to being unbiased, Var(Tn) is small, then Tn doesn't vary much from one sample to another. Coupling this with the fact that it is "centered" around θ, we have that Tn should "usually" be near θ--a very desirable property.
Actually most estimators are not unbiased. Say we have a logistic model
P(employed | age) = 1 / [1 + exp(-{β0 + β1 age})]
Due to nonlinearity etc., here the bi) will NOT average out to βi over all possible samples.
We do, however, hope that the amount of bias is small. We have no usable expression for the bias in the logit case, but one can show that for this and many other estimators
E(Tn) - θ = O(1/n) as n -> ∞.
We thus have faith that bi has reasonably small bias even for small n.
In a similar spirit, we say that Tn is statistically consisetent for θ if
limn -> ∞ Tn = θ
and then, again, consider this to signify that Tn is probably reasonable for small n too.
Many ML specialists are wary of classification problems, say with 2 classes, in which there is much more data in one class than the other. Their main concern is:
Almost all of our predictions of future cases will be to guess Y = the dominant class.
That of course is true, but on the other hand, it SHOULD be that way, if the data is representative of the class proportions in the population.
Moreover, if say we use a logistic model, the bi will estimate the βi in the above sense. Here is why.
Say first we wish to model mean weight as a linear function of height in the above example. There are two possibilities:
-
Random-X regression: We sample n random people from the population, so both the heights and weights in our sample data are random.
-
Fixed-X regression: We decide to sample people of specific heights. Here only the weights will be random.
Since our model conditions on height anyway, it really doesn't matter
which of above two sampling schemes are used. Either way, the
bi will estimate the βi, e.g. will be
statistically consistent. (And as noted, they are even unbiased.
But the fixed-X setting will better illustrate the issues.
In a fixed-X setting, consider two possibilities:
-
We sample mostly at the shorter height levels.
-
We sample mostly at the taller height levels.
Either way, the bi will estimate the βi, in the above sense of consistency. Actually, we will probably get smaller standard errors from the second approach, but either way those standard errors will go to 0 as the sample size grows.
Say we are interested in gender pay gap, and have a model
mean(wage | age, gender) = β0 + β1 age
- β2 1male,
where 1male is an indicator variable for men.
Now suppose we have many more men than women, say
nmen = 10 nwomen
and n = nmen + nwomen.
For the same reason as above, the resulting bi will be statistically consistent estimators of the β, IN SPITE OF THE IMBALANCE. The imbalance has nothing to do with it.