What Does "T Estimates θ" Mean?, and Relation to the Unbalanced Data Issue

We all understand that a sample mean--the average value of X in our sample--estimates E(X), the mean value of X in the sampled population. But in general this is less clear. What does "T estimates θ" mean?

Notation

Say we have a sample of size n, meaning that we have n datapoints. To emphasize the dependence on n, write T as T_n.

Bias

Consider the case of T and θ being the sample mean and population mean. One can show that T is an unbiased estimate of θ here, i.e.

E(T_n) = θ

The expected value here is over all possible samples of size n from the population. Sometmes T_n is too large, sometimes too small, but on average is is the right value.

Or, say we have a linear model of human weight vs. height,

mean(weight | height) = β₀ + β₁ height

Denote our sample estimates of the β_i, from the usual least-squares method, by b_i. The latter also turn out to be unbiased,

E(b_i) = β_i

If in addition to being unbiased, Var(T_n) is small, then T_n doesn't vary much from one sample to another. Coupling this with the fact that it is "centered" around θ, we have that T_n should "usually" be near θ--a very desirable property.

But unbiasedness is too much to ask for

Actually most estimators are not unbiased. Say we have a logistic model

P(employed | age) = 1 / [1 + exp(-{β₀ + β₁ age})]

Due to nonlinearity etc., here the b_i) will NOT average out to β_i over all possible samples.

Bring in asymptotics

We do, however, hope that the amount of bias is small. We have no usable expression for the bias in the logit case, but one can show that for this and many other estimators

E(T_n) - θ = O(1/n) as n -> ∞.

We thus have faith that b_i has reasonably small bias even for small n.

In a similar spirit, we say that T_n is statistically consisetent for θ if

lim_{n -> ∞} T_n = θ

and then, again, consider this to signify that T_n is probably reasonable for small n too.

The issue of unbalanced data

Many ML specialists are wary of classification problems, say with 2 classes, in which there is much more data in one class than the other. Their main concern is:

Almost all of our predictions of future cases will be to guess Y = the dominant class.

That of course is true, but on the other hand, it SHOULD be that way, if the data is representative of the class proportions in the population.

Moreover, if say we use a logistic model, the b_i will estimate the β_i in the above sense. Here is why.

A simple example

Say first we wish to model mean weight as a linear function of height in the above example. There are two possibilities:

Random-X regression: We sample n random people from the population, so both the heights and weights in our sample data are random.
Fixed-X regression: We decide to sample people of specific heights. Here only the weights will be random.

Since our model conditions on height anyway, it really doesn't matter which of above two sampling schemes are used. Either way, the b_i will estimate the β_i, e.g. will be statistically consistent. (And as noted, they are even unbiased.
But the fixed-X setting will better illustrate the issues.

In a fixed-X setting, consider two possibilities:

We sample mostly at the shorter height levels.
We sample mostly at the taller height levels.

Either way, the b_i will estimate the β_i, in the above sense of consistency. Actually, we will probably get smaller standard errors from the second approach, but either way those standard errors will go to 0 as the sample size grows.

Unbalanced data

Say we are interested in gender pay gap, and have a model

mean(wage | age, gender) = β₀ + β₁ age

β₂ 1_male,

where 1_male is an indicator variable for men.

Now suppose we have many more men than women, say

n_men = 10 n_women

and n = n_men + n_women.

For the same reason as above, the resulting b_i will be statistically consistent estimators of the β, IN SPITE OF THE IMBALANCE. The imbalance has nothing to do with it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WhatDoesTEstsThetaMean.md

WhatDoesTEstsThetaMean.md

What Does "T Estimates θ" Mean?, and Relation to the Unbalanced Data Issue

Notation

Bias

But unbiasedness is too much to ask for

Bring in asymptotics

The issue of unbalanced data

A simple example

Unbalanced data

Files

WhatDoesTEstsThetaMean.md

Latest commit

History

WhatDoesTEstsThetaMean.md

File metadata and controls

What Does "T Estimates θ" Mean?, and Relation to the Unbalanced Data Issue

Notation

Bias

But unbiasedness is too much to ask for

Bring in asymptotics

The issue of unbalanced data

A simple example

Unbalanced data