- Our Major goal of machine learning is to generalize beyond the examples seen during training.
- Memorisation is not a good strategy for learners because we often see only a fraction of examples we can will encounter in real world settings.
- Luckily induction works pretty well, which means by taking certain assumptions ML algorithms are able to figure out knowledge from a small number of examples.
"The ability to perform well on previously unobserved inputs is called generalization".
- How hard is this objective? well generally speaking the more data we have, the more better chance we have of figuring out the ground truth function 'f'. Our success also depends on no. of functions that could be ground truth(fewer the better). But again, Machine learning is not just about optimization where we reduce the training error. what we really care about is the generalization error(or test error), defined as "the expected value of the error on a new input. i.e probability that g disagrees with the ground truth f on the label of a random sample" [2][3].
In Machine Learning we are interested in finding out How well our model will perform on instances it has never seen before.
We Divide the dataset into train and test set and the error rate on new cases is called generalisation error. We can get a rough estimate of it by evaluating our model on the test set.
We have the option to alter the hypothesis space (i.e the set of functions that the learning algorithm is allowed to select as being the solution). Our challenge here is to choose a model with the right 'capacity'. This means we pick Machine Learning models that match the complexity of task to be solved and the training data available. e.g quadratic model will generalize well if the underlying function is quadratic.
To better understand generalization, we can decompose the generalization error into bias and variance.
High Variance: When a learning algorithm outputs a prediction function g with generalization error much higher than its training error(i.e gap between the training error and test error is too large.), then it is said to have overfit the training data[2].
High Bias: Means the learning algorithm is underfitting the data. In practice we can look at the train set error and dev set error. If its not fitting the training set well (as shown by the train set error) then algorithm is said to have high bias. An algorithm can have both as well if the train set error is high and gap between train set and test set is high as well.
Overall, we have to draw a balance between low training error while at the same time try to achieve a small gap between training and test error(generalization). We keep in mind that measuring generalization error this way is proxy method as we don't know the ground truth 'f' nor the distribution 'D'(that we talked about earlier).
According to “no free lunch” theorem, no learner can beat random guessing over all possible functions to be learned.