Training a Model (for Supervised Learning)
Model training is the process of fitting a specific model to a data set. In supervised learning, this involves using labeled input data to "teach" the algorithm the relationship (or target function) between the input features and the known output labels.
This process isn't fully automatic. It requires human-provided hyperparameters (non-learnable parameters, like the 'k' in kNN) to guide the learning process.
A primary challenge is determining if the model is actually effective before using it on new, unknown data. To do this, we must first partition our labeled data using a sampling methods (Holdout, K-fold Cross-validation, and Bootstrap Sampling)..
Data Partitioning Strategies
To evaluate a model, we can't just test it on the same data it used for training. We need to hold some data back.
Describe different model training strategies such as holdout and k-fold cross validation.
Analyse the impact of training strategies such as holdout and k-fold cross-validation.
1. The Holdout Method
The input data is randomly partitioned into two or three separate sets. A portion of the labelled input data is held back (hence the name "holdout") for evaluation, acting as the test data for the purpose of validating the trained model.
Training Set (e.g., ~70%): The largest portion, used to actually train the model.
Validation Set (e.g., ~15%): A separate set used to tune hyperparameters and select the best-performing model (e.g., "Should I use k=3 or k=5?").
Test Set (e.g., ~15%): This data is held back and used only once at the very end. Its purpose is to provide a final, unbiased assessment of the chosen model's performance on unseen data.
Note: In simpler examples, just a two-way 70/30 or 80/20 split into "training" and "test" sets, where the "test" set is also used for validation.
The model is trained on the training set, and its performance is measured by comparing its predictions on the test set against the known, actual labels.
- To ensure that the data in both the training and test partitions are similar in nature, the division is performed randomly using random numbers to assign data items to the partitions.
Process:
The model is trained using the allocated training data.
The target function of the trained model is used to predict the labels of the test data.
The predicted values are compared with the actual label values of the test data.
The performance of the model is measured primarily by the accuracy of prediction of the label value.
2. K-fold Cross-validation (K-fold CV)
This is a more robust and computationally intensive alternative to the simple holdout method. It's essentially a repeated holdout process.
Process:
The entire dataset is randomly divided into k equal-sized non-overlapping partitions (or folds).
One fold (e.g., Fold 1) is held out as the test set.
The model is trained on the remaining k-1 folds.
The process is repeated k times, with each fold getting exactly one turn as the test set.
The model's final performance is the average of the performance scores from all k trials.
10-fold CV: The most common approach, where .
Leave-one-out (LOOCV): An extreme version where k is set to the total number of data instances.
3. Bootstrap Sampling
This technique relies on a principle called Simple Random Sampling with Replacement (SRSWR). This technique involves randomly picking data instances from the input data set, with the possibility of the same data instance being picked multiple times.
Process:
A data instance is randomly selected from the original dataset and added to the new training set.
Crucially, that same instance is put back into the original set (this is the "with replacement" part).
This selection process is repeated n times, where n is the size of the original dataset.
The resulting training set also has n instances, but some original data points will be repeated multiple times, while others (on average, ~36.8%) will be left out entirely. These "out-of-bag" samples are then used as the test set.
This method is the foundation for ensemble techniques like Bagging (Bootstrap Aggregating).
Eager vs. Lazy Learners
Learning algorithms can also be categorized based on how they handle the information acquisition and generalization during the training process, i.e when they do the work of building a model:
Eager Learners: These algorithms abstract and generalize information during the learning phase. They take more time in learning but do not need to reference the training data later when taking future decisions (classification).
Lazy Learners: These models skip the abstraction and generalization processes altogether. They essentially do not ‘learn’ in the strict sense and instead operate through rote learning (memorization based on repetition), classifying unlabelled data using the training data exactly as-is. They are also known as instance learning or non-parametric learning methods. Training time is short, but classification time is long because a comparison-based assignment occurs for every tuple of test data.
| Learner Type | Training Characteristics | Decision/Classification Process | Examples |
|---|---|---|---|
| Eager Learners | Take time to abstract and generalize information during the learning phase. | Classification is fast as the model is already built. They do not need the original training data. | Decision Tree, SVM, Neural Network |
| Lazy Learners | Training is fast (or non-existent). They skip abstraction and use rote learning (memorization). | Classification is slow. They must compare the new data point to the entire training dataset. | k-Nearest Neighbor (kNN) |
This distinction is also related to parametric vs. non-parametric models:
Eager Learners are typically parametric models (e.g., Linear Regression, SVM). They have a fixed number of parameters, regardless of the size of the training data.
Lazy Learners are typically non-parametric models (e.g., kNN). The "model" is the data, so the model's complexity and number of parameters grow with the size of the training data.
