Model Representation and Interpretability
The primary objective of supervised machine learning is to learn or derive a target function that can most effectively map a set of input variables to a target variable.
- Generalization is the model's ability to apply the knowledge learned from a finite training set to make accurate predictions on new, previously unseen data. Generalization is Key.
- This is difficult because the input data provides only a limited, specific view of the underlying problem. A model that fails to generalize correctly will exhibit either underfitting or overfitting.
Explain underfitting and overfitting with respect to bias–variance trade-off.
1. Underfitting (High Bias)
Underfitting occurs when a model is too simple to capture the underlying trend in the data. The model fails to learn the relationships between features and the target, performing poorly with high error on both the training data and the test data.
Many times underfitting happens due to unavailability of sufficient training data.
Common Remedies:
Increase Model Complexity: Switch from a simple model to a more powerful one (e.g., from linear regression to polynomial regression).
Feature Engineering: Reducing features and add more relevant features that help the model understand the data.
Reduce Regularization: Decrease the penalty for complexity, allowing the model to fit the data more closely.
2. Overfitting (High Variance)
Overfitting occurs when a model is excessively complex and emulates the training data too closely, begins to memorize the training data, including its noise and outliers. Leads to very low error on the training set, but very high error on the test set.
- The model's decision boundary is tailored too closely to the specific deviations in the training data. When applied to new data that does not share these exact deviations, the model makes incorrect classifications.
Overfitting can be avoided by :
using re-sampling techniques like k-fold cross validation
hold back of a validation data set
Feature Selection: Remove irrelevant features which have no predictive power to simplify the model.
Use More Training Data: Providing more examples can help the model learn the true underlying pattern instead of the noise.
The Bias–Variance Trade-off
This is the central conflict in model selection, describing the relationship between underfitting and overfitting.
Error in learning can be of two types
- errors due to ‘bias’ and
- error due to ‘variance’.
Bias: is the error caused from wrong simplifying assumptions in the learning algorithm.
- High bias means the model is too simple, has wrong assumptions and leads to underfitting.
Variance: is the error from a model's sensitivity to small fluctuations in the training data. Occur from difference in training data sets used to train the model.
- High variance means the model is too complex and leads to overfitting.
The Trade-off: As model complexity increases, bias decreases (it fits the training data better), but variance increases (it becomes more likely to overfit).
Increasing the bias will decrease the variance. So the goal is to find an optimal balance between bias and variance that minimizes the total error on the test data.
Evaluating Model Performance
After a model is selected, trained (for supervised learning), and applied, its performance must be evaluated. The quality of a model is relative and depends on the algorithm, the data, and the specific task requirements.
Different performance measures reflect the varied demands of tasks.
1. Supervised Learning: Classification
In classification, the model's performance is evaluated by recording the number of correct and incorrect predictions it makes. A prediction is correct if the predicted class label matches the actual (known) class label.
Accuracy
Accuracy is the most common metric, calculated as the ratio of correct predictions (either as the class of interest, or as not the class of interest) to the total number of predictions.
Accuracy alone can be highly misleading, especially on imbalanced datasets. A model with lower overall accuracy might actually be preferred if it has higher Sensitivity (ability to correctly identify all positive cases), ensuring no malignant tumors are missed, even if it means raising false alarms (False Positives)
A 99% accuracy is not impressive if the model is predicting a rare event (like a disease) that only occurs 1% of the time, and the model simply predicts "no disease" every time.
The percentage of misclassifications is indicated using error rate, or 1 - Model Accuracy
Propose appropriate metrics for an imbalanced dataset.
Following metrics are proposed for imbalanced datasets:
Confusion Matrix and Key Metrics
Confusion Matrix provides a detailed breakdown of correct and incorrect predictions, categorizing them into True Positives (TP), False Positives (FP), False Negatives (FN), and True Negatives (TN). This granular view helps identify specific types of errors the model is making.
The performance of a classification model can be calculated using a confusion matrix to derive the following key metrics:
Precision (P) : Gives the proportion of positive predictions that were actually correct (which are truly positive). A model with higher precision is generally perceived to be more reliable in predicting a class of interest.
It answers if "Of all the times the model predicted 'Positive,' how often was it right?"
- Recall (or Sensitivity): Gives the proportion of actual positive (True Positive) cases that the model correctly identified over all actually positive cases.
- Recall indicates what proportion of the total positives were predicted correctly.
- It answers if "Of all the actual 'Positive' cases, how many did the model find?"
- Specificity: Measures the proportion of actual negative cases that were correctly classified.
- It answers: "Of all the actual 'Negative' cases, how many did the model correctly identify?"
- F1-Score (F-Measure): The harmonic mean of Precision and Recall. It provides a single score that balances both metrics, which is useful when there is an uneven class distribution.
Kappa Value: This metric adjusts the accuracy score to account for the possibility of a correct prediction (Both TP and TN) happening by mere chance. A value of 1 represents perfect agreement, while 0 indicates agreement equivalent to chance.
Proportion of observed agreement between actual and overall data set.
Proportion of expected agreement between actual and predicted data in case of interest as well as the other classes.
Visualization: ROC Curve
The Receiver Operating Characteristic (ROC) curve is a graph that visualizes performance of a classification model.
It plots the True Positive Rate (Recall) against the False Positive Rate at various classification thresholds, illustrating the trade-off between detecting true positives and avoiding false positives.
2. Supervised Learning: Regression
In regression, the goal is to check if the predicted numerical values are close to the actual observed values. A good regression model should always perform significantly better than a "mean model" (a basic model that just predicts the average of all target values).
Mean Squared Error (MSE): Average of the squared differences between the predicted and actual values.
R-Squared (): Known as the coefficient of determination. It measures the proportion of the variance in the target variable that is predictable from the input features.
Checking for Overfitting: If the MSE on the test data is substantially higher than the MSE on the training data, there is overfitting.
3. Unsupervised Learning: Clustering
Evaluating clustering, also known as using validity indices, is more subjective as there are no "correct" labels.
A good clustering result has high intra-cluster similarity (samples within the same cluster are very similar) and low inter-cluster similarity (samples in different clusters are very different).
Silhouette Coefficient: is an internal evaluation methods. It calculates a score for each sample based on its similarity to its own cluster versus its similarity to the next-nearest cluster. A high average silhouette width indicates good clustering.
Purity: A measure used when external "ground truth" labels are known (e.g., for benchmarking). It measures the extent to which each cluster contains samples from a single class.
Improving Model Performance
Procedures for enhancing a model's performance after the initial phases of selection, training, and evaluation are complete.
Before focusing on performance improvement, the initial model selection must be finalized based on several aspects:
- The type of learning task (e.g., supervised, unsupervised).
- The nature of the data (e.g., categorical, numerical).
- The specific problem domain.
The primary objective is to understand the techniques available to boost model effectiveness, primarily by adjusting its configurable parameters or by combining it with other models.
Primary Avenues for Performance Improvement
Once a baseline model is established, there are two primary avenues for improving its performance: adjusting the model's internal settings (Hyperparameter Tuning) or combining it with other models (Ensemble Learning).
1. Hyperparameter Tuning (Model Parameter Tuning)
This is the process of adjusting the model's "fitting options" known as hyperparameters. These are the user-configurable parameters of the learning algorithm, as opposed to the internal model parameters (like weights) that are learned from the data. Nearly all machine learning models have at least one hyperparameter that can be tuned.
- The goal of tuning is to find the optimal combination of hyperparameter values that achieves the best balance between bias and variance, leading to the best generalization on unseen data.
Examples:
k-Nearest Neighbour (kNN): The hyperparameter is '' (the number of neighbors to consider) can be adjusted to tune the performance and perform a trade-off between bias and variance.
- A small can lead to high variance (overfitting), while a large can lead to high bias (underfitting).
Neural Networks: Key hyperparameters include the number of hidden layers, the number of neurons per layer, the learning rate, and the activation function.
2. Ensemble Learning
Ensemble learning is an alternative approach that increases performance by combining the predictions of several individual models. The technique functions best when the combined models are complementary. This means the models are diverse and make different kinds of errors; the weakness of one model is offset by the strength of another.
- Ensemble methods effectively combine multiple "weaker learners" (models that perform slightly better than random guessing) to create a single, robust "stronger learner."
Benefits:
Helps in averaging out the biases of the different underlying models.
Aids in reducing the variance, making the final model more stable and less sensitive to the specific training data.
Often yields a significant performance boost, even if the component models are built conventionally.
Analogy
Improving model performance is like optimizing a car for a race.
Hyperparameter Tuning is like adjusting the individual parts of a single engine (e.g., spark plug gap, fuel injection timing) to get the best speed and stability.
Ensemble Learning is like combining the best features of several different engines or running multiple specialized cars (one for corners, one for straights) and blending their results to get the best overall race time.
