Modelling and Evaluation
A machine learning algorithm builds its cognitive capability by developing a mathematical formulation, or target function, based on the features in the input data. This process is not fully autonomous.
As a child learning a new skill needs guidance, an algorithm requires human-provided, non-learnable parameters, known as hyperparameters. These parameters (e.g., the 'k' in k-Nearest Neighbors) are essential for guiding the model and ensuring its success.
Selecting a Model
Modelling is the process of transforming raw input data into a structured, meaningful pattern, which is represented by the model's "target function" which is appropriate for the machine learning task.
The first major challenge in a machine learning project is selecting the appropriate model. This is often called the "first dilemma" of machine learning.
The "No Free Lunch" Theorem
Justify why no single model works best for all problems.
This selection process is a dilemma because of a concept known as the "No Free Lunch" (NFL) theorem which states that no single model works best for every machine learning problem.
Assumptions: Every learning model simplifies the real world based on a set of underlying assumptions. A model's effectiveness depends entirely on whether its assumptions hold true for your specific problem and data characteristics.
Consequence of Assumptions: A model that produces remarkable results in one situation (e.g., on image data) might fail completely in another (e.g., on text data).
This is why thorough data exploration is a critical prerequisite. Data's characteristics must be understood and align them with the specific problem before choosing a model whose assumptions align with that data.
The broadest category for model selection is based on the primary learning objective: predictive (Supervised Learning) or descriptive (Unsupervised Learning).
Model Selection: Predictive vs. Descriptive
Compare predictive models and descriptive models with suitable reasoning.
| Feature | Predictive Model | Descriptive Model |
|---|---|---|
| Learning Type | Supervised Learning | Unsupervised Learning |
| Primary Goal | Prediction: Forecast a specific value or class. | Discovery: Find hidden patterns, groups, or insights. |
| Target Feature | Yes. The model maps inputs () to a known output (). | No. The model analyzes all input features () to find structures. |
| Example Tasks | Classification, Regression. | Clustering, Association Analysis. |
| Example Question | "What will the stock price be tomorrow?" | "What natural customer segments exist in my data?" |
1. Predictive Models (Supervised Learning)
Predictive models are the foundation of supervised learning. Their primary objective is to predict a specific target value using an input dataset. To infer how a target feature (the output) changes based on one or more predictor features (the inputs).
- Predicting criminal incidents (target) based on average income and population density (predictors).
Sub-types of Predictive Models:
Classification: The target feature is categorical (e.g., "Malignant" or "Benign").
Regression: The target feature is numerical (e.g., "Price" or "Temperature").
Some advanced models, such as Support Vector Machines (SVM) and Neural Networks, are versatile enough to be used for both classification and regression tasks.
Critically evaluate different model performance metrics for classification and regression.
2. Descriptive Models (Unsupervised Learning)
Descriptive models are the foundation of unsupervised learning. Their primary objective is to discover hidden patterns or gain insights from a dataset. To describe a dataset by finding natural groupings or associations.
These models operate with no target feature or single feature of interest. They derive patterns based on the values and relationships of all features together.
Sub-types: A key example is clustering models, which group together similar data instances (similar values across different features) based on their feature values.
Criteria involved in selecting a Machine Learning model.
There is no single model that works best for every problem, a concept known as the "No Free Lunch" theorem.
The primary criteria for selecting a model include:
1. Type of Problem
What question are you trying to answer? The nature of the learning task is the most fundamental criterion:
Predictive vs. Descriptive: Determine if the goal is to predict a value based on input data (Supervised Learning) or to find patterns and groupings within the data (Unsupervised Learning).
Classification vs. Regression:
Classification: Are you predicting a class? (e.g., Is this tumor malignant or benign?).
- For predictive tasks, if the target variable is categorical, a classification model like Naïve Bayes or Decision Tree is suitable.
Regression: Are you predicting a number? (e.g., What is the expected stock price?).
- If the target is a continuous numerical value, a regression model like Linear Regression is required.
Clustering vs. Association:
Clustering: Are you looking for groups? (e.g., What are my customer segments?).
- For descriptive tasks involving grouping similar objects without labeled data, clustering models like k-Means are used.
Association: Are you looking for relationships? (e.g., What items are bought together?)
2. Nature of Input Data
The characteristics of the dataset play a significant role in model suitability:
What are the data types (categorical, numerical, text, image)?
How much data is available? (Some models, like deep learning, require massive datasets).
What is the data quality? (Are there many missing values? Are there outliers?).
Data Size: The volume of training data influences model choice.
- Small Datasets: Models with low variance (e.g., Naïve Bayes) are often preferred to avoid overfitting.
- Large Datasets: Models with low bias (e.g., Logistic Regression) are better suited as they can represent complex relationships effectively.
3. Problem Domain and Business Requirements
The specific context of the problem dictates certain requirements:
What are the operational needs? Does the model need to be extremely fast, or is batch processing acceptable?
- Inference Speed: fraud detection in banking, may require models that can provide immediate inferences on high-velocity data.
Does the model's decision need to be explained?
- Interpretability: In some fields, it is crucial to understand how a model reached a decision, potentially favoring simpler models over complex "black box" algorithms.
- A bank must explain why a loan was denied, making a "black box" model like a neural network difficult to use.
4. Model Assumptions and Complexity
Explain how model complexity affects generalization.
Every model relies on simplifications and assumptions about the data.
Matching Characteristics: A model selected must have assumptions that align with the data's characteristics. A model might perform well in one scenario but fail in another if the data violates its underlying assumptions.
Bias-Variance Trade-off: The choice involves balancing the model's ability to learn complex patterns (low bias) against its sensitivity to fluctuations in the training set (low variance).
