Skip to content

Supervised Learning

Supervised learning (SL) is defined by its requirement for labeled training data which serves as the experience (EE) on a specific task a machine has to execute. (Experience as Labeled Data)

Guided Learning : The process is "supervised" because the algorithm learns from a training set and labels that is analogous to a teacher providing correct answers.

The fundamental goal of supervised learning is to build a statistical model that learns a mapping function from inputs to outputs.

Labelled training data -> Supervised learning -> Prediction Model -> Test Data -> Prediction

Examples of supervised learning are

  • Predicting the results of a game
  • Predicting whether a tumour is malignant or benign
  • Predicting the price of domains like real estate, stocks, etc.
  • Classifying texts such as classifying a set of emails as spam or non-spam

To solve an image segregation problem, the training data would contain features of many images, with each image having a label (e.g., "round" or "triangular"). Based on this, the machine builds a predictive model to assign labels to new, unlabeled test data.

Input Data

The "experience" is provided as a training set of NN examples where xx is the input feature vector and yy is the known output label.

{(x1,y1),(x2,y2),...,(xN,yN)}\{(x_1, y_1), (x_2, y_2), ..., (x_N, y_N)\},

Mathematical Mapping

The algorithm seeks to learn a relationship, or mapping function, from the input space (XX, or predictor features) to the output space (YY, or the target feature).

f:XYf: X \rightarrow Y

Sub-Types of Supervised Learning

Supervised learning problems are categorized based on the nature of their output variable (YY).

A. Classification

Classification is used when the output variable(target feature) for an unseen instance is categorical or nominal (i.e., it belongs to a finite set of discrete classes or labels). The target categorical feature is known as class.

  • Binary Classification: The output has only two possible classes (e.g., "Spam" / "Not Spam," "Malignant" / "Benign").

  • Multiclass Classification: The output has more than two classes (e.g., Handwriting recognition (0-9), "Round" / "Triangular" / "Square").

Examples: Email spam filtering, image classification, fraud detection, medical diagnosis.

Labelled training data -> Classifier -> Classification Model -> Test Data -> Intel

B. Regression

Regression is used when the output variable is a continuous (i.e., Numerical, real-valued number) and not a class. Objective is to predict a numerical feature or real-valued output.

Examples: Predicting real estate prices, stock market values, temperature, or sales revenue.

Regression problem

A sales manager predicts next year's sales revenue (a continuous value) based on the previous year's sales and investment figures (also continuous values).

Regression is essentially finding a relationship (or) association between the dependent variable (Y) and the independent variable(s) (X), i.e. to find the function ‘f ’ for the association Y=f(X)Y = f(X)

Common Algorithm:

  • Linear Regression fits a straight-line relationship between the predictor and target variables using the least squares method.

  • Simple Linear Regression uses one predictor variable. The model is represented as y=a+bxy = a + bx. where ‘x’ is the predictor variable and ‘y’ is the target variable.

  • Multiple Linear Regression Uses multiple predictor variables.

Slope of the linear regression model

Slope of a straight line represents how much the line in a graph changes in the vertical direction (Y-axis) over a change in the horizontal direction (X-axis)

slope = Change in Y/Change in X

Slope=RiseRun=Y2Y1X2X1=Delta(Y)Delta(X)Slope = \frac{Rise}{Run} = \frac{Y_2 - Y_1}{X_2 - X_1} = \frac{Delta(Y)}{Delta(X)}

In simple linear regression, the line is drawn using the regression formula.

Y=(a+bX)+eY = (a + bX) + e

b=i=1(XiX¯)(YiY¯)i=1(XiX¯)2b = \frac{\sum_{i=1}^{}(X_i - \bar{X}) (Y_i - \bar{Y}) }{ \sum_{i=1}^{}(X_i - \bar{X})^2 }

a=Y¯bX¯a = \bar{Y} - b\bar{X}

Steps in the Supervised Learning Workflow

  1. Problem Identification: Define the business problem and the goal of the model.

  2. Data identification and Collection: Identify and gather the required data, ensuring it accurately represents the problem.

  3. Define Training/Test Sets: Decide on the data configuration, typically splitting the data into training and testing sets.

  4. Data Pre-processing: Clean, transform, and prepare the data (e.g., handling missing values, feature scaling).

  5. Definition of Training Data Set: Before starting the analysis, the user should decide what kind of data set is to be used as a training set.

  6. Algorithm Selection: Choose a suitable learning algorithm (e.g., Decision Tree, SVM). This is often considered the most critical step.

  7. Model Training: Run the algorithm on the training data. This may involve tuning hyperparameters (control parameters) to optimize the model.

  8. Model Evaluation: Measure the trained model's performance on the unseen test data. If results are unsatisfactory, return to step 5 or 6.

Common Supervised Classification Learning Algorithms

i. Decision Tree

  • Makes decisions using a tree-like structure. The learning process splits the data based on features that provide the most information, often measured using concepts like Information Gain or Entropy.

  • Decision trees are generally considered non-parametric models, meaning the number of parameters grows with the size of the training data.

A decision tree consists of three types of nodes: Root Node, Branch Node, Leaf Node

Building a decision tree

Decision trees are built corresponding to the training data following an approach called recursive partitioning.

The approach splits the data into multiple subsets on the basis of the feature values. It starts from the root node, which is nothing but the entire data set.

It first selects the feature which predicts the target class in the strongest way. The decision tree splits the data set into multiple partitions, with data in each partition having a distinct value for the feature based on which the partitioning has happened. This is the first set of branches.

Likewise, the algorithm continues splitting the nodes on the basis of the feature which helps in the best partition. This continues till a stopping criterion is reached.

The usual stopping criteria are :

  1. All or most of the examples at a particular node have the same class
  2. All features have been used up in the partitioning
  3. The tree has grown to a pre-defined threshold limit

ii. k-Nearest Neighbour (kNN)

  • Philosophy: "An instance is likely to be similar to its nearby neighbors."
  • Class label of the unknown element is assigned on the basis of the class labels of the similar training data set elements.

But there are two challenges:

  1. What is the basis of this similarity or when can we say that two data elements are similar?
  2. How many similar elements should be considered for deciding the class label of each test data element?

kNN adopts the elements distance metric (e.g., Euclidean distance) and classifies a new data point by a majority vote of its kk nearest neighbors in the training data.

Value of ‘k’ indicates the number of neighbours that need to be considered.

  • kNN is a classic example of instance-based learning or a lazy learning with fast training times but slower prediction times.

Learning Type : Lazy Learner

Lazy learner algorithms completely skip the abstraction and generalization processes and uses the training data "as-is" for classification. store the entire training dataset.

They perform no abstraction during training, so requiring very little time for training.

They only do the work of generalization at the time of classification so take more time for prediction of new instances.

Strengths of the kNN algorithm

  • Extremely simple algorithm: easy to understand

  • Very effective in certain situations, e.g. for recommender system design to find similar items. Searching documents or similar contents which is information retrieval.

  • Very fast or almost no time required for the training phase

Weaknesses of the kNN algorithm

  • Does not learn anything in the real sense. Classification is done completely on the basis of the training data. So, it has a heavy reliance on the training data. If the training data does not represent the problem domain comprehensively, the algorithm fails to make an effective classification.

  • Because there is no model trained in real sense and the classification is done completely on the basis of the training data, the classification process is very slow.

  • Large amount of computational space is required to load the training data for classification.

iii. Support Vector Machines (SVM)

SVM is a model, which can do linear classification as well as regression. SVM is based on the concept of a surface, called a hyperplane, which draws a boundary between data instances plotted in the multi-dimensional feature space.

The output prediction of an SVM is one of two conceivable classes which are already defined in the training data. SVM algorithm builds an N-dimensional hyperplane model that assigns future instances into one of the two possible output classes.

  • Classifier that finds a separating hyperplane (a decision boundary) in a high-dimensional space.

  • Goal is to find the hyperplane that maximizes the margin (the distance) between the closest data points (the support vectors) of the different classes, which promotes better generalization.

Eager Learners

These algorithms construct a generalized model from the training data during the training phase. The original training data can then be discarded (e.g., SVM, Decision Trees).

  • Support vectors are the data points (representing classes), the critical component in a data set, which are near the identified set of lines (hyperplane). If support vectors are removed, they will alter the position of the dividing hyperplane.

  • Hyperplane : For an N-dimensional feature space, hyperplane is a flat subspace of dimension (N−1) that separates and classifies a set of data.

  • For a three-dimensional feature space (data set having three features and a class variable), hyperplane is a two-dimensional subspace or a simple plane.

  • Margin is the distance between hyperplane and data points is known as margin. A hyperplane having a lower margin (distance), then there is a high probability of misclassification.

When a new testing data point/data-set is added, the side of the hyperplane it lands on will decide the class that we assign to it.

A Hyperplane should :

  1. Segregate the data instances belonging to the two classes in the best possible way.
  2. It should maximize the distances between the nearest data points of both the classes, i.e. maximize the margin.
  3. If there is a need to prioritize between higher margin and lesser misclassification, the hyperplane should try to reduce misclassifications.

Weaknesses :

  • SVM is applicable only for binary classification, i.e. when there are only two classes in the problem domain.
  • The SVM model is very complex – almost like a black box when it deals with a high-dimensional data set.
  • It is slow for a large dataset and It is quite memory-intensive.

iv. Naïve Bayes Classifier

  • A probabilistic classifier based on Bayes' Theorem of conditional probability.

  • It operates on the "naïve" (and simplifying) assumption that all predictor features are conditionally independent of one another given the class. Despite this simplicity, it is often very effective.

v. Ensemble Learning (e.g., Random Forest)

  • An ensemble method that combines multiple "weaker" individual models to create a single, "stronger" prediction model. This averages out biases and reduces variance.

  • Random Forest: A popular ensemble model that operates by constructing a multitude of Decision Tree classifiers during training.