Skip to content

Data Structures and Attribute Types in Machine Learning

Understanding the structure of input data is a fundamental concept in machine learning. This begins with defining the components of a dataset and the various types of data (attributes) encountered.

Dataset Fundamentals

A data set is a collection of related information or records, typically pertaining to a specific entity or subject area (e.g., student records, product sales).

  • Record: Each row in a data set is a record (also called an instance, observation, or sample).

  • Attribute: Each column represents an attribute, which provides information on a specific characteristic of the record. Attributes are also commonly referred to as features, variables, dimensions, or fields.

Data attributes are broadly classified in two primary ways:

  1. Based on their measurement scale (Qualitative vs. Quantitative).

  2. Based on the number of values they can assume (Discrete vs. Continuous).

Explain different data types and data structures used in ML with examples.

Classification 1: By Measurement Scale (Data Types)

This classification is based on the nature and properties of the values.

Hierarchy of Attributes Data Types

Qualitative (Categorical)

  • Nominal
  • Ordinal

Quantitative (Numerical)

  • Interval
  • Ratio

A. Qualitative (Categorical) Data

Qualitative data, also known as categorical data, describes a quality or characteristic. It provides information that cannot be measured on a numerical scale.

Further subdivided based on whether the data can be logically ordered:

  1. Nominal Data : This data consists of named values (labels) that cannot be logically ordered, quantified or ranked. Mathematical operations (addition, subtraction, etc.) cannot be performed on nominal data.

    • Blood Type (A, B, AB, O), Gender (Male, Female), Zip Code, Student Name.
  2. Ordinal Data : This data consists of named values that can be naturally ordered or ranked. Precise difference between values is not defined but we can determine if one value is greater than or better than another.

    • Student Grades (A, B, C, D), Performance Quality (Good, Average, Poor), Customer Rating (1-5 stars).

Dichotomous Data

A subtype of nominal data where only two labels are possible (e.g., Pass/Fail, Yes/No, True/False).

Measures of Central Tendency for Qualitative Data

  • Nominal Data: Since no ordering or math is possible, only the mode (most frequently occurring value) can be calculated.

  • Ordinal Data: Because the data can be counted and ordered, both the mode and the median (the middle value) can be identified, along with quartiles. The mean cannot be calculated.

B. Quantitative (Numerical) Data

represents information that can be measured on a specific numerical scale. Is subdivided based on the properties of the measurement scale:

  1. Interval Data : This is numerical data where the order is known and the difference between values is meaningful and consistent. However, interval data has no true zero point.

    • Temperature (Celsius or Fahrenheit), Dates, Time.
  2. Ratio Data : This is numerical data where the order is known, the difference between values is meaningful, and there is a true zero point. The "true zero" indicates the complete absence of the attribute.

    • Age, Salary, Weight, Height, Price.
    • Operations: All mathematical operations (addition, subtraction, multiplication, and division/ratios) are valid.

True Zero

Interval data lacks a "true zero".

0°C does not mean the absence of temperature; it is just another point on the scale. Therefore, while addition and subtraction are valid (e.g., 40°C = 20°C + 20°C), ratios are not. We cannot say 40°C is "twice as hot" as 20°C.

Classification 2: By Count (Value Types)

Attributes can also be categorized based on the number of values they can assume.

  • Discrete Attributes: These assume a finite or countably infinite number of values.
    • Examples: Roll number, Zip code, or student Rank.
  • Continuous Attributes: These can assume any real number value.
    • Examples: Length, Height, Weight, and Price.
Attribute TypeDescriptionRelation to Measurement ScaleExamples
DiscreteCan assume a finite or countably infinite number of values.Nominal and ordinal attributes are always discrete.Nominal attributes, ordinal attributes, or numerical counts (e.g., "number of children").
BinaryA specialized discrete attribute that can assume only two values.N/A (A specialized form of discrete data)Male/Female, Positive/Negative, Yes/No.
ContinuousCan assume any real-number value within a given range.Interval and ratio attributes are generally continuous, with some exceptions (e.g., a "count" is ratio data but is discrete).Length, Height, Weight, Price, Temperature.

Importance of Data Typing and the Data Dictionary

It is crucial to correctly identify attributes as numeric or categorical because the approach used for data exploration, pre-processing, and modeling is fundamentally different for each type.

The Data Dictionary

In a standard project, a data dictionary should be available for reference. A data dictionary is a metadata repository that provides detailed information on each attribute, including:

  • Its description
  • Data type (e.g., nominal, ratio, integer, string)
  • Allowed values
  • Other relevant details

If a data dictionary is not available, standard library functions within a machine learning tool (e.g., info() in pandas) must be used to infer these details.

Common Data Repository

The UCI (University of California, Irvine) Machine Learning Repository is a well-known collection of over 400 datasets that serve as benchmarks for researchers and practitioners in the machine learning community.

(http://archive.ics.uci.edu/ml/index.php)

Preparing to Model

Data Remediation

Data issues in quality need to be remediated if the goal is to achieve the right amount of efficiency in the learning activity. Proper remedial steps must be taken, particularly for issues arising from human errors, such as outliers and missing values.

Major Data Quality Issues Requiring Remediation The need for remediation stems from common data problems encountered in the process of preparing to model:

  1. Missing Values: Certain data elements are without a value or possess a missing value.

  2. Outliers: Data elements have values that are surprisingly different from the other values in that attribute.

The primary measures for remediating activities for outliers and missing values include:

  1. Removal: Removing specific rows that contain outliers or missing values.

  2. Imputation (Standard Values): Imputing the missing value with a standard statistical measure for that attribute, such as the mean, median, or mode.

  3. Estimation (Similar Records): Estimating the missing value based on the values of that attribute in records considered similar, and then replacing the missing value with the estimated value.

    • For example, if the weight of a Russian student (age 12, height 5 ft.) is missing, the weight of another Russian student with a similar age and height could be assigned.

Data Pre-Processing

Transformations applied to the identified data before feeding it into the algorithm. This step ensures that the data is prepared for modeling.

The main data pre-processing activities are:

  1. Dimensionality reduction.
  2. Feature subset selection.

1. Dimensionality Reduction

Dimensionality reduction is a technique used to reduce the number of attributes or features in the data set.

  • High-Dimensional Data sets need high amount of computational space and time.

  • Not all features are useful, they degrade the performance of machine learning algorithms. Most of the machine learning algorithms perform better if the dimensionality of data set is reduced.

  • Dimensionality reduction helps in reducing irrelevance and redundancy in features.

  • It is easier to understand a model if the number of features involved in the learning activity is less.

Feature transformation is considered an effective tool for dimensionality reduction, which consequently helps in boosting learning model performance.

  • PCA (Principal Component Analysis) is noted as one technique that can be used for dimensionality reduction.

2. Feature Subset Selection

Feature subset selection is the process that intends to select a subset of system attributes or features that make the most meaningful contribution to the machine learning activity.

  • This process is arguably the most critical pre-processing activity in any machine learning project. It serves as a method for dimensionality reduction.

The objective of feature selection is threefold:

  1. Achieving a faster and more cost-effective learning model (i.e., less need for computational resources).
  2. Improving the efficiency of the learning model.
  3. Gaining a better understanding of the underlying model that generated the data.

When predicting student weight, eliminating an irrelevant feature like "Roll Number" helps build a feature subset that is expected to give better results than the full set.

Questions

a) Explain different data types and data structures used in ML with examples.

b) Explain the role of data preprocessing in improving model performance.

d) Analyse how improper data preprocessing can mislead a learning algorithm.

e) Discuss common data quality issues and how they affect learning.

Why data exploration is mandatory before preprocessing

Nature of the data must be understood to determine the appropriate preprocessing actions and modeling techniques.

  • Understanding Data Characteristics: A thorough review is necessary to understand the data types (numeric vs. categorical), data quality, and relationships between elements. This understanding dictates which preprocessing activities are required.

    • For example, the approach to exploring and processing numeric data differs significantly from that of categorical data.
  • Model Selection: Models rely on simplifications and assumptions about real-world data. Data exploration helps identify whether the specific characteristics of a dataset meet these assumptions. Without this step, a model that works well in one situation might fail completely in another due to mismatched data characteristics.

Common data quality issues and how they affect learning

The success of machine learning depends largely on data quality and poor quality data leads to imprecise predictions.

Common Issues:

  • Missing Values: Data elements without values, often caused by omission during collection or non-response in surveys,.
  • Outliers: Data elements with values surprisingly different (abnormally high or low) from others.

Causes:

  • Incorrect Sample Selection: Data may not reflect normal quality if the sample is selected from a non-representative time period or segment.
  • Data Collection Errors: Manual errors in recording values or units of measurement can create outliers.

Effects on Learning:

  • Degraded Accuracy: If training data is of poor quality, the resulting predictions will lack precision.
  • Exponential Impact: In situations with small training datasets, the negative impact of bad data is exponentially worse.
  • Skewed Predictions: Outliers can specifically impact prediction accuracy in regression models.

How improper data preprocessing can mislead a learning algorithm

Improper handling of data during the preprocessing phase can introduce bias or reduce the model's ability to generalize to real-world scenarios:

  • Incorrect Sampling: If the training data does not reflect the actual population due to incorrect sample selection (e.g., using festive season sales to predict regular future sales, or a non-representative demographic mix), the model's predictions will be far removed from reality.

  • Mishandling Outliers: While outliers can be errors, they can also be "natural" values with valid reasons. Improperly amending or removing valid natural outliers can mislead the algorithm regarding the true variability of the data.

  • Excessive Data Removal: A simple approach to handling missing values is removing the affected records. However, if this is done excessively when a high proportion of data is missing, it reduces the training data size, thereby diminishing the power of the model.

The role of data preprocessing in improving model performance

Data preprocessing plays a critical role in enhancing the efficiency and accuracy of machine learning models by preparing raw data for analysis.

  • Dimensionality Reduction: High-dimensional datasets (those with many attributes) require significant computational time and space. Furthermore, not all features are useful; some can degrade algorithm performance.

  • Preprocessing techniques like Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) create new attributes to reduce dimensionality, which helps algorithms perform better,.

  • Feature Subset Selection: This process identifies an optimal subset of features by eliminating those that are irrelevant (contribute no information) or redundant (contribute similar information to other features),. This reduces computational cost and improves model interpretability without negatively impacting learning accuracy.

  • Data Remediation: Preprocessing involves handling data issues that can skew results. For instance, outliers (abnormally high or low values) can impact prediction accuracy, especially in regression models. Addressing missing values ensures the model retains sufficient training data to maintain its predictive power.