Feature Engineering
Explain the concept of feature engineering and its role in ML success.
Unstructured data : is raw, unorganized data that doesn't follow a specific format or hierarchy. Typical examples include text data from social networks (e.g., Twitter, Facebook) or data from server logs.
A feature is an attribute of a dataset used in a machine learning process. Features are also called dimensions. A dataset with 'n' features is called an n-dimensional dataset.
Feature engineering refers to the preparatory process of translating a raw dataset into features that represent the dataset more effectively, which can be used by ML models, resulting in better learning performance.
Selection of meaningful subset of features for machine learning is a sub-area of feature engineering. It has two major elements:
- Feature transformation
- Feature subset selection
Analyse feature transformation vs feature extraction.
Compare feature transformation and feature subset selection.
Feature Transformation
Feature transformation converts data structured or unstructured into a new set of features that represent the underlying problem for machine learning is trying to solve.
Engineering a good feature space is crucial for machine learning success. Often, it's unclear which features are more important. Feature transformation is used for dimensionality reduction to boost model performance.
For example, to classify documents as spam or non-spam, represent as bag of words, leading to hundreds of thousands of features.
There are two goals:
- Achieving best reconstruction of original features
- Achieving highest efficiency in the learning task
There are two variants:
- Feature construction
- Feature extraction
Both are sometimes known as feature discovery.
1. Feature Construction
Feature Construction is about discovering missing information about relationships between features and creating additional features in feature space.
Feature construction involves transforming a given set of input features to generate a new set of more powerful features. If there are 'n' features, after construction, 'm' more may be added, making it 'n + m' dimensional.
We transform the three-dimensional data set to a four-dimensional data set, with the newly ‘discovered’ feature (apartment area) being added to the original data set which had length and breadth.
Feature construction is an essential activity before starting with machine learning task.
- When features have categorical values and machine learning needs numeric value inputs
- When features have numeric (continuous) values and need to be converted to ordinal values
- When text-specific feature construction needs to be done
2. Feature Extraction
This involves extracting or creating new features from the original set of features using functional mapping from a combination of original features.
Commonly used operators for combining the original features include:
- For Boolean features: Conjunctions, Disjunctions, Negation, etc.
- For nominal features: Cartesian product, M of N, etc.
- For numerical features: Min, Max, Addition, Subtraction, Multiplication, Division, Average, Equivalence, Inequality, etc.
Feature Subset Selection
Explain feature subset selection and its objectives.
In feature subset selection (or feature selection), no new features are generated. It intends to select a subset of system attributes or features from the full set which makes a most meaningful contribution for the machine learning problem.
- Feature construction expands the feature space, while feature extraction and feature selection reduce it.
Essentially, derive a subset Fj (F1, F2, …, Fm) from Fi (F1, F2, …, Fn), where m < n, such that Fj is most meaningful.
- Roll number can have no bearing in predicting student weight. So we can eliminate the feature roll number.
Popular Feature Extraction Algorithms
Any machine learning algorithm performs better when the number of related attributes or features is reduced. A key to the success of machine learning lies in having fewer features with minimal similarity between them. Every dataset has multiple attributes or dimensions, many of which might have similarities with each other.
Principal Component Analysis (PCA)
In PCA, a new set of features is extracted from the original features that are orthogonal in nature (completely independent). Thus, an n-dimensional feature space is transformed into an m-dimensional feature space, where the dimensions are orthogonal to each other.
Two orthogonal vectors in a 2D space are completely unrelated or independent. The transformation decomposes the original feature vectors into a set of basis vectors (principal components) that are orthogonal.
These principal components capture the variability of the original feature space, with their number being much smaller than the original features.
The objectives of PCA are:
The new features (principal components) are distinct, with zero covariance between them.
The principal components are ordered by the amount of variability they capture, with the first capturing the maximum, the second the next highest, and so on.
The sum of variances of the principal components equals the sum of variances of the original features.
Issues in High-Dimensional Data
High-dimensional data refers to datasets with a large number of variables, attributes, or features often hundreds or thousands.
Challenges include:
Requires significant computational resources and time.
Performance degradation: Models may overfit due to noise, leading to poor generalization in both supervised and unsupervised tasks.
Models with many features are difficult to understand and analyze.
To address these, feature selection is essential to reduce dimensionality while retaining meaningful information.
The objectives of feature selection are three-fold:
- Faster and more cost-effective models (reduced computational needs)
- Improved model efficiency and performance
- Better interpretability of the underlying data-generating model
Discuss relevance and redundancy in features.
Feature relevance and redundancy are the key drivers of feature selection
Feature Relevance
Each of the predictor variables in a training dataset is expected to contribute information to decide the value of the predicted class label.
Feature relevance determines how useful a feature is for the learning task.
Supervised learning: Features are evaluated based on their contribution to predicting the target class label.
- Irrelevant: No contribution to prediction.
- Weakly relevant: Minimal contribution.
- Strongly relevant: Significant contribution.
Unsupervised learning: Has no labels; relevance is based on contribution to grouping similar instances.
- Variables that don't help in similarity assessment are marked as irrelevant.
Feature Redundancy
Redundancy occurs when features provide similar information.
- If two features correlate strongly (e.g., Age and Height in weight prediction), one may be redundant.
- Removing redundant features reduces dimensionality without losing information.
Feature Selection Approaches
There are four types of approaches for feature selection:
- Filter approach
- Wrapper approach
- Hybrid approach
- Embedded approach
Filter Approach
The feature subset is selected based on statistical measures to assess the merits of the features from the data perspective. No learning algorithm is employed to evaluate the goodness of the selected features.
Common statistical tests conducted on features are: Pearson's correlation, information gain, Fisher score, analysis of variance (ANOVA), Chi-Square, etc.
Wrapper Approach
Identification of the best feature subset is done using the induction algorithm as a black box. The feature selection algorithm searches for a good feature subset using the induction algorithm itself as part of the evaluation function.
Since for every candidate subset, the learning model is trained and the result is evaluated by running the learning algorithm, the wrapper approach is computationally very expensive. However, the performance is generally superior compared to the filter approach.
Hybrid Approach
This approach takes advantage of both filter and wrapper approaches. A typical hybrid algorithm makes use of both the statistical tests as used in the filter approach to decide the best subsets for a given cardinality and a learning algorithm to select the final best subset among the best subsets across different cardinalities.
Embedded Approach
This approach is quite similar to the wrapper approach as it also uses an inductive algorithm to evaluate the generated feature subsets. However, the difference is that it performs feature selection and classification simultaneously.
Analyse trade-offs involved in aggressive feature reduction.
