Feature Selection in Machine Learning

In our previous blog on regression analysis, we have discussed how machine learning models can predict continuous outcomes, such as house prices or sales revenue, based on input data. While regression models are powerful, their success depends heavily on the quality and relevance of the features or variables used. Poor performance, overfitting, or even misleading results may be the result of including irrelevant or redundant features.

This brings us to the next important step in machine learning: feature selection. Feature selection helps us identify the most important variables in a dataset, improving the model's accuracy and efficiency. Whether working on regression or moving into the next kind of algorithms, classification—being able to master feature selection is key to building effective models.

In machine learning, the selection of appropriate features would give rise to:

Improves Model Performance: Reduces overfitting by removing noise.
Enhances Interpretability: Makes models easier to understand.
Speeds up Computation: Reduces training time.

In this guide, we’ll explore:

What feature selection is, and why it is so important.
Feature selection methods, such as Filter, Wrapper, and Embedded methods.
Python code with visualizations and interpretations.
In practice: How to apply feature selection to your projects.

What are Features in Machine Learning?

Features are the independent variables in your dataset that your machine learning model uses to make predictions. They describe the characteristics of your data points. For example:

In a dataset predicting house prices, features might include the square footage, number of bedrooms, and distance to the nearest school.
In a dataset predicting customer churn, features could include the customer’s monthly bill, contract type, and tenure.

In machine learning, we aim to use these features to predict a target variable (the dependent variable). For house prices, this target variable might be price, and for customer churn, it could be whether the customer will leave

Why Do We Need Feature Selection?

Not all features in a dataset are equally valuable. Some may be irrelevant, redundant, or noisy, which can negatively impact your model. Here’s why feature selection is important:

Improves Model Performance
- Irrelevant or noisy features corrupts the model and leads to lower accuracy. For example, a feature like the homeowner’s favorite TV show has no relevance to predicting house prices.
Reduces Overfitting
- Overfitting occurs when a model learns patterns that exist only in the training data but don’t generalize to new data. By removing unnecessary features, you can reduce the risk of overfitting.
Simplifies the Model
- With fewer features, models are easier to interpret and debug. For example, a model trained on just 5 key features is much easier to explain than one trained on 50 features.
Saves Computational Resources
- Feature selection reduces the number of variables the model has to process, leading to faster training and prediction.

Types of Feature Selection Methods

Feature selection can be broadly categorized into three types:

Filter Methods: Use statistical measures to score features.
Wrapper Methods: Use the performance of a model to evaluate feature subsets.
Embedded Methods: Feature selection occurs as part of the model training process.

Filter Methods

Filter methods are one of the simplest and most widely used feature selection techniques. They rely on statistical properties of the features, such as correlation, variance, or mutual information, to assess the relevance of each feature independently of any machine learning model.

What are Filter Methods?

Filter methods rank features based on their relationship with the target variable. These methods use statistical tests or scoring techniques to evaluate the importance of each feature, and then you can:

Select the top k features (e.g., top 5 most relevant features).
Remove features with low scores or thresholds (e.g., features with near-zero variance).

Key Techniques in Filter Methods

Let’s briefly discuss some of the most common filter methods, their mathematical formulation, and where they are used.

Pearson Correlation (for numerical features)

This method measures the linear relationship between each feature and the target variable.

Mathematical Formulation

Where:

and are data points for the feature and the target.
and are their respective means

Interpretation:

: Strong positive correlation (e.g., as feature increases, target also increases).
: Strong negative correlation (e.g., as feature increases, target decreases).
: No correlation (feature is unrelated to target).

Use Case:
For numerical features in regression problems (e.g., predicting house prices).

Code Example: Pearson Correlation

Let's calculate a correlation matrix to identify numerical features that are strongly correlated with the target (and with each other). Visualize the correlation matrix using a heatmap for easy interpretation, and discuss how to decide which features to keep or remove.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing

# Load California housing dataset
data = fetch_california_housing(as_frame=True)
df = data['frame']

# Add the target variable to the DataFrame
df['Target'] = data['target']

# Compute the correlation matrix
correlation_matrix = df.corr()

# Visualize the correlation matrix with a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Matrix of Features")
plt.show()

Heatmap Visualization

Deciding on Features based on Correlation

Look for features with high positive or negative correlation with the target (Target). For example:
1. MedInc (Median Income): Correlation = 0.69 → Strong positive relationship (higher income tends to predict higher house prices).
2. AveOccup (Average Occupants per Household): Correlation = -0.02 → Very weak relationship (can likely be discarded).
Handle Multicollinearity:
1. Features that are highly correlated with each other (e.g., AveRooms and AveBedrms) may contain redundant information. You can choose one of them to avoid multicollinearity.
Threshold Selection:
1. Decide on a threshold for correlation. For example, keep features with correlation with the target.

Chi-Square Test (for categorical features)

The Chi-Square Test helps us determine if there’s a significant relationship between a categorical feature and the target variable. It works by comparing the observed frequency of values in a feature to the expected frequency (assuming no relationship with the target). Features with a stronger relationship to the target will have a higher chi-square score.

Mathematical Formulation

Where:

is the observed frequency for a category.
is the expected frequency under the null hypothesis (independence).

Use Case:
For categorical features in classification problems (e.g., predicting whether a customer will churn).

Code Example: Chi-Square Test

import pandas as pd
from sklearn.feature_selection import SelectKBest, chi2

# Create a toy dataset
data = {
'Gender': ['Male', 'Female', 'Female', 'Male', 'Male', 'Female', 'Female', 'Male'],
'SubscriptionType': ['Basic', 'Premium', 'Basic', 'Basic', 'Premium', 'Basic', 'Premium', 'Basic'],
'Tenure': [1, 3, 2, 1, 4, 2, 3, 1], # Numerical feature
'Churn': [0, 1, 0, 0, 1, 0, 1, 0] # Target variable
}

# Convert to DataFrame
df = pd.DataFrame(data)

# Encode categorical features (convert to numbers)
df_encoded = pd.get_dummies(df[['Gender', 'SubscriptionType']], drop_first=True)
X = pd.concat([df_encoded, df[['Tenure']]], axis=1) # Features
y = df['Churn'] # Target

# Perform Chi-Square test
chi_selector = SelectKBest(score_func=chi2, k='all') # Select all features for ranking
chi_selector.fit(X, y)

# Display scores for each feature
chi_scores = pd.DataFrame({
'Feature': X.columns,
'Chi-Square Score': chi_selector.scores_
}).sort_values(by='Chi-Square Score', ascending=False)

print(chi_scores)

Feature Chi-Square Score

1 SubscriptionType_Premium 5.000000

2 Tenure 3.298039

0 Gender_Male 0.266667

Chi-Square Visualization

# Plot the scores
plt.figure(figsize=(8, 5))
plt.barh(chi_scores['Feature'], chi_scores['Chi-Square Score'], color='skyblue')
plt.xlabel('Chi-Square Score')
plt.title('Chi-Square Scores for Features')
plt.gca().invert_yaxis()
plt.show()

Deciding on Features based on Chi-Square test

Tenure (Chi-Square Score = 2.500):
1. This feature has the highest score, indicating it is the most important for predicting churn. Customers with longer tenure might be less likely to churn, so this feature shows a stronger relationship with the target variable.
Gender_Male (Chi-Square Score = 0.125):
1. The score is very low, suggesting that a customer’s gender has little to no impact on whether they churn.
SubscriptionType_Premium (Chi-Square Score = 0.083):
1. Similarly, this feature has a low score, meaning that whether a customer is on a premium subscription has minimal relevance to churn prediction.

Wrapper Methods

Wrapper methods select features by evaluating subsets of features based on their contribution to the model’s performance. These methods are computationally expensive but more accurate than filter methods because they account for feature interactions.

Recursive Feature Elimination (RFE)

What is RFE?
Recursive Feature Elimination (RFE) is a popular wrapper method that iteratively removes the least important features based on a model’s performance. It work's as follows:

A machine learning model (e.g., linear regression or random forest) is trained on the full dataset.
Features are ranked based on their importance (e.g., coefficients or feature importance).
The least important feature is removed, and the process is repeated until the desired number of features is reached.

Code Example: RFE with Logistic Regression

We’ll use a dataset to predict whether a customer will churn (binary classification) and apply RFE with logistic regression as the model.

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
import pandas as pd

# Generate a synthetic binary classification dataset
X, y = make_classification(
n_samples=200, n_features=10, n_informative=5, n_redundant=2, random_state=42
)
feature_names = [f"Feature_{i}" for i in range(1, 11)]
df = pd.DataFrame(X, columns=feature_names)

# Initialize logistic regression model
model = LogisticRegression()

# Apply Recursive Feature Elimination (RFE)
rfe = RFE(estimator=model, n_features_to_select=5) # Select top 5 features
rfe.fit(X, y)

# Get feature rankings
feature_ranking = pd.DataFrame({
'Feature': feature_names,
'Ranking': rfe.ranking_
}).sort_values(by='Ranking')

print(feature_ranking)

Feature Ranking

0 Feature_1 1

1 Feature_2 1

4 Feature_5 1

5 Feature_6 1

6 Feature_7 1

7 Feature_8 2

2 Feature_3 3

3 Feature_4 4

9 Feature_10 5

8 Feature_9 6

RFE Visualization

# Plot feature rankings
plt.figure(figsize=(8, 5))
plt.barh(feature_ranking['Feature'], feature_ranking['Ranking'], color='skyblue')
plt.xlabel('Ranking')
plt.ylabel('Features')
plt.title('Feature Rankings by RFE')
plt.gca().invert_yaxis()
plt.show()

Deciding on Features based on RFE

Selected Features:
1. Features ranked as 1 are the most important and were selected (e.g., Feature_6, Feature_8, etc.).
2. RFE iteratively removed the least important features until only the top 5 were left
Unselected Features:
1. Features with higher rankings (e.g., Feature_1, Feature_9) were deemed less important and were excluded from the final subset.
Practical Use:
1. You can now retrain your model using only the selected features (Feature_6, Feature_8, etc.) for better performance and reduced complexity.

Embedded Methods

Embedded methods combine feature selection with the model training process. The feature importance is determined as part of model training, making these methods computationally efficient compared to wrapper methods.

Lasso Regression (L1 Regularization)

What is Lasso Regression?
Lasso regression is a type of linear regression that includes a penalty for the absolute value of feature coefficients. This penalty forces some coefficients to become exactly zero, effectively removing less important features.

Mathematical Formulation

Where:

is the loss function.
is the mean squared error.
is the regularization parameter (controls how strongly features are penalized).
are the absolute values of the feature coefficients.

Code Example: Lasso Regression

We’ll use Lasso regression to predict house prices and identify the most important features.

from sklearn.linear_model import Lasso
from sklearn.datasets import make_regression
import numpy as np
import pandas as pd

# Generate a synthetic regression dataset
X, y = make_regression(
n_samples=200, n_features=10, n_informative=5, noise=0.1, random_state=42
)
feature_names = [f"Feature_{i}" for i in range(1, 11)]
df = pd.DataFrame(X, columns=feature_names)

# Train Lasso regression model
lasso = Lasso(alpha=0.1) # Regularization strength (λ)
lasso.fit(X, y)

# Get feature importance (coefficients)
feature_importance = pd.DataFrame({
'Feature': feature_names,
'Coefficient': lasso.coef_
}).sort_values(by='Coefficient', ascending=False)

print(feature_importance)

Feature Coefficient

3 Feature_4 80.711485

7 Feature_8 40.580834

5 Feature_6 34.804164

2 Feature_3 10.970076

6 Feature_7 6.525735

0 Feature_1 0.000000

1 Feature_2 -0.000000

4 Feature_5 -0.000000

8 Feature_9 0.000000

9 Feature_10 -0.000000

Lasso Feature Importance Visualization

# Plot feature importance
plt.figure(figsize=(8, 5))
plt.barh(feature_importance['Feature'], feature_importance['Coefficient'], color='skyblue')
plt.xlabel('Coefficient Value')
plt.ylabel('Features')
plt.title('Feature Importance from Lasso Regression')
plt.gca().invert_yaxis()
plt.show()

Deciding on Features based on Lasso

Non-Zero Coefficients:
1. Features with non-zero coefficients (e.g., Feature_3, Feature_1, Feature_5) are important for predicting the target variable.
2. These features will be retained in the final model.
Zero Coefficients:
1. Features with coefficients of 0 (e.g., Feature_7, Feature_10) contribute nothing to the model and can be dropped.
Regularization Parameter (λ):
1. The alpha parameter controls the strength of regularization. Higher values of alpha lead to more aggressive feature selection, potentially dropping more features.

Conclusion

Feature selection is one of the most crucial steps in machine learning, helping to improve model performance, reduce overfitting, and simplify models by narrowing the focus to only the most relevant features. By doing so, it ensures that your model learns from meaningful patterns in the data rather than noise or irrelevant information.

Best Practices for Effective Feature Selection

Understand Your Data
Begin by performing Exploratory Data Analysis (EDA) to understand the relationships between features and the target variable. Visualize correlations, distributions, and interactions to gain insights into the data structure.
Start with Simple Methods
Use filter methods, such as correlation analysis, as a quick and cost-effective first step to eliminate irrelevant or redundant features.
Combine Multiple Techniques
For a more robust feature selection process, combine filter, wrapper, and embedded methods. For example, start with filter methods to pre-select features, then refine your selection using wrapper or embedded approaches.
Leverage Domain Knowledge
Incorporate domain expertise to identify critical features and exclude irrelevant ones. This step can significantly improve the effectiveness of feature selection.
Avoid Overfitting
Be cautious with wrapper methods, such as Recursive Feature Elimination (RFE), as they can lead to overfitting, particularly when working with small datasets. Always validate your selected features using cross-validation techniques.
Monitor Regularization
When using embedded methods like Lasso regression, carefully tune the regularization parameter (λ) to strike the right balance between feature selection and model accuracy.
Test Your Final Model
Evaluate your model's performance with and without feature selection to ensure that the removal of features genuinely improves outcomes.

By following these best practices, you can effectively streamline your machine learning models, enhance their interpretability, and improve overall performance.