DSS Blog

Introduction: From Simple to Multiple Linear Regression

Written by Ashwin Viswanathan Kannan, Ph.D. | Oct 1, 2024 6:25:53 PM

In our previous post, we introduced Linear Regression, a fundamental technique used to predict outcomes based on a single factor—such as estimating house prices based on square footage. We also discussed essential performance metrics like Mean Squared Error (MSE) and R-squared, which help assess how well our model fits the data.

By the end of that post, we touched on other types of regression beyond simple linear regression, like Multiple Linear Regression, which handles scenarios where more than one factor influences the outcome. For instance, house prices are not only affected by size but also by location, age, and the number of bedrooms. This is where Multiple Linear Regression (MLR) becomes invaluable.

In this blog post, we will build on what we’ve learned and explore Multiple Linear Regression—a method that allows us to predict an outcome by considering multiple factors at once. 

Let’s get started by diving into how MLR can help us model more complex, real-world situations.

Understanding Multiple Linear Regression

What is Regression?

Before jumping into Multiple Linear Regression (MLR), let's briefly understand what regression is. In simple terms, regression is a method used to predict one variable based on the values of another. Imagine you’re a farmer trying to predict how much wheat your field will yield based on the amount of rainfall. You could use regression to find that relationship between rain and wheat yield.

However, in real life, outcomes often depend on more than just one factor. This is where Multiple Linear Regression comes in. Instead of just looking at one factor, MLR allows us to consider several factors simultaneously.

What is Multiple Linear Regression?

Multiple Linear Regression (MLR) is a technique used to predict the value of a dependent variable (the outcome) based on multiple independent variables (predictors).

Let’s consider a relatable example: predicting the price of a car. The price of a car doesn’t depend on one thing like the engine size; it also depends on other predictors like:

  • The brand of the car
  • The year it was made
  • The mileage (how far it’s been driven)
  • Whether it’s a sedan, SUV, or hatchback

MLR allows us to understand how much each of these factors influences the car's price. It helps answer questions like, "How do different variables or predictors (brand, year, mileage) impact the price of a car?"

Why Use Multiple Linear Regression?

Typically, MLR is used in situations where you want to predict an outcome influenced by more than one predictor. For example:

  • A company predicting sales based on advertising spend across TV, radio, and social media.
  • A student trying to understand how their study habits, sleep, and attendance affect their exam scores.

MLR doesn’t just give you a prediction; it also helps you understand which variables are more important. For instance, is the mileage of a car more important than its brand when determining price?

Mathematical Formulation of Multiple Linear Regression

Multiple Linear Regression (MLR) is an extension of Simple Linear Regression, where instead of one predictor (independent variable), there are multiple predictors. The goal of MLR is to model the relationship between two or more predictors and a single outcome (dependent variable). Let’s break down the equation:

General Form of the MLR Equation

The equation for MLR looks like this:

Let’s explain each component in this equation:

  • : This is the dependent variable (the outcome we want to predict). In many real-world applications, this could be something like car price, house prices, customer satisfaction scores, or sales revenue.
  • : These are the independent variables (also called predictors, features, or explanatory variables). These are the factors that influence the outcome. For example, if we're predicting house prices, our predictors might be the number of bedrooms, square footage, and the age of the house.
  • : These are the coefficients (sometimes called regression coefficients or weights). Each coefficient represents the change in the dependent variable Y for a one-unit change in the corresponding independent variable X, assuming all other variables remain constant. This is key: the coefficients tell us how strongly each predictor affects the outcome.
  • : This is the intercept. It represents the expected value of when all the independent variables are equal to zero. In many cases, the intercept might not have a clear real-world interpretation, but mathematically, it's a baseline value for the outcome.
  • : This is the error term (sometimes called the residual). The error term accounts for variability in that the predictors can't explain. In other words, it's the difference between the actual value and the value predicted by the model. Real-life data is rarely perfectly predictable, so this term represents those imperfections.

Illustrative Example of using MLR

Let’s break down each of these components given in the equations using a practical example: "predicting the price of a car based on several features"

Suppose we want to predict the price of a used car. We have three variables or predictors that influence the price:

  • : Mileage (the number of miles the car has been driven)
  • : Age of the car (how many years old the car is)
  • : Brand (whether it's a premium or non-premium brand)

The car price () depends on these three variables. We’ll use MLR to determine how each factor (mileage, age, and brand) influences the final price.

The MLR equation for this situation would be:

  • : The dependent variable (), representing the car’s price we are trying to predict.
  • : The independent variables (, , ), representing the factors that we believe affect the price.
  • : The coefficients, representing how much each factor influences the price.
  • : The intercept, representing the baseline price when all factors are zero (for instance, a brand new car with zero mileage).
  • : The error term, representing any unpredictable factors or random variation in the data.

Understanding the Coefficients ()

The coefficients () are the key to understanding how much each independent variable influences the dependent variable.

  • (coefficient for Mileage): This tells us how much the car’s price changes as mileage increases. If , it means for every additional mile driven, the car’s price decreases by $0.05.
  • (coefficient for Age): This shows the impact of the car’s age. If , it means that for every year the car gets older, its price drops by $500.
  • (coefficient for Brand): This coefficient represents the price difference between a premium brand and a non-premium brand. If , it means that premium cars tend to be $10,000 more expensive than non-premium cars, all else being equal.

The Intercept ()

The intercept () represents the predicted price of a car if all the independent variables are zero. In our case, this would be the theoretical price of a car with mileage, age(a brand-new car), and a non-premium brand. This intercept is often viewed as the "starting point" from which the effects of mileage, age, and brand adjust the final price.

The Error Term ()

Even though we include several factors in our model, there will always be other things that affect the price which we can’t measure or didn’t account for. This is captured by the error term (). For example, two cars with the same mileage, age, and brand might still have different prices due to differences in their condition, the seller’s urgency, or market trends.

An Example Equation

Let’s say we gather data and find the following coefficients from the model:

In this case:

  • : The base price of a brand-new non-premium car with zero mileage.
  • : For every additional mile the car has been driven, the price decreases by $0.05.
  • : For every year older the car gets, the price drops by $500.
  • : If the car is a premium brand, it costs $10,000 more than a non-premium brand.

Using the Model to Predict Car Prices

Let’s use this equation to predict the price of a car.

  • Mileage: 50,000 miles
  • Age: 5 years
  • Brand: Premium (we’ll use 1 for premium brands and 0 for non-premium brands)

Plugging these values into the equation:

Step by Step calculation:

According to our model, a premium car that has 50,000 miles and is 5 years old would have an estimated price of $35,000.

Interpretation

  • The mileage caused the price to drop by $2,500.
  • The age of the car caused an additional $2,500 drop.
  • Since it’s a premium brand, the price increased by $10,000.
  • The base price for a brand-new car (with zero miles and a non-premium brand) is $30,000.

This is the usefulness of Multiple Linear Regression: it allows us to combine multiple variables and understand how each one contributes to the overall prediction, giving us more accurate results than if we looked at each factor individually.

 

Code Implementation: Building a Multiple Linear Regression Model to Predict Car Prices

Now that we’ve covered the theory behind Multiple Linear Regression, it's time to see how this works in practice. In this section, we'll walk through how to build, train, and evaluate a Multiple Linear Regression (MLR) model using Python.

Our goal is to predict the price of a car based on three key features: mileage, age, and whether the car belongs to a premium brand.

Import Libraries and Create the Dataset

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import seaborn as sns
import matplotlib.pyplot as plt
from pprint import pprint

# Simulated dataset
data = {
    'mileage': [5000, 30000, 45000, 70000, 120000, 80000, 60000, 100000, 40000, 20000],
    'age': [1, 3, 5, 8, 10, 7, 6, 9, 4, 2],
    'brand_premium': [1, 0, 1, 0, 1, 0, 1, 0, 0, 1],  # 1 = premium brand, 0 = non-premium brand
    'price': [35000, 22000, 20000, 15000, 8000, 12000, 19000, 9000, 21000, 26000]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Display the first few rows of the dataset
df.head()


 

In this dataset:

  • mileage: The number of miles the car has been driven.
  • age: The age of the car in years.
  • brand_premium: A binary variable where 1 represents a premium brand and 0 represents a non-premium brand.
  • price: The actual price of the car.

 

Create and Train the Model

# Define the independent variables (predictors) and dependent variable (target)
X = df[['mileage', 'age', 'brand_premium']]  # Independent variables
Y = df['price']  # Dependent variable (car price)

# Initialize the Linear Regression model
model = LinearRegression()

# Train the model
model.fit(X, Y)

# Output the model's coefficients and intercept
print(f"Coefficients: {model.coef_}")
print(f"Intercept: {model.intercept_}")

# Coefficients: [-1.79806175e-01 -3.54349786e+02  2.78662384e+03]
# Intercept: 29504.563894523322

 

  • Coefficients: These represent the impact of each independent variable (mileage, age, brand) on the car price.
  • Intercept: The baseline price when all independent variables are zero.




Make Predictions

 

# Predict car prices based on the input data
predicted_prices = model.predict(X)

# Compare actual and predicted prices
comparison = pd.DataFrame({'Actual Price': Y, 'Predicted Price': predicted_prices})
print(comparison)

  Actual Price  Predicted Price
0         35000     31037.807077
1         22000     23047.329277
2         20000     22428.160920
3         15000     14083.333333
4          8000      7170.948839
5         12000     12639.621366
6         19000     19376.718503
7          9000      8334.798287
8         21000     20894.917737
9         26000     27986.364661



Evaluate the Model

We’ll use two primary evaluation metrics to assess how well the model fits the data:

  1. Mean Squared Error (MSE): Measures the average squared difference between the actual and predicted values.
  2. R-squared: Explains how much of the variance in the dependent variable is explained by the independent variables.

# Calculate the Mean Squared Error (MSE)
mse = mean_squared_error(Y, predicted_prices)
print(f"Mean Squared Error (MSE): {mse:.2f}")
#Mean Squared Error (MSE): 2916965.29
# Calculate the R-squared value
r_squared = r2_score(Y, predicted_prices)
print(f"R-squared: {r_squared:.2f}")
#R-squared: 0.95

  • A lower MSE indicates a better fit (the error between the actual and predicted values is smaller).
  • An R-squared value closer to 1 means the model explains most of the variability in the target variable (car price).

 

Visualize the Results

To better understand how well the model predicts car prices, let’s create a scatter plot comparing the actual and predicted prices:

# Scatter plot of actual vs. predicted prices
plt.scatter(Y, predicted_prices)
plt.xlabel('Actual Price')
plt.ylabel('Predicted Price')
plt.title('Actual vs. Predicted Car Prices')
plt.plot([min(Y), max(Y)], [min(Y), max(Y)], color='red', linestyle='--')
plt.show()

This plot shows how closely the predicted prices align with the actual prices. The red dashed line represents a perfect prediction, where the actual price equals the predicted price.

 

Model Evaluation and Interpretation

 

Coefficients Interpretation: After training the model, we observe the following coefficient values:

 

Coefficients: [-1.79806175e-01 -3.54349786e+02  2.78662384e+03]
Intercept: 29504.563894523322

 

Mileage Coefficient (-0.1798): For every additional mile driven, the car’s price drops by approximately $0.179 

Age Coefficient (−354.349): For every year older the car gets, its price decreases by about $354.

Brand Premium Coefficient (2786.62384): A premium brand car is expected to be $2786 more expensive than a non-premium car, holding all other factors constant.

Intercept (29504.56): This represents the baseline price of a car with zero miles, zero age, and a non-premium brand (theoretical value).

 

Model Evaluation: Observing the MSE and R-squared calculations, 

Mean Squared Error (MSE): 2916965.29
R-squared: 0.95

 

MSE of 2916965.29: On average, the squared difference between the actual and predicted prices is around 2916965 (since we're dealing with large prices, it's expected).

R-squared of 0.95: This means that 95% of the variance in car prices can be explained by mileage, age, and whether the car is a premium brand or not. An R-squared of 0.95 is quite good, indicating that the model fits the data well.

Takeaways

In this example, we built a Multiple Linear Regression model to predict car prices based on mileage, age, and whether the car belongs to a premium brand. We saw how each of these factors impacts the price by examining the coefficients of the model.

We also evaluated the model using Mean Squared Error and R-squared, finding that the model performs well with an R-squared of 0.95, meaning it explains 95% of the variance in the car prices. Visualizing the results further confirmed that our predictions are reasonably close to the actual values.

Conclusion and Next Steps

In this post, we explored how Multiple Linear Regression (MLR) can help us predict car prices by considering multiple factors like mileage, age, and whether the car is a premium brand. By building and evaluating the model, we gained a deeper understanding of how each feature influences the outcome and how to assess the model's performance.

In the upcoming posts, we’ll examine more advanced topics such as:

  • Feature Selection: How to identify the most important variables to make your model more efficient and accurate.
  • Multicollinearity: Understanding the impact of closely related independent variables and how to address this issue to improve model reliability.
  • Regularization: Techniques like Ridge and Lasso Regression to prevent overfitting, especially when working with larger datasets.
  • Cross-Validation: A method that helps test how well your model generalizes to new, unseen data, ensuring it's ready for real-world applications.