DSS Blog

Linear regression basics guide - part 1

Written by Ashwin Viswanathan Kannan, Ph.D. | Jul 16, 2024 2:32:20 PM

In today's data intensive world, data scientists play a crucial role. Their main job is to make sense of large amounts of data, turning it into useful information that can guide decisions. This involves both creative and technical skills, using a range of methods and tools. For beginners, it's important to understand these tools thoroughly.

Linear regression is a basic but essential technique for anyone in data science. It's used for making predictions and understanding the relationship between different things, like predicting sales or weather patterns.

In this blog, we'll start with linear regression, showing you the essential algorithms you need to know as a data scientist. We'll cover both the theory and the practical side, making sure you understand how to use linear regression effectively. With clear examples and explanations, we aim to teach you not just how to do it, but also why it works, preparing you for more advanced topics in data science.

Linear Regression

Linear Regression is a foundational statistical method used to model the relationship between a dependent variable and one or more independent variables. The method assumes a linear relationship between the variables, which can be used for predictions, trend analysis, and data insights.

Application Areas

Linear Regression has broad applications across industries:
In finance, it's used for risk assessment and stock market predictions.
In healthcare, it's employed to understand the relationships between medical parameters.

In marketing, it helps in predicting sales and the effectiveness of advertisement expenditures.

Mathematical framework

Linear Regression is a statistical method that models the relationship between a dependent variable, y and one or more independent variables,X. The linear equation following the line that best predicts the dependence of X on y is:

where:

y is the dependent variable,

X is the independent variable, 

is the y-intercept,

 is the slope of the line, and indicates the strength of the relationship between x and y

ϵ is the error term.

In linear regression, the criterion is to minimize the sum of the squares of the differences of the observed values and the values that our model predicts. This is the least squares criterion. The best-fit line is found by minimizing the sum of squares of the residuals (differences between observed and predicted values).

The slope () and intercept () can be calculated using the following equations:

where:
  is the mean of X values,
 is the mean of Y values.

Cost Function

Mean Squared Error (MSE) is a common metric used to measure the average of the squares of the errors, essentially quantifying the difference between the estimated values predicted by a model and the actual values of the observed data. The MSE is particularly useful in regression analysis to evaluate the performance of a model

where:
n is the number of data points,
is the actual value,
  is the predicted value.

Practical Illustration of Linear Regression

Dataset preparation

Before implementing Linear Regression, it's essential to prepare your dataset. This preparation includes choosing pertinent data, addressing any anomalies or missing values, and, if necessary, normalizing or scaling the features. In this tutorial, we will create a straightforward dataset to demonstrate these steps.

Let's create a dataset where the dependent variable y has a linear relationship with the independent variable x. This synthetic dataset will help us understand the implementation of Linear Regression

Code snippet to create a linear regression dataset

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

np.random.seed(0)

# number of records
num_rec = 100
# Generate an array of 100 rows
x = 2.5 * np.random.randn(num_rec) + 1.5

# Create noise term
err_noise = 0.5 * np.random.randn(num_rec)      

# Create a linear relationship between x & y
y = 2 + 3 * x + err_noise 

Plotting the data

# Plot the results
plt.scatter(x, y, color='navy')        
plt.title('Generated Data', fontsize = 15)
plt.xlabel('X: dependent variable', fontsize = 12)
plt.ylabel('Y: response variable', fontsize = 12)
plt.tight_layout()
plt.show()

Calculating ​ and

# Calculating means
x_mean = np.mean(x)
y_mean = np.mean(y)

# Calculating the coefficients
numer_term_xy = np.sum((x - x_mean) * (y - y_mean))
denom_term_x = np.sum((x - x_mean) ** 2)
beta_1 = numer_term_xy / denom_term_x
beta_0 = y_mean - (beta_1 * x_mean)

# Verify coefficients
print(f"Coefficients: beta_0 = {beta_0}, beta_1 = {beta_1}")

# Coefficients: beta_0 = 2.0031670124623444, beta_1 = 3.0229396867092766

Calculating using Scikit-learn

Let's validate our coefficient values using the popular machine learning library scikit-learn

from sklearn.linear_model import LinearRegression
X = x.reshape(-1, 1)
model = LinearRegression()
model.fit(X, y)

print(f"Coefficients (sklearn): beta_0 = {model.intercept_}, beta_1 = {model.coef_[0]}")
# Coefficients (sklearn): beta_0 = 2.0031670124623453, beta_1 = 3.022939686709276

Both approaches provide us with the same value for coefficients  and , which defines the best-fit line for our dataset.

Making Predictions

The linear regression model obtained has the following parameters:

Slope (): 3.0229
Intercept (): 2.0032

This means the best-fit line that models the relationship between x and y can be represented as:

# Predict for x
y_pred = beta_0 + beta_1 * x

# Plotting the regression line and the data points
plt.scatter(x, y, color='navy', alpha = 0.4, label = 'Data points')  
plt.plot(x, y_pred, color='red', label = 'Best-fit line')
plt.legend(loc = 'best')
plt.title('Fitting Regression line', fontsize = 15)
plt.tight_layout()
plt.show()

To understand the effectiveness of our Linear Regression model, it's essential to visualize the results. Plotting the best-fit line alongside our data points can give us insights into how well our model has captured the underlying relationship.

Measuring the performance of the model

Calculating the Mean Squared Error

from sklearn.metrics import mean_squared_error

# Calculate MSE using sklearn
mse_sklearn = mean_squared_error(y, y_pred)
print(f"MSE using sklearn: {mse_sklearn}")
# MSE using sklearn: 0.26429296354765713

# Calculate MSE manually
mse_manual = np.mean((y - y_pred) ** 2)
print(f"Manually Calculated MSE: {mse_manual}")
# Manually Calculated MSE: 0.26429296354765713


The Mean Squared Error (MSE) for our model is roughly . MSE quantifies the average of the squares of errors—essentially, the difference between the actual and predicted values. A smaller MSE value signifies a closer match between the model predictions and the actual data. With an MSE of , the model demonstrates a relatively small deviation between predicted and actual values of the dependent variable, y. This indicates a good model fit, particularly when you consider the scale of y values and any noise present in the data.

Conclusion

In this tutorial, we covered the basics of Linear Regression, including its mathematical foundation and how to implement it in Python. By fitting a linear equation to observed data, Linear Regression allows us to make predictions about future data points. This method is fundamental in the field of machine learning and data analysis, serving as a building block for more complex algorithms