Regression is one of the most important techniques in predictive modelling. It is usually the gateway into the world of data science and machine learning, offering a perfect balance of simplicity and power. Whether you're forecasting next month's sales figures or diagnosing medical conditions, regression techniques lay the groundwork for understanding and analyzing relationships between variables.
Despite being one of the earliest methods developed in statistical modeling, regression continues to thrive in modern machine learning pipelines. From detecting linear trends in data to tackling more complex relationships with advanced techniques, regression helps us unlock actionable insights. Its ability to model both numerical outcomes and categorical probabilities makes it versatile.
Understanding regression lays the foundation for more advanced topics like neural networks, LLMs and diffusion models. n this text, we'll explore two of the most widely used regression techniques—linear and logistic regression. The models are explored using the advanced housing price data from Kaggle. The aim of this exercise is not to compare the efficiency of each type of supervised learning, but rather to see what subtle differences each type of algorithms have within them.
Supervised Learning
Supervised learning is a class of machine learning where the model or algorithm learns from input-output pairs of labeled training data. During training, the model makes predictions on the input data and is corrected based on the actual labels. This process continues until the model achieves the desired level of accuracy. These are usually used for classification or regression of data.
Supervised learning models can achieve high accuracy, especially when the training data is well-labeled and representative of the problem. These models are generally reliable for making predictions on new, unseen data, as long as the new data is similar to the training data. These models are usually interpretable, allowing users to understand how predictions are made.
However, supervised learning requires a large amount of labeled data, which usually needs manual labeling, making it time consuming and expensive. Models can easily overfit to the training data, i.e. they perform well on training data but poorly on unseen data if not properly regularized. This happens if the training data is not diverse enough and the model may not generalize well to new data that differs from the training set.
Linear Regression
Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables. The goal is to find the best-fitting straight line, known as the regression line, that predicts the dependent variable based on the independent variables. As the name suggests, the model aims to find a linear relationship between input variables (features) and a continuous output variable. Linear regression has it’s roots in coordinate geometry, especially geometry of a line.
|
A line is represented by the equation y=mx+b where y is y-intercept (how far up is the point), x is the x-intercept (how far along is the point) m is the slope or gradient and b is the values when x and y are zero. Here the value of ‘y’ or the ‘y coordinate’ of any point on the line can be easily determined if we have the value of ‘x’ or the ‘x coordinate’ and the know the values of ‘m’ and ‘b’. Note that ‘m’ and ‘b’ are same for a given line, only ‘x’ and ‘y’ change as we move along the line. The same principle is used linear regression to determine or predict a value. If we look at the plot below which represents the house prices vs the living area, we can fit a line which contains or is closest to all the data points. Thus, we can use this line to approximate or predict the price of a house not represented in the data, if we know the living area.
|
Plot of house price vs living area |
This line is represented by the equation Y= 0+1X+ ϵ , also known as the Hypothesis Function where: -
- Y is the dependent variable or the value to be predicted. In this case the house price.
- X is the independent variable. In this case it’s the living area
- 0 is the y-intercept
- 1 is the slope or gradient (also referred as weight)
- is the error term. This represents the delta between the prediction and the actual value
This type of linear regression is known as simple linear regression where we have one dependent variable and one independent variable. We can have multiple independent variables and one dependent variable as well. Such linear regression is known as Multiple Linear regression and is represented by the equation Y= 0+1X1+ 1X1+2X2+….+βnXn+ϵ
Where X1 , X2 .. Xn represent the independent variables and 1 , 2… n represent their respective weights. The higher the number of independent variables, the higher is the dimension of the data. In a 3-dimensional data (2 independent variables and 1 dependent variable), the best fitting ‘line’ is a 2-Dimentional plane.
Training Process
The training process begins with gathering the labelled data repressing the value to be predicted and corresponding features. The data set may contain several features, which may not be contributing to the value to be predicted. For example, in the dataset used here, the feature ‘Garage Year built’ may have very minimal or no effect on the price of the house, and thus can be ignored in training the model. Rather than relying on intuition or gut feeling, there are several algorithms which help us determine the impact of a feature on the value to be predicted. One such algorithm is PCA (Principal Component Analysis) and can be used to drop non-essential features. Once we have the necessary data for training, the below steps are followed.
- Choosing a Hypothesis Function: - An initial function with random weights is chosen and assumed to be the optimal solution. This function is known as Hypothesis function and is represented by y= 0+1x1+ 1x1+2x2+….+βnxn . Where x1, x1, x2 .. xn are the features 0, 1, 1….βn are corresponding coefficients or wights
- Calculating the Cost Function: - Cost function refers to the measure of how well the chosen hypothesis fits or represents the data. A perfect model, which represents the entire data with no error is 0. Though theoretically this is possible, in reality the cost function of a model is never 0. The cost function can be calculated using the below expression. J0, 1, 1….βn= 12mi=1mhΘyi- yi2 where 'm' is the number of example data used yi is the i-th prediction and yi is the actual value of i-th data sample.
- Optimizing the model/ Hypothesis Function: - The goal of training is to minimize the cost function, which corresponds to finding the best-fitting line. In other words, find the line or plane that best represents the data. This can be achieved in several ways. Gradient Descent is one of the most popular algorithms, which iteratively adjusts the coefficients or weights of the model to reduce the error. Gradient Descent is represented by
j ∶= j - α∂J(0, β1 ,β2…..n)j
Where ‘ ‘ is the learning rate and ∂Jj is the partial derivative of cost function with respect to j. Learning rate represents the step size of each update while optimizing the const function. The larger the step-size the quicker the cost function approaches the target cost function. However, choosing a very large learning rate may cause the cost function to fluctuate to either side if the target cost function. At the same time, if the learning rate chosen is very small, it may take a large number of attempts (known as epochs) to arrive at the target cost function. Gradient descent continues until the change in the cost function is negligible or target cost function is reached, indicating convergence.
- Evaluating the model: - The model is evaluated on a test dataset to assess its performance Mean Squared Error (MSE), which is the average squared difference between the predicted and actual values. MSE can be calculated using the expression.
RMSE= 1mi=1myi- yi2
Logistic Regression.
Logistic regression is a supervised learning algorithm used for binary classification. It predicts the probability that an instance belongs to a particular class. Unlike linear regression, which predicts continuous values, logistic regression is used to predict categorical outcomes, typically binary (e.g., 0 or 1, true or false, yes or no). Logistic regression uses the logistic function, also known as the sigmoid function, to map predicted values to probabilities.
In binary logistic regression there is a single binary dependent variable known as indicator variable, with values being 0 or 1 (or TRUE or FALSE), while the independent variables be a binary variable (0 or 1) or a continuous variable. The corresponding probability of the predicted value being 1 (or TRUE) can vary between 0 and 1. The function that converts log-odds to probability is the logistic function, hence the name.
The logistic function or sigmoid function is represented by:
y= 11+ e-(0+ x1 )
Where:
- y is the probability of the output being TRUE (or 1)
- 0, β1 are the coefficients of the independent variable
- e is Euler’s Number, a mathematical constant. This is an irrational number approximately equal to 2.71828182.
|
Sigmoid function for different coefficients |
We can see that for different values of the independent variable, the probability of the output belonging to one of the two categories is very close to either 0 or 1 in the majority of the distribution. The point at which the probability crosses from 0 to 1 (Or FALSE to TRUE) is known as the decision boundary.
|
Conclusion
Linear and logistic regression are foundational techniques in the world of statistical modeling and machine learning. They offer simple yet powerful ways to understand and predict relationships within data. Linear regression excels in modeling continuous outcomes, providing a clear and interpretable way to predict numerical values. Logistic regression, on the other hand, empowers us to classify data into categories by leveraging the probability-driven sigmoid function. As next steps, one can consider experimenting with real-world datasets and exploring enhancements like feature scaling, regularization, and alternative optimization techniques to fully harness the power of regression models.