An introduction to linear regression for machine learning
In this post, I will go over the concept of simple linear regression, delve into the underlying mathematical principles of the algorithm, and explore its practical application in the field of machine learning.
What is Linear Regression
Linear regression is a type of regression analysis used to make predictions based on labeled data. In this post, we're going to focus on the simplest application of the linear regression algorithm which is referred to as Simple Linear Regression, or simply Linear Regression.
Use Cases
 House Price Prediction: Linear regression can be used to predict the price of a house based on its square footage (sqft).
 Note: Predicting the price will depend on the value of square footage (sqft), therefore, we can say that price is the dependent variable and square footage is the independent variable.
 Sales Forecasting: Companies can predict marketing return on investment (ROI) based on previous advertising spend.
 Note: Predicting the return on investment will depend on the value of advertising spend, therefore, we can say that the return on investment is the dependent variable and advertising spend is the independent variable.
 Medical research: Linear regression is often used to predict disease risk in patients given their age.
 Can you guess the independent variable in this case?
Making Predictions
Let's take the first example and see how we can apply linear regression to predict a house price. Here is a table showing house prices, given their area size (sqft):
House #  Square Footage (sqft)  Price ($) 

1  1000  100000 
2  1200  150000 
3  1500  200000 
4  1800  250000 
5  2200  300000 
We're asked if we can predict the price of a 1300 sqft house given the data we already have. Could we use linear regression in this case?
Of course. Let's add the new house to our table:
House #  Square Footage (sqft)  Price ($) 

...  
6  1300  ?? 
Simple Linear Regression Formula
$$ ŷ = b_0 + b_1 x + e $$
We're going break this down to understand how it works, then we'll compute our predicted house price ŷ.
 ŷ (yhat) is the predicted house price (dependent variable)
 x is our independent variable, 1300 (sqft)
 b_{0} is the intercept
 It is the value of ŷ when x = 0
 b_{1} is the slope
 It tells us how much ŷ will change for every 1 unit change in x
 e is the error term, we'll cover e in another post, so for this one let's assume it's zero.
Calculating the slope
We know x is 1,300, but what about b_{0}, and b_{1}? We'll need them to predict the house price (ŷ).
Let's start by looking closer at b_{1} :
$$b_1 = \frac{\sum_{i=1}^n (x_i  \bar{x})(y_i  \bar{y})}{\sum_{i=1}^n (x_i  \bar{x})^2}$$
I know, I know. It looks ugly and complicated. But, you'll see that it's quite simple as you read further.

x_{i} is the value of x for each ith observation (or house #), which in our case is:
 For the first house in the table, x is 1000 (x_{1})
 For the second house in the table, x is 1200 (x_{2})
 For the third house in the table, x is 1500 (x_{3})
 For the fourth house in the table, x is 1800 (x_{4})
 For the fifth house in the table, x is 2200 (x_{5})

x̄ (xbar) is the average value of the known house prices. It is calculated by dividing the sum of the house prices, then dividing it by the total number of houses.
$$ \frac{1000 + 1200 + 1500 + 1800 + 2200}{5} = 1540 $$ 
y_{i} is the value of y for each ith observation which in our case means:
 For the first house in the table, y is 100000 (x_{1})
 For the second house in the table, y is 150000 (x_{2})
 For the third house in the table, y is 200000 (x_{3})
 For the fourth house in the table, y is 250000 (x_{4})
 For the fifth house in the table, y is 300000 (x_{5})

ȳ (ybar) is the average value of the dependent variables and it is calculated by dividing the sum of the dependent variables by the total number of dependent variables
$$ \frac{100000 + 150000 + 200000 + 250000 + 300000}{5} = 200000 $$
We have all we need to replace and solve the formula. We'll start with the numerator:
$$\sum_{i=1}^n (x_i  \bar{x})(y_i  \bar{y})$$
$$ i_1 = (1000  1540) \times (100000  200000) = 54000000 $$
$$ i_2 = (1200  1540) \times (150000  200000) = 17000000 $$
$$ i_3 = (1500  1540) \times (200000  200000) = 0 $$
$$ i_4 = (1800  1540) \times (250000  200000) = 13000000 $$
$$ i_5 = (2200  1540) \times (300000  200000) = 66000000 $$
We then add them up:
$$ 54000000 + 17000000 + 0 + 13000000 + 66000000 = 150000000 $$
Let's calculate the denominator:
$$ i_1 = (10001540)^2 = 291600 $$
$$ i_2 = (12001540)^2 = 115600 $$
$$ i_3 = (15001540)^2 = 1600 $$
$$ i_4 = (18001540)^2 = 67600 $$
$$ i_5 = (22001540)^2 = 435600 $$
We then add them up:
$$ 291600 + 115600 + 1600 + 67600 + 435600 = 912000 $$
Perfect! Let's replace:
$$b_1 = \frac{150000000}{912000} = 164.5 $$
Calculating the yintercept
Great, now let's solve for b_{0}
$$b_0 = \bar{y}  b_1(\bar{x})$$
We previously computed the values of b_{1} = 164.5, ȳ = 200000, and x̄ 1540, so we can easily replace:
$$b_0 = 200000  164.5(1540) = 53330$$
Predicting the price
Now, since we calculated all values, we can go ahead and apply the simple linear regression formula to predict our new house price:
$$ y = b_0 + b_1 x + 0 $$
$$ price = 53,330 + 164.5 (1,300) = 159,870 $$
Let's see what our table looks like with the prediction (look at house #3):
House #  Square Footage (sqft)  Price ($) 

1  1000  100,000 
2  1200  150,000 
3  1300  159,870 
4  1500  200,000 
5  1800  250,000 
6  2200  300,000 
For a new 1300 sqft house, and based on data previously available, our simple regression algorithm was able to predict a reasonable price of 159,870$.
Conclusion
That's all there is to it. In this post, we explored the closedform solution because it provides a straightforward way to find the bestfit line in linear regression. Keep in mind however, that this method will not be applicable or efficient for nonlinear problems or large datasets. In such cases, iterative and other methods are preferred as they can handle the complexity and computational challenges more effectively.
We looked at predicting the outcome given a single variable, however, in the real world you will most likely work with an extension of simple linear regression that predicts an outcome given more than a single variable.
It is called Multiple Linear Regression, and I discussed it in this post. If you'd like to see how Simple and Multiple Linear Regression can be implemented in Python, please check out this post.
Thanks for reading!