The main task in a Linear Regression problem is to find the coefficients which, in 1-dimension case, are able to create a line where the difference between the datapoints and the created line is as small as possible.
These differences are called "residuals" and they need to be minimized. This problem can be solved by using a procedure called Ordinary Least Squares (OLS) where the sum of all the squared residuals mentioned before are minimized.
In case of datapoints, the datapoint is associated to the output and the line formula is where is an an error term which takes into account all the variations of the output which are not associated to the known input variables.
Mathematically, the OLS in 1-dimension can be written as:
Note that the and coefficients found are estimates so we call them respectively and however for the rest of this explanation they simply be called and to simplify the notation.
In order to find these coefficients we can do a derivative of this quadratic function and set it equal to 0. This is a common way of finding the minima and maxima of a function.
For a quadratic function, like in this case, there is only a minimum or maximum point. However, in this case the function is also convex so the derivative leads to the minimum point.
Find
Let's start with the intercept . We can do the derivative with respect to remembering that the derivative of a quadratic function is 2 multiplied by the function and then multiplied by the derivative of the function (i.e. ).
The first step is to set the derivative to 0:
Remembering that, for a datapoint , , you can see that the derivative of the terms within the brackets is simply -1 because and do not have terms in and the derivative of with respect to is simply -1.
Equation 3 shows that the derivative can be done for all the points and, given that they are not interacting with each other, all these derivatives can simply be summed together.
Also, given that we set this function equal to 0 we can remove the moltiplicative factor -2 because it is not going to change the point where the function is 0.
Equation 4 shows that we can split the sum in the three different terms. In particular, in the second term we can put outside the sum given that does not depend on and the third term is simply the coefficient summed times so we can simply write .
Equation 5 shows that if we put the first 2 terms on the other side of the equation and divide by then we obtain an equation in . In particular is equal to the sum of all the values divided by which is the mean of the terms, minus multiplied by the mean of the terms.
We can calculate the mean of the terms and terms so the only missing value is .
Find
To find , now we do the derivative with respect to and set it to 0:
In this case, the derivative of the terms within the brackets for a datapoint is because and do not have terms in but the derivative of with respect to is .
So, . Equation 7 is similar to the previous case, we can again do the sum of all the derivatives and remove -2.
Equation 8 shows that we can multiply for all the terms within the brackets and then split the sums similarly to what we did before. In this case note that the second term is a squared term while the third term is not anymore given that it is multiplied by which depends on .
Equation 9 shows that we can replace the value using the equation found before and now this equation contains only terms in , and .
In equation 10 we move all the terms in on the other side of the equation.
Then, in equation 11 we divide for all the terms that are not and we can obtain a formula to find which only depends on the and data. This is what we are looking for given that only by knowing the datapoints we can find the optimal value.
The formulas are a bit long but we can simplify them. In particular we can substitute the mean values of and respectively with and (i.e. and
) and this simplifies the two equations for and .
Equations 12 and 13 show the calculations that need to be performed, using and datapoints, to obtain the optimal coefficient values for and . Click the button below to see the implementation of these formulas in Python.
Linear Regression Math in Matrix Form
The linear regression problem can also be defined for more than one indepedent variable and the equations are typically written in matrix form.
For a generic point with output then there are different input variables and each of them is associated to a specific coefficient and then there is also the coefficient
not associated to any variable because this is the intercept. The equation for a generic data point is:
This can be written in matrix form, taking into account all data points, in this way:
Where:
Given that there are data points, is a vector of by 1 dimensions and also the error term is a vector of by 1 dimensions.
On the other hand, as previously mentioned is a vector of by 1 dimensions where is the number of input variables and then we add the intercept.
For the data, this is a matrix of by dimensions given that there are data points but for each data point there is a multiplication for the vector of coefficients
which is dimensions. In the matrix one column is all ones, that is the intercept column which multiplies the coefficient and this is a compact way to include the intercept in the input matrix.
The Ordinary Least Squares (OLS) defined in matrix form is similar to the 1-dimension case.
The equation below shows how the problem is formally posed and that is essentially the same equation used previously where the sum of all the squared residuals is calculated and then minimized
but in this case the aim is to find a vector with all its coefficients.
The two vertical bars at the left and right of the equation, mean that the euclidean norm of the vector is calculated.
Given a vector , the euclidean norm of is defined as .
So, the euclidean norm can be written as the multiplication between the vector transpose and the original vector .
The euclidean norm has the square root but the function that we are trying to minimize is squared so that we are simply considering the sum of the squared terms without the need of calculating the square root.
To find the optimal coefficients, Equation 19 shows that we can write the equation using the transpose and then we do the derivative and set it equal to 0. In this case the derivative is in respect of a vector and not single variables as previously shown.
Equation 20 focuses on performing the multiplications of the terms within the brackets by taking into account that the transpose of two matrices (or a matrix and a vector) are equal to the multiplication of the matrices (or matrix and vector) both transposed but they are flipped (i.e. ).
If you look at all these 4 terms and apply matrix multiplications you can see that dimensionally all of them are a single dimension so that we are essentially obtaining a single value out of this OLS equation and this value is the minimum error possible given the optimal coefficients that will be found. We can do the derivative of these 4 terms by remembering some of the derivative identities in respect of a vector.
Given 2 vectors and ,
Given a matrix and a vector , ( must be symmetric)
Now let's use these identities:
The first term does not have terms, so the derivative is 0:
The second term is . This can be seen as , using the previous identity in this case the result is .
:
The third term is similar to the second but in this case is it can be seen as but the result is still the same:
The fourth term is . As shown in the second identity, this derivative is equal to and the symmetric matrix which in this case is :
Equation 21 shows that the sum of all these terms should be equal to 0:
Equation 22 shows that we can remove the first term and sum together the second and third term which both are negative and then the equation has only 2 terms.
We can divide the both by 2 and then move the first term on the other side of the equation and this leads to Equation 23.
We can transpose both sides as you can see from Equation 24.
Remember that is symmetric so the transpose of this matrix has not effect and returns again . Also, as mentioned before, for two matrices and , if they get transposed . This means that in Equation 24 we can swap the and terms and is not anymore transposed while remains the same. On the right hand of the equation we can swap and and in this case they become respectively and . You can see this in Equation 25.
Now, remembering that the inverse of matrix (i.e ) which multiplies is equal to the identity matrix which is composed only by ones on the diagonal. The identity matrix behaves like the number 1 in a single dimension, if you multiply a number by 1 you get the same number. In the same way, if you have an indentity matrix which multiplies a vector, you simply obtain the vector. So the last thing to do is to multiply both sides by . In this way the left hand side is the vector while the right hand side is composed by only terms in and as you can see from Equation 26.
This is a closed-form solution which you can apply to obtain an estimate of the coefficients. This method assumes that we can do the inverse of the matrix, this is not always the case but in this case we are assuming that we can compute it. Click the button below to see the implementation of these formulas in Python.