The main task in a Linear Regression problem is to find the coefficients which, in 1-dimension case, are able to create a line where the difference between the datapoints and the created line is as small as possible. These differences are called "residuals" and they need to be minimized. This problem can be solved by using a procedure called Ordinary Least Squares (OLS) where the sum of all the squared residuals mentioned before are minimized. In case of $n$ datapoints, the $i^{th}$ datapoint $x_i$ is associated to the output $y_i$ and the line formula is $y_i=mx_i+b+\epsilon_i$ where $\epsilon_i$ is an an error term which takes into account all the variations of the output which are not associated to the known input variables. Mathematically, the OLS in 1-dimension can be written as: \begin{equation} \min_{\hat{m},\hat{b}} \sum_{i=1}^{n} (y_i-\hat{m}x_i - \hat{b})^2 \end{equation} Note that the $m$ and $b$ coefficients found are estimates so we call them respectively $\hat{m}$ and $\hat{b}$ however for the rest of this explanation they simply be called $m$ and $b$ to simplify the notation. In order to find these coefficients we can do a derivative of this quadratic function and set it equal to 0. This is a common way of finding the minima and maxima of a function. For a quadratic function, like in this case, there is only a minimum or maximum point. However, in this case the function is also convex so the derivative leads to the minimum point.
Let's start with the intercept $b$. We can do the derivative with respect to $b$ remembering that the derivative of a quadratic function $f(x)^2$ is 2 multiplied by the function and then multiplied by the derivative of the function (i.e. $2f(x)f'(x)$). The first step is to set the derivative to 0: \begin{equation} \frac{\partial }{\partial b}[\sum_{i=1}^{n} (y_i-mx_i - b)^2]=0 \end{equation} Remembering that, for a datapoint $i$, $\frac{\partial }{\partial b}(y_i-mx_i - b)^2=2(y_i-mx_i - b)(-1)$, you can see that the derivative of the terms within the brackets is simply -1 because $y$ and $mx$ do not have terms in $b$ and the derivative of $-b$ with respect to $b$ is simply -1.
Equation 3 shows that the derivative can be done for all the $n$ points and, given that they are not interacting with each other, all these derivatives can simply be summed together. Also, given that we set this function equal to 0 we can remove the moltiplicative factor -2 because it is not going to change the point where the function is 0. \begin{equation} \frac{\partial }{\partial b}[\sum_{i=1}^{n} (y_i-mx_i - b)^2]=-2\sum_{i=1}^{n}(y_i-mx_i - b)=0 \end{equation}
Equation 4 shows that we can split the sum in the three different terms. In particular, in the second term we can put $m$ outside the sum given that does not depend on $i$ and the third term is simply the coefficient $-b$ summed $n$ times so we can simply write $-nb$. \begin{equation} \sum_{i=1}^{n}(y_i-mx_i - b)=\sum_{i=1}^{n}y_i-m\sum_{i=1}^{n}x_i - nb=0 \end{equation}
Equation 5 shows that if we put the first 2 terms on the other side of the equation and divide by $-n$ then we obtain an equation in $b$. In particular $b$ is equal to the sum of all the $y$ values divided by $n$ which is the mean of the $y$ terms, minus $m$ multiplied by the mean of the $x$ terms. \begin{equation} b=\frac{\sum_{i=1}^{n}y_i}{n}-m\frac{\sum_{i=1}^{n}x_i}{n} \end{equation} We can calculate the mean of the $x$ terms and $y$ terms so the only missing value is $m$.
To find $m$, now we do the derivative with respect to $m$ and set it to 0: \begin{equation} \frac{\partial }{\partial m}[\sum_{i=1}^{n} (y_i-mx_i - b)^2]=0 \end{equation} In this case, the derivative of the terms within the brackets for a datapoint $i$ is $-x_i$ because $y$ and $b$ do not have terms in $m$ but the derivative of $-mx_i$ with respect to $m$ is $-x_i$. So, $\frac{\partial }{\partial m}(y_i-mx_i - b)^2=2(y_i-mx_i - b)(-x_i)$. Equation 7 is similar to the previous case, we can again do the sum of all the derivatives and remove -2. \begin{equation} \frac{\partial }{\partial m}[\sum_{i=1}^{n} (y_i-mx_i - b)^2]=-2\sum_{i=1}^{n}(y_i-mx_i - b)x_i=0 \end{equation} Equation 8 shows that we can multiply $x_i$ for all the terms within the brackets and then split the sums similarly to what we did before. In this case note that the second term is a squared term while the third term is not anymore $-nb$ given that it is multiplied by $x_i$ which depends on $i$. \begin{equation} \sum_{i=1}^{n}(y_i-mx_i - b)x_i=\sum_{i=1}^{n}y_{i}x_i-m\sum_{i=1}^{n}x_i^2 - b\sum_{i=1}^{n}x_i=0 \end{equation} Equation 9 shows that we can replace the $b$ value using the equation found before and now this equation contains only terms in $m$, $x$ and $y$. \begin{equation} \sum_{i=1}^{n}y_{i}x_i-m\sum_{i=1}^{n}x_i^2 - (\frac{\sum_{i=1}^{n}y_i}{n}-m\frac{\sum_{i=1}^{n}x_i}{n}) \sum_{i=1}^{n}x_i=0 \end{equation} In equation 10 we move all the terms in $m$ on the other side of the equation. \begin{equation} \sum_{i=1}^{n}y_{i}x_i - (\frac{\sum_{i=1}^{n}y_i\sum_{i=1}^{n}x_i}{n})=m\sum_{i=1}^{n}x_i^2 -m\frac{(\sum_{i=1}^{n}x_i)^2}{n} \end{equation} Then, in equation 11 we divide for all the terms that are not $m$ and we can obtain a formula to find $m$ which only depends on the $x$ and $y$ data. This is what we are looking for given that only by knowing the datapoints we can find the optimal $m$ value. \begin{equation} m=\frac{\sum_{i=1}^{n}y_{i}x_i - (\frac{\sum_{i=1}^{n}y_i\sum_{i=1}^{n}x_i}{n})}{\sum_{i=1}^{n}x_i^2 -\frac{(\sum_{i=1}^{n}x_i)^2}{n}} \end{equation} The formulas are a bit long but we can simplify them. In particular we can substitute the mean values of $x$ and $y$ respectively with $\bar{x}$ and $\bar{y}$ (i.e. $\bar{x}=\frac{\sum_{i=1}^{n}x_i}{n}$ and $\bar{y}=\frac{\sum_{i=1}^{n}y_i}{n}$) and this simplifies the two equations for $b$ and $m$. \begin{equation} m=\frac{\sum_{i=1}^{n}y_{i}x_i - \bar{x}\sum_{i=1}^{n}y_i}{\sum_{i=1}^{n}x_i^2 -\bar{x}\sum_{i=1}^{n}x_i}=\frac{\sum_{i=1}^{n}y_{i}(x_i-\bar{x})}{\sum_{i=1}^{n}x_i(x_i-\bar{x})} \end{equation} \begin{equation} b=\bar{y}-m\bar{x} \end{equation} Equations 12 and 13 show the calculations that need to be performed, using $x$ and $y$ datapoints, to obtain the optimal coefficient values for $m$ and $b$. Click the button below to see the implementation of these formulas in Python.
The linear regression problem can also be defined for more than one indepedent variable and the equations are typically written in matrix form. For a generic point with output $y_i$ then there are $p$ different input variables and each of them is associated to a specific $\beta$ coefficient and then there is also the coefficient $\beta_0$ not associated to any variable because this is the intercept. The equation for a generic $i$ data point is: \begin{equation} y_i=\beta_0+\beta_{1}x_{i1}+\beta_{2}x_{i2}+\cdots+\beta_{p}x_{ip}+\epsilon_i \end{equation}
This can be written in matrix form, taking into account all $n$ data points, in this way: \begin{equation} \boldsymbol{y}=\boldsymbol{X\beta}+\boldsymbol{\epsilon} \end{equation} Where: \begin{equation} \boldsymbol{y}= \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} \boldsymbol{\beta}= \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{bmatrix} \boldsymbol{\epsilon}= \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \end{bmatrix} \end{equation} \begin{equation} \boldsymbol{X}= \begin{bmatrix} 1 & x_{11} & \cdots & x_{1p} \\ 1 & x_{21} & \cdots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & \cdots & x_{np} \end{bmatrix} \end{equation}
Given that there are $n$ data points, $\boldsymbol{y}$ is a vector of $n$ by 1 dimensions and also the error term $\boldsymbol{\epsilon}$ is a vector of $n$ by 1 dimensions. On the other hand, $\boldsymbol{\beta}$ as previously mentioned is a vector of $p+1$ by 1 dimensions where $p$ is the number of input variables and then we add the intercept. For the data, this is a matrix of $n$ by $p+1$ dimensions given that there are $n$ data points but for each data point there is a multiplication for the $\boldsymbol{\beta}$ vector of coefficients which is $p+1$ dimensions. In the matrix one column is all ones, that is the intercept column which multiplies the $\beta_0$ coefficient and this is a compact way to include the intercept in the input matrix.
The Ordinary Least Squares (OLS) defined in matrix form is similar to the 1-dimension case. The equation below shows how the problem is formally posed and that is essentially the same equation used previously where the sum of all the squared residuals is calculated and then minimized but in this case the aim is to find a vector $\boldsymbol{\beta}$ with all its $p+1$ coefficients. \begin{equation} \min_{\boldsymbol{\beta}}\|\boldsymbol{y-X\beta}\|^2 \end{equation} The two vertical bars at the left and right of the equation, mean that the euclidean norm of the vector is calculated. Given a vector $\boldsymbol{a}$, the euclidean norm of $\boldsymbol{a}$ is defined as $\| \boldsymbol{a} \| = \sqrt{a_1^2+a_2^2+\cdots+a_p^2} = \sqrt{\boldsymbol{a^T a}} $. So, the euclidean norm can be written as the multiplication between the vector $\boldsymbol{a}$ transpose and the original vector $\boldsymbol{a}$. The euclidean norm has the square root but the function that we are trying to minimize is squared so that we are simply considering the sum of the squared terms without the need of calculating the square root.
To find the optimal coefficients, Equation 19 shows that we can write the equation using the transpose and then we do the derivative and set it equal to 0. In this case the derivative is in respect of a vector $\boldsymbol{\beta}$ and not single variables as previously shown. \begin{equation} \frac{\partial}{\partial \boldsymbol{\beta}}\|\boldsymbol{y-X\beta}\|^2=\frac{\partial}{\partial \boldsymbol{\beta}}\boldsymbol{(y-X\beta)^T(y-X\beta)}=0 \end{equation} Equation 20 focuses on performing the multiplications of the terms within the brackets by taking into account that the transpose of two matrices (or a matrix and a vector) are equal to the multiplication of the matrices (or matrix and vector) both transposed but they are flipped (i.e. $\boldsymbol{(AB)^T=B^TA^T}$). \begin{equation} \frac{\partial}{\partial \boldsymbol{\beta}}\boldsymbol{(y-X\beta)^T(y-X\beta)}=\frac{\partial}{\partial \boldsymbol{\beta}}(\boldsymbol{y^Ty-\beta^TX^Ty-y^TX\beta+\beta^TX^TX\beta})=0 \end{equation} If you look at all these 4 terms and apply matrix multiplications you can see that dimensionally all of them are a single dimension so that we are essentially obtaining a single value out of this OLS equation and this value is the minimum error possible given the optimal coefficients that will be found. We can do the derivative of these 4 terms by remembering some of the derivative identities in respect of a vector.