Finds the optimal $\theta$ values for a linear regression problem.

Pros:

Done in a single step (no need for iterations like Gradient Descent )
Does not require Feature Scaling
Does not require a Learning Rate

Cons:

Slow if many features ($X^TX$ is an $n \times n$ matrix, cost to compute roughly $O(n^3)$)
Doesn't work for some classification problem algorithms

<aside> 💡 Begin considering gradient descent at $n=10,000$ features.

</aside>

Equation

Get an $n \times n$ matrix by multiplying the transverse of $X$ with itself
Take the inverse and multiply by $X^T$ and $y$

$$ \theta = (X^TX)^{-1}X^Ty $$

Python

def normalEquation(X, y):
    Xt = np.transpose(X)
    return pinv(Xt @ X) @ Xt @ y

<aside> 💡 You can view a Jupyter Notebook using normalEquation here.

</aside>

MATLAB

function theta = normalEquation (X, y)

    theta = pinv(X' * X) * X' * y;

end

<aside> 💡 You can view the code example for normalEquation with added comments here.

</aside>