Gradient descent reduces the error produced by a Cost Function in order to determine the best parameters for a hypothesis.

See also Normal Equation for faster, but less efficient at higher number of features.

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/d23c06d6-db3e-434a-b777-f4fcfa1f6b32/gradient_descent_graph.png

Equation

  1. Choose a good value for your $\theta$
  2. Choose a good Learning Rate
  3. If your input variable sizes range greatly, consider Feature Scaling
  4. Calculate the partial derivative of the Cost Function with your hypothesis (in this case we will choose the Linear Hypothesis
  5. Subtract the partial derivative $\times$ learning rate from your previous $\theta$ to find a new theta that causes the cost of the hypothesis to be lower

Repeat these steps for a set number of repetitions, or until the reduction between runs becomes negligible.

The equation before calculating the partial derivative:

$$ \text{Repeat} \left\{ \theta_j := \theta_j - \alpha\frac{\partial}{\partial\theta_j}J(\theta) \right\} $$

Once you calculate the partial derivative, the equation is:

$$ \text{Repeat} \left\{ \theta_j := \theta_j - \alpha\frac{1}{m}\sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)} \right\} $$

These should be repeated simultaneously for all $\theta$. Keep in mind that $x_0$ is assumed to be equal to 1, so for $\theta_0$ you can remove $x_j^{(i)}$.

Vectorized

To run gradient descent in a vectorized way you first need to prep a matrix $X$ with a column for each training set. You will also need a vector $y$ with all of the output variables.

Given a matrix $X$ where each column is an item of a training set and each row contains the variables $x_0 \rightarrow x_n$ (where $x_0=1$), you can calculate the new $\theta$ values using the following formula (which uses the Linear Hypothesis).

$$ \theta = \theta - \frac{\alpha}{m} X^T(X\theta - y) $$

Or the following using the Logistic Hypothesis :

$$ \theta = \theta - \frac{\alpha}{m} X^T(g(X\theta) - y) $$

Python