Feature scaling increases the efficiency of Gradient Descent by making the contours more circular. This causes each "step" to take a more direct path towards the global minimum.

For example, if we have an $x_1$ = size (0-2000 feet$^2$) and $x_2$ = # of bedrooms (0-5):

$$ x_1 = \frac{\text{size (feet$^2$)}}{2000} \quad x_2 = \frac{\text{number of bedrooms}}{5} $$

Goal: get every feature to approximately a $-1 \leq x_i \leq 1$ range.

Andrew Ng's rule of thumb:

$-3 \leq x_i \leq 3$ is max

$-\frac{1}{3} \leq x_i \leq \frac{1}{3}$ is min

<aside> 💡 For a vectorized MATLAB example, see meanStandardNormalize.

</aside>