Gradient Descent Visualizer

LAB02: Machine Learning Foundations

Interactive 3D Loss Surface

This simulation visualizes how gradient descent navigates a loss surface to find optimal parameters.

Concept from LAB02

See Section 2.3: Optimization in the PDF book for the mathematical foundations.

The Visualization

Code
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Create loss surface
def loss_function(x, y):
    return x**2 + y**2 + 0.5*np.sin(3*x) + 0.5*np.cos(3*y)

# Generate surface data
x = np.linspace(-2, 2, 100)
y = np.linspace(-2, 2, 100)
X, Y = np.meshgrid(x, y)
Z = loss_function(X, Y)

# Gradient descent path
def gradient(x, y):
    dx = 2*x + 1.5*np.cos(3*x)
    dy = 2*y - 1.5*np.sin(3*y)
    return dx, dy

# Run gradient descent
path_x, path_y, path_z = [1.5], [1.5], [loss_function(1.5, 1.5)]
lr = 0.1
for _ in range(50):
    dx, dy = gradient(path_x[-1], path_y[-1])
    new_x = path_x[-1] - lr * dx
    new_y = path_y[-1] - lr * dy
    path_x.append(new_x)
    path_y.append(new_y)
    path_z.append(loss_function(new_x, new_y))

# Create figure
fig = plt.figure(figsize=(12, 5))

# 3D surface plot
ax1 = fig.add_subplot(121, projection='3d')
ax1.plot_surface(X, Y, Z, cmap='viridis', alpha=0.7, edgecolor='none')
ax1.plot(path_x, path_y, path_z, 'r.-', linewidth=2, markersize=8, label='GD path')
ax1.scatter([path_x[0]], [path_y[0]], [path_z[0]], color='green', s=100, label='Start')
ax1.scatter([path_x[-1]], [path_y[-1]], [path_z[-1]], color='red', s=100, label='End')
ax1.set_xlabel('Parameter 1')
ax1.set_ylabel('Parameter 2')
ax1.set_zlabel('Loss')
ax1.set_title('3D Loss Surface')
ax1.legend()

# Contour plot
ax2 = fig.add_subplot(122)
contour = ax2.contour(X, Y, Z, levels=20, cmap='viridis')
ax2.plot(path_x, path_y, 'r.-', linewidth=2, markersize=8)
ax2.scatter([path_x[0]], [path_y[0]], color='green', s=100, zorder=5, label='Start')
ax2.scatter([path_x[-1]], [path_y[-1]], color='red', s=100, zorder=5, label='End')
ax2.set_xlabel('Parameter 1')
ax2.set_ylabel('Parameter 2')
ax2.set_title('Contour View')
ax2.legend()
plt.colorbar(contour, ax=ax2, label='Loss')

plt.tight_layout()
plt.show()
Figure 22.1: Gradient descent on a 2D loss surface

Understanding the Visualization

Loss Surface

The colored surface represents the loss function \(L(\theta_1, \theta_2)\). Lower values (darker colors) indicate better parameter combinations.

Gradient Descent Path

The red dots show the path taken by gradient descent: 1. Start (green): Initial random parameters 2. Steps: Each step moves in the direction of steepest descent 3. End (red): Final parameters after convergence

The Update Rule

At each step, parameters update according to:

\[\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t)\]

where: - \(\eta\) is the learning rate (controls step size) - \(\nabla L\) is the gradient (direction of steepest ascent)

Experiment: Learning Rate

Code
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
learning_rates = [0.01, 0.1, 0.5]
titles = ['Too Small (0.01)', 'Good (0.1)', 'Too Large (0.5)']

for ax, lr, title in zip(axes, learning_rates, titles):
    # Run gradient descent
    path_x, path_y = [1.5], [1.5]
    for _ in range(50):
        dx, dy = gradient(path_x[-1], path_y[-1])
        new_x = path_x[-1] - lr * dx
        new_y = path_y[-1] - lr * dy
        # Clip to prevent explosion
        new_x = np.clip(new_x, -3, 3)
        new_y = np.clip(new_y, -3, 3)
        path_x.append(new_x)
        path_y.append(new_y)

    # Plot
    ax.contour(X, Y, Z, levels=20, cmap='viridis', alpha=0.7)
    ax.plot(path_x, path_y, 'r.-', linewidth=1, markersize=4)
    ax.scatter([path_x[0]], [path_y[0]], color='green', s=100, zorder=5)
    ax.scatter([path_x[-1]], [path_y[-1]], color='red', s=100, zorder=5)
    ax.set_title(title)
    ax.set_xlabel('θ₁')
    ax.set_ylabel('θ₂')
    ax.set_xlim(-2.5, 2.5)
    ax.set_ylim(-2.5, 2.5)

plt.tight_layout()
plt.show()
Figure 22.2: Effect of different learning rates

Observations

Learning Rate Behavior
Too small (0.01) Slow convergence, may not reach minimum
Good (0.1) Smooth convergence to minimum
Too large (0.5) Oscillation, may diverge

Try It Yourself

Exercise
  1. Open the LAB02 notebook in Colab
  2. Modify the learning rate and observe convergence
  3. Try different starting points
  4. Compare SGD with Adam optimizer

Key Takeaways

  1. Gradient descent follows the direction of steepest descent
  2. Learning rate is crucial: too small = slow, too large = unstable
  3. Local minima can trap the optimizer (advanced optimizers help)
  4. Momentum helps escape saddle points and smooth convergence