Gradient Descent Visualizer

LAB02: Machine Learning Foundations

Interactive 3D Loss Surface

This simulation visualizes how gradient descent navigates a loss surface to find optimal parameters.

Concept from LAB02

See Section 2.3: Optimization in the PDF book for the mathematical foundations.

The Visualization

Code

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Create loss surface
def loss_function(x, y):
    return x**2 + y**2 + 0.5*np.sin(3*x) + 0.5*np.cos(3*y)

# Generate surface data
x = np.linspace(-2, 2, 100)
y = np.linspace(-2, 2, 100)
X, Y = np.meshgrid(x, y)
Z = loss_function(X, Y)

# Gradient descent path
def gradient(x, y):
    dx = 2*x + 1.5*np.cos(3*x)
    dy = 2*y - 1.5*np.sin(3*y)
    return dx, dy

# Run gradient descent
path_x, path_y, path_z = [1.5], [1.5], [loss_function(1.5, 1.5)]
lr = 0.1
for _ in range(50):
    dx, dy = gradient(path_x[-1], path_y[-1])
    new_x = path_x[-1] - lr * dx
    new_y = path_y[-1] - lr * dy
    path_x.append(new_x)
    path_y.append(new_y)
    path_z.append(loss_function(new_x, new_y))

# Create figure
fig = plt.figure(figsize=(12, 5))

# 3D surface plot
ax1 = fig.add_subplot(121, projection='3d')
ax1.plot_surface(X, Y, Z, cmap='viridis', alpha=0.7, edgecolor='none')
ax1.plot(path_x, path_y, path_z, 'r.-', linewidth=2, markersize=8, label='GD path')
ax1.scatter([path_x[0]], [path_y[0]], [path_z[0]], color='green', s=100, label='Start')
ax1.scatter([path_x[-1]], [path_y[-1]], [path_z[-1]], color='red', s=100, label='End')
ax1.set_xlabel('Parameter 1')
ax1.set_ylabel('Parameter 2')
ax1.set_zlabel('Loss')
ax1.set_title('3D Loss Surface')
ax1.legend()

# Contour plot
ax2 = fig.add_subplot(122)
contour = ax2.contour(X, Y, Z, levels=20, cmap='viridis')
ax2.plot(path_x, path_y, 'r.-', linewidth=2, markersize=8)
ax2.scatter([path_x[0]], [path_y[0]], color='green', s=100, zorder=5, label='Start')
ax2.scatter([path_x[-1]], [path_y[-1]], color='red', s=100, zorder=5, label='End')
ax2.set_xlabel('Parameter 1')
ax2.set_ylabel('Parameter 2')
ax2.set_title('Contour View')
ax2.legend()
plt.colorbar(contour, ax=ax2, label='Loss')

plt.tight_layout()
plt.show()

Understanding the Visualization

Loss Surface

The colored surface represents the loss function $L(\theta_1, \theta_2)$. Lower values (darker colors) indicate better parameter combinations.

Gradient Descent Path

The red dots show the path taken by gradient descent: 1. Start (green): Initial random parameters 2. Steps: Each step moves in the direction of steepest descent 3. End (red): Final parameters after convergence

The Update Rule

At each step, parameters update according to:

\[\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t)\]

where: - $\eta$ is the learning rate (controls step size) - $\nabla L$ is the gradient (direction of steepest ascent)

Experiment: Learning Rate

Code

fig, axes = plt.subplots(1, 3, figsize=(15, 4))
learning_rates = [0.01, 0.1, 0.5]
titles = ['Too Small (0.01)', 'Good (0.1)', 'Too Large (0.5)']

for ax, lr, title in zip(axes, learning_rates, titles):
    # Run gradient descent
    path_x, path_y = [1.5], [1.5]
    for _ in range(50):
        dx, dy = gradient(path_x[-1], path_y[-1])
        new_x = path_x[-1] - lr * dx
        new_y = path_y[-1] - lr * dy
        # Clip to prevent explosion
        new_x = np.clip(new_x, -3, 3)
        new_y = np.clip(new_y, -3, 3)
        path_x.append(new_x)
        path_y.append(new_y)

    # Plot
    ax.contour(X, Y, Z, levels=20, cmap='viridis', alpha=0.7)
    ax.plot(path_x, path_y, 'r.-', linewidth=1, markersize=4)
    ax.scatter([path_x[0]], [path_y[0]], color='green', s=100, zorder=5)
    ax.scatter([path_x[-1]], [path_y[-1]], color='red', s=100, zorder=5)
    ax.set_title(title)
    ax.set_xlabel('θ₁')
    ax.set_ylabel('θ₂')
    ax.set_xlim(-2.5, 2.5)
    ax.set_ylim(-2.5, 2.5)

plt.tight_layout()
plt.show()

Figure 22.2: Effect of different learning rates

Observations

Learning Rate	Behavior
Too small (0.01)	Slow convergence, may not reach minimum
Good (0.1)	Smooth convergence to minimum
Too large (0.5)	Oscillation, may diverge

Try It Yourself

Exercise

Open the LAB02 notebook in Colab
Modify the learning rate and observe convergence
Try different starting points
Compare SGD with Adam optimizer

Key Takeaways

Gradient descent follows the direction of steepest descent
Learning rate is crucial: too small = slow, too large = unstable
Local minima can trap the optimizer (advanced optimizers help)
Momentum helps escape saddle points and smooth convergence

--- title: "Gradient Descent Visualizer" subtitle: "LAB02: Machine Learning Foundations" format: html: code-fold: true --- ## Interactive 3D Loss Surface This simulation visualizes how gradient descent navigates a loss surface to find optimal parameters. ::: {.callout-note} ## Concept from LAB02 See **Section 2.3: Optimization** in the [PDF book](../downloads/Edge-Analytics-Lab-Book-v1.0.0.pdf) for the mathematical foundations. ::: ## The Visualization ```{python} #| label: fig-gradient-descent #| fig-cap: "Gradient descent on a 2D loss surface" #| code-fold: true import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import Axes3D # Create loss surface def loss_function(x, y): return x**2 + y**2 + 0.5*np.sin(3*x) + 0.5*np.cos(3*y) # Generate surface data x = np.linspace(-2, 2, 100) y = np.linspace(-2, 2, 100) X, Y = np.meshgrid(x, y) Z = loss_function(X, Y) # Gradient descent path def gradient(x, y): dx = 2*x + 1.5*np.cos(3*x) dy = 2*y - 1.5*np.sin(3*y) return dx, dy # Run gradient descent path_x, path_y, path_z = [1.5], [1.5], [loss_function(1.5, 1.5)] lr = 0.1 for _ in range(50): dx, dy = gradient(path_x[-1], path_y[-1]) new_x = path_x[-1] - lr * dx new_y = path_y[-1] - lr * dy path_x.append(new_x) path_y.append(new_y) path_z.append(loss_function(new_x, new_y)) # Create figure fig = plt.figure(figsize=(12, 5)) # 3D surface plot ax1 = fig.add_subplot(121, projection='3d') ax1.plot_surface(X, Y, Z, cmap='viridis', alpha=0.7, edgecolor='none') ax1.plot(path_x, path_y, path_z, 'r.-', linewidth=2, markersize=8, label='GD path') ax1.scatter([path_x[0]], [path_y[0]], [path_z[0]], color='green', s=100, label='Start') ax1.scatter([path_x[-1]], [path_y[-1]], [path_z[-1]], color='red', s=100, label='End') ax1.set_xlabel('Parameter 1') ax1.set_ylabel('Parameter 2') ax1.set_zlabel('Loss') ax1.set_title('3D Loss Surface') ax1.legend() # Contour plot ax2 = fig.add_subplot(122) contour = ax2.contour(X, Y, Z, levels=20, cmap='viridis') ax2.plot(path_x, path_y, 'r.-', linewidth=2, markersize=8) ax2.scatter([path_x[0]], [path_y[0]], color='green', s=100, zorder=5, label='Start') ax2.scatter([path_x[-1]], [path_y[-1]], color='red', s=100, zorder=5, label='End') ax2.set_xlabel('Parameter 1') ax2.set_ylabel('Parameter 2') ax2.set_title('Contour View') ax2.legend() plt.colorbar(contour, ax=ax2, label='Loss') plt.tight_layout() plt.show() ``` ## Understanding the Visualization ### Loss Surface The colored surface represents the **loss function** $L(\theta_1, \theta_2)$. Lower values (darker colors) indicate better parameter combinations. ### Gradient Descent Path The red dots show the path taken by gradient descent: 1. **Start** (green): Initial random parameters 2. **Steps**: Each step moves in the direction of steepest descent 3. **End** (red): Final parameters after convergence ### The Update Rule At each step, parameters update according to: $$\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t)$$ where: - $\eta$ is the **learning rate** (controls step size) - $\nabla L$ is the **gradient** (direction of steepest ascent) ## Experiment: Learning Rate ```{python} #| label: fig-learning-rates #| fig-cap: "Effect of different learning rates" #| code-fold: true fig, axes = plt.subplots(1, 3, figsize=(15, 4)) learning_rates = [0.01, 0.1, 0.5] titles = ['Too Small (0.01)', 'Good (0.1)', 'Too Large (0.5)'] for ax, lr, title in zip(axes, learning_rates, titles): # Run gradient descent path_x, path_y = [1.5], [1.5] for _ in range(50): dx, dy = gradient(path_x[-1], path_y[-1]) new_x = path_x[-1] - lr * dx new_y = path_y[-1] - lr * dy # Clip to prevent explosion new_x = np.clip(new_x, -3, 3) new_y = np.clip(new_y, -3, 3) path_x.append(new_x) path_y.append(new_y) # Plot ax.contour(X, Y, Z, levels=20, cmap='viridis', alpha=0.7) ax.plot(path_x, path_y, 'r.-', linewidth=1, markersize=4) ax.scatter([path_x[0]], [path_y[0]], color='green', s=100, zorder=5) ax.scatter([path_x[-1]], [path_y[-1]], color='red', s=100, zorder=5) ax.set_title(title) ax.set_xlabel('θ₁') ax.set_ylabel('θ₂') ax.set_xlim(-2.5, 2.5) ax.set_ylim(-2.5, 2.5) plt.tight_layout() plt.show() ``` ### Observations | Learning Rate | Behavior | |---------------|----------| | **Too small** (0.01) | Slow convergence, may not reach minimum | | **Good** (0.1) | Smooth convergence to minimum | | **Too large** (0.5) | Oscillation, may diverge | ## Try It Yourself ::: {.callout-tip} ## Exercise 1. Open the [LAB02 notebook](https://github.com/ngcharithperera/edge-analytics-lab-book/blob/main/notebooks/LAB02_ml_foundations.ipynb) in Colab 2. Modify the learning rate and observe convergence 3. Try different starting points 4. Compare SGD with Adam optimizer ::: ## Key Takeaways 1. **Gradient descent** follows the direction of steepest descent 2. **Learning rate** is crucial: too small = slow, too large = unstable 3. **Local minima** can trap the optimizer (advanced optimizers help) 4. **Momentum** helps escape saddle points and smooth convergence ## Related Sections in PDF Book - Section 2.3: Optimization Methods - Section 2.4: Backpropagation - Exercise 2.1: Implement gradient descent from scratch