LAB02: Machine Learning Foundations

Neural Networks and Training Basics

PDF Textbook Reference

For detailed theoretical foundations, mathematical proofs, and algorithm derivations, see Chapter 2: Neural Network Training Fundamentals in the PDF textbook.

The PDF chapter includes: - Complete mathematical derivations of backpropagation - Detailed loss function formulations and proofs - In-depth coverage of gradient descent variants - Comprehensive CNN architecture theory - Extended theoretical examples and convergence analysis

Open In Colab

Open In Colab

Download Notebook

Learning Objectives

By the end of this lab you will be able to:

  • Define and compute common loss functions for regression and classification
  • Implement gradient descent and understand how learning rate affects convergence
  • Build and train simple neural networks and CNNs in TensorFlow/Keras
  • Interpret training/validation curves and diagnose under/overfitting

Theory Summary

How Neural Networks Learn

Neural networks learn through an iterative optimization process that adjusts millions of parameters to minimize prediction errors. Understanding this process is essential for diagnosing training problems and building effective edge ML models.

Loss Functions - Measuring Mistakes: A loss function quantifies how wrong your model’s predictions are. For regression tasks (predicting numbers), we use Mean Squared Error (MSE) which penalizes large errors more heavily. For classification (predicting categories), we use Cross-Entropy Loss which measures the difference between predicted probabilities and true labels. Lower loss always means better predictions.

Gradient Descent - The Optimization Engine: Gradient descent is an algorithm that automatically finds parameter values that minimize loss. It works by computing the gradient (derivative) of the loss with respect to each parameter, then taking small steps in the opposite direction. The learning rate controls step size - too large causes oscillation, too small causes slow convergence. Think of it like walking downhill in fog: you can’t see the bottom, but you can feel the slope and take steps downward.

Neural Network Architecture: A neural network stacks multiple layers of simple operations (\(y = wx + b\)) with non-linear activation functions between them. Each layer learns increasingly abstract features: early layers detect edges and textures, middle layers combine them into shapes, and final layers recognize complete objects. This hierarchical feature learning is what makes neural networks powerful.

Key Concepts at a Glance

Core Concepts
  • Loss Function: Single number measuring prediction error (lower = better)
  • Gradient: Derivative showing direction to adjust parameters for improvement
  • Learning Rate: Step size for gradient descent updates (typically 0.001 - 0.1)
  • Epoch: One complete pass through the entire training dataset
  • Activation Functions: Non-linearities (ReLU, sigmoid, softmax) enabling complex patterns
  • Overfitting: Model memorizes training data but fails on new data
  • Regularization: Techniques (dropout, L2 penalty) to prevent overfitting

Common Pitfalls

Mistakes to Avoid

Forgetting to Normalize Inputs: The most common training failure is feeding unnormalized data to neural networks. If pixel values are 0-255 instead of 0-1, gradients become 255× larger, causing training instability or divergence. Always normalize: x = x / 255.0 for images.

Using Wrong Loss for Task Type: Mean Squared Error (MSE) is for regression (predicting continuous values). Cross-Entropy is for classification (predicting categories). Using MSE for classification or cross-entropy for regression will cause training to fail silently with poor results.

Learning Rate Too High or Too Low: Learning rate = 1.0 typically causes wild oscillation and divergence. Learning rate = 0.00001 makes training painfully slow (thousands of epochs). Start with 0.001 or 0.01 and adjust based on loss curves.

Not Splitting Train/Validation Data: Training and evaluating on the same data gives misleadingly high accuracy. Always hold out 10-20% of data for validation to detect overfitting. Use validation_split=0.2 in Keras or manually split your dataset.

Ignoring Training Curves: If validation loss increases while training loss decreases, you’re overfitting. If both losses are high, your model is underfitting. Always plot loss curves to diagnose issues early.

Quick Reference

Key Formulas

Mean Squared Error (Regression): \[\text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2\]

Cross-Entropy Loss (Classification): \[\text{Loss} = -\frac{1}{n}\sum_{i=1}^{n}\sum_{c=1}^{C} y_{ic} \log(\hat{y}_{ic})\]

Gradient Descent Update Rule: \[w_{new} = w_{old} - \alpha \cdot \frac{\partial L}{\partial w}\] where \(\alpha\) is the learning rate

Parameter Count for Dense Layer: \[\text{params} = (\text{input\_size} + 1) \times \text{output\_size}\] The “+1” accounts for bias terms

Memory for Float32 Model: \[\text{Memory (bytes)} = \text{parameters} \times 4\]

Important Parameter Values

Hyperparameter Typical Range Notes
Learning Rate 0.001 - 0.1 Start with 0.01, adjust by 10×
Batch Size 16 - 128 Smaller = noisier gradients but less memory
Epochs 5 - 100 Stop when validation loss stops improving
Hidden Layer Size 16 - 512 Larger = more capacity but slower
Dropout Rate 0.2 - 0.5 Prevents overfitting (0.5 = drop 50% neurons)

Common Activation Functions: - ReLU: max(0, x) - Default choice, fast, works well - Sigmoid: 1/(1+e^-x) - Output 0-1, used for binary classification - Softmax: Normalizes to probabilities - used for multi-class output - Tanh: tanh(x) - Output -1 to 1, centered around zero

Interactive Learning Tools

Explore Visually

Before diving into code, build intuition with these interactive tools:

Try It Yourself: Executable Python Examples

Run these interactive examples directly in your browser to build intuition before diving into the full notebook.

Example 1: Gradient Descent from Scratch

Code
import numpy as np
import matplotlib.pyplot as plt

# Generate synthetic data: y = 3x + 2 + noise
np.random.seed(42)
X = np.linspace(0, 10, 100)
y_true = 3 * X + 2
y = y_true + np.random.randn(100) * 2

# Initialize parameters
w = 0.0  # weight
b = 0.0  # bias
learning_rate = 0.01
epochs = 100

# Track loss history
loss_history = []

# Gradient descent
for epoch in range(epochs):
    # Forward pass: predictions
    y_pred = w * X + b

    # Compute loss (MSE)
    loss = np.mean((y - y_pred) ** 2)
    loss_history.append(loss)

    # Compute gradients
    grad_w = -2 * np.mean((y - y_pred) * X)
    grad_b = -2 * np.mean(y - y_pred)

    # Update parameters
    w = w - learning_rate * grad_w
    b = b - learning_rate * grad_b

# Visualize results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# Plot 1: Data and fitted line
ax1.scatter(X, y, alpha=0.5, label='Data')
ax1.plot(X, y_true, 'g--', label='True line (y=3x+2)', linewidth=2)
ax1.plot(X, w * X + b, 'r-', label=f'Fitted line (y={w:.2f}x+{b:.2f})', linewidth=2)
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_title('Gradient Descent Fit')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Plot 2: Loss over epochs
ax2.plot(loss_history, linewidth=2)
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Loss (MSE)')
ax2.set_title('Training Loss Curve')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Final parameters: w={w:.4f}, b={b:.4f}")
print(f"True parameters:  w=3.0000, b=2.0000")
print(f"Final loss: {loss_history[-1]:.4f}")
print(f"\nKey insight: Gradient descent found parameters close to the true values!")

Final parameters: w=3.1348, b=0.9415
True parameters:  w=3.0000, b=2.0000
Final loss: 3.3899

Key insight: Gradient descent found parameters close to the true values!

Example 2: Loss Function Visualization

Code
import numpy as np
import matplotlib.pyplot as plt

# Generate classification and regression examples
np.random.seed(42)

# Regression example
y_true_reg = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
y_pred_reg = np.array([1.2, 2.1, 2.8, 4.3, 4.9])

# Classification example (3 classes)
y_true_class = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1], [1, 0, 0], [0, 1, 0]])
y_pred_class = np.array([[0.8, 0.1, 0.1], [0.2, 0.7, 0.1], [0.1, 0.2, 0.7],
                          [0.6, 0.3, 0.1], [0.1, 0.8, 0.1]])

# Compute losses
mse = np.mean((y_true_reg - y_pred_reg) ** 2)
mae = np.mean(np.abs(y_true_reg - y_pred_reg))

# Cross-entropy loss (clip to avoid log(0))
y_pred_clipped = np.clip(y_pred_class, 1e-7, 1 - 1e-7)
cross_entropy = -np.mean(np.sum(y_true_class * np.log(y_pred_clipped), axis=1))

# Visualize loss surfaces for simple case
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# MSE loss surface for y = wx (simple linear)
w_range = np.linspace(-2, 6, 100)
mse_surface = [(np.mean(([2.0] - w * np.array([1.0])) ** 2)) for w in w_range]

ax1.plot(w_range, mse_surface, linewidth=2)
ax1.axvline(x=2.0, color='r', linestyle='--', label='Optimal w=2.0')
ax1.set_xlabel('Weight (w)')
ax1.set_ylabel('MSE Loss')
ax1.set_title('Mean Squared Error Loss Surface')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Compare different loss magnitudes
predictions = np.linspace(0, 5, 100)
true_value = 3.0
mse_curve = (predictions - true_value) ** 2
mae_curve = np.abs(predictions - true_value)

ax2.plot(predictions, mse_curve, label='MSE', linewidth=2)
ax2.plot(predictions, mae_curve, label='MAE', linewidth=2, linestyle='--')
ax2.axvline(x=true_value, color='g', linestyle=':', alpha=0.5, label='True value')
ax2.set_xlabel('Predicted Value')
ax2.set_ylabel('Loss')
ax2.set_title('MSE vs MAE (true value = 3.0)')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("LOSS FUNCTION COMPARISON")
print("="*50)
print(f"\nRegression Losses:")
print(f"  MSE (Mean Squared Error):     {mse:.4f}")
print(f"  MAE (Mean Absolute Error):    {mae:.4f}")
print(f"\nClassification Loss:")
print(f"  Cross-Entropy Loss:           {cross_entropy:.4f}")
print(f"\nKey insight: MSE penalizes large errors more heavily (squared term)!")

LOSS FUNCTION COMPARISON
==================================================

Regression Losses:
  MSE (Mean Squared Error):     0.0380
  MAE (Mean Absolute Error):    0.1800

Classification Loss:
  Cross-Entropy Loss:           0.3341

Key insight: MSE penalizes large errors more heavily (squared term)!

Example 3: Simple Neural Network Training Loop

Code
import numpy as np
import matplotlib.pyplot as plt

# Generate XOR-like nonlinear data
np.random.seed(42)
n_samples = 200

X = np.random.randn(n_samples, 2)
y = (X[:, 0] * X[:, 1] > 0).astype(int)  # XOR pattern

# Simple 2-layer neural network
class SimpleNN:
    def __init__(self, input_size=2, hidden_size=8, output_size=2):
        self.W1 = np.random.randn(input_size, hidden_size) * 0.1
        self.b1 = np.zeros(hidden_size)
        self.W2 = np.random.randn(hidden_size, output_size) * 0.1
        self.b2 = np.zeros(output_size)

    def relu(self, x):
        return np.maximum(0, x)

    def relu_derivative(self, x):
        return (x > 0).astype(float)

    def softmax(self, x):
        exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
        return exp_x / np.sum(exp_x, axis=1, keepdims=True)

    def forward(self, X):
        self.z1 = X @ self.W1 + self.b1
        self.a1 = self.relu(self.z1)
        self.z2 = self.a1 @ self.W2 + self.b2
        self.a2 = self.softmax(self.z2)
        return self.a2

    def backward(self, X, y, learning_rate=0.01):
        m = X.shape[0]

        # One-hot encode y
        y_one_hot = np.zeros((m, 2))
        y_one_hot[np.arange(m), y] = 1

        # Output layer gradients
        dz2 = self.a2 - y_one_hot
        dW2 = (self.a1.T @ dz2) / m
        db2 = np.sum(dz2, axis=0) / m

        # Hidden layer gradients
        da1 = dz2 @ self.W2.T
        dz1 = da1 * self.relu_derivative(self.z1)
        dW1 = (X.T @ dz1) / m
        db1 = np.sum(dz1, axis=0) / m

        # Update weights
        self.W1 -= learning_rate * dW1
        self.b1 -= learning_rate * db1
        self.W2 -= learning_rate * dW2
        self.b2 -= learning_rate * db2

# Train the network
model = SimpleNN()
epochs = 200
losses = []
accuracies = []

for epoch in range(epochs):
    # Forward pass
    predictions = model.forward(X)

    # Compute loss (cross-entropy)
    y_one_hot = np.zeros((len(y), 2))
    y_one_hot[np.arange(len(y)), y] = 1
    loss = -np.mean(np.sum(y_one_hot * np.log(predictions + 1e-8), axis=1))
    losses.append(loss)

    # Compute accuracy
    pred_labels = np.argmax(predictions, axis=1)
    accuracy = np.mean(pred_labels == y)
    accuracies.append(accuracy)

    # Backward pass
    model.backward(X, y, learning_rate=0.1)

# Visualize results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# Plot 1: Training curves
ax1.plot(losses, label='Loss', linewidth=2)
ax1_twin = ax1.twinx()
ax1_twin.plot(accuracies, 'g-', label='Accuracy', linewidth=2)
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss', color='b')
ax1_twin.set_ylabel('Accuracy', color='g')
ax1.set_title('Training Progress')
ax1.grid(True, alpha=0.3)

# Plot 2: Decision boundary
h = 0.1
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = model.forward(np.c_[xx.ravel(), yy.ravel()])
Z = np.argmax(Z, axis=1).reshape(xx.shape)

ax2.contourf(xx, yy, Z, alpha=0.3, cmap='RdYlBu')
ax2.scatter(X[:, 0], X[:, 1], c=y, cmap='RdYlBu', edgecolors='k', s=50)
ax2.set_xlabel('Feature 1')
ax2.set_ylabel('Feature 2')
ax2.set_title(f'Decision Boundary (Accuracy: {accuracies[-1]:.2%})')

plt.tight_layout()
plt.show()

print(f"Final loss: {losses[-1]:.4f}")
print(f"Final accuracy: {accuracies[-1]:.2%}")
print(f"\nKey insight: Neural networks learn nonlinear decision boundaries!")

Final loss: 0.4738
Final accuracy: 88.50%

Key insight: Neural networks learn nonlinear decision boundaries!

Example 4: Learning Rate Comparison

Code
import numpy as np
import matplotlib.pyplot as plt

# Simple quadratic loss surface: L(w) = (w - 5)^2
def loss_fn(w):
    return (w - 5) ** 2

def gradient_fn(w):
    return 2 * (w - 5)

# Test different learning rates
learning_rates = [0.01, 0.1, 0.5, 1.1]
colors = ['blue', 'green', 'orange', 'red']
n_steps = 50

fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.flatten()

for idx, (lr, color) in enumerate(zip(learning_rates, colors)):
    w = 0.0  # Start far from optimum
    w_history = [w]
    loss_history = [loss_fn(w)]

    for step in range(n_steps):
        grad = gradient_fn(w)
        w = w - lr * grad
        w_history.append(w)
        loss_history.append(loss_fn(w))

        # Stop if diverging
        if abs(w) > 100:
            break

    # Plot loss trajectory
    ax = axes[idx]
    ax.plot(loss_history, linewidth=2, color=color)
    ax.set_xlabel('Step')
    ax.set_ylabel('Loss')
    ax.set_title(f'Learning Rate = {lr}')
    ax.grid(True, alpha=0.3)

    # Add status text
    if lr < 0.1:
        status = "TOO SLOW"
        ax.text(0.5, 0.9, status, transform=ax.transAxes,
                bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.5))
    elif lr > 1.0:
        status = "DIVERGING!"
        ax.text(0.5, 0.9, status, transform=ax.transAxes,
                bbox=dict(boxstyle='round', facecolor='red', alpha=0.5))
    else:
        status = "GOOD"
        ax.text(0.5, 0.9, status, transform=ax.transAxes,
                bbox=dict(boxstyle='round', facecolor='lightgreen', alpha=0.5))

plt.tight_layout()
plt.show()

print("LEARNING RATE COMPARISON")
print("="*60)
for lr in learning_rates:
    w = 0.0
    for step in range(50):
        w = w - lr * gradient_fn(w)
        if abs(w) > 100:
            print(f"LR={lr:4.2f}: DIVERGED at step {step}")
            break
    else:
        final_loss = loss_fn(w)
        print(f"LR={lr:4.2f}: Converged to w={w:.4f}, loss={final_loss:.4f}")

print("\nKey insights:")
print("  • LR=0.01: Too slow, needs many iterations")
print("  • LR=0.1:  Good balance, stable convergence")
print("  • LR=0.5:  Fast but may oscillate")
print("  • LR=1.1:  Too large, diverges!")

LEARNING RATE COMPARISON
============================================================
LR=0.01: Converged to w=3.1792, loss=3.3155
LR=0.10: Converged to w=4.9999, loss=0.0000
LR=0.50: Converged to w=5.0000, loss=0.0000
LR=1.10: DIVERGED at step 16

Key insights:
  • LR=0.01: Too slow, needs many iterations
  • LR=0.1:  Good balance, stable convergence
  • LR=0.5:  Fast but may oscillate
  • LR=1.1:  Too large, diverges!

Interactive Notebook

The notebook below contains runnable code for all Level 1 activities.

LAB02: Machine Learning Foundations for Edge

Open In Colab View on GitHub

Learning Objectives: - Understand neural network architecture fundamentals - Train simple models on synthetic datasets - Explore loss functions and gradient descent - Export models to TensorFlow Lite format

Three-Tier Approach: - Level 1 (This Notebook): Train and visualize models on laptop - Level 2 (Simulator): Use TensorFlow Playground for visualization - Level 3 (Device): Deploy quantized model to microcontroller

1. Setup

2. Understanding Neural Networks

A neural network consists of: - Input layer: Receives raw data features - Hidden layers: Transform features through weighted connections - Output layer: Produces predictions

Each connection has a weight that is learned during training.

3. Create a Toy Dataset

We’ll use a simple 2D classification problem that’s easy to visualize.

4. Build a Simple Neural Network

We’ll create a tiny network suitable for edge deployment: - 2 input features - 1 hidden layer with 8 neurons - 2 output classes

5. Understanding Loss Functions

The loss function measures how wrong our predictions are: - Cross-entropy loss: For classification problems - Mean squared error: For regression problems

Goal: Minimize the loss by adjusting weights.

6. Training: Gradient Descent in Action

Training process: 1. Forward pass: Compute predictions 2. Compute loss 3. Backward pass: Compute gradients 4. Update weights 5. Repeat

7. Visualize Decision Boundary

8. Export to TensorFlow Lite

For edge deployment, we convert to TFLite format which is: - Smaller (optimized for mobile/embedded) - Faster (quantized operations) - Compatible with microcontrollers

9. Test TFLite Model

10. Checkpoint Questions

  1. Why does the loss decrease during training?

  2. What happens if you increase the number of hidden neurons from 8 to 32?

    • How does training time change?
    • How does accuracy change?
    • How does model size change?
  3. What’s the purpose of the validation set?

  4. Why is the INT8 model smaller than the float32 model?

11. Next Steps

Level 2: Simulator

  • Use TensorFlow Playground to visualize neural networks
  • Experiment with different architectures and datasets

Level 3: Device

  • Deploy lab02_model_int8.tflite to Raspberry Pi
  • Convert to C array for Arduino deployment

See Chapter 2 in the textbook for detailed deployment instructions.

Three-Tier Activities

Run the embedded notebook above. Key exercises:

  1. Follow along with the code cells
  2. Modify parameters and observe results
  3. Complete the checkpoint questions

This lab focuses on foundational training concepts. For Level 2 we use interactive visual tools to deepen your intuition:

TensorFlow Playground – Visualize neural networks in real-time:

  • Experiment with different architectures (layers, neurons)
  • Watch gradient descent optimize the loss surface
  • Compare activation functions (ReLU, Tanh, Sigmoid)
  • Try the “Spiral” dataset to understand model capacity

Our Gradient Descent Visualizer – 3D loss surface exploration for different learning rates and initializations

Self-Assessment Checkpoints

Test your understanding before proceeding to the exercises.

Answer: Cross-Entropy loss is designed for classification tasks and measures the difference between predicted probability distributions and true labels. MSE treats class labels as numeric values (0, 1, 2…) which doesn’t make sense - the “distance” between class 0 and class 2 has no meaning. Cross-Entropy properly handles probability outputs from softmax and provides better gradients for classification. Using MSE for classification leads to poor convergence and lower accuracy.

Answer: Parameters = (input_size + 1) × output_size = (128 + 1) × 64 = 129 × 64 = 8,256 parameters. The “+1” accounts for bias terms (one bias per output neuron). In float32, this layer requires 8,256 × 4 = 33,024 bytes (32.25 KB). After int8 quantization, it reduces to 8,256 bytes (8.06 KB).

Answer: This is overfitting - the model is memorizing training data instead of learning generalizable patterns. After epoch 10, it performs worse on unseen validation data. Solutions: (1) Stop training at epoch 10 (early stopping), (2) Add regularization (L2 penalty, dropout), (3) Increase training data size, (4) Reduce model complexity (fewer layers/neurons), or (5) Add data augmentation. Always monitor validation loss and stop when it stops improving.

Answer: The learning rate is too high. Large steps cause gradient descent to overshoot the minimum and bounce around the loss surface wildly. The loss may even diverge to infinity. Solution: Reduce learning rate by 10× or 100×. Start with 0.01 or 0.001 and adjust based on loss curves. If loss decreases smoothly, the rate is good. If it plateaus quickly, try increasing slightly. A good learning rate shows steady decrease without oscillation.

Answer: Neural networks expect inputs in a consistent, small range (typically 0-1 or -1 to 1). Raw pixel values (0-255) cause several problems: (1) Gradients become 255× larger, leading to training instability and divergence, (2) Initial random weights (typically -0.1 to 0.1) are completely wrong for 0-255 scale, (3) Optimization is much slower because the loss surface is poorly conditioned. Normalization ensures all input features have similar scales, enabling stable and efficient training.

There is no dedicated on-device deployment in LAB02. Instead:

  • LAB03 introduces TFLite conversion and quantization for edge deployment
  • LAB05 shows how to take a trained model and deploy it to Arduino/MCUs

If you want an early challenge, you can:

  1. Train a small MNIST model in this lab’s notebook
  2. Follow LAB03 to convert and quantize it to .tflite
  3. Follow LAB05 to integrate that .tflite model into an MCU project

Visual Troubleshooting

Training Loss Not Decreasing

flowchart TD
    A[Loss not decreasing] --> B{Loss value?}
    B -->|NaN| C[Gradient explosion:<br/>Reduce learning rate 10x<br/>Add gradient clipping<br/>Check for NaN in data]
    B -->|Constant high| D{Learning rate?}
    D -->|Too small| E[Increase LR:<br/>Try 1e-3 for Adam<br/>Try 0.01 for SGD]
    D -->|Reasonable| F{Data normalized?}
    F -->|No| G[Normalize inputs:<br/>x = x / 255.0 images<br/>StandardScaler tabular<br/>Mean 0 std 1]
    F -->|Yes| H{Check labels}
    H -->|Wrong format| I[Fix labels:<br/>One-hot encode<br/>Balance classes<br/>Verify ground truth]

    style A fill:#ff6b6b
    style C fill:#4ecdc4
    style E fill:#4ecdc4
    style G fill:#4ecdc4
    style I fill:#4ecdc4

Overfitting Problems

flowchart TD
    A[Train acc high<br/>Val acc low] --> B{Gap size?}
    B -->|>20%| C[Severe overfitting]
    B -->|10-20%| D[Moderate]
    C --> E{Dataset size?}
    E -->|<100/class| F[Collect more data:<br/>Aim 500+ per class<br/>Critical for deep learning]
    E -->|Adequate| G{Using augmentation?}
    G -->|No| H[Add augmentation:<br/>Flips rotations<br/>Noise injection<br/>Time warping]
    G -->|Yes| I[Add regularization:<br/>L2 weight decay 1e-4<br/>Dropout 0.3-0.5<br/>Early stopping]
    D --> G

    style A fill:#ff6b6b
    style F fill:#4ecdc4
    style H fill:#4ecdc4
    style I fill:#4ecdc4

For complete troubleshooting flowcharts, see: