LAB02: Machine Learning Foundations

Neural Networks and Training Basics

PDF Textbook Reference

For detailed theoretical foundations, mathematical proofs, and algorithm derivations, see Chapter 2: Neural Network Training Fundamentals in the PDF textbook.

The PDF chapter includes: - Complete mathematical derivations of backpropagation - Detailed loss function formulations and proofs - In-depth coverage of gradient descent variants - Comprehensive CNN architecture theory - Extended theoretical examples and convergence analysis

Open In Colab

Download Notebook

Learning Objectives

By the end of this lab you will be able to:

Define and compute common loss functions for regression and classification
Implement gradient descent and understand how learning rate affects convergence
Build and train simple neural networks and CNNs in TensorFlow/Keras
Interpret training/validation curves and diagnose under/overfitting

Theory Summary

How Neural Networks Learn

Neural networks learn through an iterative optimization process that adjusts millions of parameters to minimize prediction errors. Understanding this process is essential for diagnosing training problems and building effective edge ML models.

Loss Functions - Measuring Mistakes: A loss function quantifies how wrong your model’s predictions are. For regression tasks (predicting numbers), we use Mean Squared Error (MSE) which penalizes large errors more heavily. For classification (predicting categories), we use Cross-Entropy Loss which measures the difference between predicted probabilities and true labels. Lower loss always means better predictions.

Gradient Descent - The Optimization Engine: Gradient descent is an algorithm that automatically finds parameter values that minimize loss. It works by computing the gradient (derivative) of the loss with respect to each parameter, then taking small steps in the opposite direction. The learning rate controls step size - too large causes oscillation, too small causes slow convergence. Think of it like walking downhill in fog: you can’t see the bottom, but you can feel the slope and take steps downward.

Neural Network Architecture: A neural network stacks multiple layers of simple operations ($y = wx + b$) with non-linear activation functions between them. Each layer learns increasingly abstract features: early layers detect edges and textures, middle layers combine them into shapes, and final layers recognize complete objects. This hierarchical feature learning is what makes neural networks powerful.

Key Concepts at a Glance

Core Concepts

Loss Function: Single number measuring prediction error (lower = better)
Gradient: Derivative showing direction to adjust parameters for improvement
Learning Rate: Step size for gradient descent updates (typically 0.001 - 0.1)
Epoch: One complete pass through the entire training dataset
Activation Functions: Non-linearities (ReLU, sigmoid, softmax) enabling complex patterns
Overfitting: Model memorizes training data but fails on new data
Regularization: Techniques (dropout, L2 penalty) to prevent overfitting

Common Pitfalls

Mistakes to Avoid

Forgetting to Normalize Inputs: The most common training failure is feeding unnormalized data to neural networks. If pixel values are 0-255 instead of 0-1, gradients become 255× larger, causing training instability or divergence. Always normalize: x = x / 255.0 for images.

Using Wrong Loss for Task Type: Mean Squared Error (MSE) is for regression (predicting continuous values). Cross-Entropy is for classification (predicting categories). Using MSE for classification or cross-entropy for regression will cause training to fail silently with poor results.

Learning Rate Too High or Too Low: Learning rate = 1.0 typically causes wild oscillation and divergence. Learning rate = 0.00001 makes training painfully slow (thousands of epochs). Start with 0.001 or 0.01 and adjust based on loss curves.

Not Splitting Train/Validation Data: Training and evaluating on the same data gives misleadingly high accuracy. Always hold out 10-20% of data for validation to detect overfitting. Use validation_split=0.2 in Keras or manually split your dataset.

Ignoring Training Curves: If validation loss increases while training loss decreases, you’re overfitting. If both losses are high, your model is underfitting. Always plot loss curves to diagnose issues early.

Quick Reference

Key Formulas

Mean Squared Error (Regression): \[\text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2\]

Cross-Entropy Loss (Classification): \[\text{Loss} = -\frac{1}{n}\sum_{i=1}^{n}\sum_{c=1}^{C} y_{ic} \log(\hat{y}_{ic})\]

Gradient Descent Update Rule: \[w_{new} = w_{old} - \alpha \cdot \frac{\partial L}{\partial w}\] where $\alpha$ is the learning rate

Parameter Count for Dense Layer: \[\text{params} = (\text{input\_size} + 1) \times \text{output\_size}\] The “+1” accounts for bias terms

Memory for Float32 Model: \[\text{Memory (bytes)} = \text{parameters} \times 4\]

Important Parameter Values

Hyperparameter	Typical Range	Notes
Learning Rate	0.001 - 0.1	Start with 0.01, adjust by 10×
Batch Size	16 - 128	Smaller = noisier gradients but less memory
Epochs	5 - 100	Stop when validation loss stops improving
Hidden Layer Size	16 - 512	Larger = more capacity but slower
Dropout Rate	0.2 - 0.5	Prevents overfitting (0.5 = drop 50% neurons)

Common Activation Functions: - ReLU: max(0, x) - Default choice, fast, works well - Sigmoid: 1/(1+e^-x) - Output 0-1, used for binary classification - Softmax: Normalizes to probabilities - used for multi-class output - Tanh: tanh(x) - Output -1 to 1, centered around zero

Links to PDF Sections

For deeper understanding, see these sections in Chapter 2 PDF:

Section 2.1: Exploring Loss Functions (pages 21-24)
Section 2.2: Gradient Descent Implementation (pages 25-29)
Section 2.3: Building Neural Networks (pages 30-34)
Section 2.4: Classification with Softmax (pages 35-38)
Exercises: Practice problems with solutions (pages 39-40)

Interactive Learning Tools

Explore Visually

Before diving into code, build intuition with these interactive tools:

TensorFlow Playground - Visualize how neural networks learn in real-time. Try the “spiral” dataset with different architectures!
Gradient Descent Visualizer - See how learning rate affects convergence on 3D loss surfaces
Our Loss Function Explorer - Compare MSE vs Cross-Entropy interactively

Try It Yourself: Executable Python Examples

Run these interactive examples directly in your browser to build intuition before diving into the full notebook.

Example 1: Gradient Descent from Scratch

Code

import numpy as np
import matplotlib.pyplot as plt

# Generate synthetic data: y = 3x + 2 + noise
np.random.seed(42)
X = np.linspace(0, 10, 100)
y_true = 3 * X + 2
y = y_true + np.random.randn(100) * 2

# Initialize parameters
w = 0.0  # weight
b = 0.0  # bias
learning_rate = 0.01
epochs = 100

# Track loss history
loss_history = []

# Gradient descent
for epoch in range(epochs):
    # Forward pass: predictions
    y_pred = w * X + b

    # Compute loss (MSE)
    loss = np.mean((y - y_pred) ** 2)
    loss_history.append(loss)

    # Compute gradients
    grad_w = -2 * np.mean((y - y_pred) * X)
    grad_b = -2 * np.mean(y - y_pred)

    # Update parameters
    w = w - learning_rate * grad_w
    b = b - learning_rate * grad_b

# Visualize results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# Plot 1: Data and fitted line
ax1.scatter(X, y, alpha=0.5, label='Data')
ax1.plot(X, y_true, 'g--', label='True line (y=3x+2)', linewidth=2)
ax1.plot(X, w * X + b, 'r-', label=f'Fitted line (y={w:.2f}x+{b:.2f})', linewidth=2)
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_title('Gradient Descent Fit')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Plot 2: Loss over epochs
ax2.plot(loss_history, linewidth=2)
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Loss (MSE)')
ax2.set_title('Training Loss Curve')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Final parameters: w={w:.4f}, b={b:.4f}")
print(f"True parameters:  w=3.0000, b=2.0000")
print(f"Final loss: {loss_history[-1]:.4f}")
print(f"\nKey insight: Gradient descent found parameters close to the true values!")

Final parameters: w=3.1348, b=0.9415
True parameters:  w=3.0000, b=2.0000
Final loss: 3.3899

Key insight: Gradient descent found parameters close to the true values!

Example 2: Loss Function Visualization

Code

import numpy as np
import matplotlib.pyplot as plt

# Generate classification and regression examples
np.random.seed(42)

# Regression example
y_true_reg = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
y_pred_reg = np.array([1.2, 2.1, 2.8, 4.3, 4.9])

# Classification example (3 classes)
y_true_class = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1], [1, 0, 0], [0, 1, 0]])
y_pred_class = np.array([[0.8, 0.1, 0.1], [0.2, 0.7, 0.1], [0.1, 0.2, 0.7],
                          [0.6, 0.3, 0.1], [0.1, 0.8, 0.1]])

# Compute losses
mse = np.mean((y_true_reg - y_pred_reg) ** 2)
mae = np.mean(np.abs(y_true_reg - y_pred_reg))

# Cross-entropy loss (clip to avoid log(0))
y_pred_clipped = np.clip(y_pred_class, 1e-7, 1 - 1e-7)
cross_entropy = -np.mean(np.sum(y_true_class * np.log(y_pred_clipped), axis=1))

# Visualize loss surfaces for simple case
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# MSE loss surface for y = wx (simple linear)
w_range = np.linspace(-2, 6, 100)
mse_surface = [(np.mean(([2.0] - w * np.array([1.0])) ** 2)) for w in w_range]

ax1.plot(w_range, mse_surface, linewidth=2)
ax1.axvline(x=2.0, color='r', linestyle='--', label='Optimal w=2.0')
ax1.set_xlabel('Weight (w)')
ax1.set_ylabel('MSE Loss')
ax1.set_title('Mean Squared Error Loss Surface')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Compare different loss magnitudes
predictions = np.linspace(0, 5, 100)
true_value = 3.0
mse_curve = (predictions - true_value) ** 2
mae_curve = np.abs(predictions - true_value)

ax2.plot(predictions, mse_curve, label='MSE', linewidth=2)
ax2.plot(predictions, mae_curve, label='MAE', linewidth=2, linestyle='--')
ax2.axvline(x=true_value, color='g', linestyle=':', alpha=0.5, label='True value')
ax2.set_xlabel('Predicted Value')
ax2.set_ylabel('Loss')
ax2.set_title('MSE vs MAE (true value = 3.0)')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("LOSS FUNCTION COMPARISON")
print("="*50)
print(f"\nRegression Losses:")
print(f"  MSE (Mean Squared Error):     {mse:.4f}")
print(f"  MAE (Mean Absolute Error):    {mae:.4f}")
print(f"\nClassification Loss:")
print(f"  Cross-Entropy Loss:           {cross_entropy:.4f}")
print(f"\nKey insight: MSE penalizes large errors more heavily (squared term)!")

LOSS FUNCTION COMPARISON
==================================================

Regression Losses:
  MSE (Mean Squared Error):     0.0380
  MAE (Mean Absolute Error):    0.1800

Classification Loss:
  Cross-Entropy Loss:           0.3341

Key insight: MSE penalizes large errors more heavily (squared term)!

Example 3: Simple Neural Network Training Loop

Code

import numpy as np
import matplotlib.pyplot as plt

# Generate XOR-like nonlinear data
np.random.seed(42)
n_samples = 200

X = np.random.randn(n_samples, 2)
y = (X[:, 0] * X[:, 1] > 0).astype(int)  # XOR pattern

# Simple 2-layer neural network
class SimpleNN:
    def __init__(self, input_size=2, hidden_size=8, output_size=2):
        self.W1 = np.random.randn(input_size, hidden_size) * 0.1
        self.b1 = np.zeros(hidden_size)
        self.W2 = np.random.randn(hidden_size, output_size) * 0.1
        self.b2 = np.zeros(output_size)

    def relu(self, x):
        return np.maximum(0, x)

    def relu_derivative(self, x):
        return (x > 0).astype(float)

    def softmax(self, x):
        exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
        return exp_x / np.sum(exp_x, axis=1, keepdims=True)

    def forward(self, X):
        self.z1 = X @ self.W1 + self.b1
        self.a1 = self.relu(self.z1)
        self.z2 = self.a1 @ self.W2 + self.b2
        self.a2 = self.softmax(self.z2)
        return self.a2

    def backward(self, X, y, learning_rate=0.01):
        m = X.shape[0]

        # One-hot encode y
        y_one_hot = np.zeros((m, 2))
        y_one_hot[np.arange(m), y] = 1

        # Output layer gradients
        dz2 = self.a2 - y_one_hot
        dW2 = (self.a1.T @ dz2) / m
        db2 = np.sum(dz2, axis=0) / m

        # Hidden layer gradients
        da1 = dz2 @ self.W2.T
        dz1 = da1 * self.relu_derivative(self.z1)
        dW1 = (X.T @ dz1) / m
        db1 = np.sum(dz1, axis=0) / m

        # Update weights
        self.W1 -= learning_rate * dW1
        self.b1 -= learning_rate * db1
        self.W2 -= learning_rate * dW2
        self.b2 -= learning_rate * db2

# Train the network
model = SimpleNN()
epochs = 200
losses = []
accuracies = []

for epoch in range(epochs):
    # Forward pass
    predictions = model.forward(X)

    # Compute loss (cross-entropy)
    y_one_hot = np.zeros((len(y), 2))
    y_one_hot[np.arange(len(y)), y] = 1
    loss = -np.mean(np.sum(y_one_hot * np.log(predictions + 1e-8), axis=1))
    losses.append(loss)

    # Compute accuracy
    pred_labels = np.argmax(predictions, axis=1)
    accuracy = np.mean(pred_labels == y)
    accuracies.append(accuracy)

    # Backward pass
    model.backward(X, y, learning_rate=0.1)

# Visualize results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# Plot 1: Training curves
ax1.plot(losses, label='Loss', linewidth=2)
ax1_twin = ax1.twinx()
ax1_twin.plot(accuracies, 'g-', label='Accuracy', linewidth=2)
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss', color='b')
ax1_twin.set_ylabel('Accuracy', color='g')
ax1.set_title('Training Progress')
ax1.grid(True, alpha=0.3)

# Plot 2: Decision boundary
h = 0.1
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = model.forward(np.c_[xx.ravel(), yy.ravel()])
Z = np.argmax(Z, axis=1).reshape(xx.shape)

ax2.contourf(xx, yy, Z, alpha=0.3, cmap='RdYlBu')
ax2.scatter(X[:, 0], X[:, 1], c=y, cmap='RdYlBu', edgecolors='k', s=50)
ax2.set_xlabel('Feature 1')
ax2.set_ylabel('Feature 2')
ax2.set_title(f'Decision Boundary (Accuracy: {accuracies[-1]:.2%})')

plt.tight_layout()
plt.show()

print(f"Final loss: {losses[-1]:.4f}")
print(f"Final accuracy: {accuracies[-1]:.2%}")
print(f"\nKey insight: Neural networks learn nonlinear decision boundaries!")

Final loss: 0.4738
Final accuracy: 88.50%

Key insight: Neural networks learn nonlinear decision boundaries!

Example 4: Learning Rate Comparison

Code

import numpy as np
import matplotlib.pyplot as plt

# Simple quadratic loss surface: L(w) = (w - 5)^2
def loss_fn(w):
    return (w - 5) ** 2

def gradient_fn(w):
    return 2 * (w - 5)

# Test different learning rates
learning_rates = [0.01, 0.1, 0.5, 1.1]
colors = ['blue', 'green', 'orange', 'red']
n_steps = 50

fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.flatten()

for idx, (lr, color) in enumerate(zip(learning_rates, colors)):
    w = 0.0  # Start far from optimum
    w_history = [w]
    loss_history = [loss_fn(w)]

    for step in range(n_steps):
        grad = gradient_fn(w)
        w = w - lr * grad
        w_history.append(w)
        loss_history.append(loss_fn(w))

        # Stop if diverging
        if abs(w) > 100:
            break

    # Plot loss trajectory
    ax = axes[idx]
    ax.plot(loss_history, linewidth=2, color=color)
    ax.set_xlabel('Step')
    ax.set_ylabel('Loss')
    ax.set_title(f'Learning Rate = {lr}')
    ax.grid(True, alpha=0.3)

    # Add status text
    if lr < 0.1:
        status = "TOO SLOW"
        ax.text(0.5, 0.9, status, transform=ax.transAxes,
                bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.5))
    elif lr > 1.0:
        status = "DIVERGING!"
        ax.text(0.5, 0.9, status, transform=ax.transAxes,
                bbox=dict(boxstyle='round', facecolor='red', alpha=0.5))
    else:
        status = "GOOD"
        ax.text(0.5, 0.9, status, transform=ax.transAxes,
                bbox=dict(boxstyle='round', facecolor='lightgreen', alpha=0.5))

plt.tight_layout()
plt.show()

print("LEARNING RATE COMPARISON")
print("="*60)
for lr in learning_rates:
    w = 0.0
    for step in range(50):
        w = w - lr * gradient_fn(w)
        if abs(w) > 100:
            print(f"LR={lr:4.2f}: DIVERGED at step {step}")
            break
    else:
        final_loss = loss_fn(w)
        print(f"LR={lr:4.2f}: Converged to w={w:.4f}, loss={final_loss:.4f}")

print("\nKey insights:")
print("  • LR=0.01: Too slow, needs many iterations")
print("  • LR=0.1:  Good balance, stable convergence")
print("  • LR=0.5:  Fast but may oscillate")
print("  • LR=1.1:  Too large, diverges!")

LEARNING RATE COMPARISON
============================================================
LR=0.01: Converged to w=3.1792, loss=3.3155
LR=0.10: Converged to w=4.9999, loss=0.0000
LR=0.50: Converged to w=5.0000, loss=0.0000
LR=1.10: DIVERGED at step 16

Key insights:
  • LR=0.01: Too slow, needs many iterations
  • LR=0.1:  Good balance, stable convergence
  • LR=0.5:  Fast but may oscillate
  • LR=1.1:  Too large, diverges!

Interactive Notebook

The notebook below contains runnable code for all Level 1 activities.

LAB02: Machine Learning Foundations for Edge

Learning Objectives: - Understand neural network architecture fundamentals - Train simple models on synthetic datasets - Explore loss functions and gradient descent - Export models to TensorFlow Lite format

Three-Tier Approach: - Level 1 (This Notebook): Train and visualize models on laptop - Level 2 (Simulator): Use TensorFlow Playground for visualization - Level 3 (Device): Deploy quantized model to microcontroller

1. Setup

2. Understanding Neural Networks

A neural network consists of: - Input layer: Receives raw data features - Hidden layers: Transform features through weighted connections - Output layer: Produces predictions

Each connection has a weight that is learned during training.

3. Create a Toy Dataset

We’ll use a simple 2D classification problem that’s easy to visualize.

4. Build a Simple Neural Network

We’ll create a tiny network suitable for edge deployment: - 2 input features - 1 hidden layer with 8 neurons - 2 output classes

5. Understanding Loss Functions

The loss function measures how wrong our predictions are: - Cross-entropy loss: For classification problems - Mean squared error: For regression problems

Goal: Minimize the loss by adjusting weights.

6. Training: Gradient Descent in Action

Training process: 1. Forward pass: Compute predictions 2. Compute loss 3. Backward pass: Compute gradients 4. Update weights 5. Repeat

7. Visualize Decision Boundary

8. Export to TensorFlow Lite

For edge deployment, we convert to TFLite format which is: - Smaller (optimized for mobile/embedded) - Faster (quantized operations) - Compatible with microcontrollers

9. Test TFLite Model

10. Checkpoint Questions

Why does the loss decrease during training?
What happens if you increase the number of hidden neurons from 8 to 32?
- How does training time change?
- How does accuracy change?
- How does model size change?
What’s the purpose of the validation set?
Why is the INT8 model smaller than the float32 model?

11. Next Steps

Level 2: Simulator

Use TensorFlow Playground to visualize neural networks
Experiment with different architectures and datasets

Level 3: Device

Deploy lab02_model_int8.tflite to Raspberry Pi
Convert to C array for Arduino deployment

See Chapter 2 in the textbook for detailed deployment instructions.

Three-Tier Activities

Run the embedded notebook above. Key exercises:

Follow along with the code cells
Modify parameters and observe results
Complete the checkpoint questions

This lab focuses on foundational training concepts. For Level 2 we use interactive visual tools to deepen your intuition:

TensorFlow Playground – Visualize neural networks in real-time:

Experiment with different architectures (layers, neurons)
Watch gradient descent optimize the loss surface
Compare activation functions (ReLU, Tanh, Sigmoid)
Try the “Spiral” dataset to understand model capacity

Our Gradient Descent Visualizer – 3D loss surface exploration for different learning rates and initializations

Self-Assessment Checkpoints

Test your understanding before proceeding to the exercises.

Question 1: Why would you use Cross-Entropy loss instead of MSE for a 10-class image classification problem?

Answer: Cross-Entropy loss is designed for classification tasks and measures the difference between predicted probability distributions and true labels. MSE treats class labels as numeric values (0, 1, 2…) which doesn’t make sense - the “distance” between class 0 and class 2 has no meaning. Cross-Entropy properly handles probability outputs from softmax and provides better gradients for classification. Using MSE for classification leads to poor convergence and lower accuracy.

Question 2: Calculate the number of parameters in a Dense layer with 128 inputs and 64 outputs.

Answer: Parameters = (input_size + 1) × output_size = (128 + 1) × 64 = 129 × 64 = 8,256 parameters. The “+1” accounts for bias terms (one bias per output neuron). In float32, this layer requires 8,256 × 4 = 33,024 bytes (32.25 KB). After int8 quantization, it reduces to 8,256 bytes (8.06 KB).

Question 3: Your model’s training loss decreases steadily but validation loss starts increasing after epoch 10. What’s happening and what should you do?

Answer: This is overfitting - the model is memorizing training data instead of learning generalizable patterns. After epoch 10, it performs worse on unseen validation data. Solutions: (1) Stop training at epoch 10 (early stopping), (2) Add regularization (L2 penalty, dropout), (3) Increase training data size, (4) Reduce model complexity (fewer layers/neurons), or (5) Add data augmentation. Always monitor validation loss and stop when it stops improving.

Question 4: You set learning_rate=1.0 and the loss oscillates wildly between 0.1 and 100. What’s wrong?

Answer: The learning rate is too high. Large steps cause gradient descent to overshoot the minimum and bounce around the loss surface wildly. The loss may even diverge to infinity. Solution: Reduce learning rate by 10× or 100×. Start with 0.01 or 0.001 and adjust based on loss curves. If loss decreases smoothly, the rate is good. If it plateaus quickly, try increasing slightly. A good learning rate shows steady decrease without oscillation.

Question 5: Why must you normalize image inputs (pixel_values / 255.0) before training?

Answer: Neural networks expect inputs in a consistent, small range (typically 0-1 or -1 to 1). Raw pixel values (0-255) cause several problems: (1) Gradients become 255× larger, leading to training instability and divergence, (2) Initial random weights (typically -0.1 to 0.1) are completely wrong for 0-255 scale, (3) Optimization is much slower because the loss surface is poorly conditioned. Normalization ensures all input features have similar scales, enabling stable and efficient training.

There is no dedicated on-device deployment in LAB02. Instead:

LAB03 introduces TFLite conversion and quantization for edge deployment
LAB05 shows how to take a trained model and deploy it to Arduino/MCUs

If you want an early challenge, you can:

Train a small MNIST model in this lab’s notebook
Follow LAB03 to convert and quantize it to .tflite
Follow LAB05 to integrate that .tflite model into an MCU project

Visual Troubleshooting

Training Loss Not Decreasing

flowchart TD
    A[Loss not decreasing] --> B{Loss value?}
    B -->|NaN| C[Gradient explosion:<br/>Reduce learning rate 10x<br/>Add gradient clipping<br/>Check for NaN in data]
    B -->|Constant high| D{Learning rate?}
    D -->|Too small| E[Increase LR:<br/>Try 1e-3 for Adam<br/>Try 0.01 for SGD]
    D -->|Reasonable| F{Data normalized?}
    F -->|No| G[Normalize inputs:<br/>x = x / 255.0 images<br/>StandardScaler tabular<br/>Mean 0 std 1]
    F -->|Yes| H{Check labels}
    H -->|Wrong format| I[Fix labels:<br/>One-hot encode<br/>Balance classes<br/>Verify ground truth]

    style A fill:#ff6b6b
    style C fill:#4ecdc4
    style E fill:#4ecdc4
    style G fill:#4ecdc4
    style I fill:#4ecdc4

Overfitting Problems

flowchart TD
    A[Train acc high<br/>Val acc low] --> B{Gap size?}
    B -->|>20%| C[Severe overfitting]
    B -->|10-20%| D[Moderate]
    C --> E{Dataset size?}
    E -->|<100/class| F[Collect more data:<br/>Aim 500+ per class<br/>Critical for deep learning]
    E -->|Adequate| G{Using augmentation?}
    G -->|No| H[Add augmentation:<br/>Flips rotations<br/>Noise injection<br/>Time warping]
    G -->|Yes| I[Add regularization:<br/>L2 weight decay 1e-4<br/>Dropout 0.3-0.5<br/>Early stopping]
    D --> G

    style A fill:#ff6b6b
    style F fill:#4ecdc4
    style H fill:#4ecdc4
    style I fill:#4ecdc4

For complete troubleshooting flowcharts, see:

--- title: "LAB02: Machine Learning Foundations" subtitle: "Neural Networks and Training Basics" --- ::: {.callout-note} ## PDF Textbook Reference For detailed theoretical foundations, mathematical proofs, and algorithm derivations, see **Chapter 2: Neural Network Training Fundamentals** in the [PDF textbook](../downloads/Edge-Analytics-Lab-Book-v1.0.0.pdf). The PDF chapter includes: - Complete mathematical derivations of backpropagation - Detailed loss function formulations and proofs - In-depth coverage of gradient descent variants - Comprehensive CNN architecture theory - Extended theoretical examples and convergence analysis ::: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ngcharithperera/edge-analytics-lab-book/blob/main/notebooks/LAB02_ml_foundations.ipynb) [Download Notebook](https://raw.githubusercontent.com/ngcharithperera/edge-analytics-lab-book/main/notebooks/LAB02_ml_foundations.ipynb) ## Learning Objectives By the end of this lab you will be able to: - Define and compute common loss functions for regression and classification - Implement gradient descent and understand how learning rate affects convergence - Build and train simple neural networks and CNNs in TensorFlow/Keras - Interpret training/validation curves and diagnose under/overfitting ## Theory Summary ### How Neural Networks Learn Neural networks learn through an iterative optimization process that adjusts millions of parameters to minimize prediction errors. Understanding this process is essential for diagnosing training problems and building effective edge ML models. **Loss Functions - Measuring Mistakes:** A loss function quantifies how wrong your model's predictions are. For regression tasks (predicting numbers), we use Mean Squared Error (MSE) which penalizes large errors more heavily. For classification (predicting categories), we use Cross-Entropy Loss which measures the difference between predicted probabilities and true labels. Lower loss always means better predictions. **Gradient Descent - The Optimization Engine:** Gradient descent is an algorithm that automatically finds parameter values that minimize loss. It works by computing the gradient (derivative) of the loss with respect to each parameter, then taking small steps in the opposite direction. The learning rate controls step size - too large causes oscillation, too small causes slow convergence. Think of it like walking downhill in fog: you can't see the bottom, but you can feel the slope and take steps downward. **Neural Network Architecture:** A neural network stacks multiple layers of simple operations ($y = wx + b$) with non-linear activation functions between them. Each layer learns increasingly abstract features: early layers detect edges and textures, middle layers combine them into shapes, and final layers recognize complete objects. This hierarchical feature learning is what makes neural networks powerful. ## Key Concepts at a Glance ::: {.callout-note icon=false} ## Core Concepts - **Loss Function**: Single number measuring prediction error (lower = better) - **Gradient**: Derivative showing direction to adjust parameters for improvement - **Learning Rate**: Step size for gradient descent updates (typically 0.001 - 0.1) - **Epoch**: One complete pass through the entire training dataset - **Activation Functions**: Non-linearities (ReLU, sigmoid, softmax) enabling complex patterns - **Overfitting**: Model memorizes training data but fails on new data - **Regularization**: Techniques (dropout, L2 penalty) to prevent overfitting ::: ## Common Pitfalls ::: {.callout-warning} ## Mistakes to Avoid **Forgetting to Normalize Inputs**: The most common training failure is feeding unnormalized data to neural networks. If pixel values are 0-255 instead of 0-1, gradients become 255× larger, causing training instability or divergence. Always normalize: `x = x / 255.0` for images. **Using Wrong Loss for Task Type**: Mean Squared Error (MSE) is for regression (predicting continuous values). Cross-Entropy is for classification (predicting categories). Using MSE for classification or cross-entropy for regression will cause training to fail silently with poor results. **Learning Rate Too High or Too Low**: Learning rate = 1.0 typically causes wild oscillation and divergence. Learning rate = 0.00001 makes training painfully slow (thousands of epochs). Start with 0.001 or 0.01 and adjust based on loss curves. **Not Splitting Train/Validation Data**: Training and evaluating on the same data gives misleadingly high accuracy. Always hold out 10-20% of data for validation to detect overfitting. Use `validation_split=0.2` in Keras or manually split your dataset. **Ignoring Training Curves**: If validation loss increases while training loss decreases, you're overfitting. If both losses are high, your model is underfitting. Always plot loss curves to diagnose issues early. ::: ## Quick Reference ### Key Formulas **Mean Squared Error (Regression):** $$\text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$$ **Cross-Entropy Loss (Classification):** $$\text{Loss} = -\frac{1}{n}\sum_{i=1}^{n}\sum_{c=1}^{C} y_{ic} \log(\hat{y}_{ic})$$ **Gradient Descent Update Rule:** $$w_{new} = w_{old} - \alpha \cdot \frac{\partial L}{\partial w}$$ where $\alpha$ is the learning rate **Parameter Count for Dense Layer:** $$\text{params} = (\text{input\_size} + 1) \times \text{output\_size}$$ The "+1" accounts for bias terms **Memory for Float32 Model:** $$\text{Memory (bytes)} = \text{parameters} \times 4$$ ### Important Parameter Values | Hyperparameter | Typical Range | Notes | |----------------|---------------|-------| | Learning Rate | 0.001 - 0.1 | Start with 0.01, adjust by 10× | | Batch Size | 16 - 128 | Smaller = noisier gradients but less memory | | Epochs | 5 - 100 | Stop when validation loss stops improving | | Hidden Layer Size | 16 - 512 | Larger = more capacity but slower | | Dropout Rate | 0.2 - 0.5 | Prevents overfitting (0.5 = drop 50% neurons) | **Common Activation Functions:** - **ReLU**: `max(0, x)` - Default choice, fast, works well - **Sigmoid**: `1/(1+e^-x)` - Output 0-1, used for binary classification - **Softmax**: Normalizes to probabilities - used for multi-class output - **Tanh**: `tanh(x)` - Output -1 to 1, centered around zero ### Links to PDF Sections For deeper understanding, see these sections in [Chapter 2 PDF](../downloads/Edge-Analytics-Lab-Book-v1.0.0.pdf#page=20): - **Section 2.1**: Exploring Loss Functions (pages 21-24) - **Section 2.2**: Gradient Descent Implementation (pages 25-29) - **Section 2.3**: Building Neural Networks (pages 30-34) - **Section 2.4**: Classification with Softmax (pages 35-38) - **Exercises**: Practice problems with solutions (pages 39-40) ### Interactive Learning Tools ::: {.callout-tip} ## Explore Visually Before diving into code, build intuition with these interactive tools: - **[TensorFlow Playground](https://playground.tensorflow.org)** - Visualize how neural networks learn in real-time. Try the "spiral" dataset with different architectures! - **[Gradient Descent Visualizer](../simulations/gradient-descent.qmd)** - See how learning rate affects convergence on 3D loss surfaces - **[Our Loss Function Explorer](../simulations/loss-function-viz.qmd)** - Compare MSE vs Cross-Entropy interactively ::: ## Try It Yourself: Executable Python Examples Run these interactive examples directly in your browser to build intuition before diving into the full notebook. ### Example 1: Gradient Descent from Scratch ```{python} import numpy as np import matplotlib.pyplot as plt # Generate synthetic data: y = 3x + 2 + noise np.random.seed(42) X = np.linspace(0, 10, 100) y_true = 3 * X + 2 y = y_true + np.random.randn(100) * 2 # Initialize parameters w = 0.0 # weight b = 0.0 # bias learning_rate = 0.01 epochs = 100 # Track loss history loss_history = [] # Gradient descent for epoch in range(epochs): # Forward pass: predictions y_pred = w * X + b # Compute loss (MSE) loss = np.mean((y - y_pred) ** 2) loss_history.append(loss) # Compute gradients grad_w = -2 * np.mean((y - y_pred) * X) grad_b = -2 * np.mean(y - y_pred) # Update parameters w = w - learning_rate * grad_w b = b - learning_rate * grad_b # Visualize results fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4)) # Plot 1: Data and fitted line ax1.scatter(X, y, alpha=0.5, label='Data') ax1.plot(X, y_true, 'g--', label='True line (y=3x+2)', linewidth=2) ax1.plot(X, w * X + b, 'r-', label=f'Fitted line (y={w:.2f}x+{b:.2f})', linewidth=2) ax1.set_xlabel('x') ax1.set_ylabel('y') ax1.set_title('Gradient Descent Fit') ax1.legend() ax1.grid(True, alpha=0.3) # Plot 2: Loss over epochs ax2.plot(loss_history, linewidth=2) ax2.set_xlabel('Epoch') ax2.set_ylabel('Loss (MSE)') ax2.set_title('Training Loss Curve') ax2.grid(True, alpha=0.3) plt.tight_layout() plt.show() print(f"Final parameters: w={w:.4f}, b={b:.4f}") print(f"True parameters: w=3.0000, b=2.0000") print(f"Final loss: {loss_history[-1]:.4f}") print(f"\nKey insight: Gradient descent found parameters close to the true values!") ``` ### Example 2: Loss Function Visualization ```{python} import numpy as np import matplotlib.pyplot as plt # Generate classification and regression examples np.random.seed(42) # Regression example y_true_reg = np.array([1.0, 2.0, 3.0, 4.0, 5.0]) y_pred_reg = np.array([1.2, 2.1, 2.8, 4.3, 4.9]) # Classification example (3 classes) y_true_class = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1], [1, 0, 0], [0, 1, 0]]) y_pred_class = np.array([[0.8, 0.1, 0.1], [0.2, 0.7, 0.1], [0.1, 0.2, 0.7], [0.6, 0.3, 0.1], [0.1, 0.8, 0.1]]) # Compute losses mse = np.mean((y_true_reg - y_pred_reg) ** 2) mae = np.mean(np.abs(y_true_reg - y_pred_reg)) # Cross-entropy loss (clip to avoid log(0)) y_pred_clipped = np.clip(y_pred_class, 1e-7, 1 - 1e-7) cross_entropy = -np.mean(np.sum(y_true_class * np.log(y_pred_clipped), axis=1)) # Visualize loss surfaces for simple case fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4)) # MSE loss surface for y = wx (simple linear) w_range = np.linspace(-2, 6, 100) mse_surface = [(np.mean(([2.0] - w * np.array([1.0])) ** 2)) for w in w_range] ax1.plot(w_range, mse_surface, linewidth=2) ax1.axvline(x=2.0, color='r', linestyle='--', label='Optimal w=2.0') ax1.set_xlabel('Weight (w)') ax1.set_ylabel('MSE Loss') ax1.set_title('Mean Squared Error Loss Surface') ax1.legend() ax1.grid(True, alpha=0.3) # Compare different loss magnitudes predictions = np.linspace(0, 5, 100) true_value = 3.0 mse_curve = (predictions - true_value) ** 2 mae_curve = np.abs(predictions - true_value) ax2.plot(predictions, mse_curve, label='MSE', linewidth=2) ax2.plot(predictions, mae_curve, label='MAE', linewidth=2, linestyle='--') ax2.axvline(x=true_value, color='g', linestyle=':', alpha=0.5, label='True value') ax2.set_xlabel('Predicted Value') ax2.set_ylabel('Loss') ax2.set_title('MSE vs MAE (true value = 3.0)') ax2.legend() ax2.grid(True, alpha=0.3) plt.tight_layout() plt.show() print("LOSS FUNCTION COMPARISON") print("="*50) print(f"\nRegression Losses:") print(f" MSE (Mean Squared Error): {mse:.4f}") print(f" MAE (Mean Absolute Error): {mae:.4f}") print(f"\nClassification Loss:") print(f" Cross-Entropy Loss: {cross_entropy:.4f}") print(f"\nKey insight: MSE penalizes large errors more heavily (squared term)!") ``` ### Example 3: Simple Neural Network Training Loop ```{python} import numpy as np import matplotlib.pyplot as plt # Generate XOR-like nonlinear data np.random.seed(42) n_samples = 200 X = np.random.randn(n_samples, 2) y = (X[:, 0] * X[:, 1] > 0).astype(int) # XOR pattern # Simple 2-layer neural network class SimpleNN: def __init__(self, input_size=2, hidden_size=8, output_size=2): self.W1 = np.random.randn(input_size, hidden_size) * 0.1 self.b1 = np.zeros(hidden_size) self.W2 = np.random.randn(hidden_size, output_size) * 0.1 self.b2 = np.zeros(output_size) def relu(self, x): return np.maximum(0, x) def relu_derivative(self, x): return (x > 0).astype(float) def softmax(self, x): exp_x = np.exp(x - np.max(x, axis=1, keepdims=True)) return exp_x / np.sum(exp_x, axis=1, keepdims=True) def forward(self, X): self.z1 = X @ self.W1 + self.b1 self.a1 = self.relu(self.z1) self.z2 = self.a1 @ self.W2 + self.b2 self.a2 = self.softmax(self.z2) return self.a2 def backward(self, X, y, learning_rate=0.01): m = X.shape[0] # One-hot encode y y_one_hot = np.zeros((m, 2)) y_one_hot[np.arange(m), y] = 1 # Output layer gradients dz2 = self.a2 - y_one_hot dW2 = (self.a1.T @ dz2) / m db2 = np.sum(dz2, axis=0) / m # Hidden layer gradients da1 = dz2 @ self.W2.T dz1 = da1 * self.relu_derivative(self.z1) dW1 = (X.T @ dz1) / m db1 = np.sum(dz1, axis=0) / m # Update weights self.W1 -= learning_rate * dW1 self.b1 -= learning_rate * db1 self.W2 -= learning_rate * dW2 self.b2 -= learning_rate * db2 # Train the network model = SimpleNN() epochs = 200 losses = [] accuracies = [] for epoch in range(epochs): # Forward pass predictions = model.forward(X) # Compute loss (cross-entropy) y_one_hot = np.zeros((len(y), 2)) y_one_hot[np.arange(len(y)), y] = 1 loss = -np.mean(np.sum(y_one_hot * np.log(predictions + 1e-8), axis=1)) losses.append(loss) # Compute accuracy pred_labels = np.argmax(predictions, axis=1) accuracy = np.mean(pred_labels == y) accuracies.append(accuracy) # Backward pass model.backward(X, y, learning_rate=0.1) # Visualize results fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4)) # Plot 1: Training curves ax1.plot(losses, label='Loss', linewidth=2) ax1_twin = ax1.twinx() ax1_twin.plot(accuracies, 'g-', label='Accuracy', linewidth=2) ax1.set_xlabel('Epoch') ax1.set_ylabel('Loss', color='b') ax1_twin.set_ylabel('Accuracy', color='g') ax1.set_title('Training Progress') ax1.grid(True, alpha=0.3) # Plot 2: Decision boundary h = 0.1 x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1 y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1 xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h)) Z = model.forward(np.c_[xx.ravel(), yy.ravel()]) Z = np.argmax(Z, axis=1).reshape(xx.shape) ax2.contourf(xx, yy, Z, alpha=0.3, cmap='RdYlBu') ax2.scatter(X[:, 0], X[:, 1], c=y, cmap='RdYlBu', edgecolors='k', s=50) ax2.set_xlabel('Feature 1') ax2.set_ylabel('Feature 2') ax2.set_title(f'Decision Boundary (Accuracy: {accuracies[-1]:.2%})') plt.tight_layout() plt.show() print(f"Final loss: {losses[-1]:.4f}") print(f"Final accuracy: {accuracies[-1]:.2%}") print(f"\nKey insight: Neural networks learn nonlinear decision boundaries!") ``` ### Example 4: Learning Rate Comparison ```{python} import numpy as np import matplotlib.pyplot as plt # Simple quadratic loss surface: L(w) = (w - 5)^2 def loss_fn(w): return (w - 5) ** 2 def gradient_fn(w): return 2 * (w - 5) # Test different learning rates learning_rates = [0.01, 0.1, 0.5, 1.1] colors = ['blue', 'green', 'orange', 'red'] n_steps = 50 fig, axes = plt.subplots(2, 2, figsize=(12, 10)) axes = axes.flatten() for idx, (lr, color) in enumerate(zip(learning_rates, colors)): w = 0.0 # Start far from optimum w_history = [w] loss_history = [loss_fn(w)] for step in range(n_steps): grad = gradient_fn(w) w = w - lr * grad w_history.append(w) loss_history.append(loss_fn(w)) # Stop if diverging if abs(w) > 100: break # Plot loss trajectory ax = axes[idx] ax.plot(loss_history, linewidth=2, color=color) ax.set_xlabel('Step') ax.set_ylabel('Loss') ax.set_title(f'Learning Rate = {lr}') ax.grid(True, alpha=0.3) # Add status text if lr < 0.1: status = "TOO SLOW" ax.text(0.5, 0.9, status, transform=ax.transAxes, bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.5)) elif lr > 1.0: status = "DIVERGING!" ax.text(0.5, 0.9, status, transform=ax.transAxes, bbox=dict(boxstyle='round', facecolor='red', alpha=0.5)) else: status = "GOOD" ax.text(0.5, 0.9, status, transform=ax.transAxes, bbox=dict(boxstyle='round', facecolor='lightgreen', alpha=0.5)) plt.tight_layout() plt.show() print("LEARNING RATE COMPARISON") print("="*60) for lr in learning_rates: w = 0.0 for step in range(50): w = w - lr * gradient_fn(w) if abs(w) > 100: print(f"LR={lr:4.2f}: DIVERGED at step {step}") break else: final_loss = loss_fn(w) print(f"LR={lr:4.2f}: Converged to w={w:.4f}, loss={final_loss:.4f}") print("\nKey insights:") print(" • LR=0.01: Too slow, needs many iterations") print(" • LR=0.1: Good balance, stable convergence") print(" • LR=0.5: Fast but may oscillate") print(" • LR=1.1: Too large, diverges!") ``` ## Interactive Notebook The notebook below contains runnable code for all Level 1 activities. {{< embed ../../notebooks/LAB02_ml_foundations.ipynb >}} ## Three-Tier Activities ::: {.panel-tabset} ### Level 1: Notebook Run the embedded notebook above. Key exercises: 1. Follow along with the code cells 2. Modify parameters and observe results 3. Complete the checkpoint questions ### Level 2: Simulator This lab focuses on foundational training concepts. For Level 2 we use interactive visual tools to deepen your intuition: **[TensorFlow Playground](https://playground.tensorflow.org)** – Visualize neural networks in real-time: - Experiment with different architectures (layers, neurons) - Watch gradient descent optimize the loss surface - Compare activation functions (ReLU, Tanh, Sigmoid) - Try the "Spiral" dataset to understand model capacity **[Our Gradient Descent Visualizer](../simulations/gradient-descent.qmd)** – 3D loss surface exploration for different learning rates and initializations ## Self-Assessment Checkpoints Test your understanding before proceeding to the exercises. ::: {.callout-note collapse="true" title="Question 1: Why would you use Cross-Entropy loss instead of MSE for a 10-class image classification problem?"} **Answer:** Cross-Entropy loss is designed for classification tasks and measures the difference between predicted probability distributions and true labels. MSE treats class labels as numeric values (0, 1, 2...) which doesn't make sense - the "distance" between class 0 and class 2 has no meaning. Cross-Entropy properly handles probability outputs from softmax and provides better gradients for classification. Using MSE for classification leads to poor convergence and lower accuracy. ::: ::: {.callout-note collapse="true" title="Question 2: Calculate the number of parameters in a Dense layer with 128 inputs and 64 outputs."} **Answer:** Parameters = (input_size + 1) × output_size = (128 + 1) × 64 = 129 × 64 = 8,256 parameters. The "+1" accounts for bias terms (one bias per output neuron). In float32, this layer requires 8,256 × 4 = 33,024 bytes (32.25 KB). After int8 quantization, it reduces to 8,256 bytes (8.06 KB). ::: ::: {.callout-note collapse="true" title="Question 3: Your model's training loss decreases steadily but validation loss starts increasing after epoch 10. What's happening and what should you do?"} **Answer:** This is overfitting - the model is memorizing training data instead of learning generalizable patterns. After epoch 10, it performs worse on unseen validation data. Solutions: (1) Stop training at epoch 10 (early stopping), (2) Add regularization (L2 penalty, dropout), (3) Increase training data size, (4) Reduce model complexity (fewer layers/neurons), or (5) Add data augmentation. Always monitor validation loss and stop when it stops improving. ::: ::: {.callout-note collapse="true" title="Question 4: You set learning_rate=1.0 and the loss oscillates wildly between 0.1 and 100. What's wrong?"} **Answer:** The learning rate is too high. Large steps cause gradient descent to overshoot the minimum and bounce around the loss surface wildly. The loss may even diverge to infinity. Solution: Reduce learning rate by 10× or 100×. Start with 0.01 or 0.001 and adjust based on loss curves. If loss decreases smoothly, the rate is good. If it plateaus quickly, try increasing slightly. A good learning rate shows steady decrease without oscillation. ::: ::: {.callout-note collapse="true" title="Question 5: Why must you normalize image inputs (pixel_values / 255.0) before training?"} **Answer:** Neural networks expect inputs in a consistent, small range (typically 0-1 or -1 to 1). Raw pixel values (0-255) cause several problems: (1) Gradients become 255× larger, leading to training instability and divergence, (2) Initial random weights (typically -0.1 to 0.1) are completely wrong for 0-255 scale, (3) Optimization is much slower because the loss surface is poorly conditioned. Normalization ensures all input features have similar scales, enabling stable and efficient training. ::: ### Level 3: Device There is no dedicated on-device deployment in LAB02. Instead: - LAB03 introduces TFLite conversion and quantization for edge deployment - LAB05 shows how to take a trained model and deploy it to Arduino/MCUs If you want an early challenge, you can: 1. Train a small MNIST model in this lab’s notebook 2. Follow LAB03 to convert and quantize it to `.tflite` 3. Follow LAB05 to integrate that `.tflite` model into an MCU project ::: ## Visual Troubleshooting ### Training Loss Not Decreasing ```{mermaid} flowchart TD A[Loss not decreasing] --> B{Loss value?} B -->|NaN| C[Gradient explosion: Reduce learning rate 10x Add gradient clipping Check for NaN in data] B -->|Constant high| D{Learning rate?} D -->|Too small| E[Increase LR: Try 1e-3 for Adam Try 0.01 for SGD] D -->|Reasonable| F{Data normalized?} F -->|No| G[Normalize inputs: x = x / 255.0 images StandardScaler tabular Mean 0 std 1] F -->|Yes| H{Check labels} H -->|Wrong format| I[Fix labels: One-hot encode Balance classes Verify ground truth] style A fill:#ff6b6b style C fill:#4ecdc4 style E fill:#4ecdc4 style G fill:#4ecdc4 style I fill:#4ecdc4 ``` ### Overfitting Problems ```{mermaid} flowchart TD A[Train acc high Val acc low] --> B{Gap size?} B -->|>20%| C[Severe overfitting] B -->|10-20%| D[Moderate] C --> E{Dataset size?} E -->|<100/class| F[Collect more data: Aim 500+ per class Critical for deep learning] E -->|Adequate| G{Using augmentation?} G -->|No| H[Add augmentation: Flips rotations Noise injection Time warping] G -->|Yes| I[Add regularization: L2 weight decay 1e-4 Dropout 0.3-0.5 Early stopping] D --> G style A fill:#ff6b6b style F fill:#4ecdc4 style H fill:#4ecdc4 style I fill:#4ecdc4 ``` For complete troubleshooting flowcharts, see: - [Training Loss Not Decreasing](../troubleshooting/index.qmd#training-loss-not-decreasing) - [Overfitting Detection and Solutions](../troubleshooting/index.qmd#overfitting-detection-and-solutions) - [All Visual Troubleshooting Guides](../troubleshooting/index.qmd) ## Related Labs ::: {.callout-tip} ## Foundations Track - **LAB01: Introduction** - Prerequisites for this lab - **LAB03: Quantization** - Next step: optimize models for edge devices - **LAB04: Keyword Spotting** - Apply ML foundations to audio classification ::: ::: {.callout-tip} ## Advanced ML - **LAB07: CNNs & Vision** - Deep dive into convolutional neural networks - **LAB14: Anomaly Detection** - Unsupervised learning techniques ::: ## Related Resources - [Hardware Guide](../resources/hardware.qmd) - Equipment needed for Level 3 - [Troubleshooting](../resources/troubleshooting.qmd) - Common issues and solutions