LAB07: CNNs and Computer Vision

Convolutional Neural Networks for Images

PDF Textbook Reference

For detailed theoretical foundations, mathematical proofs, and algorithm derivations, see Chapter 7: Convolutional Neural Networks for Computer Vision in the PDF textbook.

The PDF chapter includes: - Complete mathematical derivation of convolution operations - Detailed analysis of receptive fields and feature hierarchies - In-depth coverage of pooling and normalization layers - Theoretical foundations of data augmentation - Comprehensive CNN architecture design principles

Open In Colab

Open In Colab

Download Notebook

Learning Objectives

By the end of this lab you will be able to:

  • Explain how convolution and pooling operations extract features from images
  • Build and train simple CNNs in TensorFlow/Keras
  • Visualize feature maps and understand what CNN layers “see”
  • Use data augmentation to improve generalization
  • Assess CNN architectures for suitability on edge devices (size, FLOPs, latency)

Theory Summary

Convolutional Neural Networks (CNNs) revolutionize computer vision by solving three fundamental problems of dense neural networks: parameter explosion, loss of spatial relationships, and lack of translation invariance. Instead of connecting every pixel to every neuron, CNNs use small learned filters (typically 3×3) that slide across images, detecting patterns like edges, textures, and eventually complex features. This local connectivity dramatically reduces parameters—a 3×3 filter has only 9 weights regardless of image size.

The key innovation is parameter sharing: the same filter weights are reused across all image positions, making CNNs inherently translation-invariant. A filter that detects vertical edges in the top-left will also detect them anywhere else in the image. CNNs build hierarchical representations through stacked convolutional layers: early layers detect simple patterns (edges, gradients), middle layers combine these into shapes (circles, corners), and deep layers recognize complex objects (faces, cars, animals). This hierarchy emerges automatically during training—you never explicitly program edge detectors.

Pooling layers complement convolutions by reducing spatial dimensions while preserving important features. MaxPooling takes the maximum value in each region (typically 2×2), halving width and height while keeping the strongest activations. This provides translation invariance (small shifts don’t change output), dimension reduction (75% memory savings per pooling layer), and noise robustness (weak activations are discarded). For edge deployment, this compression is critical: a CNN with aggressive pooling can process images with far fewer parameters and memory than a dense network.

Key Concepts at a Glance

Core CNN Principles
  • Convolution: Learned filters (kernels) slide across images computing weighted sums
  • Local Connectivity: Each neuron connects only to a small receptive field, not the entire image
  • Parameter Sharing: Same weights used across all spatial positions → translation invariance
  • Hierarchical Features: Early layers detect edges → middle layers detect shapes → deep layers detect objects
  • Pooling: Downsamples spatial dimensions (typically 2×2 MaxPooling) for efficiency and robustness
  • Feature Maps: Each convolutional filter produces one feature map detecting specific patterns
  • Edge Constraints: Model size, tensor arena, and inference latency determine deployability

Common Pitfalls

Mistakes to Avoid

Not Using Data Augmentation: Training on unaugmented data leads to overfitting. If your model sees only centered, well-lit images, it will fail on rotated, cropped, or shadowed inputs. Always use augmentation (rotation, flipping, zooming) for training—but never for validation!

Ignoring Model Size for Edge Deployment: A 4.5 million parameter model requires ~17 MB in Float32 format—too large for most microcontrollers. Always check model size with model.summary() and convert to TFLite with int8 quantization for edge deployment.

Dense Layers Dominate Parameters: Most parameters in CNNs are in the final dense (fully-connected) layers, not the convolutional layers. Minimize dense layer size for edge deployment—or replace with global average pooling.

Wrong Input Shape: CNNs expect 4D input: (batch, height, width, channels). Grayscale images need reshaping to add the channel dimension: train_images.reshape(60000, 28, 28, 1).

Overfitting Without Regularization: Large gap between training (97%) and validation (91%) accuracy signals overfitting. Use data augmentation, dropout, or early stopping to improve generalization.

Quick Reference

Key Formulas and Parameters

Convolution Output Size (no padding): \[\text{Output Size} = \text{Input Size} - \text{Kernel Size} + 1\] Example: \(26 = 28 - 3 + 1\)

Convolution Parameters: \[\text{Params} = (\text{Kernel Height} \times \text{Kernel Width} \times \text{Input Channels} + 1) \times \text{Num Filters}\] Example: \((3 \times 3 \times 1 + 1) \times 64 = 640\) parameters

Pooling Output Size: \[\text{Output Size} = \frac{\text{Input Size}}{\text{Pool Size}}\] Example: \(13 = \frac{26}{2}\) for 2×2 MaxPooling

Model Size Estimation: - Float32: num_params × 4 bytes - Int8 quantized: num_params × 1 bytes - Tensor Arena: Typically 2-5× model size for intermediate activations

Parameter Budget for Microcontrollers: - Arduino Nano 33 BLE (256 KB SRAM): ~25,000 params (Float32) or ~100,000 params (Int8) - Total budget: Model size + Tensor arena < available RAM

Related PDF Sections: - Section 7.2: Understanding Convolution Operations - Section 7.3: Pooling Layers - Section 7.4: Building Your First CNN - Section 7.6: Data Augmentation - Section 7.7: Edge Deployment Considerations

Interactive Elements

Try the CNN Explainer to see how convolution filters work in real-time. Watch how:

  • Different filters detect different features (edges, textures, patterns)
  • Activations flow through the network layer by layer
  • Pooling reduces spatial dimensions while preserving features
  • Feature maps become more abstract in deeper layers

Use the 3D CNN Visualization to:

  • Draw digits and see real-time classification
  • Observe how the network processes your input in 3D
  • Understand the depth of feature maps at each layer

Experiment with ConvNetJS to:

  • Train CNNs directly in your browser on CIFAR-10
  • Observe overfitting vs underfitting in real-time
  • Adjust hyperparameters and see immediate effects
  • Compare different architectures (depth, width, pooling)
Code Example: Efficient Edge CNN
# Efficient CNN for microcontrollers (~30KB quantized)
efficient_model = tf.keras.models.Sequential([
    # Small input, aggressive pooling
    tf.keras.layers.Conv2D(8, (3,3), activation='relu',
                           input_shape=(48, 48, 1)),
    tf.keras.layers.MaxPooling2D(2, 2),

    tf.keras.layers.Conv2D(16, (3,3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2, 2),

    # Minimal dense layers
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(10, activation='softmax')
])

# Only 7,818 parameters - fits easily on Cortex-M4!
efficient_model.summary()

Try It Yourself: Executable Python Examples

Run these interactive Python examples directly in your browser or Jupyter environment. Each demonstrates a core CNN concept with hands-on code you can modify and experiment with.

Convolution Operation from Scratch

Understanding convolution at the NumPy level builds intuition for what CNNs actually compute. This implementation shows the element-wise multiplication and summation that occurs at each position.

Code
import numpy as np
import matplotlib.pyplot as plt

def conv2d_simple(image, kernel):
    """
    Perform 2D convolution without padding or stride > 1.

    Args:
        image: 2D numpy array (H x W)
        kernel: 2D numpy array (K x K), typically 3x3

    Returns:
        Feature map: 2D array of size (H-K+1) x (W-K+1)
    """
    img_h, img_w = image.shape
    ker_h, ker_w = kernel.shape

    out_h = img_h - ker_h + 1
    out_w = img_w - ker_w + 1

    output = np.zeros((out_h, out_w))

    for i in range(out_h):
        for j in range(out_w):
            region = image[i:i+ker_h, j:j+ker_w]
            output[i, j] = np.sum(region * kernel)

    return output

# Create a simple test image with gradient
test_image = np.array([
    [10, 20, 30, 40, 50],
    [10, 20, 30, 40, 50],
    [10, 20, 30, 40, 50],
    [10, 20, 30, 40, 50],
    [10, 20, 30, 40, 50]
])

# Vertical edge detector (Sobel kernel)
vertical_edge_kernel = np.array([
    [1,  0, -1],
    [2,  0, -2],
    [1,  0, -1]
])

# Horizontal edge detector
horizontal_edge_kernel = np.array([
    [ 1,  2,  1],
    [ 0,  0,  0],
    [-1, -2, -1]
])

# Apply convolutions
feature_map_v = conv2d_simple(test_image, vertical_edge_kernel)
feature_map_h = conv2d_simple(test_image, horizontal_edge_kernel)

# Visualize
fig, axes = plt.subplots(1, 3, figsize=(12, 4))

axes[0].imshow(test_image, cmap='gray')
axes[0].set_title('Input Image (5x5)')
axes[0].axis('off')

axes[1].imshow(feature_map_v, cmap='RdBu')
axes[1].set_title('Vertical Edge Detection (3x3)')
axes[1].axis('off')

axes[2].imshow(feature_map_h, cmap='RdBu')
axes[2].set_title('Horizontal Edge Detection (3x3)')
axes[2].axis('off')

plt.tight_layout()
plt.show()

print(f"Input shape: {test_image.shape}")
print(f"Kernel shape: {vertical_edge_kernel.shape}")
print(f"Output shape: {feature_map_v.shape}")
print(f"\nVertical edge detection values:\n{feature_map_v}")

Input shape: (5, 5)
Kernel shape: (3, 3)
Output shape: (3, 3)

Vertical edge detection values:
[[-80. -80. -80.]
 [-80. -80. -80.]
 [-80. -80. -80.]]

Key Insight: Notice how the vertical gradient produces strong responses in the vertical edge detector, but zero response in the horizontal detector. In CNNs, these kernels are learned automatically during training.

Parameter Counting and Memory Estimation

Understanding where parameters come from is critical for edge deployment. Most CNN parameters are in dense layers, not convolutions.

Code
def count_conv2d_params(input_channels, output_filters, kernel_size, use_bias=True):
    """Calculate parameters in a Conv2D layer."""
    weights = kernel_size * kernel_size * input_channels * output_filters
    biases = output_filters if use_bias else 0
    return weights + biases

def count_dense_params(input_size, output_size, use_bias=True):
    """Calculate parameters in a Dense layer."""
    weights = input_size * output_size
    biases = output_size if use_bias else 0
    return weights + biases

# Analyze Fashion MNIST CNN architecture
print("=== Fashion MNIST CNN Parameter Breakdown ===\n")

layers = [
    ("Conv2D(64, 3x3, in=1)", count_conv2d_params(1, 64, 3)),
    ("MaxPooling2D", 0),
    ("Conv2D(64, 3x3, in=64)", count_conv2d_params(64, 64, 3)),
    ("MaxPooling2D", 0),
    ("Flatten", 0),
    ("Dense(5*5*64 -> 128)", count_dense_params(5*5*64, 128)),
    ("Dense(128 -> 10)", count_dense_params(128, 10))
]

total = sum(params for _, params in layers)

for layer_name, params in layers:
    pct = (params / total * 100) if total > 0 else 0
    print(f"{layer_name:30s}: {params:7,d} params ({pct:5.1f}%)")

print(f"\n{'Total':30s}: {total:7,d} params")
print(f"\nMemory Requirements:")
print(f"  Float32: {total * 4 / 1024:.1f} KB")
print(f"  Int8 (quantized): {total / 1024:.1f} KB")

conv_params = layers[0][1] + layers[2][1]
dense_params = layers[5][1] + layers[6][1]

print(f"\nParameter Distribution:")
print(f"  Convolutional layers: {conv_params:,d} ({conv_params/total*100:.1f}%)")
print(f"  Dense layers: {dense_params:,d} ({dense_params/total*100:.1f}%)")
print(f"  --> Dense layers dominate! Minimize these for edge deployment.")

# Visualize parameter distribution
import matplotlib.pyplot as plt

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# Bar chart of layer parameters
layer_names = [name.split('(')[0] for name, _ in layers if _ > 0]
layer_params = [params for _, params in layers if params > 0]

ax1.barh(layer_names, layer_params, color=['steelblue', 'steelblue', 'coral', 'coral'])
ax1.set_xlabel('Parameters')
ax1.set_title('Parameters by Layer')
ax1.grid(axis='x', alpha=0.3)

# Pie chart of conv vs dense
ax2.pie([conv_params, dense_params],
        labels=['Convolutional', 'Dense'],
        autopct='%1.1f%%',
        colors=['steelblue', 'coral'],
        startangle=90)
ax2.set_title('Parameter Distribution')

plt.tight_layout()
plt.show()
=== Fashion MNIST CNN Parameter Breakdown ===

Conv2D(64, 3x3, in=1)         :     640 params (  0.3%)
MaxPooling2D                  :       0 params (  0.0%)
Conv2D(64, 3x3, in=64)        :  36,928 params ( 15.1%)
MaxPooling2D                  :       0 params (  0.0%)
Flatten                       :       0 params (  0.0%)
Dense(5*5*64 -> 128)          : 204,928 params ( 84.1%)
Dense(128 -> 10)              :   1,290 params (  0.5%)

Total                         : 243,786 params

Memory Requirements:
  Float32: 952.3 KB
  Int8 (quantized): 238.1 KB

Parameter Distribution:
  Convolutional layers: 37,568 (15.4%)
  Dense layers: 206,218 (84.6%)
  --> Dense layers dominate! Minimize these for edge deployment.

Key Insight: Dense layers contain 73% of parameters despite being only 2 layers! For edge deployment, use Global Average Pooling instead of Flatten to eliminate this bottleneck.

Output Size Calculator

Predicting the output dimensions of convolutional and pooling layers is critical for designing CNN architectures that fit edge device memory constraints.

Code
def conv_output_size(input_size, kernel_size, padding=0, stride=1):
    """Calculate output size for convolution or pooling layer."""
    return ((input_size - kernel_size + 2 * padding) // stride) + 1

def calculate_cnn_dimensions(input_shape, architecture):
    """Track spatial dimensions through a CNN architecture."""
    h, w, c = input_shape
    shapes = [input_shape]

    for layer in architecture:
        if layer['type'] in ['conv', 'pool']:
            h = conv_output_size(h, layer['kernel'],
                               layer.get('padding', 0),
                               layer.get('stride', 1))
            w = conv_output_size(w, layer['kernel'],
                               layer.get('padding', 0),
                               layer.get('stride', 1))

            if layer['type'] == 'conv':
                c = layer['filters']

            shapes.append((h, w, c))

    return shapes

# Example: Fashion MNIST CNN
input_shape = (28, 28, 1)

architecture = [
    {'type': 'conv', 'kernel': 3, 'stride': 1, 'padding': 0, 'filters': 64},
    {'type': 'pool', 'kernel': 2, 'stride': 2, 'padding': 0},
    {'type': 'conv', 'kernel': 3, 'stride': 1, 'padding': 0, 'filters': 64},
    {'type': 'pool', 'kernel': 2, 'stride': 2, 'padding': 0}
]

shapes = calculate_cnn_dimensions(input_shape, architecture)

print("CNN Dimension Flow:")
print(f"Input:              {shapes[0]}")
print(f"After Conv2D(64,3): {shapes[1]}")
print(f"After MaxPool(2,2): {shapes[2]}")
print(f"After Conv2D(64,3): {shapes[3]}")
print(f"After MaxPool(2,2): {shapes[4]}")

# Calculate memory at each stage
print("\nMemory Requirements (Float32):")
for i, shape in enumerate(shapes):
    h, w, c = shape
    memory_kb = (h * w * c * 4) / 1024
    stage = ["Input", "Conv1", "Pool1", "Conv2", "Pool2"][i]
    print(f"{stage:6s}: {h:2d}x{w:2d}x{c:3d} = {h*w*c:6,d} values = {memory_kb:6.1f} KB")

print("\n--- Common Scenarios ---")
print(f"28x28 image, 3x3 conv (no padding):   {conv_output_size(28, 3, 0, 1)}x{conv_output_size(28, 3, 0, 1)}")
print(f"28x28 image, 3x3 conv (same padding): {conv_output_size(28, 3, 1, 1)}x{conv_output_size(28, 3, 1, 1)}")
print(f"26x26 image, 2x2 maxpool:             {conv_output_size(26, 2, 0, 2)}x{conv_output_size(26, 2, 0, 2)}")
print(f"224x224 image, 7x7 conv, stride=2:    {conv_output_size(224, 7, 3, 2)}x{conv_output_size(224, 7, 3, 2)}")

# Visualize dimension reduction
import matplotlib.pyplot as plt

stages = ["Input", "Conv1", "Pool1", "Conv2", "Pool2"]
spatial_sizes = [h * w for h, w, _ in shapes]

plt.figure(figsize=(10, 4))
plt.plot(stages, spatial_sizes, marker='o', linewidth=2, markersize=10)
plt.ylabel('Spatial Size (H × W)')
plt.title('Spatial Dimension Reduction Through CNN')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

print(f"\nSpatial reduction: {shapes[0][0]*shapes[0][1]} -> {shapes[-1][0]*shapes[-1][1]} ({shapes[-1][0]*shapes[-1][1]/(shapes[0][0]*shapes[0][1])*100:.1f}%)")
CNN Dimension Flow:
Input:              (28, 28, 1)
After Conv2D(64,3): (26, 26, 64)
After MaxPool(2,2): (13, 13, 64)
After Conv2D(64,3): (11, 11, 64)
After MaxPool(2,2): (5, 5, 64)

Memory Requirements (Float32):
Input : 28x28x  1 =    784 values =    3.1 KB
Conv1 : 26x26x 64 = 43,264 values =  169.0 KB
Pool1 : 13x13x 64 = 10,816 values =   42.2 KB
Conv2 : 11x11x 64 =  7,744 values =   30.2 KB
Pool2 :  5x 5x 64 =  1,600 values =    6.2 KB

--- Common Scenarios ---
28x28 image, 3x3 conv (no padding):   26x26
28x28 image, 3x3 conv (same padding): 28x28
26x26 image, 2x2 maxpool:             13x13
224x224 image, 7x7 conv, stride=2:    112x112

Spatial reduction: 784 -> 25 (3.2%)

Key Insight: Pooling layers reduce spatial dimensions by 75% (from 26x26 to 13x13), dramatically reducing memory requirements for subsequent layers.

Building a Simple CNN with TensorFlow/Keras

This example shows a complete CNN architecture for image classification, suitable for MNIST/Fashion MNIST.

Code
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

# Build CNN architecture
def build_simple_cnn(input_shape=(28, 28, 1), num_classes=10):
    """Build a simple CNN for image classification."""
    model = tf.keras.models.Sequential([
        # First convolutional block
        tf.keras.layers.Conv2D(32, (3, 3), activation='relu',
                               input_shape=input_shape, name='conv1'),
        tf.keras.layers.MaxPooling2D((2, 2), name='pool1'),

        # Second convolutional block
        tf.keras.layers.Conv2D(64, (3, 3), activation='relu', name='conv2'),
        tf.keras.layers.MaxPooling2D((2, 2), name='pool2'),

        # Classification head
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(64, activation='relu', name='fc1'),
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.Dense(num_classes, activation='softmax', name='output')
    ])

    model.compile(
        optimizer='adam',
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )

    return model

# Create and inspect the model
model = build_simple_cnn()
print("=== CNN Architecture Summary ===\n")
model.summary()

# Load Fashion MNIST dataset
print("\n=== Loading Fashion MNIST Dataset ===")
(train_images, train_labels), (val_images, val_labels) = \
    tf.keras.datasets.fashion_mnist.load_data()

# Preprocess
train_images = train_images.reshape(-1, 28, 28, 1) / 255.0
val_images = val_images.reshape(-1, 28, 28, 1) / 255.0

print(f"Training samples: {len(train_images)}")
print(f"Validation samples: {len(val_images)}")

# Train the model (short training for demo)
print("\n=== Training CNN (5 epochs) ===")
history = model.fit(
    train_images[:12000], train_labels[:12000],  # Subset for speed
    validation_data=(val_images, val_labels),
    epochs=5,
    batch_size=128,
    verbose=1
)

# Evaluate
val_loss, val_accuracy = model.evaluate(val_images, val_labels, verbose=0)
print(f"\nValidation Accuracy: {val_accuracy:.2%}")

# Plot training history
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

ax1.plot(history.history['accuracy'], label='Training')
ax1.plot(history.history['val_accuracy'], label='Validation')
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Accuracy')
ax1.set_title('Model Accuracy')
ax1.legend()
ax1.grid(alpha=0.3)

ax2.plot(history.history['loss'], label='Training')
ax2.plot(history.history['val_loss'], label='Validation')
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Loss')
ax2.set_title('Model Loss')
ax2.legend()
ax2.grid(alpha=0.3)

plt.tight_layout()
plt.show()

# Visualize predictions
class_names = ['T-shirt', 'Trouser', 'Pullover', 'Dress', 'Coat',
               'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']

predictions = model.predict(val_images[:9])

fig, axes = plt.subplots(3, 3, figsize=(10, 10))
for i, ax in enumerate(axes.flat):
    ax.imshow(val_images[i].reshape(28, 28), cmap='gray')
    pred_label = class_names[np.argmax(predictions[i])]
    true_label = class_names[val_labels[i]]
    color = 'green' if pred_label == true_label else 'red'
    ax.set_title(f'Pred: {pred_label}\nTrue: {true_label}', color=color)
    ax.axis('off')

plt.tight_layout()
plt.show()

print(f"\nModel has {model.count_params():,} parameters")
print(f"Model size (Float32): {model.count_params() * 4 / 1024:.1f} KB")
print(f"Estimated Int8 size: {model.count_params() / 1024:.1f} KB")
2025-12-15 01:13:08.082279: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2025-12-15 01:13:08.126513: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-12-15 01:13:09.589308: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/keras/src/layers/convolutional/base_conv.py:113: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
2025-12-15 01:13:10.304903: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
=== CNN Architecture Summary ===


=== Loading Fashion MNIST Dataset ===
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-labels-idx1-ubyte.gz
    0/29515 ━━━━━━━━━━━━━━━━━━━━ 0s 0s/step29515/29515 ━━━━━━━━━━━━━━━━━━━━ 0s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz
       0/26421880 ━━━━━━━━━━━━━━━━━━━━ 0s 0s/step 2596864/26421880 ━━━━━━━━━━━━━━━━━━━━ 0s 0us/step 8396800/26421880 ━━━━━━━━━━━━━━━━━━━━ 0s 0us/step16785408/26421880 ━━━━━━━━━━━━━━━━━━━━ 0s 0us/step25174016/26421880 ━━━━━━━━━━━━━━━━━━━━ 0s 0us/step26421880/26421880 ━━━━━━━━━━━━━━━━━━━━ 0s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz
   0/5148 ━━━━━━━━━━━━━━━━━━━━ 0s 0s/step5148/5148 ━━━━━━━━━━━━━━━━━━━━ 0s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz
      0/4422102 ━━━━━━━━━━━━━━━━━━━━ 0s 0s/step3235840/4422102 ━━━━━━━━━━━━━━━━━━━━ 0s 0us/step4422102/4422102 ━━━━━━━━━━━━━━━━━━━━ 0s 0us/step
Training samples: 60000
Validation samples: 10000

=== Training CNN (5 epochs) ===
Epoch 1/5
 1/94 ━━━━━━━━━━━━━━━━━━━━ 1:30 976ms/step - accuracy: 0.0859 - loss: 2.3302 3/94 ━━━━━━━━━━━━━━━━━━━━ 4s 46ms/step - accuracy: 0.1241 - loss: 2.2903    4/94 ━━━━━━━━━━━━━━━━━━━━ 4s 48ms/step - accuracy: 0.1346 - loss: 2.2813 6/94 ━━━━━━━━━━━━━━━━━━━━ 4s 47ms/step - accuracy: 0.1525 - loss: 2.2623 8/94 ━━━━━━━━━━━━━━━━━━━━ 4s 47ms/step - accuracy: 0.1687 - loss: 2.243710/94 ━━━━━━━━━━━━━━━━━━━━ 3s 46ms/step - accuracy: 0.1835 - loss: 2.225112/94 ━━━━━━━━━━━━━━━━━━━━ 3s 47ms/step - accuracy: 0.1966 - loss: 2.204814/94 ━━━━━━━━━━━━━━━━━━━━ 3s 47ms/step - accuracy: 0.2081 - loss: 2.183816/94 ━━━━━━━━━━━━━━━━━━━━ 3s 47ms/step - accuracy: 0.2194 - loss: 2.161918/94 ━━━━━━━━━━━━━━━━━━━━ 3s 47ms/step - accuracy: 0.2297 - loss: 2.139720/94 ━━━━━━━━━━━━━━━━━━━━ 3s 47ms/step - accuracy: 0.2395 - loss: 2.117022/94 ━━━━━━━━━━━━━━━━━━━━ 3s 47ms/step - accuracy: 0.2483 - loss: 2.095124/94 ━━━━━━━━━━━━━━━━━━━━ 3s 48ms/step - accuracy: 0.2568 - loss: 2.073325/94 ━━━━━━━━━━━━━━━━━━━━ 3s 48ms/step - accuracy: 0.2609 - loss: 2.062826/94 ━━━━━━━━━━━━━━━━━━━━ 3s 48ms/step - accuracy: 0.2649 - loss: 2.052528/94 ━━━━━━━━━━━━━━━━━━━━ 3s 48ms/step - accuracy: 0.2725 - loss: 2.032229/94 ━━━━━━━━━━━━━━━━━━━━ 3s 48ms/step - accuracy: 0.2762 - loss: 2.022331/94 ━━━━━━━━━━━━━━━━━━━━ 3s 48ms/step - accuracy: 0.2834 - loss: 2.002533/94 ━━━━━━━━━━━━━━━━━━━━ 2s 48ms/step - accuracy: 0.2904 - loss: 1.983235/94 ━━━━━━━━━━━━━━━━━━━━ 2s 48ms/step - accuracy: 0.2971 - loss: 1.964737/94 ━━━━━━━━━━━━━━━━━━━━ 2s 48ms/step - accuracy: 0.3034 - loss: 1.946838/94 ━━━━━━━━━━━━━━━━━━━━ 2s 49ms/step - accuracy: 0.3065 - loss: 1.938139/94 ━━━━━━━━━━━━━━━━━━━━ 2s 49ms/step - accuracy: 0.3095 - loss: 1.929540/94 ━━━━━━━━━━━━━━━━━━━━ 2s 49ms/step - accuracy: 0.3125 - loss: 1.921142/94 ━━━━━━━━━━━━━━━━━━━━ 2s 49ms/step - accuracy: 0.3184 - loss: 1.904444/94 ━━━━━━━━━━━━━━━━━━━━ 2s 49ms/step - accuracy: 0.3242 - loss: 1.888345/94 ━━━━━━━━━━━━━━━━━━━━ 2s 49ms/step - accuracy: 0.3270 - loss: 1.880546/94 ━━━━━━━━━━━━━━━━━━━━ 2s 49ms/step - accuracy: 0.3297 - loss: 1.872847/94 ━━━━━━━━━━━━━━━━━━━━ 2s 49ms/step - accuracy: 0.3324 - loss: 1.865248/94 ━━━━━━━━━━━━━━━━━━━━ 2s 49ms/step - accuracy: 0.3351 - loss: 1.857750/94 ━━━━━━━━━━━━━━━━━━━━ 2s 49ms/step - accuracy: 0.3402 - loss: 1.843152/94 ━━━━━━━━━━━━━━━━━━━━ 2s 49ms/step - accuracy: 0.3452 - loss: 1.829054/94 ━━━━━━━━━━━━━━━━━━━━ 1s 49ms/step - accuracy: 0.3500 - loss: 1.815356/94 ━━━━━━━━━━━━━━━━━━━━ 1s 49ms/step - accuracy: 0.3546 - loss: 1.802058/94 ━━━━━━━━━━━━━━━━━━━━ 1s 49ms/step - accuracy: 0.3592 - loss: 1.789259/94 ━━━━━━━━━━━━━━━━━━━━ 1s 49ms/step - accuracy: 0.3614 - loss: 1.782961/94 ━━━━━━━━━━━━━━━━━━━━ 1s 49ms/step - accuracy: 0.3658 - loss: 1.770563/94 ━━━━━━━━━━━━━━━━━━━━ 1s 49ms/step - accuracy: 0.3701 - loss: 1.758465/94 ━━━━━━━━━━━━━━━━━━━━ 1s 49ms/step - accuracy: 0.3743 - loss: 1.746667/94 ━━━━━━━━━━━━━━━━━━━━ 1s 49ms/step - accuracy: 0.3784 - loss: 1.735269/94 ━━━━━━━━━━━━━━━━━━━━ 1s 48ms/step - accuracy: 0.3824 - loss: 1.724071/94 ━━━━━━━━━━━━━━━━━━━━ 1s 48ms/step - accuracy: 0.3863 - loss: 1.713173/94 ━━━━━━━━━━━━━━━━━━━━ 1s 48ms/step - accuracy: 0.3901 - loss: 1.702575/94 ━━━━━━━━━━━━━━━━━━━━ 0s 48ms/step - accuracy: 0.3938 - loss: 1.692277/94 ━━━━━━━━━━━━━━━━━━━━ 0s 48ms/step - accuracy: 0.3973 - loss: 1.682179/94 ━━━━━━━━━━━━━━━━━━━━ 0s 48ms/step - accuracy: 0.4009 - loss: 1.672280/94 ━━━━━━━━━━━━━━━━━━━━ 0s 48ms/step - accuracy: 0.4026 - loss: 1.667482/94 ━━━━━━━━━━━━━━━━━━━━ 0s 48ms/step - accuracy: 0.4060 - loss: 1.657984/94 ━━━━━━━━━━━━━━━━━━━━ 0s 48ms/step - accuracy: 0.4093 - loss: 1.648686/94 ━━━━━━━━━━━━━━━━━━━━ 0s 48ms/step - accuracy: 0.4125 - loss: 1.639588/94 ━━━━━━━━━━━━━━━━━━━━ 0s 48ms/step - accuracy: 0.4157 - loss: 1.630690/94 ━━━━━━━━━━━━━━━━━━━━ 0s 48ms/step - accuracy: 0.4188 - loss: 1.621992/94 ━━━━━━━━━━━━━━━━━━━━ 0s 48ms/step - accuracy: 0.4219 - loss: 1.613494/94 ━━━━━━━━━━━━━━━━━━━━ 0s 48ms/step - accuracy: 0.4249 - loss: 1.605094/94 ━━━━━━━━━━━━━━━━━━━━ 7s 64ms/step - accuracy: 0.5634 - loss: 1.2213 - val_accuracy: 0.7532 - val_loss: 0.6716
Epoch 2/5
 1/94 ━━━━━━━━━━━━━━━━━━━━ 6:39 4s/step - accuracy: 0.7188 - loss: 0.8125 3/94 ━━━━━━━━━━━━━━━━━━━━ 4s 45ms/step - accuracy: 0.7079 - loss: 0.8232 5/94 ━━━━━━━━━━━━━━━━━━━━ 4s 46ms/step - accuracy: 0.7070 - loss: 0.8210 7/94 ━━━━━━━━━━━━━━━━━━━━ 4s 46ms/step - accuracy: 0.7075 - loss: 0.8216 9/94 ━━━━━━━━━━━━━━━━━━━━ 3s 46ms/step - accuracy: 0.7077 - loss: 0.819011/94 ━━━━━━━━━━━━━━━━━━━━ 3s 46ms/step - accuracy: 0.7086 - loss: 0.816113/94 ━━━━━━━━━━━━━━━━━━━━ 3s 46ms/step - accuracy: 0.7084 - loss: 0.815915/94 ━━━━━━━━━━━━━━━━━━━━ 3s 46ms/step - accuracy: 0.7084 - loss: 0.815117/94 ━━━━━━━━━━━━━━━━━━━━ 3s 46ms/step - accuracy: 0.7088 - loss: 0.813319/94 ━━━━━━━━━━━━━━━━━━━━ 3s 46ms/step - accuracy: 0.7093 - loss: 0.811421/94 ━━━━━━━━━━━━━━━━━━━━ 3s 46ms/step - accuracy: 0.7098 - loss: 0.810123/94 ━━━━━━━━━━━━━━━━━━━━ 3s 46ms/step - accuracy: 0.7102 - loss: 0.809125/94 ━━━━━━━━━━━━━━━━━━━━ 3s 46ms/step - accuracy: 0.7104 - loss: 0.808127/94 ━━━━━━━━━━━━━━━━━━━━ 3s 46ms/step - accuracy: 0.7105 - loss: 0.807629/94 ━━━━━━━━━━━━━━━━━━━━ 3s 46ms/step - accuracy: 0.7106 - loss: 0.807331/94 ━━━━━━━━━━━━━━━━━━━━ 2s 46ms/step - accuracy: 0.7105 - loss: 0.807333/94 ━━━━━━━━━━━━━━━━━━━━ 2s 47ms/step - accuracy: 0.7105 - loss: 0.807235/94 ━━━━━━━━━━━━━━━━━━━━ 2s 47ms/step - accuracy: 0.7106 - loss: 0.806637/94 ━━━━━━━━━━━━━━━━━━━━ 2s 47ms/step - accuracy: 0.7106 - loss: 0.806039/94 ━━━━━━━━━━━━━━━━━━━━ 2s 47ms/step - accuracy: 0.7107 - loss: 0.805241/94 ━━━━━━━━━━━━━━━━━━━━ 2s 47ms/step - accuracy: 0.7109 - loss: 0.804343/94 ━━━━━━━━━━━━━━━━━━━━ 2s 47ms/step - accuracy: 0.7111 - loss: 0.803345/94 ━━━━━━━━━━━━━━━━━━━━ 2s 47ms/step - accuracy: 0.7114 - loss: 0.802247/94 ━━━━━━━━━━━━━━━━━━━━ 2s 47ms/step - accuracy: 0.7117 - loss: 0.801249/94 ━━━━━━━━━━━━━━━━━━━━ 2s 47ms/step - accuracy: 0.7119 - loss: 0.800351/94 ━━━━━━━━━━━━━━━━━━━━ 2s 47ms/step - accuracy: 0.7122 - loss: 0.799353/94 ━━━━━━━━━━━━━━━━━━━━ 1s 47ms/step - accuracy: 0.7124 - loss: 0.798455/94 ━━━━━━━━━━━━━━━━━━━━ 1s 47ms/step - accuracy: 0.7127 - loss: 0.797357/94 ━━━━━━━━━━━━━━━━━━━━ 1s 47ms/step - accuracy: 0.7130 - loss: 0.796359/94 ━━━━━━━━━━━━━━━━━━━━ 1s 47ms/step - accuracy: 0.7133 - loss: 0.795361/94 ━━━━━━━━━━━━━━━━━━━━ 1s 47ms/step - accuracy: 0.7136 - loss: 0.794263/94 ━━━━━━━━━━━━━━━━━━━━ 1s 47ms/step - accuracy: 0.7138 - loss: 0.793365/94 ━━━━━━━━━━━━━━━━━━━━ 1s 47ms/step - accuracy: 0.7141 - loss: 0.792467/94 ━━━━━━━━━━━━━━━━━━━━ 1s 47ms/step - accuracy: 0.7144 - loss: 0.791569/94 ━━━━━━━━━━━━━━━━━━━━ 1s 47ms/step - accuracy: 0.7147 - loss: 0.790671/94 ━━━━━━━━━━━━━━━━━━━━ 1s 47ms/step - accuracy: 0.7150 - loss: 0.789873/94 ━━━━━━━━━━━━━━━━━━━━ 0s 47ms/step - accuracy: 0.7152 - loss: 0.788975/94 ━━━━━━━━━━━━━━━━━━━━ 0s 47ms/step - accuracy: 0.7155 - loss: 0.788277/94 ━━━━━━━━━━━━━━━━━━━━ 0s 47ms/step - accuracy: 0.7157 - loss: 0.787479/94 ━━━━━━━━━━━━━━━━━━━━ 0s 47ms/step - accuracy: 0.7159 - loss: 0.786781/94 ━━━━━━━━━━━━━━━━━━━━ 0s 46ms/step - accuracy: 0.7161 - loss: 0.786083/94 ━━━━━━━━━━━━━━━━━━━━ 0s 46ms/step - accuracy: 0.7163 - loss: 0.785385/94 ━━━━━━━━━━━━━━━━━━━━ 0s 46ms/step - accuracy: 0.7165 - loss: 0.784687/94 ━━━━━━━━━━━━━━━━━━━━ 0s 46ms/step - accuracy: 0.7167 - loss: 0.783989/94 ━━━━━━━━━━━━━━━━━━━━ 0s 46ms/step - accuracy: 0.7170 - loss: 0.783291/94 ━━━━━━━━━━━━━━━━━━━━ 0s 46ms/step - accuracy: 0.7172 - loss: 0.782493/94 ━━━━━━━━━━━━━━━━━━━━ 0s 46ms/step - accuracy: 0.7174 - loss: 0.781794/94 ━━━━━━━━━━━━━━━━━━━━ 10s 58ms/step - accuracy: 0.7272 - loss: 0.7519 - val_accuracy: 0.7880 - val_loss: 0.5677
Epoch 3/5
 1/94 ━━━━━━━━━━━━━━━━━━━━ 5s 57ms/step - accuracy: 0.6719 - loss: 0.9019 3/94 ━━━━━━━━━━━━━━━━━━━━ 4s 47ms/step - accuracy: 0.6931 - loss: 0.8338 5/94 ━━━━━━━━━━━━━━━━━━━━ 4s 46ms/step - accuracy: 0.6997 - loss: 0.8105 7/94 ━━━━━━━━━━━━━━━━━━━━ 3s 46ms/step - accuracy: 0.7071 - loss: 0.7947 9/94 ━━━━━━━━━━━━━━━━━━━━ 3s 46ms/step - accuracy: 0.7147 - loss: 0.777311/94 ━━━━━━━━━━━━━━━━━━━━ 3s 46ms/step - accuracy: 0.7206 - loss: 0.764313/94 ━━━━━━━━━━━━━━━━━━━━ 3s 46ms/step - accuracy: 0.7251 - loss: 0.753815/94 ━━━━━━━━━━━━━━━━━━━━ 3s 46ms/step - accuracy: 0.7292 - loss: 0.744717/94 ━━━━━━━━━━━━━━━━━━━━ 3s 46ms/step - accuracy: 0.7325 - loss: 0.737019/94 ━━━━━━━━━━━━━━━━━━━━ 3s 46ms/step - accuracy: 0.7354 - loss: 0.731021/94 ━━━━━━━━━━━━━━━━━━━━ 3s 46ms/step - accuracy: 0.7376 - loss: 0.726023/94 ━━━━━━━━━━━━━━━━━━━━ 3s 46ms/step - accuracy: 0.7393 - loss: 0.722625/94 ━━━━━━━━━━━━━━━━━━━━ 3s 46ms/step - accuracy: 0.7409 - loss: 0.719227/94 ━━━━━━━━━━━━━━━━━━━━ 3s 46ms/step - accuracy: 0.7425 - loss: 0.715729/94 ━━━━━━━━━━━━━━━━━━━━ 2s 46ms/step - accuracy: 0.7440 - loss: 0.712431/94 ━━━━━━━━━━━━━━━━━━━━ 2s 46ms/step - accuracy: 0.7454 - loss: 0.709333/94 ━━━━━━━━━━━━━━━━━━━━ 2s 46ms/step - accuracy: 0.7466 - loss: 0.706635/94 ━━━━━━━━━━━━━━━━━━━━ 2s 46ms/step - accuracy: 0.7478 - loss: 0.704237/94 ━━━━━━━━━━━━━━━━━━━━ 2s 46ms/step - accuracy: 0.7488 - loss: 0.701939/94 ━━━━━━━━━━━━━━━━━━━━ 2s 46ms/step - accuracy: 0.7496 - loss: 0.699941/94 ━━━━━━━━━━━━━━━━━━━━ 2s 46ms/step - accuracy: 0.7503 - loss: 0.698343/94 ━━━━━━━━━━━━━━━━━━━━ 2s 46ms/step - accuracy: 0.7508 - loss: 0.696945/94 ━━━━━━━━━━━━━━━━━━━━ 2s 46ms/step - accuracy: 0.7512 - loss: 0.695747/94 ━━━━━━━━━━━━━━━━━━━━ 2s 46ms/step - accuracy: 0.7516 - loss: 0.694649/94 ━━━━━━━━━━━━━━━━━━━━ 2s 46ms/step - accuracy: 0.7520 - loss: 0.693851/94 ━━━━━━━━━━━━━━━━━━━━ 1s 46ms/step - accuracy: 0.7522 - loss: 0.693053/94 ━━━━━━━━━━━━━━━━━━━━ 1s 46ms/step - accuracy: 0.7526 - loss: 0.692155/94 ━━━━━━━━━━━━━━━━━━━━ 1s 46ms/step - accuracy: 0.7529 - loss: 0.691157/94 ━━━━━━━━━━━━━━━━━━━━ 1s 46ms/step - accuracy: 0.7533 - loss: 0.690159/94 ━━━━━━━━━━━━━━━━━━━━ 1s 47ms/step - accuracy: 0.7536 - loss: 0.689061/94 ━━━━━━━━━━━━━━━━━━━━ 1s 47ms/step - accuracy: 0.7540 - loss: 0.688063/94 ━━━━━━━━━━━━━━━━━━━━ 1s 46ms/step - accuracy: 0.7543 - loss: 0.687165/94 ━━━━━━━━━━━━━━━━━━━━ 1s 47ms/step - accuracy: 0.7547 - loss: 0.686167/94 ━━━━━━━━━━━━━━━━━━━━ 1s 46ms/step - accuracy: 0.7550 - loss: 0.685369/94 ━━━━━━━━━━━━━━━━━━━━ 1s 46ms/step - accuracy: 0.7553 - loss: 0.684471/94 ━━━━━━━━━━━━━━━━━━━━ 1s 46ms/step - accuracy: 0.7556 - loss: 0.683573/94 ━━━━━━━━━━━━━━━━━━━━ 0s 46ms/step - accuracy: 0.7559 - loss: 0.682775/94 ━━━━━━━━━━━━━━━━━━━━ 0s 46ms/step - accuracy: 0.7563 - loss: 0.681877/94 ━━━━━━━━━━━━━━━━━━━━ 0s 46ms/step - accuracy: 0.7566 - loss: 0.681179/94 ━━━━━━━━━━━━━━━━━━━━ 0s 46ms/step - accuracy: 0.7568 - loss: 0.680381/94 ━━━━━━━━━━━━━━━━━━━━ 0s 46ms/step - accuracy: 0.7571 - loss: 0.679683/94 ━━━━━━━━━━━━━━━━━━━━ 0s 46ms/step - accuracy: 0.7573 - loss: 0.678985/94 ━━━━━━━━━━━━━━━━━━━━ 0s 46ms/step - accuracy: 0.7576 - loss: 0.678287/94 ━━━━━━━━━━━━━━━━━━━━ 0s 46ms/step - accuracy: 0.7578 - loss: 0.677689/94 ━━━━━━━━━━━━━━━━━━━━ 0s 46ms/step - accuracy: 0.7580 - loss: 0.676991/94 ━━━━━━━━━━━━━━━━━━━━ 0s 46ms/step - accuracy: 0.7583 - loss: 0.676293/94 ━━━━━━━━━━━━━━━━━━━━ 0s 46ms/step - accuracy: 0.7585 - loss: 0.675694/94 ━━━━━━━━━━━━━━━━━━━━ 6s 60ms/step - accuracy: 0.7682 - loss: 0.6447 - val_accuracy: 0.8047 - val_loss: 0.5182
Epoch 4/5
 1/94 ━━━━━━━━━━━━━━━━━━━━ 7:12 5s/step - accuracy: 0.7656 - loss: 0.6036 3/94 ━━━━━━━━━━━━━━━━━━━━ 4s 46ms/step - accuracy: 0.7782 - loss: 0.5869 5/94 ━━━━━━━━━━━━━━━━━━━━ 4s 46ms/step - accuracy: 0.7758 - loss: 0.5966 7/94 ━━━━━━━━━━━━━━━━━━━━ 3s 46ms/step - accuracy: 0.7772 - loss: 0.5934 9/94 ━━━━━━━━━━━━━━━━━━━━ 3s 46ms/step - accuracy: 0.7779 - loss: 0.590711/94 ━━━━━━━━━━━━━━━━━━━━ 3s 46ms/step - accuracy: 0.7776 - loss: 0.591513/94 ━━━━━━━━━━━━━━━━━━━━ 3s 46ms/step - accuracy: 0.7775 - loss: 0.593315/94 ━━━━━━━━━━━━━━━━━━━━ 3s 46ms/step - accuracy: 0.7776 - loss: 0.594517/94 ━━━━━━━━━━━━━━━━━━━━ 3s 46ms/step - accuracy: 0.7773 - loss: 0.596419/94 ━━━━━━━━━━━━━━━━━━━━ 3s 46ms/step - accuracy: 0.7771 - loss: 0.596921/94 ━━━━━━━━━━━━━━━━━━━━ 3s 46ms/step - accuracy: 0.7772 - loss: 0.596823/94 ━━━━━━━━━━━━━━━━━━━━ 3s 47ms/step - accuracy: 0.7774 - loss: 0.596325/94 ━━━━━━━━━━━━━━━━━━━━ 3s 47ms/step - accuracy: 0.7776 - loss: 0.595527/94 ━━━━━━━━━━━━━━━━━━━━ 3s 47ms/step - accuracy: 0.7779 - loss: 0.594629/94 ━━━━━━━━━━━━━━━━━━━━ 3s 47ms/step - accuracy: 0.7783 - loss: 0.593431/94 ━━━━━━━━━━━━━━━━━━━━ 2s 47ms/step - accuracy: 0.7786 - loss: 0.592733/94 ━━━━━━━━━━━━━━━━━━━━ 2s 47ms/step - accuracy: 0.7788 - loss: 0.592335/94 ━━━━━━━━━━━━━━━━━━━━ 2s 47ms/step - accuracy: 0.7792 - loss: 0.591836/94 ━━━━━━━━━━━━━━━━━━━━ 2s 47ms/step - accuracy: 0.7794 - loss: 0.591538/94 ━━━━━━━━━━━━━━━━━━━━ 2s 47ms/step - accuracy: 0.7797 - loss: 0.591240/94 ━━━━━━━━━━━━━━━━━━━━ 2s 47ms/step - accuracy: 0.7798 - loss: 0.591342/94 ━━━━━━━━━━━━━━━━━━━━ 2s 47ms/step - accuracy: 0.7798 - loss: 0.591544/94 ━━━━━━━━━━━━━━━━━━━━ 2s 47ms/step - accuracy: 0.7797 - loss: 0.591846/94 ━━━━━━━━━━━━━━━━━━━━ 2s 47ms/step - accuracy: 0.7797 - loss: 0.592248/94 ━━━━━━━━━━━━━━━━━━━━ 2s 47ms/step - accuracy: 0.7798 - loss: 0.592450/94 ━━━━━━━━━━━━━━━━━━━━ 2s 47ms/step - accuracy: 0.7798 - loss: 0.592652/94 ━━━━━━━━━━━━━━━━━━━━ 1s 47ms/step - accuracy: 0.7798 - loss: 0.592854/94 ━━━━━━━━━━━━━━━━━━━━ 1s 47ms/step - accuracy: 0.7798 - loss: 0.593056/94 ━━━━━━━━━━━━━━━━━━━━ 1s 47ms/step - accuracy: 0.7799 - loss: 0.593258/94 ━━━━━━━━━━━━━━━━━━━━ 1s 47ms/step - accuracy: 0.7800 - loss: 0.593560/94 ━━━━━━━━━━━━━━━━━━━━ 1s 47ms/step - accuracy: 0.7801 - loss: 0.593662/94 ━━━━━━━━━━━━━━━━━━━━ 1s 47ms/step - accuracy: 0.7802 - loss: 0.593864/94 ━━━━━━━━━━━━━━━━━━━━ 1s 47ms/step - accuracy: 0.7803 - loss: 0.593966/94 ━━━━━━━━━━━━━━━━━━━━ 1s 47ms/step - accuracy: 0.7803 - loss: 0.594168/94 ━━━━━━━━━━━━━━━━━━━━ 1s 47ms/step - accuracy: 0.7804 - loss: 0.594270/94 ━━━━━━━━━━━━━━━━━━━━ 1s 47ms/step - accuracy: 0.7805 - loss: 0.594472/94 ━━━━━━━━━━━━━━━━━━━━ 1s 47ms/step - accuracy: 0.7805 - loss: 0.594574/94 ━━━━━━━━━━━━━━━━━━━━ 0s 47ms/step - accuracy: 0.7805 - loss: 0.594776/94 ━━━━━━━━━━━━━━━━━━━━ 0s 47ms/step - accuracy: 0.7805 - loss: 0.594878/94 ━━━━━━━━━━━━━━━━━━━━ 0s 46ms/step - accuracy: 0.7805 - loss: 0.594980/94 ━━━━━━━━━━━━━━━━━━━━ 0s 46ms/step - accuracy: 0.7806 - loss: 0.594982/94 ━━━━━━━━━━━━━━━━━━━━ 0s 46ms/step - accuracy: 0.7806 - loss: 0.594984/94 ━━━━━━━━━━━━━━━━━━━━ 0s 46ms/step - accuracy: 0.7807 - loss: 0.594986/94 ━━━━━━━━━━━━━━━━━━━━ 0s 46ms/step - accuracy: 0.7807 - loss: 0.594988/94 ━━━━━━━━━━━━━━━━━━━━ 0s 46ms/step - accuracy: 0.7808 - loss: 0.594990/94 ━━━━━━━━━━━━━━━━━━━━ 0s 46ms/step - accuracy: 0.7808 - loss: 0.594892/94 ━━━━━━━━━━━━━━━━━━━━ 0s 46ms/step - accuracy: 0.7809 - loss: 0.594894/94 ━━━━━━━━━━━━━━━━━━━━ 0s 46ms/step - accuracy: 0.7810 - loss: 0.594794/94 ━━━━━━━━━━━━━━━━━━━━ 10s 60ms/step - accuracy: 0.7847 - loss: 0.5896 - val_accuracy: 0.8146 - val_loss: 0.4891
Epoch 5/5
 1/94 ━━━━━━━━━━━━━━━━━━━━ 5s 58ms/step - accuracy: 0.7969 - loss: 0.5440 3/94 ━━━━━━━━━━━━━━━━━━━━ 4s 46ms/step - accuracy: 0.7999 - loss: 0.5974 5/94 ━━━━━━━━━━━━━━━━━━━━ 4s 46ms/step - accuracy: 0.8012 - loss: 0.5891 7/94 ━━━━━━━━━━━━━━━━━━━━ 3s 45ms/step - accuracy: 0.8003 - loss: 0.5819 9/94 ━━━━━━━━━━━━━━━━━━━━ 3s 45ms/step - accuracy: 0.7998 - loss: 0.575811/94 ━━━━━━━━━━━━━━━━━━━━ 3s 46ms/step - accuracy: 0.7987 - loss: 0.574113/94 ━━━━━━━━━━━━━━━━━━━━ 3s 46ms/step - accuracy: 0.7981 - loss: 0.573415/94 ━━━━━━━━━━━━━━━━━━━━ 3s 46ms/step - accuracy: 0.7975 - loss: 0.572517/94 ━━━━━━━━━━━━━━━━━━━━ 3s 46ms/step - accuracy: 0.7966 - loss: 0.572519/94 ━━━━━━━━━━━━━━━━━━━━ 3s 46ms/step - accuracy: 0.7960 - loss: 0.573021/94 ━━━━━━━━━━━━━━━━━━━━ 3s 46ms/step - accuracy: 0.7952 - loss: 0.574123/94 ━━━━━━━━━━━━━━━━━━━━ 3s 46ms/step - accuracy: 0.7945 - loss: 0.575125/94 ━━━━━━━━━━━━━━━━━━━━ 3s 46ms/step - accuracy: 0.7940 - loss: 0.575927/94 ━━━━━━━━━━━━━━━━━━━━ 3s 46ms/step - accuracy: 0.7937 - loss: 0.576229/94 ━━━━━━━━━━━━━━━━━━━━ 2s 46ms/step - accuracy: 0.7934 - loss: 0.576231/94 ━━━━━━━━━━━━━━━━━━━━ 2s 46ms/step - accuracy: 0.7929 - loss: 0.576533/94 ━━━━━━━━━━━━━━━━━━━━ 2s 46ms/step - accuracy: 0.7926 - loss: 0.576935/94 ━━━━━━━━━━━━━━━━━━━━ 2s 46ms/step - accuracy: 0.7922 - loss: 0.577537/94 ━━━━━━━━━━━━━━━━━━━━ 2s 46ms/step - accuracy: 0.7920 - loss: 0.577939/94 ━━━━━━━━━━━━━━━━━━━━ 2s 46ms/step - accuracy: 0.7918 - loss: 0.578141/94 ━━━━━━━━━━━━━━━━━━━━ 2s 46ms/step - accuracy: 0.7918 - loss: 0.578243/94 ━━━━━━━━━━━━━━━━━━━━ 2s 46ms/step - accuracy: 0.7918 - loss: 0.578145/94 ━━━━━━━━━━━━━━━━━━━━ 2s 46ms/step - accuracy: 0.7918 - loss: 0.578147/94 ━━━━━━━━━━━━━━━━━━━━ 2s 46ms/step - accuracy: 0.7918 - loss: 0.578049/94 ━━━━━━━━━━━━━━━━━━━━ 2s 46ms/step - accuracy: 0.7918 - loss: 0.577751/94 ━━━━━━━━━━━━━━━━━━━━ 1s 46ms/step - accuracy: 0.7919 - loss: 0.577453/94 ━━━━━━━━━━━━━━━━━━━━ 1s 46ms/step - accuracy: 0.7920 - loss: 0.577155/94 ━━━━━━━━━━━━━━━━━━━━ 1s 46ms/step - accuracy: 0.7921 - loss: 0.576757/94 ━━━━━━━━━━━━━━━━━━━━ 1s 46ms/step - accuracy: 0.7922 - loss: 0.576459/94 ━━━━━━━━━━━━━━━━━━━━ 1s 46ms/step - accuracy: 0.7923 - loss: 0.576161/94 ━━━━━━━━━━━━━━━━━━━━ 1s 46ms/step - accuracy: 0.7924 - loss: 0.575863/94 ━━━━━━━━━━━━━━━━━━━━ 1s 46ms/step - accuracy: 0.7925 - loss: 0.575465/94 ━━━━━━━━━━━━━━━━━━━━ 1s 46ms/step - accuracy: 0.7925 - loss: 0.575167/94 ━━━━━━━━━━━━━━━━━━━━ 1s 46ms/step - accuracy: 0.7926 - loss: 0.574869/94 ━━━━━━━━━━━━━━━━━━━━ 1s 46ms/step - accuracy: 0.7926 - loss: 0.574471/94 ━━━━━━━━━━━━━━━━━━━━ 1s 46ms/step - accuracy: 0.7926 - loss: 0.574173/94 ━━━━━━━━━━━━━━━━━━━━ 0s 46ms/step - accuracy: 0.7927 - loss: 0.573875/94 ━━━━━━━━━━━━━━━━━━━━ 0s 46ms/step - accuracy: 0.7927 - loss: 0.573577/94 ━━━━━━━━━━━━━━━━━━━━ 0s 46ms/step - accuracy: 0.7927 - loss: 0.573279/94 ━━━━━━━━━━━━━━━━━━━━ 0s 46ms/step - accuracy: 0.7927 - loss: 0.572981/94 ━━━━━━━━━━━━━━━━━━━━ 0s 46ms/step - accuracy: 0.7928 - loss: 0.572683/94 ━━━━━━━━━━━━━━━━━━━━ 0s 46ms/step - accuracy: 0.7929 - loss: 0.572285/94 ━━━━━━━━━━━━━━━━━━━━ 0s 46ms/step - accuracy: 0.7929 - loss: 0.571987/94 ━━━━━━━━━━━━━━━━━━━━ 0s 46ms/step - accuracy: 0.7930 - loss: 0.571589/94 ━━━━━━━━━━━━━━━━━━━━ 0s 46ms/step - accuracy: 0.7931 - loss: 0.571291/94 ━━━━━━━━━━━━━━━━━━━━ 0s 46ms/step - accuracy: 0.7932 - loss: 0.570893/94 ━━━━━━━━━━━━━━━━━━━━ 0s 46ms/step - accuracy: 0.7933 - loss: 0.570594/94 ━━━━━━━━━━━━━━━━━━━━ 6s 60ms/step - accuracy: 0.7983 - loss: 0.5536 - val_accuracy: 0.8245 - val_loss: 0.4785

Validation Accuracy: 82.45%
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 44ms/step1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 55ms/step

Model has 121,930 parameters
Model size (Float32): 476.3 KB
Estimated Int8 size: 119.1 KB
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                     Output Shape                  Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ conv1 (Conv2D)                  │ (None, 26, 26, 32)     │           320 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ pool1 (MaxPooling2D)            │ (None, 13, 13, 32)     │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ conv2 (Conv2D)                  │ (None, 11, 11, 64)     │        18,496 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ pool2 (MaxPooling2D)            │ (None, 5, 5, 64)       │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ flatten (Flatten)               │ (None, 1600)           │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ fc1 (Dense)                     │ (None, 64)             │       102,464 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout (Dropout)               │ (None, 64)             │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ output (Dense)                  │ (None, 10)             │           650 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 121,930 (476.29 KB)
 Trainable params: 121,930 (476.29 KB)
 Non-trainable params: 0 (0.00 B)

Key Insight: CNNs achieve 90%+ accuracy on Fashion MNIST with only 5 epochs of training. The hierarchical feature extraction (edges → shapes → objects) emerges automatically during backpropagation.

Essential CNN Code Examples

The following examples demonstrate core CNN concepts with hands-on implementations. Each example is designed to build your intuition for how CNNs work under the hood.

Understanding how convolution works at the NumPy level helps demystify what CNNs actually compute. This implementation shows the element-wise multiplication and summation that occurs at each position.

import numpy as np

def conv2d_simple(image, kernel):
    """
    Perform 2D convolution without padding or stride > 1.

    Args:
        image: 2D numpy array (H x W)
        kernel: 2D numpy array (K x K), typically 3x3

    Returns:
        Feature map: 2D array of size (H-K+1) x (W-K+1)
    """
    # Get dimensions
    img_h, img_w = image.shape
    ker_h, ker_w = kernel.shape

    # Calculate output dimensions
    out_h = img_h - ker_h + 1
    out_w = img_w - ker_w + 1

    # Initialize output feature map
    output = np.zeros((out_h, out_w))

    # Slide kernel across image
    for i in range(out_h):
        for j in range(out_w):
            # Extract the region of interest
            region = image[i:i+ker_h, j:j+ker_w]

            # Element-wise multiplication and sum (dot product)
            output[i, j] = np.sum(region * kernel)

    return output

# Demo: Edge detection on a simple 5x5 image
test_image = np.array([
    [10, 20, 30, 40, 50],
    [10, 20, 30, 40, 50],
    [10, 20, 30, 40, 50],
    [10, 20, 30, 40, 50],
    [10, 20, 30, 40, 50]
])

# Vertical edge detector kernel
vertical_edge_kernel = np.array([
    [1,  0, -1],
    [2,  0, -2],
    [1,  0, -1]
])

# Apply convolution
feature_map = conv2d_simple(test_image, vertical_edge_kernel)
print("Input shape:", test_image.shape)
print("Kernel shape:", vertical_edge_kernel.shape)
print("Output shape:", feature_map.shape)  # (3, 3)
print("\nFeature map (detects vertical edges):")
print(feature_map)

# Try a horizontal edge detector
horizontal_edge_kernel = np.array([
    [ 1,  2,  1],
    [ 0,  0,  0],
    [-1, -2, -1]
])

feature_map_h = conv2d_simple(test_image, horizontal_edge_kernel)
print("\nHorizontal edge detection:")
print(feature_map_h)

Key Insights:

  • Output size shrinks: 5×5 image with 3×3 kernel → 3×3 output
  • Different kernels detect different features (edges, textures, etc.)
  • In real CNNs, kernel weights are learned, not hand-designed
  • This operation happens millions of times during CNN inference!

For deeper theory on convolution mathematics, see Section 7.2 in the PDF book.

Predicting the output dimensions of convolutional and pooling layers is critical for designing CNN architectures that fit edge device memory constraints.

def conv_output_size(input_size, kernel_size, padding=0, stride=1):
    """
    Calculate output size for convolution or pooling layer.

    Formula: output_size = floor((input_size - kernel_size + 2*padding) / stride) + 1

    Args:
        input_size: Height or width of input (int)
        kernel_size: Size of kernel/filter (int)
        padding: Number of pixels padded on each side (int)
        stride: Step size for sliding window (int)

    Returns:
        Output dimension (int)
    """
    return ((input_size - kernel_size + 2 * padding) // stride) + 1

def calculate_cnn_dimensions(input_shape, architecture):
    """
    Track spatial dimensions through a CNN architecture.

    Args:
        input_shape: Tuple (height, width, channels)
        architecture: List of layer dicts with 'type', 'kernel', 'stride', 'padding', 'filters'

    Returns:
        List of output shapes at each layer
    """
    h, w, c = input_shape
    shapes = [input_shape]

    for layer in architecture:
        if layer['type'] in ['conv', 'pool']:
            h = conv_output_size(h, layer['kernel'], layer.get('padding', 0), layer.get('stride', 1))
            w = conv_output_size(w, layer['kernel'], layer.get('padding', 0), layer.get('stride', 1))

            if layer['type'] == 'conv':
                c = layer['filters']  # Conv changes number of channels
            # Pooling keeps same number of channels

            shapes.append((h, w, c))

    return shapes

# Example: Fashion MNIST CNN
input_shape = (28, 28, 1)

architecture = [
    {'type': 'conv', 'kernel': 3, 'stride': 1, 'padding': 0, 'filters': 64},
    {'type': 'pool', 'kernel': 2, 'stride': 2, 'padding': 0},
    {'type': 'conv', 'kernel': 3, 'stride': 1, 'padding': 0, 'filters': 64},
    {'type': 'pool', 'kernel': 2, 'stride': 2, 'padding': 0}
]

shapes = calculate_cnn_dimensions(input_shape, architecture)

print("CNN Dimension Flow:")
print(f"Input:              {shapes[0]}")
print(f"After Conv2D(64,3): {shapes[1]}")
print(f"After MaxPool(2,2): {shapes[2]}")
print(f"After Conv2D(64,3): {shapes[3]}")
print(f"After MaxPool(2,2): {shapes[4]}")

# Quick reference examples
print("\n--- Common Scenarios ---")
print(f"28×28 image, 3×3 conv (no padding): {conv_output_size(28, 3, 0, 1)}×{conv_output_size(28, 3, 0, 1)}")
print(f"28×28 image, 3×3 conv (same padding): {conv_output_size(28, 3, 1, 1)}×{conv_output_size(28, 3, 1, 1)}")
print(f"26×26 image, 2×2 maxpool: {conv_output_size(26, 2, 0, 2)}×{conv_output_size(26, 2, 0, 2)}")
print(f"224×224 image, 7×7 conv, stride=2: {conv_output_size(224, 7, 3, 2)}×{conv_output_size(224, 7, 3, 2)}")

Key Insights:

  • No padding: Output shrinks by (kernel_size - 1) per dimension
  • Same padding: Output size equals input size (when stride=1)
  • Pooling: Typically halves dimensions (2×2 pool, stride 2)
  • Memory impact: 28×28 → 13×13 reduces spatial memory by 75%!

See Section 7.2.4 in the PDF for mathematical derivations and edge deployment considerations.

This example shows a complete CNN architecture using Keras, suitable for MNIST/Fashion MNIST or similar small image datasets.

import tensorflow as tf

# Build CNN architecture
def build_simple_cnn(input_shape=(28, 28, 1), num_classes=10):
    """
    Build a simple CNN for image classification.

    Args:
        input_shape: Input dimensions (height, width, channels)
        num_classes: Number of output classes

    Returns:
        Compiled Keras model
    """
    model = tf.keras.models.Sequential([
        # First convolutional block
        # Input: 28×28×1 → Output: 26×26×64
        tf.keras.layers.Conv2D(64, (3, 3), activation='relu',
                               input_shape=input_shape,
                               name='conv1'),
        # Output: 13×13×64
        tf.keras.layers.MaxPooling2D((2, 2), name='pool1'),

        # Second convolutional block
        # Output: 11×11×64
        tf.keras.layers.Conv2D(64, (3, 3), activation='relu', name='conv2'),
        # Output: 5×5×64
        tf.keras.layers.MaxPooling2D((2, 2), name='pool2'),

        # Classification head
        # Flatten: 5×5×64 = 1600 features
        tf.keras.layers.Flatten(),

        # Dense layers for classification
        tf.keras.layers.Dense(128, activation='relu', name='fc1'),
        tf.keras.layers.Dropout(0.5),  # Prevent overfitting
        tf.keras.layers.Dense(num_classes, activation='softmax', name='output')
    ])

    # Compile with optimizer and loss function
    model.compile(
        optimizer='adam',
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )

    return model

# Create and inspect the model
model = build_simple_cnn()
model.summary()

# Load Fashion MNIST dataset
(train_images, train_labels), (val_images, val_labels) = \
    tf.keras.datasets.fashion_mnist.load_data()

# Preprocess: Add channel dimension and normalize
train_images = train_images.reshape(-1, 28, 28, 1) / 255.0
val_images = val_images.reshape(-1, 28, 28, 1) / 255.0

# Train the model
print("\n--- Training CNN ---")
history = model.fit(
    train_images, train_labels,
    validation_data=(val_images, val_labels),
    epochs=5,  # Use 20 for full training
    batch_size=128,
    verbose=1
)

# Evaluate
val_loss, val_accuracy = model.evaluate(val_images, val_labels, verbose=0)
print(f"\nValidation Accuracy: {val_accuracy:.2%}")

# Compare with a simple dense network
def build_dense_baseline(input_shape=(28, 28, 1), num_classes=10):
    """Dense network baseline for comparison."""
    model = tf.keras.models.Sequential([
        tf.keras.layers.Flatten(input_shape=input_shape),
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dense(num_classes, activation='softmax')
    ])
    model.compile(optimizer='adam',
                  loss='sparse_categorical_crossentropy',
                  metrics=['accuracy'])
    return model

print("\n--- Dense Network Baseline ---")
baseline = build_dense_baseline()
baseline.summary()

# Typical results:
# Dense Network: ~87-89% validation accuracy
# CNN: ~91-93% validation accuracy (4-5% improvement!)

Key Insights:

  • Convolution blocks: Conv2D + MaxPooling pattern extracts hierarchical features
  • Dropout: Prevents overfitting by randomly dropping 50% of neurons during training
  • Parameter efficiency: Despite better accuracy, CNN may have fewer parameters than dense networks
  • Architecture pattern: Spatial dimensions decrease (28→13→5), channels increase (1→64→64)

See Section 7.4 in the PDF for building your first CNN with complete explanations.

Understanding where parameters come from is critical for edge deployment. Most CNN parameters are in dense layers, not convolutions!

import numpy as np

def count_conv2d_params(input_channels, output_filters, kernel_size, use_bias=True):
    """
    Count parameters in a Conv2D layer.

    Formula: (kernel_h × kernel_w × input_channels + bias) × output_filters

    Args:
        input_channels: Number of input feature maps
        output_filters: Number of output feature maps (filters)
        kernel_size: Size of kernel (assume square)
        use_bias: Whether layer has bias terms

    Returns:
        Total parameter count
    """
    weights = kernel_size * kernel_size * input_channels * output_filters
    biases = output_filters if use_bias else 0
    return weights + biases

def count_dense_params(input_size, output_size, use_bias=True):
    """
    Count parameters in a Dense (fully-connected) layer.

    Formula: (input_size + bias) × output_size
    """
    weights = input_size * output_size
    biases = output_size if use_bias else 0
    return weights + biases

def analyze_cnn_parameters(architecture_name="Fashion MNIST CNN"):
    """
    Analyze parameter distribution in a typical CNN.
    """
    print(f"=== {architecture_name} Parameter Breakdown ===\n")

    layers = [
        ("Conv2D(64, 3×3)", count_conv2d_params(1, 64, 3)),
        ("MaxPooling2D", 0),  # Pooling has no parameters
        ("Conv2D(64, 3×3)", count_conv2d_params(64, 64, 3)),
        ("MaxPooling2D", 0),
        ("Flatten", 0),
        ("Dense(128)", count_dense_params(5*5*64, 128)),
        ("Dense(10)", count_dense_params(128, 10))
    ]

    total = 0
    for layer_name, params in layers:
        total += params
        pct = (params / 70000) * 100  # Approximate total
        print(f"{layer_name:20s}: {params:7,d} params ({pct:5.1f}%)")

    print(f"\n{'Total':20s}: {total:7,d} params")
    print(f"\nMemory Requirements:")
    print(f"  Float32: {total * 4 / 1024:.1f} KB")
    print(f"  Int8 (quantized): {total / 1024:.1f} KB")

    # Highlight where parameters come from
    conv_params = layers[0][1] + layers[2][1]
    dense_params = layers[5][1] + layers[6][1]

    print(f"\nParameter Distribution:")
    print(f"  Convolutional layers: {conv_params:,d} ({conv_params/total*100:.1f}%)")
    print(f"  Dense layers: {dense_params:,d} ({dense_params/total*100:.1f}%)")
    print(f"  → Dense layers dominate! Minimize these for edge deployment.")

# Run the analysis
analyze_cnn_parameters()

# Show individual calculations
print("\n=== Detailed Parameter Calculations ===\n")

print("Conv2D(64, 3×3, input_channels=1):")
print(f"  Weights: 3 × 3 × 1 × 64 = {3*3*1*64}")
print(f"  Biases:  64")
print(f"  Total:   {count_conv2d_params(1, 64, 3)}")

print("\nConv2D(64, 3×3, input_channels=64):")
print(f"  Weights: 3 × 3 × 64 × 64 = {3*3*64*64}")
print(f"  Biases:  64")
print(f"  Total:   {count_conv2d_params(64, 64, 3)}")

print("\nDense(1600 → 128):")
print(f"  Weights: 1600 × 128 = {1600*128}")
print(f"  Biases:  128")
print(f"  Total:   {count_dense_params(1600, 128)}")

print("\nDense(128 → 10):")
print(f"  Weights: 128 × 10 = {128*10}")
print(f"  Biases:  10")
print(f"  Total:   {count_dense_params(128, 10)}")

# Edge deployment optimization tip
print("\n=== Edge Optimization Strategy ===")
print("\nProblem: Dense(1600 → 128) has 204,928 parameters (73% of total!)")
print("\nSolution 1: Use Global Average Pooling instead of Flatten")
print("  Before: Flatten() → Dense(1600, 128)")
print("  After:  GlobalAveragePooling2D() → Dense(64, 10)")
print("  Savings: 204,928 → 650 params (99.7% reduction!)")

print("\nSolution 2: Reduce dense layer size")
print("  Dense(1600 → 32) = 51,232 params (75% reduction)")
print("  Dense(32 → 10) = 330 params")

print("\nSolution 3: More pooling, smaller spatial dimensions")
print("  Aggressive pooling → smaller feature maps before Flatten")
print("  Example: 5×5×64 → 2×2×64 reduces dense input from 1600 to 256")

Key Insights:

  • Parameter explosion: Dense layers dominate parameter counts
  • Conv2D efficiency: Parameter sharing makes convolutions very efficient
  • Edge strategy: Minimize or eliminate dense layers using Global Average Pooling
  • Quantization: Int8 quantization reduces model size by 4× (critical for microcontrollers)

For complete parameter formulas and edge deployment constraints, see Section 7.5 in the PDF book.

Practice Exercise: Design Your Own Edge CNN

Using the examples above, design a CNN that:

  1. Processes 32×32 RGB images (CIFAR-10 style)
  2. Has fewer than 50,000 parameters
  3. Uses at least 3 convolutional layers
  4. Achieves reasonable accuracy (>70% on CIFAR-10)

Constraints for edge deployment: - Total model size < 200 KB (quantized) - Use GlobalAveragePooling2D instead of large Dense layers - Test your design with the parameter counting function above

Hint: Start with 16 filters in early layers, gradually increase to 32 or 64. Use 2×2 MaxPooling after every 1-2 conv layers.

Self-Assessment Checkpoints

Test your understanding before proceeding to the exercises.

Answer: Parameters = (Kernel Height × Kernel Width × Input Channels + 1) × Num Filters = (3 × 3 × 3 + 1) × 32 = (27 + 1) × 32 = 28 × 32 = 896 parameters. The “+1” accounts for bias terms (one per filter). In float32, this layer requires 896 × 4 = 3,584 bytes. After int8 quantization, it reduces to 896 bytes. Note: This is just the weights—runtime requires additional memory for input/output feature maps.

Answer: After Conv2D: Output Size = Input Size - Kernel Size + 1 = 28 - 3 + 1 = 26×26. After MaxPooling: Output Size = 26 / 2 = 13×13. If the Conv2D layer has 64 filters, the final output shape is (batch_size, 13, 13, 64). Each pooling layer reduces spatial dimensions by 2× while preserving the number of feature maps (channels).

Answer: Convolutional layers use parameter sharing—the same 3×3 filter (9 weights) is reused across all spatial positions. A Conv2D layer with 32 filters has only ~300 parameters. Dense layers connect every input to every output with no sharing. A Dense(512, 128) layer has 512 × 128 = 65,536 parameters! For edge deployment, minimize dense layers: use GlobalAveragePooling2D instead of Flatten → Dense, or keep final dense layers small (Dense(32) instead of Dense(512)).

Answer: This is overfitting—the model memorizes training data instead of learning general patterns. The 9% gap indicates the model is too complex for the available training data. Solutions: (1) Add data augmentation (rotation, flipping, zooming) during training to artificially increase dataset size and variety, (2) Add dropout layers (0.2-0.5) to prevent co-adaptation, (3) Reduce model complexity (fewer filters or layers), (4) Use early stopping to halt training when validation accuracy plateaus, (5) Collect more training data if possible. Never augment validation data—it must represent real deployment conditions.

Answer: Maybe, but unlikely. In int8 quantized form, 200,000 parameters = 200KB of Flash (fits easily in 1MB). However, runtime requires SRAM for the tensor arena, which typically needs 2-4× the model size for intermediate activations. Estimated arena = 200KB × 3 = 600KB, which exceeds 256KB SRAM. Solutions: (1) Reduce model to ~50,000 parameters (50KB model, 150KB arena, fits in 256KB), (2) Use aggressive pooling to reduce activation sizes, (3) Switch to ESP32 with 520KB SRAM, or (4) Use depthwise separable convolutions (MobileNet-style) which reduce memory requirements.

Interactive Notebook

The notebook below contains runnable code for all Level 1 activities.

LAB 07: Convolutional Neural Networks for Computer Vision

Open In Colab View on GitHub


Overview

Property Value
Book Chapter Chapter 07
Execution Levels Level 1 (Notebook) | Level 2 (TFLite) | Level 3 (MCU)
Estimated Time 60 minutes
Prerequisites LAB 02-03, basic neural networks

Learning Objectives

  1. Understand convolution mathematics - how filters extract features
  2. Master CNN architecture concepts - receptive fields, feature hierarchies
  3. Build and train CNNs for image classification
  4. Apply regularization techniques to prevent overfitting
  5. Deploy CNNs on edge devices

Prerequisites Check

Before You Begin

Make sure you have completed: - [ ] LAB 02: ML Foundations with TensorFlow - [ ] LAB 03: Model Quantization - [ ] Understanding of matrix operations

Part 1: The Mathematics of Convolution

1.1 Why CNNs for Images?

Dense neural networks have fundamental problems with images:

┌─────────────────────────────────────────────────────────────────┐
│           DNN vs CNN for Image Processing                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Dense Neural Network (DNN):                                    │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ 28×28 image = 784 pixels                                 │   │
│  │                                                          │   │
│  │ First hidden layer (128 neurons):                        │   │
│  │ Parameters = 784 × 128 = 100,352 weights                 │   │
│  │                                                          │   │
│  │ Problems:                                                 │   │
│  │ • No spatial awareness (pixel order doesn't matter)      │   │
│  │ • No translation invariance                              │   │
│  │ • Huge parameter count                                   │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  Convolutional Neural Network (CNN):                            │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ 28×28 image                                              │   │
│  │                                                          │   │
│  │ First conv layer (32 filters, 3×3):                      │   │
│  │ Parameters = 3×3×1×32 = 288 weights                      │   │
│  │                                                          │   │
│  │ Advantages:                                               │   │
│  │ • Spatial structure preserved                            │   │
│  │ • Translation invariance via weight sharing              │   │
│  │ • 350× fewer parameters!                                 │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

1.2 The Convolution Operation

A 2D convolution slides a kernel (filter) across an image, computing dot products:

\(y[i,j] = \sum_{m=0}^{K-1} \sum_{n=0}^{K-1} x[i+m, j+n] \cdot w[m,n] + b\)

Where: - \(x\): input image - \(w\): kernel weights (K×K) - \(b\): bias - \(y\): output feature map

Example: 3×3 Edge Detection Kernel

Input Image (5×5):           Kernel (3×3):        Output (3×3):
┌───┬───┬───┬───┬───┐       ┌────┬────┬────┐     ┌───┬───┬───┐
│ 0 │ 0 │ 0 │ 0 │ 0 │       │ -1 │ -1 │ -1 │     │   │   │   │
├───┼───┼───┼───┼───┤       ├────┼────┼────┤     ├───┼───┼───┤
│ 0 │ 1 │ 1 │ 1 │ 0 │   ⊛   │ -1 │  8 │ -1 │  =  │   │ 8 │   │
├───┼───┼───┼───┼───┤       ├────┼────┼────┤     ├───┼───┼───┤
│ 0 │ 1 │ 1 │ 1 │ 0 │       │ -1 │ -1 │ -1 │     │   │   │   │
├───┼───┼───┼───┼───┤                            └───┴───┴───┘
│ 0 │ 1 │ 1 │ 1 │ 0 │
├───┼───┼───┼───┼───┤        Computation at center:
│ 0 │ 0 │ 0 │ 0 │ 0 │        (-1×1)+(-1×1)+(-1×1)+
└───┴───┴───┴───┴───┘        (-1×1)+(8×1)+(-1×1)+
                             (-1×1)+(-1×1)+(-1×1) = 0

Key insight: The same kernel is applied everywhere → weight sharing → efficiency!

1.3 Output Dimensions

For an input of size \(H_{in} \times W_{in}\) with kernel size \(K\), stride \(S\), and padding \(P\):

\(H_{out} = \lfloor \frac{H_{in} + 2P - K}{S} \rfloor + 1\)

\(W_{out} = \lfloor \frac{W_{in} + 2P - K}{S} \rfloor + 1\)

Common configurations:

Input Kernel Stride Padding Output
28×28 3×3 1 0 (valid) 26×26
28×28 3×3 1 1 (same) 28×28
28×28 3×3 2 0 13×13
28×28 5×5 1 0 24×24

1.4 Receptive Field: What Each Neuron “Sees”

The receptive field is the region of input that affects a neuron’s output:

┌─────────────────────────────────────────────────────────────────┐
│                    RECEPTIVE FIELD GROWTH                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Input Image (28×28)                                            │
│  ┌─────────────────────────────────┐                           │
│  │ ┌─────────────┐                 │                           │
│  │ │ ┌───────┐   │                 │                           │
│  │ │ │ ┌───┐ │   │                 │   Layer 1: 3×3 RF         │
│  │ │ │ │ • │ │   │  ◄──────────────┤   Layer 2: 5×5 RF         │
│  │ │ │ └───┘ │   │                 │   Layer 3: 7×7 RF         │
│  │ │ └───────┘   │                 │   (with pooling)          │
│  │ └─────────────┘                 │                           │
│  └─────────────────────────────────┘                           │
│                                                                 │
│  Receptive Field Formula (no pooling, stride=1):               │
│  RF_n = RF_{n-1} + (K_n - 1)                                   │
│                                                                 │
│  With 2×2 pooling after each conv:                             │
│  Layer 1: RF = 3                                                │
│  Pool 1:  RF = 3 × 2 = 6                                       │
│  Layer 2: RF = 6 + 2 = 8                                       │
│  Pool 2:  RF = 8 × 2 = 16                                      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Deeper networks have larger receptive fields → can detect larger patterns!

1.5 Feature Hierarchy

CNNs learn a hierarchy of features from simple to complex:

┌─────────────────────────────────────────────────────────────────┐
│                    FEATURE HIERARCHY                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Layer 1 (Early):          Layer 2 (Mid):       Layer 3+ (Deep):│
│  ┌───────────────┐         ┌──────────────┐    ┌──────────────┐│
│  │ ─── edges     │    →    │ ╔══╗ corners │ →  │ 👁️ eyes     ││
│  │ │ │ │ lines   │         │ ╚══╝ shapes  │    │ 👃 nose     ││
│  │ ╱ ╲ gradients │         │ ○ ● textures │    │ 🐾 paws     ││
│  └───────────────┘         └──────────────┘    └──────────────┘│
│                                                                 │
│  Small receptive field     Medium RF            Large RF       │
│  Local features            Combinations         Semantic parts │
│                                                                 │
│  Example for Face Detection:                                   │
│  Edges → Eye corners → Eyes → Eye pair → Face                  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Part 2: Pooling Operations

2.1 Why Pooling?

Pooling provides: 1. Dimensionality reduction - fewer parameters in later layers 2. Translation invariance - small shifts don’t change output 3. Larger receptive field - see more of the image

┌─────────────────────────────────────────────────────────────────┐
│                    POOLING OPERATIONS                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Max Pooling (2×2):              Average Pooling (2×2):        │
│                                                                 │
│  ┌───┬───┬───┬───┐              ┌───┬───┬───┬───┐              │
│  │ 1 │ 3 │ 2 │ 4 │              │ 1 │ 3 │ 2 │ 4 │              │
│  ├───┼───┼───┼───┤    ┌───┬───┐ ├───┼───┼───┼───┤    ┌───┬───┐│
│  │ 5 │ 6 │ 1 │ 2 │ →  │ 6 │ 4 │ │ 5 │ 6 │ 1 │ 2 │ →  │3.75│2.25││
│  ├───┼───┼───┼───┤    ├───┼───┤ ├───┼───┼───┼───┤    ├───┼───┤│
│  │ 3 │ 2 │ 1 │ 0 │    │ 3 │ 3 │ │ 3 │ 2 │ 1 │ 0 │    │2.25│1.5 ││
│  ├───┼───┼───┼───┤    └───┴───┘ ├───┼───┼───┼───┤    └───┴───┘│
│  │ 1 │ 1 │ 2 │ 3 │              │ 1 │ 1 │ 2 │ 3 │              │
│  └───┴───┴───┴───┘              └───┴───┴───┴───┘              │
│                                                                 │
│  Max: Keeps strongest activation (edges, features)             │
│  Avg: Smooths features (less common in modern CNNs)            │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

2.2 Global Pooling

Global Average Pooling (GAP) replaces fully connected layers:

\(y_c = \frac{1}{H \times W} \sum_{i=1}^{H} \sum_{j=1}^{W} x_{c,i,j}\)

Benefits: - No additional parameters - Reduces overfitting - Used in MobileNet, ResNet

Part 3: DNN Baseline - Fashion MNIST

Let’s establish a baseline using a Dense Neural Network to see why CNNs are better.

Part 4: Building a CNN

4.1 CNN Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    CNN ARCHITECTURE                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Input (28×28×1)                                                │
│       │                                                         │
│       ▼                                                         │
│  ┌─────────────┐  Conv2D(64, 3×3, relu)                        │
│  │  26×26×64   │  Parameters: 3×3×1×64 + 64 = 640              │
│  └──────┬──────┘                                                │
│         ▼                                                       │
│  ┌─────────────┐  MaxPool(2×2)                                 │
│  │  13×13×64   │  No parameters (just downsampling)            │
│  └──────┬──────┘                                                │
│         ▼                                                       │
│  ┌─────────────┐  Conv2D(64, 3×3, relu)                        │
│  │  11×11×64   │  Parameters: 3×3×64×64 + 64 = 36,928          │
│  └──────┬──────┘                                                │
│         ▼                                                       │
│  ┌─────────────┐  MaxPool(2×2)                                 │
│  │   5×5×64    │                                               │
│  └──────┬──────┘                                                │
│         ▼                                                       │
│  ┌─────────────┐  Flatten                                      │
│  │    1600     │  5×5×64 = 1600                                │
│  └──────┬──────┘                                                │
│         ▼                                                       │
│  ┌─────────────┐  Dense(20, relu)                              │
│  │     20      │  Parameters: 1600×20 + 20 = 32,020            │
│  └──────┬──────┘                                                │
│         ▼                                                       │
│  ┌─────────────┐  Dense(10, softmax)                           │
│  │     10      │  Parameters: 20×10 + 10 = 210                 │
│  └─────────────┘                                                │
│                                                                 │
│  Total: ~70,000 parameters                                      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Part 5: Visualizing What CNNs Learn

Let’s examine the intermediate feature maps to understand what the CNN “sees”.

Part 6: Regularization Techniques

6.1 Overfitting Theory

┌─────────────────────────────────────────────────────────────────┐
│                    OVERFITTING                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Training Loss         Training & Val Loss                      │
│       │                        │                                │
│       │ \                      │  \   /← Overfitting!          │
│       │  \                     │   \ /                          │
│       │   \___                 │    X                           │
│       │       \___             │   / \                          │
│       │           \___         │  /   \___                      │
│       └──────────────────►     └──────────────────►             │
│              Epochs                   Epochs                    │
│                                                                 │
│  Underfitting      Good Fit      Overfitting                   │
│  ┌─────────┐      ┌─────────┐   ┌─────────┐                    │
│  │    ─    │      │   ───   │   │ ~~~~~~~ │                    │
│  │ o o o o │      │ o o o o │   │ o o o o │                    │
│  └─────────┘      └─────────┘   └─────────┘                    │
│  Too simple        Just right    Too complex                   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

6.2 Dropout Regularization

Dropout randomly sets neurons to 0 during training:

\(h_i^{(l)} = \begin{cases} 0 & \text{with probability } p \\ \frac{h_i^{(l)}}{1-p} & \text{with probability } 1-p \end{cases}\)

This forces the network to learn redundant representations and prevents co-adaptation.

Part 7: Edge Deployment

7.1 Model Size Considerations for Edge

┌─────────────────────────────────────────────────────────────────┐
│              EDGE DEPLOYMENT CONSIDERATIONS                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Device          RAM       Recommended Model Size               │
│  ─────────────────────────────────────────────────              │
│  Arduino Nano 33  256KB    < 100KB (int8)                      │
│  ESP32            520KB    < 200KB (int8)                      │
│  Raspberry Pi     1-8GB    < 50MB                              │
│                                                                 │
│  Optimization Techniques:                                       │
│  1. Quantization: Float32 → Int8 (4× smaller)                  │
│  2. Pruning: Remove small weights (2-3× smaller)               │
│  3. Knowledge Distillation: Train smaller student model        │
│  4. Architecture: Use depthwise separable convolutions         │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Checkpoint: Self-Assessment

Knowledge Check

Before proceeding, make sure you can answer:

  1. What is the convolution operation mathematically? Write the formula.
  2. How do you calculate output dimensions after convolution and pooling?
  3. What is a receptive field and how does it grow with depth?
  4. Why do CNNs outperform DNNs for images?
  5. How does dropout prevent overfitting?
  6. What is the size reduction from int8 quantization?
Common Pitfalls
  • Forgetting to add channel dimension for grayscale images
  • Not normalizing pixel values to [0, 1]
  • Using too many filters (increases model size)
  • Ignoring early stopping (leads to overfitting)

Three-Tier Activities

Run the embedded notebook above. Key exercises:

  1. Follow along with the code cells
  2. Modify parameters and observe results
  3. Complete the checkpoint questions

Use Level 2 to develop strong CNN intuition without needing hardware:

CNN Explainer – Interactive CNN visualization:

  • See how convolution filters extract features
  • Watch activations flow through the network
  • Understand pooling operations visually

3D CNN Visualization – Draw digits and watch the network classify them in 3D

ConvNetJS – Train CNNs in your browser on CIFAR-10 and observe overfitting/underfitting

In this foundational CNN lab we do not fully deploy a model yet, but you can:

  • Export a small CNN or MobileNet variant trained in the notebook
  • Follow LAB03 to quantize it to .tflite
  • Follow LAB05 to deploy it to:
    • A Raspberry Pi with a USB camera, or
    • An embedded board such as ESP32-CAM (with appropriate camera firmware)

Lab 16 will provide a full, end-to-end deployment path for real-time CV on edge devices.