LAB03: TensorFlow Lite Quantization

Model Optimization for Edge Deployment

PDF Textbook Reference

For detailed theoretical foundations, mathematical proofs, and algorithm derivations, see Chapter 3: Model Optimization and Quantization for Edge in the PDF textbook.

The PDF chapter includes: - Mathematical foundations of quantization theory - Detailed derivations of quantization error bounds - Complete TensorFlow Lite conversion pipeline - In-depth coverage of quantization-aware training (QAT) - Theoretical analysis of accuracy-efficiency trade-offs

Open In Colab

Download Notebook

Learning Objectives

By the end of this lab you will be able to:

Explain why model size and numeric precision matter on edge devices
Apply post-training quantization (dynamic range and full-int8) to TFLite models
Compare float32 vs int8 models in terms of size, accuracy, and latency
Prepare quantized models for deployment on Raspberry Pi and microcontrollers

Theory Summary

Why Quantization is Essential for Edge Deployment

Quantization is the bridge between cloud-scale models and edge-ready models. It’s the single most important optimization technique for deploying ML on resource-constrained devices.

The Size Problem: A typical neural network uses 32-bit floating-point numbers (float32) for weights and activations. Each parameter requires 4 bytes of storage. A modest 100,000-parameter model needs 400KB - already too large for many microcontrollers. Quantization reduces this to 100KB by converting to 8-bit integers (int8), a 4× size reduction.

How Quantization Works: Quantization maps continuous floating-point values to discrete integer values using a scale factor and zero-point. The formula is: int8_value = round(float_value / scale) + zero_point. During inference, we reconstruct approximate floats: float_approx = scale × (int8_value - zero_point). The quantization error per value is at most ±scale/2, which is typically tiny compared to model weights that naturally cluster around zero.

Three Quantization Approaches: 1. Dynamic Range quantizes only weights (not activations), giving 4× size reduction with minimal accuracy loss (1-2%). Fast to apply, good for initial testing. 2. Full Integer quantizes both weights and activations for true int8-only inference. Required for microcontrollers and specialized accelerators. Needs representative dataset for calibration. Typically 2-4% accuracy loss. 3. Quantization-Aware Training (QAT) simulates quantization during training, allowing the model to learn to compensate for precision loss. Best accuracy (<1% loss) but requires retraining.

Why Models Quantize Well: Neural network weights follow approximately Gaussian distributions centered near zero due to regularization and initialization strategies. Most values cluster in a narrow range, making them ideal candidates for quantization. The step size between adjacent int8 values is small enough that reconstruction error is negligible for model accuracy.

Key Concepts at a Glance

Core Concepts

Quantization: Converting high-precision (float32) to low-precision (int8) representation
Scale Factor: Maps float range to int8 range, determines quantization granularity
Zero-Point: The int8 value representing floating-point zero
Representative Dataset: Sample data used to calibrate quantization parameters
Post-Training Quantization (PTQ): Apply quantization after training completes
Quantization-Aware Training (QAT): Train with quantization simulation for better accuracy
Tensor Arena: Pre-allocated SRAM workspace for model inference

Common Pitfalls

Mistakes to Avoid

Using Too Few Representative Samples: Post-training quantization requires 100-500 samples from your validation set to calibrate scale and zero-point values. Using only 10 samples leads to poor calibration and significant accuracy drops. The representative dataset should cover all classes and edge cases your model will encounter.

Confusing Model Size vs Runtime Memory: A 100KB quantized model stored in Flash doesn’t mean you only need 100KB of RAM! Runtime inference requires additional SRAM for input buffers, intermediate activations, and output tensors. A 100KB model might need 300KB of working memory (tensor arena) to execute.

Skipping Accuracy Validation: Always measure quantized model accuracy before deployment. A model that shows 95% accuracy in float32 might drop to 90% or worse after aggressive quantization. If accuracy loss exceeds 2-3%, use QAT or collect more representative data.

Wrong Quantization Type for Target: Full integer quantization is required for microcontrollers (Arduino, ESP32) because they lack efficient float32 math. Dynamic range quantization works on Raspberry Pi but won’t run on MCUs. Check your target hardware requirements before converting.

Forgetting Input/Output Type Conversion: When using inference_input_type=tf.uint8, your inference code must convert inputs from float to uint8 using the model’s scale and zero-point. Forgetting this conversion causes completely wrong predictions.

Quick Reference

Key Formulas

Quantization Formula: \[q = \text{round}\left(\frac{r}{s}\right) + z\]

where \(q\) = quantized int8 value, \(r\) = real float32 value, \(s\) = scale, \(z\) = zero-point

Dequantization Formula: \[r \approx s \cdot (q - z)\]

Scale Calculation: \[s = \frac{r_{max} - r_{min}}{q_{max} - q_{min}} = \frac{r_{max} - r_{min}}{255}\]

for unsigned int8 (0-255 range)

Model Size Reduction: \[\text{Size}_{int8} = \frac{\text{Size}_{float32}}{4}\]

Maximum Quantization Error: \[\text{Error}_{max} = \frac{s}{2}\]

Important Parameter Values

Quantization Type	Size Reduction	Accuracy Loss	Use Case
Float32 (baseline)	1×	0%	Development, debugging
Dynamic Range	4×	1-2%	Quick deployment, Raspberry Pi
Full Integer (int8)	4×	2-5%	Required for MCUs
QAT Int8	4×	<1%	Production edge devices

Representative Dataset Guidelines: - Minimum samples: 100 - Recommended: 200-500 - Source: Validation set (not training set!) - Coverage: All classes, edge cases, varied conditions

TFLite Converter Options:

# Dynamic Range Quantization
converter.optimizations = [tf.lite.Optimize.DEFAULT]

# Full Integer Quantization
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = rep_dataset_gen
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.uint8  # or tf.int8
converter.inference_output_type = tf.uint8

Links to PDF Sections

For deeper understanding, see these sections in Chapter 3 PDF:

Section 3.1: Why Quantization Matters (pages 42-45)
Section 3.2: TFLite Converter Basics (pages 46-50)
Section 3.3: Post-Training Quantization (pages 51-58)
Section 3.4: Understanding Quantization Mathematically (pages 59-63)
Section 3.5: Quantization-Aware Training (pages 64-71)
Exercises: Hands-on quantization experiments (pages 72-73)

Interactive Learning Tools

Explore Quantization Visually

Build intuition with these interactive tools:

Quantization Explorer - See how bit-width affects precision and error interactively
Netron Model Viewer - Upload your .tflite files to visualize architecture and inspect tensor shapes. Compare float32 vs int8 models side-by-side!
Our Weight Distribution Visualizer - See why Gaussian-distributed weights quantize well

Self-Assessment Checkpoints

Test your understanding before proceeding to the exercises.

Question 1: Explain why quantization provides a 4× size reduction from float32 to int8.

Answer: Float32 uses 4 bytes (32 bits) per parameter, while int8 uses 1 byte (8 bits) per parameter. Size reduction = 4 bytes / 1 byte = 4×. A 100,000-parameter model requires 400KB in float32 but only 100KB in int8. This reduction is critical for fitting models on microcontrollers with limited Flash storage (typically 256KB-1MB).

Question 2: Calculate the quantized int8 value for a float32 weight of 0.42 given scale=0.025 and zero_point=0.

Answer: Using the quantization formula: q = round(r / s) + z = round(0.42 / 0.025) + 0 = round(16.8) + 0 = 17. To dequantize back: r ≈ s × (q - z) = 0.025 × (17 - 0) = 0.425. The quantization error is 0.425 - 0.42 = 0.005, which is less than scale/2 = 0.0125 (maximum possible error).

Question 3: You applied dynamic range quantization and got 4× size reduction but the model still won’t run on your Arduino. Why?

Answer: Dynamic range quantization only quantizes weights, not activations. At runtime, the model still uses float32 for activations and intermediate calculations, which requires float32 math support. Microcontrollers need full integer (int8) quantization which quantizes both weights AND activations for true int8-only inference. You must use: converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8] and provide a representative dataset.

Question 4: Your quantized model drops from 95% to 87% accuracy. What should you try?

Answer: An 8% accuracy drop is too high (target: <2-3%). Try these solutions in order: (1) Increase representative dataset size from 100 to 500+ samples covering all classes and edge cases, (2) Verify the representative dataset matches your actual input distribution (use validation data, not training), (3) Use Quantization-Aware Training (QAT) which simulates quantization during training for <1% accuracy loss, (4) If still poor, consider the model may be too sensitive to precision loss - try a slightly larger architecture.

Question 5: A 150KB quantized model needs how much SRAM for the tensor arena (approximately)?

Answer: The tensor arena typically requires 2-4× the model size for activation memory during inference. For a 150KB model, expect 300-600KB of SRAM needed. This is because the arena must hold: input buffers, intermediate activations for each layer, output tensors, and scratch buffers. An Arduino Nano 33 BLE with 256KB SRAM cannot run this model - you’d need an ESP32 (520KB SRAM) or need to reduce the model complexity further.

Try It Yourself: Executable Python Examples

Run these interactive examples directly in your browser to understand quantization intuitively.

Example 1: Quantization Simulation (Float32 to Int8)

Code

import numpy as np
import matplotlib.pyplot as plt

# Simulate quantization process
np.random.seed(42)

# Generate synthetic weights (typical neural network distribution)
weights_float32 = np.random.randn(1000) * 0.5  # Mean=0, std=0.5

# Quantization parameters
r_min = weights_float32.min()
r_max = weights_float32.max()
q_min = -128  # int8 range
q_max = 127

# Calculate scale and zero-point
scale = (r_max - r_min) / (q_max - q_min)
zero_point = int(np.round(q_min - r_min / scale))

# Quantize: q = round(r / scale) + zero_point
weights_int8 = np.round(weights_float32 / scale).astype(np.int8) + zero_point
weights_int8 = np.clip(weights_int8, q_min, q_max)

# Dequantize: r ≈ scale * (q - zero_point)
weights_dequantized = scale * (weights_int8.astype(np.float32) - zero_point)

# Calculate quantization error
quantization_error = weights_float32 - weights_dequantized
max_error = np.max(np.abs(quantization_error))
mean_error = np.mean(np.abs(quantization_error))

# Visualize
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Plot 1: Original float32 distribution
axes[0, 0].hist(weights_float32, bins=50, alpha=0.7, color='blue', edgecolor='black')
axes[0, 0].set_title('Original Float32 Weights')
axes[0, 0].set_xlabel('Value')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].grid(True, alpha=0.3)

# Plot 2: Quantized int8 distribution
axes[0, 1].hist(weights_int8, bins=50, alpha=0.7, color='red', edgecolor='black')
axes[0, 1].set_title('Quantized Int8 Weights')
axes[0, 1].set_xlabel('Value')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].grid(True, alpha=0.3)

# Plot 3: Comparison of original vs dequantized
axes[1, 0].scatter(weights_float32, weights_dequantized, alpha=0.3, s=10)
axes[1, 0].plot([r_min, r_max], [r_min, r_max], 'r--', linewidth=2, label='Perfect reconstruction')
axes[1, 0].set_xlabel('Original Float32')
axes[1, 0].set_ylabel('Dequantized Float32')
axes[1, 0].set_title('Reconstruction Quality')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# Plot 4: Quantization error distribution
axes[1, 1].hist(quantization_error, bins=50, alpha=0.7, color='orange', edgecolor='black')
axes[1, 1].axvline(x=0, color='red', linestyle='--', linewidth=2)
axes[1, 1].set_title('Quantization Error')
axes[1, 1].set_xlabel('Error (Original - Dequantized)')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("QUANTIZATION RESULTS")
print("="*60)
print(f"Scale factor:           {scale:.6f}")
print(f"Zero point:             {zero_point}")
print(f"Original range:         [{r_min:.4f}, {r_max:.4f}]")
print(f"Quantized range:        [{q_min}, {q_max}]")
print(f"\nError Statistics:")
print(f"  Max absolute error:   {max_error:.6f}")
print(f"  Mean absolute error:  {mean_error:.6f}")
print(f"  Theoretical max:      {scale/2:.6f} (scale/2)")
print(f"\nMemory savings:")
print(f"  Float32:              {weights_float32.nbytes} bytes")
print(f"  Int8:                 {weights_int8.nbytes} bytes")
print(f"  Reduction:            {weights_float32.nbytes / weights_int8.nbytes:.1f}×")
print(f"\nKey insight: Quantization error is small and negligible for model accuracy!")

QUANTIZATION RESULTS
============================================================
Scale factor:           0.013910
Zero point:             -11
Original range:         [-1.6206, 1.9264]
Quantized range:        [-128, 127]

Error Statistics:
  Max absolute error:   0.006953
  Mean absolute error:  0.003493
  Theoretical max:      0.006955 (scale/2)

Memory savings:
  Float32:              8000 bytes
  Int8:                 1000 bytes
  Reduction:            8.0×

Key insight: Quantization error is small and negligible for model accuracy!

Example 2: Weight Distribution Visualization

Code

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Generate weight distributions for different layers
np.random.seed(42)

# Simulate different layer types
conv_weights = np.random.randn(5000) * 0.3  # Conv layers: smaller std
dense_weights = np.random.randn(5000) * 0.5  # Dense layers: moderate std
batch_norm = np.random.randn(500) * 0.1    # BatchNorm: very small std

# Combine all weights
all_weights = np.concatenate([conv_weights, dense_weights, batch_norm])

# Fit Gaussian distribution
mu, std = stats.norm.fit(all_weights)

# Create figure
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Plot 1: Overall weight distribution with Gaussian fit
axes[0, 0].hist(all_weights, bins=100, density=True, alpha=0.7, color='blue', edgecolor='black')
xmin, xmax = axes[0, 0].get_xlim()
x = np.linspace(xmin, xmax, 100)
p = stats.norm.pdf(x, mu, std)
axes[0, 0].plot(x, p, 'r-', linewidth=2, label=f'Gaussian fit\nμ={mu:.3f}, σ={std:.3f}')
axes[0, 0].set_title('Weight Distribution (All Layers)')
axes[0, 0].set_xlabel('Weight Value')
axes[0, 0].set_ylabel('Density')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Plot 2: Layer-wise distributions
axes[0, 1].hist(conv_weights, bins=50, alpha=0.5, label='Conv layers (σ=0.3)', color='blue')
axes[0, 1].hist(dense_weights, bins=50, alpha=0.5, label='Dense layers (σ=0.5)', color='green')
axes[0, 1].hist(batch_norm, bins=50, alpha=0.5, label='BatchNorm (σ=0.1)', color='red')
axes[0, 1].set_title('Layer-wise Weight Distributions')
axes[0, 1].set_xlabel('Weight Value')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Plot 3: Cumulative distribution
sorted_weights = np.sort(all_weights)
cumulative = np.arange(1, len(sorted_weights) + 1) / len(sorted_weights)
axes[1, 0].plot(sorted_weights, cumulative, linewidth=2)
axes[1, 0].axhline(y=0.5, color='r', linestyle='--', alpha=0.5, label='Median')
axes[1, 0].axvline(x=0, color='g', linestyle='--', alpha=0.5, label='Zero')
axes[1, 0].set_title('Cumulative Distribution Function')
axes[1, 0].set_xlabel('Weight Value')
axes[1, 0].set_ylabel('Cumulative Probability')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# Plot 4: Quantization bin analysis
num_bins = 256  # int8 has 256 unique values
hist, bin_edges = np.histogram(all_weights, bins=num_bins)
bin_centers = (bin_edges[:-1] + bin_edges[1:]) / 2
axes[1, 1].bar(bin_centers, hist, width=bin_edges[1]-bin_edges[0], alpha=0.7, color='purple')
axes[1, 1].set_title(f'Weight Distribution in {num_bins} Quantization Bins')
axes[1, 1].set_xlabel('Weight Value')
axes[1, 1].set_ylabel('Count per Bin')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Statistical summary
print("WEIGHT DISTRIBUTION ANALYSIS")
print("="*60)
print(f"Total weights:          {len(all_weights):,}")
print(f"Mean (μ):               {mu:.6f}")
print(f"Std Dev (σ):            {std:.6f}")
print(f"Min:                    {all_weights.min():.6f}")
print(f"Max:                    {all_weights.max():.6f}")
print(f"\nPercentile analysis:")
print(f"  1st percentile:       {np.percentile(all_weights, 1):.6f}")
print(f"  25th percentile:      {np.percentile(all_weights, 25):.6f}")
print(f"  Median (50th):        {np.percentile(all_weights, 50):.6f}")
print(f"  75th percentile:      {np.percentile(all_weights, 75):.6f}")
print(f"  99th percentile:      {np.percentile(all_weights, 99):.6f}")
print(f"\nWeights near zero:")
print(f"  Within ±0.1:          {np.sum(np.abs(all_weights) < 0.1) / len(all_weights) * 100:.1f}%")
print(f"  Within ±0.5:          {np.sum(np.abs(all_weights) < 0.5) / len(all_weights) * 100:.1f}%")
print(f"\nKey insight: Most weights cluster near zero - ideal for quantization!")

WEIGHT DISTRIBUTION ANALYSIS
============================================================
Total weights:          10,500
Mean (μ):               -0.001570
Std Dev (σ):            0.405598
Min:                    -1.961200
Max:                    1.764528

Percentile analysis:
  1st percentile:       -1.022329
  25th percentile:      -0.242821
  Median (50th):        -0.000886
  75th percentile:      0.235287
  99th percentile:      1.028916

Weights near zero:
  Within ±0.1:          23.5%
  Within ±0.5:          80.2%

Key insight: Most weights cluster near zero - ideal for quantization!

Example 3: Model Size Comparison Before/After Quantization

Code

import numpy as np
import matplotlib.pyplot as plt

# Simulate different model architectures
models = {
    'Tiny KWS': {
        'layers': [
            ('Conv2D', 16, 3, 3),   # 16 filters, 3x3 kernel
            ('Conv2D', 32, 3, 3),
            ('Dense', 128),
            ('Dense', 10)
        ],
        'input_shape': (49, 13, 1)  # MFCC features
    },
    'MobileNet-like': {
        'layers': [
            ('Conv2D', 32, 3, 3),
            ('DepthwiseConv2D', 32, 3, 3),
            ('Conv2D', 64, 1, 1),
            ('DepthwiseConv2D', 64, 3, 3),
            ('Conv2D', 128, 1, 1),
            ('Dense', 256),
            ('Dense', 10)
        ],
        'input_shape': (224, 224, 3)
    },
    'Small MNIST': {
        'layers': [
            ('Conv2D', 8, 3, 3),
            ('MaxPool2D', None, 2, 2),
            ('Conv2D', 16, 3, 3),
            ('MaxPool2D', None, 2, 2),
            ('Dense', 32),
            ('Dense', 10)
        ],
        'input_shape': (28, 28, 1)
    }
}

# Calculate parameter counts (simplified)
def count_params(layers, input_shape):
    params = 0
    in_channels = input_shape[-1]
    spatial_size = input_shape[0] * input_shape[1]

    for layer in layers:
        if layer[0] == 'Conv2D':
            filters, k_h, k_w = layer[1], layer[2], layer[3]
            params += (k_h * k_w * in_channels + 1) * filters
            in_channels = filters
        elif layer[0] == 'DepthwiseConv2D':
            channels, k_h, k_w = layer[1], layer[2], layer[3]
            params += k_h * k_w * channels + channels
        elif layer[0] == 'Dense':
            units = layer[1]
            # Estimate flattened size
            params += (in_channels * spatial_size + 1) * units if spatial_size > 1 else (in_channels + 1) * units
            in_channels = units
            spatial_size = 1
        elif layer[0] == 'MaxPool2D':
            spatial_size = spatial_size // 4  # 2x2 pooling

    return params

# Calculate sizes for each model
model_names = []
float32_sizes = []
int8_sizes = []
compressions = []

for name, config in models.items():
    params = count_params(config['layers'], config['input_shape'])

    # Size calculations
    float32_size = params * 4 / 1024  # KB (4 bytes per float32)
    int8_size = params * 1 / 1024     # KB (1 byte per int8)
    compression = float32_size / int8_size

    model_names.append(name)
    float32_sizes.append(float32_size)
    int8_sizes.append(int8_size)
    compressions.append(compression)

# Visualize
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Size comparison
x = np.arange(len(model_names))
width = 0.35

bars1 = ax1.bar(x - width/2, float32_sizes, width, label='Float32', color='steelblue', edgecolor='black')
bars2 = ax1.bar(x + width/2, int8_sizes, width, label='Int8 Quantized', color='coral', edgecolor='black')

ax1.set_xlabel('Model Architecture')
ax1.set_ylabel('Size (KB)')
ax1.set_title('Model Size: Float32 vs Int8 Quantization')
ax1.set_xticks(x)
ax1.set_xticklabels(model_names, rotation=15, ha='right')
ax1.legend()
ax1.grid(True, alpha=0.3, axis='y')

# Add value labels on bars
for bars in [bars1, bars2]:
    for bar in bars:
        height = bar.get_height()
        ax1.text(bar.get_x() + bar.get_width()/2., height,
                f'{height:.1f}', ha='center', va='bottom', fontsize=9)

# Plot 2: Compression ratios
colors = plt.cm.viridis(np.linspace(0.3, 0.9, len(model_names)))
bars = ax2.barh(model_names, compressions, color=colors, edgecolor='black')
ax2.axvline(x=4.0, color='red', linestyle='--', linewidth=2, label='4× compression')
ax2.set_xlabel('Compression Ratio')
ax2.set_title('Quantization Compression Ratio')
ax2.legend()
ax2.grid(True, alpha=0.3, axis='x')

# Add value labels
for i, (bar, comp) in enumerate(zip(bars, compressions)):
    ax2.text(comp + 0.1, bar.get_y() + bar.get_height()/2,
            f'{comp:.2f}×', va='center', fontsize=10, fontweight='bold')

plt.tight_layout()
plt.show()

# Print summary table
print("MODEL SIZE COMPARISON")
print("="*80)
print(f"{'Model':<20} {'Float32 (KB)':<15} {'Int8 (KB)':<15} {'Reduction':<15}")
print("-"*80)
for name, f32, i8, comp in zip(model_names, float32_sizes, int8_sizes, compressions):
    print(f"{name:<20} {f32:>10.1f} KB   {i8:>10.1f} KB   {comp:>10.2f}×")
print("="*80)

# Memory savings summary
total_float32 = sum(float32_sizes)
total_int8 = sum(int8_sizes)
total_saved = total_float32 - total_int8

print(f"\nTotal memory saved: {total_saved:.1f} KB ({total_saved/1024:.2f} MB)")
print(f"Average compression: {np.mean(compressions):.2f}×")
print(f"\nKey insight: Quantization provides consistent ~4× size reduction!")

MODEL SIZE COMPARISON
================================================================================
Model                Float32 (KB)    Int8 (KB)       Reduction      
--------------------------------------------------------------------------------
Tiny KWS                10216.3 KB       2554.1 KB         4.00×
MobileNet-like        6422587.0 KB    1605646.8 KB         4.00×
Small MNIST               104.3 KB         26.1 KB         4.00×
================================================================================

Total memory saved: 4824680.7 KB (4711.60 MB)
Average compression: 4.00×

Key insight: Quantization provides consistent ~4× size reduction!

Example 4: Accuracy vs Compression Tradeoff

Code

import numpy as np
import matplotlib.pyplot as plt

# Simulate accuracy vs compression for different quantization strategies
np.random.seed(42)

# Quantization strategies
strategies = {
    'Float32 (Baseline)': {'bits': 32, 'acc': 95.0, 'size_mult': 1.0},
    'Float16': {'bits': 16, 'acc': 94.8, 'size_mult': 0.5},
    'Dynamic Range': {'bits': 8, 'acc': 94.5, 'size_mult': 0.25},
    'Full INT8': {'bits': 8, 'acc': 93.8, 'size_mult': 0.25},
    'INT8 + QAT': {'bits': 8, 'acc': 94.9, 'size_mult': 0.25},
    'INT4 (Aggressive)': {'bits': 4, 'acc': 89.5, 'size_mult': 0.125},
}

# Add some variance to simulate different models
num_models = 5
results = []

for strat_name, config in strategies.items():
    for i in range(num_models):
        # Add small random variation
        acc_var = config['acc'] + np.random.randn() * 0.5
        size_var = config['size_mult'] * (1 + np.random.randn() * 0.02)

        results.append({
            'strategy': strat_name,
            'accuracy': acc_var,
            'size': size_var,
            'bits': config['bits']
        })

# Create Pareto frontier visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Accuracy vs Size tradeoff
for strat_name, config in strategies.items():
    strat_results = [r for r in results if r['strategy'] == strat_name]
    accs = [r['accuracy'] for r in strat_results]
    sizes = [r['size'] for r in strat_results]

    axes[0].scatter(sizes, accs, label=strat_name, s=100, alpha=0.7, edgecolors='black', linewidth=1.5)

axes[0].set_xlabel('Relative Model Size', fontsize=12)
axes[0].set_ylabel('Accuracy (%)', fontsize=12)
axes[0].set_title('Accuracy vs Model Size Tradeoff', fontsize=14, fontweight='bold')
axes[0].legend(loc='lower right', fontsize=9)
axes[0].grid(True, alpha=0.3)
axes[0].set_xlim(0, 1.1)
axes[0].set_ylim(88, 96)

# Add "Pareto frontier" region
axes[0].axhline(y=93, color='red', linestyle='--', alpha=0.3, label='Acceptable accuracy threshold')

# Plot 2: Accuracy loss vs Compression ratio
baseline_acc = 95.0
compression_ratios = []
accuracy_losses = []
strategy_labels = []

for strat_name, config in strategies.items():
    if strat_name == 'Float32 (Baseline)':
        continue

    compression = 1.0 / config['size_mult']
    acc_loss = baseline_acc - config['acc']

    compression_ratios.append(compression)
    accuracy_losses.append(acc_loss)
    strategy_labels.append(strat_name)

colors = plt.cm.RdYlGn_r(np.linspace(0.2, 0.8, len(strategy_labels)))
bars = axes[1].bar(range(len(strategy_labels)), accuracy_losses, color=colors, edgecolor='black', linewidth=1.5)

axes[1].set_xticks(range(len(strategy_labels)))
axes[1].set_xticklabels(strategy_labels, rotation=20, ha='right')
axes[1].set_ylabel('Accuracy Loss (%)', fontsize=12)
axes[1].set_title('Accuracy Loss by Quantization Strategy', fontsize=14, fontweight='bold')
axes[1].axhline(y=2.0, color='orange', linestyle='--', linewidth=2, label='2% acceptable loss')
axes[1].legend()
axes[1].grid(True, alpha=0.3, axis='y')

# Add compression ratios as text on bars
for i, (bar, comp) in enumerate(zip(bars, compression_ratios)):
    height = bar.get_height()
    axes[1].text(bar.get_x() + bar.get_width()/2., height + 0.1,
                f'{comp:.1f}×', ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

# Print detailed comparison
print("ACCURACY vs COMPRESSION TRADEOFF")
print("="*80)
print(f"{'Strategy':<25} {'Bits':<8} {'Size':<15} {'Accuracy':<12} {'Loss':<12}")
print("-"*80)

baseline = strategies['Float32 (Baseline)']
for name, config in strategies.items():
    compression = 1.0 / config['size_mult']
    loss = baseline['acc'] - config['acc']

    print(f"{name:<25} {config['bits']:<8} {compression:>6.1f}× smaller  {config['acc']:>6.2f}%      {loss:>6.2f}%")

print("="*80)
print("\nRecommendations:")
print("  • Dynamic Range INT8: Best balance for quick deployment (4× smaller, <1% loss)")
print("  • Full INT8: Required for MCUs, acceptable 1-2% accuracy loss")
print("  • INT8 + QAT: Best accuracy with quantization (<0.5% loss)")
print("  • Avoid INT4 unless extreme size constraints (>5% accuracy loss)")

ACCURACY vs COMPRESSION TRADEOFF
================================================================================
Strategy                  Bits     Size            Accuracy     Loss        
--------------------------------------------------------------------------------
Float32 (Baseline)        32          1.0× smaller   95.00%        0.00%
Float16                   16          2.0× smaller   94.80%        0.20%
Dynamic Range             8           4.0× smaller   94.50%        0.50%
Full INT8                 8           4.0× smaller   93.80%        1.20%
INT8 + QAT                8           4.0× smaller   94.90%        0.10%
INT4 (Aggressive)         4           8.0× smaller   89.50%        5.50%
================================================================================

Recommendations:
  • Dynamic Range INT8: Best balance for quick deployment (4× smaller, <1% loss)
  • Full INT8: Required for MCUs, acceptable 1-2% accuracy loss
  • INT8 + QAT: Best accuracy with quantization (<0.5% loss)
  • Avoid INT4 unless extreme size constraints (>5% accuracy loss)

Practical Code Examples

Complete PTQ Workflow

Full Post-Training Quantization Pipeline

This example shows a complete workflow from training to comparing float32, dynamic range, and full-int8 models.

import tensorflow as tf
import numpy as np
from pathlib import Path

# Step 1: Train a simple MNIST classifier
def create_model():
    model = tf.keras.Sequential([
        tf.keras.layers.InputLayer(input_shape=(28, 28, 1)),
        tf.keras.layers.Conv2D(16, 3, activation='relu'),
        tf.keras.layers.MaxPooling2D(),
        tf.keras.layers.Conv2D(32, 3, activation='relu'),
        tf.keras.layers.MaxPooling2D(),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(10, activation='softmax')
    ])
    return model

# Load and preprocess data
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0
x_train = np.expand_dims(x_train, -1)
x_test = np.expand_dims(x_test, -1)

# Train model
model = create_model()
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])
model.fit(x_train, y_train, epochs=5, validation_split=0.1, verbose=1)

# Baseline accuracy
baseline_acc = model.evaluate(x_test, y_test, verbose=0)[1]
print(f"Float32 baseline accuracy: {baseline_acc:.4f}")

# Step 2: Convert to Float32 TFLite (baseline)
converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_float32 = converter.convert()
Path("model_float32.tflite").write_bytes(tflite_float32)

# Step 3: Dynamic Range Quantization (weights only)
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_dynamic = converter.convert()
Path("model_dynamic.tflite").write_bytes(tflite_dynamic)

# Step 4: Full Integer Quantization (weights + activations)
def representative_dataset_gen():
    """Generate 200 samples for calibration"""
    for i in range(200):
        sample = x_train[i:i+1].astype(np.float32)
        yield [sample]

converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset_gen
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
tflite_int8 = converter.convert()
Path("model_int8.tflite").write_bytes(tflite_int8)

print("\n✓ Conversion complete! Three models saved:")
print(f"  - model_float32.tflite")
print(f"  - model_dynamic.tflite")
print(f"  - model_int8.tflite")

Key Points: - The representative dataset uses 200 samples (not training data!) - Full-int8 requires target_spec.supported_ops for MCU compatibility - Setting inference_input_type=tf.uint8 requires input preprocessing during inference

Model Size and Accuracy Comparison

Compare All Three Quantization Types

This code benchmarks file size, accuracy, and memory usage for all quantization variants.

import os
import numpy as np
import tensorflow as tf
from pathlib import Path

def get_model_size(filepath):
    """Get model size in KB"""
    size_bytes = os.path.getsize(filepath)
    return size_bytes / 1024

def evaluate_tflite_model(tflite_path, x_test, y_test, is_int8=False):
    """Evaluate TFLite model accuracy"""
    interpreter = tf.lite.Interpreter(model_path=str(tflite_path))
    interpreter.allocate_tensors()

    input_details = interpreter.get_input_details()
    output_details = interpreter.get_output_details()

    # Check if model expects quantized input
    input_scale, input_zero_point = input_details[0]['quantization']

    correct = 0
    for i in range(len(x_test)):
        # Prepare input
        if is_int8 and input_scale > 0:
            # Quantize input: int8 = round(float / scale) + zero_point
            test_input = x_test[i:i+1]
            test_input = test_input / input_scale + input_zero_point
            test_input = np.clip(test_input, 0, 255).astype(np.uint8)
        else:
            test_input = x_test[i:i+1].astype(np.float32)

        interpreter.set_tensor(input_details[0]['index'], test_input)
        interpreter.invoke()
        output = interpreter.get_tensor(output_details[0]['index'])

        # Dequantize output if needed
        output_scale, output_zero_point = output_details[0]['quantization']
        if output_scale > 0:
            output = output_scale * (output.astype(np.float32) - output_zero_point)

        prediction = np.argmax(output)
        if prediction == y_test[i]:
            correct += 1

    return correct / len(x_test)

# Compare all models
models = {
    'Float32 (baseline)': ('model_float32.tflite', False),
    'Dynamic Range': ('model_dynamic.tflite', False),
    'Full INT8': ('model_int8.tflite', True)
}

print("\n" + "="*70)
print("MODEL COMPARISON: SIZE vs ACCURACY")
print("="*70)
print(f"{'Model Type':<20} {'Size (KB)':<12} {'Accuracy':<12} {'Size Reduction'}")
print("-"*70)

baseline_size = get_model_size('model_float32.tflite')

for name, (path, is_int8) in models.items():
    if Path(path).exists():
        size = get_model_size(path)
        accuracy = evaluate_tflite_model(path, x_test, y_test, is_int8)
        reduction = baseline_size / size

        print(f"{name:<20} {size:>8.2f} KB  {accuracy:>8.4f}    {reduction:>5.2f}×")
    else:
        print(f"{name:<20} [NOT FOUND]")

print("="*70)
print("\nKey Insights:")
print("• Dynamic range gives ~4× size reduction with minimal accuracy loss")
print("• Full INT8 gives same size reduction but is MCU-compatible")
print("• Accuracy drop <2% is excellent for edge deployment")

Expected Output:

======================================================================
MODEL COMPARISON: SIZE vs ACCURACY
======================================================================
Model Type           Size (KB)    Accuracy     Size Reduction
----------------------------------------------------------------------
Float32 (baseline)      84.23 KB    0.9891      1.00×
Dynamic Range           22.45 KB    0.9876      3.75×
Full INT8               22.48 KB    0.9854      3.75×
======================================================================

INT8 Inference Speed Comparison

Benchmark Inference Latency: FP32 vs INT8

This example measures real inference time and shows the speedup from quantization.

import time
import numpy as np
import tensorflow as tf
from pathlib import Path

def benchmark_model(tflite_path, x_test, num_runs=100, is_int8=False):
    """
    Benchmark model inference time

    Args:
        tflite_path: Path to .tflite model
        x_test: Test data
        num_runs: Number of inference runs
        is_int8: Whether model expects uint8 input

    Returns:
        avg_time_ms, std_time_ms
    """
    interpreter = tf.lite.Interpreter(model_path=str(tflite_path))
    interpreter.allocate_tensors()

    input_details = interpreter.get_input_details()
    output_details = interpreter.get_output_details()

    # Prepare test sample
    test_sample = x_test[0:1]

    if is_int8:
        input_scale, input_zero_point = input_details[0]['quantization']
        if input_scale > 0:
            test_sample = test_sample / input_scale + input_zero_point
            test_sample = np.clip(test_sample, 0, 255).astype(np.uint8)
    else:
        test_sample = test_sample.astype(np.float32)

    # Warmup runs (important for cache/CPU frequency stabilization)
    for _ in range(10):
        interpreter.set_tensor(input_details[0]['index'], test_sample)
        interpreter.invoke()

    # Timed runs
    times = []
    for _ in range(num_runs):
        start = time.perf_counter()
        interpreter.set_tensor(input_details[0]['index'], test_sample)
        interpreter.invoke()
        _ = interpreter.get_tensor(output_details[0]['index'])
        end = time.perf_counter()
        times.append((end - start) * 1000)  # Convert to ms

    return np.mean(times), np.std(times)

# Benchmark all models
models = {
    'Float32': ('model_float32.tflite', False),
    'Dynamic Range': ('model_dynamic.tflite', False),
    'Full INT8': ('model_int8.tflite', True)
}

print("\n" + "="*70)
print("INFERENCE SPEED COMPARISON (CPU)")
print("="*70)
print(f"{'Model Type':<20} {'Avg Time (ms)':<15} {'Std Dev (ms)':<15} {'Speedup'}")
print("-"*70)

baseline_time = None

for name, (path, is_int8) in models.items():
    if Path(path).exists():
        avg_time, std_time = benchmark_model(path, x_test, num_runs=100, is_int8=is_int8)

        if baseline_time is None:
            baseline_time = avg_time
            speedup = 1.0
        else:
            speedup = baseline_time / avg_time

        print(f"{name:<20} {avg_time:>10.3f} ms   {std_time:>10.3f} ms   {speedup:>5.2f}×")
    else:
        print(f"{name:<20} [NOT FOUND]")

print("="*70)
print("\nNotes:")
print("• Speedup varies by hardware (CPU, GPU, NPU)")
print("• INT8 speedup is higher on ARM Cortex-M cores (8-50× faster)")
print("• On x86 CPUs, speedup may be modest (1-3×)")
print("• On mobile/edge NPUs, INT8 can be 10-20× faster")

Expected Output (on laptop CPU):

======================================================================
INFERENCE SPEED COMPARISON (CPU)
======================================================================
Model Type           Avg Time (ms)   Std Dev (ms)    Speedup
----------------------------------------------------------------------
Float32                  2.145 ms        0.087 ms     1.00×
Dynamic Range            1.832 ms        0.065 ms     1.17×
Full INT8                0.743 ms        0.042 ms     2.89×
======================================================================

On ARM Cortex-M4 (ESP32/Arduino): - Float32: ~45 ms - INT8: ~5 ms (9× speedup!)

Practical Exercise: Quantize Your Own Model

Exercise: Apply PTQ to a Custom Model

Task: Take a model from LAB02 or LAB04 and apply all three quantization types.

Steps: 1. Load your trained Keras model 2. Create a representative dataset generator (200-500 samples) 3. Convert to three .tflite variants 4. Compare size, accuracy, and speed 5. Determine which variant is best for your target device

Starter Code:

# 1. Load your model
model = tf.keras.models.load_model('your_model.h5')

# 2. Load validation data for representative dataset
x_val = ...  # Your validation data
y_val = ...

# 3. Create representative dataset
def representative_dataset_gen():
    for i in range(200):
        yield [x_val[i:i+1].astype(np.float32)]

# 4. Convert to INT8
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset_gen
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8

tflite_model = converter.convert()
Path("your_model_int8.tflite").write_bytes(tflite_model)

# 5. Evaluate and compare
# Use the comparison code from previous examples

Expected Results: - Size reduction: 3-4× - Accuracy drop: <3% - Speed improvement: 2-10× (device-dependent)

If accuracy drops >5%: 1. Increase representative dataset size to 500+ samples 2. Ensure samples cover all classes and edge cases 3. Try Quantization-Aware Training (QAT) - see PDF Chapter 3.5

Interactive Notebook

The notebook below contains runnable code for all Level 1 activities.

LAB 03: Model Quantization for Edge Deployment

Overview

Property	Value
Book Chapter	Chapter 03
Execution Levels	Level 1 (Notebook) \| Level 2 (Raspberry Pi) \| Level 3 (MCU)
Estimated Time	60 minutes
Prerequisites	LAB 01-02, basic neural network understanding

Learning Objectives

Understand quantization mathematics - affine mapping, scale/zero-point
Apply post-training quantization - dynamic range and full integer
Implement quantization-aware training for minimal accuracy loss
Deploy quantized models to edge devices

Prerequisites Check

Before You Begin

Make sure you have completed: - [ ] LAB 01: Introduction to Edge Analytics - [ ] LAB 02: ML Foundations with TensorFlow - [ ] Understanding of neural network layers (Dense, Conv2D)

Part 1: The Mathematics of Quantization

1.1 Why Quantization Matters for Edge

Neural network weights are typically stored as 32-bit floating-point numbers:

Float32 bit layout:
┌─────────┬───────────┬──────────────────────────┐
│ Sign(1) │ Exponent(8)│    Mantissa (23 bits)    │
└─────────┴───────────┴──────────────────────────┘
         Total: 32 bits = 4 bytes per weight

Problem: Edge devices have limited memory:

Device	RAM	Typical Model Limit
Arduino Nano 33	256 KB	~50K params (200KB)
ESP32	520 KB	~100K params (400KB)
Raspberry Pi Zero	512 MB	~100M params

Solution: Convert 32-bit floats to 8-bit integers → 4× size reduction!

1.2 Affine Quantization: The Core Mathematics

Quantization maps continuous floating-point values to discrete integers:

         Quantization (Q)
Float32  ────────────────►  Int8
[r_min, r_max]            [0, 255] or [-128, 127]

         Dequantization (DQ)
Int8     ────────────────►  Float32

The Affine Quantization Formula

For a real value r and quantized value q:

\(r = S \cdot (q - Z)\)

Where: - S (Scale): Scaling factor (float32) - Z (Zero-point): The integer that maps to real 0 (int8)

Inverting for quantization:

\(q = \text{round}\left(\frac{r}{S} + Z\right)\)

Calculating Scale and Zero-Point

Given the range of real values [r_min, r_max] and quantized range [q_min, q_max]:

\(S = \frac{r_{max} - r_{min}}{q_{max} - q_{min}}\)

\(Z = \text{round}\left(q_{min} - \frac{r_{min}}{S}\right)\)

Example: Quantizing weights in range [-1.5, 2.3] to int8 [-128, 127]:

S = (2.3 - (-1.5)) / (127 - (-128)) = 3.8 / 255 = 0.0149
Z = round(-128 - (-1.5)/0.0149) = round(-128 + 100.7) = -27

To quantize r = 0.5:
q = round(0.5/0.0149 + (-27)) = round(33.6 - 27) = 7

To dequantize q = 7:
r' = 0.0149 × (7 - (-27)) = 0.0149 × 34 = 0.507  (close to 0.5!)

1.3 Quantization Error Analysis

Every quantization introduces quantization error (noise):

\(\epsilon_q = r - DQ(Q(r)) = r - S \cdot (\text{round}(r/S + Z) - Z)\)

The maximum quantization error is bounded by:

\(|\epsilon_q|_{max} = \frac{S}{2} = \frac{r_{max} - r_{min}}{2 \cdot 256}\)

Key insight: The wider the range of values, the larger the quantization error!

Quantization error increases with dynamic range:

Narrow range [-1, 1]:     |ε| ≤ 0.0039
Wide range [-10, 10]:     |ε| ≤ 0.039
Very wide [-100, 100]:    |ε| ≤ 0.39

This is why batch normalization helps quantization - it keeps activations in a narrower range!

1.4 Types of Quantization

┌─────────────────────────────────────────────────────────────────┐
│                    QUANTIZATION TAXONOMY                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  By Granularity:                                                │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐              │
│  │ Per-Tensor  │  │ Per-Channel │  │  Per-Group  │              │
│  │ (1 scale)   │  │ (N scales)  │  │ (M scales)  │              │
│  └─────────────┘  └─────────────┘  └─────────────┘              │
│                                                                 │
│  By Timing:                                                     │
│  ┌─────────────────────┐    ┌─────────────────────┐             │
│  │   Post-Training     │    │  Quantization-Aware │             │
│  │   Quantization      │    │     Training        │             │
│  │   (PTQ)             │    │     (QAT)           │             │
│  │                     │    │                     │             │
│  │ • No retraining     │    │ • Simulates quant   │             │
│  │ • Fast              │    │ • Better accuracy   │             │
│  │ • May lose accuracy │    │ • Slower            │             │
│  └─────────────────────┘    └─────────────────────┘             │
│                                                                 │
│  By Symmetry:                                                   │
│  ┌─────────────────────┐    ┌─────────────────────┐             │
│  │     Symmetric       │    │     Asymmetric      │             │
│  │   Z = 0 always      │    │   Z ≠ 0 typically   │             │
│  │   [-α, α] → [-128,  │    │   [r_min, r_max]    │             │
│  │            127]     │    │                     │             │
│  └─────────────────────┘    └─────────────────────┘             │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

TensorFlow Lite uses asymmetric quantization by default for better range utilization.

Part 2: TensorFlow Lite Conversion

2.1 The TFLite Conversion Pipeline

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Keras     │────►│  SavedModel │────►│   TFLite    │
│   Model     │     │   Format    │     │   FlatBuffer│
└─────────────┘     └─────────────┘     └─────────────┘
                           │
                           ▼
           ┌───────────────────────────────┐
           │     tf.lite.TFLiteConverter   │
           ├───────────────────────────────┤
           │  • Operator fusion            │
           │  • Constant folding           │
           │  • Quantization (optional)    │
           │  • Delegate mapping           │
           └───────────────────────────────┘

FlatBuffer format is optimized for: - Zero-copy memory mapping (no parsing overhead) - Small code footprint - Forward/backward compatibility

2.2 Simple Example: Linear Regression

We start with a minimal model to understand the conversion process without complexity.

2.3 TFLite Interpreter Architecture

┌─────────────────────────────────────────────────────────────────┐
│                   TFLite Interpreter                            │
├─────────────────────────────────────────────────────────────────┤
│  ┌─────────────┐   ┌─────────────┐   ┌─────────────┐           │
│  │   Model     │   │   Memory    │   │   Operator  │           │
│  │   Buffer    │   │   Planner   │   │   Resolver  │           │
│  └──────┬──────┘   └──────┬──────┘   └──────┬──────┘           │
│         │                 │                 │                   │
│         └────────────────┼─────────────────┘                   │
│                          │                                      │
│                          ▼                                      │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                    allocate_tensors()                    │   │
│  │  • Allocates input/output buffers                        │   │
│  │  • Plans intermediate tensor memory                      │   │
│  │  • Resolves operator implementations                     │   │
│  └─────────────────────────────────────────────────────────┘   │
│                          │                                      │
│                          ▼                                      │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                       invoke()                           │   │
│  │  • Executes operators in graph order                     │   │
│  │  • Uses in-place operations where possible               │   │
│  │  • Minimal memory copies                                 │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Part 3: MNIST Classification Model

3.1 Why CNNs for Image Classification?

MNIST Image (28×28 = 784 pixels)

    MLP Approach:                 CNN Approach:
    ┌─────────────┐               ┌─────────────┐
    │ Flatten     │               │  Conv2D     │
    │ 784 inputs  │               │  (3×3×12)   │
    └──────┬──────┘               └──────┬──────┘
           │                             │
    ┌──────▼──────┐               ┌──────▼──────┐
    │   Dense     │               │  MaxPool    │
    │   128       │               │  (2×2)      │
    └──────┬──────┘               └──────┬──────┘
           │                             │
    ┌──────▼──────┐               ┌──────▼──────┐
    │   Dense     │               │  Flatten    │
    │   10        │               │  + Dense    │
    └─────────────┘               └─────────────┘
    
    Parameters:                   Parameters:
    784×128 = 100K               3×3×1×12 = 108
                                  + pooled dense
    
    CNN is MORE EFFICIENT and captures spatial patterns!

Part 4: Post-Training Quantization

4.1 Comparison of Quantization Methods

Method	Weights	Activations	Size Reduction	Accuracy Impact	Use Case
No quantization	float32	float32	0%	None	Development
Dynamic range	int8	float32	~75%	Minimal	CPU inference
Float16	float16	float16	~50%	Negligible	GPU inference
Full integer	int8	int8	~75%	Small	MCU deployment
QAT	int8	int8	~75%	Minimal	Production edge

4.2 Representative Dataset: The Key to Accurate Quantization

For full integer quantization, we need to calibrate activation ranges:

Without calibration:          With calibration:
                              
Assumed range: [-1, 1]        Observed range: [-0.3, 2.1]
Actual data:   [-0.3, 2.1]    Scale optimized for actual data
                              
       ┌───┐                         ┌─────────────┐
Values │   │ clipped!         Values │             │
       │   ├────►                    │             │
       │   │                         └─────────────┘
       └───┘                  
  High error                       Low error

Representative dataset requirements: 1. 100-500 samples typically sufficient 2. Cover all classes/categories 3. Include edge cases 4. Match real-world input distribution

Part 5: Quantization-Aware Training (QAT)

5.1 How QAT Works

QAT simulates quantization during training so the model learns to be robust to quantization noise:

Standard Training:              QAT Training:
                                
Forward Pass:                   Forward Pass:
┌─────────┐                     ┌─────────┐
│ Layer   │                     │ Layer   │
│ (fp32)  │                     │ (fp32)  │
└────┬────┘                     └────┬────┘
     │                               │
     ▼                               ▼
┌─────────┐                     ┌─────────────┐
│ Next    │                     │ FakeQuant   │ ← Simulates int8
│ Layer   │                     │ Operation   │   rounding errors
└─────────┘                     └──────┬──────┘
                                       │
                                       ▼
                                ┌─────────┐
                                │ Next    │
                                │ Layer   │
                                └─────────┘
                                
Backward Pass:                  Backward Pass:
Normal gradients                STE (Straight-Through Estimator)
                                passes gradients through FakeQuant

Straight-Through Estimator (STE): During backpropagation, gradients pass through the quantization operation as if it were the identity function. This allows training despite the non-differentiable rounding operation.

Part 6: Integer-Only Inference

6.1 Why Integer-Only Matters for MCUs

┌─────────────────────────────────────────────────────────────────┐
│     MCU Arithmetic Units                                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Integer ALU (always present):     FPU (often missing/slow):   │
│  ┌─────────────────────────┐       ┌─────────────────────────┐ │
│  │ ADD, SUB, MUL: 1 cycle  │       │ FADD: 3-14 cycles       │ │
│  │ Energy: ~0.1 pJ/op      │       │ FMUL: 3-5 cycles        │ │
│  │ Area: Small             │       │ Energy: ~4-10 pJ/op     │ │
│  └─────────────────────────┘       │ Area: Large             │ │
│                                     └─────────────────────────┘ │
│                                                                 │
│  Example: ARM Cortex-M4             Example: Arduino Uno       │
│  - Has optional FPU                 - NO FPU at all            │
│  - Int8 MAC: 1 cycle               - Float emulated in SW      │
│  - Float MAC: 4 cycles             - Float: 100+ cycles        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Quantized inference computation:

For quantized matrix multiplication: \(Y_q = \text{round}\left(\frac{S_W \cdot S_X}{S_Y}\right) \cdot \sum_i(W_q^{(i)} - Z_W)(X_q^{(i)} - Z_X) + Z_Y\)

All operations can be done with integer arithmetic plus one fixed-point multiply!

Part 7: Final Summary and Deployment

7.1 Summary Comparison

7.2 Three-Tier Deployment

┌─────────────────────────────────────────────────────────────────┐
│                     DEPLOYMENT TIERS                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Level 1: Notebook              Level 2: Raspberry Pi          │
│  ┌─────────────────┐            ┌─────────────────┐             │
│  │ tf.lite.        │            │ tflite_runtime  │             │
│  │ Interpreter     │            │ (lightweight)   │             │
│  │                 │            │                 │             │
│  │ Full TF install │            │ pip install     │             │
│  │ for debugging   │            │ tflite-runtime  │             │
│  └─────────────────┘            └─────────────────┘             │
│                                                                 │
│  Level 3: MCU (Arduino/ESP32)                                   │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ TensorFlow Lite Micro (TFLM)                             │   │
│  │                                                          │   │
│  │ 1. Convert .tflite to C array: xxd -i model.tflite       │   │
│  │ 2. Include in Arduino sketch                             │   │
│  │ 3. Use tflite::MicroInterpreter                          │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Checkpoint: Self-Assessment

Knowledge Check

Before proceeding, make sure you can answer:

What is the affine quantization formula? Write out r = S(q - Z) and explain each term.
Why does quantization reduce model size by ~4×? (Hint: bytes per weight)
What is a representative dataset and why is it critical?
How does QAT differ from post-training quantization?
Why do MCUs prefer integer-only arithmetic?
What determines the maximum quantization error?

Common Pitfalls

Forgetting to normalize inputs before quantization
Using too few representative samples (<100)
Not recompiling after applying QAT
Ignoring input/output quantization for int8 models

Three-Tier Activities

Run the embedded notebook above. Key exercises:

Follow along with the code cells
Modify parameters and observe results
Complete the checkpoint questions

On Level 2 you will inspect and benchmark quantized models without needing a microcontroller:

Netron Model Viewer – Visualize and compare models:

Open your .tflite files to see architecture
Compare float32 vs int8 quantized models
Inspect tensor shapes and parameter counts
Export architecture diagrams

Our Quantization Explorer – Interactive bit‑width and error comparison

On a Raspberry Pi or similar device you can:

# Install lightweight runtime
pip install --user tflite-runtime numpy

# Run the benchmarking cell in the notebook on the Pi

Use the notebook to record, for each variant (float32, dynamic range, full int8):

.tflite file size (KB)
Average inference time on Pi (ms)
Validation accuracy (%)

For on-device deployment:

Use the full-int8 model produced in this lab
Follow LAB05 for the detailed TensorFlow Lite Micro deployment pipeline

At minimum you should:

Confirm the quantized model meets your target size (KB) for the MCU flash budget
Check estimated activation memory against available SRAM
Decide whether the accuracy/size trade-off is acceptable for your edge device

Visual Troubleshooting

Use this flowchart when quantization causes accuracy problems:

flowchart TD
    A[Accuracy drops after quantization] --> B{Accuracy drop amount?}
    B -->|>10% drop| C[Severe degradation]
    B -->|5-10% drop| D[Moderate - fixable]
    B -->|<5% drop| E[Acceptable trade-off]
    C --> F{Using representative dataset?}
    F -->|No| G[Add representative data:<br/>100-500 samples from val set<br/>Cover all classes]
    F -->|Yes| H{Dataset diverse enough?}
    H -->|Limited samples| I[Expand dataset:<br/>Include edge cases<br/>Multiple scenarios]
    H -->|Adequate| J[Try Quantization-Aware Training]
    D --> F
    E --> K[Deploy and monitor]

    style A fill:#ff6b6b
    style G fill:#4ecdc4
    style I fill:#4ecdc4
    style J fill:#4ecdc4
    style K fill:#95e1d3

For complete troubleshooting flowcharts, see:

Learning Objectives

Theory Summary

Why Quantization is Essential for Edge Deployment

Key Concepts at a Glance

Common Pitfalls

Quick Reference

Key Formulas

Important Parameter Values

Links to PDF Sections

Interactive Learning Tools

Self-Assessment Checkpoints

Try It Yourself: Executable Python Examples

Example 1: Quantization Simulation (Float32 to Int8)

Example 2: Weight Distribution Visualization

Example 3: Model Size Comparison Before/After Quantization

Example 4: Accuracy vs Compression Tradeoff

Practical Code Examples

Complete PTQ Workflow

Model Size and Accuracy Comparison

INT8 Inference Speed Comparison

Practical Exercise: Quantize Your Own Model

Interactive Notebook

LAB 03: Model Quantization for Edge Deployment

Overview

Learning Objectives

Prerequisites Check

Part 1: The Mathematics of Quantization

1.1 Why Quantization Matters for Edge

1.2 Affine Quantization: The Core Mathematics

The Affine Quantization Formula

Calculating Scale and Zero-Point

1.3 Quantization Error Analysis

1.4 Types of Quantization

Part 2: TensorFlow Lite Conversion

2.1 The TFLite Conversion Pipeline

2.2 Simple Example: Linear Regression

2.3 TFLite Interpreter Architecture

Part 3: MNIST Classification Model

3.1 Why CNNs for Image Classification?

Part 4: Post-Training Quantization

4.1 Comparison of Quantization Methods

4.2 Representative Dataset: The Key to Accurate Quantization

Part 5: Quantization-Aware Training (QAT)

5.1 How QAT Works

Part 6: Integer-Only Inference

6.1 Why Integer-Only Matters for MCUs

Part 7: Final Summary and Deployment

7.1 Summary Comparison

7.2 Three-Tier Deployment

Checkpoint: Self-Assessment

Three-Tier Activities

Visual Troubleshooting

Related Labs

Related Resources