LAB05: Edge Deployment Pipeline

TensorFlow Lite Micro on Microcontrollers

PDF Textbook Reference

For detailed theoretical foundations, mathematical proofs, and algorithm derivations, see Chapter 5: Embedded Systems Deployment with TensorFlow Lite in the PDF textbook.

The PDF chapter includes: - Detailed memory architecture analysis (Flash vs SRAM) - Complete TensorFlow Lite Micro interpreter design - In-depth coverage of tensor arena allocation - Mathematical memory budget calculations - Comprehensive deployment workflow and optimization strategies

Open In Colab

Open In Colab

Download Notebook

Learning Objectives

By the end of this lab you will be able to:

  • Convert trained TFLite models into a form suitable for microcontrollers
  • Configure the TensorFlow Lite Micro runtime (tensor arena, ops resolver)
  • Integrate inference into a C++ firmware sketch (sensor → features → model → output)
  • Deploy and test the same model on simulator and real hardware

Theory Summary

Deploying ML Models to Microcontrollers

Microcontroller deployment is the final frontier of edge ML - running intelligent models on tiny devices with kilobytes of memory and milliwatts of power. This requires understanding memory constraints, embedded programming, and the TensorFlow Lite Micro runtime.

Understanding MCU Memory Architecture: Microcontrollers have two distinct memory regions with fundamentally different characteristics. Flash (ROM) is non-volatile storage (persists when powered off) where your compiled code and model weights live - typically 1-4 MB. SRAM (RAM) is volatile working memory (lost when powered off) used for runtime computations - typically 256-520 KB. A critical insight: model weights stored in Flash don’t consume SRAM, but inference operations require SRAM for input buffers, intermediate activations, and output tensors.

The Tensor Arena Concept: Unlike desktop systems with dynamic memory allocation (malloc), microcontrollers use a pre-allocated “tensor arena” - a fixed-size SRAM buffer where all inference calculations happen. TensorFlow Lite Micro cleverly reuses this space, overwriting intermediate results from earlier layers once they’re no longer needed. A 20KB model might need 50KB of arena space because intermediate activations can be larger than the model itself. Always profile arena_used_bytes() to find the minimum safe size.

Model Conversion Pipeline: Converting a trained model for MCU deployment involves several steps: (1) Quantize to int8 using TFLite converter (LAB03), (2) Save as .tflite binary file, (3) Convert binary to C array using xxd or similar tool, (4) Include array in your firmware as const unsigned char model_data[], (5) Configure ops resolver to include only operations your model uses, (6) Allocate tensor arena, (7) Initialize interpreter and run inference. Each step is critical - missing even one causes cryptic runtime errors.

Integration with Sensors: The complete embedded ML pipeline: (1) Read raw sensor data (accelerometer, microphone, camera), (2) Buffer and preprocess (normalize, downsample, extract features), (3) Copy to model input tensor with correct format and scaling, (4) Invoke interpreter, (5) Read output tensor and apply post-processing (argmax for classification, threshold for detection), (6) Act on results (LED, motor, BLE notification). Preprocessing must exactly match training - any mismatch causes silent accuracy degradation.

Key Concepts at a Glance

Core Concepts
  • Flash Memory: Non-volatile storage for code and model weights (1-4 MB typical)
  • SRAM: Volatile working memory for runtime computations (256-520 KB typical)
  • Tensor Arena: Pre-allocated SRAM buffer for all inference calculations
  • TensorFlow Lite Micro (TFLM): Embedded ML runtime with no dependencies
  • Ops Resolver: Defines which operations your model uses (reduces code size)
  • Model Array: C/C++ array containing .tflite binary data
  • Inference Latency: Time from input to output (target <100ms for real-time)
  • AllocateTensors(): One-time setup allocating memory for all model tensors

Common Pitfalls

Mistakes to Avoid

Tensor Arena Too Small = Cryptic Crashes: The most frustrating deployment bug is when AllocateTensors() fails with no helpful error message. This almost always means your tensor arena is too small. The arena must hold all intermediate activations, not just model weights. A 20KB model might need 50KB of arena! Start large (50KB), then reduce based on arena_used_bytes() output.

Model Size vs Runtime Memory Confusion: Students check if their 30KB model fits in 1MB Flash and assume success. But runtime needs SRAM: model arena (50KB), input buffer (32KB for 1s audio), stack (20KB), etc. A device with 256KB SRAM might only have 150KB available after system overhead. Always calculate total SRAM budget.

Preprocessing Mismatch Between Training and Deployment: If you trained with pixel values 0-1 but deploy with 0-255, your model outputs garbage. If you normalized with ImageNet mean but deploy with raw values, accuracy tanks. Document preprocessing steps and implement them identically in C++. Even small differences (float32 vs float64) can cause issues.

Wrong Ops in Ops Resolver: If your model uses Conv2D but you only register FullyConnected in the ops resolver, you get a runtime error. Use AllOpsResolver during development to avoid this, then switch to MicroMutableOpResolver with only needed ops to save Flash space. Check model architecture to know which ops to include.

Forgetting to Call allocate_tensors(): After creating the interpreter, you must call interpreter.AllocateTensors() before first inference. Forgetting this causes segmentation faults or silent failures. This is a one-time setup step - don’t call it every inference loop!

Quick Reference

Key Formulas

SRAM Budget Calculation: \[\text{Available SRAM} = \text{Total SRAM} - \text{System Overhead}\] \[\text{Required} = \text{Tensor Arena} + \text{Input Buffer} + \text{Output Buffer} + \text{Stack}\]

Tensor Arena Sizing: \[\text{Min Arena} = \text{arena\_used\_bytes()} \times 1.2\] (20% safety margin recommended)

Inference Latency Budget: \[\text{Max Latency} < \frac{1}{\text{sampling rate}}\] For 10 Hz gesture recognition: latency must be <100ms

Flash Usage: \[\text{Total Flash} = \text{Code} + \text{Model Data} + \text{Constants}\]

Important Parameter Values

Memory Specifications:

Device Flash SRAM Good For
Arduino Nano 33 BLE 1 MB 256 KB Audio, motion, small models
ESP32 4 MB 520 KB WiFi + ML, larger models
Raspberry Pi Pico 2 MB 264 KB General TinyML
STM32F4 1 MB 192 KB Industrial apps

Typical Memory Usage:

Component SRAM Usage Notes
Tensor Arena 20-100 KB Model-dependent, profile it!
Audio Buffer (1s) 32 KB 16kHz × 2 bytes
Image Buffer (96×96) 27 KB 96×96×3 RGB
Stack 8-20 KB System + local variables
System Overhead 10-30 KB Arduino core, BLE, etc

Code Patterns:

// Model setup (in setup())
const tflite::Model* model = tflite::GetModel(model_data);
static tflite::MicroMutableOpResolver<5> resolver;
resolver.AddConv2D();
resolver.AddFullyConnected();
// ... add other ops

constexpr int kTensorArenaSize = 30 * 1024;
alignas(16) uint8_t tensor_arena[kTensorArenaSize];

static tflite::MicroInterpreter interpreter(
    model, resolver, tensor_arena, kTensorArenaSize);
interpreter.AllocateTensors();

// Inference (in loop())
TfLiteTensor* input = interpreter.input(0);
// Fill input->data.f or input->data.uint8
interpreter.Invoke();
TfLiteTensor* output = interpreter.output(0);
// Read output->data.f or output->data.uint8

Essential Code Examples

These practical code snippets prepare you for successful deployment to microcontrollers. Run them in the notebook or adapt for your target hardware.

1. Tensor Arena Sizing Calculator

Before deploying, you need to estimate memory requirements. This Python function helps you calculate the tensor arena size based on profiling data.

def calculate_tensor_arena_size(model_path, safety_margin=1.2):
    """
    Estimate required tensor arena size for TFLM deployment.

    Args:
        model_path: Path to .tflite model file
        safety_margin: Multiplier for safety (default 1.2 = 20% extra)

    Returns:
        Dictionary with arena size recommendations
    """
    import tensorflow as tf
    import numpy as np

    # Load the TFLite model
    interpreter = tf.lite.Interpreter(model_path=model_path)
    interpreter.allocate_tensors()

    # Get input and output details
    input_details = interpreter.get_input_details()
    output_details = interpreter.get_output_details()

    # Calculate activation memory (intermediate tensors)
    tensor_details = interpreter.get_tensor_details()
    activation_bytes = sum(
        np.prod(t['shape']) * t['dtype'].itemsize
        for t in tensor_details
    )

    # Model size (weights in Flash, not SRAM)
    import os
    model_size = os.path.getsize(model_path)

    # Estimate arena size (activations + buffers + overhead)
    # Rule of thumb: 2-3x model size for activations
    estimated_arena = int(activation_bytes * safety_margin)

    # Round up to next KB for alignment
    arena_kb = ((estimated_arena + 1023) // 1024)

    results = {
        'model_size_bytes': model_size,
        'model_size_kb': model_size / 1024,
        'activation_bytes': activation_bytes,
        'safety_margin': safety_margin,
        'recommended_arena_bytes': estimated_arena,
        'recommended_arena_kb': arena_kb,
        'input_shape': input_details[0]['shape'].tolist(),
        'output_shape': output_details[0]['shape'].tolist(),
    }

    # Print human-readable summary
    print(f"=== Tensor Arena Calculator ===")
    print(f"Model size: {results['model_size_kb']:.1f} KB (stored in Flash)")
    print(f"Estimated activations: {activation_bytes / 1024:.1f} KB")
    print(f"Recommended arena size: {arena_kb} KB (SRAM)")
    print(f"\nIn your Arduino sketch, use:")
    print(f"  constexpr int kTensorArenaSize = {arena_kb} * 1024;")
    print(f"\nInput shape: {results['input_shape']}")
    print(f"Output shape: {results['output_shape']}")

    return results

# Example usage:
# arena_info = calculate_tensor_arena_size('model_quantized.tflite')
Using the Calculator
  1. Run this function on your quantized .tflite model
  2. The recommended arena size includes a 20% safety margin
  3. Start with this value in your MCU code, then profile with arena_used_bytes()
  4. Reduce incrementally if you need to save SRAM

2. TFLite Model Loading and Inspection

Before deployment, inspect your model to understand its structure and requirements.

def inspect_tflite_model(model_path):
    """
    Load and display detailed information about a TFLite model.
    Helps verify quantization and understand input/output requirements.
    """
    import tensorflow as tf
    import numpy as np

    # Load interpreter
    interpreter = tf.lite.Interpreter(model_path=model_path)
    interpreter.allocate_tensors()

    # Get model metadata
    input_details = interpreter.get_input_details()
    output_details = interpreter.get_output_details()
    tensor_details = interpreter.get_tensor_details()

    print("=== TFLite Model Inspection ===\n")

    # Input information
    print("INPUT TENSOR:")
    for i, inp in enumerate(input_details):
        print(f"  Index {i}:")
        print(f"    Name: {inp['name']}")
        print(f"    Shape: {inp['shape']}")
        print(f"    Type: {inp['dtype']}")
        if 'quantization_parameters' in inp:
            scales = inp['quantization_parameters']['scales']
            zero_points = inp['quantization_parameters']['zero_points']
            if len(scales) > 0:
                print(f"    Quantization - Scale: {scales[0]:.6f}, Zero-point: {zero_points[0]}")

    # Output information
    print("\nOUTPUT TENSOR:")
    for i, out in enumerate(output_details):
        print(f"  Index {i}:")
        print(f"    Name: {out['name']}")
        print(f"    Shape: {out['shape']}")
        print(f"    Type: {out['dtype']}")
        if 'quantization_parameters' in out:
            scales = out['quantization_parameters']['scales']
            zero_points = out['quantization_parameters']['zero_points']
            if len(scales) > 0:
                print(f"    Quantization - Scale: {scales[0]:.6f}, Zero-point: {zero_points[0]}")

    # Count operations (layers)
    print(f"\nMODEL STATISTICS:")
    print(f"  Total tensors: {len(tensor_details)}")

    # Estimate memory
    total_params = sum(np.prod(t['shape']) for t in tensor_details if len(t['shape']) > 0)
    print(f"  Total parameters: {total_params:,}")

    # Check if fully quantized
    is_quantized = input_details[0]['dtype'] in [np.uint8, np.int8]
    print(f"  Input quantized: {is_quantized}")
    print(f"  Quantization type: {'int8' if is_quantized else 'float32'}")

    # File size
    import os
    file_size = os.path.getsize(model_path)
    print(f"  Model file size: {file_size / 1024:.1f} KB")

    print("\n=== Deployment Checklist ===")
    if is_quantized:
        print("✓ Model is quantized (suitable for MCU)")
    else:
        print("✗ Model is float32 (quantize before MCU deployment!)")

    if file_size < 200 * 1024:  # 200KB
        print("✓ Model size OK for Arduino Nano 33 BLE (1MB Flash)")
    else:
        print("⚠ Large model - check Flash capacity of target device")

    return {
        'input_details': input_details,
        'output_details': output_details,
        'is_quantized': is_quantized,
        'file_size_kb': file_size / 1024,
        'total_params': total_params
    }

# Example usage:
# model_info = inspect_tflite_model('model_quantized.tflite')
What to Check
  • Input/Output dtype: Must be uint8 or int8 for microcontrollers
  • Quantization parameters: Verify scale and zero-point are present
  • Model size: Must fit in target Flash memory (e.g., 1MB for Arduino Nano)
  • Input shape: Document this - you’ll need it for preprocessing

3. Memory Budget Worksheet

Planning your memory allocation prevents runtime failures. Use this template to budget SRAM usage.

def create_memory_budget(device_name, total_sram_kb, components):
    """
    Create a memory budget worksheet for MCU deployment.

    Args:
        device_name: Name of target MCU (e.g., "Arduino Nano 33 BLE")
        total_sram_kb: Total SRAM available in KB
        components: Dictionary of component names and sizes in KB

    Example:
        budget = create_memory_budget(
            "Arduino Nano 33 BLE",
            256,
            {
                'system_overhead': 30,
                'tensor_arena': 80,
                'audio_buffer': 32,
                'feature_buffer': 10,
                'stack': 20,
            }
        )
    """
    import pandas as pd

    # Calculate totals
    total_used = sum(components.values())
    available = total_sram_kb - total_used
    utilization = (total_used / total_sram_kb) * 100

    # Create budget table
    budget_data = []
    for component, size_kb in components.items():
        percentage = (size_kb / total_sram_kb) * 100
        budget_data.append({
            'Component': component.replace('_', ' ').title(),
            'Size (KB)': size_kb,
            'Percentage': f"{percentage:.1f}%"
        })

    # Add totals
    budget_data.append({
        'Component': '─' * 30,
        'Size (KB)': '─' * 8,
        'Percentage': '─' * 8
    })
    budget_data.append({
        'Component': 'TOTAL USED',
        'Size (KB)': total_used,
        'Percentage': f"{utilization:.1f}%"
    })
    budget_data.append({
        'Component': 'AVAILABLE',
        'Size (KB)': available,
        'Percentage': f"{(available/total_sram_kb)*100:.1f}%"
    })

    df = pd.DataFrame(budget_data)

    # Print results
    print(f"=== Memory Budget: {device_name} ===")
    print(f"Total SRAM: {total_sram_kb} KB\n")
    print(df.to_string(index=False))
    print()

    # Status check
    if available < 0:
        print("❌ BUDGET EXCEEDED! Reduce component sizes or choose larger MCU.")
        print(f"   Over budget by: {-available:.1f} KB")
    elif available < 20:
        print("⚠️  WARNING: Less than 20KB available margin.")
        print("   Consider reducing sizes for safety.")
    else:
        print(f"✓ Budget OK with {available:.1f} KB safety margin")

    return {
        'device': device_name,
        'total_sram_kb': total_sram_kb,
        'total_used_kb': total_used,
        'available_kb': available,
        'utilization_percent': utilization,
        'status': 'OK' if available >= 20 else ('TIGHT' if available >= 0 else 'EXCEEDED')
    }

# Example: Budget for keyword spotting on Arduino Nano 33 BLE
# budget = create_memory_budget(
#     "Arduino Nano 33 BLE",
#     256,  # Total SRAM
#     {
#         'system_overhead': 30,    # Arduino core, BLE stack
#         'tensor_arena': 50,       # Model inference workspace
#         'audio_buffer': 32,       # 1 second at 16kHz int16
#         'mfcc_features': 8,       # Feature extraction buffer
#         'stack': 20,              # Function calls, local vars
#     }
# )

Common Device SRAM Sizes:

Device Total SRAM Usable SRAM Notes
Arduino Nano 33 BLE 256 KB ~220 KB BLE stack uses ~30KB
ESP32 520 KB ~480 KB WiFi uses additional ~40KB when active
Raspberry Pi Pico 264 KB ~250 KB Minimal system overhead
STM32F4 192 KB ~180 KB Depends on HAL configuration
Memory Budget Best Practices
  1. Always include system overhead (20-40 KB for Arduino/BLE)
  2. Add 20% safety margin to avoid edge cases
  3. Profile on real hardware - simulators may not catch all allocation
  4. Test worst-case scenarios - maximum input sizes, all buffers full

4. TFLM Inference Template

This C++ template demonstrates the complete TFLM inference pipeline for microcontrollers. Adapt this for your specific model and sensors.

// ========================================
// TensorFlow Lite Micro Inference Template
// Adapted for Arduino and similar MCUs
// ========================================

#include <TensorFlowLite.h>
#include "tensorflow/lite/micro/micro_mutable_op_resolver.h"
#include "tensorflow/lite/micro/micro_interpreter.h"
#include "tensorflow/lite/schema/schema_generated.h"

// Include your model data (converted from .tflite to C array)
#include "model_data.h"  // Contains: const unsigned char model_data[]

// ========================================
// 1. GLOBAL CONFIGURATION
// ========================================

// Tensor arena size - adjust based on your model
// Use the tensor arena calculator to determine this value
constexpr int kTensorArenaSize = 50 * 1024;  // 50 KB

// Align to 16 bytes for optimal performance
alignas(16) uint8_t tensor_arena[kTensorArenaSize];

// Global pointers for TFLM components
const tflite::Model* model = nullptr;
tflite::MicroInterpreter* interpreter = nullptr;
TfLiteTensor* input_tensor = nullptr;
TfLiteTensor* output_tensor = nullptr;

// ========================================
// 2. SETUP - ONE-TIME INITIALIZATION
// ========================================

void setup() {
  Serial.begin(115200);
  while (!Serial) { delay(10); }

  Serial.println("=== TensorFlow Lite Micro Setup ===");

  // Step 1: Load the model from Flash memory
  model = tflite::GetModel(model_data);
  if (model->version() != TFLITE_SCHEMA_VERSION) {
    Serial.print("Model schema version ");
    Serial.print(model->version());
    Serial.print(" doesn't match runtime version ");
    Serial.println(TFLITE_SCHEMA_VERSION);
    while (1);  // Halt - incompatible model
  }
  Serial.println("✓ Model loaded");

  // Step 2: Configure operations resolver
  // Only include operations your model actually uses to save Flash
  // Use AllOpsResolver during development, then switch to specific ops
  static tflite::MicroMutableOpResolver<5> resolver;

  // Add only the operations your model needs
  // Check model architecture to determine required ops
  resolver.AddConv2D();           // For CNNs
  resolver.AddFullyConnected();   // Dense layers
  resolver.AddSoftmax();          // Classification output
  resolver.AddReshape();          // Shape manipulation
  resolver.AddQuantize();         // Quantization ops

  // For development/testing, you can use:
  // static tflite::AllOpsResolver resolver;
  // Then optimize later by switching to specific ops

  Serial.println("✓ Ops resolver configured");

  // Step 3: Create interpreter
  static tflite::MicroInterpreter static_interpreter(
      model, resolver, tensor_arena, kTensorArenaSize);
  interpreter = &static_interpreter;

  // Step 4: Allocate tensors (CRITICAL - must call before inference!)
  TfLiteStatus allocate_status = interpreter->AllocateTensors();
  if (allocate_status != kTfLiteOk) {
    Serial.println("✗ AllocateTensors() failed!");
    Serial.println("  Possible causes:");
    Serial.println("  - Tensor arena too small");
    Serial.println("  - Out of SRAM");
    Serial.println("  - Missing required operations");
    while (1);  // Halt
  }
  Serial.println("✓ Tensors allocated");

  // Step 5: Get input and output tensor pointers
  input_tensor = interpreter->input(0);
  output_tensor = interpreter->output(0);

  // Step 6: Print tensor information for debugging
  Serial.println("\nModel I/O Information:");
  Serial.print("  Input shape: [");
  for (int i = 0; i < input_tensor->dims->size; i++) {
    Serial.print(input_tensor->dims->data[i]);
    if (i < input_tensor->dims->size - 1) Serial.print(", ");
  }
  Serial.println("]");

  Serial.print("  Output shape: [");
  for (int i = 0; i < output_tensor->dims->size; i++) {
    Serial.print(output_tensor->dims->data[i]);
    if (i < output_tensor->dims->size - 1) Serial.print(", ");
  }
  Serial.println("]");

  // Step 7: Report memory usage
  Serial.print("\nMemory usage:");
  Serial.print("  Arena size: ");
  Serial.print(kTensorArenaSize / 1024);
  Serial.println(" KB");
  Serial.print("  Arena used: ");
  Serial.print(interpreter->arena_used_bytes() / 1024);
  Serial.println(" KB");
  Serial.print("  Headroom: ");
  Serial.print((kTensorArenaSize - interpreter->arena_used_bytes()) / 1024);
  Serial.println(" KB");

  Serial.println("\n=== Setup Complete ===\n");
}

// ========================================
// 3. INFERENCE LOOP
// ========================================

void loop() {
  // Step 1: Acquire input data from sensors
  // This is application-specific - examples:
  // - Read accelerometer for gesture recognition
  // - Capture audio for keyword spotting
  // - Get camera frame for image classification

  float sensor_data[INPUT_SIZE];  // Replace INPUT_SIZE with actual size
  acquireSensorData(sensor_data);  // Your sensor reading function

  // Step 2: Preprocess and fill input tensor
  // CRITICAL: Preprocessing must match training exactly!

  if (input_tensor->type == kTfLiteFloat32) {
    // Float32 input (rare for MCU deployment)
    float* input_data = input_tensor->data.f;
    for (int i = 0; i < INPUT_SIZE; i++) {
      input_data[i] = sensor_data[i];  // Apply normalization if needed
    }
  } else if (input_tensor->type == kTfLiteUInt8) {
    // Uint8 quantized input (typical for MCU)
    uint8_t* input_data = input_tensor->data.uint8;
    for (int i = 0; i < INPUT_SIZE; i++) {
      // Quantize: q = (value / scale) + zero_point
      // Get scale and zero_point from model inspection
      float scale = 0.003921569;  // Example: 1/255 for normalized inputs
      int zero_point = 0;
      int quantized = (int)(sensor_data[i] / scale) + zero_point;
      input_data[i] = constrain(quantized, 0, 255);
    }
  } else if (input_tensor->type == kTfLiteInt8) {
    // Int8 quantized input
    int8_t* input_data = input_tensor->data.int8;
    for (int i = 0; i < INPUT_SIZE; i++) {
      float scale = 0.003921569;  // Adjust for your model
      int zero_point = -128;
      int quantized = (int)(sensor_data[i] / scale) + zero_point;
      input_data[i] = constrain(quantized, -128, 127);
    }
  }

  // Step 3: Run inference
  unsigned long start_time = micros();
  TfLiteStatus invoke_status = interpreter->Invoke();
  unsigned long inference_time = micros() - start_time;

  if (invoke_status != kTfLiteOk) {
    Serial.println("Invoke() failed!");
    return;
  }

  // Step 4: Read and interpret output
  if (output_tensor->type == kTfLiteFloat32) {
    // Float32 output
    float* output_data = output_tensor->data.f;
    int num_classes = output_tensor->dims->data[1];  // Assuming shape [1, num_classes]

    // Find class with highest probability
    int predicted_class = 0;
    float max_prob = output_data[0];
    for (int i = 1; i < num_classes; i++) {
      if (output_data[i] > max_prob) {
        max_prob = output_data[i];
        predicted_class = i;
      }
    }

    Serial.print("Prediction: Class ");
    Serial.print(predicted_class);
    Serial.print(" (confidence: ");
    Serial.print(max_prob * 100, 1);
    Serial.println("%)");

  } else if (output_tensor->type == kTfLiteUInt8 || output_tensor->type == kTfLiteInt8) {
    // Quantized output
    uint8_t* output_data = (uint8_t*)output_tensor->data.uint8;
    int num_classes = output_tensor->dims->data[1];

    // Dequantize if needed: value = scale * (q - zero_point)
    float scale = 0.00390625;  // Example: adjust for your model
    int zero_point = 0;

    int predicted_class = 0;
    float max_prob = 0;
    for (int i = 0; i < num_classes; i++) {
      float prob = scale * ((int)output_data[i] - zero_point);
      if (prob > max_prob) {
        max_prob = prob;
        predicted_class = i;
      }
    }

    Serial.print("Prediction: Class ");
    Serial.print(predicted_class);
    Serial.print(" (confidence: ");
    Serial.print(max_prob * 100, 1);
    Serial.println("%)");
  }

  // Step 5: Print inference latency
  Serial.print("Inference time: ");
  Serial.print(inference_time / 1000.0, 2);
  Serial.println(" ms");

  // Step 6: Take action based on prediction
  // Examples:
  // - Turn on LED for specific class
  // - Send BLE notification
  // - Trigger motor/actuator
  // - Log to SD card

  if (predicted_class == 1 && max_prob > 0.7) {
    digitalWrite(LED_BUILTIN, HIGH);  // Activate on confident detection
  } else {
    digitalWrite(LED_BUILTIN, LOW);
  }

  delay(100);  // Adjust based on your application needs
}

// ========================================
// 4. HELPER FUNCTIONS
// ========================================

void acquireSensorData(float* buffer) {
  // Example: Read accelerometer data
  // Replace with your actual sensor code

  // For accelerometer (3 axes):
  // buffer[0] = readAccelX();
  // buffer[1] = readAccelY();
  // buffer[2] = readAccelZ();

  // For audio (window of samples):
  // for (int i = 0; i < AUDIO_SAMPLES; i++) {
  //   buffer[i] = readMicrophone();
  // }

  // For images (pixels):
  // readCameraFrame(buffer);

  // Dummy data for template:
  for (int i = 0; i < INPUT_SIZE; i++) {
    buffer[i] = random(0, 255) / 255.0;  // Normalized 0-1
  }
}
Key Points for TFLM Deployment
  1. AllocateTensors() is mandatory - Call exactly once in setup()
  2. Match quantization parameters - Use exact scale/zero-point from model
  3. Static allocation only - No malloc(), declare arrays as static
  4. Ops resolver optimization - Start with AllOpsResolver, then reduce to specific ops
  5. Profile memory usage - Check arena_used_bytes() and adjust kTensorArenaSize
  6. Preprocessing consistency - Must match training exactly (normalization, scaling)
Deployment Workflow
  1. Train & quantize model in notebook (LAB03, LAB04)
  2. Calculate arena size using calculator above
  3. Inspect model to get input/output details and quantization parameters
  4. Create memory budget to verify SRAM availability
  5. Convert .tflite to C array: xxd -i model.tflite > model_data.h
  6. Adapt this template with your model’s specifics
  7. Test on simulator (Level 2) before hardware
  8. Deploy to device and profile actual performance

Interactive Learning Tools

Test Before Hardware

Validate your firmware logic before dealing with hardware:

  • Wokwi Arduino Simulator - Test TFLM code in browser with virtual sensors. No Arduino required!

  • Our Memory Calculator - Calculate if your model will fit. Input model size, arena size, buffers - get pass/fail

  • Platform IO - Better debugging than Arduino IDE. Set breakpoints, inspect memory, profile execution time

Self-Assessment Checkpoints

Test your understanding before proceeding to the exercises.

Answer: Flash (ROM) is non-volatile storage that persists when power is off - it stores your compiled firmware code and model weights (typically 1-4 MB). SRAM (RAM) is volatile working memory that is lost when power is off - it’s used for runtime computations including the tensor arena, input buffers, stack, and variables (typically 256-520 KB). Critical insight: model weights in Flash don’t consume SRAM, but inference operations require SRAM for intermediate activations. A 50KB model file (Flash) might need 150KB SRAM to run.

Answer: Minimum safe tensor arena = arena_used_bytes() × 1.2 = 47,832 × 1.2 = 57,398 bytes ≈ 58 KB. The 20% safety margin (1.2×) accounts for runtime variability, alignment requirements, and potential increases from different input patterns. In code, you’d set: constexpr int kTensorArenaSize = 58 * 1024; Always start larger during development, profile with arena_used_bytes(), then reduce to this calculated minimum.

Answer: Tensor arena too small. This is the most common and frustrating deployment bug. The arena must hold all intermediate activations during inference, which can be much larger than the model size. Solutions: (1) Increase kTensorArenaSize from 20KB to 50KB or 100KB initially, (2) Check if total SRAM usage exceeds device limit (256KB for Nano 33 BLE), (3) Use AllOpsResolver temporarily to rule out missing operations, (4) Add debug prints before/after AllocateTensors() to confirm where failure occurs, (5) Reduce model complexity if arena requirements exceed available SRAM.

Answer: Any preprocessing mismatch causes silent accuracy degradation or complete failure. If training used pixel values 0-1 (divided by 255) but deployment uses 0-255, the model receives inputs 255× larger than expected, producing garbage outputs. Common mismatches: normalization (0-1 vs 0-255), mean subtraction (ImageNet mean vs none), data types (float32 vs float64), color channel order (RGB vs BGR), and feature extraction parameters. Document all preprocessing steps and implement them identically in C++. Even small differences accumulate to destroy accuracy.

Answer: No, it cannot. Total required = 80 + 32 + 20 = 132KB. However, Arduino core and BLE stack use 20-40KB of SRAM before your code runs. Available SRAM ≈ 256KB - 30KB (system) = 226KB theoretical, but in practice often only 180-200KB is safely usable. Your 132KB requirement might work, but you’re at the edge with no margin for error. Better solution: reduce tensor arena (smaller model), use smaller audio buffer (500ms instead of 1s), or choose ESP32 with 520KB SRAM.

Hands-On Exercises

Apply the code examples above to practice deployment planning.

Exercise 1: Calculate Tensor Arena Size

Using the tensor arena calculator function from the code examples:

  1. Download or create a quantized .tflite model from LAB03 or LAB04
  2. Run calculate_tensor_arena_size('your_model.tflite')
  3. Record the recommended arena size
  4. Compare with the estimated 2-3× rule of thumb from model size
  5. Add this arena size to your C++ code template

Expected outcome: You should get a specific KB value to use for kTensorArenaSize. If the calculator shows 47KB, use 50KB in your code.

Exercise 2: Inspect Your Model

Using the model inspection function:

  1. Run inspect_tflite_model('your_model.tflite')
  2. Verify the model is fully quantized (input/output dtype should be uint8 or int8)
  3. Record the quantization scale and zero-point values
  4. Document the input and output shapes for your deployment code
  5. Check if model size fits in your target device’s Flash memory

Expected outcome: A deployment checklist showing whether your model is ready for MCU deployment. If it shows “float32”, go back to LAB03 and apply full integer quantization.

Exercise 3: Create a Memory Budget

Plan your memory allocation for a keyword spotting application on Arduino Nano 33 BLE:

  1. Use the create_memory_budget() function with these components:
    • System overhead: 30 KB
    • Tensor arena: (use value from Exercise 1)
    • Audio buffer: 32 KB (1 second at 16kHz)
    • Feature buffer: 8 KB (MFCC features)
    • Stack: 20 KB
  2. Check if total fits in 256 KB SRAM
  3. If budget is exceeded, identify which component to reduce
  4. Calculate the utilization percentage

Expected outcome: A budget table showing each component’s size and whether deployment is feasible. If over budget, try reducing audio buffer to 16KB (500ms) or using a smaller model.

Exercise 4: Adapt the TFLM Template

Customize the C++ inference template for your application:

  1. Replace model_data.h include with your converted model array
  2. Update kTensorArenaSize with value from Exercise 1
  3. Add the specific operations your model uses to the ops resolver
  4. Replace the acquireSensorData() function with your sensor reading code
  5. Update the quantization scale and zero-point values from Exercise 2
  6. Modify the output handling for your specific application

Expected outcome: A complete, compilable Arduino sketch ready for deployment. Test compilation even if you don’t have hardware yet.

Integration Challenge

Combine all four exercises into a complete deployment workflow:

  1. Start with a trained model from LAB04 (keyword spotting)
  2. Calculate tensor arena → 50 KB
  3. Inspect model → confirms int8 quantization
  4. Create budget → shows 140 KB total (fits in 256 KB with 116 KB margin)
  5. Adapt template → compiles successfully
  6. (Optional) Deploy to hardware or Wokwi simulator

This workflow simulates the real process of taking a model from training to deployment.

Try It Yourself: Executable Python Examples

The following code blocks are fully executable examples that demonstrate key deployment concepts. Run them directly in this Quarto document or in a Jupyter notebook.

1. TFLite Model Conversion Demo

Convert a simple Keras model to TFLite format and compare sizes:

Code
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

# Create a simple neural network
print("=== Creating Sample Model ===")
model = tf.keras.Sequential([
    tf.keras.layers.Dense(16, activation='relu', input_shape=(10,)),
    tf.keras.layers.Dense(8, activation='relu'),
    tf.keras.layers.Dense(3, activation='softmax')
])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
print(f"Model created: {model.count_params()} parameters")

# Generate dummy training data
X_train = np.random.randn(100, 10).astype(np.float32)
y_train = np.random.randint(0, 3, 100)
model.fit(X_train, y_train, epochs=3, verbose=0)

# Save as full Keras model
model.save('/tmp/model_full.h5')

# Convert to TFLite (float32)
converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_model_float = converter.convert()
with open('/tmp/model_float.tflite', 'wb') as f:
    f.write(tflite_model_float)

# Convert to TFLite with int8 quantization
converter.optimizations = [tf.lite.Optimize.DEFAULT]
def representative_dataset():
    for i in range(100):
        yield [X_train[i:i+1]]
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
tflite_model_quant = converter.convert()
with open('/tmp/model_quantized.tflite', 'wb') as f:
    f.write(tflite_model_quant)

# Compare sizes
import os
size_keras = os.path.getsize('/tmp/model_full.h5') / 1024
size_float = len(tflite_model_float) / 1024
size_quant = len(tflite_model_quant) / 1024

print(f"\n=== Model Size Comparison ===")
print(f"Keras (.h5):          {size_keras:.2f} KB")
print(f"TFLite (float32):     {size_float:.2f} KB")
print(f"TFLite (int8):        {size_quant:.2f} KB")
print(f"Size reduction:       {(1 - size_quant/size_keras)*100:.1f}%")

# Visualize size comparison
sizes = [size_keras, size_float, size_quant]
labels = ['Keras\n(full)', 'TFLite\n(float32)', 'TFLite\n(int8)']
colors = ['#ff6b6b', '#4ecdc4', '#45b7d1']

plt.figure(figsize=(8, 5))
bars = plt.bar(labels, sizes, color=colors, edgecolor='black', linewidth=1.5)
plt.ylabel('Size (KB)', fontsize=12)
plt.title('Model Size Comparison: Keras vs TFLite', fontsize=14, fontweight='bold')
plt.grid(axis='y', alpha=0.3)

# Add value labels on bars
for bar, size in zip(bars, sizes):
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height,
             f'{size:.1f} KB',
             ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

print("\nInsight: Int8 quantization reduces model size by ~75%, making it suitable for microcontrollers with limited Flash memory.")
2025-12-15 01:10:11.909882: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2025-12-15 01:10:11.955421: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-12-15 01:10:13.482806: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/keras/src/layers/core/dense.py:95: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
2025-12-15 01:10:14.179797: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
WARNING:absl:You are saving your model as an HDF5 file via `model.save()` or `keras.saving.save_model(model)`. This file format is considered legacy. We recommend using instead the native Keras format, e.g. `model.save('my_model.keras')` or `keras.saving.save_model(model, 'my_model.keras')`. 
INFO:tensorflow:Assets written to: /tmp/tmp66p4dgd1/assets
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
W0000 00:00:1765761015.188749    2465 tf_tfl_flatbuffer_helpers.cc:364] Ignored output_format.
W0000 00:00:1765761015.188777    2465 tf_tfl_flatbuffer_helpers.cc:367] Ignored drop_control_dependency.
2025-12-15 01:10:15.189057: I tensorflow/cc/saved_model/reader.cc:83] Reading SavedModel from: /tmp/tmp66p4dgd1
2025-12-15 01:10:15.189499: I tensorflow/cc/saved_model/reader.cc:52] Reading meta graph with tags { serve }
2025-12-15 01:10:15.189504: I tensorflow/cc/saved_model/reader.cc:147] Reading SavedModel debug info (if present) from: /tmp/tmp66p4dgd1
I0000 00:00:1765761015.192288    2465 mlir_graph_optimization_pass.cc:437] MLIR V1 optimization pass is not enabled
2025-12-15 01:10:15.192814: I tensorflow/cc/saved_model/loader.cc:236] Restoring SavedModel bundle.
2025-12-15 01:10:15.210021: I tensorflow/cc/saved_model/loader.cc:220] Running initialization op on SavedModel bundle at path: /tmp/tmp66p4dgd1
2025-12-15 01:10:15.215512: I tensorflow/cc/saved_model/loader.cc:471] SavedModel load for tags { serve }; Status: success: OK. Took 26458 microseconds.
2025-12-15 01:10:15.223539: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
INFO:tensorflow:Assets written to: /tmp/tmpm3svy26v/assets
/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/tensorflow/lite/python/convert.py:863: UserWarning: Statistics for quantized inputs were expected, but not specified; continuing anyway.
  warnings.warn(
W0000 00:00:1765761015.433649    2465 tf_tfl_flatbuffer_helpers.cc:364] Ignored output_format.
W0000 00:00:1765761015.433670    2465 tf_tfl_flatbuffer_helpers.cc:367] Ignored drop_control_dependency.
2025-12-15 01:10:15.433817: I tensorflow/cc/saved_model/reader.cc:83] Reading SavedModel from: /tmp/tmpm3svy26v
2025-12-15 01:10:15.434163: I tensorflow/cc/saved_model/reader.cc:52] Reading meta graph with tags { serve }
2025-12-15 01:10:15.434169: I tensorflow/cc/saved_model/reader.cc:147] Reading SavedModel debug info (if present) from: /tmp/tmpm3svy26v
2025-12-15 01:10:15.436825: I tensorflow/cc/saved_model/loader.cc:236] Restoring SavedModel bundle.
2025-12-15 01:10:15.453509: I tensorflow/cc/saved_model/loader.cc:220] Running initialization op on SavedModel bundle at path: /tmp/tmpm3svy26v
2025-12-15 01:10:15.459104: I tensorflow/cc/saved_model/loader.cc:471] SavedModel load for tags { serve }; Status: success: OK. Took 25289 microseconds.
fully_quantize: 0, inference_type: 6, input_inference_type: INT8, output_inference_type: INT8
2025-12-15 01:10:15.507437: W tensorflow/compiler/mlir/lite/flatbuffer_export.cc:3705] Skipping runtime version metadata in the model. This will be generated by the exporter.
=== Creating Sample Model ===
Model created: 339 parameters
INFO:tensorflow:Assets written to: /tmp/tmp66p4dgd1/assets
Saved artifact at '/tmp/tmp66p4dgd1'. The following endpoints are available:

* Endpoint 'serve'
  args_0 (POSITIONAL_ONLY): TensorSpec(shape=(None, 10), dtype=tf.float32, name='keras_tensor')
Output Type:
  TensorSpec(shape=(None, 3), dtype=tf.float32, name=None)
Captures:
  139631198430864: TensorSpec(shape=(), dtype=tf.resource, name=None)
  139631198432016: TensorSpec(shape=(), dtype=tf.resource, name=None)
  139631198430672: TensorSpec(shape=(), dtype=tf.resource, name=None)
  139631198429328: TensorSpec(shape=(), dtype=tf.resource, name=None)
  139631198432592: TensorSpec(shape=(), dtype=tf.resource, name=None)
  139631198429520: TensorSpec(shape=(), dtype=tf.resource, name=None)
INFO:tensorflow:Assets written to: /tmp/tmpm3svy26v/assets
Saved artifact at '/tmp/tmpm3svy26v'. The following endpoints are available:

* Endpoint 'serve'
  args_0 (POSITIONAL_ONLY): TensorSpec(shape=(None, 10), dtype=tf.float32, name='keras_tensor')
Output Type:
  TensorSpec(shape=(None, 3), dtype=tf.float32, name=None)
Captures:
  139631198430864: TensorSpec(shape=(), dtype=tf.resource, name=None)
  139631198432016: TensorSpec(shape=(), dtype=tf.resource, name=None)
  139631198430672: TensorSpec(shape=(), dtype=tf.resource, name=None)
  139631198429328: TensorSpec(shape=(), dtype=tf.resource, name=None)
  139631198432592: TensorSpec(shape=(), dtype=tf.resource, name=None)
  139631198429520: TensorSpec(shape=(), dtype=tf.resource, name=None)

=== Model Size Comparison ===
Keras (.h5):          29.94 KB
TFLite (float32):     3.36 KB
TFLite (int8):        3.36 KB
Size reduction:       88.8%

Insight: Int8 quantization reduces model size by ~75%, making it suitable for microcontrollers with limited Flash memory.

2. Model Size Analysis and Comparison

Analyze model architecture and estimate memory requirements:

Code
import tensorflow as tf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

def analyze_tflite_model(model_path):
    """Detailed analysis of TFLite model memory requirements"""
    interpreter = tf.lite.Interpreter(model_path=model_path)
    interpreter.allocate_tensors()

    input_details = interpreter.get_input_details()
    output_details = interpreter.get_output_details()
    tensor_details = interpreter.get_tensor_details()

    # Calculate sizes
    results = {
        'input_shape': input_details[0]['shape'].tolist(),
        'output_shape': output_details[0]['shape'].tolist(),
        'input_dtype': str(input_details[0]['dtype']),
        'output_dtype': str(output_details[0]['dtype']),
        'num_tensors': len(tensor_details),
    }

    # Calculate total parameter count
    total_params = 0
    for tensor in tensor_details:
        if len(tensor['shape']) > 0:
            params = np.prod(tensor['shape'])
            total_params += params

    results['total_parameters'] = int(total_params)

    import os
    model_size = os.path.getsize(model_path)
    results['model_size_bytes'] = model_size
    results['model_size_kb'] = model_size / 1024

    # Estimate tensor arena (2-3x model size is typical)
    estimated_arena = model_size * 2.5
    results['estimated_arena_kb'] = estimated_arena / 1024

    return results

# Analyze both models
print("=== TFLite Model Analysis ===\n")

float_analysis = analyze_tflite_model('/tmp/model_float.tflite')
quant_analysis = analyze_tflite_model('/tmp/model_quantized.tflite')

# Create comparison dataframe
comparison_data = {
    'Metric': [
        'Model Size (KB)',
        'Input Shape',
        'Output Shape',
        'Data Type',
        'Total Tensors',
        'Parameters',
        'Est. Arena (KB)'
    ],
    'Float32 Model': [
        f"{float_analysis['model_size_kb']:.2f}",
        str(float_analysis['input_shape']),
        str(float_analysis['output_shape']),
        float_analysis['input_dtype'],
        float_analysis['num_tensors'],
        f"{float_analysis['total_parameters']:,}",
        f"{float_analysis['estimated_arena_kb']:.1f}"
    ],
    'Int8 Model': [
        f"{quant_analysis['model_size_kb']:.2f}",
        str(quant_analysis['input_shape']),
        str(quant_analysis['output_shape']),
        quant_analysis['input_dtype'],
        quant_analysis['num_tensors'],
        f"{quant_analysis['total_parameters']:,}",
        f"{quant_analysis['estimated_arena_kb']:.1f}"
    ]
}

df = pd.DataFrame(comparison_data)
print(df.to_string(index=False))

# Visualize memory budget for Arduino Nano 33 BLE (256 KB SRAM)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Float32 memory budget
float_arena = float_analysis['estimated_arena_kb']
components_float = {
    'Tensor Arena': float_arena,
    'Input Buffer': 2,
    'Output Buffer': 1,
    'Stack': 20,
    'System Overhead': 30,
    'Available': max(0, 256 - float_arena - 2 - 1 - 20 - 30)
}

ax1.pie(components_float.values(), labels=components_float.keys(), autopct='%1.1f%%',
        startangle=90, colors=['#ff6b6b', '#feca57', '#48dbfb', '#ff9ff3', '#54a0ff', '#1dd1a1'])
ax1.set_title(f'Float32 Model Memory Budget\n(Total: 256 KB SRAM)', fontweight='bold')

# Int8 memory budget
quant_arena = quant_analysis['estimated_arena_kb']
components_quant = {
    'Tensor Arena': quant_arena,
    'Input Buffer': 2,
    'Output Buffer': 1,
    'Stack': 20,
    'System Overhead': 30,
    'Available': max(0, 256 - quant_arena - 2 - 1 - 20 - 30)
}

ax2.pie(components_quant.values(), labels=components_quant.keys(), autopct='%1.1f%%',
        startangle=90, colors=['#45b7d1', '#feca57', '#48dbfb', '#ff9ff3', '#54a0ff', '#1dd1a1'])
ax2.set_title(f'Int8 Model Memory Budget\n(Total: 256 KB SRAM)', fontweight='bold')

plt.tight_layout()
plt.show()

print(f"\nInsight: The int8 model requires {quant_arena:.1f} KB arena vs {float_arena:.1f} KB for float32,")
print(f"leaving {components_quant['Available']:.1f} KB available (vs {components_float['Available']:.1f} KB for float32).")
/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/tensorflow/lite/python/interpreter.py:457: UserWarning:     Warning: tf.lite.Interpreter is deprecated and is scheduled for deletion in
    TF 2.20. Please use the LiteRT interpreter from the ai_edge_litert package.
    See the [migration guide](https://ai.google.dev/edge/litert/migration)
    for details.
    
  warnings.warn(_INTERPRETER_DELETION_WARNING)
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
=== TFLite Model Analysis ===

         Metric           Float32 Model           Int8 Model
Model Size (KB)                    3.36                 3.36
    Input Shape                 [1, 10]              [1, 10]
   Output Shape                  [1, 3]               [1, 3]
      Data Type <class 'numpy.float32'> <class 'numpy.int8'>
  Total Tensors                      11                   11
     Parameters                     379                  379
Est. Arena (KB)                     8.4                  8.4

Insight: The int8 model requires 8.4 KB arena vs 8.4 KB for float32,
leaving 194.6 KB available (vs 194.6 KB for float32).

3. Inference Time Benchmarking

Benchmark inference latency on different model formats:

Code
import tensorflow as tf
import numpy as np
import time
import matplotlib.pyplot as plt

def benchmark_inference(model_path, num_runs=100):
    """Benchmark TFLite inference speed"""
    interpreter = tf.lite.Interpreter(model_path=model_path)
    interpreter.allocate_tensors()

    input_details = interpreter.get_input_details()
    output_details = interpreter.get_output_details()

    # Generate random input
    input_shape = input_details[0]['shape']
    input_dtype = input_details[0]['dtype']

    if input_dtype == np.int8:
        test_input = np.random.randint(-128, 127, input_shape, dtype=np.int8)
    else:
        test_input = np.random.randn(*input_shape).astype(np.float32)

    # Warm-up runs
    for _ in range(10):
        interpreter.set_tensor(input_details[0]['index'], test_input)
        interpreter.invoke()

    # Benchmark
    times = []
    for _ in range(num_runs):
        start = time.perf_counter()
        interpreter.set_tensor(input_details[0]['index'], test_input)
        interpreter.invoke()
        output = interpreter.get_tensor(output_details[0]['index'])
        end = time.perf_counter()
        times.append((end - start) * 1000)  # Convert to ms

    return {
        'mean_ms': np.mean(times),
        'std_ms': np.std(times),
        'min_ms': np.min(times),
        'max_ms': np.max(times),
        'p50_ms': np.percentile(times, 50),
        'p95_ms': np.percentile(times, 95),
        'p99_ms': np.percentile(times, 99),
        'all_times': times
    }

print("=== Inference Latency Benchmarking ===\n")
print("Running 100 inference iterations for each model...\n")

# Benchmark both models
float_bench = benchmark_inference('/tmp/model_float.tflite')
quant_bench = benchmark_inference('/tmp/model_quantized.tflite')

# Display results
results_data = {
    'Metric': ['Mean (ms)', 'Std Dev (ms)', 'Min (ms)', 'Max (ms)',
               'P50 (ms)', 'P95 (ms)', 'P99 (ms)'],
    'Float32': [
        f"{float_bench['mean_ms']:.3f}",
        f"{float_bench['std_ms']:.3f}",
        f"{float_bench['min_ms']:.3f}",
        f"{float_bench['max_ms']:.3f}",
        f"{float_bench['p50_ms']:.3f}",
        f"{float_bench['p95_ms']:.3f}",
        f"{float_bench['p99_ms']:.3f}"
    ],
    'Int8': [
        f"{quant_bench['mean_ms']:.3f}",
        f"{quant_bench['std_ms']:.3f}",
        f"{quant_bench['min_ms']:.3f}",
        f"{quant_bench['max_ms']:.3f}",
        f"{quant_bench['p50_ms']:.3f}",
        f"{quant_bench['p95_ms']:.3f}",
        f"{quant_bench['p99_ms']:.3f}"
    ]
}

df_bench = pd.DataFrame(results_data)
print(df_bench.to_string(index=False))

# Visualize latency distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram comparison
axes[0].hist(float_bench['all_times'], bins=30, alpha=0.6, label='Float32', color='#ff6b6b', edgecolor='black')
axes[0].hist(quant_bench['all_times'], bins=30, alpha=0.6, label='Int8', color='#45b7d1', edgecolor='black')
axes[0].axvline(float_bench['mean_ms'], color='#ff6b6b', linestyle='--', linewidth=2, label=f'Float32 Mean: {float_bench["mean_ms"]:.2f}ms')
axes[0].axvline(quant_bench['mean_ms'], color='#45b7d1', linestyle='--', linewidth=2, label=f'Int8 Mean: {quant_bench["mean_ms"]:.2f}ms')
axes[0].set_xlabel('Latency (ms)', fontsize=11)
axes[0].set_ylabel('Frequency', fontsize=11)
axes[0].set_title('Inference Latency Distribution', fontsize=13, fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Box plot comparison
box_data = [float_bench['all_times'], quant_bench['all_times']]
bp = axes[1].boxplot(box_data, labels=['Float32', 'Int8'], patch_artist=True,
                      medianprops=dict(color='red', linewidth=2),
                      boxprops=dict(facecolor='lightblue', edgecolor='black', linewidth=1.5))
bp['boxes'][0].set_facecolor('#ff6b6b')
bp['boxes'][1].set_facecolor('#45b7d1')
axes[1].set_ylabel('Latency (ms)', fontsize=11)
axes[1].set_title('Inference Latency Box Plot', fontsize=13, fontweight='bold')
axes[1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

speedup = float_bench['mean_ms'] / quant_bench['mean_ms']
print(f"\nInsight: Int8 quantized model is {speedup:.2f}x {'faster' if speedup > 1 else 'slower'} than float32.")
print(f"For real-time applications requiring <100ms latency, both models are suitable on desktop CPUs.")
print(f"On microcontrollers (Cortex-M4 @ 64 MHz), expect 10-50x slower performance.")
=== Inference Latency Benchmarking ===

Running 100 inference iterations for each model...

      Metric Float32  Int8
   Mean (ms)   0.005 0.002
Std Dev (ms)   0.022 0.000
    Min (ms)   0.002 0.002
    Max (ms)   0.221 0.003
    P50 (ms)   0.002 0.002
    P95 (ms)   0.007 0.002
    P99 (ms)   0.013 0.002

Insight: Int8 quantized model is 2.28x faster than float32.
For real-time applications requiring <100ms latency, both models are suitable on desktop CPUs.
On microcontrollers (Cortex-M4 @ 64 MHz), expect 10-50x slower performance.
/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/tensorflow/lite/python/interpreter.py:457: UserWarning:     Warning: tf.lite.Interpreter is deprecated and is scheduled for deletion in
    TF 2.20. Please use the LiteRT interpreter from the ai_edge_litert package.
    See the [migration guide](https://ai.google.dev/edge/litert/migration)
    for details.
    
  warnings.warn(_INTERPRETER_DELETION_WARNING)
/tmp/ipykernel_2465/3698699205.py:99: MatplotlibDeprecationWarning: The 'labels' parameter of boxplot() has been renamed 'tick_labels' since Matplotlib 3.9; support for the old name will be dropped in 3.11.
  bp = axes[1].boxplot(box_data, labels=['Float32', 'Int8'], patch_artist=True,

4. Memory Requirement Calculator

Interactive calculator for MCU memory budgeting:

Code
import matplotlib.pyplot as plt
import numpy as np

def calculate_mcu_budget(device_name, total_sram_kb, tensor_arena_kb,
                         buffer_sizes_kb, system_overhead_kb=30):
    """
    Calculate complete memory budget for MCU deployment.

    Args:
        device_name: Name of target MCU
        total_sram_kb: Total SRAM available
        tensor_arena_kb: Tensor arena size
        buffer_sizes_kb: Dict of buffer names and sizes
        system_overhead_kb: System/library overhead
    """
    components = {
        'Tensor Arena': tensor_arena_kb,
        'System Overhead': system_overhead_kb,
    }
    components.update(buffer_sizes_kb)

    total_used = sum(components.values())
    available = total_sram_kb - total_used
    utilization_pct = (total_used / total_sram_kb) * 100

    # Print budget table
    print(f"=== Memory Budget: {device_name} ===")
    print(f"Total SRAM: {total_sram_kb} KB\n")

    print(f"{'Component':<25} {'Size (KB)':<12} {'Percentage':<12}")
    print("-" * 50)
    for component, size in components.items():
        pct = (size / total_sram_kb) * 100
        print(f"{component:<25} {size:<12.1f} {pct:<12.1f}%")
    print("-" * 50)
    print(f"{'TOTAL USED':<25} {total_used:<12.1f} {utilization_pct:<12.1f}%")
    print(f"{'AVAILABLE':<25} {available:<12.1f} {(available/total_sram_kb)*100:<12.1f}%")
    print()

    # Status assessment
    if available < 0:
        status = "EXCEEDED"
        print(f"❌ BUDGET EXCEEDED by {-available:.1f} KB!")
        print(f"   Recommendations:")
        print(f"   - Reduce tensor arena size (use smaller model)")
        print(f"   - Decrease buffer sizes")
        print(f"   - Choose MCU with more SRAM (ESP32: 520 KB)")
    elif available < 20:
        status = "TIGHT"
        print(f"⚠️  WARNING: Only {available:.1f} KB available margin.")
        print(f"   Consider reducing component sizes for safety.")
    else:
        status = "OK"
        print(f"✅ Budget OK with {available:.1f} KB safety margin.")

    # Visualization
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

    # Pie chart
    plot_components = {**components, 'Available': max(0, available)}
    colors = ['#ff6b6b', '#feca57', '#48dbfb', '#ff9ff3', '#54a0ff', '#1dd1a1']
    ax1.pie(plot_components.values(), labels=plot_components.keys(), autopct='%1.1f%%',
            startangle=90, colors=colors[:len(plot_components)])
    ax1.set_title(f'{device_name} Memory Allocation\n({total_sram_kb} KB Total SRAM)',
                  fontweight='bold', fontsize=12)

    # Bar chart
    sorted_components = sorted(components.items(), key=lambda x: x[1], reverse=True)
    names, values = zip(*sorted_components)
    names = list(names) + ['Available']
    values = list(values) + [max(0, available)]

    bar_colors = ['#ff6b6b' if v > 50 else '#4ecdc4' if v > 20 else '#95afc0'
                  for v in values[:-1]]
    bar_colors.append('#1dd1a1' if available > 20 else '#ffa502' if available > 0 else '#ff4757')

    bars = ax2.barh(names, values, color=bar_colors, edgecolor='black', linewidth=1.2)
    ax2.set_xlabel('Size (KB)', fontsize=11)
    ax2.set_title(f'Memory Budget Breakdown', fontweight='bold', fontsize=12)
    ax2.grid(axis='x', alpha=0.3)

    # Add value labels
    for bar, value in zip(bars, values):
        width = bar.get_width()
        ax2.text(width, bar.get_y() + bar.get_height()/2,
                f' {value:.1f} KB', ha='left', va='center', fontweight='bold')

    plt.tight_layout()
    plt.show()

    return {
        'device': device_name,
        'total_sram_kb': total_sram_kb,
        'total_used_kb': total_used,
        'available_kb': available,
        'utilization_pct': utilization_pct,
        'status': status
    }

# Example: Keyword spotting on Arduino Nano 33 BLE
print("### Example 1: Keyword Spotting on Arduino Nano 33 BLE ###\n")
kws_budget = calculate_mcu_budget(
    device_name="Arduino Nano 33 BLE",
    total_sram_kb=256,
    tensor_arena_kb=quant_analysis['estimated_arena_kb'],
    buffer_sizes_kb={
        'Audio Buffer (1s)': 32,
        'MFCC Features': 8,
        'Output Buffer': 1,
        'Stack': 20
    }
)

print("\n" + "="*60 + "\n")

# Example 2: Image classification on ESP32
print("### Example 2: Image Classification on ESP32 ###\n")
esp32_budget = calculate_mcu_budget(
    device_name="ESP32",
    total_sram_kb=520,
    tensor_arena_kb=80,
    buffer_sizes_kb={
        'Image Buffer (96x96 RGB)': 27,
        'Preprocessed Input': 10,
        'Output Buffer': 1,
        'Stack': 25,
        'WiFi Stack': 40
    },
    system_overhead_kb=40
)

print("\nInsight: ESP32's larger SRAM (520 KB) enables more complex models and WiFi connectivity,")
print("while Arduino Nano 33 BLE (256 KB) is suitable for simpler models with BLE communication.")
### Example 1: Keyword Spotting on Arduino Nano 33 BLE ###

=== Memory Budget: Arduino Nano 33 BLE ===
Total SRAM: 256 KB

Component                 Size (KB)    Percentage  
--------------------------------------------------
Tensor Arena              8.4          3.3         %
System Overhead           30.0         11.7        %
Audio Buffer (1s)         32.0         12.5        %
MFCC Features             8.0          3.1         %
Output Buffer             1.0          0.4         %
Stack                     20.0         7.8         %
--------------------------------------------------
TOTAL USED                99.4         38.8        %
AVAILABLE                 156.6        61.2        %

✅ Budget OK with 156.6 KB safety margin.

============================================================

### Example 2: Image Classification on ESP32 ###

=== Memory Budget: ESP32 ===
Total SRAM: 520 KB

Component                 Size (KB)    Percentage  
--------------------------------------------------
Tensor Arena              80.0         15.4        %
System Overhead           40.0         7.7         %
Image Buffer (96x96 RGB)  27.0         5.2         %
Preprocessed Input        10.0         1.9         %
Output Buffer             1.0          0.2         %
Stack                     25.0         4.8         %
WiFi Stack                40.0         7.7         %
--------------------------------------------------
TOTAL USED                223.0        42.9        %
AVAILABLE                 297.0        57.1        %

✅ Budget OK with 297.0 KB safety margin.

Insight: ESP32's larger SRAM (520 KB) enables more complex models and WiFi connectivity,
while Arduino Nano 33 BLE (256 KB) is suitable for simpler models with BLE communication.

Interactive Notebook

The notebook below contains runnable code for all Level 1 activities.

LAB 05: Deploying ML Systems on Edge

Open In Colab View on GitHub

Book Reference: Chapter 5 - Deploying ML Systems on Edge

Learning Objectives

  • Understand the differences between Python and C++ for embedded systems
  • Set up the Arduino development environment for TinyML
  • Deploy TensorFlow Lite Micro models to microcontrollers
  • Manage memory constraints on edge devices (SRAM, Flash)
  • Profile and optimize inference latency on real hardware

Execution Tier Selection

Select your execution tier based on available hardware:

Tier Environment Hardware Required
Level 1 Jupyter Notebook Laptop/PC only
Level 2 TFLite Runtime Raspberry Pi or host simulation
Level 3 Device Deployment Arduino Nano 33 BLE / ESP32

Part 1: Understanding Edge Deployment

Edge deployment involves converting ML models to run on resource-constrained devices. The key challenges are:

  1. Memory Constraints: MCUs have KB of RAM vs GB on desktops
  2. No Operating System: Bare-metal execution
  3. Fixed-Point Arithmetic: Integer-only operations for efficiency
  4. Power Budget: Battery-powered devices need efficiency

Part 2: Model Conversion Pipeline

The deployment pipeline transforms a trained model into edge-ready format:

TensorFlow Model → TFLite Converter → Quantization → C Array → MCU Firmware

Part 3: Converting to C Array for MCU

For MCU deployment, the TFLite model must be converted to a C array that can be compiled into firmware.

Part 4: Memory Planning for MCU

Understanding memory layout is critical for successful MCU deployment.

Part 5: Arduino Sketch Structure

Here’s the typical structure of a TFLite Micro Arduino sketch:

#include <TensorFlowLite.h>
#include "model_data.h"  // Your converted model

// Tensor arena (memory for inference)
constexpr int kTensorArenaSize = 50 * 1024;
uint8_t tensor_arena[kTensorArenaSize];

// TFLite objects
tflite::MicroInterpreter* interpreter;
TfLiteTensor* input;
TfLiteTensor* output;

void setup() {
    // Load model
    const tflite::Model* model = tflite::GetModel(model_data);
    
    // Create interpreter
    static tflite::MicroMutableOpResolver<4> resolver;
    resolver.AddFullyConnected();
    resolver.AddSoftmax();
    
    interpreter = new tflite::MicroInterpreter(
        model, resolver, tensor_arena, kTensorArenaSize);
    interpreter->AllocateTensors();
    
    input = interpreter->input(0);
    output = interpreter->output(0);
}

void loop() {
    // Read sensor data into input tensor
    // ...
    
    // Run inference
    interpreter->Invoke();
    
    // Process output
    // ...
}

Part 6: Tier-Specific Execution

Tier 1: Simulation in Python

Tier 2: Running on Raspberry Pi

For Tier 2 execution on Raspberry Pi:

# Install TFLite runtime on Pi
pip install tflite-runtime

# Run the same Python code, replacing:
# import tensorflow as tf
# with:
# import tflite_runtime.interpreter as tflite

Tier 3: MCU Deployment Checklist

For Tier 3 deployment to Arduino/ESP32:

Summary

In this lab, you learned:

  1. Edge Constraints: Memory (RAM/Flash) and compute limitations on MCUs
  2. Model Conversion: TensorFlow → TFLite → C Array pipeline
  3. Quantization: Reducing model size with minimal accuracy loss
  4. Memory Planning: Understanding tensor arena and memory layout
  5. Arduino Integration: Structuring TFLite Micro sketches

Next Steps

  • Proceed to LAB 08 for sensor integration
  • Try deploying a keyword spotting model (LAB 4-5)
  • Explore computer vision models (LAB 16)

Three-Tier Activities

Run the embedded notebook above. Key exercises:

  1. Follow along with the code cells
  2. Modify parameters and observe results
  3. Complete the checkpoint questions

Use Level 2 to validate your firmware without needing a physical board:

  • Wokwi Arduino simulator
    • Open the Arduino Multi-Sensor Simulator
    • Replace the starter sketch with your TFLite Micro inference code from this lab
    • Use the virtual sensors to exercise your model logic
  • (Optional) TFLite Micro host build
    • Build a small C++ binary on your laptop that links against TFLM
    • Feed recorded sensor data and check that inference matches the Python/TFLite results

For Level 3 deployment use an Arduino Nano 33 BLE Sense (or similar Cortex-M MCU) and follow these steps:

  1. Take the quantized .tflite model from LAB03 or LAB04.
  2. Convert it to a C array (e.g. using xxd) and include it as model_data.h.
  3. Configure the tensor arena size and ops resolver in your sketch.
  4. Flash the firmware using the Arduino IDE or PlatformIO.
  5. Verify correct behavior using serial logs and, where appropriate, LED indicators.

Record basic metrics (flash use, RAM estimate, and approximate inference latency) as part of your lab notes.

Visual Troubleshooting

Model Loading Issues

flowchart TD
    A[Model won't load] --> B{File exists?}
    B -->|No| C[Check file path<br/>Verify upload to device]
    B -->|Yes| D{File size normal?}
    D -->|0 bytes| E[Re-convert model<br/>Check TFLite conversion]
    D -->|Normal| F{Enough Flash?}
    F -->|No| G[Reduce model complexity<br/>More aggressive quantization]
    F -->|Yes| H{GetModel returns null?}
    H -->|Yes| I[Schema incompatible<br/>Check flatbuffer version]
    H -->|No| J[See Memory Allocation flowchart]

    style A fill:#ff6b6b
    style C fill:#4ecdc4
    style E fill:#4ecdc4
    style G fill:#4ecdc4
    style I fill:#4ecdc4
    style J fill:#ffe66d

Memory Allocation Errors

flowchart TD
    A[AllocateTensors fails] --> B{Error type?}
    B -->|Arena too small| C[Check arena_used_bytes]
    C --> D[Set arena = used × 1.2<br/>Add 20% safety margin]
    B -->|Segmentation fault| E{Static allocation?}
    E -->|No| F[Declare static:<br/>static uint8_t tensor_arena<br/>alignas 16]
    E -->|Yes| G[Check array bounds<br/>Enable debugger]
    B -->|Out of SRAM| H[Reduce memory:<br/>Smaller buffers<br/>Reduce model size]

    style A fill:#ff6b6b
    style D fill:#4ecdc4
    style F fill:#4ecdc4
    style G fill:#ffe66d
    style H fill:#4ecdc4

For complete troubleshooting flowcharts, see: