Visual Troubleshooting Guides

Flowchart-Driven Diagnosis for Edge ML Issues

Quick Diagnosis Flowcharts

Use these visual guides to diagnose common edge ML issues systematically. Each flowchart provides a step-by-step decision tree to identify and resolve problems.

How to Use These Flowcharts

Start at the top node describing your symptom
Follow the decision paths based on your observations
Apply the suggested solution at terminal nodes
If problem persists, check the text-based Troubleshooting Guide

General Edge ML Issues

1. Model Loading Failures

When your model fails to load on the device, this flowchart helps identify whether it’s a file issue, memory problem, or compatibility issue.

flowchart TD
    A[Start: Model won't load] --> B{Does model file exist?}
    B -->|No| C[Check file path is correct<br/>Verify file was uploaded to device<br/>Check SD card if used]
    B -->|Yes| D{Check file size}
    D -->|0 bytes or corrupted| E[Re-download or re-convert model<br/>Verify TFLite conversion succeeded<br/>Check disk space during save]
    D -->|Normal size| F{Enough Flash memory?}
    F -->|No| G[Model too large for device<br/>- Reduce model complexity<br/>- Use more aggressive quantization<br/>- Remove unused layers]
    F -->|Yes| H{Check model format}
    H -->|Not .tflite| I[Convert to TFLite format:<br/>converter = tf.lite.TFLiteConverter<br/>tflite_model = converter.convert]
    H -->|Valid .tflite| J{TFLite version compatible?}
    J -->|Version mismatch| K[Update TFLite Micro to match<br/>or re-convert model with<br/>compatible TF version]
    J -->|Compatible| L{GetModel returns null?}
    L -->|Yes| M[Model schema incompatible<br/>- Rebuild with correct flatbuffer<br/>- Check for custom ops<br/>- Verify model_data array]
    L -->|No| N[Check AllocateTensors result<br/>See Memory Allocation flowchart]

    style A fill:#ff6b6b
    style C fill:#4ecdc4
    style E fill:#4ecdc4
    style G fill:#4ecdc4
    style I fill:#4ecdc4
    style K fill:#4ecdc4
    style M fill:#4ecdc4
    style N fill:#ffe66d

2. Inference Accuracy Problems

When your deployed model gives poor results despite good training accuracy, use this diagnostic path.

flowchart TD
    A[Start: Poor inference accuracy] --> B{Works in notebook/Python?}
    B -->|No| C[Problem is with model itself<br/>- Retrain with more data<br/>- Check for overfitting<br/>- Validate training pipeline]
    B -->|Yes| D{Works with TFLite interpreter?}
    D -->|No| E[Quantization error<br/>- Use representative dataset<br/>- Try quantization-aware training<br/>- Check for extreme activations]
    D -->|Yes| F{Check preprocessing}
    F -->|Different from training| G[Match preprocessing exactly:<br/>- Same normalization 0-1 or -1 to 1<br/>- Same scaling factors<br/>- Same color space RGB/BGR]
    F -->|Matches training| H{Check input data type}
    H -->|Type mismatch| I[Fix input tensor type:<br/>- float32 vs uint8<br/>- Signed vs unsigned<br/>- Check input_details]
    H -->|Correct type| J{Check output interpretation}
    J -->|Wrong postprocessing| K[Fix output handling:<br/>- Apply softmax if needed<br/>- Correct argmax usage<br/>- Check dequantization]
    J -->|Correct| L{Sensor data quality}
    L -->|Noisy/inconsistent| M[Improve sensor pipeline:<br/>- Add filtering moving avg<br/>- Increase sampling rate<br/>- Calibrate sensors per user]
    L -->|Good quality| N[Model may not generalize<br/>- Collect more diverse data<br/>- Test edge cases<br/>- Consider retraining]

    style A fill:#ff6b6b
    style C fill:#4ecdc4
    style E fill:#4ecdc4
    style G fill:#4ecdc4
    style I fill:#4ecdc4
    style K fill:#4ecdc4
    style M fill:#4ecdc4
    style N fill:#4ecdc4

3. Memory Allocation Errors

The dreaded “tensor arena too small” and related memory issues.

flowchart TD
    A[Start: Memory allocation fails] --> B{Error message type?}
    B -->|Tensor arena too small| C[Check arena_used_bytes]
    C --> D{Have you profiled it?}
    D -->|No| E[Add this code:<br/>Serial.print arena_used_bytes<br/>Start with large arena 100KB]
    D -->|Yes| F[Set arena = used_bytes × 1.2<br/>20% safety margin]
    B -->|AllocateTensors fails| G{Enough total SRAM?}
    G -->|No| H[Reduce memory usage:<br/>- Smaller input buffers<br/>- Reduce model size<br/>- Remove debug code]
    G -->|Yes| I{Check for memory leaks}
    I -->|Calling AllocateTensors<br/>repeatedly| J[Only call once in setup<br/>Not in loop]
    I -->|Static allocation OK| K[Check ops resolver:<br/>Missing required ops?]
    B -->|Segmentation fault| L{Using static allocation?}
    L -->|No malloc on MCU| M[Declare tensors static:<br/>static uint8_t tensor_arena<br/>alignas 16]
    L -->|Already static| N{Check array bounds}
    N -->|Buffer overflow| O[Validate input sizes:<br/>- Clip values to expected range<br/>- Check buffer indexing<br/>- Add bounds checking]
    N -->|Bounds OK| P[Enable debug symbols<br/>Use GDB or Platform IO debugger]
    B -->|Stack overflow| Q[Increase stack size or<br/>reduce local variables<br/>Move large arrays to static]

    style A fill:#ff6b6b
    style E fill:#ffe66d
    style F fill:#4ecdc4
    style H fill:#4ecdc4
    style J fill:#4ecdc4
    style K fill:#ffe66d
    style M fill:#4ecdc4
    style O fill:#4ecdc4
    style P fill:#ffe66d
    style Q fill:#4ecdc4

Hardware-Specific Issues

4. Arduino Upload Failures

When you can’t upload your sketch to Arduino.

flowchart TD
    A[Start: Upload fails] --> B{Arduino detected?}
    B -->|Port not shown| C[Check USB connection:<br/>- Try different cable<br/>- Try different USB port<br/>- Restart Arduino IDE]
    B -->|Port shows up| D{Correct board selected?}
    D -->|Wrong board| E[Tools → Board → select correct:<br/>Arduino Nano 33 BLE<br/>ESP32 Dev Module etc]
    D -->|Correct board| F{Upload error message?}
    F -->|avrdude timeout| G[Press reset button twice<br/>quickly to enter bootloader<br/>Upload within 8 seconds]
    F -->|Port in use| H[Close Serial Monitor<br/>Close other IDE instances<br/>Restart IDE]
    F -->|Sketch too big| I[Reduce program size:<br/>- Use MicroMutableOpResolver<br/>- Remove debug prints<br/>- Smaller model]
    F -->|Compilation error| J{TensorFlow Lite library?}
    J -->|Not installed| K[Install via Library Manager:<br/>Arduino_TensorFlowLite<br/>or TensorFlowLite_ESP32]
    J -->|Version conflict| L[Update all libraries<br/>Check TF version compatibility<br/>Use 2.4.0-ALPHA for MCU]
    J -->|Code syntax error| M[Check error message:<br/>- Missing semicolons<br/>- Undeclared variables<br/>- Type mismatches]

    style A fill:#ff6b6b
    style C fill:#4ecdc4
    style E fill:#4ecdc4
    style G fill:#4ecdc4
    style H fill:#4ecdc4
    style I fill:#4ecdc4
    style K fill:#4ecdc4
    style L fill:#4ecdc4
    style M fill:#ffe66d

5. ESP32 WiFi Connection Issues

Debugging wireless connectivity on ESP32.

flowchart TD
    A[Start: WiFi won't connect] --> B{WiFi.status returns?}
    B -->|WL_NO_SSID_AVAIL| C[Network not found:<br/>- Check SSID spelling<br/>- Ensure 2.4GHz not 5GHz<br/>- Move closer to router]
    B -->|WL_CONNECT_FAILED| D[Authentication failed:<br/>- Verify password correct<br/>- Check security type WPA2<br/>- Router MAC filter?]
    B -->|WL_DISCONNECTED| E{Connecting then drops?}
    E -->|Yes| F[Weak signal or interference:<br/>- Move closer to AP<br/>- Add external antenna<br/>- Reduce TX power if overheating]
    E -->|Never connects| G{Delay after WiFi.begin?}
    G -->|No| H[Add connection timeout:<br/>while WiFi.status != WL_CONNECTED<br/>  delay 500<br/>  retry up to 20 times]
    G -->|Yes timeout| I{Check router settings}
    I -->|AP isolation enabled| J[Disable AP isolation<br/>on router to allow<br/>device-to-device comm]
    I -->|DHCP full| K[Assign static IP or<br/>increase DHCP pool size]
    B -->|WL_IDLE_STATUS| L[WiFi not initialized:<br/>WiFi.mode WIFI_STA<br/>before WiFi.begin]
    B -->|WL_CONNECTED but<br/>no internet| M{Can ping gateway?}
    M -->|No| N[Local network issue:<br/>Check subnet mask<br/>Check gateway IP]
    M -->|Yes| O[DNS or internet issue:<br/>Try 8.8.8.8 for DNS<br/>Check if router has internet]

    style A fill:#ff6b6b
    style C fill:#4ecdc4
    style D fill:#4ecdc4
    style F fill:#4ecdc4
    style H fill:#4ecdc4
    style J fill:#4ecdc4
    style K fill:#4ecdc4
    style L fill:#4ecdc4
    style N fill:#4ecdc4
    style O fill:#4ecdc4

6. Sensor Reading Problems

When sensor data looks wrong or inconsistent.

flowchart TD
    A[Start: Bad sensor readings] --> B{Sensor type?}
    B -->|Analog ADC| C{Reading always 0 or 1023?}
    C -->|Always max/min| D[Check wiring:<br/>- Need voltage divider?<br/>- Correct pin analog capable?<br/>- Ground connection OK?]
    C -->|Values present but wrong| E{Calibrated?}
    E -->|No| F[Implement calibration:<br/>- Record min/max values<br/>- Map to expected range<br/>- Account for offset/drift]
    E -->|Yes but noisy| G[Add filtering:<br/>- Moving average window 5-10<br/>- Median filter for spikes<br/>- Low-pass RC filter hardware]
    B -->|Digital I2C/SPI| H{Communication working?}
    H -->|No response| I[Check I2C address:<br/>- Scan for devices<br/>- Check A0 A1 jumpers<br/>- Verify pull-up resistors]
    H -->|Returns 0xFF or error| J[Check timing:<br/>- Clock speed too fast?<br/>- Adequate delays?<br/>- Power supply stable?]
    H -->|Intermittent| K[Check connections:<br/>- Loose wires?<br/>- Cable length too long?<br/>- Electromagnetic interference?]
    B -->|Timing-sensitive| L{Consistent sample rate?}
    L -->|Irregular timing| M[Use millis for timing:<br/>unsigned long last = 0<br/>if millis - last >= interval<br/>  read sensor]
    L -->|Regular but wrong values| N{Check sensor datasheet}
    N -->|Wrong voltage| O[Voltage level issue:<br/>- 3.3V vs 5V logic<br/>- Use level shifter<br/>- Check sensor V rating]
    N -->|Conversion needed| P[Apply formula from datasheet:<br/>- Temperature coefficients<br/>- Resistance to value<br/>- Raw to engineering units]

    style A fill:#ff6b6b
    style D fill:#4ecdc4
    style F fill:#4ecdc4
    style G fill:#4ecdc4
    style I fill:#4ecdc4
    style J fill:#4ecdc4
    style K fill:#4ecdc4
    style M fill:#4ecdc4
    style O fill:#4ecdc4
    style P fill:#4ecdc4

Training Issues

7. Training Loss Not Decreasing

When your model isn’t learning during training.

flowchart TD
    A[Start: Loss not decreasing] --> B{Loss value?}
    B -->|Loss is NaN| C[Gradient explosion:<br/>- Reduce learning rate 10x<br/>- Add gradient clipping<br/>- Check for bad data inf/NaN]
    B -->|Loss constant high| D{Check learning rate}
    D -->|LR too small<br/>1e-6 or less| E[Increase learning rate:<br/>Try 1e-3 for Adam<br/>Try 0.01 for SGD]
    D -->|LR reasonable| F{Data preprocessed?}
    F -->|No normalization| G[Normalize inputs:<br/>x = x / 255.0 for images<br/>StandardScaler for tabular<br/>Mean 0 std 1 generally]
    F -->|Already normalized| H{Check labels}
    H -->|All same label<br/>or wrong format| I[Fix label issues:<br/>- One-hot encode categorical<br/>- Balance class distribution<br/>- Verify ground truth correct]
    H -->|Labels OK| J{Model capacity?}
    J -->|Too small| K[Increase model size:<br/>- Add layers<br/>- Increase units per layer<br/>- But avoid overfitting]
    J -->|Reasonable size| L{Activation functions?}
    L -->|All linear or wrong| M[Use proper activations:<br/>- ReLU for hidden layers<br/>- Softmax for classification<br/>- Sigmoid for binary]
    L -->|Activations OK| N{Loss function correct?}
    N -->|Mismatch with task| O[Match loss to task:<br/>- Categorical crossentropy multi-class<br/>- Binary crossentropy binary<br/>- MSE for regression]
    N -->|Correct loss| P[Try different optimizer:<br/>- Adam usually works<br/>- SGD with momentum<br/>- Reduce batch size if large]

    style A fill:#ff6b6b
    style C fill:#4ecdc4
    style E fill:#4ecdc4
    style G fill:#4ecdc4
    style I fill:#4ecdc4
    style K fill:#4ecdc4
    style M fill:#4ecdc4
    style O fill:#4ecdc4
    style P fill:#4ecdc4

8. Overfitting Detection and Solutions

When training accuracy is high but validation accuracy is low.

flowchart TD
    A[Start: Train acc high<br/>Val acc low] --> B{Gap between accuracies?}
    B -->|>20% difference| C[Severe overfitting detected]
    B -->|10-20% difference| D[Moderate overfitting]
    B -->|<10% difference| E[Mild - may be acceptable<br/>for edge ML trade-off]
    C --> F{Dataset size?}
    F -->|<100 samples/class| G[Collect more data:<br/>- Aim for 500+ per class<br/>- Critical for deep learning<br/>- Data beats algorithms]
    F -->|Adequate data| H{Using augmentation?}
    H -->|No| I[Add data augmentation:<br/>- Random flips rotations<br/>- Noise injection<br/>- Time warping for audio/sensors]
    H -->|Yes| J{Regularization applied?}
    J -->|No regularization| K[Add regularization:<br/>- L2 weight decay 1e-4<br/>- Dropout 0.3-0.5 after FC layers<br/>- Batch normalization]
    J -->|Already using| L{Model complexity?}
    L -->|Very deep/wide| M[Reduce model size:<br/>- Fewer layers<br/>- Smaller hidden dims<br/>- Early stopping on val loss]
    L -->|Simple model| N{Check validation set}
    N -->|Different distribution| O[Fix data split:<br/>- Stratified split by class<br/>- Shuffle before split<br/>- Ensure representative sample]
    N -->|Distribution matches| P[Consider ensemble:<br/>- Multiple models vote<br/>- Or accept slight overfit<br/>- Edge models trade accuracy]
    D --> J
    E --> Q[Monitor in production<br/>May need retraining with<br/>real-world data later]

    style A fill:#ff6b6b
    style G fill:#4ecdc4
    style I fill:#4ecdc4
    style K fill:#4ecdc4
    style M fill:#4ecdc4
    style O fill:#4ecdc4
    style P fill:#4ecdc4
    style Q fill:#ffe66d

9. Quantization Accuracy Drop

When converting from float32 to int8 causes significant accuracy loss.

flowchart TD
    A[Start: Accuracy drops<br/>after quantization] --> B{Accuracy drop amount?}
    B -->|>10% drop| C[Severe degradation]
    B -->|5-10% drop| D[Moderate - may be fixable]
    B -->|<5% drop| E[Acceptable for edge<br/>deployment trade-off]
    C --> F{Using representative dataset?}
    F -->|No rep_dataset| G[Critical - add representative data:<br/>def rep_data_gen:<br/>  for sample in dataset<br/>    yield sample<br/>converter.representative_dataset = rep_data_gen]
    F -->|Have rep_dataset| H{Dataset diverse enough?}
    H -->|<100 samples or<br/>single scenario| I[Expand representative dataset:<br/>- Include all input variations<br/>- Cover activation ranges<br/>- Multiple scenarios/users]
    H -->|Dataset adequate| J{Try quantization-aware training}
    J -->|Not using QAT| K[Implement QAT:<br/>model = tfmot.quantization.keras<br/>  .quantize_model model<br/>Train with quantization simulation]
    J -->|Already using QAT| L{Check activation ranges}
    L -->|Extreme outliers| M[Fix extreme activations:<br/>- Clip outliers in preprocessing<br/>- Use batch normalization<br/>- Scale input range properly]
    L -->|Ranges normal| N{Specific layer causing issue?}
    N -->|One layer drops accuracy| O[Keep that layer in float:<br/>converter.target_spec<br/>  .supported_ops TFLITE_BUILTINS<br/>Mark layer as non-quantizable]
    N -->|Whole model degrades| P[Model may not be quantizable:<br/>- Try different architecture<br/>- Use larger model to compensate<br/>- Consider dynamic range quant]
    D --> F
    E --> Q[Deploy and monitor<br/>Acceptable for most<br/>edge ML applications]

    style A fill:#ff6b6b
    style G fill:#4ecdc4
    style I fill:#4ecdc4
    style K fill:#4ecdc4
    style M fill:#4ecdc4
    style O fill:#4ecdc4
    style P fill:#ffe66d
    style Q fill:#95e1d3

Deployment Issues

10. TFLite Conversion Errors

When converting your Keras model to TFLite format fails.

flowchart TD
    A[Start: TFLite conversion fails] --> B{Error message type?}
    B -->|Unsupported operation| C{Which operation?}
    C -->|Custom layer| D[Replace or reimplement:<br/>- Use built-in equivalent<br/>- Implement as TFLite custom op<br/>- Redesign model architecture]
    C -->|Standard op but flagged| E[Enable TF op fallback:<br/>supported_ops = <br/>  TFLITE_BUILTINS<br/>  SELECT_TF_OPS]
    B -->|Dynamic tensor shape| F{Where are dynamic shapes?}
    F -->|Input layer| G[Set explicit input shape:<br/>Input shape=fixed_shape<br/>Avoid None dimensions]
    F -->|Internal layers| H[Redesign model:<br/>- Remove dynamic reshaping<br/>- Use fixed size tensors<br/>- Pad to max size if needed]
    B -->|Graph optimization failed| I{Model complexity?}
    I -->|Very complex graph| J[Simplify model:<br/>- Remove unnecessary ops<br/>- Fuse batch norm into conv<br/>- Remove training-only ops]
    I -->|Simple model| K{TensorFlow version?}
    K -->|TF 2.0-2.3 old| L[Update TensorFlow:<br/>pip install tensorflow==2.10<br/>Rebuild model with new version]
    K -->|Version OK| M[Try different converter:<br/>from_keras_model vs<br/>from_saved_model vs<br/>from_concrete_functions]
    B -->|Quantization error| N[See Quantization flowchart<br/>Check representative dataset<br/>Try post-training quant only]
    B -->|Model is None| O{Model saved correctly?}
    O -->|Not saved| P[Save model first:<br/>model.save model.h5<br/>or tf.saved_model.save]
    O -->|Saved but corrupt| Q[Re-train and save:<br/>Check disk space<br/>Verify file integrity]

    style A fill:#ff6b6b
    style D fill:#4ecdc4
    style E fill:#4ecdc4
    style G fill:#4ecdc4
    style H fill:#4ecdc4
    style J fill:#4ecdc4
    style L fill:#4ecdc4
    style M fill:#4ecdc4
    style N fill:#ffe66d
    style P fill:#4ecdc4
    style Q fill:#4ecdc4

11. Real-Time Performance Issues

When inference is too slow for real-time operation.

flowchart TD
    A[Start: Inference too slow] --> B{Measure current latency}
    B --> C{Latency vs requirement?}
    C -->|2-5x too slow| D[Significant optimization needed]
    C -->|1.5-2x too slow| E[Minor optimization may suffice]
    C -->|Just barely slow| F[Fine-tune existing setup]
    D --> G{Platform?}
    G -->|Microcontroller| H{Model size?}
    H -->|Large model| I[Reduce model complexity:<br/>- Fewer layers depth<br/>- Smaller kernels 3x3 not 5x5<br/>- Reduce channels width<br/>- Use depthwise separable conv]
    H -->|Already minimal| J{Optimize ops}
    J --> K[Profile which ops slow:<br/>- Use CMSIS-NN optimizations<br/>- Enable hardware acceleration<br/>- Check if ops are optimized<br/>- Consider assembly for critical ops]
    G -->|Raspberry Pi| L{Using TFLite?}
    L -->|Using full TensorFlow| M[Switch to TFLite:<br/>interpreter = tf.lite.Interpreter<br/>4-10x faster than full TF]
    L -->|Already TFLite| N{Threading enabled?}
    N -->|Single thread| O[Enable multi-threading:<br/>Interpreter num_threads=4<br/>Use all CPU cores]
    N -->|Multi-threaded| P{Hardware acceleration?}
    P -->|No accelerator| Q[Use available hardware:<br/>- Coral TPU USB 100x faster<br/>- Intel Neural Stick 2<br/>- GPU if available]
    P -->|Using accelerator| R[Optimize model for accelerator:<br/>- INT8 for Edge TPU<br/>- Check supported ops<br/>- Profile bottlenecks]
    E --> S{Reduce input size?}
    S -->|Can downsample| T[Reduce input dimensions:<br/>- 96x96 instead of 224x224<br/>- Lower audio sample rate<br/>- Skip frames temporal stride]
    S -->|Input size fixed| U[Batch processing if applicable<br/>or reduce inference frequency]
    F --> V[Fine-tune parameters:<br/>- Compiler optimizations -O3<br/>- Reduce logging overhead<br/>- Check sensor read time]

    style A fill:#ff6b6b
    style I fill:#4ecdc4
    style K fill:#4ecdc4
    style M fill:#4ecdc4
    style O fill:#4ecdc4
    style Q fill:#4ecdc4
    style R fill:#4ecdc4
    style T fill:#4ecdc4
    style U fill:#4ecdc4
    style V fill:#4ecdc4

12. Power Consumption Too High

When your battery-powered device drains too quickly.

flowchart TD
    A[Start: Battery drains too fast] --> B{Measure current draw}
    B --> C{When is power high?}
    C -->|Always high even idle| D[Idle power issue]
    C -->|High during inference| E[Inference power issue]
    C -->|High during wireless| F[Radio power issue]
    D --> G{Sleep modes enabled?}
    G -->|No sleep| H[Implement sleep modes:<br/>- Deep sleep between samples<br/>- Light sleep during idle<br/>- Wake on interrupt not polling]
    G -->|Sleep enabled| I{Peripherals powered down?}
    I -->|Always on| J[Disable unused peripherals:<br/>- Turn off LEDs<br/>- Power down sensors when idle<br/>- Disable USB if not needed]
    I -->|Optimized| K[Check for current leaks:<br/>- Pull-up/down resistors<br/>- Floating pins<br/>- LDO efficiency]
    E --> L{Inference frequency?}
    L -->|Very frequent| M[Reduce inference rate:<br/>- 1 Hz instead of 10 Hz<br/>- On-demand vs continuous<br/>- Motion trigger activation]
    L -->|Already low| N{Model efficiency?}
    N -->|Large complex model| O[Optimize model:<br/>- Smaller architecture<br/>- INT8 quantization<br/>- Prune unnecessary weights<br/>- Knowledge distillation]
    N -->|Efficient model| P[Hardware acceleration:<br/>- Dedicated ML accelerator<br/>- Lower voltage operation<br/>- Better power profile MCU]
    F --> Q{WiFi always on?}
    Q -->|Yes| R[Optimize radio usage:<br/>- Connect only when needed<br/>- Reduce TX power<br/>- Batch transmissions<br/>- Use BLE instead of WiFi]
    Q -->|Optimized usage| S{Connection parameters?}
    S -->|Frequent reconnects<br/>or poor signal| T[Improve connectivity:<br/>- Keep-alive intervals<br/>- Better antenna position<br/>- Closer to AP<br/>- Lower data rate trade quality]
    S -->|Parameters good| U[Consider different protocol:<br/>- LoRa for long range low power<br/>- BLE for short range<br/>- Zigbee for mesh networks]

    style A fill:#ff6b6b
    style H fill:#4ecdc4
    style J fill:#4ecdc4
    style K fill:#ffe66d
    style M fill:#4ecdc4
    style O fill:#4ecdc4
    style P fill:#4ecdc4
    style R fill:#4ecdc4
    style T fill:#4ecdc4
    style U fill:#4ecdc4

--- title: "Visual Troubleshooting Guides" subtitle: "Flowchart-Driven Diagnosis for Edge ML Issues" --- ## Quick Diagnosis Flowcharts Use these visual guides to diagnose common edge ML issues systematically. Each flowchart provides a step-by-step decision tree to identify and resolve problems. ::: {.callout-tip} ## How to Use These Flowcharts 1. Start at the top node describing your symptom 2. Follow the decision paths based on your observations 3. Apply the suggested solution at terminal nodes 4. If problem persists, check the text-based [Troubleshooting Guide](../resources/troubleshooting.qmd) ::: --- ## General Edge ML Issues ### 1. Model Loading Failures When your model fails to load on the device, this flowchart helps identify whether it's a file issue, memory problem, or compatibility issue. ```{mermaid} flowchart TD A[Start: Model won't load] --> B{Does model file exist?} B -->|No| C[Check file path is correct Verify file was uploaded to device Check SD card if used] B -->|Yes| D{Check file size} D -->|0 bytes or corrupted| E[Re-download or re-convert model Verify TFLite conversion succeeded Check disk space during save] D -->|Normal size| F{Enough Flash memory?} F -->|No| G[Model too large for device - Reduce model complexity - Use more aggressive quantization - Remove unused layers] F -->|Yes| H{Check model format} H -->|Not .tflite| I[Convert to TFLite format: converter = tf.lite.TFLiteConverter tflite_model = converter.convert] H -->|Valid .tflite| J{TFLite version compatible?} J -->|Version mismatch| K[Update TFLite Micro to match or re-convert model with compatible TF version] J -->|Compatible| L{GetModel returns null?} L -->|Yes| M[Model schema incompatible - Rebuild with correct flatbuffer - Check for custom ops - Verify model_data array] L -->|No| N[Check AllocateTensors result See Memory Allocation flowchart] style A fill:#ff6b6b style C fill:#4ecdc4 style E fill:#4ecdc4 style G fill:#4ecdc4 style I fill:#4ecdc4 style K fill:#4ecdc4 style M fill:#4ecdc4 style N fill:#ffe66d ``` ### 2. Inference Accuracy Problems When your deployed model gives poor results despite good training accuracy, use this diagnostic path. ```{mermaid} flowchart TD A[Start: Poor inference accuracy] --> B{Works in notebook/Python?} B -->|No| C[Problem is with model itself - Retrain with more data - Check for overfitting - Validate training pipeline] B -->|Yes| D{Works with TFLite interpreter?} D -->|No| E[Quantization error - Use representative dataset - Try quantization-aware training - Check for extreme activations] D -->|Yes| F{Check preprocessing} F -->|Different from training| G[Match preprocessing exactly: - Same normalization 0-1 or -1 to 1 - Same scaling factors - Same color space RGB/BGR] F -->|Matches training| H{Check input data type} H -->|Type mismatch| I[Fix input tensor type: - float32 vs uint8 - Signed vs unsigned - Check input_details] H -->|Correct type| J{Check output interpretation} J -->|Wrong postprocessing| K[Fix output handling: - Apply softmax if needed - Correct argmax usage - Check dequantization] J -->|Correct| L{Sensor data quality} L -->|Noisy/inconsistent| M[Improve sensor pipeline: - Add filtering moving avg - Increase sampling rate - Calibrate sensors per user] L -->|Good quality| N[Model may not generalize - Collect more diverse data - Test edge cases - Consider retraining] style A fill:#ff6b6b style C fill:#4ecdc4 style E fill:#4ecdc4 style G fill:#4ecdc4 style I fill:#4ecdc4 style K fill:#4ecdc4 style M fill:#4ecdc4 style N fill:#4ecdc4 ``` ### 3. Memory Allocation Errors The dreaded "tensor arena too small" and related memory issues. ```{mermaid} flowchart TD A[Start: Memory allocation fails] --> B{Error message type?} B -->|Tensor arena too small| C[Check arena_used_bytes] C --> D{Have you profiled it?} D -->|No| E[Add this code: Serial.print arena_used_bytes Start with large arena 100KB] D -->|Yes| F[Set arena = used_bytes × 1.2 20% safety margin] B -->|AllocateTensors fails| G{Enough total SRAM?} G -->|No| H[Reduce memory usage: - Smaller input buffers - Reduce model size - Remove debug code] G -->|Yes| I{Check for memory leaks} I -->|Calling AllocateTensors repeatedly| J[Only call once in setup Not in loop] I -->|Static allocation OK| K[Check ops resolver: Missing required ops?] B -->|Segmentation fault| L{Using static allocation?} L -->|No malloc on MCU| M[Declare tensors static: static uint8_t tensor_arena alignas 16] L -->|Already static| N{Check array bounds} N -->|Buffer overflow| O[Validate input sizes: - Clip values to expected range - Check buffer indexing - Add bounds checking] N -->|Bounds OK| P[Enable debug symbols Use GDB or Platform IO debugger] B -->|Stack overflow| Q[Increase stack size or reduce local variables Move large arrays to static] style A fill:#ff6b6b style E fill:#ffe66d style F fill:#4ecdc4 style H fill:#4ecdc4 style J fill:#4ecdc4 style K fill:#ffe66d style M fill:#4ecdc4 style O fill:#4ecdc4 style P fill:#ffe66d style Q fill:#4ecdc4 ``` --- ## Hardware-Specific Issues ### 4. Arduino Upload Failures When you can't upload your sketch to Arduino. ```{mermaid} flowchart TD A[Start: Upload fails] --> B{Arduino detected?} B -->|Port not shown| C[Check USB connection: - Try different cable - Try different USB port - Restart Arduino IDE] B -->|Port shows up| D{Correct board selected?} D -->|Wrong board| E[Tools → Board → select correct: Arduino Nano 33 BLE ESP32 Dev Module etc] D -->|Correct board| F{Upload error message?} F -->|avrdude timeout| G[Press reset button twice quickly to enter bootloader Upload within 8 seconds] F -->|Port in use| H[Close Serial Monitor Close other IDE instances Restart IDE] F -->|Sketch too big| I[Reduce program size: - Use MicroMutableOpResolver - Remove debug prints - Smaller model] F -->|Compilation error| J{TensorFlow Lite library?} J -->|Not installed| K[Install via Library Manager: Arduino_TensorFlowLite or TensorFlowLite_ESP32] J -->|Version conflict| L[Update all libraries Check TF version compatibility Use 2.4.0-ALPHA for MCU] J -->|Code syntax error| M[Check error message: - Missing semicolons - Undeclared variables - Type mismatches] style A fill:#ff6b6b style C fill:#4ecdc4 style E fill:#4ecdc4 style G fill:#4ecdc4 style H fill:#4ecdc4 style I fill:#4ecdc4 style K fill:#4ecdc4 style L fill:#4ecdc4 style M fill:#ffe66d ``` ### 5. ESP32 WiFi Connection Issues Debugging wireless connectivity on ESP32. ```{mermaid} flowchart TD A[Start: WiFi won't connect] --> B{WiFi.status returns?} B -->|WL_NO_SSID_AVAIL| C[Network not found: - Check SSID spelling - Ensure 2.4GHz not 5GHz - Move closer to router] B -->|WL_CONNECT_FAILED| D[Authentication failed: - Verify password correct - Check security type WPA2 - Router MAC filter?] B -->|WL_DISCONNECTED| E{Connecting then drops?} E -->|Yes| F[Weak signal or interference: - Move closer to AP - Add external antenna - Reduce TX power if overheating] E -->|Never connects| G{Delay after WiFi.begin?} G -->|No| H[Add connection timeout: while WiFi.status != WL_CONNECTED delay 500 retry up to 20 times] G -->|Yes timeout| I{Check router settings} I -->|AP isolation enabled| J[Disable AP isolation on router to allow device-to-device comm] I -->|DHCP full| K[Assign static IP or increase DHCP pool size] B -->|WL_IDLE_STATUS| L[WiFi not initialized: WiFi.mode WIFI_STA before WiFi.begin] B -->|WL_CONNECTED but no internet| M{Can ping gateway?} M -->|No| N[Local network issue: Check subnet mask Check gateway IP] M -->|Yes| O[DNS or internet issue: Try 8.8.8.8 for DNS Check if router has internet] style A fill:#ff6b6b style C fill:#4ecdc4 style D fill:#4ecdc4 style F fill:#4ecdc4 style H fill:#4ecdc4 style J fill:#4ecdc4 style K fill:#4ecdc4 style L fill:#4ecdc4 style N fill:#4ecdc4 style O fill:#4ecdc4 ``` ### 6. Sensor Reading Problems When sensor data looks wrong or inconsistent. ```{mermaid} flowchart TD A[Start: Bad sensor readings] --> B{Sensor type?} B -->|Analog ADC| C{Reading always 0 or 1023?} C -->|Always max/min| D[Check wiring: - Need voltage divider? - Correct pin analog capable? - Ground connection OK?] C -->|Values present but wrong| E{Calibrated?} E -->|No| F[Implement calibration: - Record min/max values - Map to expected range - Account for offset/drift] E -->|Yes but noisy| G[Add filtering: - Moving average window 5-10 - Median filter for spikes - Low-pass RC filter hardware] B -->|Digital I2C/SPI| H{Communication working?} H -->|No response| I[Check I2C address: - Scan for devices - Check A0 A1 jumpers - Verify pull-up resistors] H -->|Returns 0xFF or error| J[Check timing: - Clock speed too fast? - Adequate delays? - Power supply stable?] H -->|Intermittent| K[Check connections: - Loose wires? - Cable length too long? - Electromagnetic interference?] B -->|Timing-sensitive| L{Consistent sample rate?} L -->|Irregular timing| M[Use millis for timing: unsigned long last = 0 if millis - last >= interval read sensor] L -->|Regular but wrong values| N{Check sensor datasheet} N -->|Wrong voltage| O[Voltage level issue: - 3.3V vs 5V logic - Use level shifter - Check sensor V rating] N -->|Conversion needed| P[Apply formula from datasheet: - Temperature coefficients - Resistance to value - Raw to engineering units] style A fill:#ff6b6b style D fill:#4ecdc4 style F fill:#4ecdc4 style G fill:#4ecdc4 style I fill:#4ecdc4 style J fill:#4ecdc4 style K fill:#4ecdc4 style M fill:#4ecdc4 style O fill:#4ecdc4 style P fill:#4ecdc4 ``` --- ## Training Issues ### 7. Training Loss Not Decreasing When your model isn't learning during training. ```{mermaid} flowchart TD A[Start: Loss not decreasing] --> B{Loss value?} B -->|Loss is NaN| C[Gradient explosion: - Reduce learning rate 10x - Add gradient clipping - Check for bad data inf/NaN] B -->|Loss constant high| D{Check learning rate} D -->|LR too small 1e-6 or less| E[Increase learning rate: Try 1e-3 for Adam Try 0.01 for SGD] D -->|LR reasonable| F{Data preprocessed?} F -->|No normalization| G[Normalize inputs: x = x / 255.0 for images StandardScaler for tabular Mean 0 std 1 generally] F -->|Already normalized| H{Check labels} H -->|All same label or wrong format| I[Fix label issues: - One-hot encode categorical - Balance class distribution - Verify ground truth correct] H -->|Labels OK| J{Model capacity?} J -->|Too small| K[Increase model size: - Add layers - Increase units per layer - But avoid overfitting] J -->|Reasonable size| L{Activation functions?} L -->|All linear or wrong| M[Use proper activations: - ReLU for hidden layers - Softmax for classification - Sigmoid for binary] L -->|Activations OK| N{Loss function correct?} N -->|Mismatch with task| O[Match loss to task: - Categorical crossentropy multi-class - Binary crossentropy binary - MSE for regression] N -->|Correct loss| P[Try different optimizer: - Adam usually works - SGD with momentum - Reduce batch size if large] style A fill:#ff6b6b style C fill:#4ecdc4 style E fill:#4ecdc4 style G fill:#4ecdc4 style I fill:#4ecdc4 style K fill:#4ecdc4 style M fill:#4ecdc4 style O fill:#4ecdc4 style P fill:#4ecdc4 ``` ### 8. Overfitting Detection and Solutions When training accuracy is high but validation accuracy is low. ```{mermaid} flowchart TD A[Start: Train acc high Val acc low] --> B{Gap between accuracies?} B -->|>20% difference| C[Severe overfitting detected] B -->|10-20% difference| D[Moderate overfitting] B -->|<10% difference| E[Mild - may be acceptable for edge ML trade-off] C --> F{Dataset size?} F -->|<100 samples/class| G[Collect more data: - Aim for 500+ per class - Critical for deep learning - Data beats algorithms] F -->|Adequate data| H{Using augmentation?} H -->|No| I[Add data augmentation: - Random flips rotations - Noise injection - Time warping for audio/sensors] H -->|Yes| J{Regularization applied?} J -->|No regularization| K[Add regularization: - L2 weight decay 1e-4 - Dropout 0.3-0.5 after FC layers - Batch normalization] J -->|Already using| L{Model complexity?} L -->|Very deep/wide| M[Reduce model size: - Fewer layers - Smaller hidden dims - Early stopping on val loss] L -->|Simple model| N{Check validation set} N -->|Different distribution| O[Fix data split: - Stratified split by class - Shuffle before split - Ensure representative sample] N -->|Distribution matches| P[Consider ensemble: - Multiple models vote - Or accept slight overfit - Edge models trade accuracy] D --> J E --> Q[Monitor in production May need retraining with real-world data later] style A fill:#ff6b6b style G fill:#4ecdc4 style I fill:#4ecdc4 style K fill:#4ecdc4 style M fill:#4ecdc4 style O fill:#4ecdc4 style P fill:#4ecdc4 style Q fill:#ffe66d ``` ### 9. Quantization Accuracy Drop When converting from float32 to int8 causes significant accuracy loss. ```{mermaid} flowchart TD A[Start: Accuracy drops after quantization] --> B{Accuracy drop amount?} B -->|>10% drop| C[Severe degradation] B -->|5-10% drop| D[Moderate - may be fixable] B -->|<5% drop| E[Acceptable for edge deployment trade-off] C --> F{Using representative dataset?} F -->|No rep_dataset| G[Critical - add representative data: def rep_data_gen: for sample in dataset yield sample converter.representative_dataset = rep_data_gen] F -->|Have rep_dataset| H{Dataset diverse enough?} H -->|<100 samples or single scenario| I[Expand representative dataset: - Include all input variations - Cover activation ranges - Multiple scenarios/users] H -->|Dataset adequate| J{Try quantization-aware training} J -->|Not using QAT| K[Implement QAT: model = tfmot.quantization.keras .quantize_model model Train with quantization simulation] J -->|Already using QAT| L{Check activation ranges} L -->|Extreme outliers| M[Fix extreme activations: - Clip outliers in preprocessing - Use batch normalization - Scale input range properly] L -->|Ranges normal| N{Specific layer causing issue?} N -->|One layer drops accuracy| O[Keep that layer in float: converter.target_spec .supported_ops TFLITE_BUILTINS Mark layer as non-quantizable] N -->|Whole model degrades| P[Model may not be quantizable: - Try different architecture - Use larger model to compensate - Consider dynamic range quant] D --> F E --> Q[Deploy and monitor Acceptable for most edge ML applications] style A fill:#ff6b6b style G fill:#4ecdc4 style I fill:#4ecdc4 style K fill:#4ecdc4 style M fill:#4ecdc4 style O fill:#4ecdc4 style P fill:#ffe66d style Q fill:#95e1d3 ``` --- ## Deployment Issues ### 10. TFLite Conversion Errors When converting your Keras model to TFLite format fails. ```{mermaid} flowchart TD A[Start: TFLite conversion fails] --> B{Error message type?} B -->|Unsupported operation| C{Which operation?} C -->|Custom layer| D[Replace or reimplement: - Use built-in equivalent - Implement as TFLite custom op - Redesign model architecture] C -->|Standard op but flagged| E[Enable TF op fallback: supported_ops = TFLITE_BUILTINS SELECT_TF_OPS] B -->|Dynamic tensor shape| F{Where are dynamic shapes?} F -->|Input layer| G[Set explicit input shape: Input shape=fixed_shape Avoid None dimensions] F -->|Internal layers| H[Redesign model: - Remove dynamic reshaping - Use fixed size tensors - Pad to max size if needed] B -->|Graph optimization failed| I{Model complexity?} I -->|Very complex graph| J[Simplify model: - Remove unnecessary ops - Fuse batch norm into conv - Remove training-only ops] I -->|Simple model| K{TensorFlow version?} K -->|TF 2.0-2.3 old| L[Update TensorFlow: pip install tensorflow==2.10 Rebuild model with new version] K -->|Version OK| M[Try different converter: from_keras_model vs from_saved_model vs from_concrete_functions] B -->|Quantization error| N[See Quantization flowchart Check representative dataset Try post-training quant only] B -->|Model is None| O{Model saved correctly?} O -->|Not saved| P[Save model first: model.save model.h5 or tf.saved_model.save] O -->|Saved but corrupt| Q[Re-train and save: Check disk space Verify file integrity] style A fill:#ff6b6b style D fill:#4ecdc4 style E fill:#4ecdc4 style G fill:#4ecdc4 style H fill:#4ecdc4 style J fill:#4ecdc4 style L fill:#4ecdc4 style M fill:#4ecdc4 style N fill:#ffe66d style P fill:#4ecdc4 style Q fill:#4ecdc4 ``` ### 11. Real-Time Performance Issues When inference is too slow for real-time operation. ```{mermaid} flowchart TD A[Start: Inference too slow] --> B{Measure current latency} B --> C{Latency vs requirement?} C -->|2-5x too slow| D[Significant optimization needed] C -->|1.5-2x too slow| E[Minor optimization may suffice] C -->|Just barely slow| F[Fine-tune existing setup] D --> G{Platform?} G -->|Microcontroller| H{Model size?} H -->|Large model| I[Reduce model complexity: - Fewer layers depth - Smaller kernels 3x3 not 5x5 - Reduce channels width - Use depthwise separable conv] H -->|Already minimal| J{Optimize ops} J --> K[Profile which ops slow: - Use CMSIS-NN optimizations - Enable hardware acceleration - Check if ops are optimized - Consider assembly for critical ops] G -->|Raspberry Pi| L{Using TFLite?} L -->|Using full TensorFlow| M[Switch to TFLite: interpreter = tf.lite.Interpreter 4-10x faster than full TF] L -->|Already TFLite| N{Threading enabled?} N -->|Single thread| O[Enable multi-threading: Interpreter num_threads=4 Use all CPU cores] N -->|Multi-threaded| P{Hardware acceleration?} P -->|No accelerator| Q[Use available hardware: - Coral TPU USB 100x faster - Intel Neural Stick 2 - GPU if available] P -->|Using accelerator| R[Optimize model for accelerator: - INT8 for Edge TPU - Check supported ops - Profile bottlenecks] E --> S{Reduce input size?} S -->|Can downsample| T[Reduce input dimensions: - 96x96 instead of 224x224 - Lower audio sample rate - Skip frames temporal stride] S -->|Input size fixed| U[Batch processing if applicable or reduce inference frequency] F --> V[Fine-tune parameters: - Compiler optimizations -O3 - Reduce logging overhead - Check sensor read time] style A fill:#ff6b6b style I fill:#4ecdc4 style K fill:#4ecdc4 style M fill:#4ecdc4 style O fill:#4ecdc4 style Q fill:#4ecdc4 style R fill:#4ecdc4 style T fill:#4ecdc4 style U fill:#4ecdc4 style V fill:#4ecdc4 ``` ### 12. Power Consumption Too High When your battery-powered device drains too quickly. ```{mermaid} flowchart TD A[Start: Battery drains too fast] --> B{Measure current draw} B --> C{When is power high?} C -->|Always high even idle| D[Idle power issue] C -->|High during inference| E[Inference power issue] C -->|High during wireless| F[Radio power issue] D --> G{Sleep modes enabled?} G -->|No sleep| H[Implement sleep modes: - Deep sleep between samples - Light sleep during idle - Wake on interrupt not polling] G -->|Sleep enabled| I{Peripherals powered down?} I -->|Always on| J[Disable unused peripherals: - Turn off LEDs - Power down sensors when idle - Disable USB if not needed] I -->|Optimized| K[Check for current leaks: - Pull-up/down resistors - Floating pins - LDO efficiency] E --> L{Inference frequency?} L -->|Very frequent| M[Reduce inference rate: - 1 Hz instead of 10 Hz - On-demand vs continuous - Motion trigger activation] L -->|Already low| N{Model efficiency?} N -->|Large complex model| O[Optimize model: - Smaller architecture - INT8 quantization - Prune unnecessary weights - Knowledge distillation] N -->|Efficient model| P[Hardware acceleration: - Dedicated ML accelerator - Lower voltage operation - Better power profile MCU] F --> Q{WiFi always on?} Q -->|Yes| R[Optimize radio usage: - Connect only when needed - Reduce TX power - Batch transmissions - Use BLE instead of WiFi] Q -->|Optimized usage| S{Connection parameters?} S -->|Frequent reconnects or poor signal| T[Improve connectivity: - Keep-alive intervals - Better antenna position - Closer to AP - Lower data rate trade quality] S -->|Parameters good| U[Consider different protocol: - LoRa for long range low power - BLE for short range - Zigbee for mesh networks] style A fill:#ff6b6b style H fill:#4ecdc4 style J fill:#4ecdc4 style K fill:#ffe66d style M fill:#4ecdc4 style O fill:#4ecdc4 style P fill:#4ecdc4 style R fill:#4ecdc4 style T fill:#4ecdc4 style U fill:#4ecdc4 ``` --- ## Related Resources ::: {.callout-tip} ## Additional Help - **[Text-Based Troubleshooting Guide](../resources/troubleshooting.qmd)** - Detailed solutions with code examples - **[Hardware Guide](../resources/hardware.qmd)** - Equipment specifications and setup - **[Simulators](../resources/simulators.qmd)** - Test without hardware using Wokwi and others - **[GitHub Issues](https://github.com/ngcharithperera/edge-analytics-lab-book/issues)** - Community support and known issues ::: ::: {.callout-note} ## Contributing Found a common issue not covered here? Please [submit an issue](https://github.com/ngcharithperera/edge-analytics-lab-book/issues) or pull request to add it to this guide. :::