Datasets

Data Requirements Summary

Good news! Most notebooks in this lab book work without pre-downloading datasets:

Data Type	Notebooks	Action Required
Built-in (Keras/sklearn)	LAB02, 03, 06, 07, 17, 18	None - auto-downloads
Synthetic/Simulated	LAB01, 05, 08-15	None - generated in notebook
TensorFlow Datasets	LAB04	Auto-downloads ~2GB
Ultralytics Models	LAB16	Auto-downloads ~14MB

Recommended: Use Google Colab

Open In Colab

All datasets download automatically in Colab. No setup required!

Datasets by Lab

LAB02: ML Foundations

sklearn built-in datasets - No download needed

from sklearn.datasets import load_iris, load_wine
iris = load_iris()
wine = load_wine()

LAB03, LAB06, LAB17, LAB18: MNIST

Keras built-in - Auto-downloads ~11 MB once

from tensorflow.keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()

60,000 training / 10,000 test images
28x28 grayscale
10 classes (0-9)

Original Source

LAB04: Keyword Spotting

Google Speech Commands v0.02 - ~2 GB

Downloads automatically via TensorFlow Datasets:

import tensorflow_datasets as tfds
ds = tfds.load('speech_commands', split='train')

105,829 audio files
35 keywords + silence + unknown
1-second clips at 16kHz

Dataset Info | Paper

LAB07: CNNs & Computer Vision

Fashion MNIST + CIFAR-10 - ~150 MB combined

from tensorflow.keras.datasets import fashion_mnist, cifar10

# Fashion MNIST: 10 clothing categories
(x_train, y_train), (x_test, y_test) = fashion_mnist.load_data()

# CIFAR-10: 10 object categories
(x_train, y_train), (x_test, y_test) = cifar10.load_data()

CIFAR-10 Classes: airplane, car, bird, cat, deer, dog, frog, horse, ship, truck

LAB08-LAB15: Synthetic Data

These labs generate all data in the notebook - no downloads required!

Lab	Data Type	How Generated
LAB08	Sensor readings	`SimulatedSensor` class with numpy
LAB09	IoT network data	Simulated ESP32 nodes
LAB10	EMG signals	`generate_synthetic_emg()` with scipy
LAB11	Profiling metrics	Model benchmarks
LAB12	Streaming data	`DataStream` class
LAB13	Distributed readings	numpy with time patterns
LAB14	Anomaly data	Injected anomalies in sensor stream
LAB15	Energy metrics	Analytical calculations

LAB16: YOLO Object Detection

YOLOv5 Models - Auto-downloads from Ultralytics

from ultralytics import YOLO

# Downloads pre-trained model (~14 MB)
model = YOLO('yolov5s.pt')

# Run inference
results = model('image.jpg')

Ultralytics YOLOv5

Sample Sensor Data

For quick testing, sample CSV files are provided in the repository:

File	Description	Labs
`temperature_humidity.csv`	DHT11 sensor readings	LAB08, LAB12
`emg_signal.csv`	EMG muscle activation	LAB10
`network_traffic.csv`	Network traffic features	LAB14
`energy_usage.csv`	Power profiling data	LAB11, LAB15

import pandas as pd

# Load sample data
df = pd.read_csv('data/sample/sensors/temperature_humidity.csv')
print(df.head())

Pre-Download for Offline Use

If you need to work offline, pre-download all Keras datasets:

import tensorflow as tf

# Download all datasets once (~200 MB total)
print("Downloading MNIST...")
tf.keras.datasets.mnist.load_data()

print("Downloading Fashion MNIST...")
tf.keras.datasets.fashion_mnist.load_data()

print("Downloading CIFAR-10...")
tf.keras.datasets.cifar10.load_data()

print("All datasets cached in ~/.keras/datasets/")

Dataset Loading Utilities

import tensorflow as tf

def load_and_preprocess(dataset_name):
    """Universal dataset loader with normalization"""
    if dataset_name == 'mnist':
        (x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
        x_train = x_train.astype('float32') / 255.0
        x_test = x_test.astype('float32') / 255.0
        return (x_train, y_train), (x_test, y_test)

    elif dataset_name == 'cifar10':
        (x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()
        x_train = x_train.astype('float32') / 255.0
        x_test = x_test.astype('float32') / 255.0
        return (x_train, y_train), (x_test, y_test)

Data Augmentation

For Images

from tensorflow.keras.preprocessing.image import ImageDataGenerator

datagen = ImageDataGenerator(
    rotation_range=15,
    width_shift_range=0.1,
    height_shift_range=0.1,
    horizontal_flip=True
)

For Audio

import numpy as np

def augment_audio(waveform):
    # Time shift
    shift = np.random.randint(-1600, 1600)
    waveform = np.roll(waveform, shift)

    # Add noise
    noise = np.random.randn(len(waveform)) * 0.005
    return waveform + noise

Dataset Licenses

Dataset	License	Commercial Use
MNIST	Public Domain	Yes
Fashion MNIST	MIT License	Yes
CIFAR-10	Public Domain	Yes
Speech Commands	CC BY 4.0	Yes (with attribution)
YOLOv5 Models	AGPL-3.0	Check terms

Always verify license terms for your specific use case.

--- title: "Datasets" --- ## Data Requirements Summary **Good news!** Most notebooks in this lab book work without pre-downloading datasets: | Data Type | Notebooks | Action Required | |-----------|-----------|-----------------| | Built-in (Keras/sklearn) | LAB02, 03, 06, 07, 17, 18 | None - auto-downloads | | Synthetic/Simulated | LAB01, 05, 08-15 | None - generated in notebook | | TensorFlow Datasets | LAB04 | Auto-downloads ~2GB | | Ultralytics Models | LAB16 | Auto-downloads ~14MB | ::: {.callout-tip} ## Recommended: Use Google Colab [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ngcharithperera/edge-analytics-lab-book/blob/main/notebooks/) All datasets download automatically in Colab. No setup required! ::: --- ## Datasets by Lab ### LAB02: ML Foundations **sklearn built-in datasets** - No download needed ```python from sklearn.datasets import load_iris, load_wine iris = load_iris() wine = load_wine() ``` ### LAB03, LAB06, LAB17, LAB18: MNIST **Keras built-in** - Auto-downloads ~11 MB once ```python from tensorflow.keras.datasets import mnist (x_train, y_train), (x_test, y_test) = mnist.load_data() ``` - 60,000 training / 10,000 test images - 28x28 grayscale - 10 classes (0-9) [Original Source](http://yann.lecun.com/exdb/mnist/) --- ### LAB04: Keyword Spotting **Google Speech Commands v0.02** - ~2 GB Downloads automatically via TensorFlow Datasets: ```python import tensorflow_datasets as tfds ds = tfds.load('speech_commands', split='train') ``` - 105,829 audio files - 35 keywords + silence + unknown - 1-second clips at 16kHz [Dataset Info](https://www.tensorflow.org/datasets/catalog/speech_commands) | [Paper](https://arxiv.org/abs/1804.03209) --- ### LAB07: CNNs & Computer Vision **Fashion MNIST** + **CIFAR-10** - ~150 MB combined ```python from tensorflow.keras.datasets import fashion_mnist, cifar10 # Fashion MNIST: 10 clothing categories (x_train, y_train), (x_test, y_test) = fashion_mnist.load_data() # CIFAR-10: 10 object categories (x_train, y_train), (x_test, y_test) = cifar10.load_data() ``` **CIFAR-10 Classes**: airplane, car, bird, cat, deer, dog, frog, horse, ship, truck --- ### LAB08-LAB15: Synthetic Data These labs generate all data in the notebook - no downloads required! | Lab | Data Type | How Generated | |-----|-----------|---------------| | LAB08 | Sensor readings | `SimulatedSensor` class with numpy | | LAB09 | IoT network data | Simulated ESP32 nodes | | LAB10 | EMG signals | `generate_synthetic_emg()` with scipy | | LAB11 | Profiling metrics | Model benchmarks | | LAB12 | Streaming data | `DataStream` class | | LAB13 | Distributed readings | numpy with time patterns | | LAB14 | Anomaly data | Injected anomalies in sensor stream | | LAB15 | Energy metrics | Analytical calculations | --- ### LAB16: YOLO Object Detection **YOLOv5 Models** - Auto-downloads from Ultralytics ```python from ultralytics import YOLO # Downloads pre-trained model (~14 MB) model = YOLO('yolov5s.pt') # Run inference results = model('image.jpg') ``` [Ultralytics YOLOv5](https://github.com/ultralytics/yolov5) --- ## Sample Sensor Data For quick testing, sample CSV files are provided in the repository: | File | Description | Labs | |------|-------------|------| | `temperature_humidity.csv` | DHT11 sensor readings | LAB08, LAB12 | | `emg_signal.csv` | EMG muscle activation | LAB10 | | `network_traffic.csv` | Network traffic features | LAB14 | | `energy_usage.csv` | Power profiling data | LAB11, LAB15 | ```python import pandas as pd # Load sample data df = pd.read_csv('data/sample/sensors/temperature_humidity.csv') print(df.head()) ``` --- ## Pre-Download for Offline Use If you need to work offline, pre-download all Keras datasets: ```python import tensorflow as tf # Download all datasets once (~200 MB total) print("Downloading MNIST...") tf.keras.datasets.mnist.load_data() print("Downloading Fashion MNIST...") tf.keras.datasets.fashion_mnist.load_data() print("Downloading CIFAR-10...") tf.keras.datasets.cifar10.load_data() print("All datasets cached in ~/.keras/datasets/") ``` --- ## Dataset Loading Utilities ```python import tensorflow as tf def load_and_preprocess(dataset_name): """Universal dataset loader with normalization""" if dataset_name == 'mnist': (x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data() x_train = x_train.astype('float32') / 255.0 x_test = x_test.astype('float32') / 255.0 return (x_train, y_train), (x_test, y_test) elif dataset_name == 'cifar10': (x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data() x_train = x_train.astype('float32') / 255.0 x_test = x_test.astype('float32') / 255.0 return (x_train, y_train), (x_test, y_test) ``` --- ## Data Augmentation ### For Images ```python from tensorflow.keras.preprocessing.image import ImageDataGenerator datagen = ImageDataGenerator( rotation_range=15, width_shift_range=0.1, height_shift_range=0.1, horizontal_flip=True ) ``` ### For Audio ```python import numpy as np def augment_audio(waveform): # Time shift shift = np.random.randint(-1600, 1600) waveform = np.roll(waveform, shift) # Add noise noise = np.random.randn(len(waveform)) * 0.005 return waveform + noise ``` --- ## Dataset Licenses | Dataset | License | Commercial Use | |---------|---------|----------------| | MNIST | Public Domain | Yes | | Fashion MNIST | MIT License | Yes | | CIFAR-10 | Public Domain | Yes | | Speech Commands | CC BY 4.0 | Yes (with attribution) | | YOLOv5 Models | AGPL-3.0 | Check terms | Always verify license terms for your specific use case.