Datasets

Data Requirements Summary

Good news! Most notebooks in this lab book work without pre-downloading datasets:

Data Type Notebooks Action Required
Built-in (Keras/sklearn) LAB02, 03, 06, 07, 17, 18 None - auto-downloads
Synthetic/Simulated LAB01, 05, 08-15 None - generated in notebook
TensorFlow Datasets LAB04 Auto-downloads ~2GB
Ultralytics Models LAB16 Auto-downloads ~14MB
Recommended: Use Google Colab

Open In Colab

Open In Colab

All datasets download automatically in Colab. No setup required!


Datasets by Lab

LAB02: ML Foundations

sklearn built-in datasets - No download needed

from sklearn.datasets import load_iris, load_wine
iris = load_iris()
wine = load_wine()

LAB03, LAB06, LAB17, LAB18: MNIST

Keras built-in - Auto-downloads ~11 MB once

from tensorflow.keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
  • 60,000 training / 10,000 test images
  • 28x28 grayscale
  • 10 classes (0-9)

Original Source


LAB04: Keyword Spotting

Google Speech Commands v0.02 - ~2 GB

Downloads automatically via TensorFlow Datasets:

import tensorflow_datasets as tfds
ds = tfds.load('speech_commands', split='train')
  • 105,829 audio files
  • 35 keywords + silence + unknown
  • 1-second clips at 16kHz

Dataset Info | Paper


LAB07: CNNs & Computer Vision

Fashion MNIST + CIFAR-10 - ~150 MB combined

from tensorflow.keras.datasets import fashion_mnist, cifar10

# Fashion MNIST: 10 clothing categories
(x_train, y_train), (x_test, y_test) = fashion_mnist.load_data()

# CIFAR-10: 10 object categories
(x_train, y_train), (x_test, y_test) = cifar10.load_data()

CIFAR-10 Classes: airplane, car, bird, cat, deer, dog, frog, horse, ship, truck


LAB08-LAB15: Synthetic Data

These labs generate all data in the notebook - no downloads required!

Lab Data Type How Generated
LAB08 Sensor readings SimulatedSensor class with numpy
LAB09 IoT network data Simulated ESP32 nodes
LAB10 EMG signals generate_synthetic_emg() with scipy
LAB11 Profiling metrics Model benchmarks
LAB12 Streaming data DataStream class
LAB13 Distributed readings numpy with time patterns
LAB14 Anomaly data Injected anomalies in sensor stream
LAB15 Energy metrics Analytical calculations

LAB16: YOLO Object Detection

YOLOv5 Models - Auto-downloads from Ultralytics

from ultralytics import YOLO

# Downloads pre-trained model (~14 MB)
model = YOLO('yolov5s.pt')

# Run inference
results = model('image.jpg')

Ultralytics YOLOv5


Sample Sensor Data

For quick testing, sample CSV files are provided in the repository:

File Description Labs
temperature_humidity.csv DHT11 sensor readings LAB08, LAB12
emg_signal.csv EMG muscle activation LAB10
network_traffic.csv Network traffic features LAB14
energy_usage.csv Power profiling data LAB11, LAB15
import pandas as pd

# Load sample data
df = pd.read_csv('data/sample/sensors/temperature_humidity.csv')
print(df.head())

Pre-Download for Offline Use

If you need to work offline, pre-download all Keras datasets:

import tensorflow as tf

# Download all datasets once (~200 MB total)
print("Downloading MNIST...")
tf.keras.datasets.mnist.load_data()

print("Downloading Fashion MNIST...")
tf.keras.datasets.fashion_mnist.load_data()

print("Downloading CIFAR-10...")
tf.keras.datasets.cifar10.load_data()

print("All datasets cached in ~/.keras/datasets/")

Dataset Loading Utilities

import tensorflow as tf

def load_and_preprocess(dataset_name):
    """Universal dataset loader with normalization"""
    if dataset_name == 'mnist':
        (x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
        x_train = x_train.astype('float32') / 255.0
        x_test = x_test.astype('float32') / 255.0
        return (x_train, y_train), (x_test, y_test)

    elif dataset_name == 'cifar10':
        (x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()
        x_train = x_train.astype('float32') / 255.0
        x_test = x_test.astype('float32') / 255.0
        return (x_train, y_train), (x_test, y_test)

Data Augmentation

For Images

from tensorflow.keras.preprocessing.image import ImageDataGenerator

datagen = ImageDataGenerator(
    rotation_range=15,
    width_shift_range=0.1,
    height_shift_range=0.1,
    horizontal_flip=True
)

For Audio

import numpy as np

def augment_audio(waveform):
    # Time shift
    shift = np.random.randint(-1600, 1600)
    waveform = np.roll(waveform, shift)

    # Add noise
    noise = np.random.randn(len(waveform)) * 0.005
    return waveform + noise

Dataset Licenses

Dataset License Commercial Use
MNIST Public Domain Yes
Fashion MNIST MIT License Yes
CIFAR-10 Public Domain Yes
Speech Commands CC BY 4.0 Yes (with attribution)
YOLOv5 Models AGPL-3.0 Check terms

Always verify license terms for your specific use case.