Datasets
Data Requirements Summary
Good news! Most notebooks in this lab book work without pre-downloading datasets:
| Data Type | Notebooks | Action Required |
|---|---|---|
| Built-in (Keras/sklearn) | LAB02, 03, 06, 07, 17, 18 | None - auto-downloads |
| Synthetic/Simulated | LAB01, 05, 08-15 | None - generated in notebook |
| TensorFlow Datasets | LAB04 | Auto-downloads ~2GB |
| Ultralytics Models | LAB16 | Auto-downloads ~14MB |
Recommended: Use Google Colab
Datasets by Lab
LAB02: ML Foundations
sklearn built-in datasets - No download needed
from sklearn.datasets import load_iris, load_wine
iris = load_iris()
wine = load_wine()LAB03, LAB06, LAB17, LAB18: MNIST
Keras built-in - Auto-downloads ~11 MB once
from tensorflow.keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()- 60,000 training / 10,000 test images
- 28x28 grayscale
- 10 classes (0-9)
LAB04: Keyword Spotting
Google Speech Commands v0.02 - ~2 GB
Downloads automatically via TensorFlow Datasets:
import tensorflow_datasets as tfds
ds = tfds.load('speech_commands', split='train')- 105,829 audio files
- 35 keywords + silence + unknown
- 1-second clips at 16kHz
LAB07: CNNs & Computer Vision
Fashion MNIST + CIFAR-10 - ~150 MB combined
from tensorflow.keras.datasets import fashion_mnist, cifar10
# Fashion MNIST: 10 clothing categories
(x_train, y_train), (x_test, y_test) = fashion_mnist.load_data()
# CIFAR-10: 10 object categories
(x_train, y_train), (x_test, y_test) = cifar10.load_data()CIFAR-10 Classes: airplane, car, bird, cat, deer, dog, frog, horse, ship, truck
LAB08-LAB15: Synthetic Data
These labs generate all data in the notebook - no downloads required!
| Lab | Data Type | How Generated |
|---|---|---|
| LAB08 | Sensor readings | SimulatedSensor class with numpy |
| LAB09 | IoT network data | Simulated ESP32 nodes |
| LAB10 | EMG signals | generate_synthetic_emg() with scipy |
| LAB11 | Profiling metrics | Model benchmarks |
| LAB12 | Streaming data | DataStream class |
| LAB13 | Distributed readings | numpy with time patterns |
| LAB14 | Anomaly data | Injected anomalies in sensor stream |
| LAB15 | Energy metrics | Analytical calculations |
LAB16: YOLO Object Detection
YOLOv5 Models - Auto-downloads from Ultralytics
from ultralytics import YOLO
# Downloads pre-trained model (~14 MB)
model = YOLO('yolov5s.pt')
# Run inference
results = model('image.jpg')Sample Sensor Data
For quick testing, sample CSV files are provided in the repository:
| File | Description | Labs |
|---|---|---|
temperature_humidity.csv |
DHT11 sensor readings | LAB08, LAB12 |
emg_signal.csv |
EMG muscle activation | LAB10 |
network_traffic.csv |
Network traffic features | LAB14 |
energy_usage.csv |
Power profiling data | LAB11, LAB15 |
import pandas as pd
# Load sample data
df = pd.read_csv('data/sample/sensors/temperature_humidity.csv')
print(df.head())Pre-Download for Offline Use
If you need to work offline, pre-download all Keras datasets:
import tensorflow as tf
# Download all datasets once (~200 MB total)
print("Downloading MNIST...")
tf.keras.datasets.mnist.load_data()
print("Downloading Fashion MNIST...")
tf.keras.datasets.fashion_mnist.load_data()
print("Downloading CIFAR-10...")
tf.keras.datasets.cifar10.load_data()
print("All datasets cached in ~/.keras/datasets/")Dataset Loading Utilities
import tensorflow as tf
def load_and_preprocess(dataset_name):
"""Universal dataset loader with normalization"""
if dataset_name == 'mnist':
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0
return (x_train, y_train), (x_test, y_test)
elif dataset_name == 'cifar10':
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()
x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0
return (x_train, y_train), (x_test, y_test)Data Augmentation
For Images
from tensorflow.keras.preprocessing.image import ImageDataGenerator
datagen = ImageDataGenerator(
rotation_range=15,
width_shift_range=0.1,
height_shift_range=0.1,
horizontal_flip=True
)For Audio
import numpy as np
def augment_audio(waveform):
# Time shift
shift = np.random.randint(-1600, 1600)
waveform = np.roll(waveform, shift)
# Add noise
noise = np.random.randn(len(waveform)) * 0.005
return waveform + noiseDataset Licenses
| Dataset | License | Commercial Use |
|---|---|---|
| MNIST | Public Domain | Yes |
| Fashion MNIST | MIT License | Yes |
| CIFAR-10 | Public Domain | Yes |
| Speech Commands | CC BY 4.0 | Yes (with attribution) |
| YOLOv5 Models | AGPL-3.0 | Check terms |
Always verify license terms for your specific use case.