LAB17: Federated Learning

Distributed Training with Flower

PDF Textbook Reference

For detailed theoretical foundations, mathematical proofs, and algorithm derivations, see Chapter 17: Federated Learning with the Flower Framework in the PDF textbook.

The PDF chapter includes: - Complete mathematical foundations of Federated Averaging (FedAvg) - Detailed convergence analysis and communication complexity - In-depth coverage of non-IID data distribution challenges - Comprehensive differential privacy and secure aggregation theory - Theoretical analysis of communication-efficient federated learning

Open In Colab

Download Notebook

Learning Objectives

By the end of this lab you should be able to:

Explain the core ideas of federated learning and when it is preferable to centralised training
Implement the FedAvg algorithm using the Flower framework on a simple dataset
Explore the impact of IID vs non-IID client data on convergence and accuracy
Design and run small-scale FL experiments on laptops/Pis that respect edge constraints (bandwidth, memory, energy)

Theory Summary

Why Federated Learning?

Traditional machine learning centralizes all training data in one location—a data center or cloud server. Federated Learning (FL) flips this model: the model travels to the data, not vice versa.

This solves critical problems for edge AI:

Privacy: Raw data never leaves the device (healthcare records, personal photos, typing patterns)
Bandwidth: Sending model updates (KB-MB) is cheaper than sending raw data (GB-TB)
Compliance: GDPR/HIPAA regulations often prohibit centralizing sensitive data
Latency: Local inference with periodically improved global models

Example: Google’s Gboard keyboard learns from your typing patterns via FL—your messages never leave your phone, yet the global model improves from millions of users.

Federated Averaging (FedAvg)

FedAvg is the foundational FL algorithm. Each training round:

Server broadcasts current global model weights \(w_t\) to selected clients
Clients train locally for \(E\) epochs on their private data
Clients send back updated weights \(w_i^{t+1}\)
Server aggregates via weighted average:

\[w_{t+1} = \sum_{i=1}^{K} \frac{n_i}{n} w_i^{t+1}\]

where \(n_i\) is the number of training samples on client \(i\), and \(n = \sum n_i\) is the total.

Key insight: Weighting by sample count ensures clients with more data have proportionally more influence, preventing bias toward smaller datasets.

The Non-IID Challenge

In real deployments, data is non-IID (non-independently and identically distributed). Examples:

Hospital A specializes in cardiology (90% heart patients), Hospital B in pediatrics (90% children)
Smart home users in Alaska vs Florida have vastly different temperature patterns
Mobile keyboards learn from different languages per user

Non-IID data causes: - Slower convergence: The global model oscillates between different local optima - Client drift: Local models diverge, making aggregation less effective - Accuracy degradation: The global model may perform poorly on minority data distributions

Mitigations: FedProx (adds a regularization term to keep local models close to global), client sampling strategies, or federated data augmentation.

Key Concepts at a Glance

Core Concepts

Decentralized Training: Models train where data lives; only updates travel over the network
FedAvg Formula: \(w_{new} = \sum_i \frac{n_i}{n} w_i\) (weighted average by sample count)
Client Fraction: Percentage of clients selected per round (e.g., \(C = 0.1\) = 10%)
Local Epochs: Number of epochs each client trains before sending updates (\(E = 1-5\) typical)
Model Consistency: All clients must use identical architectures—server aggregates by weight position
IID vs Non-IID: IID = each client has similar data distribution; Non-IID = skewed/heterogeneous data
Privacy vs Accuracy: FL trades some accuracy (vs centralized) for privacy and bandwidth savings

Common Pitfalls

Mistakes to Avoid

Model Architecture Mismatch Between Clients: The most cryptic FL error. If Client 1 has a 128-neuron layer where Client 2 has 256, aggregation silently produces garbage. Prevention: Define the model in one shared file that all clients import. Print model.summary() on each client and verify they match exactly.
Training Too Many Local Epochs: Setting \(E = 50\) local epochs causes “client drift”—each client’s model wanders far from the global optimum. Start with \(E = 1-5\) and increase only if communication is extremely expensive.
Not Weighting by Sample Count: If you average models without weighting (\(w_{new} = \frac{1}{K} \sum w_i\)), a client with 10 samples has the same influence as one with 10,000. Always use weighted averaging: return weights, len(x_train), {} in Flower’s fit().
Forgetting min_available_clients: If your server waits for 10 clients but only 3 connect, training stalls forever. Set min_available_clients to the number you actually have for testing, or use fraction_fit to select a subset.
Using Different Random Seeds Across Clients: If clients use different seeds for data shuffling or dropout, models diverge unnecessarily. For reproducibility, set np.random.seed() and tf.random.set_seed() consistently.
Ignoring Network Failures: In production FL, clients disconnect mid-round. Flower handles this with timeouts and minimum client requirements, but always test with simulated failures (client.stop() or network drops).

Quick Reference

Flower Server Setup

import flwr as fl

strategy = fl.server.strategy.FedAvg(
    fraction_fit=1.0,          # Use all available clients per round
    fraction_evaluate=1.0,     # Evaluate on all clients
    min_fit_clients=3,         # Min clients needed to start training
    min_evaluate_clients=3,    # Min clients for evaluation
    min_available_clients=3,   # Wait for this many to connect
)

fl.server.start_server(
    server_address="0.0.0.0:8080",
    config=fl.server.ServerConfig(num_rounds=10),
    strategy=strategy,
)

Flower Client Implementation

import flwr as fl
import tensorflow as tf

class MNISTClient(fl.client.NumPyClient):
    def __init__(self, model, x_train, y_train, x_test, y_test):
        self.model = model
        self.x_train, self.y_train = x_train, y_train
        self.x_test, self.y_test = x_test, y_test

    def get_parameters(self, config):
        """Return current model weights"""
        return self.model.get_weights()

    def fit(self, parameters, config):
        """Train on local data"""
        self.model.set_weights(parameters)  # Apply global weights
        self.model.fit(self.x_train, self.y_train, epochs=1, batch_size=32, verbose=0)
        return self.model.get_weights(), len(self.x_train), {}  # weights, count, metrics

    def evaluate(self, parameters, config):
        """Evaluate on local test data"""
        self.model.set_weights(parameters)
        loss, accuracy = self.model.evaluate(self.x_test, self.y_test, verbose=0)
        return loss, len(self.x_test), {"accuracy": accuracy}

# Connect to server
fl.client.start_numpy_client(
    server_address="192.168.1.100:8080",
    client=MNISTClient(model, x_train, y_train, x_test, y_test)
)

Data Partitioning Strategies

IID Partitioning (random split):

def partition_iid(x_data, y_data, num_clients):
    """Random uniform partition"""
    indices = np.random.permutation(len(x_data))
    partition_size = len(x_data) // num_clients

    partitions = []
    for i in range(num_clients):
        start = i * partition_size
        end = start + partition_size
        client_indices = indices[start:end]
        partitions.append((x_data[client_indices], y_data[client_indices]))
    return partitions

Non-IID Partitioning (label skew):

def partition_non_iid(x_data, y_data, num_clients, classes_per_client=2):
    """Each client gets only a subset of classes"""
    num_classes = len(np.unique(y_data))
    partitions = [[] for _ in range(num_clients)]

    for client_id in range(num_clients):
        # Assign specific classes to this client
        client_classes = np.random.choice(num_classes, classes_per_client, replace=False)

        for cls in client_classes:
            class_indices = np.where(y_data == cls)[0]
            samples = np.random.choice(class_indices, len(class_indices) // num_clients)
            partitions[client_id].extend(samples)

    return [(x_data[indices], y_data[indices]) for indices in partitions]

FedAvg Hyperparameters

Parameter	Typical Value	Effect	When to Adjust
Rounds	10-100	More rounds = better convergence	Increase for complex tasks
Local Epochs (E)	1-5	More epochs = less communication	Increase if bandwidth is expensive
Client Fraction (C)	0.1-1.0	Lower = fewer clients per round	Lower for large deployments (1000+ clients)
Learning Rate	0.001-0.01	Lower for FL than centralized	Start 10× lower than centralized training
Batch Size	32-64	Larger = faster but more memory	Reduce for edge devices with limited RAM

Communication Cost Analysis

For a model with \(M\) parameters (FP32), each round requires:

Upload per client: \(4M\) bytes (weights)
Download per client: \(4M\) bytes (global model)
Total per client: \(8M\) bytes/round

Example: MobileNetV2 (3.5M params) = 14 MB/client/round. With 100 clients and 50 rounds = 70 GB total network traffic.

Compare to centralized training: uploading raw MNIST dataset (60k images × 784 pixels × 1 byte) = 47 MB per client. FL is more efficient when datasets are large relative to model size.

Related Concepts in PDF Chapter 17

Section 17.2: Federated Learning vs traditional centralized training comparison
Section 17.3: FedAvg algorithm mathematical formulation and convergence properties
Section 17.4: Flower framework architecture (server, client, gRPC communication)
Section 17.5: Handling non-IID data with FedProx and client sampling strategies
Section 17.6: Raspberry Pi deployment with multiple networked devices
Section 17.7: Privacy-preserving techniques (secure aggregation, differential privacy)

Self-Assessment Checkpoints

Test your understanding before proceeding to the exercises.

Question 1: Calculate the aggregated weight for 3 clients with weights w1=0.8, w2=0.7, w3=0.9 and dataset sizes n1=100, n2=200, n3=150.

Answer: Using FedAvg weighted average: w_global = Σ(n_i / n_total) × w_i. Total samples n = 100 + 200 + 150 = 450. w_global = (100/450)×0.8 + (200/450)×0.7 + (150/450)×0.9 = 0.222×0.8 + 0.444×0.7 + 0.333×0.9 = 0.178 + 0.311 + 0.300 = 0.789. Client 2 has the most influence (200 samples, 44.4% weight) despite having the lowest individual weight (0.7). This weighting ensures larger datasets don’t get drowned out by many small clients. Without weighting (simple average = 0.8), a client with 10 samples would have equal influence to one with 10,000.

Question 2: Why does non-IID data slow federated learning convergence compared to IID data?

Answer: IID data: Each client has similar data distribution (e.g., all clients see all digit classes 0-9 equally). Local training moves in consistent directions toward the global optimum. Aggregation produces smooth, steady improvement. Non-IID data: Client A has mostly 0s and 1s, Client B has mostly 8s and 9s. Client A’s local training optimizes for 0/1 classification while destroying performance on 8/9. Client B does the opposite. Aggregation averages these conflicting updates, causing the global model to oscillate and converge slowly or get stuck in poor local minima. Mitigations: (1) FedProx: Adds penalty term keeping local models close to global, (2) Client sampling: Select diverse clients each round, (3) More communication rounds: Compensate for conflicting updates with more averaging.

Question 3: Your FL server waits for min_available_clients=10 but only 3 devices connect. What happens and how do you fix it?

Answer: Training stalls forever. The server waits indefinitely for 10 clients but only 3 are available. This is a common deployment issue during development/testing. Fixes: (1) Set min_available_clients=3 to match actual device count, (2) Use min_fit_clients=2 (minimum to start a round) separately from min_available_clients (wait threshold), (3) Set timeout in ServerConfig to start with available clients after waiting, (4) Use fraction_fit=0.5 to sample 50% of available clients instead of waiting for fixed count. For production: always plan for clients dropping offline—use min_fit_clients = 50-70% of expected to handle network failures gracefully.

Question 4: Explain why setting local_epochs=50 causes ‘client drift’ in federated learning.

Answer: Client drift: When clients train too many epochs locally, their models wander far from the global model into client-specific local optima. With E=50 local epochs on non-IID data, Client A (heart disease data) optimizes heavily for cardiology features while Client B (pediatrics) optimizes for child-specific patterns. After 50 epochs, their models are so different that aggregation produces an incoherent “average” that performs poorly on both. Result: global model accuracy degrades instead of improving. Solution: Keep E=1-5 epochs. The key insight: FL works through frequent communication and averaging, not local perfection. More rounds with less local training (R=100, E=1) beats fewer rounds with heavy training (R=10, E=10) for non-IID data.

Question 5: Why is federated learning important for mobile keyboard learning, even though bandwidth savings seem minor?

Answer: It’s about privacy, not bandwidth. Keyboard learning needs to adapt to your typing patterns, autocorrect preferences, and frequently used words/phrases. Centralizing this data reveals: personal messages, passwords typed, search queries, private conversations, health information, financial data. Even anonymized, typing patterns can identify individuals. FL solution: Your phone trains a local model on your typing data. Only model weight updates (KB) are sent to the server—never your actual keystrokes. The global model improves from millions of users while your private data never leaves your device. This is why Google’s Gboard, Apple’s QuickType, and similar apps use FL: user trust requires privacy guarantees that centralized training cannot provide, even with encryption.

Interactive Notebook

The notebook below contains runnable code for all Level 1 activities.

LAB17: Federated Learning with Flower

Learning Objectives: - Understand federated learning principles (decentralized training) - Implement FedAvg algorithm for distributed model aggregation - Set up Flower server and client architecture - Handle non-IID data distributions across clients - Deploy federated learning on edge devices

Three-Tier Approach: - Level 1 (This Notebook): Simulate FL with multiple clients on one machine - Level 2 (Simulator): Run server and clients in separate processes/containers - Level 3 (Device): Deploy clients on Raspberry Pi devices

📚 Theory: Federated Learning Fundamentals

The Distributed Learning Paradigm

Definition: Federated Learning (FL) enables training ML models across decentralized data sources without centralizing the data.

┌─────────────────────────────────────────────────────────────────────────┐
│                    CENTRALIZED vs FEDERATED LEARNING                    │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   CENTRALIZED:                                                          │
│   ┌──────┐  ┌──────┐  ┌──────┐      ┌─────────────┐      ┌───────┐     │
│   │Device│  │Device│  │Device│ ───► │ Central     │ ───► │ Model │     │
│   │ Data │  │ Data │  │ Data │      │ Database    │      │       │     │
│   └──────┘  └──────┘  └──────┘      └─────────────┘      └───────┘     │
│                                            ↑                            │
│                                   Privacy Risk!                         │
│                                   Bandwidth Cost!                       │
│                                                                         │
│   FEDERATED:                                                            │
│   ┌──────────────────────────────────────────────────────────────────┐ │
│   │                        Server                                    │ │
│   │                    ┌───────────┐                                 │ │
│   │                    │  Global   │                                 │ │
│   │                    │  Model    │                                 │ │
│   │                    └─────┬─────┘                                 │ │
│   └──────────────────────────┼───────────────────────────────────────┘ │
│          ┌───────────────────┼───────────────────┐                     │
│          ▼                   ▼                   ▼                     │
│     ┌─────────┐         ┌─────────┐         ┌─────────┐               │
│     │Client 1 │         │Client 2 │         │Client K │               │
│     │─────────│         │─────────│         │─────────│               │
│     │ Local   │         │ Local   │         │ Local   │               │
│     │ Data    │         │ Data    │         │ Data    │               │
│     │(private)│         │(private)│         │(private)│               │
│     └─────────┘         └─────────┘         └─────────┘               │
│                                                                         │
│   ✓ Data stays on device                                               │
│   ✓ Only model updates transmitted                                     │
│   ✓ Privacy preserved                                                  │
└─────────────────────────────────────────────────────────────────────────┘

Mathematical Formulation

The federated learning objective is to minimize the global loss:

\(\min_w F(w) = \sum_{k=1}^{K} \frac{n_k}{n} F_k(w)\)

where: - \(K\) = number of clients - \(n_k\) = number of samples on client \(k\) - \(n = \sum_k n_k\) = total samples - \(F_k(w) = \frac{1}{n_k} \sum_{i \in \mathcal{D}_k} \ell(w; x_i, y_i)\) = local objective on client \(k\)

Key Differences from Distributed Learning

Aspect	Distributed ML	Federated Learning
Data location	Centralized, partitioned	Decentralized, local
Data access	Full access	No direct access
Communication	High bandwidth	Low, intermittent
Data distribution	Usually IID	Often non-IID
Privacy	Not a concern	Primary motivation
Clients	Homogeneous servers	Heterogeneous devices

FL System Characteristics

Statistical Heterogeneity: Non-IID data across clients - Different users have different patterns - Class imbalance varies per client - Local distributions don’t match global

Systems Heterogeneity: Varying device capabilities - Different compute power (RPi vs smartphone vs laptop) - Different network conditions (WiFi, 4G, offline) - Different availability (battery, usage patterns)

Communication Constraints: - Bandwidth: 1 Mbps vs 100 Mbps - Latency: 10ms vs 1000ms - Cost: Metered connections

1. Setup

2. Why Federated Learning?

Traditional ML vs Federated Learning

Traditional ML:

Devices → Upload Data → Central Server → Train Model → Deploy

Federated Learning:

Server sends model → Devices train locally → Upload gradients → Server aggregates

Benefits

Privacy: Data never leaves the device
Bandwidth: Only model updates transmitted (not raw data)
Personalization: Models can adapt to local patterns

3. Create Federated Dataset

We’ll simulate 5 clients with different data distributions (non-IID scenario).

4. FedAvg Algorithm (Manual Implementation)

Before using Flower, let’s understand FedAvg:

Server initializes global model
For each round:
- Server sends model to clients
- Each client trains on local data
- Clients send updated weights to server
- Server averages weights (weighted by sample count)

📚 Theory: FedAvg Algorithm

Federated Averaging (FedAvg) is the foundational FL algorithm:

┌────────────────────────────────────────────────────────────────────────┐
│                         FedAvg ALGORITHM                               │
├────────────────────────────────────────────────────────────────────────┤
│                                                                        │
│  Round t:                                                              │
│  ═══════                                                               │
│                                                                        │
│  ┌──────────┐   Broadcast w^t    ┌──────────┐                         │
│  │  Server  │ ─────────────────► │ Clients  │                         │
│  │          │                    │ {1..K}   │                         │
│  └──────────┘                    └────┬─────┘                         │
│       ▲                               │                               │
│       │                               ▼                               │
│       │                     ┌─────────────────┐                       │
│       │                     │  Local Training │                       │
│       │                     │  E epochs on    │                       │
│       │                     │  local data D_k │                       │
│       │                     └────────┬────────┘                       │
│       │                              │                                │
│       │      Send w_k^{t+1}          ▼                                │
│       └────────────────────── ┌────────────┐                          │
│                               │ Updated    │                          │
│                               │ weights    │                          │
│                               └────────────┘                          │
│                                                                        │
│  Aggregation:                                                          │
│  ────────────                                                          │
│                    K                                                   │
│  w^{t+1} = Σ  (n_k / n) · w_k^{t+1}                                   │
│            k=1                                                         │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

Algorithm Pseudocode

FEDAVG ALGORITHM:
─────────────────
Input: K clients, T rounds, E local epochs, η learning rate
Output: Global model w

1. Server initializes w⁰
2. for t = 0, 1, ..., T-1 do:
3.     Select subset S_t of clients (or all K)
4.     Broadcast w^t to selected clients
5.     for each client k ∈ S_t in parallel:
6.         w_k ← w^t
7.         for epoch e = 1 to E:
8.             for batch (x,y) in D_k:
9.                 w_k ← w_k - η∇ℓ(w_k; x, y)
10.        Send w_k to server
11.    w^{t+1} ← Σ_k (n_k/n) · w_k    // Weighted average
12. return w^T

Convergence Analysis

Under certain assumptions (convexity, bounded gradients), FedAvg converges:

\(\mathbb{E}[F(w^T)] - F(w^*) \leq \mathcal{O}\left(\frac{1}{\sqrt{TKE}}\right) + \text{(non-IID error)}\)

The non-IID error term increases with: - Local epochs E: More local updates = more drift from optimal - Data heterogeneity: Larger differences between local distributions

Why Weighted Average?

Using \(\frac{n_k}{n}\) weights ensures: - Clients with more data contribute proportionally more - Equivalent to training on pooled dataset (in IID case) - Unbiased estimator of full-batch gradient

\(\sum_{k=1}^{K} \frac{n_k}{n} \nabla F_k(w) = \nabla F(w)\)

4. FedAvg Algorithm (Manual Implementation)

Before using Flower, let’s understand FedAvg:

Server initializes global model
For each round:
- Server sends model to clients
- Each client trains on local data
- Clients send updated weights to server
- Server averages weights (weighted by sample count)

5. Using Flower Framework

Flower provides production-ready FL infrastructure. Let’s implement the same example using Flower.

6. IID vs Non-IID Comparison

Let’s compare convergence with IID and non-IID data.

📚 Theory: The Non-IID Challenge

Non-IID (Non-Independent and Identically Distributed) data is the primary challenge in federated learning.

┌─────────────────────────────────────────────────────────────────────┐
│                     IID vs NON-IID DATA                             │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  IID (Ideal):                Non-IID (Reality):                     │
│  ═══════════                 ════════════════                       │
│                                                                     │
│  Client 1: [0,1,2,3,4,5]     Client 1: [0,0,0,1,1]  (mostly 0,1)   │
│  Client 2: [0,1,2,3,4,5]     Client 2: [2,3,3,3,3]  (mostly 2,3)   │
│  Client 3: [0,1,2,3,4,5]     Client 3: [5,5,4,5,4]  (mostly 4,5)   │
│                                                                     │
│  Same distribution           Different distributions                │
│  → Easy convergence          → Harder convergence                   │
│  → Simple averaging          → Client drift problem                 │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Types of Non-IID Data

Type	Description	Example
Label skew	Uneven class distribution	User A: mostly cat photos; User B: mostly dogs
Feature skew	Same labels, different features	Night photos vs day photos
Quantity skew	Different dataset sizes	Power user: 10K samples vs casual: 100
Temporal skew	Distribution shift over time	Seasonal patterns in activity data

Client Drift Problem

           Global Optimum
                 ★
                /│\
               / │ \
              /  │  \
             ◄───┼───►
            /    │    \
           /     │     \
       w₁*      w*      w₂*
     Client 1  (True)  Client 2
     Optimum          Optimum
     
When clients train locally, they move toward their
LOCAL optimum, not the GLOBAL optimum.
After averaging, the result may be suboptimal.

Mathematical Impact

With non-IID data, the local gradient differs from global:

\(\nabla F_k(w) \neq \nabla F(w)\)

This introduces gradient divergence: \(\Gamma = \frac{1}{K}\sum_{k=1}^{K} \|\nabla F_k(w^*) - \nabla F(w^*)\|^2\)

Higher \(\Gamma\) → slower convergence, worse final accuracy.

Mitigation Strategies

Strategy	Description	Trade-off
FedProx	Add proximal term to keep local ≈ global	Slower local training
SCAFFOLD	Variance reduction with control variates	2× communication
FedNova	Normalize by local update steps	Minor overhead
Data sharing	Share small public dataset	Privacy compromise
Personalization	Fine-tune local model per client	Storage overhead

6. IID vs Non-IID Comparison

Let’s compare convergence with IID and non-IID data.

7. Communication Efficiency

In real FL deployments, communication is often the bottleneck.

📚 Theory: Communication in Federated Learning

Communication Cost Analysis

┌─────────────────────────────────────────────────────────────────────┐
│                    COMMUNICATION BREAKDOWN                          │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Per Round:                                                         │
│  ══════════                                                         │
│                                                                     │
│  Server → Clients:  1 × |w| bytes      (broadcast)                 │
│  Clients → Server:  K × |w| bytes      (aggregation)               │
│                     ─────────────                                   │
│  Total per round:   (K + 1) × |w| bytes                            │
│                                                                     │
│  For T rounds: T × (K + 1) × |w| bytes                             │
│                                                                     │
│  Example (MobileNet, 3.4M params, 10 clients, 100 rounds):         │
│  100 × (10 + 1) × 3.4M × 4 bytes = 14.96 GB                        │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Communication Reduction Techniques

Technique	Method	Reduction	Trade-off
Quantization	Reduce precision (FP32→INT8)	4×	Accuracy loss
Sparsification	Send only top-k gradients	10-100×	Convergence delay
Compression	LZ4, zstd on updates	2-5×	CPU overhead
Partial updates	Send changed layers only	Variable	Staleness issues

Gradient Compression Example

Top-k Sparsification: \(\text{Sparse}(g) = \begin{cases} g_i & \text{if } |g_i| \in \text{Top-k}(|g|) \\ 0 & \text{otherwise} \end{cases}\)

With k = 1% of parameters: 100× bandwidth reduction.

Privacy Considerations

┌─────────────────────────────────────────────────────────────────────┐
│                     PRIVACY IN FEDERATED LEARNING                   │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  FL provides SOME privacy, but model updates can leak information: │
│                                                                     │
│  Attack Types:                                                      │
│  ─────────────                                                      │
│  • Gradient inversion: Reconstruct training data from gradients    │
│  • Membership inference: Detect if sample was in training set      │
│  • Model inversion: Infer sensitive attributes from model          │
│                                                                     │
│  Defenses:                                                          │
│  ─────────                                                          │
│  ┌───────────────────┐                                              │
│  │ Differential      │  Add noise: w̃ = w + N(0, σ²)                │
│  │ Privacy (DP)      │  Provides mathematical privacy guarantee    │
│  └───────────────────┘                                              │
│                                                                     │
│  ┌───────────────────┐                                              │
│  │ Secure            │  Cryptographic protocols ensure server      │
│  │ Aggregation       │  only sees aggregated result, not           │
│  └───────────────────┘  individual updates                         │
│                                                                     │
│  ε-Differential Privacy:                                            │
│  P(output | D) ≤ e^ε × P(output | D')                              │
│  where D, D' differ by one record                                  │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Edge Deployment Considerations

Factor	Challenge	Solution
Battery	Training drains power	Schedule during charging
Network	Intermittent connectivity	Robust aggregation protocols
Storage	Limited space for data	Streaming data, no storage
Compute	Slow training on MCU	Model compression, TFLite
Latency	Round-trip delays	Async aggregation

7. Communication Efficiency

In real FL deployments, communication is often the bottleneck.

8. Checkpoint Questions

Why does non-IID data make FL more challenging?
What is the purpose of weighting by sample count in FedAvg?
When would you choose FL over centralized training?
- Consider: privacy, bandwidth, latency, data location
How would you handle a client that goes offline mid-round?

9. Next Steps

Level 2: Multi-Process Simulation

Run server and clients in separate terminal windows:

# Terminal 1: Start server
python server.py

# Terminal 2-4: Start clients
python client.py --cid 0
python client.py --cid 1
python client.py --cid 2

Level 3: Raspberry Pi Deployment

See textbook Chapter 17 for: - Running Flower clients on Pi devices - Handling network unreliability - Training on real sensor data

Three-Tier Activities

Environment: local Jupyter or Colab, no real network required.

Suggested workflow:

Use the notebook to run Flower-based FL simulations on a single machine:
- multiple logical clients (processes) training a shared model (e.g., MNIST).
Implement at least two partitioning strategies:
- IID partitions (each client has a representative slice),
- non-IID partitions (each client sees only a subset of labels).
Record convergence behaviours:
- accuracy vs round for IID vs non-IID,
- effect of changing local epochs, client fraction, and learning rate.
Compare final FL performance with a centralised training baseline.

Here you move beyond single-machine simulation to a “small cluster” on your LAN (or multiple VMs on one host).

Start the Flower server on one machine (or VM).
Run 2–3 clients on other machines/VMs, each with its own data partition.
Use:
- Our FL Simulator to visualise FedAvg behaviour,
- Flower Documentation and Colab examples for more complex experiments if desired.
Observe:
- how network latency and client dropouts affect round time,
- how different choices of client fraction and number of rounds affect convergence.

Deploy an FL experiment to a small Raspberry Pi cluster.

Choose a simple task (e.g., digit recognition, small sensor-based classifier) and port your client code to Pis.
Run the Flower server on a laptop/desktop; run clients on 2–3 Pis with local datasets (e.g., different sensors/locations).
Monitor:
- per-round duration and CPU/memory usage on the Pis,
- network throughput (roughly how many bytes per round),
- convergence behaviour compared with your Level 1/2 experiments.
Reflect on:
- when FL is preferable to centralised training (privacy, bandwidth, regulation),
- how FL interacts with LAB18’s on-device learning (per-device adaptation) and LAB15’s energy budget constraints.

Try It Yourself: Executable Python Examples

The following code blocks are fully executable and demonstrate key federated learning concepts. Each example is self-contained and can be run directly in this Quarto document.

Example 1: FedAvg Weighted Averaging Simulation

This example demonstrates how FedAvg aggregates model weights from multiple clients using weighted averaging based on dataset sizes.

Code

import numpy as np
import matplotlib.pyplot as plt

# Simulate client model weights (3 clients, 5 parameters each)
client_weights = [
    np.array([0.8, 0.5, 0.3, 0.9, 0.4]),  # Client 1
    np.array([0.7, 0.6, 0.4, 0.8, 0.5]),  # Client 2
    np.array([0.9, 0.4, 0.5, 0.7, 0.6])   # Client 3
]

# Dataset sizes for each client
dataset_sizes = np.array([100, 200, 150])  # Total: 450 samples

# Simple averaging (incorrect - treats all clients equally)
simple_avg = np.mean(client_weights, axis=0)

# FedAvg weighted averaging (correct - weights by dataset size)
total_samples = np.sum(dataset_sizes)
weighted_avg = np.zeros(5)

for i, weights in enumerate(client_weights):
    weight_factor = dataset_sizes[i] / total_samples
    weighted_avg += weight_factor * weights
    print(f"Client {i+1}: {dataset_sizes[i]} samples ({weight_factor*100:.1f}% weight)")

print(f"\nSimple Average: {simple_avg}")
print(f"Weighted Average (FedAvg): {weighted_avg}")

# Visualize the difference
fig, ax = plt.subplots(figsize=(10, 6))
x = np.arange(5)
width = 0.25

ax.bar(x - width, client_weights[0], width, label='Client 1 (100 samples)', alpha=0.8)
ax.bar(x, client_weights[1], width, label='Client 2 (200 samples)', alpha=0.8)
ax.bar(x + width, client_weights[2], width, label='Client 3 (150 samples)', alpha=0.8)
ax.plot(x, simple_avg, 'r--', marker='o', label='Simple Average', linewidth=2)
ax.plot(x, weighted_avg, 'g-', marker='s', label='FedAvg (Weighted)', linewidth=2)

ax.set_xlabel('Parameter Index')
ax.set_ylabel('Parameter Value')
ax.set_title('FedAvg Weighted Averaging vs Simple Averaging')
ax.set_xticks(x)
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nKey Insight: Client 2 has the most influence (200/450 = 44.4%) because it has")
print("the most training data. This prevents bias toward clients with less representative data.")

Client 1: 100 samples (22.2% weight)
Client 2: 200 samples (44.4% weight)
Client 3: 150 samples (33.3% weight)

Simple Average: [0.8 0.5 0.4 0.8 0.5]
Weighted Average (FedAvg): [0.78888889 0.51111111 0.41111111 0.78888889 0.51111111]

Key Insight: Client 2 has the most influence (200/450 = 44.4%) because it has
the most training data. This prevents bias toward clients with less representative data.

Example 2: IID vs Non-IID Data Partitioning

This example shows how different data partitioning strategies affect federated learning by creating IID and Non-IID distributions.

Code

import numpy as np
import matplotlib.pyplot as plt

# Create synthetic dataset (1000 samples, 10 classes)
np.random.seed(42)
num_samples = 1000
num_classes = 10
num_clients = 5

# Generate labels
labels = np.random.randint(0, num_classes, num_samples)

# IID Partitioning: Random uniform split
def partition_iid(labels, num_clients):
    """Each client gets random subset with similar distribution"""
    indices = np.random.permutation(len(labels))
    partition_size = len(labels) // num_clients
    partitions = []

    for i in range(num_clients):
        start = i * partition_size
        end = start + partition_size if i < num_clients - 1 else len(labels)
        client_indices = indices[start:end]
        partitions.append(labels[client_indices])

    return partitions

# Non-IID Partitioning: Label skew (each client gets only 2 classes)
def partition_non_iid(labels, num_clients, classes_per_client=2):
    """Each client gets only a subset of classes"""
    partitions = [[] for _ in range(num_clients)]

    for client_id in range(num_clients):
        # Assign specific classes to this client (rotating)
        start_class = (client_id * classes_per_client) % num_classes
        client_classes = [(start_class + i) % num_classes for i in range(classes_per_client)]

        for cls in client_classes:
            class_indices = np.where(labels == cls)[0]
            # Split class samples among clients that have this class
            samples_per_client = len(class_indices) // (num_clients // (num_classes // classes_per_client))
            start_idx = (client_id % (num_clients // (num_classes // classes_per_client))) * samples_per_client
            end_idx = start_idx + samples_per_client
            partitions[client_id].extend(labels[class_indices[start_idx:end_idx]])

    return [np.array(p) for p in partitions]

# Create both partitions
iid_parts = partition_iid(labels, num_clients)
non_iid_parts = partition_non_iid(labels, num_clients, classes_per_client=2)

# Visualize distributions
fig, axes = plt.subplots(2, num_clients, figsize=(15, 6))

for i in range(num_clients):
    # IID distribution
    iid_dist = np.bincount(iid_parts[i], minlength=num_classes)
    axes[0, i].bar(range(num_classes), iid_dist, color='skyblue', alpha=0.8)
    axes[0, i].set_title(f'Client {i+1}\n({len(iid_parts[i])} samples)')
    axes[0, i].set_ylim(0, max([max(np.bincount(p, minlength=num_classes)) for p in iid_parts]) * 1.1)
    if i == 0:
        axes[0, i].set_ylabel('IID\nSample Count')

    # Non-IID distribution
    non_iid_dist = np.bincount(non_iid_parts[i], minlength=num_classes)
    axes[1, i].bar(range(num_classes), non_iid_dist, color='coral', alpha=0.8)
    axes[1, i].set_xlabel('Class')
    if i == 0:
        axes[1, i].set_ylabel('Non-IID\nSample Count')

plt.suptitle('Data Distribution: IID vs Non-IID Partitioning', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# Calculate entropy (measure of distribution uniformity)
def calculate_entropy(partition, num_classes):
    dist = np.bincount(partition, minlength=num_classes)
    probs = dist / np.sum(dist)
    entropy = -np.sum([p * np.log(p + 1e-10) for p in probs if p > 0])
    return entropy

print("Entropy Analysis (higher = more uniform distribution):")
print(f"Maximum possible entropy: {np.log(num_classes):.3f}")
print("\nIID Partitions:")
for i, part in enumerate(iid_parts):
    ent = calculate_entropy(part, num_classes)
    print(f"  Client {i+1}: {ent:.3f} ({ent/np.log(num_classes)*100:.1f}% of max)")

print("\nNon-IID Partitions:")
for i, part in enumerate(non_iid_parts):
    ent = calculate_entropy(part, num_classes)
    print(f"  Client {i+1}: {ent:.3f} ({ent/np.log(num_classes)*100:.1f}% of max)")

Entropy Analysis (higher = more uniform distribution):
Maximum possible entropy: 2.303

IID Partitions:
  Client 1: 2.279 (99.0% of max)
  Client 2: 2.293 (99.6% of max)
  Client 3: 2.276 (98.8% of max)
  Client 4: 2.284 (99.2% of max)
  Client 5: 2.282 (99.1% of max)

Non-IID Partitions:
  Client 1: 0.678 (29.4% of max)
  Client 2: 0.690 (30.0% of max)
  Client 3: 0.692 (30.0% of max)
  Client 4: 0.693 (30.1% of max)
  Client 5: 0.690 (30.0% of max)

Example 3: Convergence Comparison Visualization

This example simulates and compares the convergence behavior of federated learning with IID vs Non-IID data.

Code

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

# Simulation parameters
num_rounds = 20
num_clients = 5

# Simulate convergence for IID data
def simulate_convergence_iid(num_rounds, base_acc=0.95, noise_level=0.02):
    """Simulate smooth convergence with IID data"""
    rounds = np.arange(1, num_rounds + 1)
    # Fast exponential convergence
    accuracy = base_acc * (1 - np.exp(-rounds / 4))
    # Add small random noise
    accuracy += np.random.normal(0, noise_level, num_rounds)
    accuracy = np.clip(accuracy, 0, 1)
    return rounds, accuracy

# Simulate convergence for Non-IID data
def simulate_convergence_non_iid(num_rounds, base_acc=0.95, degradation=0.15, noise_level=0.03):
    """Simulate slower, oscillating convergence with Non-IID data"""
    rounds = np.arange(1, num_rounds + 1)
    # Slower convergence with accuracy penalty
    accuracy = (base_acc - degradation) * (1 - np.exp(-rounds / 6))
    # Add oscillation due to client drift
    accuracy += 0.05 * np.sin(rounds / 2)
    # Add larger noise
    accuracy += np.random.normal(0, noise_level, num_rounds)
    accuracy = np.clip(accuracy, 0, 1)
    return rounds, accuracy

# Generate convergence curves
rounds_iid, acc_iid = simulate_convergence_iid(num_rounds)
rounds_non_iid, acc_non_iid = simulate_convergence_non_iid(num_rounds)

# Calculate loss (inverse of accuracy for visualization)
loss_iid = 1 - acc_iid
loss_non_iid = 1 - acc_non_iid

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Accuracy comparison
axes[0].plot(rounds_iid, acc_iid, 'g-', marker='o', label='IID Data', linewidth=2, markersize=6)
axes[0].plot(rounds_non_iid, acc_non_iid, 'r--', marker='s', label='Non-IID Data', linewidth=2, markersize=6)
axes[0].axhline(y=0.9, color='gray', linestyle=':', alpha=0.5, label='90% Target')
axes[0].set_xlabel('Communication Round', fontsize=12)
axes[0].set_ylabel('Global Model Accuracy', fontsize=12)
axes[0].set_title('Convergence: IID vs Non-IID Data', fontsize=13, fontweight='bold')
axes[0].legend(loc='lower right')
axes[0].grid(True, alpha=0.3)
axes[0].set_ylim(0, 1)

# Loss comparison
axes[1].plot(rounds_iid, loss_iid, 'g-', marker='o', label='IID Data', linewidth=2, markersize=6)
axes[1].plot(rounds_non_iid, loss_non_iid, 'r--', marker='s', label='Non-IID Data', linewidth=2, markersize=6)
axes[1].set_xlabel('Communication Round', fontsize=12)
axes[1].set_ylabel('Global Model Loss', fontsize=12)
axes[1].set_title('Loss Curves', fontsize=13, fontweight='bold')
axes[1].legend(loc='upper right')
axes[1].grid(True, alpha=0.3)
axes[1].set_ylim(0, max(loss_iid.max(), loss_non_iid.max()) * 1.1)

plt.tight_layout()
plt.show()

# Performance metrics
print("Performance Comparison:")
print(f"\nIID Data:")
print(f"  Final Accuracy: {acc_iid[-1]:.2%}")
print(f"  Rounds to 90%: {np.argmax(acc_iid >= 0.9) + 1 if any(acc_iid >= 0.9) else 'Not reached'}")
print(f"  Convergence Rate: Fast (smooth exponential)")

print(f"\nNon-IID Data:")
print(f"  Final Accuracy: {acc_non_iid[-1]:.2%}")
print(f"  Rounds to 90%: {np.argmax(acc_non_iid >= 0.9) + 1 if any(acc_non_iid >= 0.9) else 'Not reached'}")
print(f"  Convergence Rate: Slow (oscillating, client drift)")

degradation = (acc_iid[-1] - acc_non_iid[-1]) / acc_iid[-1] * 100
print(f"\nAccuracy Degradation: {degradation:.1f}%")
print(f"Additional rounds needed: ~{int((num_rounds * degradation) / 100)}")

Performance Comparison:

IID Data:
  Final Accuracy: 91.54%
  Rounds to 90%: 13
  Convergence Rate: Fast (smooth exponential)

Non-IID Data:
  Final Accuracy: 75.02%
  Rounds to 90%: Not reached
  Convergence Rate: Slow (oscillating, client drift)

Accuracy Degradation: 18.0%
Additional rounds needed: ~3

Example 4: Privacy Budget Demonstration

This example demonstrates the privacy-utility tradeoff in differential privacy for federated learning.

Code

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

# Privacy parameters
epsilon_values = [0.1, 0.5, 1.0, 5.0, 10.0, 100.0]  # Privacy budgets
delta = 1e-5
sensitivity = 1.0  # Maximum L2 norm of gradients

def calculate_noise_scale(epsilon, delta, sensitivity):
    """Calculate Gaussian noise scale for (ε, δ)-differential privacy"""
    return sensitivity * np.sqrt(2 * np.log(1.25 / delta)) / epsilon

def simulate_dp_training(epsilon, base_accuracy=0.95):
    """Simulate how DP noise affects model accuracy"""
    # Privacy-utility tradeoff: lower epsilon = more noise = lower accuracy
    noise_scale = calculate_noise_scale(epsilon, delta, sensitivity)

    # Model accuracy decreases with more noise (lower epsilon)
    privacy_penalty = 1.0 / (1.0 + epsilon)
    final_accuracy = base_accuracy * (1 - 0.3 * privacy_penalty)

    return final_accuracy, noise_scale

# Calculate accuracy for different privacy budgets
results = []
for eps in epsilon_values:
    acc, noise = simulate_dp_training(eps)
    privacy_level = 'High' if eps < 1.0 else ('Medium' if eps < 5.0 else 'Low')
    results.append({
        'epsilon': eps,
        'accuracy': acc,
        'noise_scale': noise,
        'privacy_level': privacy_level
    })

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Privacy-Utility Tradeoff
epsilons = [r['epsilon'] for r in results]
accuracies = [r['accuracy'] for r in results]

axes[0].semilogx(epsilons, accuracies, 'bo-', linewidth=2, markersize=8)
axes[0].axhline(y=0.95, color='green', linestyle='--', alpha=0.5, label='No Privacy (95%)')
axes[0].axvline(x=1.0, color='orange', linestyle=':', alpha=0.5, label='ε=1 (Strong Privacy)')
axes[0].set_xlabel('Privacy Budget (ε)', fontsize=12)
axes[0].set_ylabel('Model Accuracy', fontsize=12)
axes[0].set_title('Privacy-Utility Tradeoff', fontsize=13, fontweight='bold')
axes[0].grid(True, alpha=0.3)
axes[0].legend()

# Add privacy zone shading
axes[0].axvspan(0, 1, alpha=0.1, color='green', label='High Privacy')
axes[0].axvspan(1, 5, alpha=0.1, color='yellow')
axes[0].axvspan(5, 100, alpha=0.1, color='red')

# Noise scale vs epsilon
noise_scales = [r['noise_scale'] for r in results]

axes[1].loglog(epsilons, noise_scales, 'rs-', linewidth=2, markersize=8)
axes[1].set_xlabel('Privacy Budget (ε)', fontsize=12)
axes[1].set_ylabel('Noise Scale (σ)', fontsize=12)
axes[1].set_title('Gaussian Noise Scale vs Privacy Budget', fontsize=13, fontweight='bold')
axes[1].grid(True, alpha=0.3, which='both')

plt.tight_layout()
plt.show()

# Summary table
print("\nPrivacy-Utility Analysis")
print("=" * 80)
print(f"{'Epsilon':<10} {'Privacy':<15} {'Accuracy':<12} {'Noise Scale':<15} {'Use Case'}")
print("-" * 80)

for r in results:
    if r['epsilon'] <= 1.0:
        use_case = "Medical/Financial"
    elif r['epsilon'] <= 5.0:
        use_case = "General Apps"
    else:
        use_case = "Low-risk Apps"

    print(f"{r['epsilon']:<10.1f} {r['privacy_level']:<15} {r['accuracy']:<12.1%} "
          f"{r['noise_scale']:<15.4f} {use_case}")

print("\nKey Insights:")
print(f"• ε < 1: Strong privacy guarantee, but accuracy drops by ~{(0.95-results[0]['accuracy'])*100:.1f}%")
print(f"• ε = 1: Good balance - commonly used for sensitive applications")
print(f"• ε > 10: Weak privacy, minimal accuracy impact")
print(f"• Lower ε → Higher noise → More privacy → Lower utility")


Privacy-Utility Analysis
================================================================================
Epsilon    Privacy         Accuracy     Noise Scale     Use Case
--------------------------------------------------------------------------------
0.1        High            69.1%        48.4481         Medical/Financial
0.5        High            76.0%        9.6896          Medical/Financial
1.0        Medium          80.8%        4.8448          Medical/Financial
5.0        Low             90.2%        0.9690          General Apps
10.0       Low             92.4%        0.4845          Low-risk Apps
100.0      Low             94.7%        0.0484          Low-risk Apps

Key Insights:
• ε < 1: Strong privacy guarantee, but accuracy drops by ~25.9%
• ε = 1: Good balance - commonly used for sensitive applications
• ε > 10: Weak privacy, minimal accuracy impact
• Lower ε → Higher noise → More privacy → Lower utility

Learning Objectives

Theory Summary

Why Federated Learning?

Federated Averaging (FedAvg)

The Non-IID Challenge

Key Concepts at a Glance

Common Pitfalls

Quick Reference

Flower Server Setup

Flower Client Implementation

Data Partitioning Strategies

FedAvg Hyperparameters

Communication Cost Analysis

Self-Assessment Checkpoints

Interactive Notebook

LAB17: Federated Learning with Flower

📚 Theory: Federated Learning Fundamentals

The Distributed Learning Paradigm

Mathematical Formulation

Key Differences from Distributed Learning

FL System Characteristics

1. Setup

2. Why Federated Learning?

Traditional ML vs Federated Learning

Benefits

3. Create Federated Dataset

4. FedAvg Algorithm (Manual Implementation)

📚 Theory: FedAvg Algorithm

Algorithm Pseudocode

Convergence Analysis

Why Weighted Average?

4. FedAvg Algorithm (Manual Implementation)

5. Using Flower Framework

6. IID vs Non-IID Comparison

📚 Theory: The Non-IID Challenge

Types of Non-IID Data

Client Drift Problem

Mathematical Impact

Mitigation Strategies

6. IID vs Non-IID Comparison

7. Communication Efficiency

📚 Theory: Communication in Federated Learning

Communication Cost Analysis

Communication Reduction Techniques

Gradient Compression Example

Privacy Considerations

Edge Deployment Considerations

7. Communication Efficiency

8. Checkpoint Questions

9. Next Steps

Level 2: Multi-Process Simulation

Level 3: Raspberry Pi Deployment

Three-Tier Activities

Related Labs

Try It Yourself: Executable Python Examples

Example 1: FedAvg Weighted Averaging Simulation

Example 2: IID vs Non-IID Data Partitioning

Example 3: Convergence Comparison Visualization

Example 4: Privacy Budget Demonstration

Related Resources