# Tutorial 3: Datasets In this note, we present how to use the out-of-the-box datasets to simulate different federated learning (FL) scenarios. Besides, we introduce how to use the customized dataset in EasyFL. We currently provide four out-of-the-box datasets: FEMNIST, Shakespeare, CIFAR-10, and CIFAR-100. FEMNIST and Shakespeare are adopted from [LEAF benchmark](https://leaf.cmu.edu/). We plan to integrate and provide more out-of-the-box datasets in the future. ## Out-of-the-box Datasets The simulation of different FL scenarios is configured in the configurations. You can refer to the other [tutorial](config.md) to learn more about how to modify configs. In this note, we focus on how to config the datasets with different simulations. The following are dataset configurations. ```yaml data: # The root directory where datasets are stored. root: "./data/" # The name of the dataset, support: femnist, shakespeare, cifar10, and cifar100. dataset: femnist # The data distribution of each client, support: iid, niid (for femnist and shakespeare), and dir and class (for cifar datasets). # `iid` means independent and identically distributed data. # `niid` means non-independent and identically distributed data for FEMNIST and Shakespeare. # `dir` means using Dirichlet process to simulate non-iid data, for CIFAR-10 and CIFAR-100 datasets. # `class` means partitioning the dataset by label classes, for datasets like CIFAR-10, CIFAR-100. split_type: "iid" # The minimal number of samples in each client. It is applicable for LEAF datasets and dir simulation of CIFAR-10 and CIFAR-100. min_size: 10 # The fraction of data sampled for LEAF datasets. e.g., 10% means that only 10% of the total dataset size is used. data_amount: 0.05 # The fraction of the number of clients used when the split_type is 'iid'. iid_fraction: 0.1 # Whether partition users of the dataset into train-test groups. Only applicable to femnist and shakespeare datasets. # True means partitioning users of the dataset into train-test groups. # False means partitioning each users' samples into train-test groups. user: False # The fraction of data for training; the rest are for testing. train_test_split: 0.9 # The number of classes in each client. Only applicable when the split_type is 'class'. class_per_client: 1 # The targeted number of clients to construct.used in non-leaf dataset, number of clients split into. for leaf dataset, only used when split type class. num_of_clients: 100 # The parameter for Dirichlet distribution simulation, applicable only when split_type is `dir` for CIFAR datasets. alpha: 0.5 # The targeted distribution of quantities to simulate data quantity heterogeneity. # The values should sum up to 1. e.g., [0.1, 0.2, 0.7]. # The `num_of_clients` should be divisible by `len(weights)`. # None means clients are simulated with the same data quantity. weights: NULL ``` Among them, `root` is applicable to all datasets. It specifies the directory to store datasets. EasyFL automatically downloads a dataset if it is not exist in the root directory. Next, we introduce the simulation and configuration for specific datasets. ### FEMNIST and Shakespeare Datasets The following are basic stats of these two datasets. FEMNIST * Overview: Image Dataset * Details: 3500 users, 62 different classes (10 digits, 26 lowercase, 26 uppercase), images are 28 by 28 pixels (with option to make them all 128 by 128 pixels) * Task: Image Classification Shakespeare * Overview: Text Dataset of Shakespeare Dialogues * Details: 1129 users (reduced to 660 with our choice of sequence length.) * Task: Next-Character Prediction The datasets are non-IID (independent and identically distributed) in nature. `split_type`: There are two options for these two datasets: `iid` and `niid`, representing IID data simulation and non-IID data simulation. Five hyper-parameters determine the simulated dataset: `min_size`, `data_amount`, `iid_fraction`, `tran_test_split`, and `user`. `user` is a boolean that determines whether to partition the dataset to train test group by user or samples. `user: True` means partitioning users of the dataset into train-test groups, i.e. some users are for training, some users are for testing. `user: False` means partitioning each users' samples into train-test groups, i.e. data in each client is partitioned into training set and testing set. Note: we normally use `test_mode: test_in_clients` for these two datasets. #### IID Simulation In IID simulation, data are randomly partitioned into multiple clients. The number of clients is determined by `data_amount` and `iid_fraction`. #### Non-IID Simulation Since FEMNIST and Shakespeare are non-IID in nature, each user of the dataset is regarded as a client. `data_amount` determine the number of clients participate in training. ### CIFAR-10 and CIFAR-100 Datasets > The **CIFAR-10** dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. > The **CIFAR-100** dataset consists of 60000 32x32 colour images in 100 classes, with 600 images per class. There are 50000 training images and 10000 test images. `split_type`: There are three options for CIFAR datasets: `iid`, `dir`, and `class`. Three hyper-parameters determine the simulated dataset: `num_of_clients`, `class_per_client`, and `alpha`. #### IID Simulation In IID simulation, the training images of the datasets are randomly partitioned into `num_of_clients` clients. #### Non-IID Simulation We can simulate non-IID CIFAR datasets by Dirichlet process (`dir`) or by label class (`class`). `alpha` controls the level of heterogeneity for `dir` simulation. `class_per_client` determines the number of classes in each client. ## Customize Datasets EasyFL also supports integrating with customized dataset to simulate federated learning. You can use the following classes to integrate customized dataset: [FederatedImageDataset](../api.html#easyfl.datasets.FederatedImageDataset), [FederatedTensorDataset](../api.html#easyfl.datasets.FederatedTensorDataset), and [FederatedTorchDataset](../api.html#easyfl.datasets.FederatedTorchDataset). The following is an example that integrates [nine person re-identification datasets](https://arxiv.org/abs/2008.11560), where each client contains one dataset. ```python import easyfl import os from torchvision import transforms from easyfl.datasets import FederatedImageDataset TRANSFORM_TRAIN_LIST = transforms.Compose([ transforms.Resize((256, 128), interpolation=3), transforms.Pad(10), transforms.RandomCrop((256, 128)), transforms.RandomHorizontalFlip(), transforms.ToTensor(), transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]) ]) TRANSFORM_VAL_LIST = transforms.Compose([ transforms.Resize(size=(256, 128), interpolation=3), transforms.ToTensor(), transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]) ]) DATASETS = ["MSMT17", "Duke", "Market", "cuhk03", "prid", "cuhk01", "viper", "3dpes", "ilids"] # Prepare customized training data def prepare_train_data(data_dir): client_ids = [] roots = [] for db in DATASETS: client_ids.append(db) data_path = os.path.join(data_dir, db, "pytorch") roots.append(os.path.join(data_path, "train_all")) data = FederatedImageDataset(root=roots, simulated=True, do_simulate=False, transform=TRANSFORM_TRAIN_LIST, client_ids=client_ids) return data # Prepare customized testing data def prepare_test_data(data_dir): roots = [] client_ids = [] for db in DATASETS: test_gallery = os.path.join(data_dir, db, 'pytorch', 'gallery') test_query = os.path.join(data_dir, db, 'pytorch', 'query') roots.extend([test_gallery, test_query]) client_ids.extend([f"{db}_gallery", f"{db}_query"]) data = FederatedImageDataset(root=roots, simulated=True, do_simulate=False, transform=TRANSFORM_VAL_LIST, client_ids=client_ids) return data if __name__ == '__main__': config = {...} data_dir = "datasets/" train_data, test_data = prepare_train_data(data_dir), prepare_test_data(data_dir) easyfl.register_dataset(train_data, test_data) easyfl.init(config) easyfl.run() ``` The folder structure of these datasets are as followed: ``` |-- MSMT17 | |-- pytorch | | |-- gallery | | |-- query | | |-- train | | |-- train_all | | `-- val |-- cuhk01 | |-- pytorch | | |-- gallery | | |-- query | | |-- train | | |-- train_all | ... ``` Please [email us](mailto:weiming001@e.ntu.edu.sg) if you want to access these datasets with: 1. A short self-introduction. 2. The purposes of using these datasets. *⚠️ Further distribution of the datasets are prohibited.* ### Create Your Own Federated Dataset In case that the provided federated dataset class is not enough, you can implement your own federated dataset by inherit and implement [FederatedDataset](../api.html#easyfl.datasets.FederatedDataset). You can refer to [FederatedImageDataset](../api.html#easyfl.datasets.FederatedImageDataset), [FederatedTensorDataset](../api.html#easyfl.datasets.FederatedTensorDataset), and [FederatedTorchDataset](../api.html#easyfl.datasets.FederatedTorchDataset) on how to implement.