# Homo-NN: Customize your Dataset

The FATE system primarily supports tabular data as its standard data format. However, it is possible to utilize non-tabular data, such as images, text, mixed data, or relational data, in neural networks through the use of the Dataset feature of the NN module. The Dataset module within the NN module allows for the customization of datasets for use in more complex data scenarios. This tutorial will cover the use of the Dataset feature in the Homo-NN module and provide guidance on how to customize datasets. We will use the MNIST handwriting recognition task as an example to illustrate these concepts.

## Prepare MNIST Data

Please download the MNIST dataset from the link below and place it in the project examples/data folder:
https://webank-ai-1251170195.cos.ap-guangzhou.myqcloud.com/fate/examples/data/mnist.zip

This is a simplified version of the MNIST dataset, with a total of ten categories, which are classified into 0-9 10 folders according to labels. We sampled the dataset to reduce the sample number.

The origin of MNIST dataset is:
http://yann.lecun.com/exdb/mnist/

In [5]:
! ls ../../../../examples/data/mnist

0  1  2  3  4  5  6  7	8  9


## Dataset

In version FATE-1.10, FATE introduces a new base class for datasets called [Dataset](../../../../python/federatedml/nn/dataset/base.py), which is based on PyTorch's Dataset class. This class allows users to create custom datasets according to their specific needs. The usage is similar to that of PyTorch's Dataset class, with the added requirement of implementing two additional interfaces when using FATE-NN for data reading and training: load() and get_sample_ids().

To create a custom dataset in Homo-NN, users need to:

- Develop a new dataset class that inherits from the Dataset class
- Implement the \_\_len\_\_() and \_\_getitem\_\_() methods, which are consistent with PyTorch's Dataset usage. The \_\_len\_\_() method should return the length of the dataset, while the \_\_getitem\_\_() method should return the corresponding data at the specified index
- Implement the load() and get_sample_ids() methods
  
For those unfamiliar with PyTorch's Dataset class, more information can be found in the PyTorch documentation: [Pytorch Dataset Documentation](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html)

### load()

The first additional interface required is load(). This interface receives a file path and allows users to read data directly from the local file system. When submitting a task, the data path can be specified through the reader component. Homo-NN will use the user-specified Dataset class, utilizing the load() interface to read data from the specified path and complete the loading of the dataset for training. For more information, please refer to the source code in /federatedml/nn/dataset/base.py.

### get_sample_ids()

The second additional interface is get_sample_ids(). This interface should return a list of sample IDs, which can be either integers or strings and should have the same length as the dataset. Actually you can skip implementing this interface when using Homo-NN, as the Homo-NN component will automatically generate IDs for the samples.

## Example: Implement a simple image dataset

In order to better understand the customization of Dataset, here we implement a simple image dataset to read MNIST images, and then complete a federated image classification task in a horizontal scene
For convenience here, we use the jupyter interface of save_to_fate to update the code to federatedml.nn.dataset, named mnist_dataset.py, of course you can manually copy the code file to the directory

### jupyter: save_to_fate()

In [1]:
from pipeline.component.nn import save_to_fate

### The MNIST Dataset

Here we implement the Dataset, and save it using save_to_fate().

In [8]:
%%save_to_fate dataset mnist_dataset.py
import numpy as np
from federatedml.nn.dataset.base import Dataset
from torchvision.datasets import ImageFolder
from torchvision import transforms


class MNISTDataset(Dataset):
    
    def __init__(self, flatten_feature=False): # flatten feature or not 
        super(MNISTDataset, self).__init__()
        self.image_folder = None
        self.ids = None
        self.flatten_feature = flatten_feature
        
    def load(self, path):  # read data from path, and set sample ids
        # read using ImageFolder
        self.image_folder = ImageFolder(root=path, transform=transforms.Compose([transforms.ToTensor()]))
        # filename as the image id
        ids = []
        for image_name in self.image_folder.imgs:
            ids.append(image_name[0].split('/')[-1].replace('.jpg', ''))
        self.ids = ids
        return self

    def get_sample_ids(self):  # implement the get sample id interface, simply return ids
        return self.ids
    
    def __len__(self,):  # return the length of the dataset
        return len(self.image_folder)
    
    def __getitem__(self, idx): # get item
        ret = self.image_folder[idx]
        if self.flatten_feature:
            img = ret[0][0].flatten() # return flatten tensor 784-dim
            return img, ret[1] # return tensor and label
        else:
            return ret

After we implement the dataset, we can test it locally:

In [9]:
from federatedml.nn.dataset.mnist_dataset import MNISTDataset

ds = MNISTDataset(flatten_feature=True)


In [10]:
# load MNIST data and check 
ds.load('../../../../examples/data/mnist/')
print(len(ds))
print(ds[0])
print(ds.get_sample_ids()[0])

1309
(tensor([0.0118, 0.0000, 0.0000, 0.0118, 0.0275, 0.0118, 0.0000, 0.0118, 0.0000,
        0.0431, 0.0000, 0.0000, 0.0118, 0.0000, 0.0000, 0.0118, 0.0314, 0.0000,
        0.0000, 0.0118, 0.0000, 0.0000, 0.0000, 0.0078, 0.0000, 0.0000, 0.0000,
        0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0039,
        0.0196, 0.0000, 0.0471, 0.0000, 0.0627, 0.0000, 0.0000, 0.0157, 0.0000,
        0.0078, 0.0314, 0.0118, 0.0000, 0.0157, 0.0314, 0.0000, 0.0000, 0.0000,
        0.0000, 0.0000, 0.0000, 0.0000, 0.0078, 0.0000, 0.0000, 0.0000, 0.0039,
        0.0078, 0.0039, 0.0471, 0.0000, 0.0314, 0.0000, 0.0000, 0.0235, 0.0000,
        0.0431, 0.0000, 0.0000, 0.0235, 0.0275, 0.0078, 0.0000, 0.0000, 0.0000,
        0.0000, 0.0000, 0.0000, 0.0000, 0.0039, 0.0118, 0.0000, 0.0000, 0.0078,
        0.0118, 0.0000, 0.0000, 0.0000, 0.0471, 0.0000, 0.0000, 0.0902, 0.0000,
        0.0000, 0.0000, 0.0000, 0.0431, 0.0118, 0.0000, 0.0000, 0.0157, 0.0000,
        0.0000, 0.0000, 0.0000, 0.

## Test Your Dataset

Before submitting a task, it is possible to test locally. As we mentioned in [1.1 Homo-NN Quick Start: A Binary Classification Task](Homo-NN-Quick-Start.ipynb), in Homo-NN, FATE uses the fedavg_trainer by default. Custom datasets, models, and trainers can be used for local debugging to test if the program runs correctly. **Note that during local testing, all federation processes will be skipped and the model will not perform federated averaging.**

In [11]:
from federatedml.nn.homo.trainer.fedavg_trainer import FedAVGTrainer
trainer = FedAVGTrainer(epochs=3, batch_size=256, shuffle=True, data_loader_worker=8, pin_memory=False) # set parameter

In [12]:
trainer.local_mode() # !! Be sure to enable local_mode to skip the federation process !!

In [15]:
import torch as t
from pipeline import fate_torch_hook
fate_torch_hook(t)
# our simple classification model:
model = t.nn.Sequential(
    t.nn.Linear(784, 32),
    t.nn.ReLU(),
    t.nn.Linear(32, 10),
    t.nn.Softmax(dim=1)
)

trainer.set_model(model) # set model

In [16]:
optimizer = t.optim.Adam(model.parameters(), lr=0.01)  # optimizer
loss = t.nn.CrossEntropyLoss()  # loss function
trainer.train(train_set=ds, optimizer=optimizer, loss=loss)  # use dataset we just developed

epoch is 0
100%|██████████| 6/6 [00:00<00:00, 11.28it/s]
epoch loss is 2.5860611556412336
epoch is 1
100%|██████████| 6/6 [00:00<00:00, 12.40it/s]
epoch loss is 2.2709667185411098
epoch is 2
100%|██████████| 6/6 [00:00<00:00, 11.20it/s]
epoch loss is 2.0878872911469277


In the train() function of the Trainer, your dataset will be iterated using Pytorch DataLoader. 
The program can run correctly! Now we can submit a federated task.

## Submit a task with your dataset

### Import Components

In [42]:
import torch as t
from torch import nn
from pipeline import fate_torch_hook
from pipeline.component import HomoNN
from pipeline.backend.pipeline import PipeLine
from pipeline.component import Reader, Evaluation, DataTransform
from pipeline.interface import Data, Model

t = fate_torch_hook(t)


### Bind data path to name & namespace

Here, we use the pipeline to bind a path to a name&namespace. Then we can use the reader component to pass this path to the 'load' interface of the dataset.
The trainer will get this dataset in the train(), and iterate it with a Pytorch Dataloader. **Please notice that in this tutorial we are using a standalone version, if you are using a cluster version, you need to bind the data with the corresponding name&namespace on each machine.**

In [34]:
import os
# bind data path to name & namespace
fate_project_path = os.path.abspath('../../../../')
host = 10000
guest = 9999
arbiter = 10000
pipeline = PipeLine().set_initiator(role='guest', party_id=guest).set_roles(guest=guest, host=host,
                                                                            arbiter=arbiter)

data_0 = {"name": "mnist_guest", "namespace": "experiment"}
data_1 = {"name": "mnist_host", "namespace": "experiment"}

data_path_0 = fate_project_path + '/examples/data/mnist'
data_path_1 = fate_project_path + '/examples/data/mnist'
pipeline.bind_table(name=data_0['name'], namespace=data_0['namespace'], path=data_path_0)
pipeline.bind_table(name=data_1['name'], namespace=data_1['namespace'], path=data_path_1)

{'namespace': 'experiment', 'table_name': 'mnist_host'}

In [35]:
reader_0 = Reader(name="reader_0")
reader_0.get_party_instance(role='guest', party_id=guest).component_param(table=data_0)
reader_0.get_party_instance(role='host', party_id=host).component_param(table=data_1)

### DatasetParam

Use dataset_name to specify the module name of your dataset, and fill in its parameters behind, these parameters will be passed to the \_\_init\_\_ interface of your dataset. **Please notice that your Dataset parameters need to be JSON-serializable, otherwise they cannot be parsed by the pipeline.**

In [36]:
from pipeline.component.nn import DatasetParam

dataset_param = DatasetParam(dataset_name='mnist_dataset', flatten_feature=True)  # specify dataset, and its init parameters

In [37]:
from pipeline.component.homo_nn import TrainerParam  # Interface

# our simple classification model:
model = t.nn.Sequential(
    t.nn.Linear(784, 32),
    t.nn.ReLU(),
    t.nn.Linear(32, 10),
    t.nn.Softmax(dim=1)
)

nn_component = HomoNN(name='nn_0',
                      model=model, # model
                      loss=t.nn.CrossEntropyLoss(),  # loss
                      optimizer=t.optim.Adam(model.parameters(), lr=0.01), # optimizer
                      dataset=dataset_param,  # dataset
                      trainer=TrainerParam(trainer_name='fedavg_trainer', epochs=2, batch_size=1024, validation_freqs=1),
                      torch_seed=100 # random seed
                      )

In [38]:
pipeline.add_component(reader_0)
pipeline.add_component(nn_component, data=Data(train_data=reader_0.output.data))
pipeline.add_component(Evaluation(name='eval_0', eval_type='multi'), data=Data(data=nn_component.output.data))

<pipeline.backend.pipeline.PipeLine at 0x7fbffe5ff700>

In [39]:
pipeline.compile()
pipeline.fit()

[32m2022-12-19 16:26:20.771[0m | [1mINFO    [0m | [36mpipeline.utils.invoker.job_submitter[0m:[36mmonitor_job_status[0m:[36m83[0m - [1mJob id is 202212191626190908350
[0m
[32m2022-12-19 16:26:20.805[0m | [1mINFO    [0m | [36mpipeline.utils.invoker.job_submitter[0m:[36mmonitor_job_status[0m:[36m98[0m - [1m[80D[1A[KJob is still waiting, time elapse: 0:00:00[0m
[32m2022-12-19 16:26:21.840[0m | [1mINFO    [0m | [36mpipeline.utils.invoker.job_submitter[0m:[36mmonitor_job_status[0m:[36m98[0m - [1m[80D[1A[KJob is still waiting, time elapse: 0:00:01[0m
[32m2022-12-19 16:26:22.865[0m | [1mINFO    [0m | [36mpipeline.utils.invoker.job_submitter[0m:[36mmonitor_job_status[0m:[36m98[0m - [1m[80D[1A[KJob is still waiting, time elapse: 0:00:02[0m
[32m2022-12-19 16:26:23.899[0m | [1mINFO    [0m | [36mpipeline.utils.invoker.job_submitter[0m:[36mmonitor_job_status[0m:[36m98[0m - [1m[80D[1A[KJob is still waiting, time elapse: 0:00:03[

In [40]:
pipeline.get_component('nn_0').get_output_data()

Unnamed: 0,id,label,predict_result,predict_score,predict_detail,type
0,img_1,0,0,0.9070178270339966,"{'0': 0.9070178270339966, '1': 0.0023874549660...",train
1,img_3,4,6,0.19601570069789886,"{'0': 0.19484134018421173, '1': 0.044997252523...",train
2,img_4,0,0,0.9618675112724304,"{'0': 0.9618675112724304, '1': 0.0010393995326...",train
3,img_5,0,0,0.33044907450675964,"{'0': 0.33044907450675964, '1': 0.033256266266...",train
4,img_6,7,7,0.3145765960216522,"{'0': 0.05851678550243378, '1': 0.075524508953...",train
...,...,...,...,...,...,...
1304,img_32537,1,8,0.20599651336669922,"{'0': 0.080563984811306, '1': 0.12380836158990...",train
1305,img_32558,1,8,0.20311488211154938,"{'0': 0.07224143296480179, '1': 0.130610913038...",train
1306,img_32563,1,8,0.2071550488471985,"{'0': 0.06843454390764236, '1': 0.129064396023...",train
1307,img_32565,1,5,0.29367145895957947,"{'0': 0.05658009275794029, '1': 0.086584843695...",train


In [41]:
pipeline.get_component('nn_0').get_summary()

{'best_epoch': 1,
 'loss_history': [3.58235876026547, 3.4448592824914055],
 'metrics_summary': {'train': {'accuracy': [0.25668449197860965,
    0.4950343773873186],
   'precision': [0.3708616690797323, 0.5928620913124757],
   'recall': [0.21817632850241547, 0.4855654369784805]}},
 'need_stop': False}