# Hetero-NN Customize your Dataset

The FATE system primarily supports tabular data as its standard data format. However, it is possible to utilize non-tabular data, such as images, text, mixed data, or relational data, in neural networks through the use of the Dataset feature of the NN module. The Dataset module within the NN module allows for the customization of datasets so that user can use them in more complex data scenarios. This tutorial will cover the use of the Dataset feature in the Hetero-NN. For the ease of demonstration, We will use the MNIST handwriting recognition dataset as an example to simulate a Hetero Federation task to illustrate these concepts. 

## Prepare MNIST Data

Please download the guest/host MNIST dataset from the link below and place it in the project examples/data folder:

- guest data: https://webank-ai-1251170195.cos.ap-guangzhou.myqcloud.com/fate/examples/data/mnist_guest.zip

- host data: https://webank-ai-1251170195.cos.ap-guangzhou.myqcloud.com/fate/examples/data/mnist_host.zip

The mnist_guest is a simplified version of the MNIST dataset, with a total of ten categories, which are classified into 0-9 10 folders according to labels. The mnist_host has the same images as the mnist_guest, but it is not labeled.

In [3]:
! ls ../../../../examples/data/mnist_guest

0  1  2  3  4  5  6  7	8  9


In [4]:
! ls ../../../../examples/data/mnist_host

not_labeled


## Dataset

In version FATE-1.10, FATE introduces a new base class for datasets called Dataset, which is based on PyTorch's Dataset class. This class allows users to create custom datasets according to their specific needs. The usage is similar to that of PyTorch's Dataset class, with the added requirement of implementing two additional interfaces when using FATE-NN for data reading and training: load() and get_sample_ids().

To create a custom dataset in Hetero-NN, users need to:

- Develop a new dataset class that inherits from the Dataset class
- Implement the \_\_len\_\_() and \_\_getitem\_\_() methods, which are consistent with PyTorch's Dataset usage. The \_\_len\_\_() method should return the length of the dataset, while the \_\_getitem\_\_() method should return the corresponding data at the specified index. **However, please notice that different \_\_getitem\_\_() methods may have different behaviors between different parties. In the guest party(party with labels), _\_getitem\_\_() method return features and labels, while in the host parties(parties without label), _\_getitem\_\_() method return features only.** 
- Implement the load(), get_sample_ids(), get_classes() methods
  
For those unfamiliar with PyTorch's Dataset class, more information can be found in the PyTorch documentation: [Pytorch Dataset Documentation](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html)

### load()

The first additional interface required is load(). This interface receives a file path and allows users to read data directly from the local file system. When submitting a task, the data path can be specified through the reader component. Hetero-NN will use the user-specified Dataset class, utilizing the load() interface to read data from the specified path and complete the loading of the dataset for training. For more information, please refer to the source code in [/federatedml/nn/dataset/base.py](../../../../python/federatedml/nn/dataset/base.py).

### get_sample_ids()

The second additional interface is get_sample_ids(). This interface should return a list of sample IDs, which can be either integers or strings and should have the same length as the dataset. This function is important in Hetero-NN, there are some points you need to know:

- **When using a customized dataset in Hetero-NN, it is important to ensure that your sample IDs are aligned with those of other parties. You can do this by using the intersection component and extracting the results, or by agreeing on the sample IDs to be used with the other parties.**
- **You don't have to put your ids in order, Hetero-NN component will sort them.** 

### get_classes

The third function return a list of all unique labels. It will be called in the guest party. If it is not a classification task, just return an empty list.

## Example: Implement a simple image dataset

In order to better understand the customization of Dataset, here we implement a simple image dataset to read MNIST images, and then complete a federated image classification task in a vertical scene.
For convenience here, we use the jupyter interface of save_to_fate to update the code to federatedml.nn.dataset, named mnist_dataset.py, of course you can manually copy the code file to the directory.

- This dataset has a parameter 'return_label', when guest party(party with label) is using it, we set return_label=True, else return_label=False
- It is developed based on ImageFolder, and we take image name as the sample id. 

In [6]:
from pipeline.component.nn import save_to_fate

In [7]:
%%save_to_fate dataset mnist_dataset.py
import numpy as np
from federatedml.nn.dataset.base import Dataset
from torchvision.datasets import ImageFolder
from torchvision import transforms

class MNISTDataset(Dataset):
    
    def __init__(self, return_label=True):  
        super(MNISTDataset, self).__init__() 
        self.return_label = return_label
        self.image_folder = None
        self.ids = None
        
    def load(self, path):  
        
        self.image_folder = ImageFolder(root=path, transform=transforms.Compose([transforms.ToTensor()]))
        ids = []
        for image_name in self.image_folder.imgs:
            ids.append(image_name[0].split('/')[-1].replace('.jpg', ''))
        self.ids = ids

        return self

    def get_sample_ids(self, ):
        return self.ids
        
    def get_classes(self, ):
        return np.unique(self.image_folder.targets).tolist()
    
    def __len__(self,):  
        return len(self.image_folder)
    
    def __getitem__(self, idx): # get item 
        ret = self.image_folder[idx]
        img = ret[0][0].flatten() # flatten tensor 784 dims
        if self.return_label:
            return img, ret[1] # img & label
        else:
            return img # no label, for host

Now we test our dataset class:

In [10]:
# guest party
! ls ../../../../examples/data/mnist_guest
ds = MNISTDataset().load('../../../../examples/data/mnist_guest/')
print(len(ds))
print(ds[0][0]) 
print(ds.get_classes())
print(ds.get_sample_ids()[0: 10])

0  1  2  3  4  5  6  7	8  9
1309
tensor([0.0118, 0.0000, 0.0000, 0.0118, 0.0275, 0.0118, 0.0000, 0.0118, 0.0000,
        0.0431, 0.0000, 0.0000, 0.0118, 0.0000, 0.0000, 0.0118, 0.0314, 0.0000,
        0.0000, 0.0118, 0.0000, 0.0000, 0.0000, 0.0078, 0.0000, 0.0000, 0.0000,
        0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0039,
        0.0196, 0.0000, 0.0471, 0.0000, 0.0627, 0.0000, 0.0000, 0.0157, 0.0000,
        0.0078, 0.0314, 0.0118, 0.0000, 0.0157, 0.0314, 0.0000, 0.0000, 0.0000,
        0.0000, 0.0000, 0.0000, 0.0000, 0.0078, 0.0000, 0.0000, 0.0000, 0.0039,
        0.0078, 0.0039, 0.0471, 0.0000, 0.0314, 0.0000, 0.0000, 0.0235, 0.0000,
        0.0431, 0.0000, 0.0000, 0.0235, 0.0275, 0.0078, 0.0000, 0.0000, 0.0000,
        0.0000, 0.0000, 0.0000, 0.0000, 0.0039, 0.0118, 0.0000, 0.0000, 0.0078,
        0.0118, 0.0000, 0.0000, 0.0000, 0.0471, 0.0000, 0.0000, 0.0902, 0.0000,
        0.0000, 0.0000, 0.0000, 0.0431, 0.0118, 0.0000, 0.0000, 0.0157, 0.0000,
       

In [11]:
# host party
! ls ../../../../examples/data/mnist_host  # no label
ds = MNISTDataset(return_label=False).load('../../../../examples/data/mnist_host')
print(len(ds))
print(ds[0]) # no label

not_labeled
1309
tensor([0.0118, 0.0000, 0.0000, 0.0118, 0.0275, 0.0118, 0.0000, 0.0118, 0.0000,
        0.0431, 0.0000, 0.0000, 0.0118, 0.0000, 0.0000, 0.0118, 0.0314, 0.0000,
        0.0000, 0.0118, 0.0000, 0.0000, 0.0000, 0.0078, 0.0000, 0.0000, 0.0000,
        0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0039,
        0.0196, 0.0000, 0.0471, 0.0000, 0.0627, 0.0000, 0.0000, 0.0157, 0.0000,
        0.0078, 0.0314, 0.0118, 0.0000, 0.0157, 0.0314, 0.0000, 0.0000, 0.0000,
        0.0000, 0.0000, 0.0000, 0.0000, 0.0078, 0.0000, 0.0000, 0.0000, 0.0039,
        0.0078, 0.0039, 0.0471, 0.0000, 0.0314, 0.0000, 0.0000, 0.0235, 0.0000,
        0.0431, 0.0000, 0.0000, 0.0235, 0.0275, 0.0078, 0.0000, 0.0000, 0.0000,
        0.0000, 0.0000, 0.0000, 0.0000, 0.0039, 0.0118, 0.0000, 0.0000, 0.0078,
        0.0118, 0.0000, 0.0000, 0.0000, 0.0471, 0.0000, 0.0000, 0.0902, 0.0000,
        0.0000, 0.0000, 0.0000, 0.0431, 0.0118, 0.0000, 0.0000, 0.0157, 0.0000,
        0.0000, 0.0000,

Good! It’s ready to use, so let’s use this developed dataset to run a Hetero-NN model, and both parties use the two datasets mnist_guest & mnist_host aligned by id to conduct a hetero federated training

The same as the Homo-NN(see [Customize your Dataset](Homo-NN-Customize-your-Dataset.ipynb)), here we will not follow the usage of conventional FATE components, but directly bind the data path to a FATE name&namespace and pass it to the Hetero-NN component through the reader, and Hetero-NN imports your custom dataset class through the DatasetParam you set, then read data from the path.

## pipeline initialization

Here we define the pipeline to run a hetero task

In [21]:
import os
import torch as t
from torch import nn
from pipeline import fate_torch_hook
from pipeline.component import HeteroNN
from pipeline.component.hetero_nn import DatasetParam
from pipeline.backend.pipeline import PipeLine
from pipeline.component import Reader, Evaluation, DataTransform
from pipeline.interface import Data, Model
from pipeline.component.nn import save_to_fate

fate_torch_hook(t)

# bind path to fate name&namespace
fate_project_path = os.path.abspath('../../../../')
guest = 10000
host = 9999

pipeline_img = PipeLine().set_initiator(role='guest', party_id=guest).set_roles(guest=guest, host=host)

guest_data = {"name": "mnist_guest", "namespace": "experiment"}
host_data = {"name": "mnist_host", "namespace": "experiment"}

guest_data_path = fate_project_path + '/examples/data/mnist_guest/'
host_data_path = fate_project_path + '/examples/data/mnist_host/'
pipeline_img.bind_table(name='mnist_guest', namespace='experiment', path=guest_data_path)
pipeline_img.bind_table(name='mnist_host', namespace='experiment', path=host_data_path)

{'namespace': 'experiment', 'table_name': 'mnist_host'}

In [22]:
guest_data = {"name": "mnist_guest", "namespace": "experiment"}
host_data = {"name": "mnist_host", "namespace": "experiment"}
reader_0 = Reader(name="reader_0")
reader_0.get_party_instance(role='guest', party_id=guest).component_param(table=guest_data)
reader_0.get_party_instance(role='host', party_id=host).component_param(table=host_data)

In [23]:
hetero_nn_0 = HeteroNN(name="hetero_nn_0", epochs=3,
                       interactive_layer_lr=0.01, batch_size=512, task_type='classification', seed=100
                       )
guest_nn_0 = hetero_nn_0.get_party_instance(role='guest', party_id=guest)
host_nn_0 = hetero_nn_0.get_party_instance(role='host', party_id=host)

# define model
# image features 784, guest bottom model
# our simple classification model:
guest_bottom = t.nn.Sequential(
    t.nn.Linear(784, 8),
    t.nn.ReLU()
)

# image features 784, host bottom model
host_bottom = t.nn.Sequential(
    t.nn.Linear(784, 8),
    t.nn.ReLU()
)

# Top Model, a classifier
guest_top = t.nn.Sequential(
    nn.Linear(8, 10),
    nn.Softmax(dim=1)
)

# interactive layer define
interactive_layer = t.nn.InteractiveLayer(out_dim=8, guest_dim=8, host_dim=8)

# add models
guest_nn_0.add_top_model(guest_top)
guest_nn_0.add_bottom_model(guest_bottom)
host_nn_0.add_bottom_model(host_bottom)

# opt, loss
optimizer = t.optim.Adam(lr=0.01) 
loss = t.nn.CrossEntropyLoss()

# use DatasetParam to specify dataset and pass parameters
guest_nn_0.add_dataset(DatasetParam(dataset_name='mnist_dataset', return_label=True))
host_nn_0.add_dataset(DatasetParam(dataset_name='mnist_dataset', return_label=False))

hetero_nn_0.set_interactive_layer(interactive_layer)
hetero_nn_0.compile(optimizer=optimizer, loss=loss)

In [24]:
pipeline_img.add_component(reader_0)
pipeline_img.add_component(hetero_nn_0, data=Data(train_data=reader_0.output.data))
pipeline_img.add_component(Evaluation(name='eval_0', eval_type='multi'), data=Data(data=hetero_nn_0.output.data))
pipeline_img.compile()

<pipeline.backend.pipeline.PipeLine at 0x7fae4e81f460>

In [25]:
pipeline_img.fit()

[32m2022-12-24 17:37:26.846[0m | [1mINFO    [0m | [36mpipeline.utils.invoker.job_submitter[0m:[36mmonitor_job_status[0m:[36m83[0m - [1mJob id is 202212241737239009640
[0m
[32m2022-12-24 17:37:26.858[0m | [1mINFO    [0m | [36mpipeline.utils.invoker.job_submitter[0m:[36mmonitor_job_status[0m:[36m98[0m - [1m[80D[1A[KJob is still waiting, time elapse: 0:00:00[0m
[0mm2022-12-24 17:37:27.906[0m | [1mINFO    [0m | [36mpipeline.utils.invoker.job_submitter[0m:[36mmonitor_job_status[0m:[36m125[0m - [1m
[32m2022-12-24 17:37:27.908[0m | [1mINFO    [0m | [36mpipeline.utils.invoker.job_submitter[0m:[36mmonitor_job_status[0m:[36m127[0m - [1m[80D[1A[KRunning component reader_0, time elapse: 0:00:01[0m
[32m2022-12-24 17:37:28.931[0m | [1mINFO    [0m | [36mpipeline.utils.invoker.job_submitter[0m:[36mmonitor_job_status[0m:[36m127[0m - [1m[80D[1A[KRunning component reader_0, time elapse: 0:00:02[0m
[32m2022-12-24 17:37:29.954[0m | [1mI

In [26]:
pipeline_img.get_component('hetero_nn_0').get_output_data()  # get result

Unnamed: 0,id,label,predict_result,predict_score,predict_detail,type
0,img_1,0,8,0.2643284201622009,"{'0': 0.1221427395939827, '1': 0.0131008885800...",train
1,img_3,4,7,0.28708773851394653,"{'0': 0.020947180688381195, '1': 0.10722759366...",train
2,img_4,0,8,0.2315242737531662,"{'0': 0.1916944831609726, '1': 0.0062320181168...",train
3,img_5,0,8,0.2495078295469284,"{'0': 0.04027773439884186, '1': 0.039246417582...",train
4,img_6,7,7,0.5058534145355225,"{'0': 0.011092708446085453, '1': 0.06033106520...",train
...,...,...,...,...,...,...
1304,img_32537,1,7,0.228757843375206,"{'0': 0.02710532397031784, '1': 0.178657308220...",train
1305,img_32558,1,7,0.22664928436279297,"{'0': 0.03342469036579132, '1': 0.192812800407...",train
1306,img_32563,1,7,0.2404891550540924,"{'0': 0.02606056071817875, '1': 0.177977174520...",train
1307,img_32565,1,7,0.44030535221099854,"{'0': 0.014257671311497688, '1': 0.11480070650...",train
