# Using FATE Built-In Dataset

In FATE-1.10, three data sets of table, nlp_tokenizer and image are provided to meet the basic needs of table data, text data and image data

## TableDataset

TableDataset is provided under [table.py](../../../../python/federatedml/nn/dataset/table.py), which is used to process data in csv format, and will automatically parse the id and label from the data. Here is some source code to understand the use of this dataset class:

In [2]:
class TableDataset(Dataset):

 """
 A Table Dataset, load data from a give csv path, or transform FATE DTable

 Parameters
 ----------
 label_col str, name of label column in csv, if None, will automatically take 'y' or 'label' or 'target' as label
 feature_dtype dtype of feature, supports int, long, float, double
 label_dtype: dtype of label, supports int, long, float, double
 label_shape: list or tuple, the shape of label
 flatten_label: bool, flatten extracted label column or not, default is False
 """

 def __init__(
 self,
 label_col=None,
 feature_dtype='float',
 label_dtype='float',
 label_shape=None,
 flatten_label=False):

### TokenizerDataset

TokenizerDataset is provided under [nlp_tokenizer.py](../../../../python/federatedml/nn/dataset/nlp_tokenizer.py), which is developed based on Transformer's BertTokenizer, which can read strings from csv, and at the same time automatically segment the text and convert it into word ids.

In [None]:
class TokenizerDataset(Dataset):
 """
 A Dataset for some basic NLP Tasks, this dataset will automatically transform raw text into word indices
 using BertTokenizer from transformers library,
 see https://huggingface.co/docs/transformers/model_doc/bert?highlight=berttokenizer for details of BertTokenizer

 Parameters
 ----------
 truncation bool, truncate word sequence to 'text_max_length'
 text_max_length int, max length of word sequences
 tokenizer_name_or_path str, name of bert tokenizer(see transformers official for details) or path to local
 transformer tokenizer folder
 return_label bool, return label or not, this option is for host dataset, when running hetero-NN
 """

 def __init__(self, truncation=True, text_max_length=128,
 tokenizer_name_or_path="bert-base-uncased",
 return_label=True):

### ImageDataset

ImageDataset is provided under [image.py](../../../../python/federatedml/nn/dataset/image.py), which is used to simply process image data. It is developed based on torchvision's ImageFolder. It can be seen that the parameters of this dataset are used:

In [None]:
class ImageDataset(Dataset):

 """

 A basic Image Dataset built on pytorch ImageFolder, supports simple image transform
 Given a folder path, ImageDataset will load images from this folder, images in this
 folder need to be organized in a Torch-ImageFolder format, see
 https://pytorch.org/vision/main/generated/torchvision.datasets.ImageFolder.html for details.

 Image name will be automatically taken as the sample id.

 Parameters
 ----------
 center_crop : bool, use center crop transformer
 center_crop_shape: tuple or list
 generate_id_from_file_name: bool, whether to take image name as sample id
 file_suffix: str, default is '.jpg', if generate_id_from_file_name is True, will remove this suffix from file name,
 result will be the sample id
 return_label: bool, return label or not, this option is for host dataset, when running hetero-NN
 float64: bool, returned image tensors will be transformed to double precision
 label_dtype: str, long, float, or double, the dtype of return label
 """

 def __init__(self, center_crop=False, center_crop_shape=None,
 generate_id_from_file_name=True, file_suffix='.jpg',
 return_label=True, float64=False, label_dtype='long'):

## Use Built-IN Dataset

Using the built-in dataset of FATE is precisely the same as using a user-customized dataset. Here we use our image dataset and a new model with conv layers to solve the MNIST handwritten recognition task again, as the example.

If you don't have the MNIST dataset, you can refer to previous tutorial and download it:
 - [Customize your Dataset](Homo-NN-Customize-your-Dataset.ipynb)

In [1]:
from federatedml.nn.dataset.image import ImageDataset

In [2]:
! ls ../examples/data/mnist/ 

In [5]:
dataset = ImageDataset()
dataset.load('../../../../examples/data/mnist/') 

In [6]:
len(dataset)

1309

In [7]:
dataset[400] 

(tensor([[[0.0000, 0.0275, 0.0000, ..., 0.0000, 0.0000, 0.0000],
 [0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
 [0.0118, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
 ...,
 [0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
 [0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
 [0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000]],
 
 [[0.0000, 0.0275, 0.0000, ..., 0.0000, 0.0000, 0.0000],
 [0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
 [0.0118, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
 ...,
 [0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
 [0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
 [0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000]],
 
 [[0.0000, 0.0275, 0.0000, ..., 0.0000, 0.0000, 0.0000],
 [0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
 [0.0118, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
 ...,
 [0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
 [0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
 [0.0000, 0.0

In [35]:
from torch import nn
import torch as t
from torch.nn import functional as F
from pipeline.component.nn.backend.torch.operation import Flatten

# a new model with conv layer, it can work with our ImageDataset
model = t.nn.Sequential(
 nn.Conv2d(in_channels=3, out_channels=12, kernel_size=5),
 nn.MaxPool2d(kernel_size=3),
 nn.Conv2d(in_channels=12, out_channels=12, kernel_size=3),
 nn.AvgPool2d(kernel_size=3),
 Flatten(start_dim=1),
 nn.Linear(48, 32),
 nn.ReLU(),
 nn.Linear(32, 10),
 nn.Softmax(dim=1)
 )


## Local Test

**In the case of local testing, all federation processes will be skipped, and the model will not perform fed averaging**

In [36]:
from federatedml.nn.homo.trainer.fedavg_trainer import FedAVGTrainer
trainer = FedAVGTrainer(epochs=5, batch_size=256, shuffle=True, data_loader_worker=8, pin_memory=False) # 参数
trainer.set_model(model)

In [37]:
trainer.local_mode() 

In [38]:
optimizer = t.optim.Adam(model.parameters(), lr=0.01)
loss = t.nn.CrossEntropyLoss()
trainer.train(train_set=dataset,optimizer=optimizer, loss=loss)

epoch is 0
100%|██████████| 6/6 [00:00<00:00, 7.49it/s]
epoch loss is 2.6923995983336515
epoch is 1
100%|██████████| 6/6 [00:00<00:00, 7.78it/s]
epoch loss is 2.636708398735915
epoch is 2
100%|██████████| 6/6 [00:00<00:00, 7.75it/s]
epoch loss is 2.4953262410699364
epoch is 3
100%|██████████| 6/6 [00:00<00:00, 7.79it/s]
epoch loss is 2.3616474521715647
epoch is 4
100%|██████████| 6/6 [00:00<00:00, 8.26it/s]
epoch loss is 2.2441106669496635


It can work, now good to go to federated task!

## A Homo-NN Task with Built-in Dataset

In [27]:
import torch as t
from torch import nn
from pipeline import fate_torch_hook
from pipeline.component import HomoNN
from pipeline.backend.pipeline import PipeLine
from pipeline.component import Reader, Evaluation, DataTransform
from pipeline.interface import Data, Model

t = fate_torch_hook(t)


In [28]:
import os
# bind data path to name & namespace
fate_project_path = os.path.abspath('../../../../')
host = 10000
guest = 9999
arbiter = 10000
pipeline = PipeLine().set_initiator(role='guest', party_id=guest).set_roles(guest=guest, host=host,
 arbiter=arbiter)

data_0 = {"name": "mnist_guest", "namespace": "experiment"}
data_1 = {"name": "mnist_host", "namespace": "experiment"}

data_path_0 = fate_project_path + '/examples/data/mnist'
data_path_1 = fate_project_path + '/examples/data/mnist'
pipeline.bind_table(name=data_0['name'], namespace=data_0['namespace'], path=data_path_0)
pipeline.bind_table(name=data_1['name'], namespace=data_1['namespace'], path=data_path_1)

{'namespace': 'experiment', 'table_name': 'mnist_host'}

In [29]:
# 定义reader
reader_0 = Reader(name="reader_0")
reader_0.get_party_instance(role='guest', party_id=guest).component_param(table=data_0)
reader_0.get_party_instance(role='host', party_id=host).component_param(table=data_1)

In [39]:
from pipeline.component.homo_nn import DatasetParam, TrainerParam 

# a new model with conv layer, it can work with our ImageDataset
model = t.nn.Sequential(
 nn.Conv2d(in_channels=3, out_channels=12, kernel_size=5),
 nn.MaxPool2d(kernel_size=3),
 nn.Conv2d(in_channels=12, out_channels=12, kernel_size=3),
 nn.AvgPool2d(kernel_size=3),
 Flatten(start_dim=1),
 nn.Linear(48, 32),
 nn.ReLU(),
 nn.Linear(32, 10),
 nn.Softmax(dim=1)
 )

nn_component = HomoNN(name='nn_0',
 model=model, # model
 loss=t.nn.CrossEntropyLoss(), # loss
 optimizer=t.optim.Adam(model.parameters(), lr=0.01), # optimizer
 dataset=DatasetParam(dataset_name='image', label_dtype='long'), # dataset
 trainer=TrainerParam(trainer_name='fedavg_trainer', epochs=2, batch_size=1024, validation_freqs=1),
 torch_seed=100 # random seed
 )

In [40]:
pipeline.add_component(reader_0)
pipeline.add_component(nn_component, data=Data(train_data=reader_0.output.data))
pipeline.add_component(Evaluation(name='eval_0', eval_type='multi'), data=Data(data=nn_component.output.data))



In [46]:
pipeline.compile()
pipeline.fit()

[32m2022-12-19 17:31:15.709[0m | [1mINFO [0m | [36mpipeline.utils.invoker.job_submitter[0m:[36mmonitor_job_status[0m:[36m83[0m - [1mJob id is 202212191731149354320
[0m
[32m2022-12-19 17:31:15.732[0m | [1mINFO [0m | [36mpipeline.utils.invoker.job_submitter[0m:[36mmonitor_job_status[0m:[36m98[0m - [1m[80D[1A[KJob is still waiting, time elapse: 0:00:00[0m
[0mm2022-12-19 17:31:16.813[0m | [1mINFO [0m | [36mpipeline.utils.invoker.job_submitter[0m:[36mmonitor_job_status[0m:[36m125[0m - [1m
[32m2022-12-19 17:31:16.815[0m | [1mINFO [0m | [36mpipeline.utils.invoker.job_submitter[0m:[36mmonitor_job_status[0m:[36m127[0m - [1m[80D[1A[KRunning component reader_0, time elapse: 0:00:01[0m
[32m2022-12-19 17:31:17.847[0m | [1mINFO [0m | [36mpipeline.utils.invoker.job_submitter[0m:[36mmonitor_job_status[0m:[36m127[0m - [1m[80D[1A[KRunning component reader_0, time elapse: 0:00:02[0m
[32m2022-12-19 17:31:18.874[0m | [1mINFO [0m | [36