Each Component
wraps a FederatedML
Module
. Modules
implement machine learning algorithms on federated
learning, while Components
provide convenient interface for easy model
building.
Input
encapsulates all upstream input to a component in a job
workflow. There are three classes of input
: data
, cache
, and
model
. Not all components have all three classes of input, and a
component may accept only some types of the class. Note that only
Intersection
may have cache
input. For information on each
components' input, check the list.
Here is an example to access a component's input:
from pipeline.component import DataTransform
data_transform_0 = DataTransform(name="data_transform_0")
input_all = data_transform_0.input
input_data = data_transform_0.input.data
input_model = data_transform_0.input.model
Same as Input
, Output
encapsulates output data
, cache
, and
model
of component in a FATE job. Not all components have all classes
of outputs. Note that only Intersection
may have cache
output. For
information on each components' output, check the
list.
Here is an example to access a component's output:
from pipeline.component import DataTransform
data_transform_0 = DataTransform(name="data_transform_0")
output_all = data_transform_0.output
output_data = data_transform_0.output.data
output_model = data_transform_0.output.model
Meanwhile, to download components' output table or model, please use task info interface.
In most cases, data sets are wrapped into data
when being passed
between modules. For instance, in the mini
demo, data output of data_transform_0
is set
as data input to
intersection_0
.
pipeline.add_component(intersection_0, data=Data(data=data_transform_0.output.data))
For data sets used in different modeling stages (e.g., train & validate)
of the same component, additional keywords train_data
and
validate_data
are used to distinguish data sets. Also from mini
demo, result from intersection_0
and
intersection_1
are set as train and validate data of hetero logistic
regression,
respectively.
pipeline.add_component(hetero_lr_0, data=Data(train_data=intersection_0.output.data,
validate_data=intersection_1.output.data))
Another case of using keywords train_data
and validate_data
is to
use data output from DataSplit
module, which always has three data
outputs: train_data
, validate_data
, and test_data
.
pipeline.add_component(hetero_lr_0,
data=Data(train_data=hetero_data_split_0.output.data.train_data))
A special data type is predict_input
. predict_input
is only used for
specifying data input when running prediction task.
Here is an example of running prediction with an upstream model within the same pipeline:
pipeline.add_component(hetero_lr_1,
data=Data(predict_input=hetero_data_split_0.output.data.test_data),
model=Model(model=hetero_lr_0))
To run prediction with with new data, data source needs to be updated in prediction job. Below is an example from mini demo, where data input of original data_transform_0 component is set to be the data output from reader_2.
reader_2 = Reader(name="reader_2")
reader_2.get_party_instance(role="guest", party_id=guest).component_param(table=guest_eval_data)
reader_2.get_party_instance(role="host", party_id=host).component_param(table=host_eval_data)
# add data reader onto predict pipeline
predict_pipeline.add_component(reader_2)
predict_pipeline.add_component(pipeline,
data=Data(predict_input={pipeline.data_transform_0.input.data: reader_2.output.data}))
Below lists all five types of data
and whether Input
and Output
include
them.
Data Name | Input | Output | Use Case |
---|---|---|---|
data | Yes | Yes | single data input or output |
train_data | Yes | Yes | model training; output of DataSplit component |
validate_data | Yes | Yes | model training with validate data; output of DataSplit component |
test_data | No | Yes | output of DataSplit component |
predict_input | Yes | No | model prediction |
Data
All input and output data of components need to be wrapped into Data
objects when being passed between components. For information on valid
data types of each component, check the list.
Here is a an example of chaining components with different types of data
input and output:
from pipeline.backend.pipeline import Pipeline
from pipeline.component import DataTransform, Intersection, HeteroDataSplit, HeteroLR
# initialize a pipeline
pipeline = PipeLine().set_initiator(role='guest', party_id=guest).set_roles(guest=guest)
# define all components
data_transform_0 = DataTransform(name="data_transform_0")
data_split = HeteroDataSplit(name="data_split_0")
hetero_lr_0 = HeteroLR(name="hetero_lr_0", max_iter=20)
# chain together all components
pipeline.add_component(reader_0)
pipeline.add_component(data_transform_0, data=Data(data=reader_0.output.data))
pipeline.add_component(intersection_0, data=Data(data=data_transform_0.output.data))
pipeline.add_component(hetero_data_split_0, data=Data(data=intersection_0.output.data))
pipeline.add_component(hetero_lr_0, data=Data(train_data=hetero_data_split_0.output.data.train_data,
validate_data=hetero_data_split_0.output.data.test_data))
There are two types of Model
: model
andisometric_model
. When the
current component is of the same class as the previous component, if
receiving model
, the current component will replicate all model
parameters from the previous component. When a model from previous
component is used as input but the current component is of different
class from the previous component, isometric_model
is used.
Check below for a case from mini demo, where model
from data_transform_0
is
passed to data_transform_1
.
pipeline.add_component(data_transform_1,
data=Data(data=reader_1.output.data),
model=Model(data_transform_0.output.model))
Here is a case of using isometric model
. HeteroFeatureSelection
uses
isometric_model
from HeteroFeatureBinning
to select the most
important features.
pipeline.add_component(hetero_feature_selection_0,
data=Data(data=intersection_0.output.data),
isometric_model=Model(hetero_feature_binning_0.output.model))
Please note that when using stepwise or
cross validation method, components do
not have model
output. For information on valid model types of each
components, check the list.
Cache
is only available for Intersection
component. Please refer
here
for an example of using cache with intersection.
Below code sets cache output from intersection_0
as cache input of
intersection_1
.
pipeline.add_component(intersection_1, data=Data(data=data_transform_0.output.data), cache=Cache(intersect_0.output.cache))
To load cache from another job, use CacheLoader
component. In this
demo,
result from some previous job is loaded into intersection_0
as cache
input.
pipeline.add_component(cache_loader_0)
pipeline.add_component(intersect_0, data=Data(data=data_transform_0.output.data), cache=Cache(cache_loader_0.output.cache))
Parameters of underlying module can be set for all job participants or per individual participant.
from pipeline.component import DataTransform
data_transform_0 = DataTransform(name="data_transform_0", input_format="dense", output_format="dense",
outlier_replace=False)
# set guest data_transform_0 component parameters
guest_data_transform_0 = data_transform_0.get_party_instance(role='guest', party_id=9999)
guest_data_transform_0.component_param(with_label=True)
# set host data_transform_0 component parameters
data_transform_0.get_party_instance(role='host', party_id=10000).component_param(with_label=False)
Output data and model information of Components
can be retrieved with
Pipeline task info API. Currently Pipeline support these four types of
query on components:
To obtain output of a component, the component needs to be first extracted from pipeline:
print(pipeline.get_component("data_transform_0").get_output_data(limits=10))