Resources refer to the basic engine resources, mainly CPU resources and memory resources of the compute engine, CPU resources and network resources of the transport engine, currently only the management of CPU resources of the compute engine is supported
$FATE_PROJECT_BASE/conf/service_conf.yaml
, that is, the resource size of the current engine allocated to the FATE clusterFATE Flow Server
gets all the base engine information from the configuration file and registers it in the database table t_engine_registry
when it starts.FATE Flow Server
has been started and the resource configuration can be modified by restarting FATE Flow Server
or by reloading the configuration using the command: flow server reload
.total_cores
= nodes
* cores_per_node
Example
fate_on_standalone: is for executing a standalone engine on the same machine as FATE Flow Server
, generally used for fast experiments, nodes
is generally set to 1, cores_per_node
is generally the number of CPU cores of the machine, also can be moderately over-provisioned
fate_on_standalone:
standalone:
cores_per_node: 20
nodes: 1
fate_on_eggroll: configured based on the actual deployment of EggRoll
cluster, nodes
denotes the number of node manager
machines, cores_per_node
denotes the average number of CPU cores per node manager
machine
fate_on_eggroll:
clustermanager:
cores_per_node: 16
nodes: 1
rollsite:
host: 127.0.0.1
port: 9370
fate_on_spark: configured based on the resources allocated to the FATE
cluster in the Spark
cluster, nodes
indicates the number of Spark
nodes, cores_per_node
indicates the average number of CPU cores per node allocated to the FATE
cluster
fate_on_spark:
spark:
# default use SPARK_HOME environment variable
home:
cores_per_node: 20
nodes: 2
Note: Please make sure that the Spark
cluster allocates the corresponding amount of resources to the FATE
cluster, if the Spark
cluster allocates less resources than the resources configured in FATE
here, then it will be possible to submit the FATE
job, but when FATE Flow
submits the task to the Spark
cluster, the task will not actually execute because the Spark
cluster has insufficient resources. Insufficient resources, the task is not actually executed
We generally use task_cores`'' and
task_parallelism`' to configure job request resources, such as
{
"job_parameters": {
"common": {
"job_type": "train",
"task_cores": 6,
"task_parallelism": 2,
"computing_partitions": 8,
"timeout": 36000
}
}
}
The total resources requested by the job are task_cores
* task_parallelism
. When creating a job, FATE Flow
will distribute the job to each party
based on the above configuration, running role, and the engine used by the party (via $FATE_PROJECT_BASE/conf/service_conf .yaml#default_engines
), the actual parameters will be calculated as follows
Calculate request_task_cores
:
request_task_cores
= task_cores
request_task_cores
= 1Further calculate task_cores_per_node
.
task_cores_per_node"
= max(1, request_task_cores
/ task_nodes
)
If eggroll_run
or spark_run
configuration resource is used in the above job_parameters
, then the task_cores
configuration is invalid; calculate task_cores_per_node
.
task_cores_per_node"
= eggroll_run["eggroll.session.processors.per.node"]
task_cores_per_node"
= spark_run["executor-cores"]
The parameter to convert to the adaptation engine (which will be presented to the compute engine for recognition when running the task).
task_cores_per_node
task_nodes
task_cores_per_node
The final calculation can be seen in the job's job_runtime_conf_on_party.json
, typically in $FATE_PROJECT_BASE/jobs/$job_id/$role/$party_id/job_runtime_on_party_conf.json
total_cores
see total_resource_allocationapply_cores
see job_request_resource_configuration, apply_cores
= task_nodes
* task_cores_per_node
* task_parallelism
{{snippet('cli/resource.md', header=False)}}