Directory used to save model parameters, graph, etc. It can also be used to load
checkpoints for a previously saved model. If null
, a temporary directory will be used.
Configuration to use for the created sessions.
Configuration specifying when to save checkpoints.
Random seed value to be used by the TensorFlow initializers. Setting this value allows consistency between re-runs.
Configuration specifying when to save checkpoints.
Random seed value to be used by the TensorFlow initializers.
Random seed value to be used by the TensorFlow initializers. Setting this value allows consistency between re-runs.
Configuration to use for the created sessions.
Directory used to save model parameters, graph, etc.
Directory used to save model parameters, graph, etc. It can also be used to load
checkpoints for a previously saved model. If null
, a temporary directory will be used.
Configuration for models in the learn API, to be used by estimators.
If
clusterConfig
is not provided, then all distributed training related properties are set based on theTF_CONFIG
environment variable, if the pertinent information is present. TheTF_CONFIG
environment variable is a JSON object with attributes:cluster
andtask
.cluster
is a JSON serialized version of ClusterConfig, mapping task types (usually one of the instances of TaskType) to a list of task addresses.task
has two attributes:type
andindex
, wheretype
can be any of the task types incluster
. WhenTF_CONFIG
contains said information, the following properties are set on this class:clusterConfig
is parsed fromTF_CONFIG['cluster']
. Defaults toNone
. If present, it must have one and only one node for thechief
job (i.e.,CHIEF
task type).taskType
is set toTF_CONFIG['task']['type']
. Must be set ifclusterConfig
is present; must beworker
(the default value), if it is not.taskIndex
is set toTF_CONFIG['task']['index']
. Must be set ifclusterConfig
is present; must be 0 (the default value), if it is not.master
is determined by looking uptaskType
andtaskIndex
in theclusterConfig
. Defaults to""
.numParameterServers
is set by counting the number of nodes listed in theps
job (i.e.,PARAMETER_SERVER
task type) ofclusterConfig
. Defaults to 0.numWorkers
is set by counting the number of nodes listed in theworker
andchief
jobs (i.e.,WORKER
andCHIEF
task types) ofclusterConfig
. Defaults to 1.isChief
is determined based ontaskType
andTF_CONFIG['cluster']
.There is a special node with
taskType
set asEVALUATOR
, which is not part of the (training)clusterConfig
. It handles the distributed evaluation job.Example for a non-chief node:
Example for a chief node:
Example for an evaluator node (an evaluator is not part of the training cluster):
NOTE: If a
checkpointConfig
is set,maxCheckpointsToKeep
might need to be adjusted accordingly, especially in distributed training. For example, usingTimeBasedCheckpoints(60)
without adjustingmaxCheckpointsToKeep
(which defaults to 5) leads to a situation that checkpoints would be garbage collected after 5 minutes. In distributed training, the evaluation job starts asynchronously and might fail to load or find the checkpoints due to a race condition.Directory used to save model parameters, graph, etc. It can also be used to load checkpoints for a previously saved model. If
null
, a temporary directory will be used.Configuration to use for the created sessions.
Configuration specifying when to save checkpoints.
Random seed value to be used by the TensorFlow initializers. Setting this value allows consistency between re-runs.