README.md
1# Cluster Resolvers
2
3Cluster Resolvers are a new way of specifying cluster information for distributed execution. Built on top of existing `ClusterSpec` framework, Cluster Resolvers allow users to simply specify a configuration and a cluster management service and a `ClusterResolver` will automatically fetch the relevant information from the service and populate `ClusterSpec`s.
4
5`ClusterResolvers` are designed to work well with `ManagedTrainingSession` and `ClusterSpec` propagation so that distributed training sessions remain robust in the face of node and network failures.
6
README.slurm
1# Slurm Cluster Resolver
2
3The Slurm Cluster Resolver resolves cluster specification for distribution TensorFlow work launched on HPC system running on Slurm. This implementation is able to handle homogeneous task allocation on computing nodes with default task distribution plane. The resolution is done by determining job configuration through a number of Slurm output variables and user input. The resolver requires the specification of total number of tasks launched, process ID/rank of the running process, number of tasks launched per node, number of GPUs present on each node and the number of GPUs to allocate for each task.
4
5The process ID/rank is extracted from environment variable ```SLURM_PROCID``` and the total number of tasks launched is extract from ```SLURM_NTASKS```. The number of tasks per node is extracted from ```SLURM_NTASKS_PER_NODE```, unless a value is specified by user. The number of GPUs present on each node and number of GPUs for each task have to be specified by the user. A base port can be specified by user and in case there are more than one task launched per node the port number will be incremented for each additional tasks on that node. The hostnames are resolved by running command ```scontrol show hostname``` through a subprocess and a list of hostnames will be returned. The distribution of rank/process ID by default follows that order. By default allocated GPUs will be automatically exposed to processes according to specification by setting ```CUDA_VISIBLE_DEVICE```.
6
7## Example
8- Slurm allocation in shell ```salloc --nodes=2 -t 01:30:00 -A <project ID> --ntasks-per-node=2 --gres=gpu:k80:2 --exclusive```
9- Creating cluster in Python
10```
11cluster_resolver = tf.contrib.cluster_resolver.SlurmClusterResolver(
12 {'ps': 1, 'worker': 3},
13 port_base=8888,
14 tasks_per_node=2,
15 gpus_per_node=2,
16 gpus_per_task=1,
17 auto_set_gpu=True)
18
19cluster = cluster_resolver.cluster_spec()
20job_name, task_index = cluster_resolver.get_task_info()
21```
22The above example resolves a cluster specification for a Slurm job allocation with two computing nodes each having two GPUs and two tasks will be launched on each node. The jobs are specified in form of a dictionary where the key is a string representing job name and value is an integer that specifies the number of tasks in that job. ```cluster_resolver.cluster_spec()``` will return a cluster specificaiton object and the cluster specification will have the following specification as protobuf.
23
24```
25job {
26 name: "ps"
27 tasks {
28 value: "t02n13:8888"
29 }
30}
31job {
32 name: "worker"
33 tasks {
34 value: "t02n13:8889"
35 }
36 tasks {
37 key: 1
38 value: "t02n41:8888"
39 }
40 tasks {
41 key: 2
42 value: "t02n41:8889"
43 }
44}
45```
46
47After calling ```cluster_resolver.cluster_spec()``` internal data structions of the resolver will be populated. By looking at the process ID/rank and comparing with cluster specification the task can 'realize' which task it belongs to. This can be retrieved by calling ```cluster_resolver.get_task_info()``` and a string specifying job name and an integer specifying the task index will be returned.
48
49GPUs will be automatically allocated to the processes. For example in the above example ```
50t02n41:8888``` will see GPU 0 and ```t02n41:8889``` will see GPU 1.
51