Skip to content

Commit 1892895

Browse files
authored
Merge pull request #977 from wwarriner/feat-ntasks-per-socket
Add information on --ntasks-per-socket for multiple GPU jobs.
2 parents 8cee742 + 7f28cc2 commit 1892895

File tree

12 files changed

+271
-28
lines changed

12 files changed

+271
-28
lines changed

docs/cheaha/hardware.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,8 @@ Examples of how to make use of the table:
2727

2828
The full table can be downloaded [here](./res/hardware_summary_cheaha.csv).
2929

30+
Information about GPU efficiency can be found at [Making the Most of GPUs](./slurm/gpu.md#making-the-most-of-gpus).
31+
3032
### Details
3133

3234
Detailed hardware information, including processor and GPU makes and models, core clock frequencies, and other information for current hardware are in the table below.

docs/cheaha/job_efficiency.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,9 @@ Questions to ask yourself before requesting resources:
4141

4242
1. How is the software I'm using programmed?
4343

44-
- Can it use a GPU? Request one.
44+
- Can it use a GPU? Request one. Don't forget to consider...
45+
- [Local Scratch](../data_management/cheaha_storage_gpfs/index.md#local-scratch) for [IO performance](../cheaha/slurm/gpu.md#ensuring-io-performance-with-a100-gpus).
46+
- `--ntasks-per-socket` when using [Multiple GPUs](../cheaha/slurm/gpu.md#using-multiple-gpus).
4547
- Can it use multiple cores? Request more than one core.
4648
- Is it single-threaded? Request only one core.
4749
- Does it use MPI? Request multiple nodes.

docs/cheaha/open_ondemand/ood_jupyter.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,14 @@ For information on partition and GPU selection, please review our [hardware info
3232
The latest CUDA and cuDNN are now available from [Conda](../slurm/gpu.md#cuda-and-cudnn-modules).
3333
<!-- markdownlint-enable MD046 -->
3434

35+
For more information on GPU efficiency please see [Making the Most of GPUs](../slurm/gpu.md#making-the-most-of-gpus).
36+
37+
<!-- markdownlint-disable MD046 -->
38+
!!! important
39+
40+
April 21, 2025: Currently, GPU-core affinity is not considered for GPU jobs on interactive apps. This may mean selecting multiple GPUs results in some GPUs not being used.
41+
<!-- markdownlint-enable MD046 -->
42+
3543
## Extra Jupyter Arguments
3644

3745
The `Extra Jupyter Arguments` field allows you to pass additional arguments to the Jupyter Server as it is being started. It can be helpful to point the server to the folder containing your notebook. To do this, assuming your notebooks are stored in `/data/user/$USER`, also known as `$USER_DATA`, put `--notebook-dir=$USER_DATA` in this field. You will be able to navigate to the notebook if it is in a subdirectory of `notebook-dir`, but you won't be able to navigate to any other directories. An example is shown below.

docs/cheaha/open_ondemand/ood_layout.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -94,6 +94,14 @@ The interactive apps have the following fields to customize the resources for yo
9494

9595
Every interactive app has resources only allocated on a single node, and resources are shared among all processes running in the app. Make sure the amount of memory you request is less than or equal to the max amount per node for the partition you choose. We have a table with [memory available per node](../hardware.md#cheaha-hpc-cluster) for each partition.
9696

97+
For more information on GPU efficiency please see [Making the Most of GPUs](../slurm/gpu.md#making-the-most-of-gpus).
98+
99+
<!-- markdownlint-disable MD046 -->
100+
!!! important
101+
102+
April 21, 2025: Currently, GPU-core affinity is not considered for GPU jobs on interactive apps. This may mean selecting multiple GPUs results in some GPUs not being used.
103+
<!-- markdownlint-enable MD046 -->
104+
97105
#### Environment Setup Window
98106

99107
In addition to requesting general resources, for some apps you will have the option to add commands to be run during job startup in an Environment Setup Window. See below for an example showing how to load CUDA into a Jupyter job so it can use a GPU.

docs/cheaha/open_ondemand/ood_matlab.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,14 @@ You may optionally verify that Python works correctly by entering `py.list(["hel
3535
3636
Please see the [MATLAB Section on our GPU Page](../slurm/gpu.md#matlab).
3737
38+
For more information on GPU efficiency please see [Making the Most of GPUs](../slurm/gpu.md#making-the-most-of-gpus).
39+
40+
<!-- markdownlint-disable MD046 -->
41+
!!! important
42+
43+
April 21, 2025: Currently, GPU-core affinity is not considered for GPU jobs on interactive apps. This may mean selecting multiple GPUs results in some GPUs not being used.
44+
<!-- markdownlint-enable MD046 -->
45+
3846
## Known Issues
3947
4048
There is a known issue with `parpool` and other related multi-core parallel features such as `parfor` affecting R2022a and earlier. See our [Modules Known Issues section](../software/modules.md#matlab-issues) for more information.

docs/cheaha/res/job_submit_flags.csv

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ Flag,Short,Environment Variable,Description,sbatch,srun
77
`--time`,`-t`,`SBATCH_TIMELIMIT`,Maximum allowed runtime of job. Allowed formats below.,[sbatch](https://slurm.schedmd.com/sbatch.html#OPT_time),[srun](https://slurm.schedmd.com/srun.html#OPT_time)
88
`--nodes`,`-N`,,Number of nodes needed. Set to `1` if your software does not use MPI or if unsure.,[sbatch](https://slurm.schedmd.com/sbatch.html#OPT_nodes),[srun](https://slurm.schedmd.com/srun.html#OPT_nodes)
99
`--ntasks`,`-n`,`SLURM_NTASKS`,Number of tasks planned per node. Mostly used for bookkeeping and calculating total cpus per node. If unsure set to `1`.,[sbatch](https://slurm.schedmd.com/sbatch.html#OPT_ntasks),[srun](https://slurm.schedmd.com/srun.html#OPT_ntasks)
10+
`--ntasks-per-socket`,,,"Number of tasks per socket. Required for multiple GPU jobs, [details here](../slurm/gpu.md#using-multiple-gpus)",[sbatch](https://slurm.schedmd.com/sbatch.html#OPT_ntasks-per-socket),[srun](https://slurm.schedmd.com/srun.html#OPT_ntasks-per-socket)
1011
`--cpus-per-task`,`-c`,`SLURM_CPUS_PER_TASK`,Number of needed cores per task. Cores per node equals `-n` times `-c`.,[sbatch](https://slurm.schedmd.com/sbatch.html#OPT_cpus-per-task),[srun](https://slurm.schedmd.com/srun.html#OPT_cpus-per-task)
1112
,,`SLURM_CPUS_ON_NODE`,Number of cpus available on this node.,[sbatch](https://slurm.schedmd.com/sbatch.html#OPT_SLURM_CPUS_ON_NODE),[srun](https://slurm.schedmd.com/srun.html#OPT_SLURM_CPUS_ON_NODE)
1213
`--mem`,,`SLURM_MEM_PER_NODE`,Amount of RAM needed per node in MB. Can specify 16 GB using 16384 or 16G.,[sbatch](https://slurm.schedmd.com/sbatch.html#OPT_mem),[srun](https://slurm.schedmd.com/srun.html#OPT_SLURM_CPUS_ON_NODE)

docs/cheaha/slurm/gpu.md

Lines changed: 66 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,12 +21,77 @@ When requesting a job using `sbatch`, you will need to include the Slurm flag `-
2121

2222
<!-- markdownlint-enable MD046 -->
2323

24-
## Ensuring IO Performance With A100 GPUs
24+
### Making the Most of GPUs
25+
26+
#### Ensuring IO Performance With A100 GPUs
2527

2628
If you are using `amperenodes` and the A100 GPUs, then it is highly recommended to move your input files to the [local scratch](../../data_management/cheaha_storage_gpfs/index.md#local-scratch) at `/local/$USER/$SLURM_JOB_ID` prior to running your workflow, to ensure adequate GPU performance. Network file mounts, such as `$USER_SCRATCH`, `/scratch/`, `/data/user/` and `/data/project/`, do not have sufficient bandwidth to keep the GPU busy. So, your processing pipeline will slow down to network speeds, instead of GPU speeds.
2729

2830
Please see our [Local Scratch Storage section](../../data_management/cheaha_storage_gpfs/index.md#local-scratch) for more details and an example script.
2931

32+
#### Using Multiple GPUs
33+
34+
<!-- markdownlint-disable MD046 -->
35+
!!! note
36+
37+
To effectively use multiple GPUs per node, you'll need to get in the mindset of doing some light unit canceling, multiplication, and division. Please be mindful.
38+
<!-- markdownlint-enable MD046 -->
39+
40+
When using multiple GPUs on the `amperenodes*` or `pascalnodes*` partitions, an additional Slurm directive is required to ensure the GPUs can all be put to use by your research software: `--ntasks-per-socket`. You will need to explicitly set the `--ntasks` directive to an integer multiple of the number of GPUs in `--gres=gpu`, then set `--ntasks-per-socket` to the multiplier.
41+
42+
Most researchers, in most scenarios, should find the following examples to be sufficient. It is very important to note that `--ntasks-per-socket` times `--gres=gpu` equals `--ntasks` (1 times 3 equals 3). You will need to supply other directives, as usual, remembering that total CPUs is equal to `--cpus-per-task` times `--ntasks`, and that the total number of CPUs per node cannot exceed the actual number of physical cores on the node, and cannot exceed any quotas for the partition. See [Hardware](../hardware.md#cheaha-hpc-cluster) for more information about hardware and quota limits on Cheaha.
43+
44+
Pascalnodes:
45+
46+
```bash
47+
#SBATCH --partition=pascalnodes # up to 28 cpus per node
48+
#SBATCH --ntasks-per-socket=1
49+
#SBATCH --gres=gpu:4
50+
#SBATCH --ntasks=4
51+
#SBATCH --cpus-per-task=7 # 7 cpus-per-task times 4 tasks = 28 cpus
52+
```
53+
54+
Amperenodes:
55+
56+
```bash
57+
#SBATCH --partition=amperenodes # up to 64 cpus per job
58+
#SBATCH --ntasks-per-socket=1
59+
#SBATCH --gres=gpu:2
60+
#SBATCH --ntasks=2
61+
#SBATCH --cpus-per-task=32 # 32 cpus-per-task times 2 tasks = 64 cpus
62+
```
63+
64+
If `--ntasks-per-socket` is not used, or used incorrectly, it is possible that some of the GPUs requested may go unused, reducing performance and increasing job runtimes. For more information, please read the [GPU-Core Affinity Details](#gpu-core-affinity-details) below.
65+
66+
##### GPU-Core Affinity Details
67+
68+
Important terminology:
69+
70+
- **[node](https://en.wikipedia.org/wiki/Node_(networking)#Distributed_systems)**: A single computer in a cluster.
71+
- **[mainboard](https://en.wikipedia.org/wiki/Motherboard)**: The central circuit board of a computer, where the components of the node all integrate.
72+
- **[CPU](https://en.wikipedia.org/wiki/Central_processing_unit)**: Central Processing Unit, where general calculations are performed during operation of the node. Often contains multiple cores. Sometimes conflated with "core". We use the term "CPU die" in this section to avoid ambiguity.
73+
- **[socket](https://en.wikipedia.org/wiki/CPU_socket)**: A connector on the mainboard for electrical connection to a CPU die. Some mainboards have a single socket, others have multiple sockets.
74+
- **[core](https://en.wikipedia.org/wiki/Processor_core)**: A single physical processor of computer instructions. One core can carry out one computation at a time. Part of a CPU. Also called "processor core". Sometimes conflated with "CPU".
75+
- **[GPU](https://en.wikipedia.org/wiki/Graphics_processing_unit)**: Graphics Processing Unit, trades off generalized computing for faster computation with a limited set of operations. Often used for AI processing. Contains many, specialied cores. Increasingly called "accelerator" in the context of clusters and high-performance computing (HPC).
76+
77+
Nodes in both the `amperenodes*` and `*pascalnodes` partition are configured as follows:
78+
79+
- Each node has a single mainboard.
80+
- Each mainboard has two sockets.
81+
- Each socket has a single CPU die.
82+
- Each CPU die has multiple cores:
83+
- `amperenodes*`: 128 cores per CPU die
84+
- `pascalnodes*`: 28 cores per CPU die
85+
- Each socket is connected with a subset of the GPUs:
86+
- `amperenodes*`: 1 GPU per socket (2 per mainboard)
87+
- `pascalnodes*`: 2 GPUs per socket (4 per mainboard)
88+
89+
Communication between each socket and its connected GPUs is relatively very fast. Communication between GPUs connected to different sockets is much slower, so we want to make sure that the Slurm knows which cores in each socket are associated with each GPU to allow for optimal performance of applications. The association between cores and GPUs is called "GPU-core affinity". Slurm is made explicitly aware of GPU-core affinity in the file located at `/etc/slurm/gres.conf`.
90+
91+
When a researcher submits an sbatch script, the use of `--ntasks-per-socket` informs slurm that tasks should be distributed across sockets, rather than the default behavior of "first available". Often, the default behavior results in all cores being allocated from a single socket, leaving some of the GPUs unavailable to your software, or with lower than expected performance.
92+
93+
To ensure the capability for optimal performance, ensure use of the `--ntasks-per-socket`.
94+
3095
### Open OnDemand
3196

3297
When requesting an interactive job through `Open OnDemand`, selecting the `pascalnodes` partitions will automatically request access to one GPU as well. There is currently no way to change the number of GPUs for OOD interactive jobs.

0 commit comments

Comments
 (0)