Skip to content

Add information on --ntasks-per-socket for multiple GPU jobs. #977

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
May 2, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/cheaha/hardware.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,8 @@ Examples of how to make use of the table:

The full table can be downloaded [here](./res/hardware_summary_cheaha.csv).

Information about GPU efficiency can be found at [Making the Most of GPUs](./slurm/gpu.md#making-the-most-of-gpus).

### Details

Detailed hardware information, including processor and GPU makes and models, core clock frequencies, and other information for current hardware are in the table below.
Expand Down
4 changes: 3 additions & 1 deletion docs/cheaha/job_efficiency.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,9 @@ Questions to ask yourself before requesting resources:

1. How is the software I'm using programmed?

- Can it use a GPU? Request one.
- Can it use a GPU? Request one. Don't forget to consider...
- [Local Scratch](../data_management/cheaha_storage_gpfs/index.md#local-scratch) for [IO performance](../cheaha/slurm/gpu.md#ensuring-io-performance-with-a100-gpus).
- `--ntasks-per-socket` when using [Multiple GPUs](../cheaha/slurm/gpu.md#using-multiple-gpus).
- Can it use multiple cores? Request more than one core.
- Is it single-threaded? Request only one core.
- Does it use MPI? Request multiple nodes.
Expand Down
8 changes: 8 additions & 0 deletions docs/cheaha/open_ondemand/ood_jupyter.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,14 @@ For information on partition and GPU selection, please review our [hardware info
The latest CUDA and cuDNN are now available from [Conda](../slurm/gpu.md#cuda-and-cudnn-modules).
<!-- markdownlint-enable MD046 -->

For more information on GPU efficiency please see [Making the Most of GPUs](../slurm/gpu.md#making-the-most-of-gpus).

<!-- markdownlint-disable MD046 -->
!!! important

April 21, 2025: Currently, GPU-core affinity is not considered for GPU jobs on interactive apps. This may mean selecting multiple GPUs results in some GPUs not being used.
<!-- markdownlint-enable MD046 -->

## Extra Jupyter Arguments

The `Extra Jupyter Arguments` field allows you to pass additional arguments to the Jupyter Server as it is being started. It can be helpful to point the server to the folder containing your notebook. To do this, assuming your notebooks are stored in `/data/user/$USER`, also known as `$USER_DATA`, put `--notebook-dir=$USER_DATA` in this field. You will be able to navigate to the notebook if it is in a subdirectory of `notebook-dir`, but you won't be able to navigate to any other directories. An example is shown below.
Expand Down
8 changes: 8 additions & 0 deletions docs/cheaha/open_ondemand/ood_layout.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,14 @@ The interactive apps have the following fields to customize the resources for yo

Every interactive app has resources only allocated on a single node, and resources are shared among all processes running in the app. Make sure the amount of memory you request is less than or equal to the max amount per node for the partition you choose. We have a table with [memory available per node](../hardware.md#cheaha-hpc-cluster) for each partition.

For more information on GPU efficiency please see [Making the Most of GPUs](../slurm/gpu.md#making-the-most-of-gpus).

<!-- markdownlint-disable MD046 -->
!!! important

April 21, 2025: Currently, GPU-core affinity is not considered for GPU jobs on interactive apps. This may mean selecting multiple GPUs results in some GPUs not being used.
<!-- markdownlint-enable MD046 -->

#### Environment Setup Window

In addition to requesting general resources, for some apps you will have the option to add commands to be run during job startup in an Environment Setup Window. See below for an example showing how to load CUDA into a Jupyter job so it can use a GPU.
Expand Down
8 changes: 8 additions & 0 deletions docs/cheaha/open_ondemand/ood_matlab.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,14 @@ You may optionally verify that Python works correctly by entering `py.list(["hel

Please see the [MATLAB Section on our GPU Page](../slurm/gpu.md#matlab).

For more information on GPU efficiency please see [Making the Most of GPUs](../slurm/gpu.md#making-the-most-of-gpus).

<!-- markdownlint-disable MD046 -->
!!! important

April 21, 2025: Currently, GPU-core affinity is not considered for GPU jobs on interactive apps. This may mean selecting multiple GPUs results in some GPUs not being used.
<!-- markdownlint-enable MD046 -->

## Known Issues

There is a known issue with `parpool` and other related multi-core parallel features such as `parfor` affecting R2022a and earlier. See our [Modules Known Issues section](../software/modules.md#matlab-issues) for more information.
1 change: 1 addition & 0 deletions docs/cheaha/res/job_submit_flags.csv
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ Flag,Short,Environment Variable,Description,sbatch,srun
`--time`,`-t`,`SBATCH_TIMELIMIT`,Maximum allowed runtime of job. Allowed formats below.,[sbatch](https://slurm.schedmd.com/sbatch.html#OPT_time),[srun](https://slurm.schedmd.com/srun.html#OPT_time)
`--nodes`,`-N`,,Number of nodes needed. Set to `1` if your software does not use MPI or if unsure.,[sbatch](https://slurm.schedmd.com/sbatch.html#OPT_nodes),[srun](https://slurm.schedmd.com/srun.html#OPT_nodes)
`--ntasks`,`-n`,`SLURM_NTASKS`,Number of tasks planned per node. Mostly used for bookkeeping and calculating total cpus per node. If unsure set to `1`.,[sbatch](https://slurm.schedmd.com/sbatch.html#OPT_ntasks),[srun](https://slurm.schedmd.com/srun.html#OPT_ntasks)
`--ntasks-per-socket`,,,"Number of tasks per socket. Required for multiple GPU jobs, [details here](../slurm/gpu.md#using-multiple-gpus)",[sbatch](https://slurm.schedmd.com/sbatch.html#OPT_ntasks-per-socket),[srun](https://slurm.schedmd.com/srun.html#OPT_ntasks-per-socket)
`--cpus-per-task`,`-c`,`SLURM_CPUS_PER_TASK`,Number of needed cores per task. Cores per node equals `-n` times `-c`.,[sbatch](https://slurm.schedmd.com/sbatch.html#OPT_cpus-per-task),[srun](https://slurm.schedmd.com/srun.html#OPT_cpus-per-task)
,,`SLURM_CPUS_ON_NODE`,Number of cpus available on this node.,[sbatch](https://slurm.schedmd.com/sbatch.html#OPT_SLURM_CPUS_ON_NODE),[srun](https://slurm.schedmd.com/srun.html#OPT_SLURM_CPUS_ON_NODE)
`--mem`,,`SLURM_MEM_PER_NODE`,Amount of RAM needed per node in MB. Can specify 16 GB using 16384 or 16G.,[sbatch](https://slurm.schedmd.com/sbatch.html#OPT_mem),[srun](https://slurm.schedmd.com/srun.html#OPT_SLURM_CPUS_ON_NODE)
Expand Down
67 changes: 66 additions & 1 deletion docs/cheaha/slurm/gpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,12 +21,77 @@ When requesting a job using `sbatch`, you will need to include the Slurm flag `-

<!-- markdownlint-enable MD046 -->

## Ensuring IO Performance With A100 GPUs
### Making the Most of GPUs

#### Ensuring IO Performance With A100 GPUs

If you are using `amperenodes` and the A100 GPUs, then it is highly recommended to move your input files to the [local scratch](../../data_management/cheaha_storage_gpfs/index.md#local-scratch) at `/local/$USER/$SLURM_JOB_ID` prior to running your workflow, to ensure adequate GPU performance. Network file mounts, such as `$USER_SCRATCH`, `/scratch/`, `/data/user/` and `/data/project/`, do not have sufficient bandwidth to keep the GPU busy. So, your processing pipeline will slow down to network speeds, instead of GPU speeds.

Please see our [Local Scratch Storage section](../../data_management/cheaha_storage_gpfs/index.md#local-scratch) for more details and an example script.

#### Using Multiple GPUs

<!-- markdownlint-disable MD046 -->
!!! note

To effectively use multiple GPUs per node, you'll need to get in the mindset of doing some light unit canceling, multiplication, and division. Please be mindful.
<!-- markdownlint-enable MD046 -->

When using multiple GPUs on the `amperenodes*` or `pascalnodes*` partitions, an additional Slurm directive is required to ensure the GPUs can all be put to use by your research software: `--ntasks-per-socket`. You will need to explicitly set the `--ntasks` directive to an integer multiple of the number of GPUs in `--gres=gpu`, then set `--ntasks-per-socket` to the multiplier.

Most researchers, in most scenarios, should find the following examples to be sufficient. It is very important to note that `--ntasks-per-socket` times `--gres=gpu` equals `--ntasks` (1 times 3 equals 3). You will need to supply other directives, as usual, remembering that total CPUs is equal to `--cpus-per-task` times `--ntasks`, and that the total number of CPUs per node cannot exceed the actual number of physical cores on the node, and cannot exceed any quotas for the partition. See [Hardware](../hardware.md#cheaha-hpc-cluster) for more information about hardware and quota limits on Cheaha.

Pascalnodes:

```bash
#SBATCH --partition=pascalnodes # up to 28 cpus per node
#SBATCH --ntasks-per-socket=1
#SBATCH --gres=gpu:4
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=7 # 7 cpus-per-task times 4 tasks = 28 cpus
```

Amperenodes:

```bash
#SBATCH --partition=amperenodes # up to 64 cpus per job
#SBATCH --ntasks-per-socket=1
#SBATCH --gres=gpu:2
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=32 # 32 cpus-per-task times 2 tasks = 64 cpus
```

If `--ntasks-per-socket` is not used, or used incorrectly, it is possible that some of the GPUs requested may go unused, reducing performance and increasing job runtimes. For more information, please read the [GPU-Core Affinity Details](#gpu-core-affinity-details) below.

##### GPU-Core Affinity Details

Important terminology:

- **[node](https://en.wikipedia.org/wiki/Node_(networking)#Distributed_systems)**: A single computer in a cluster.
- **[mainboard](https://en.wikipedia.org/wiki/Motherboard)**: The central circuit board of a computer, where the components of the node all integrate.
- **[CPU](https://en.wikipedia.org/wiki/Central_processing_unit)**: Central Processing Unit, where general calculations are performed during operation of the node. Often contains multiple cores. Sometimes conflated with "core". We use the term "CPU die" in this section to avoid ambiguity.
- **[socket](https://en.wikipedia.org/wiki/CPU_socket)**: A connector on the mainboard for electrical connection to a CPU die. Some mainboards have a single socket, others have multiple sockets.
- **[core](https://en.wikipedia.org/wiki/Processor_core)**: A single physical processor of computer instructions. One core can carry out one computation at a time. Part of a CPU. Also called "processor core". Sometimes conflated with "CPU".
- **[GPU](https://en.wikipedia.org/wiki/Graphics_processing_unit)**: Graphics Processing Unit, trades off generalized computing for faster computation with a limited set of operations. Often used for AI processing. Contains many, specialied cores. Increasingly called "accelerator" in the context of clusters and high-performance computing (HPC).

Nodes in both the `amperenodes*` and `*pascalnodes` partition are configured as follows:

- Each node has a single mainboard.
- Each mainboard has two sockets.
- Each socket has a single CPU die.
- Each CPU die has multiple cores:
- `amperenodes*`: 128 cores per CPU die
- `pascalnodes*`: 28 cores per CPU die
- Each socket is connected with a subset of the GPUs:
- `amperenodes*`: 1 GPU per socket (2 per mainboard)
- `pascalnodes*`: 2 GPUs per socket (4 per mainboard)

Communication between each socket and its connected GPUs is relatively very fast. Communication between GPUs connected to different sockets is much slower, so we want to make sure that the Slurm knows which cores in each socket are associated with each GPU to allow for optimal performance of applications. The association between cores and GPUs is called "GPU-core affinity". Slurm is made explicitly aware of GPU-core affinity in the file located at `/etc/slurm/gres.conf`.

When a researcher submits an sbatch script, the use of `--ntasks-per-socket` informs slurm that tasks should be distributed across sockets, rather than the default behavior of "first available". Often, the default behavior results in all cores being allocated from a single socket, leaving some of the GPUs unavailable to your software, or with lower than expected performance.

To ensure the capability for optimal performance, ensure use of the `--ntasks-per-socket`.

### Open OnDemand

When requesting an interactive job through `Open OnDemand`, selecting the `pascalnodes` partitions will automatically request access to one GPU as well. There is currently no way to change the number of GPUs for OOD interactive jobs.
Expand Down
Loading
Loading