uabrc · iam4tune · May 2, 2025 · Apr 23, 2025 · Apr 23, 2025 · Apr 23, 2025
diff --git a/docs/cheaha/hardware.md b/docs/cheaha/hardware.md
@@ -27,6 +27,8 @@ Examples of how to make use of the table:
 
 The full table can be downloaded [here](./res/hardware_summary_cheaha.csv).
 
+Information about GPU efficiency can be found at [Making the Most of GPUs](./slurm/gpu.md#making-the-most-of-gpus).
+
 ### Details
 
 Detailed hardware information, including processor and GPU makes and models, core clock frequencies, and other information for current hardware are in the table below.

diff --git a/docs/cheaha/job_efficiency.md b/docs/cheaha/job_efficiency.md
@@ -41,7 +41,9 @@ Questions to ask yourself before requesting resources:
 
 1. How is the software I'm using programmed?
 
-    - Can it use a GPU? Request one.
+    - Can it use a GPU? Request one. Don't forget to consider...
+        - [Local Scratch](../data_management/cheaha_storage_gpfs/index.md#local-scratch) for [IO performance](../cheaha/slurm/gpu.md#ensuring-io-performance-with-a100-gpus).
+        - `--ntasks-per-socket` when using [Multiple GPUs](../cheaha/slurm/gpu.md#using-multiple-gpus).
     - Can it use multiple cores? Request more than one core.
     - Is it single-threaded? Request only one core.
     - Does it use MPI? Request multiple nodes.

diff --git a/docs/cheaha/open_ondemand/ood_jupyter.md b/docs/cheaha/open_ondemand/ood_jupyter.md
@@ -32,6 +32,14 @@ For information on partition and GPU selection, please review our [hardware info
     The latest CUDA and cuDNN are now available from [Conda](../slurm/gpu.md#cuda-and-cudnn-modules).
 <!-- markdownlint-enable MD046 -->
 
+For more information on GPU efficiency please see [Making the Most of GPUs](../slurm/gpu.md#making-the-most-of-gpus).
+
+<!-- markdownlint-disable MD046 -->
+!!! important
+
+    April 21, 2025: Currently, GPU-core affinity is not considered for GPU jobs on interactive apps. This may mean selecting multiple GPUs results in some GPUs not being used.
+<!-- markdownlint-enable MD046 -->
+
 ## Extra Jupyter Arguments
 
 The `Extra Jupyter Arguments` field allows you to pass additional arguments to the Jupyter Server as it is being started. It can be helpful to point the server to the folder containing your notebook. To do this, assuming your notebooks are stored in `/data/user/$USER`, also known as `$USER_DATA`, put `--notebook-dir=$USER_DATA` in this field. You will be able to navigate to the notebook if it is in a subdirectory of `notebook-dir`, but you won't be able to navigate to any other directories. An example is shown below.

diff --git a/docs/cheaha/open_ondemand/ood_layout.md b/docs/cheaha/open_ondemand/ood_layout.md
@@ -94,6 +94,14 @@ The interactive apps have the following fields to customize the resources for yo
 
 Every interactive app has resources only allocated on a single node, and resources are shared among all processes running in the app. Make sure the amount of memory you request is less than or equal to the max amount per node for the partition you choose. We have a table with [memory available per node](../hardware.md#cheaha-hpc-cluster) for each partition.
 
+For more information on GPU efficiency please see [Making the Most of GPUs](../slurm/gpu.md#making-the-most-of-gpus).
+
+<!-- markdownlint-disable MD046 -->
+!!! important
+
+    April 21, 2025: Currently, GPU-core affinity is not considered for GPU jobs on interactive apps. This may mean selecting multiple GPUs results in some GPUs not being used.
+<!-- markdownlint-enable MD046 -->
+
 #### Environment Setup Window
 
 In addition to requesting general resources, for some apps you will have the option to add commands to be run during job startup in an Environment Setup Window. See below for an example showing how to load CUDA into a Jupyter job so it can use a GPU.

diff --git a/docs/cheaha/open_ondemand/ood_matlab.md b/docs/cheaha/open_ondemand/ood_matlab.md
@@ -35,6 +35,14 @@ You may optionally verify that Python works correctly by entering `py.list(["hel
 
 Please see the [MATLAB Section on our GPU Page](../slurm/gpu.md#matlab).
 
+For more information on GPU efficiency please see [Making the Most of GPUs](../slurm/gpu.md#making-the-most-of-gpus).
+
+<!-- markdownlint-disable MD046 -->
+!!! important
+
+    April 21, 2025: Currently, GPU-core affinity is not considered for GPU jobs on interactive apps. This may mean selecting multiple GPUs results in some GPUs not being used.
+<!-- markdownlint-enable MD046 -->
+
 ## Known Issues
 
 There is a known issue with `parpool` and other related multi-core parallel features such as `parfor` affecting R2022a and earlier. See our [Modules Known Issues section](../software/modules.md#matlab-issues) for more information.
diff --git a/docs/cheaha/res/job_submit_flags.csv b/docs/cheaha/res/job_submit_flags.csv
@@ -7,6 +7,7 @@ Flag,Short,Environment Variable,Description,sbatch,srun
 `--time`,`-t`,`SBATCH_TIMELIMIT`,Maximum allowed runtime of job. Allowed formats below.,[sbatch](https://slurm.schedmd.com/sbatch.html#OPT_time),[srun](https://slurm.schedmd.com/srun.html#OPT_time)
 `--nodes`,`-N`,,Number of nodes needed. Set to `1` if your software does not use MPI or if unsure.,[sbatch](https://slurm.schedmd.com/sbatch.html#OPT_nodes),[srun](https://slurm.schedmd.com/srun.html#OPT_nodes)
 `--ntasks`,`-n`,`SLURM_NTASKS`,Number of tasks planned per node. Mostly used for bookkeeping and calculating total cpus per node. If unsure set to `1`.,[sbatch](https://slurm.schedmd.com/sbatch.html#OPT_ntasks),[srun](https://slurm.schedmd.com/srun.html#OPT_ntasks)
+`--ntasks-per-socket`,,,"Number of tasks per socket. Required for multiple GPU jobs, [details here](../slurm/gpu.md#using-multiple-gpus)",[sbatch](https://slurm.schedmd.com/sbatch.html#OPT_ntasks-per-socket),[srun](https://slurm.schedmd.com/srun.html#OPT_ntasks-per-socket)
 `--cpus-per-task`,`-c`,`SLURM_CPUS_PER_TASK`,Number of needed cores per task. Cores per node equals `-n` times `-c`.,[sbatch](https://slurm.schedmd.com/sbatch.html#OPT_cpus-per-task),[srun](https://slurm.schedmd.com/srun.html#OPT_cpus-per-task)
 ,,`SLURM_CPUS_ON_NODE`,Number of cpus available on this node.,[sbatch](https://slurm.schedmd.com/sbatch.html#OPT_SLURM_CPUS_ON_NODE),[srun](https://slurm.schedmd.com/srun.html#OPT_SLURM_CPUS_ON_NODE)
 `--mem`,,`SLURM_MEM_PER_NODE`,Amount of RAM needed per node in MB. Can specify 16 GB using 16384 or 16G.,[sbatch](https://slurm.schedmd.com/sbatch.html#OPT_mem),[srun](https://slurm.schedmd.com/srun.html#OPT_SLURM_CPUS_ON_NODE)

diff --git a/docs/cheaha/slurm/gpu.md b/docs/cheaha/slurm/gpu.md
@@ -21,12 +21,77 @@ When requesting a job using `sbatch`, you will need to include the Slurm flag `-
 
 <!-- markdownlint-enable MD046 -->
 
-## Ensuring IO Performance With A100 GPUs
+### Making the Most of GPUs
+
+#### Ensuring IO Performance With A100 GPUs
 
 If you are using `amperenodes` and the A100 GPUs, then it is highly recommended to move your input files to the [local scratch](../../data_management/cheaha_storage_gpfs/index.md#local-scratch) at `/local/$USER/$SLURM_JOB_ID` prior to running your workflow, to ensure adequate GPU performance. Network file mounts, such as `$USER_SCRATCH`, `/scratch/`, `/data/user/` and `/data/project/`, do not have sufficient bandwidth to keep the GPU busy. So, your processing pipeline will slow down to network speeds, instead of GPU speeds.
 
 Please see our [Local Scratch Storage section](../../data_management/cheaha_storage_gpfs/index.md#local-scratch) for more details and an example script.
 
+#### Using Multiple GPUs
+
+<!-- markdownlint-disable MD046 -->
+!!! note
+
+    To effectively use multiple GPUs per node, you'll need to get in the mindset of doing some light unit canceling, multiplication, and division. Please be mindful.
+<!-- markdownlint-enable MD046 -->
+
+When using multiple GPUs on the `amperenodes*` or `pascalnodes*` partitions, an additional Slurm directive is required to ensure the GPUs can all be put to use by your research software: `--ntasks-per-socket`. You will need to explicitly set the `--ntasks` directive to an integer multiple of the number of GPUs in `--gres=gpu`, then set `--ntasks-per-socket` to the multiplier.
+
+Most researchers, in most scenarios, should find the following examples to be sufficient. It is very important to note that `--ntasks-per-socket` times `--gres=gpu` equals `--ntasks` (1 times 3 equals 3). You will need to supply other directives, as usual, remembering that total CPUs is equal to `--cpus-per-task` times `--ntasks`, and that the total number of CPUs per node cannot exceed the actual number of physical cores on the node, and cannot exceed any quotas for the partition. See [Hardware](../hardware.md#cheaha-hpc-cluster) for more information about hardware and quota limits on Cheaha.
+
+Pascalnodes:
+
+```bash
+#SBATCH --partition=pascalnodes   # up to 28 cpus per node
+#SBATCH --ntasks-per-socket=1
+#SBATCH --gres=gpu:4
+#SBATCH --ntasks=4
+#SBATCH --cpus-per-task=7         # 7 cpus-per-task times 4 tasks = 28 cpus
+```
+
+Amperenodes:
+
+```bash
+#SBATCH --partition=amperenodes   # up to 64 cpus per job
+#SBATCH --ntasks-per-socket=1
+#SBATCH --gres=gpu:2
+#SBATCH --ntasks=2
+#SBATCH --cpus-per-task=32        # 32 cpus-per-task times 2 tasks = 64 cpus
+```
+
+If `--ntasks-per-socket` is not used, or used incorrectly, it is possible that some of the GPUs requested may go unused, reducing performance and increasing job runtimes. For more information, please read the [GPU-Core Affinity Details](#gpu-core-affinity-details) below.
+
+##### GPU-Core Affinity Details
+
+Important terminology:
+
+- **[node](https://en.wikipedia.org/wiki/Node_(networking)#Distributed_systems)**: A single computer in a cluster.
+- **[mainboard](https://en.wikipedia.org/wiki/Motherboard)**: The central circuit board of a computer, where the components of the node all integrate.
+- **[CPU](https://en.wikipedia.org/wiki/Central_processing_unit)**: Central Processing Unit, where general calculations are performed during operation of the node. Often contains multiple cores. Sometimes conflated with "core". We use the term "CPU die" in this section to avoid ambiguity.
+- **[socket](https://en.wikipedia.org/wiki/CPU_socket)**: A connector on the mainboard for electrical connection to a CPU die. Some mainboards have a single socket, others have multiple sockets.
+- **[core](https://en.wikipedia.org/wiki/Processor_core)**: A single physical processor of computer instructions. One core can carry out one computation at a time. Part of a CPU. Also called "processor core". Sometimes conflated with "CPU".
+- **[GPU](https://en.wikipedia.org/wiki/Graphics_processing_unit)**: Graphics Processing Unit, trades off generalized computing for faster computation with a limited set of operations. Often used for AI processing. Contains many, specialied cores. Increasingly called "accelerator" in the context of clusters and high-performance computing (HPC).
+
+Nodes in both the `amperenodes*` and `*pascalnodes` partition are configured as follows:
+
+- Each node has a single mainboard.
+- Each mainboard has two sockets.
+- Each socket has a single CPU die.
+- Each CPU die has multiple cores:
+    - `amperenodes*`: 128 cores per CPU die
+    - `pascalnodes*`: 28 cores per CPU die
+- Each socket is connected with a subset of the GPUs:
+    - `amperenodes*`: 1 GPU per socket (2 per mainboard)
+    - `pascalnodes*`: 2 GPUs per socket (4 per mainboard)
+
+Communication between each socket and its connected GPUs is relatively very fast. Communication between GPUs connected to different sockets is much slower, so we want to make sure that the Slurm knows which cores in each socket are associated with each GPU to allow for optimal performance of applications. The association between cores and GPUs is called "GPU-core affinity". Slurm is made explicitly aware of GPU-core affinity in the file located at `/etc/slurm/gres.conf`.
+
+When a researcher submits an sbatch script, the use of `--ntasks-per-socket` informs slurm that tasks should be distributed across sockets, rather than the default behavior of "first available". Often, the default behavior results in all cores being allocated from a single socket, leaving some of the GPUs unavailable to your software, or with lower than expected performance.
+
+To ensure the capability for optimal performance, ensure use of the `--ntasks-per-socket`.
+
 ### Open OnDemand
 
 When requesting an interactive job through `Open OnDemand`, selecting the `pascalnodes` partitions will automatically request access to one GPU as well. There is currently no way to change the number of GPUs for OOD interactive jobs.