diff --git a/docs/cheaha/hardware.md b/docs/cheaha/hardware.md index 3df21f8b5..a8bc0b4fd 100644 --- a/docs/cheaha/hardware.md +++ b/docs/cheaha/hardware.md @@ -27,6 +27,8 @@ Examples of how to make use of the table: The full table can be downloaded [here](./res/hardware_summary_cheaha.csv). +Information about GPU efficiency can be found at [Making the Most of GPUs](./slurm/gpu.md#making-the-most-of-gpus). + ### Details Detailed hardware information, including processor and GPU makes and models, core clock frequencies, and other information for current hardware are in the table below. diff --git a/docs/cheaha/job_efficiency.md b/docs/cheaha/job_efficiency.md index 9f22a3c8c..ffd53808c 100644 --- a/docs/cheaha/job_efficiency.md +++ b/docs/cheaha/job_efficiency.md @@ -41,7 +41,9 @@ Questions to ask yourself before requesting resources: 1. How is the software I'm using programmed? - - Can it use a GPU? Request one. + - Can it use a GPU? Request one. Don't forget to consider... + - [Local Scratch](../data_management/cheaha_storage_gpfs/index.md#local-scratch) for [IO performance](../cheaha/slurm/gpu.md#ensuring-io-performance-with-a100-gpus). + - `--ntasks-per-socket` when using [Multiple GPUs](../cheaha/slurm/gpu.md#using-multiple-gpus). - Can it use multiple cores? Request more than one core. - Is it single-threaded? Request only one core. - Does it use MPI? Request multiple nodes. diff --git a/docs/cheaha/open_ondemand/ood_jupyter.md b/docs/cheaha/open_ondemand/ood_jupyter.md index 129478227..3c64ca1b6 100644 --- a/docs/cheaha/open_ondemand/ood_jupyter.md +++ b/docs/cheaha/open_ondemand/ood_jupyter.md @@ -32,6 +32,14 @@ For information on partition and GPU selection, please review our [hardware info The latest CUDA and cuDNN are now available from [Conda](../slurm/gpu.md#cuda-and-cudnn-modules). +For more information on GPU efficiency please see [Making the Most of GPUs](../slurm/gpu.md#making-the-most-of-gpus). + + +!!! important + + April 21, 2025: Currently, GPU-core affinity is not considered for GPU jobs on interactive apps. This may mean selecting multiple GPUs results in some GPUs not being used. + + ## Extra Jupyter Arguments The `Extra Jupyter Arguments` field allows you to pass additional arguments to the Jupyter Server as it is being started. It can be helpful to point the server to the folder containing your notebook. To do this, assuming your notebooks are stored in `/data/user/$USER`, also known as `$USER_DATA`, put `--notebook-dir=$USER_DATA` in this field. You will be able to navigate to the notebook if it is in a subdirectory of `notebook-dir`, but you won't be able to navigate to any other directories. An example is shown below. diff --git a/docs/cheaha/open_ondemand/ood_layout.md b/docs/cheaha/open_ondemand/ood_layout.md index f46159a92..cf8cfdd8b 100644 --- a/docs/cheaha/open_ondemand/ood_layout.md +++ b/docs/cheaha/open_ondemand/ood_layout.md @@ -94,6 +94,14 @@ The interactive apps have the following fields to customize the resources for yo Every interactive app has resources only allocated on a single node, and resources are shared among all processes running in the app. Make sure the amount of memory you request is less than or equal to the max amount per node for the partition you choose. We have a table with [memory available per node](../hardware.md#cheaha-hpc-cluster) for each partition. +For more information on GPU efficiency please see [Making the Most of GPUs](../slurm/gpu.md#making-the-most-of-gpus). + + +!!! important + + April 21, 2025: Currently, GPU-core affinity is not considered for GPU jobs on interactive apps. This may mean selecting multiple GPUs results in some GPUs not being used. + + #### Environment Setup Window In addition to requesting general resources, for some apps you will have the option to add commands to be run during job startup in an Environment Setup Window. See below for an example showing how to load CUDA into a Jupyter job so it can use a GPU. diff --git a/docs/cheaha/open_ondemand/ood_matlab.md b/docs/cheaha/open_ondemand/ood_matlab.md index 21c8bff1d..31f5f6073 100644 --- a/docs/cheaha/open_ondemand/ood_matlab.md +++ b/docs/cheaha/open_ondemand/ood_matlab.md @@ -35,6 +35,14 @@ You may optionally verify that Python works correctly by entering `py.list(["hel Please see the [MATLAB Section on our GPU Page](../slurm/gpu.md#matlab). +For more information on GPU efficiency please see [Making the Most of GPUs](../slurm/gpu.md#making-the-most-of-gpus). + + +!!! important + + April 21, 2025: Currently, GPU-core affinity is not considered for GPU jobs on interactive apps. This may mean selecting multiple GPUs results in some GPUs not being used. + + ## Known Issues There is a known issue with `parpool` and other related multi-core parallel features such as `parfor` affecting R2022a and earlier. See our [Modules Known Issues section](../software/modules.md#matlab-issues) for more information. diff --git a/docs/cheaha/res/job_submit_flags.csv b/docs/cheaha/res/job_submit_flags.csv index 42d017691..fb1d5d141 100644 --- a/docs/cheaha/res/job_submit_flags.csv +++ b/docs/cheaha/res/job_submit_flags.csv @@ -7,6 +7,7 @@ Flag,Short,Environment Variable,Description,sbatch,srun `--time`,`-t`,`SBATCH_TIMELIMIT`,Maximum allowed runtime of job. Allowed formats below.,[sbatch](https://slurm.schedmd.com/sbatch.html#OPT_time),[srun](https://slurm.schedmd.com/srun.html#OPT_time) `--nodes`,`-N`,,Number of nodes needed. Set to `1` if your software does not use MPI or if unsure.,[sbatch](https://slurm.schedmd.com/sbatch.html#OPT_nodes),[srun](https://slurm.schedmd.com/srun.html#OPT_nodes) `--ntasks`,`-n`,`SLURM_NTASKS`,Number of tasks planned per node. Mostly used for bookkeeping and calculating total cpus per node. If unsure set to `1`.,[sbatch](https://slurm.schedmd.com/sbatch.html#OPT_ntasks),[srun](https://slurm.schedmd.com/srun.html#OPT_ntasks) +`--ntasks-per-socket`,,,"Number of tasks per socket. Required for multiple GPU jobs, [details here](../slurm/gpu.md#using-multiple-gpus)",[sbatch](https://slurm.schedmd.com/sbatch.html#OPT_ntasks-per-socket),[srun](https://slurm.schedmd.com/srun.html#OPT_ntasks-per-socket) `--cpus-per-task`,`-c`,`SLURM_CPUS_PER_TASK`,Number of needed cores per task. Cores per node equals `-n` times `-c`.,[sbatch](https://slurm.schedmd.com/sbatch.html#OPT_cpus-per-task),[srun](https://slurm.schedmd.com/srun.html#OPT_cpus-per-task) ,,`SLURM_CPUS_ON_NODE`,Number of cpus available on this node.,[sbatch](https://slurm.schedmd.com/sbatch.html#OPT_SLURM_CPUS_ON_NODE),[srun](https://slurm.schedmd.com/srun.html#OPT_SLURM_CPUS_ON_NODE) `--mem`,,`SLURM_MEM_PER_NODE`,Amount of RAM needed per node in MB. Can specify 16 GB using 16384 or 16G.,[sbatch](https://slurm.schedmd.com/sbatch.html#OPT_mem),[srun](https://slurm.schedmd.com/srun.html#OPT_SLURM_CPUS_ON_NODE) diff --git a/docs/cheaha/slurm/gpu.md b/docs/cheaha/slurm/gpu.md index a34b69b14..ca98aba4c 100644 --- a/docs/cheaha/slurm/gpu.md +++ b/docs/cheaha/slurm/gpu.md @@ -21,12 +21,77 @@ When requesting a job using `sbatch`, you will need to include the Slurm flag `- -## Ensuring IO Performance With A100 GPUs +### Making the Most of GPUs + +#### Ensuring IO Performance With A100 GPUs If you are using `amperenodes` and the A100 GPUs, then it is highly recommended to move your input files to the [local scratch](../../data_management/cheaha_storage_gpfs/index.md#local-scratch) at `/local/$USER/$SLURM_JOB_ID` prior to running your workflow, to ensure adequate GPU performance. Network file mounts, such as `$USER_SCRATCH`, `/scratch/`, `/data/user/` and `/data/project/`, do not have sufficient bandwidth to keep the GPU busy. So, your processing pipeline will slow down to network speeds, instead of GPU speeds. Please see our [Local Scratch Storage section](../../data_management/cheaha_storage_gpfs/index.md#local-scratch) for more details and an example script. +#### Using Multiple GPUs + + +!!! note + + To effectively use multiple GPUs per node, you'll need to get in the mindset of doing some light unit canceling, multiplication, and division. Please be mindful. + + +When using multiple GPUs on the `amperenodes*` or `pascalnodes*` partitions, an additional Slurm directive is required to ensure the GPUs can all be put to use by your research software: `--ntasks-per-socket`. You will need to explicitly set the `--ntasks` directive to an integer multiple of the number of GPUs in `--gres=gpu`, then set `--ntasks-per-socket` to the multiplier. + +Most researchers, in most scenarios, should find the following examples to be sufficient. It is very important to note that `--ntasks-per-socket` times `--gres=gpu` equals `--ntasks` (1 times 3 equals 3). You will need to supply other directives, as usual, remembering that total CPUs is equal to `--cpus-per-task` times `--ntasks`, and that the total number of CPUs per node cannot exceed the actual number of physical cores on the node, and cannot exceed any quotas for the partition. See [Hardware](../hardware.md#cheaha-hpc-cluster) for more information about hardware and quota limits on Cheaha. + +Pascalnodes: + +```bash +#SBATCH --partition=pascalnodes # up to 28 cpus per node +#SBATCH --ntasks-per-socket=1 +#SBATCH --gres=gpu:4 +#SBATCH --ntasks=4 +#SBATCH --cpus-per-task=7 # 7 cpus-per-task times 4 tasks = 28 cpus +``` + +Amperenodes: + +```bash +#SBATCH --partition=amperenodes # up to 64 cpus per job +#SBATCH --ntasks-per-socket=1 +#SBATCH --gres=gpu:2 +#SBATCH --ntasks=2 +#SBATCH --cpus-per-task=32 # 32 cpus-per-task times 2 tasks = 64 cpus +``` + +If `--ntasks-per-socket` is not used, or used incorrectly, it is possible that some of the GPUs requested may go unused, reducing performance and increasing job runtimes. For more information, please read the [GPU-Core Affinity Details](#gpu-core-affinity-details) below. + +##### GPU-Core Affinity Details + +Important terminology: + +- **[node](https://en.wikipedia.org/wiki/Node_(networking)#Distributed_systems)**: A single computer in a cluster. +- **[mainboard](https://en.wikipedia.org/wiki/Motherboard)**: The central circuit board of a computer, where the components of the node all integrate. +- **[CPU](https://en.wikipedia.org/wiki/Central_processing_unit)**: Central Processing Unit, where general calculations are performed during operation of the node. Often contains multiple cores. Sometimes conflated with "core". We use the term "CPU die" in this section to avoid ambiguity. +- **[socket](https://en.wikipedia.org/wiki/CPU_socket)**: A connector on the mainboard for electrical connection to a CPU die. Some mainboards have a single socket, others have multiple sockets. +- **[core](https://en.wikipedia.org/wiki/Processor_core)**: A single physical processor of computer instructions. One core can carry out one computation at a time. Part of a CPU. Also called "processor core". Sometimes conflated with "CPU". +- **[GPU](https://en.wikipedia.org/wiki/Graphics_processing_unit)**: Graphics Processing Unit, trades off generalized computing for faster computation with a limited set of operations. Often used for AI processing. Contains many, specialied cores. Increasingly called "accelerator" in the context of clusters and high-performance computing (HPC). + +Nodes in both the `amperenodes*` and `*pascalnodes` partition are configured as follows: + +- Each node has a single mainboard. +- Each mainboard has two sockets. +- Each socket has a single CPU die. +- Each CPU die has multiple cores: + - `amperenodes*`: 128 cores per CPU die + - `pascalnodes*`: 28 cores per CPU die +- Each socket is connected with a subset of the GPUs: + - `amperenodes*`: 1 GPU per socket (2 per mainboard) + - `pascalnodes*`: 2 GPUs per socket (4 per mainboard) + +Communication between each socket and its connected GPUs is relatively very fast. Communication between GPUs connected to different sockets is much slower, so we want to make sure that the Slurm knows which cores in each socket are associated with each GPU to allow for optimal performance of applications. The association between cores and GPUs is called "GPU-core affinity". Slurm is made explicitly aware of GPU-core affinity in the file located at `/etc/slurm/gres.conf`. + +When a researcher submits an sbatch script, the use of `--ntasks-per-socket` informs slurm that tasks should be distributed across sockets, rather than the default behavior of "first available". Often, the default behavior results in all cores being allocated from a single socket, leaving some of the GPUs unavailable to your software, or with lower than expected performance. + +To ensure the capability for optimal performance, ensure use of the `--ntasks-per-socket`. + ### Open OnDemand When requesting an interactive job through `Open OnDemand`, selecting the `pascalnodes` partitions will automatically request access to one GPU as well. There is currently no way to change the number of GPUs for OOD interactive jobs. diff --git a/docs/cheaha/slurm/slurm_tutorial.md b/docs/cheaha/slurm/slurm_tutorial.md index b6412b203..9734fec75 100644 --- a/docs/cheaha/slurm/slurm_tutorial.md +++ b/docs/cheaha/slurm/slurm_tutorial.md @@ -34,6 +34,12 @@ If you're new to using Unix/Linux commands and bash scripting, we suggest going ## Slurm Batch Job User Guide + +!!! important + + All parts of the tutorials here should be run in a job context, instead of on the login node. If you are new to Cheaha, the simplest way to get started is to use an [Open OnDemand HPC Desktop Job](../open_ondemand/hpc_desktop.md). + + This user guide provides comprehensive insight into different types of batch jobs, facilitating in identifying the most suitable job type for your specific tasks. With clear explanations and practical examples, you will gain a deeper understanding of sequential, parallel, array, multicore, GPU, and multi-node jobs, assisting to make informed decisions when submitting jobs on the Cheaha system. 1. [A Simple Slurm Batch Job](#example-1-a-simple-slurm-batch-job) is ideal for Cheaha users who are just starting with Slurm batch job submission. It uses a simple example to introduce new users to requesting resources with `sbatch`, printing the `hostname`, and monitoring batch job submission. @@ -46,7 +52,7 @@ This user guide provides comprehensive insight into different types of batch job 1. [Mutlithreaded or Multicore Job](#example-5-multithreaded-or-multicore-job) is used when software inherently support multithreaded parallelism i.e run independent tasks simultaneously on multicore processors. For instance, there are numerous software such as [MATLAB](https://www.mathworks.com/help/matlab/ref/parfor.html), [FEBio](https://help.febio.org/FebioUser/FEBio_um_3-4-Section-2.6.html), [Xplor-NIH](https://nmr.cit.nih.gov/xplor-nih/doc/current/helperPrograms/options.html) support running multiple tasks at the same time on multicore processors. Users or programmers do not need to modify the code; you can simply enable multithreaded parallelism by configuring the appropriate options. -1. [GPU Job](#example-6-gpu-job) utilizes the parallel GPUs, which contain numerous cores designed to perform the same mathematical operations simultaneously. GPU job is appropriate for pipelines and software that are designed to run on GPU-based systems and efficiently distribute tasks across cores to process large datasets in parallel. Example includes [Tensorflow](https://www.tensorflow.org/guide/gpu), [Parabricks](../../education/case_studies.md), [PyTorch](https://pytorch.org/tutorials/prototype/ios_gpu_workflow.html#prototype-use-ios-gpu-in-pytorch), etc. +1. [GPU Jobs](#example-6-gpu-jobs) utilizes the parallel GPUs, which contain numerous cores designed to perform the same mathematical operations simultaneously. GPU job is appropriate for pipelines and software that are designed to run on GPU-based systems and efficiently distribute tasks across cores to process large datasets in parallel. Example includes [Tensorflow](https://www.tensorflow.org/guide/gpu), [Parabricks](../../education/case_studies.md), [PyTorch](https://pytorch.org/tutorials/prototype/ios_gpu_workflow.html#prototype-use-ios-gpu-in-pytorch), etc. 1. [Multinode Job](#example-7-multinode-job) is for pipeline/software that can be distributed and run across multiple nodes. For example, MPI based applications/tools such as [Quantum Expresso](https://www.quantum-espresso.org/Doc/user_guide/node20.html), [Amber](https://usc-rc.github.io/tutorials/amber), [LAMMPS](https://docs.lammps.org/Run_basics.html), etc. @@ -408,36 +414,38 @@ $ sacct -j 27105035 27105035.ex+ extern USER 4 COMPLETED 0:0 ``` -### Example 6: GPU Job +### Example 6: GPU Jobs -This slurm script shows the execution of Tensorflow job using GPU resources. Let us save this script as `gpu.job`. The Slurm parameter `--gres=gpu:2` in line 6, requests for 2 GPUs. In line 8, note that in order to run GPU-based jobs, either the `amperenodes` or `pascalnodes` partition must be used (please refer to our [GPU page](../slurm/gpu.md) for more information). Lines 14-15 loads the necessary CUDA modules, while lines 18-19 load the Anaconda module and activate a `conda` environment called `tensorflow`. Refer to [Tensorflow official page](https://www.tensorflow.org/) for installation. The last line executes a python script that utilizes Tensorflow library to perform matrix multiplication across multiple GPUs. +GPUs are a resource for speeding up computation in many scientific domains, so understanding how to use them effectively is important for accelerating scientific discovery. Always make sure you know your software's capabilities. Not all software can take advantage of GPUs, or multiple GPUs. Even if it can, be sure you understand what information or parameters you will need to supply to your software. -```bash linenums="1" -#!/bin/bash -#SBATCH --job-name=gpu ### Name of the job -#SBATCH --nodes=1 ### Number of Nodes -#SBATCH --ntasks=1 ### Number of Tasks -#SBATCH --cpus-per-task=1 ### Number of Tasks per CPU -#SBATCH --gres=gpu:2 ### Number of GPUs, 2 GPUs -#SBATCH --mem=16G ### Memory required, 16 gigabyte -#SBATCH --partition=amperenodes ### Cheaha Partition -#SBATCH --time=01:00:00 ### Estimated Time of Completion, 1 hour -#SBATCH --output=%x_%j.out ### Slurm Output file, %x is job name, %j is job id -#SBATCH --error=%x_%j.err ### Slurm Error file, %x is job name, %j is job id +In this section there are two tutorials that show how to use (a) a single GPU, and (b) multiple GPUs. Before we get started with the specifics, we need a working directory and software to work with. Our software will be a short script performing some low-level tensor operations with Tensorflow. It is programmed to take advantage of multiple GPUs automatically, to put the focus on the job scripts and the GPUs, rather than on the software used. -### Loading the required CUDA and cuDNN modules -module load CUDA/12.2.0 -module load cuDNN/8.9.2.26-CUDA-12.2.0 + +!!! note -### Loading the Anaconda module and activating the `tensorflow` environment -module load Anaconda3 -conda activate tensorflow + For real applications, especially AI and other large-data applications, we recommend pre-loading data onto [Local Scratch](../../data_management/cheaha_storage_gpfs/index.md#local-scratch) to [ensure good performance](../slurm/gpu.md#ensuring-io-performance-with-a100-gpus). Don't worry about doing this for the current tutorial, but do make a note of it for your own scientific work. The difference in performance is huge, especially for AI and large-data applications. + -### Executing the python script -python matmul_tensorflow.py +#### Initial Setup + +Let's create a working directory using [shell commands](../../workflow_solutions/shell.md). + +```bash +mkdir -p ~/slurm_tutorials/example_6 ``` -Let us now create a file named `matmul_tensorflow.py` and copy the following script into it. This python script demonstrates the utilization of Tensorflow library to distribute computational tasks among multiple GPUs, in order to perform matrix multiplication in parallel (Lines 11-19). Lines 8-9 retrieve the logical GPUs and enable device placement logging, which helps to analyze which device is used for each operation. The final results are aggregated and the sum is computed on the CPU device (lines 22-23). +Navigate to the working directory to prepare for following steps. All of the following steps will take place in this directory. + +```bash +cd ~/slurm_tutorials/example_6 +``` + +Let us create a file named `matmul_tensorflow.py` and copy the script below into it to prepare for the tutorials. You are welcome to use your favorite text editor. On Cheaha, there are two built-in options, just before the script below. + +- At any terminal on Cheaha, use [nano]. Type [`nano matmul_tensorflow.py` at the terminal](../../workflow_solutions/shell.md#edit-plain-text-files-nano) to create and start editing the file. +- In an HPC Desktop job terminal, type `gedit matmul_tensorflow.py` to create the file and open a graphical editor. + +Below is the script to copy into the new file. ```bash linenums="1" import tensorflow as tf @@ -468,7 +476,130 @@ if gpus: print(matmul_sum) ``` -The results indicate that the Tensorflow version utilized is 2.15. The segments `/device:GPU:0` and `/device:GPU:1` specify that the computations were executed on two GPUs. The final results is a 4x4 matrix obtained by summing the matrix multiplication results. In the `sacct` report, the column `AllocGRES` shows that 2 GPUs are allocated for this job. +We will also need to set up a [Conda environment](../software/software.md#anaconda-on-cheaha) suitable for executing this Tensorflow-based code. Please do not try to install Pip packages outside of a Conda environment, as it can result in [hard-to-diagnose errors](../../workflow_solutions/using_anaconda.md#). Copy the following into a file `environment.yml`. + +```yaml +name: tensorflow +dependencies: + - conda-forge::pip==25.0.1 + - conda-forge::python==3.11.0 + - pip: + - tensorflow==2.15.0 +``` + +To create the environment, run the following commands. This is a one-time setup for this tutorial. Please see our [Module page](../software/modules.md) and our [Conda page](../software/software.md#anaconda-on-cheaha) for more information about each. + +```bash +module load Anaconda3 +conda env create --file environment.yml +``` + +Each time you start a new session and want to use the environment, you'll need to use the following command to activate it. This should be done before moving on to the two GPU tutorials below. + +```bash +module load Anaconda3 # unless it is already loaded in this session +conda activate tensorflow +``` + +#### Example 6a: Single GPU Job + +The following slurm script can be used to run our script with a single GPU. The Slurm parameter `--gres=gpu:1` in line 6 requests the GPU. In line 8, note that in order to run GPU-based jobs, either the `amperenodes` or `pascalnodes` partition must be used (please refer to our [GPU page](../slurm/gpu.md) for more information). Lines 14-15 load the necessary modules, while lines 18-19 load the Anaconda module and activate a Conda environment called `tensorflow`.The last line executes the python script from the introduction. + +As before, copy this script to a new file `gpu-single.job`. + +```bash linenums="1" +#!/bin/bash +#SBATCH --job-name=gpu ### Name of the job +#SBATCH --nodes=1 ### Number of Nodes +#SBATCH --ntasks=1 ### Number of Tasks +#SBATCH --cpus-per-task=1 ### Number of Tasks per CPU +#SBATCH --gres=gpu:1 ### Number of GPUs +#SBATCH --mem=16G ### Memory required, 16 gigabyte +#SBATCH --partition=amperenodes ### Cheaha Partition +#SBATCH --time=01:00:00 ### Requested Time, 1 hour + +# Slurm Output and Error files, %x is job name, %j is job id +#SBATCH --output=%x_%j.out +#SBATCH --error=%x_%j.err + +### Loading the required CUDA and cuDNN modules +module load CUDA/12.2.0 +module load cuDNN/8.9.2.26-CUDA-12.2.0 + +### Loading the Anaconda module and activating the `tensorflow` environment +module load Anaconda3 +conda activate tensorflow + +### Executing the python script +python matmul_tensorflow.py +``` + +To submit the job, use the following command from within your working directory. + +```bash +sbatch gpu-single.job +``` + +When the job has completed, check the results using `cat` to read the Slurm output log. The results indicate that the Tensorflow version used is 2.15. The segment `/device:GPU:0` specifies which GPU the computation was executed on. The final result is a 4x4 matrix obtained by summing the matrix multiplication results. Note that the name of your output file will have a different job ID number. + +```bash +$ cat gpu_27107693.out + +TensorFlow version: 2.15.0 +Num GPUs Available: 1 +Computation on GPU: /device:GPU:0 +tf.Tensor( +[[0.7417870 0.436646 0.0565315 0.5258054] + [0.7313270 0.8445346 0.885784 0.0902905] + [1.176963 0.9857005 1.9687731 0.6279962] + [1.2957641 0.9410924 0.4280013 0.2470699]], shape=(4, 4), dtype=float32) +``` + +#### Example 6b: Multiple GPU Job + +Using multiple GPUs is very similar to the single GPU job, with a couple of small, but important, changes. You must also be sure that your software is able to take advantage of multiple GPUs. Some software is designed for single-GPU usage only and, in that case, requesting more GPUs wastes resources. In this tutorial we've already designed our software to take advantage of multiple GPUs automatically. + +First, we need to request two GPUs with `--gres=gpu:2`. We also need to instruct Slurm how to use CPU cores that are assigned to each GPU with `--ntasks-per-socket=1`. We also need to instruct Slurm we have two tasks, one for each socket, by using `--ntasks=2` instead of `1`. Much more detail is available at our [Using Multiple GPUs](../slurm/gpu.md#using-multiple-gpus) section. + +All of the other parts of our script can remain the same, because we programmed it with multiple GPU use in mind.. That may not be the case for all software, so be sure to check its documentation. + +Let us save this script as `gpu-multiple.job`. + +```bash linenums="1" +#!/bin/bash +#SBATCH --job-name=gpu ### Name of the job +#SBATCH --nodes=1 ### Number of Nodes +#SBATCH --ntasks=2 ### DIFF Number of Tasks, one for each socket +#SBATCH --ntasks-per-socket ### NEW Number of Tasks per Socket +#SBATCH --cpus-per-task=1 ### Number of Tasks per CPU +#SBATCH --gres=gpu:2 ### DIFF Number of GPUs, 2 GPUs +#SBATCH --mem=16G ### Memory required, 16 gigabyte +#SBATCH --partition=amperenodes ### Cheaha Partition +#SBATCH --time=01:00:00 ### Requested Time, 1 hour + +# Slurm Output and Error files, %x is job name, %j is job id +#SBATCH --output=%x_%j.out +#SBATCH --error=%x_%j.err + +### Loading the required CUDA and cuDNN modules +module load CUDA/12.2.0 +module load cuDNN/8.9.2.26-CUDA-12.2.0 + +### Loading the Anaconda module and activating the `tensorflow` environment +module load Anaconda3 +conda activate tensorflow + +### Executing the python script +python matmul_tensorflow.py +``` + +We will use the same `matmul_tensorflow.py`, since we programmed it to take advantage of multiple GPUs. To submit the job, use the following command. + +```bash +sbatch gpu-multiple.job +``` + +As before, the results indicate that the Tensorflow version used is 2.15. The segments `/device:GPU:0` and `/device:GPU:1` specify that the computations were executed on two GPUs. The final results is a 4x4 matrix obtained by summing the matrix multiplication results. In the `sacct` report, the column `AllocGRES` shows that 2 GPUs are allocated for this job. ```bash $ cat gpu_27107694.out diff --git a/docs/cheaha/slurm/submitting_jobs.md b/docs/cheaha/slurm/submitting_jobs.md index a3931c9b0..14279a379 100644 --- a/docs/cheaha/slurm/submitting_jobs.md +++ b/docs/cheaha/slurm/submitting_jobs.md @@ -34,7 +34,9 @@ Please see [Cheaha Hardware](../hardware.md#summary) for more information. Remem ### Requesting GPUs -Please see the [GPUs page](gpu.md) for more information. +Please see the [GPUs page](gpu.md) for more information. Take note that you'll need to take special care of how you submit GPU jobs to maximize performance. See our [Making the Most of GPUs Section](./gpu.md#making-the-most-of-gpus) + +See our [GPU Jobs Tutorial](./slurm_tutorial.md#example-6-gpu-jobs) for an introduction. ### Dynamic `--output` and `--error` File Names diff --git a/docs/cheaha/tutorial/pytorch_tensorflow.md b/docs/cheaha/tutorial/pytorch_tensorflow.md index f4b1df71d..28df5309c 100644 --- a/docs/cheaha/tutorial/pytorch_tensorflow.md +++ b/docs/cheaha/tutorial/pytorch_tensorflow.md @@ -8,6 +8,12 @@ The below tutorial would show you steps on how to create an Anaconda environment CUDA modules are used in this tutorial. Please note that the latest CUDA and cuDNN are now available from [Conda](../slurm/gpu.md#cuda-and-cudnn-modules). The tutorial provides good practices, but ages over time. You may need to modify the scripts to be suitable for your work. + +!!! note + + Be mindful that there are special considerations when submitting GPU jobs to maximize performance. See [Making the Most of GPUs](../slurm/gpu.md#making-the-most-of-gpus) for more information. This is not necessary for the tutorial in this page, but may benefit your research computation. + + ## Installing Anaconda Environments Using the Terminal To access the terminal (shell), please do the following. diff --git a/docs/education/case_studies.md b/docs/education/case_studies.md index f82821213..78ff6f954 100644 --- a/docs/education/case_studies.md +++ b/docs/education/case_studies.md @@ -12,6 +12,12 @@ For more information on Cheaha GPUs, please see our [GPU Page](../cheaha/slurm/g CUDA modules are used in this case study. Please note that the latest CUDA and cuDNN are now available from [Conda](../cheaha/slurm/gpu.md#cuda-and-cudnn-modules). + +!!! note + + Be mindful that there are special considerations when submitting GPU jobs to maximize performance. See [Making the Most of GPUs](../cheaha/slurm/gpu.md#making-the-most-of-gpus) for more information. This case study predates our knowledge and understanding of these considerations and does not make use of them. + + ### Licensing Policy A license is no longer required to use Clara Parabricks 4.x and later versions, and is free for the following groups, diff --git a/docs/workflow_solutions/using_anaconda.md b/docs/workflow_solutions/using_anaconda.md index c91e62991..cd7885429 100644 --- a/docs/workflow_solutions/using_anaconda.md +++ b/docs/workflow_solutions/using_anaconda.md @@ -125,6 +125,10 @@ For more information about using Anaconda with Jupyter, see the section [Working For more information about finding CUDA and cuDNN packages for use with GPUs, see the section [CUDA and cuDNN Modules](../cheaha/slurm/gpu.md#cuda-and-cudnn-modules) +#### Performance Considerations for GPUs + +See our [Making the Most of GPUs](../cheaha/slurm/gpu.md#making-the-most-of-gpus) for more information about maximizing the performance of GPUs on Cheaha. + ### Update packages in an environment To ensure packages and their dependencies are all up to date, it is a best practice to regularly update installed packages, and libraries in your activated environment.