diff --git a/micro-benchmarks/nvshmem/README.md b/micro-benchmarks/nvshmem/README.md new file mode 100644 index 00000000..649bfe84 --- /dev/null +++ b/micro-benchmarks/nvshmem/README.md @@ -0,0 +1,159 @@ +# NVSHMEM + +NVIDIA NVSHMEM is NVIDIA’s implementation of the OpenSHMEM [PGAS](https://en.wikipedia.org/wiki/Partitioned_global_address_space) model for GPU clusters. It provides an easy-to-use CPU-side interface to allocate pinned memory that is symmetrically distributed across a cluster of NVIDIA GPUs. NVSHMEM can significantly reduce communication and coordination overheads by allowing programmers to perform these operations from within CUDA kernels and on CUDA streams. + +One of the options for using the NVSHMEM is to implement high-throughput and low-latency MoE dispatch and combine GPU kernels. [DeepEP](https://github.com/deepseek-ai/DeepEP) and [pplx-kernels](https://github.com/ppl-ai/pplx-kernels) are examples of such implementations. + +The goal of this document is to provide a guide on how to build NVSHMEM with NCCL with AWS EFA support and run the performance tests. This document reuses NCCL Tests Docker image as a base image and adds NVSHMEM on top. This is done because NVSHMEM is built with NCCL. + +### Building NCCL Tests Docker image + +For more details on how to build the NCCL Tests Docker image, please refer to the [NCCL Tests README](../nccl-tests/README.md). + +```bash +GDRCOPY_VERSION=v2.4.4 +EFA_INSTALLER_VERSION=1.38.1 +AWS_OFI_NCCL_VERSION=v1.14.0 +NCCL_VERSION=v2.26.2-1 +NCCL_TESTS_VERSION=v2.14.1 +TAG="efa${EFA_INSTALLER_VERSION}-ofi${AWS_OFI_NCCL_VERSION}-nccl${NCCL_VERSION}-tests${NCCL_TESTS_VERSION}" +NCCL_CONTAINER_IMAGE_NAME_TAG="nccl-tests:${TAG}" +``` + +```bash +docker build --progress=plain -f ../nccl-tests/nccl-tests.Dockerfile \ + --build-arg="EFA_INSTALLER_VERSION=${EFA_INSTALLER_VERSION}" \ + --build-arg="AWS_OFI_NCCL_VERSION=${AWS_OFI_NCCL_VERSION}" \ + --build-arg="NCCL_VERSION=${NCCL_VERSION}" \ + --build-arg="NCCL_TESTS_VERSION=${NCCL_TESTS_VERSION}" \ + -t ${NCCL_CONTAINER_IMAGE_NAME_TAG} \ + . +``` + +### Building NVSHMEM Docker image on top of NCCL Tests Docker base image + +```bash +NVSHMEM_VERSION=3.2.5-1 +TAG="efa${EFA_INSTALLER_VERSION}-ofi${AWS_OFI_NCCL_VERSION}-nccl${NCCL_VERSION}-tests${NCCL_TESTS_VERSION}-nvshmem${NVSHMEM_VERSION}" +NVSHMEM_CONTAINER_IMAGE_NAME_TAG="nvshmem:${TAG}" +``` + +```bash +docker build --progress=plain -f nvshmem.Dockerfile \ + --build-arg="EFA_INSTALLER_VERSION=${EFA_INSTALLER_VERSION}" \ + --build-arg="AWS_OFI_NCCL_VERSION=${AWS_OFI_NCCL_VERSION}" \ + --build-arg="NCCL_VERSION=${NCCL_VERSION}" \ + --build-arg="NCCL_TESTS_VERSION=${NCCL_TESTS_VERSION}" \ + --build-arg="NVSHMEM_VERSION=${NVSHMEM_VERSION}" \ + -t ${NVSHMEM_CONTAINER_IMAGE_NAME_TAG} \ + . +``` + +### Slurm + +To run the NCCL tests on Slurm, you will need to convert the container into a Squash file using Enroot. + +Convert the container image to a squash file via Enroot. If you have the built image locally use the following command: + +```bash +enroot import -o ./nvshmem.sqsh dockerd://${NVSHMEM_CONTAINER_IMAGE_NAME_TAG} +``` + +# Perf Test + +NVSHMEM provides rich set of performance tests for different operations launched on both device and host. + +Common arguments: + +* `-b, --min_size ` - Minimum message size in bytes +* `-e, --max_size ` - Maximum message size in bytes +* `-f, --step ` - Step factor for message sizes +* `-n, --iters ` - Number of iterations +* `-w, --warmup_iters ` - Number of warmup iterations +* `-c, --ctas ` - Number of CTAs to launch (used in some device pt-to-pt tests) +* `-t, --threads_per_cta ` - Number of threads per block (used in some device pt-to-pt tests) +* `-d, --datatype ` - Data type: int, int32_t, uint32_t, int64_t, uint64_t, long, longlong, ulonglong, size, ptrdiff, float, double, fp16, bf16 +* `-o, --reduce_op ` - Reduction operation: min, max, sum, prod, and, or, xor +* `-s, --scope ` - Thread group scope: thread, warp, block, all +* `-i, --stride ` - Stride between elements +* `-a, --atomic_op ` - Atomic operation: inc, add, and, or, xor, set, swap, fetch_inc, add, and, or, xor, compare_swap +* `--bidir` - Run bidirectional test +* `--msgrate` - Report message rate (MMPs) +* `--dir ` - Direction (read/write) for put/get operations +* `--issue ` - Issue mode (on_stream/host) for some host pt-to-pt tests + +## Device + +### Collective + +Device collective tests are located in `/opt/nvshmem/bin/perftest/device/collective/`: + +- alltoall_latency +- barrier_latency +- bcast_latency +- fcollect_latency +- redmaxloc_latency +- reducescatter_latency +- reduction_latency +- sync_latency + +### Point-to-Point + +Device point-to-point tests are located in `/opt/nvshmem/bin/perftest/device/pt-to-pt/`: + +- shmem_atomic_bw: +- shmem_atomic_latency +- shmem_atomic_ping_pong_latency +- shmem_g_bw +- shmem_g_latency +- shmem_get_bw +- shmem_get_latency +- shmem_p_bw +- shmem_p_latency +- shmem_p_ping_pong_latency +- shmem_put_atomic_ping_pong_latency +- shmem_put_bw +- shmem_put_latency +- shmem_put_ping_pong_latency +- shmem_put_signal_ping_pong_latency +- shmem_signal_ping_pong_latency +- shmem_st_bw + +## Host + +### Collectives + +Host collective tests are located in `/opt/nvshmem/bin/perftest/host/collective/`: + +- alltoall_on_stream +- barrier_all_on_stream +- barrier_on_stream +- broadcast_on_stream +- fcollect_on_stream +- reducescatter_on_stream +- reduction_on_stream +- sync_all_on_stream + +### Point-to-Point + +Host point-to-point tests are located in `/opt/nvshmem/bin/perftest/host/pt-to-pt/`: + +- bw +- latency +- stream_latency + +### Example of running shmem_put_bw benchmark on 2 GPUs on a single node and 2 GPUs on two different nodes + +NVSHMEM shmem_put_bw benchmark requires 2 processing elements (PEs), so there are two options: + +benchmark 2 GPUs on a single node over NVLink: + +```bash +srun --mpi=pmix --cpu-bind=none --container-image ./nvshmem.sqsh --nodes=1 --ntasks-per-node=2 /opt/nvshmem/bin/perftest/device/pt-to-pt/shmem_put_bw +``` + +benchmark 2 GPUs on two different nodes over AWS EFA: + +```bash +srun --mpi=pmix --cpu-bind=none --container-image ./nvshmem.sqsh --nodes=2 --ntasks-per-node=1 /opt/nvshmem/bin/perftest/device/pt-to-pt/shmem_put_bw +``` diff --git a/micro-benchmarks/nvshmem/nvshmem.Dockerfile b/micro-benchmarks/nvshmem/nvshmem.Dockerfile new file mode 100644 index 00000000..ff7d26e6 --- /dev/null +++ b/micro-benchmarks/nvshmem/nvshmem.Dockerfile @@ -0,0 +1,61 @@ +# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. +# SPDX-License-Identifier: MIT-0 +ARG GDRCOPY_VERSION=v2.4.1 +ARG EFA_INSTALLER_VERSION=1.37.0 +ARG AWS_OFI_NCCL_VERSION=v1.13.2-aws +ARG NCCL_VERSION=v2.23.4-1 +ARG NCCL_TESTS_VERSION=v2.13.10 + +FROM nccl-tests:efa${EFA_INSTALLER_VERSION}-ofi${AWS_OFI_NCCL_VERSION}-nccl${NCCL_VERSION}-tests${NCCL_TESTS_VERSION} + +ARG NVSHMEM_VERSION=3.2.5-1 + +ENV NVSHMEM_DIR=/opt/nvshmem +ENV NVSHMEM_HOME=/opt/nvshmem + +RUN curl -L https://developer.nvidia.com/downloads/assets/secure/nvshmem/nvshmem_src_${NVSHMEM_VERSION}.txz -o /nvshmem_src_${NVSHMEM_VERSION}.txz \ + && tar -xf /nvshmem_src_${NVSHMEM_VERSION}.txz -C / \ + && cd /nvshmem_src \ + && mkdir -p build \ + && cd build \ + && cmake \ + -DNVSHMEM_PREFIX=/opt/nvshmem \ + -DCMAKE_INSTALL_PREFIX=/opt/nvshmem \ + \ + -DCUDA_HOME=/usr/local/cuda \ + -DCMAKE_CUDA_ARCHITECTURES=90a \ + \ + -DNVSHMEM_USE_GDRCOPY=1 \ + -DGDRCOPY_HOME=/opt/gdrcopy \ + \ + -DNVSHMEM_USE_NCCL=1 \ + -DNCCL_HOME=/opt/nccl/build \ + -DNCCL_INCLUDE=/opt/nccl/build/include \ + \ + -DNVSHMEM_LIBFABRIC_SUPPORT=1 \ + -DLIBFABRIC_HOME=/opt/amazon/efa \ + \ + -DNVSHMEM_MPI_SUPPORT=1 \ + -DMPI_HOME=/opt/amazon/openmpi \ + \ + -DNVSHMEM_PMIX_SUPPORT=1 \ + -DPMIX_HOME=/opt/amazon/pmix \ + -DNVSHMEM_DEFAULT_PMIX=1 \ + \ + -DNVSHMEM_BUILD_TESTS=1 \ + -DNVSHMEM_BUILD_EXAMPLES=1 \ + -DNVSHMEM_BUILD_HYDRA_LAUNCHER=1 \ + -DNVSHMEM_BUILD_TXZ_PACKAGE=1 \ + \ + -DNVSHMEM_IBRC_SUPPORT=1 \ + -DNVSHMEM_IBGDA_SUPPORT=1 \ + \ + -DNVSHMEM_TIMEOUT_DEVICE_POLLING=0 \ + \ + -DNVSHMEM_DEBUG=WARN \ + -DNVSHMEM_TRACE=1 \ + .. \ + && make -j$(nproc) \ + && make install + +ENV PATH=/opt/nvshmem/bin:$PATH LD_LIBRARY_PATH=/opt/amazon/pmix/lib:/opt/nvshmem/lib:$LD_LIBRARY_PATH NVSHMEM_REMOTE_TRANSPORT=libfabric NVSHMEM_LIBFABRIC_PROVIDER=efa diff --git a/micro-benchmarks/nvshmem/slurm/alltoall_latency.sbatch b/micro-benchmarks/nvshmem/slurm/alltoall_latency.sbatch new file mode 100644 index 00000000..8b534274 --- /dev/null +++ b/micro-benchmarks/nvshmem/slurm/alltoall_latency.sbatch @@ -0,0 +1,55 @@ +#!/bin/bash + +# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. +# SPDX-License-Identifier: MIT-0 + +#SBATCH --job-name=all2all # name of your job +#SBATCH --nodes=1 # number of nodes to use, 24 p4d(e) = 192 A100 GPUs +#SBATCH --ntasks-per-node 8 # Number of GPU per node +###SBATCH --gpus-per-node=8 # number of GPU we reserve. Uncomment for AWS ParallelCluster +#SBATCH --output %x_%j.out +#SBATCH --error %x_%j.err +#SBATCH --exclusive +#SBATCH --wait-all-nodes=1 + +### Disable hyperthreading by setting the tasks per core to 1 +#SBATCH --ntasks-per-core=1 + +########################### +###### User Variables ##### +########################### + + +# default variables for Enroot +: "${APPS_PATH:=/fsx}" +: "${NCCL_TESTS_PATH:=/opt/nccl-tests/build}" +: "${IMAGE:=$APPS_PATH/nccl-tests.sqsh}" + +## Set libfabric flags to use EFA +export FI_PROVIDER=efa +export FI_EFA_USE_DEVICE_RDMA=1 # use for p4d +export FI_EFA_FORK_SAFE=1 + +## Set this flag for debugging EFA +#export FI_LOG_LEVEL=warn + +## NCCL Environment variables +export NCCL_DEBUG=INFO + +### Increase the send queue depth and can turn NCCL communications into non-blocking. +### https://www.usenix.org/system/files/atc23-choi.pdf +export NCCL_BUFFSIZE=8388608 +### Improve performance by increasing buffer size for Send/Recv, Gather, Scatter and Alltoall communications +### https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/p2p.html +export NCCL_P2P_NET_CHUNKSIZE=524288 + +### Improve performance for AllReduce by selecting specific protocol and algorithm for specific +### message size and number of ranks. +### More information https://github.com/aws/aws-ofi-nccl/wiki/Algorithm-and-Protocol-Tuner-for-AWS. +export NCCL_TUNER_PLUGIN=/opt/aws-ofi-nccl/install/lib/libnccl-ofi-tuner.so + +#Get Hostname and Instance IDs +mpirun -N 1 bash -c 'echo $(hostname): $(cat /sys/devices/virtual/dmi/id/board_asset_tag | tr -d " ")' + +# Run shmem_put_bw benchmark +srun --mpi=pmix --cpu-bind=none --container-image ./nvshmem.sqsh /opt/nvshmem/bin/perftest/device/coll/alltoall_latency diff --git a/micro-benchmarks/nvshmem/slurm/shmem_put_bw_internode.sbatch b/micro-benchmarks/nvshmem/slurm/shmem_put_bw_internode.sbatch new file mode 100644 index 00000000..0bb488c1 --- /dev/null +++ b/micro-benchmarks/nvshmem/slurm/shmem_put_bw_internode.sbatch @@ -0,0 +1,55 @@ +#!/bin/bash + +# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. +# SPDX-License-Identifier: MIT-0 + +#SBATCH --job-name=put_bw # name of your job +#SBATCH --nodes=2 # number of nodes to use, 24 p4d(e) = 192 A100 GPUs +#SBATCH --ntasks-per-node 1 # Number of GPU per node +###SBATCH --gpus-per-node=8 # number of GPU we reserve. Uncomment for AWS ParallelCluster +#SBATCH --output %x_%j.out +#SBATCH --error %x_%j.err +#SBATCH --exclusive +#SBATCH --wait-all-nodes=1 + +### Disable hyperthreading by setting the tasks per core to 1 +#SBATCH --ntasks-per-core=1 + +########################### +###### User Variables ##### +########################### + + +# default variables for Enroot +: "${APPS_PATH:=/fsx}" +: "${NCCL_TESTS_PATH:=/opt/nccl-tests/build}" +: "${IMAGE:=$APPS_PATH/nccl-tests.sqsh}" + +## Set libfabric flags to use EFA +export FI_PROVIDER=efa +export FI_EFA_USE_DEVICE_RDMA=1 # use for p4d +export FI_EFA_FORK_SAFE=1 + +## Set this flag for debugging EFA +#export FI_LOG_LEVEL=warn + +## NCCL Environment variables +export NCCL_DEBUG=INFO + +### Increase the send queue depth and can turn NCCL communications into non-blocking. +### https://www.usenix.org/system/files/atc23-choi.pdf +export NCCL_BUFFSIZE=8388608 +### Improve performance by increasing buffer size for Send/Recv, Gather, Scatter and Alltoall communications +### https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/p2p.html +export NCCL_P2P_NET_CHUNKSIZE=524288 + +### Improve performance for AllReduce by selecting specific protocol and algorithm for specific +### message size and number of ranks. +### More information https://github.com/aws/aws-ofi-nccl/wiki/Algorithm-and-Protocol-Tuner-for-AWS. +export NCCL_TUNER_PLUGIN=/opt/aws-ofi-nccl/install/lib/libnccl-ofi-tuner.so + +#Get Hostname and Instance IDs +mpirun -N 1 bash -c 'echo $(hostname): $(cat /sys/devices/virtual/dmi/id/board_asset_tag | tr -d " ")' + +# Run shmem_put_bw benchmark +srun --mpi=pmix --cpu-bind=none --container-image ./nvshmem.sqsh /opt/nvshmem/bin/perftest/device/pt-to-pt/shmem_put_bw diff --git a/micro-benchmarks/nvshmem/slurm/shmem_put_bw_intranode.sbatch b/micro-benchmarks/nvshmem/slurm/shmem_put_bw_intranode.sbatch new file mode 100644 index 00000000..c94104e6 --- /dev/null +++ b/micro-benchmarks/nvshmem/slurm/shmem_put_bw_intranode.sbatch @@ -0,0 +1,55 @@ +#!/bin/bash + +# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. +# SPDX-License-Identifier: MIT-0 + +#SBATCH --job-name=put_bw # name of your job +#SBATCH --nodes=1 # number of nodes to use, 24 p4d(e) = 192 A100 GPUs +#SBATCH --ntasks-per-node 2 # Number of GPU per node +###SBATCH --gpus-per-node=8 # number of GPU we reserve. Uncomment for AWS ParallelCluster +#SBATCH --output %x_%j.out +#SBATCH --error %x_%j.err +#SBATCH --exclusive +#SBATCH --wait-all-nodes=1 + +### Disable hyperthreading by setting the tasks per core to 1 +#SBATCH --ntasks-per-core=1 + +########################### +###### User Variables ##### +########################### + + +# default variables for Enroot +: "${APPS_PATH:=/fsx}" +: "${NCCL_TESTS_PATH:=/opt/nccl-tests/build}" +: "${IMAGE:=$APPS_PATH/nccl-tests.sqsh}" + +## Set libfabric flags to use EFA +export FI_PROVIDER=efa +export FI_EFA_USE_DEVICE_RDMA=1 # use for p4d +export FI_EFA_FORK_SAFE=1 + +## Set this flag for debugging EFA +#export FI_LOG_LEVEL=warn + +## NCCL Environment variables +export NCCL_DEBUG=INFO + +### Increase the send queue depth and can turn NCCL communications into non-blocking. +### https://www.usenix.org/system/files/atc23-choi.pdf +export NCCL_BUFFSIZE=8388608 +### Improve performance by increasing buffer size for Send/Recv, Gather, Scatter and Alltoall communications +### https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/p2p.html +export NCCL_P2P_NET_CHUNKSIZE=524288 + +### Improve performance for AllReduce by selecting specific protocol and algorithm for specific +### message size and number of ranks. +### More information https://github.com/aws/aws-ofi-nccl/wiki/Algorithm-and-Protocol-Tuner-for-AWS. +export NCCL_TUNER_PLUGIN=/opt/aws-ofi-nccl/install/lib/libnccl-ofi-tuner.so + +#Get Hostname and Instance IDs +mpirun -N 1 bash -c 'echo $(hostname): $(cat /sys/devices/virtual/dmi/id/board_asset_tag | tr -d " ")' + +# Run shmem_put_bw benchmark +srun --mpi=pmix --cpu-bind=none --container-image ./nvshmem.sqsh /opt/nvshmem/bin/perftest/device/pt-to-pt/shmem_put_bw