nvshmem (#599)

pbelevich · KeitaW · web-flow · commit e97db8e34a58 · 2025-04-28T19:10:08.000-04:00
Co-authored-by: Keita Watanabe &lt;mlkeita@amazon.com&gt;
diff --git a/micro-benchmarks/nvshmem/README.md b/micro-benchmarks/nvshmem/README.md
@@ -0,0 +1,159 @@
+# NVSHMEM
+
+NVIDIA NVSHMEM is NVIDIA’s implementation of the OpenSHMEM [PGAS](https://en.wikipedia.org/wiki/Partitioned_global_address_space) model for GPU clusters. It provides an easy-to-use CPU-side interface to allocate pinned memory that is symmetrically distributed across a cluster of NVIDIA GPUs. NVSHMEM can significantly reduce communication and coordination overheads by allowing programmers to perform these operations from within CUDA kernels and on CUDA streams.
+
+One of the options for using the NVSHMEM is to implement high-throughput and low-latency MoE dispatch and combine GPU kernels. [DeepEP](https://github.com/deepseek-ai/DeepEP) and [pplx-kernels](https://github.com/ppl-ai/pplx-kernels) are examples of such implementations.
+
+The goal of this document is to provide a guide on how to build NVSHMEM with NCCL with AWS EFA support and run the performance tests. This document reuses NCCL Tests Docker image as a base image and adds NVSHMEM on top. This is done because NVSHMEM is built with NCCL. 
+
+### Building NCCL Tests Docker image
+
+For more details on how to build the NCCL Tests Docker image, please refer to the [NCCL Tests README](../nccl-tests/README.md).
+
+```bash
+GDRCOPY_VERSION=v2.4.4
+EFA_INSTALLER_VERSION=1.38.1
+AWS_OFI_NCCL_VERSION=v1.14.0
+NCCL_VERSION=v2.26.2-1
+NCCL_TESTS_VERSION=v2.14.1
+TAG="efa${EFA_INSTALLER_VERSION}-ofi${AWS_OFI_NCCL_VERSION}-nccl${NCCL_VERSION}-tests${NCCL_TESTS_VERSION}"
+NCCL_CONTAINER_IMAGE_NAME_TAG="nccl-tests:${TAG}"
+```
+
+```bash
+docker build --progress=plain -f ../nccl-tests/nccl-tests.Dockerfile \
+       --build-arg="EFA_INSTALLER_VERSION=${EFA_INSTALLER_VERSION}" \
+       --build-arg="AWS_OFI_NCCL_VERSION=${AWS_OFI_NCCL_VERSION}" \
+       --build-arg="NCCL_VERSION=${NCCL_VERSION}" \
+       --build-arg="NCCL_TESTS_VERSION=${NCCL_TESTS_VERSION}" \
+       -t ${NCCL_CONTAINER_IMAGE_NAME_TAG} \
+       .
+```
+
+### Building NVSHMEM Docker image on top of NCCL Tests Docker base image
+
+```bash
+NVSHMEM_VERSION=3.2.5-1
+TAG="efa${EFA_INSTALLER_VERSION}-ofi${AWS_OFI_NCCL_VERSION}-nccl${NCCL_VERSION}-tests${NCCL_TESTS_VERSION}-nvshmem${NVSHMEM_VERSION}"
+NVSHMEM_CONTAINER_IMAGE_NAME_TAG="nvshmem:${TAG}"
+```
+
+```bash
+docker build --progress=plain -f nvshmem.Dockerfile \
+       --build-arg="EFA_INSTALLER_VERSION=${EFA_INSTALLER_VERSION}" \
+       --build-arg="AWS_OFI_NCCL_VERSION=${AWS_OFI_NCCL_VERSION}" \
+       --build-arg="NCCL_VERSION=${NCCL_VERSION}" \
+       --build-arg="NCCL_TESTS_VERSION=${NCCL_TESTS_VERSION}" \
+       --build-arg="NVSHMEM_VERSION=${NVSHMEM_VERSION}" \
+       -t ${NVSHMEM_CONTAINER_IMAGE_NAME_TAG} \
+       .
+```
+
+### Slurm
+
+To run the NCCL tests on Slurm, you will need to convert the container into a Squash file using Enroot.
+
+Convert the container image to a squash file via Enroot. If you have the built image locally use the following command:
+
+```bash
+enroot import -o ./nvshmem.sqsh dockerd://${NVSHMEM_CONTAINER_IMAGE_NAME_TAG}
+```
+
+# Perf Test
+
+NVSHMEM provides rich set of performance tests for different operations launched on both device and host.
+
+Common arguments:
+
+* `-b, --min_size <minbytes>` - Minimum message size in bytes
+* `-e, --max_size <maxbytes>` - Maximum message size in bytes  
+* `-f, --step <step factor>` - Step factor for message sizes
+* `-n, --iters <number>` - Number of iterations
+* `-w, --warmup_iters <number>` - Number of warmup iterations
+* `-c, --ctas <number>` - Number of CTAs to launch (used in some device pt-to-pt tests)
+* `-t, --threads_per_cta <number>` - Number of threads per block (used in some device pt-to-pt tests)
+* `-d, --datatype <type>` - Data type: int, int32_t, uint32_t, int64_t, uint64_t, long, longlong, ulonglong, size, ptrdiff, float, double, fp16, bf16
+* `-o, --reduce_op <op>` - Reduction operation: min, max, sum, prod, and, or, xor
+* `-s, --scope <scope>` - Thread group scope: thread, warp, block, all
+* `-i, --stride <number>` - Stride between elements
+* `-a, --atomic_op <op>` - Atomic operation: inc, add, and, or, xor, set, swap, fetch_inc, add, and, or, xor, compare_swap
+* `--bidir` - Run bidirectional test
+* `--msgrate` - Report message rate (MMPs) 
+* `--dir <direction>` - Direction (read/write) for put/get operations
+* `--issue <mode>` - Issue mode (on_stream/host) for some host pt-to-pt tests
+
+## Device
+
+### Collective
+
+Device collective tests are located in `/opt/nvshmem/bin/perftest/device/collective/`:
+
+- alltoall_latency
+- barrier_latency
+- bcast_latency
+- fcollect_latency
+- redmaxloc_latency
+- reducescatter_latency
+- reduction_latency
+- sync_latency
+
+### Point-to-Point
+
+Device point-to-point tests are located in `/opt/nvshmem/bin/perftest/device/pt-to-pt/`:
+
+- shmem_atomic_bw: 
+- shmem_atomic_latency
+- shmem_atomic_ping_pong_latency
+- shmem_g_bw
+- shmem_g_latency
+- shmem_get_bw
+- shmem_get_latency
+- shmem_p_bw
+- shmem_p_latency
+- shmem_p_ping_pong_latency
+- shmem_put_atomic_ping_pong_latency
+- shmem_put_bw
+- shmem_put_latency
+- shmem_put_ping_pong_latency
+- shmem_put_signal_ping_pong_latency
+- shmem_signal_ping_pong_latency
+- shmem_st_bw
+
+## Host
+
+### Collectives
+
+Host collective tests are located in `/opt/nvshmem/bin/perftest/host/collective/`:
+
+- alltoall_on_stream
+- barrier_all_on_stream
+- barrier_on_stream
+- broadcast_on_stream
+- fcollect_on_stream
+- reducescatter_on_stream
+- reduction_on_stream
+- sync_all_on_stream
+
+### Point-to-Point
+
+Host point-to-point tests are located in `/opt/nvshmem/bin/perftest/host/pt-to-pt/`:
+
+- bw
+- latency     
+- stream_latency
+
+### Example of running shmem_put_bw benchmark on 2 GPUs on a single node and 2 GPUs on two different nodes
+
+NVSHMEM shmem_put_bw benchmark requires 2 processing elements (PEs), so there are two options:
+
+benchmark 2 GPUs on a single node over NVLink:
+
+```bash
+srun --mpi=pmix --cpu-bind=none --container-image ./nvshmem.sqsh --nodes=1 --ntasks-per-node=2 /opt/nvshmem/bin/perftest/device/pt-to-pt/shmem_put_bw
+```
+
+benchmark 2 GPUs on two different nodes over AWS EFA:
+
+```bash
+srun --mpi=pmix --cpu-bind=none --container-image ./nvshmem.sqsh --nodes=2 --ntasks-per-node=1 /opt/nvshmem/bin/perftest/device/pt-to-pt/shmem_put_bw
+```
diff --git a/micro-benchmarks/nvshmem/nvshmem.Dockerfile b/micro-benchmarks/nvshmem/nvshmem.Dockerfile
@@ -0,0 +1,61 @@
+# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+# SPDX-License-Identifier: MIT-0
+ARG GDRCOPY_VERSION=v2.4.1
+ARG EFA_INSTALLER_VERSION=1.37.0
+ARG AWS_OFI_NCCL_VERSION=v1.13.2-aws
+ARG NCCL_VERSION=v2.23.4-1
+ARG NCCL_TESTS_VERSION=v2.13.10
+
+FROM nccl-tests:efa${EFA_INSTALLER_VERSION}-ofi${AWS_OFI_NCCL_VERSION}-nccl${NCCL_VERSION}-tests${NCCL_TESTS_VERSION}
+
+ARG NVSHMEM_VERSION=3.2.5-1
+
+ENV NVSHMEM_DIR=/opt/nvshmem
+ENV NVSHMEM_HOME=/opt/nvshmem
+
+RUN curl -L https://developer.nvidia.com/downloads/assets/secure/nvshmem/nvshmem_src_${NVSHMEM_VERSION}.txz -o /nvshmem_src_${NVSHMEM_VERSION}.txz \
+    && tar -xf /nvshmem_src_${NVSHMEM_VERSION}.txz -C / \
+    && cd /nvshmem_src \
+    && mkdir -p build \
+    && cd build \ 
+    && cmake \
+    -DNVSHMEM_PREFIX=/opt/nvshmem \
+    -DCMAKE_INSTALL_PREFIX=/opt/nvshmem \
+    \
+    -DCUDA_HOME=/usr/local/cuda \
+    -DCMAKE_CUDA_ARCHITECTURES=90a \
+    \
+    -DNVSHMEM_USE_GDRCOPY=1 \
+    -DGDRCOPY_HOME=/opt/gdrcopy \
+    \
+    -DNVSHMEM_USE_NCCL=1 \
+    -DNCCL_HOME=/opt/nccl/build \
+    -DNCCL_INCLUDE=/opt/nccl/build/include \
+    \
+    -DNVSHMEM_LIBFABRIC_SUPPORT=1 \
+    -DLIBFABRIC_HOME=/opt/amazon/efa \
+    \
+    -DNVSHMEM_MPI_SUPPORT=1 \
+    -DMPI_HOME=/opt/amazon/openmpi \
+    \
+    -DNVSHMEM_PMIX_SUPPORT=1 \
+    -DPMIX_HOME=/opt/amazon/pmix \
+    -DNVSHMEM_DEFAULT_PMIX=1 \
+    \
+    -DNVSHMEM_BUILD_TESTS=1 \
+    -DNVSHMEM_BUILD_EXAMPLES=1 \
+    -DNVSHMEM_BUILD_HYDRA_LAUNCHER=1 \
+    -DNVSHMEM_BUILD_TXZ_PACKAGE=1 \
+    \
+    -DNVSHMEM_IBRC_SUPPORT=1 \
+    -DNVSHMEM_IBGDA_SUPPORT=1 \
+    \
+    -DNVSHMEM_TIMEOUT_DEVICE_POLLING=0 \
+    \
+    -DNVSHMEM_DEBUG=WARN \
+    -DNVSHMEM_TRACE=1 \
+    .. \
+    && make -j$(nproc) \
+    && make install
+
+ENV PATH=/opt/nvshmem/bin:$PATH LD_LIBRARY_PATH=/opt/amazon/pmix/lib:/opt/nvshmem/lib:$LD_LIBRARY_PATH NVSHMEM_REMOTE_TRANSPORT=libfabric NVSHMEM_LIBFABRIC_PROVIDER=efa
diff --git a/micro-benchmarks/nvshmem/slurm/alltoall_latency.sbatch b/micro-benchmarks/nvshmem/slurm/alltoall_latency.sbatch
@@ -0,0 +1,55 @@
+#!/bin/bash
+
+# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+# SPDX-License-Identifier: MIT-0
+
+#SBATCH --job-name=all2all # name of your job
+#SBATCH --nodes=1 # number of nodes to use, 24 p4d(e) = 192 A100 GPUs
+#SBATCH --ntasks-per-node 8 # Number of GPU per node
+###SBATCH --gpus-per-node=8 # number of GPU we reserve. Uncomment for AWS ParallelCluster
+#SBATCH --output %x_%j.out
+#SBATCH --error %x_%j.err
+#SBATCH --exclusive
+#SBATCH --wait-all-nodes=1
+
+### Disable hyperthreading by setting the tasks per core to 1
+#SBATCH --ntasks-per-core=1
+
+###########################
+###### User Variables #####
+###########################
+
+
+# default variables for Enroot
+: "${APPS_PATH:=/fsx}"
+: "${NCCL_TESTS_PATH:=/opt/nccl-tests/build}"
+: "${IMAGE:=$APPS_PATH/nccl-tests.sqsh}"
+
+## Set libfabric flags to use EFA
+export FI_PROVIDER=efa
+export FI_EFA_USE_DEVICE_RDMA=1 # use for p4d
+export FI_EFA_FORK_SAFE=1
+
+## Set this flag for debugging EFA
+#export FI_LOG_LEVEL=warn
+
+## NCCL Environment variables
+export NCCL_DEBUG=INFO
+
+### Increase the send queue depth and can turn NCCL communications into non-blocking.
+### https://www.usenix.org/system/files/atc23-choi.pdf
+export NCCL_BUFFSIZE=8388608
+### Improve performance by increasing buffer size for Send/Recv, Gather, Scatter and Alltoall communications
+### https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/p2p.html
+export NCCL_P2P_NET_CHUNKSIZE=524288
+
+### Improve performance for AllReduce by selecting specific protocol and algorithm for specific
+### message size and number of ranks.
+### More information https://github.com/aws/aws-ofi-nccl/wiki/Algorithm-and-Protocol-Tuner-for-AWS.
+export NCCL_TUNER_PLUGIN=/opt/aws-ofi-nccl/install/lib/libnccl-ofi-tuner.so
+
+#Get Hostname and Instance IDs
+mpirun -N 1 bash -c 'echo $(hostname): $(cat /sys/devices/virtual/dmi/id/board_asset_tag | tr -d " ")'
+
+# Run shmem_put_bw benchmark
+srun --mpi=pmix --cpu-bind=none --container-image ./nvshmem.sqsh /opt/nvshmem/bin/perftest/device/coll/alltoall_latency
diff --git a/micro-benchmarks/nvshmem/slurm/shmem_put_bw_internode.sbatch b/micro-benchmarks/nvshmem/slurm/shmem_put_bw_internode.sbatch
@@ -0,0 +1,55 @@
+#!/bin/bash
+
+# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+# SPDX-License-Identifier: MIT-0
+
+#SBATCH --job-name=put_bw # name of your job
+#SBATCH --nodes=2 # number of nodes to use, 24 p4d(e) = 192 A100 GPUs
+#SBATCH --ntasks-per-node 1 # Number of GPU per node
+###SBATCH --gpus-per-node=8 # number of GPU we reserve. Uncomment for AWS ParallelCluster
+#SBATCH --output %x_%j.out
+#SBATCH --error %x_%j.err
+#SBATCH --exclusive
+#SBATCH --wait-all-nodes=1
+
+### Disable hyperthreading by setting the tasks per core to 1
+#SBATCH --ntasks-per-core=1
+
+###########################
+###### User Variables #####
+###########################
+
+
+# default variables for Enroot
+: "${APPS_PATH:=/fsx}"
+: "${NCCL_TESTS_PATH:=/opt/nccl-tests/build}"
+: "${IMAGE:=$APPS_PATH/nccl-tests.sqsh}"
+
+## Set libfabric flags to use EFA
+export FI_PROVIDER=efa
+export FI_EFA_USE_DEVICE_RDMA=1 # use for p4d
+export FI_EFA_FORK_SAFE=1
+
+## Set this flag for debugging EFA
+#export FI_LOG_LEVEL=warn
+
+## NCCL Environment variables
+export NCCL_DEBUG=INFO
+
+### Increase the send queue depth and can turn NCCL communications into non-blocking.
+### https://www.usenix.org/system/files/atc23-choi.pdf
+export NCCL_BUFFSIZE=8388608
+### Improve performance by increasing buffer size for Send/Recv, Gather, Scatter and Alltoall communications
+### https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/p2p.html
+export NCCL_P2P_NET_CHUNKSIZE=524288
+
+### Improve performance for AllReduce by selecting specific protocol and algorithm for specific
+### message size and number of ranks.
+### More information https://github.com/aws/aws-ofi-nccl/wiki/Algorithm-and-Protocol-Tuner-for-AWS.
+export NCCL_TUNER_PLUGIN=/opt/aws-ofi-nccl/install/lib/libnccl-ofi-tuner.so
+
+#Get Hostname and Instance IDs
+mpirun -N 1 bash -c 'echo $(hostname): $(cat /sys/devices/virtual/dmi/id/board_asset_tag | tr -d " ")'
+
+# Run shmem_put_bw benchmark
+srun --mpi=pmix --cpu-bind=none --container-image ./nvshmem.sqsh /opt/nvshmem/bin/perftest/device/pt-to-pt/shmem_put_bw
diff --git a/micro-benchmarks/nvshmem/slurm/shmem_put_bw_intranode.sbatch b/micro-benchmarks/nvshmem/slurm/shmem_put_bw_intranode.sbatch
@@ -0,0 +1,55 @@
+#!/bin/bash
+
+# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+# SPDX-License-Identifier: MIT-0
+
+#SBATCH --job-name=put_bw # name of your job
+#SBATCH --nodes=1 # number of nodes to use, 24 p4d(e) = 192 A100 GPUs
+#SBATCH --ntasks-per-node 2 # Number of GPU per node
+###SBATCH --gpus-per-node=8 # number of GPU we reserve. Uncomment for AWS ParallelCluster
+#SBATCH --output %x_%j.out
+#SBATCH --error %x_%j.err
+#SBATCH --exclusive
+#SBATCH --wait-all-nodes=1
+
+### Disable hyperthreading by setting the tasks per core to 1
+#SBATCH --ntasks-per-core=1
+
+###########################
+###### User Variables #####
+###########################
+
+
+# default variables for Enroot
+: "${APPS_PATH:=/fsx}"
+: "${NCCL_TESTS_PATH:=/opt/nccl-tests/build}"
+: "${IMAGE:=$APPS_PATH/nccl-tests.sqsh}"
+
+## Set libfabric flags to use EFA
+export FI_PROVIDER=efa
+export FI_EFA_USE_DEVICE_RDMA=1 # use for p4d
+export FI_EFA_FORK_SAFE=1
+
+## Set this flag for debugging EFA
+#export FI_LOG_LEVEL=warn
+
+## NCCL Environment variables
+export NCCL_DEBUG=INFO
+
+### Increase the send queue depth and can turn NCCL communications into non-blocking.
+### https://www.usenix.org/system/files/atc23-choi.pdf
+export NCCL_BUFFSIZE=8388608
+### Improve performance by increasing buffer size for Send/Recv, Gather, Scatter and Alltoall communications
+### https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/p2p.html
+export NCCL_P2P_NET_CHUNKSIZE=524288
+
+### Improve performance for AllReduce by selecting specific protocol and algorithm for specific
+### message size and number of ranks.
+### More information https://github.com/aws/aws-ofi-nccl/wiki/Algorithm-and-Protocol-Tuner-for-AWS.
+export NCCL_TUNER_PLUGIN=/opt/aws-ofi-nccl/install/lib/libnccl-ofi-tuner.so
+
+#Get Hostname and Instance IDs
+mpirun -N 1 bash -c 'echo $(hostname): $(cat /sys/devices/virtual/dmi/id/board_asset_tag | tr -d " ")'
+
+# Run shmem_put_bw benchmark
+srun --mpi=pmix --cpu-bind=none --container-image ./nvshmem.sqsh /opt/nvshmem/bin/perftest/device/pt-to-pt/shmem_put_bw