Skip to content

nvshmem #599

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 28, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
159 changes: 159 additions & 0 deletions micro-benchmarks/nvshmem/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
# NVSHMEM

NVIDIA NVSHMEM is NVIDIA’s implementation of the OpenSHMEM [PGAS](https://en.wikipedia.org/wiki/Partitioned_global_address_space) model for GPU clusters. It provides an easy-to-use CPU-side interface to allocate pinned memory that is symmetrically distributed across a cluster of NVIDIA GPUs. NVSHMEM can significantly reduce communication and coordination overheads by allowing programmers to perform these operations from within CUDA kernels and on CUDA streams.

One of the options for using the NVSHMEM is to implement high-throughput and low-latency MoE dispatch and combine GPU kernels. [DeepEP](https://github.com/deepseek-ai/DeepEP) and [pplx-kernels](https://github.com/ppl-ai/pplx-kernels) are examples of such implementations.

The goal of this document is to provide a guide on how to build NVSHMEM with NCCL with AWS EFA support and run the performance tests. This document reuses NCCL Tests Docker image as a base image and adds NVSHMEM on top. This is done because NVSHMEM is built with NCCL.

### Building NCCL Tests Docker image

For more details on how to build the NCCL Tests Docker image, please refer to the [NCCL Tests README](../nccl-tests/README.md).

```bash
GDRCOPY_VERSION=v2.4.4
EFA_INSTALLER_VERSION=1.38.1
AWS_OFI_NCCL_VERSION=v1.14.0
NCCL_VERSION=v2.26.2-1
NCCL_TESTS_VERSION=v2.14.1
TAG="efa${EFA_INSTALLER_VERSION}-ofi${AWS_OFI_NCCL_VERSION}-nccl${NCCL_VERSION}-tests${NCCL_TESTS_VERSION}"
NCCL_CONTAINER_IMAGE_NAME_TAG="nccl-tests:${TAG}"
```

```bash
docker build --progress=plain -f ../nccl-tests/nccl-tests.Dockerfile \
--build-arg="EFA_INSTALLER_VERSION=${EFA_INSTALLER_VERSION}" \
--build-arg="AWS_OFI_NCCL_VERSION=${AWS_OFI_NCCL_VERSION}" \
--build-arg="NCCL_VERSION=${NCCL_VERSION}" \
--build-arg="NCCL_TESTS_VERSION=${NCCL_TESTS_VERSION}" \
-t ${NCCL_CONTAINER_IMAGE_NAME_TAG} \
.
```

### Building NVSHMEM Docker image on top of NCCL Tests Docker base image

```bash
NVSHMEM_VERSION=3.2.5-1
TAG="efa${EFA_INSTALLER_VERSION}-ofi${AWS_OFI_NCCL_VERSION}-nccl${NCCL_VERSION}-tests${NCCL_TESTS_VERSION}-nvshmem${NVSHMEM_VERSION}"
NVSHMEM_CONTAINER_IMAGE_NAME_TAG="nvshmem:${TAG}"
```

```bash
docker build --progress=plain -f nvshmem.Dockerfile \
--build-arg="EFA_INSTALLER_VERSION=${EFA_INSTALLER_VERSION}" \
--build-arg="AWS_OFI_NCCL_VERSION=${AWS_OFI_NCCL_VERSION}" \
--build-arg="NCCL_VERSION=${NCCL_VERSION}" \
--build-arg="NCCL_TESTS_VERSION=${NCCL_TESTS_VERSION}" \
--build-arg="NVSHMEM_VERSION=${NVSHMEM_VERSION}" \
-t ${NVSHMEM_CONTAINER_IMAGE_NAME_TAG} \
.
```

### Slurm

To run the NCCL tests on Slurm, you will need to convert the container into a Squash file using Enroot.

Convert the container image to a squash file via Enroot. If you have the built image locally use the following command:

```bash
enroot import -o ./nvshmem.sqsh dockerd://${NVSHMEM_CONTAINER_IMAGE_NAME_TAG}
```

# Perf Test

NVSHMEM provides rich set of performance tests for different operations launched on both device and host.

Common arguments:

* `-b, --min_size <minbytes>` - Minimum message size in bytes
* `-e, --max_size <maxbytes>` - Maximum message size in bytes
* `-f, --step <step factor>` - Step factor for message sizes
* `-n, --iters <number>` - Number of iterations
* `-w, --warmup_iters <number>` - Number of warmup iterations
* `-c, --ctas <number>` - Number of CTAs to launch (used in some device pt-to-pt tests)
* `-t, --threads_per_cta <number>` - Number of threads per block (used in some device pt-to-pt tests)
* `-d, --datatype <type>` - Data type: int, int32_t, uint32_t, int64_t, uint64_t, long, longlong, ulonglong, size, ptrdiff, float, double, fp16, bf16
* `-o, --reduce_op <op>` - Reduction operation: min, max, sum, prod, and, or, xor
* `-s, --scope <scope>` - Thread group scope: thread, warp, block, all
* `-i, --stride <number>` - Stride between elements
* `-a, --atomic_op <op>` - Atomic operation: inc, add, and, or, xor, set, swap, fetch_inc, add, and, or, xor, compare_swap
* `--bidir` - Run bidirectional test
* `--msgrate` - Report message rate (MMPs)
* `--dir <direction>` - Direction (read/write) for put/get operations
* `--issue <mode>` - Issue mode (on_stream/host) for some host pt-to-pt tests

## Device

### Collective

Device collective tests are located in `/opt/nvshmem/bin/perftest/device/collective/`:

- alltoall_latency
- barrier_latency
- bcast_latency
- fcollect_latency
- redmaxloc_latency
- reducescatter_latency
- reduction_latency
- sync_latency

### Point-to-Point

Device point-to-point tests are located in `/opt/nvshmem/bin/perftest/device/pt-to-pt/`:

- shmem_atomic_bw:
- shmem_atomic_latency
- shmem_atomic_ping_pong_latency
- shmem_g_bw
- shmem_g_latency
- shmem_get_bw
- shmem_get_latency
- shmem_p_bw
- shmem_p_latency
- shmem_p_ping_pong_latency
- shmem_put_atomic_ping_pong_latency
- shmem_put_bw
- shmem_put_latency
- shmem_put_ping_pong_latency
- shmem_put_signal_ping_pong_latency
- shmem_signal_ping_pong_latency
- shmem_st_bw

## Host

### Collectives

Host collective tests are located in `/opt/nvshmem/bin/perftest/host/collective/`:

- alltoall_on_stream
- barrier_all_on_stream
- barrier_on_stream
- broadcast_on_stream
- fcollect_on_stream
- reducescatter_on_stream
- reduction_on_stream
- sync_all_on_stream

### Point-to-Point

Host point-to-point tests are located in `/opt/nvshmem/bin/perftest/host/pt-to-pt/`:

- bw
- latency
- stream_latency

### Example of running shmem_put_bw benchmark on 2 GPUs on a single node and 2 GPUs on two different nodes

NVSHMEM shmem_put_bw benchmark requires 2 processing elements (PEs), so there are two options:

benchmark 2 GPUs on a single node over NVLink:

```bash
srun --mpi=pmix --cpu-bind=none --container-image ./nvshmem.sqsh --nodes=1 --ntasks-per-node=2 /opt/nvshmem/bin/perftest/device/pt-to-pt/shmem_put_bw
```

benchmark 2 GPUs on two different nodes over AWS EFA:

```bash
srun --mpi=pmix --cpu-bind=none --container-image ./nvshmem.sqsh --nodes=2 --ntasks-per-node=1 /opt/nvshmem/bin/perftest/device/pt-to-pt/shmem_put_bw
```
61 changes: 61 additions & 0 deletions micro-benchmarks/nvshmem/nvshmem.Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0
ARG GDRCOPY_VERSION=v2.4.1
ARG EFA_INSTALLER_VERSION=1.37.0
ARG AWS_OFI_NCCL_VERSION=v1.13.2-aws
ARG NCCL_VERSION=v2.23.4-1
ARG NCCL_TESTS_VERSION=v2.13.10

FROM nccl-tests:efa${EFA_INSTALLER_VERSION}-ofi${AWS_OFI_NCCL_VERSION}-nccl${NCCL_VERSION}-tests${NCCL_TESTS_VERSION}

ARG NVSHMEM_VERSION=3.2.5-1

ENV NVSHMEM_DIR=/opt/nvshmem
ENV NVSHMEM_HOME=/opt/nvshmem

RUN curl -L https://developer.nvidia.com/downloads/assets/secure/nvshmem/nvshmem_src_${NVSHMEM_VERSION}.txz -o /nvshmem_src_${NVSHMEM_VERSION}.txz \
&& tar -xf /nvshmem_src_${NVSHMEM_VERSION}.txz -C / \
&& cd /nvshmem_src \
&& mkdir -p build \
&& cd build \
&& cmake \
-DNVSHMEM_PREFIX=/opt/nvshmem \
-DCMAKE_INSTALL_PREFIX=/opt/nvshmem \
\
-DCUDA_HOME=/usr/local/cuda \
-DCMAKE_CUDA_ARCHITECTURES=90a \
\
-DNVSHMEM_USE_GDRCOPY=1 \
-DGDRCOPY_HOME=/opt/gdrcopy \
\
-DNVSHMEM_USE_NCCL=1 \
-DNCCL_HOME=/opt/nccl/build \
-DNCCL_INCLUDE=/opt/nccl/build/include \
\
-DNVSHMEM_LIBFABRIC_SUPPORT=1 \
-DLIBFABRIC_HOME=/opt/amazon/efa \
\
-DNVSHMEM_MPI_SUPPORT=1 \
-DMPI_HOME=/opt/amazon/openmpi \
\
-DNVSHMEM_PMIX_SUPPORT=1 \
-DPMIX_HOME=/opt/amazon/pmix \
-DNVSHMEM_DEFAULT_PMIX=1 \
\
-DNVSHMEM_BUILD_TESTS=1 \
-DNVSHMEM_BUILD_EXAMPLES=1 \
-DNVSHMEM_BUILD_HYDRA_LAUNCHER=1 \
-DNVSHMEM_BUILD_TXZ_PACKAGE=1 \
\
-DNVSHMEM_IBRC_SUPPORT=1 \
-DNVSHMEM_IBGDA_SUPPORT=1 \
\
-DNVSHMEM_TIMEOUT_DEVICE_POLLING=0 \
\
-DNVSHMEM_DEBUG=WARN \
-DNVSHMEM_TRACE=1 \
.. \
&& make -j$(nproc) \
&& make install

ENV PATH=/opt/nvshmem/bin:$PATH LD_LIBRARY_PATH=/opt/amazon/pmix/lib:/opt/nvshmem/lib:$LD_LIBRARY_PATH NVSHMEM_REMOTE_TRANSPORT=libfabric NVSHMEM_LIBFABRIC_PROVIDER=efa
55 changes: 55 additions & 0 deletions micro-benchmarks/nvshmem/slurm/alltoall_latency.sbatch
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
#!/bin/bash

# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0

#SBATCH --job-name=all2all # name of your job
#SBATCH --nodes=1 # number of nodes to use, 24 p4d(e) = 192 A100 GPUs
#SBATCH --ntasks-per-node 8 # Number of GPU per node
###SBATCH --gpus-per-node=8 # number of GPU we reserve. Uncomment for AWS ParallelCluster
#SBATCH --output %x_%j.out
#SBATCH --error %x_%j.err
#SBATCH --exclusive
#SBATCH --wait-all-nodes=1

### Disable hyperthreading by setting the tasks per core to 1
#SBATCH --ntasks-per-core=1

###########################
###### User Variables #####
###########################


# default variables for Enroot
: "${APPS_PATH:=/fsx}"
: "${NCCL_TESTS_PATH:=/opt/nccl-tests/build}"
: "${IMAGE:=$APPS_PATH/nccl-tests.sqsh}"

## Set libfabric flags to use EFA
export FI_PROVIDER=efa
export FI_EFA_USE_DEVICE_RDMA=1 # use for p4d
export FI_EFA_FORK_SAFE=1

## Set this flag for debugging EFA
#export FI_LOG_LEVEL=warn

## NCCL Environment variables
export NCCL_DEBUG=INFO

### Increase the send queue depth and can turn NCCL communications into non-blocking.
### https://www.usenix.org/system/files/atc23-choi.pdf
export NCCL_BUFFSIZE=8388608
### Improve performance by increasing buffer size for Send/Recv, Gather, Scatter and Alltoall communications
### https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/p2p.html
export NCCL_P2P_NET_CHUNKSIZE=524288

### Improve performance for AllReduce by selecting specific protocol and algorithm for specific
### message size and number of ranks.
### More information https://github.com/aws/aws-ofi-nccl/wiki/Algorithm-and-Protocol-Tuner-for-AWS.
export NCCL_TUNER_PLUGIN=/opt/aws-ofi-nccl/install/lib/libnccl-ofi-tuner.so

#Get Hostname and Instance IDs
mpirun -N 1 bash -c 'echo $(hostname): $(cat /sys/devices/virtual/dmi/id/board_asset_tag | tr -d " ")'

# Run shmem_put_bw benchmark
srun --mpi=pmix --cpu-bind=none --container-image ./nvshmem.sqsh /opt/nvshmem/bin/perftest/device/coll/alltoall_latency
55 changes: 55 additions & 0 deletions micro-benchmarks/nvshmem/slurm/shmem_put_bw_internode.sbatch
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
#!/bin/bash

# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0

#SBATCH --job-name=put_bw # name of your job
#SBATCH --nodes=2 # number of nodes to use, 24 p4d(e) = 192 A100 GPUs
#SBATCH --ntasks-per-node 1 # Number of GPU per node
###SBATCH --gpus-per-node=8 # number of GPU we reserve. Uncomment for AWS ParallelCluster
#SBATCH --output %x_%j.out
#SBATCH --error %x_%j.err
#SBATCH --exclusive
#SBATCH --wait-all-nodes=1

### Disable hyperthreading by setting the tasks per core to 1
#SBATCH --ntasks-per-core=1

###########################
###### User Variables #####
###########################


# default variables for Enroot
: "${APPS_PATH:=/fsx}"
: "${NCCL_TESTS_PATH:=/opt/nccl-tests/build}"
: "${IMAGE:=$APPS_PATH/nccl-tests.sqsh}"

## Set libfabric flags to use EFA
export FI_PROVIDER=efa
export FI_EFA_USE_DEVICE_RDMA=1 # use for p4d
export FI_EFA_FORK_SAFE=1

## Set this flag for debugging EFA
#export FI_LOG_LEVEL=warn

## NCCL Environment variables
export NCCL_DEBUG=INFO

### Increase the send queue depth and can turn NCCL communications into non-blocking.
### https://www.usenix.org/system/files/atc23-choi.pdf
export NCCL_BUFFSIZE=8388608
### Improve performance by increasing buffer size for Send/Recv, Gather, Scatter and Alltoall communications
### https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/p2p.html
export NCCL_P2P_NET_CHUNKSIZE=524288

### Improve performance for AllReduce by selecting specific protocol and algorithm for specific
### message size and number of ranks.
### More information https://github.com/aws/aws-ofi-nccl/wiki/Algorithm-and-Protocol-Tuner-for-AWS.
export NCCL_TUNER_PLUGIN=/opt/aws-ofi-nccl/install/lib/libnccl-ofi-tuner.so

#Get Hostname and Instance IDs
mpirun -N 1 bash -c 'echo $(hostname): $(cat /sys/devices/virtual/dmi/id/board_asset_tag | tr -d " ")'

# Run shmem_put_bw benchmark
srun --mpi=pmix --cpu-bind=none --container-image ./nvshmem.sqsh /opt/nvshmem/bin/perftest/device/pt-to-pt/shmem_put_bw
55 changes: 55 additions & 0 deletions micro-benchmarks/nvshmem/slurm/shmem_put_bw_intranode.sbatch
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
#!/bin/bash

# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0

#SBATCH --job-name=put_bw # name of your job
#SBATCH --nodes=1 # number of nodes to use, 24 p4d(e) = 192 A100 GPUs
#SBATCH --ntasks-per-node 2 # Number of GPU per node
###SBATCH --gpus-per-node=8 # number of GPU we reserve. Uncomment for AWS ParallelCluster
#SBATCH --output %x_%j.out
#SBATCH --error %x_%j.err
#SBATCH --exclusive
#SBATCH --wait-all-nodes=1

### Disable hyperthreading by setting the tasks per core to 1
#SBATCH --ntasks-per-core=1

###########################
###### User Variables #####
###########################


# default variables for Enroot
: "${APPS_PATH:=/fsx}"
: "${NCCL_TESTS_PATH:=/opt/nccl-tests/build}"
: "${IMAGE:=$APPS_PATH/nccl-tests.sqsh}"

## Set libfabric flags to use EFA
export FI_PROVIDER=efa
export FI_EFA_USE_DEVICE_RDMA=1 # use for p4d
export FI_EFA_FORK_SAFE=1

## Set this flag for debugging EFA
#export FI_LOG_LEVEL=warn

## NCCL Environment variables
export NCCL_DEBUG=INFO

### Increase the send queue depth and can turn NCCL communications into non-blocking.
### https://www.usenix.org/system/files/atc23-choi.pdf
export NCCL_BUFFSIZE=8388608
### Improve performance by increasing buffer size for Send/Recv, Gather, Scatter and Alltoall communications
### https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/p2p.html
export NCCL_P2P_NET_CHUNKSIZE=524288

### Improve performance for AllReduce by selecting specific protocol and algorithm for specific
### message size and number of ranks.
### More information https://github.com/aws/aws-ofi-nccl/wiki/Algorithm-and-Protocol-Tuner-for-AWS.
export NCCL_TUNER_PLUGIN=/opt/aws-ofi-nccl/install/lib/libnccl-ofi-tuner.so

#Get Hostname and Instance IDs
mpirun -N 1 bash -c 'echo $(hostname): $(cat /sys/devices/virtual/dmi/id/board_asset_tag | tr -d " ")'

# Run shmem_put_bw benchmark
srun --mpi=pmix --cpu-bind=none --container-image ./nvshmem.sqsh /opt/nvshmem/bin/perftest/device/pt-to-pt/shmem_put_bw