Skip to content

Commit 143b5cc

Browse files
committed
nvshmem
1 parent eb32c00 commit 143b5cc

File tree

3 files changed

+280
-0
lines changed

3 files changed

+280
-0
lines changed

micro-benchmarks/nvshmem/README.md

+159
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,159 @@
1+
# NVSHMEM
2+
3+
NVIDIA NVSHMEM is NVIDIA’s implementation of the OpenSHMEM [PGAS](https://en.wikipedia.org/wiki/Partitioned_global_address_space) model for GPU clusters. It provides an easy-to-use CPU-side interface to allocate pinned memory that is symmetrically distributed across a cluster of NVIDIA GPUs. NVSHMEM can significantly reduce communication and coordination overheads by allowing programmers to perform these operations from within CUDA kernels and on CUDA streams.
4+
5+
One of the options for using the NVSHMEM is to implement high-throughput and low-latency MoE dispatch and combine GPU kernels. [DeepEP](https://github.com/deepseek-ai/DeepEP) and [pplx-kernels](https://github.com/ppl-ai/pplx-kernels) are examples of such implementations.
6+
7+
The goal of this document is to provide a guide on how to build NVSHMEM with NCCL with AWS EFA support and run the performance tests. This document reuses NCCL Tests Docker image as a base image and adds NVSHMEM on top. This is done because NVSHMEM is built with NCCL.
8+
9+
### Building NCCL Tests Docker image
10+
11+
For more details on how to build the NCCL Tests Docker image, please refer to the [NCCL Tests README](../nccl-tests/README.md).
12+
13+
```bash
14+
GDRCOPY_VERSION=v2.4.4
15+
EFA_INSTALLER_VERSION=1.38.1
16+
AWS_OFI_NCCL_VERSION=v1.14.0
17+
NCCL_VERSION=v2.26.2-1
18+
NCCL_TESTS_VERSION=v2.14.1
19+
TAG="efa${EFA_INSTALLER_VERSION}-ofi${AWS_OFI_NCCL_VERSION}-nccl${NCCL_VERSION}-tests${NCCL_TESTS_VERSION}"
20+
NCCL_CONTAINER_IMAGE_NAME_TAG="nccl-tests:${TAG}"
21+
```
22+
23+
```bash
24+
docker build --progress=plain -f ../nccl-tests/nccl-tests.Dockerfile \
25+
--build-arg="EFA_INSTALLER_VERSION=${EFA_INSTALLER_VERSION}" \
26+
--build-arg="AWS_OFI_NCCL_VERSION=${AWS_OFI_NCCL_VERSION}" \
27+
--build-arg="NCCL_VERSION=${NCCL_VERSION}" \
28+
--build-arg="NCCL_TESTS_VERSION=${NCCL_TESTS_VERSION}" \
29+
-t ${NCCL_CONTAINER_IMAGE_NAME_TAG} \
30+
.
31+
```
32+
33+
### Building NVSHMEM Docker image on top of NCCL Tests Docker base image
34+
35+
```bash
36+
NVSHMEM_VERSION=3.2.5-1
37+
TAG="efa${EFA_INSTALLER_VERSION}-ofi${AWS_OFI_NCCL_VERSION}-nccl${NCCL_VERSION}-tests${NCCL_TESTS_VERSION}-nvshmem${NVSHMEM_VERSION}"
38+
NVSHMEM_CONTAINER_IMAGE_NAME_TAG="nvshmem:${TAG}"
39+
```
40+
41+
```bash
42+
docker build --progress=plain -f nvshmem.Dockerfile \
43+
--build-arg="EFA_INSTALLER_VERSION=${EFA_INSTALLER_VERSION}" \
44+
--build-arg="AWS_OFI_NCCL_VERSION=${AWS_OFI_NCCL_VERSION}" \
45+
--build-arg="NCCL_VERSION=${NCCL_VERSION}" \
46+
--build-arg="NCCL_TESTS_VERSION=${NCCL_TESTS_VERSION}" \
47+
--build-arg="NVSHMEM_VERSION=${NVSHMEM_VERSION}" \
48+
-t ${NVSHMEM_CONTAINER_IMAGE_NAME_TAG} \
49+
.
50+
```
51+
52+
### Slurm
53+
54+
To run the NCCL tests on Slurm, you will need to convert the container into a Squash file using Enroot.
55+
56+
Convert the container image to a squash file via Enroot. If you have the built image locally use the following command:
57+
58+
```bash
59+
enroot import -o ./nvshmem.sqsh dockerd://${NVSHMEM_CONTAINER_IMAGE_NAME_TAG}
60+
```
61+
62+
# Perf Test
63+
64+
NVSHMEM provides rich set of performance tests for different operations launched on both device and host.
65+
66+
Common arguments:
67+
68+
* `-b, --min_size <minbytes>` - Minimum message size in bytes
69+
* `-e, --max_size <maxbytes>` - Maximum message size in bytes
70+
* `-f, --step <step factor>` - Step factor for message sizes
71+
* `-n, --iters <number>` - Number of iterations
72+
* `-w, --warmup_iters <number>` - Number of warmup iterations
73+
* `-c, --ctas <number>` - Number of CTAs to launch (used in some device pt-to-pt tests)
74+
* `-t, --threads_per_cta <number>` - Number of threads per block (used in some device pt-to-pt tests)
75+
* `-d, --datatype <type>` - Data type: int, int32_t, uint32_t, int64_t, uint64_t, long, longlong, ulonglong, size, ptrdiff, float, double, fp16, bf16
76+
* `-o, --reduce_op <op>` - Reduction operation: min, max, sum, prod, and, or, xor
77+
* `-s, --scope <scope>` - Thread group scope: thread, warp, block, all
78+
* `-i, --stride <number>` - Stride between elements
79+
* `-a, --atomic_op <op>` - Atomic operation: inc, add, and, or, xor, set, swap, fetch_inc, add, and, or, xor, compare_swap
80+
* `--bidir` - Run bidirectional test
81+
* `--msgrate` - Report message rate (MMPs)
82+
* `--dir <direction>` - Direction (read/write) for put/get operations
83+
* `--issue <mode>` - Issue mode (on_stream/host) for some host pt-to-pt tests
84+
85+
## Device
86+
87+
### Collective
88+
89+
Device collective tests are located in `/opt/nvshmem/bin/perftest/device/collective/`:
90+
91+
- alltoall_latency
92+
- barrier_latency
93+
- bcast_latency
94+
- fcollect_latency
95+
- redmaxloc_latency
96+
- reducescatter_latency
97+
- reduction_latency
98+
- sync_latency
99+
100+
### Point-to-Point
101+
102+
Device point-to-point tests are located in `/opt/nvshmem/bin/perftest/device/pt-to-pt/`:
103+
104+
- shmem_atomic_bw:
105+
- shmem_atomic_latency
106+
- shmem_atomic_ping_pong_latency
107+
- shmem_g_bw
108+
- shmem_g_latency
109+
- shmem_get_bw
110+
- shmem_get_latency
111+
- shmem_p_bw
112+
- shmem_p_latency
113+
- shmem_p_ping_pong_latency
114+
- shmem_put_atomic_ping_pong_latency
115+
- shmem_put_bw
116+
- shmem_put_latency
117+
- shmem_put_ping_pong_latency
118+
- shmem_put_signal_ping_pong_latency
119+
- shmem_signal_ping_pong_latency
120+
- shmem_st_bw
121+
122+
## Host
123+
124+
### Collectives
125+
126+
Host collective tests are located in `/opt/nvshmem/bin/perftest/host/collective/`:
127+
128+
- alltoall_on_stream
129+
- barrier_all_on_stream
130+
- barrier_on_stream
131+
- broadcast_on_stream
132+
- fcollect_on_stream
133+
- reducescatter_on_stream
134+
- reduction_on_stream
135+
- sync_all_on_stream
136+
137+
### Point-to-Point
138+
139+
Host point-to-point tests are located in `/opt/nvshmem/bin/perftest/host/pt-to-pt/`:
140+
141+
- bw
142+
- latency
143+
- stream_latency
144+
145+
### Example of running shmem_put_bw benchmark on 2 GPUs on a single node and 2 GPUs on two different nodes
146+
147+
NVSHMEM shmem_put_bw benchmark requires 2 processing elements (PEs), so there are two options:
148+
149+
benchmark 2 GPUs on a single node over NVLink:
150+
151+
```bash
152+
srun --mpi=pmix --cpu-bind=none --container-image ./nvshmem.sqsh --nodes=1 --ntasks-per-node=2 /opt/nvshmem/bin/perftest/device/pt-to-pt/shmem_put_bw
153+
```
154+
155+
benchmark 2 GPUs on two different nodes over AWS EFA:
156+
157+
```bash
158+
srun --mpi=pmix --cpu-bind=none --container-image ./nvshmem.sqsh --nodes=2 --ntasks-per-node=1 /opt/nvshmem/bin/perftest/device/pt-to-pt/shmem_put_bw
159+
```
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
2+
# SPDX-License-Identifier: MIT-0
3+
ARG GDRCOPY_VERSION=v2.4.1
4+
ARG EFA_INSTALLER_VERSION=1.37.0
5+
ARG AWS_OFI_NCCL_VERSION=v1.13.2-aws
6+
ARG NCCL_VERSION=v2.23.4-1
7+
ARG NCCL_TESTS_VERSION=v2.13.10
8+
9+
FROM nccl-tests:efa${EFA_INSTALLER_VERSION}-ofi${AWS_OFI_NCCL_VERSION}-nccl${NCCL_VERSION}-tests${NCCL_TESTS_VERSION}
10+
11+
ARG NVSHMEM_VERSION=3.2.5-1
12+
13+
ENV NVSHMEM_DIR=/opt/nvshmem
14+
ENV NVSHMEM_HOME=/opt/nvshmem
15+
16+
RUN curl -L https://developer.nvidia.com/downloads/assets/secure/nvshmem/nvshmem_src_${NVSHMEM_VERSION}.txz -o /nvshmem_src_${NVSHMEM_VERSION}.txz \
17+
&& tar -xf /nvshmem_src_${NVSHMEM_VERSION}.txz -C / \
18+
&& cd /nvshmem_src \
19+
&& mkdir -p build \
20+
&& cd build \
21+
&& cmake \
22+
-DNVSHMEM_PREFIX=/opt/nvshmem \
23+
-DCMAKE_INSTALL_PREFIX=/opt/nvshmem \
24+
\
25+
-DCUDA_HOME=/usr/local/cuda \
26+
-DCMAKE_CUDA_ARCHITECTURES=90a \
27+
\
28+
-DNVSHMEM_USE_GDRCOPY=1 \
29+
-DGDRCOPY_HOME=/opt/gdrcopy \
30+
\
31+
-DNVSHMEM_USE_NCCL=1 \
32+
-DNCCL_HOME=/opt/nccl/build \
33+
-DNCCL_INCLUDE=/opt/nccl/build/include \
34+
\
35+
-DNVSHMEM_LIBFABRIC_SUPPORT=1 \
36+
-DLIBFABRIC_HOME=/opt/amazon/efa \
37+
\
38+
-DNVSHMEM_MPI_SUPPORT=1 \
39+
-DMPI_HOME=/opt/amazon/openmpi \
40+
\
41+
-DNVSHMEM_PMIX_SUPPORT=1 \
42+
-DPMIX_HOME=/opt/amazon/pmix \
43+
-DNVSHMEM_DEFAULT_PMIX=1 \
44+
\
45+
-DNVSHMEM_BUILD_TESTS=1 \
46+
-DNVSHMEM_BUILD_EXAMPLES=1 \
47+
-DNVSHMEM_BUILD_HYDRA_LAUNCHER=1 \
48+
-DNVSHMEM_BUILD_TXZ_PACKAGE=1 \
49+
\
50+
-DNVSHMEM_IBRC_SUPPORT=1 \
51+
-DNVSHMEM_IBGDA_SUPPORT=1 \
52+
\
53+
-DNVSHMEM_TIMEOUT_DEVICE_POLLING=0 \
54+
\
55+
-DNVSHMEM_DEBUG=WARN \
56+
-DNVSHMEM_TRACE=1 \
57+
.. \
58+
&& make -j$(nproc) \
59+
&& make install
60+
61+
ENV PATH=/opt/nvshmem/bin:$PATH LD_LIBRARY_PATH=/opt/amazon/pmix/lib:/opt/nvshmem/lib:$LD_LIBRARY_PATH NVSHMEM_REMOTE_TRANSPORT=libfabric NVSHMEM_LIBFABRIC_PROVIDER=efa
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
#!/bin/bash
2+
3+
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
4+
# SPDX-License-Identifier: MIT-0
5+
6+
#SBATCH --job-name=nccl-all_reduce_perf # name of your job
7+
#SBATCH --nodes=2 # number of nodes to use, 24 p4d(e) = 192 A100 GPUs
8+
#SBATCH --ntasks-per-node 8 # Number of GPU per node
9+
###SBATCH --gpus-per-node=8 # number of GPU we reserve. Uncomment for AWS ParallelCluster
10+
#SBATCH --output %x_%j.out
11+
#SBATCH --error %x_%j.err
12+
#SBATCH --exclusive
13+
#SBATCH --wait-all-nodes=1
14+
15+
### Disable hyperthreading by setting the tasks per core to 1
16+
#SBATCH --ntasks-per-core=1
17+
18+
###########################
19+
###### User Variables #####
20+
###########################
21+
22+
23+
# default variables for Enroot
24+
: "${APPS_PATH:=/fsx}"
25+
: "${NCCL_TESTS_PATH:=/opt/nccl-tests/build}"
26+
: "${IMAGE:=$APPS_PATH/nccl-tests.sqsh}"
27+
28+
## Set libfabric flags to use EFA
29+
export FI_PROVIDER=efa
30+
export FI_EFA_USE_DEVICE_RDMA=1 # use for p4d
31+
export FI_EFA_FORK_SAFE=1
32+
33+
## Set this flag for debugging EFA
34+
#export FI_LOG_LEVEL=warn
35+
36+
## NCCL Environment variables
37+
export NCCL_DEBUG=INFO
38+
39+
### Increase the send queue depth and can turn NCCL communications into non-blocking.
40+
### https://www.usenix.org/system/files/atc23-choi.pdf
41+
export NCCL_BUFFSIZE=8388608
42+
### Improve performance by increasing buffer size for Send/Recv, Gather, Scatter and Alltoall communications
43+
### https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/p2p.html
44+
export NCCL_P2P_NET_CHUNKSIZE=524288
45+
46+
### Improve performance for AllReduce by selecting specific protocol and algorithm for specific
47+
### message size and number of ranks.
48+
### More information https://github.com/aws/aws-ofi-nccl/wiki/Algorithm-and-Protocol-Tuner-for-AWS.
49+
export NCCL_TUNER_PLUGIN=/opt/aws-ofi-nccl/install/lib/libnccl-ofi-tuner.so
50+
51+
52+
declare -a ARGS=(
53+
--container-image $IMAGE
54+
)
55+
56+
#Get Hostname and Instance IDs
57+
mpirun -N 1 bash -c 'echo $(hostname): $(cat /sys/devices/virtual/dmi/id/board_asset_tag | tr -d " ")'
58+
59+
# Run NCCL test
60+
srun "${ARGS[@]}" --mpi=pmix --cpu-bind=none $NCCL_TESTS_PATH/all_reduce_perf -b 8 -e 16G -f 2 -g 1 -c 1 -n 100

0 commit comments

Comments
 (0)