|
| 1 | +# NVSHMEM |
| 2 | + |
| 3 | +NVIDIA NVSHMEM is NVIDIA’s implementation of the OpenSHMEM [PGAS](https://en.wikipedia.org/wiki/Partitioned_global_address_space) model for GPU clusters. It provides an easy-to-use CPU-side interface to allocate pinned memory that is symmetrically distributed across a cluster of NVIDIA GPUs. NVSHMEM can significantly reduce communication and coordination overheads by allowing programmers to perform these operations from within CUDA kernels and on CUDA streams. |
| 4 | + |
| 5 | +One of the options for using the NVSHMEM is to implement high-throughput and low-latency MoE dispatch and combine GPU kernels. [DeepEP](https://github.com/deepseek-ai/DeepEP) and [pplx-kernels](https://github.com/ppl-ai/pplx-kernels) are examples of such implementations. |
| 6 | + |
| 7 | +The goal of this document is to provide a guide on how to build NVSHMEM with NCCL with AWS EFA support and run the performance tests. This document reuses NCCL Tests Docker image as a base image and adds NVSHMEM on top. This is done because NVSHMEM is built with NCCL. |
| 8 | + |
| 9 | +### Building NCCL Tests Docker image |
| 10 | + |
| 11 | +For more details on how to build the NCCL Tests Docker image, please refer to the [NCCL Tests README](../nccl-tests/README.md). |
| 12 | + |
| 13 | +```bash |
| 14 | +GDRCOPY_VERSION=v2.4.4 |
| 15 | +EFA_INSTALLER_VERSION=1.38.1 |
| 16 | +AWS_OFI_NCCL_VERSION=v1.14.0 |
| 17 | +NCCL_VERSION=v2.26.2-1 |
| 18 | +NCCL_TESTS_VERSION=v2.14.1 |
| 19 | +TAG="efa${EFA_INSTALLER_VERSION}-ofi${AWS_OFI_NCCL_VERSION}-nccl${NCCL_VERSION}-tests${NCCL_TESTS_VERSION}" |
| 20 | +NCCL_CONTAINER_IMAGE_NAME_TAG="nccl-tests:${TAG}" |
| 21 | +``` |
| 22 | + |
| 23 | +```bash |
| 24 | +docker build --progress=plain -f ../nccl-tests/nccl-tests.Dockerfile \ |
| 25 | + --build-arg="EFA_INSTALLER_VERSION=${EFA_INSTALLER_VERSION}" \ |
| 26 | + --build-arg="AWS_OFI_NCCL_VERSION=${AWS_OFI_NCCL_VERSION}" \ |
| 27 | + --build-arg="NCCL_VERSION=${NCCL_VERSION}" \ |
| 28 | + --build-arg="NCCL_TESTS_VERSION=${NCCL_TESTS_VERSION}" \ |
| 29 | + -t ${NCCL_CONTAINER_IMAGE_NAME_TAG} \ |
| 30 | + . |
| 31 | +``` |
| 32 | + |
| 33 | +### Building NVSHMEM Docker image on top of NCCL Tests Docker base image |
| 34 | + |
| 35 | +```bash |
| 36 | +NVSHMEM_VERSION=3.2.5-1 |
| 37 | +TAG="efa${EFA_INSTALLER_VERSION}-ofi${AWS_OFI_NCCL_VERSION}-nccl${NCCL_VERSION}-tests${NCCL_TESTS_VERSION}-nvshmem${NVSHMEM_VERSION}" |
| 38 | +NVSHMEM_CONTAINER_IMAGE_NAME_TAG="nvshmem:${TAG}" |
| 39 | +``` |
| 40 | + |
| 41 | +```bash |
| 42 | +docker build --progress=plain -f nvshmem.Dockerfile \ |
| 43 | + --build-arg="EFA_INSTALLER_VERSION=${EFA_INSTALLER_VERSION}" \ |
| 44 | + --build-arg="AWS_OFI_NCCL_VERSION=${AWS_OFI_NCCL_VERSION}" \ |
| 45 | + --build-arg="NCCL_VERSION=${NCCL_VERSION}" \ |
| 46 | + --build-arg="NCCL_TESTS_VERSION=${NCCL_TESTS_VERSION}" \ |
| 47 | + --build-arg="NVSHMEM_VERSION=${NVSHMEM_VERSION}" \ |
| 48 | + -t ${NVSHMEM_CONTAINER_IMAGE_NAME_TAG} \ |
| 49 | + . |
| 50 | +``` |
| 51 | + |
| 52 | +### Slurm |
| 53 | + |
| 54 | +To run the NCCL tests on Slurm, you will need to convert the container into a Squash file using Enroot. |
| 55 | + |
| 56 | +Convert the container image to a squash file via Enroot. If you have the built image locally use the following command: |
| 57 | + |
| 58 | +```bash |
| 59 | +enroot import -o ./nvshmem.sqsh dockerd://${NVSHMEM_CONTAINER_IMAGE_NAME_TAG} |
| 60 | +``` |
| 61 | + |
| 62 | +# Perf Test |
| 63 | + |
| 64 | +NVSHMEM provides rich set of performance tests for different operations launched on both device and host. |
| 65 | + |
| 66 | +Common arguments: |
| 67 | + |
| 68 | +* `-b, --min_size <minbytes>` - Minimum message size in bytes |
| 69 | +* `-e, --max_size <maxbytes>` - Maximum message size in bytes |
| 70 | +* `-f, --step <step factor>` - Step factor for message sizes |
| 71 | +* `-n, --iters <number>` - Number of iterations |
| 72 | +* `-w, --warmup_iters <number>` - Number of warmup iterations |
| 73 | +* `-c, --ctas <number>` - Number of CTAs to launch (used in some device pt-to-pt tests) |
| 74 | +* `-t, --threads_per_cta <number>` - Number of threads per block (used in some device pt-to-pt tests) |
| 75 | +* `-d, --datatype <type>` - Data type: int, int32_t, uint32_t, int64_t, uint64_t, long, longlong, ulonglong, size, ptrdiff, float, double, fp16, bf16 |
| 76 | +* `-o, --reduce_op <op>` - Reduction operation: min, max, sum, prod, and, or, xor |
| 77 | +* `-s, --scope <scope>` - Thread group scope: thread, warp, block, all |
| 78 | +* `-i, --stride <number>` - Stride between elements |
| 79 | +* `-a, --atomic_op <op>` - Atomic operation: inc, add, and, or, xor, set, swap, fetch_inc, add, and, or, xor, compare_swap |
| 80 | +* `--bidir` - Run bidirectional test |
| 81 | +* `--msgrate` - Report message rate (MMPs) |
| 82 | +* `--dir <direction>` - Direction (read/write) for put/get operations |
| 83 | +* `--issue <mode>` - Issue mode (on_stream/host) for some host pt-to-pt tests |
| 84 | + |
| 85 | +## Device |
| 86 | + |
| 87 | +### Collective |
| 88 | + |
| 89 | +Device collective tests are located in `/opt/nvshmem/bin/perftest/device/collective/`: |
| 90 | + |
| 91 | +- alltoall_latency |
| 92 | +- barrier_latency |
| 93 | +- bcast_latency |
| 94 | +- fcollect_latency |
| 95 | +- redmaxloc_latency |
| 96 | +- reducescatter_latency |
| 97 | +- reduction_latency |
| 98 | +- sync_latency |
| 99 | + |
| 100 | +### Point-to-Point |
| 101 | + |
| 102 | +Device point-to-point tests are located in `/opt/nvshmem/bin/perftest/device/pt-to-pt/`: |
| 103 | + |
| 104 | +- shmem_atomic_bw: |
| 105 | +- shmem_atomic_latency |
| 106 | +- shmem_atomic_ping_pong_latency |
| 107 | +- shmem_g_bw |
| 108 | +- shmem_g_latency |
| 109 | +- shmem_get_bw |
| 110 | +- shmem_get_latency |
| 111 | +- shmem_p_bw |
| 112 | +- shmem_p_latency |
| 113 | +- shmem_p_ping_pong_latency |
| 114 | +- shmem_put_atomic_ping_pong_latency |
| 115 | +- shmem_put_bw |
| 116 | +- shmem_put_latency |
| 117 | +- shmem_put_ping_pong_latency |
| 118 | +- shmem_put_signal_ping_pong_latency |
| 119 | +- shmem_signal_ping_pong_latency |
| 120 | +- shmem_st_bw |
| 121 | + |
| 122 | +## Host |
| 123 | + |
| 124 | +### Collectives |
| 125 | + |
| 126 | +Host collective tests are located in `/opt/nvshmem/bin/perftest/host/collective/`: |
| 127 | + |
| 128 | +- alltoall_on_stream |
| 129 | +- barrier_all_on_stream |
| 130 | +- barrier_on_stream |
| 131 | +- broadcast_on_stream |
| 132 | +- fcollect_on_stream |
| 133 | +- reducescatter_on_stream |
| 134 | +- reduction_on_stream |
| 135 | +- sync_all_on_stream |
| 136 | + |
| 137 | +### Point-to-Point |
| 138 | + |
| 139 | +Host point-to-point tests are located in `/opt/nvshmem/bin/perftest/host/pt-to-pt/`: |
| 140 | + |
| 141 | +- bw |
| 142 | +- latency |
| 143 | +- stream_latency |
| 144 | + |
| 145 | +### Example of running shmem_put_bw benchmark on 2 GPUs on a single node and 2 GPUs on two different nodes |
| 146 | + |
| 147 | +NVSHMEM shmem_put_bw benchmark requires 2 processing elements (PEs), so there are two options: |
| 148 | + |
| 149 | +benchmark 2 GPUs on a single node over NVLink: |
| 150 | + |
| 151 | +```bash |
| 152 | +srun --mpi=pmix --cpu-bind=none --container-image ./nvshmem.sqsh --nodes=1 --ntasks-per-node=2 /opt/nvshmem/bin/perftest/device/pt-to-pt/shmem_put_bw |
| 153 | +``` |
| 154 | + |
| 155 | +benchmark 2 GPUs on two different nodes over AWS EFA: |
| 156 | + |
| 157 | +```bash |
| 158 | +srun --mpi=pmix --cpu-bind=none --container-image ./nvshmem.sqsh --nodes=2 --ntasks-per-node=1 /opt/nvshmem/bin/perftest/device/pt-to-pt/shmem_put_bw |
| 159 | +``` |
0 commit comments