Skip to content

nvshmem #599

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 28, 2025
Merged

nvshmem #599

merged 2 commits into from
Apr 28, 2025

Conversation

pbelevich
Copy link
Collaborator

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@nghtm
Copy link
Collaborator

nghtm commented Apr 15, 2025

Thanks @pbelevich - is this PR ready for review?

@pbelevich pbelevich marked this pull request as ready for review April 22, 2025 21:49
@pbelevich
Copy link
Collaborator Author

Observation: EFA environment variables do not affect NVSHMEM(compiled with NCCL) performance:

srun --mpi=pmix --cpu-bind=none --container-image ./nvshmem.sqsh --nodes=2 --ntasks-per-node=1 bash -c "/opt/nvshmem/bin/perftest/device/pt-to-pt/shmem_put_bw"
  DEVICE CUDA API              12020
#shmem_put_bw_uni
size(B)     scope     BW (GB/sec)     
4           None      0.000342        
8           None      0.000706        
16          None      0.001419        
32          None      0.002860        
64          None      0.005706        
128         None      0.011534        
256         None      0.008055        
512         None      0.016731        
1024        None      0.032343        
2048        None      0.066438        
4096        None      0.121858        
8192        None      0.259136        
16384       None      0.535117        
32768       None      1.061690        
65536       None      2.096427        
131072      None      4.082528        
262144      None      8.061405        
524288      None      9.839648        
1048576     None      11.298922       
2097152     None      11.798084       
4194304     None      12.004139       
srun --mpi=pmix --cpu-bind=none --container-image ./nvshmem.sqsh --nodes=2 --ntasks-per-node=1 bash -c "FI_PROVIDER=efa FI_EFA_USE_DEVICE_RDMA=1 FI_EFA_FORK_SAFE=1 NCCL_BUFFSIZE=8388608 NCCL_P2P_NET_CHUNKSIZE=524288 NCCL_TUNER_PLUGIN=/opt/aws-ofi-nccl/install/lib/libnccl-ofi-tuner.so /opt/nvshmem/bin/perftest/device/pt-to-pt/shmem_put_bw"
  DEVICE CUDA API              12020
#shmem_put_bw_uni
size(B)     scope     BW (GB/sec)     
4           None      0.000347        
8           None      0.000707        
16          None      0.001442        
32          None      0.002881        
64          None      0.005777        
128         None      0.011567        
256         None      0.008049        
512         None      0.016305        
1024        None      0.032577        
2048        None      0.065776        
4096        None      0.133139        
8192        None      0.246795        
16384       None      0.514366        
32768       None      1.086703        
65536       None      2.134000        
131072      None      4.344045        
262144      None      8.356626        
524288      None      10.074402       
1048576     None      11.425782       
2097152     None      11.801483       
4194304     None      12.004469       

@pbelevich
Copy link
Collaborator Author

@nghtm yes, please review

@nghtm
Copy link
Collaborator

nghtm commented Apr 22, 2025

I am tight on bandwidth to review this PR this week. Requesting @amanshanbhag to take a look

@KeitaW
Copy link
Collaborator

KeitaW commented Apr 23, 2025

Thanks @pbelevich ! I will also take a look at it on Friday (right now up to my ears for upcoming TFC summit sesssion).

Copy link
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM (thank you so much!) but we might want to add description for sbatch files (see #654 and feel free to merge it this branch before merge this to main if that makes sense).

@nghtm nghtm merged commit e97db8e into main Apr 28, 2025
@nghtm nghtm deleted the nvshmem branch April 28, 2025 23:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants