Skip to content

programmah/WIP-Distributed-Model-Training

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

WIP

The Distributed Model Training Bootcamp is designed from a real-world perspective on how to efficiently utilize GPUs in training models in a distributed manner. Attendees walk through the system topology to learn the dynamics of multi-GPU and multi-node connections and architecture. Using the PyTorch Framework, they will also learn and understand state-of-the-art strategies for training models that include distributed data parallelism (DDP), Fully Sharded Data Parallelism (FSDP), model parallelism, pipeline parallelism, and tensor parallelism. Furthermore, attendees will learn to profile code and analyze performance using NVIDIA® Nsight™ Systems. This tool helps identify optimization opportunities and improve the performance of applications running on a system consisting of multiple CPUs and GPUs.

About

Distributed Training

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published