The Distributed Model Training Bootcamp is designed from a real-world perspective on how to efficiently utilize GPUs in training models in a distributed manner. Attendees walk through the system topology to learn the dynamics of multi-GPU and multi-node connections and architecture. Using the PyTorch Framework, they will also learn and understand state-of-the-art strategies for training models that include distributed data parallelism (DDP), Fully Sharded Data Parallelism (FSDP), model parallelism, pipeline parallelism, and tensor parallelism. Furthermore, attendees will learn to profile code and analyze performance using NVIDIA® Nsight™ Systems. This tool helps identify optimization opportunities and improve the performance of applications running on a system consisting of multiple CPUs and GPUs.
-
Notifications
You must be signed in to change notification settings - Fork 0
Distributed Training
License
programmah/WIP-Distributed-Model-Training
About
Distributed Training
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published