Replies: 1 comment
-
Closing since this was discussed directly. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi Alby, Anders, and co.!
I've been playing around with Allegro and LAMMPS. I can get away with a relatively small simulation (~500 atoms) but would like to have this run for a long time scale. Because of that, I'd like to optimize the performance per timestep. I'm using a fairly small Allegro model (1 layer, SO(3) symmetry, small # of tensor features, etc.). Looking at the CPU and GPU utilization for 1 MPI rank and 1 V100 GPU, I'm seeing 100% CPU usage and 75% GPU usage. Moving to 2 MPI ranks and 2 GPUs (1 node), I'm seeing 100% CPU usage per rank and 66% GPU usage per GPU. It seems it's currently bottlenecked by something on the CPU.
I did a bit of profiling, and it looks like a significant chunk of total runtime (30%) is spent on
LAMMPS_NS::CommKokkos::borders()
here, with about 20% of total runtime spent onLAMMPS_NS::CommKokkos::borders_device<Kokkos::Cuda>()
. It looks like this function transfers neighbor data between procs, but the odd thing is, I'm running this with only 1 MPI rank. Maybe this is just sending the neighbor list to the GPU? I can provide thegprof
output if you'd like a closer look.Have you encountered something similar before?
Beta Was this translation helpful? Give feedback.
All reactions