Skip to content

Commit fa9ea74

Browse files
BenWibkingpgrete
andauthored
New CUDA CI+development Docker container (#1162)
* update CI Docker container Update to CUDA 12.6, Ubuntu 24.04, and Clang 19. * update OpenMPI * revert to OpenMPI 4.1.4 * add ADIOS2+openPMD * add c-blosc ubuntu package * fix Dockerfile.nvcc * fix dockerfile * install python headers * install cmake from apt (required for aarch64) * remove duplicate cmake dep * disable ascent build * add newer ascent version * disable ascent build; fix openpmd build * downgrade to CUDA 12.0 * fix ascent build path * fix bug in build_ascent.sh * remove unneeded patches * ascent complains if MFEM is not built * control cuda support for ascent with env var * add MAKEOPTS=--output-sync=target * add comment to Dockerfile * Downgrade numpy * Fix ADIOS2 and OpenPMD versions * Directly use Ascent script with small patch * Use Cuda12.1 container and drop to local user * add emacs and vi * set build_jobs=`nproc` to avoid OOM kill * add developer tools for Codespaces/VSCode * add devcontainer.json * update to CUDA 12.8 and VTK-m 2.3 * update image ref * fetch BLT * extract BLT into correct dir * avoid uid 1000 * build and publish Docker image based on Dockerfiles in repo * Update CI image to be used * Fix python version used for linting * Bump opmd to stable release * Add changelog * Use updated clang for compiler check * remove docker-publish action * Fix C++20 build * Attempt to fix Parthenon Ascent dep * Try Ben's BLT build fix * Fix Ascent build * Include opmd in rocm image * Use actual image --------- Co-authored-by: Philipp Grete <[email protected]>
1 parent 1671c57 commit fa9ea74

File tree

9 files changed

+173
-471
lines changed

9 files changed

+173
-471
lines changed

.devcontainer/devcontainer.json

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
// devcontainer.json
2+
{
3+
"name": "parthenon-dev",
4+
"image": "ghcr.io/parthenon-hpc-lab/cuda12.8-mpi-hdf5-ascent",
5+
"hostRequirements": {
6+
"cpus": 4
7+
},
8+
"customizations": {
9+
"vscode": {
10+
"settings": {},
11+
"extensions": [
12+
"-ms-vscode.cpptools",
13+
"llvm-vs-code-extensions.vscode-clangd",
14+
"github.vscode-pull-request-github",
15+
"ms-python.python",
16+
"ms-toolsai.jupyter",
17+
"ms-vscode.live-server",
18+
"ms-azuretools.vscode-docker",
19+
"swyddfa.esbonio",
20+
"tomoki1207.pdf",
21+
"ms-vscode.cmake-tools",
22+
"ms-vsliveshare.vsliveshare"
23+
]
24+
}
25+
},
26+
"remoteEnv": {
27+
"PATH": "${containerEnv:PATH}:/usr/local/hdf5/parallel/bin",
28+
"OMPI_MCA_opal_warn_on_missing_libcuda": "0"
29+
},
30+
//"remoteUser": "ubuntu",
31+
// we need to manually checkout the submodules,
32+
// but VSCode may try to configure CMake before they are fully checked-out.
33+
// workaround TBD
34+
"postCreateCommand": "git submodule update --init"
35+
}

.github/workflows/check-compilers.yml

Lines changed: 2 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -21,20 +21,13 @@ jobs:
2121
continue-on-error: true
2222
strategy:
2323
matrix:
24-
cxx: ['g++', 'clang++-15']
24+
cxx: ['g++', 'clang++-20']
2525
cmake_build_type: ['Release', 'DbgNoSym']
2626
device: ['cuda', 'host']
2727
parallel: ['serial', 'mpi']
28-
exclude:
29-
# Debug cuda clang build fail for the unit test.
30-
# Exclude for now until we figure out what's going on.
31-
# https://github.com/lanl/parthenon/issues/630
32-
- cxx: clang++-15
33-
device: cuda
34-
cmake_build_type: DbgNoSym
3528
runs-on: ubuntu-latest
3629
container:
37-
image: ghcr.io/parthenon-hpc-lab/cuda11.6-mpi-hdf5-ascent
30+
image: ghcr.io/parthenon-hpc-lab/cuda12.8-mpi-hdf5-ascent
3831
env:
3932
CMAKE_GENERATOR: Ninja
4033
steps:

.github/workflows/ci-extended.yml

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ jobs:
3434
parallel: ['serial', 'mpi']
3535
runs-on: [self-hosted, A100]
3636
container:
37-
image: ghcr.io/parthenon-hpc-lab/cuda11.6-mpi-hdf5-ascent
37+
image: ghcr.io/parthenon-hpc-lab/cuda12.8-mpi-hdf5-ascent
3838
# map to local user id on CI machine to allow writing to build cache
3939
options: --user 1001 --cap-add CAP_SYS_PTRACE --shm-size="8g" --ulimit memlock=134217728
4040
steps:
@@ -100,7 +100,8 @@ jobs:
100100
-DCMAKE_BUILD_TYPE=Release \
101101
-DMACHINE_VARIANT=${{ matrix.device }}-${{ matrix.parallel }} \
102102
-DPARTHENON_ENABLE_ASCENT=ON \
103-
-DAscent_DIR=/usr/local/ascent-develop/lib/cmake/ascent
103+
-DCMAKE_CUDA_HOST_COMPILER=g++ \
104+
-DAscent_DIR=/usr/local/ascent-checkout/lib/cmake/ascent
104105
cmake --build build-ascent
105106
cd example/advection/
106107
# Pick GPU with most available memory
@@ -131,7 +132,7 @@ jobs:
131132
parallel: ['serial', 'mpi']
132133
runs-on: [self-hosted, navi1030]
133134
container:
134-
image: ghcr.io/parthenon-hpc-lab/rocm5.4.3-mpi-hdf5
135+
image: ghcr.io/parthenon-hpc-lab/rocm6.2-mpi-hdf5
135136
# Map to local user id on CI machine to allow writing to build cache and
136137
# forward device handles to access AMD GPU within container
137138
options: --user 1000 -w /home/ci --device /dev/kfd --device /dev/dri --security-opt seccomp=unconfined

.github/workflows/ci-short.yml

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -22,15 +22,15 @@ jobs:
2222
style:
2323
runs-on: [self-hosted, A100]
2424
container:
25-
image: ghcr.io/parthenon-hpc-lab/cuda11.6-mpi-hdf5-ascent
25+
image: ghcr.io/parthenon-hpc-lab/cuda12.8-mpi-hdf5-ascent
2626
# map to local user id on CI machine to allow writing to build cache
2727
options: --user 1001 --cap-add CAP_SYS_PTRACE --shm-size="8g" --ulimit memlock=134217728
2828
steps:
2929
- uses: actions/checkout@v3
3030
with:
3131
submodules: 'true'
3232
- name: cpplint
33-
run: python ./tst/style/cpplint.py --counting=detailed --recursive src example tst
33+
run: python3 ./tst/style/cpplint.py --counting=detailed --recursive src example tst
3434
- name: copyright
3535
run: |
3636
cmake -DCMAKE_CXX_FLAGS=-Werror -Bbuild-copyright-check
@@ -47,7 +47,7 @@ jobs:
4747
device: ['cuda', 'host']
4848
runs-on: [self-hosted, A100]
4949
container:
50-
image: ghcr.io/parthenon-hpc-lab/cuda11.6-mpi-hdf5-ascent
50+
image: ghcr.io/parthenon-hpc-lab/cuda12.8-mpi-hdf5-ascent
5151
# map to local user id on CI machine to allow writing to build cache
5252
options: --user 1001 --cap-add CAP_SYS_PTRACE --shm-size="8g" --ulimit memlock=134217728
5353
steps:
@@ -79,7 +79,7 @@ jobs:
7979
device: ['cuda', 'host']
8080
runs-on: [self-hosted, A100]
8181
container:
82-
image: ghcr.io/parthenon-hpc-lab/cuda11.6-mpi-hdf5-ascent
82+
image: ghcr.io/parthenon-hpc-lab/cuda12.8-mpi-hdf5-ascent
8383
# map to local user id on CI machine to allow writing to build cache
8484
options: --user 1001 --cap-add CAP_SYS_PTRACE --shm-size="8g" --ulimit memlock=134217728
8585
steps:
@@ -137,7 +137,7 @@ jobs:
137137
integration-amdgpu:
138138
runs-on: [self-hosted, navi1030]
139139
container:
140-
image: ghcr.io/parthenon-hpc-lab/rocm5.4.3-mpi-hdf5
140+
image: ghcr.io/parthenon-hpc-lab/rocm6.2-mpi-hdf5
141141
# Map to local user id on CI machine to allow writing to build cache and
142142
# forward device handles to access AMD GPU within container
143143
options: --user 1000 -w /home/ci --device /dev/kfd --device /dev/dri --security-opt seccomp=unconfined

CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
## Current develop
44

55
### Added (new features/APIs/variables/...)
6+
- [[PR 1162]](https://github.com/parthenon-hpc-lab/parthenon/pull/1162) Add dev container (e.g., GitHub Codepsacer or VSCode)
67

78

89
### Changed (changing behavior/API/variables/...)
@@ -13,6 +14,7 @@
1314

1415

1516
### Infrastructure (changes irrelevant to downstream codes)
17+
- [[PR 1162]](https://github.com/parthenon-hpc-lab/parthenon/pull/1162) Update CI container to Cuda 12.8
1618

1719

1820
### Removed (removing behavior/API/varaibles/...)

scripts/docker/Dockerfile.hip-rocm

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
FROM rocm/dev-ubuntu-24.04:6.2
22

33
RUN apt-get clean && apt-get update -y && \
4-
DEBIAN_FRONTEND="noninteractive" TZ=America/New_York apt-get install -y --no-install-recommends git python3-minimal libpython3-stdlib bc hwloc wget openssh-client python3-numpy python3-h5py python3-matplotlib lcov curl cmake ninja-build openmpi-bin libopenmpi-dev && \
4+
DEBIAN_FRONTEND="noninteractive" TZ=America/New_York apt-get install -y --no-install-recommends git python3-minimal libpython3-stdlib bc hwloc wget openssh-client python3-numpy python3-h5py python3-matplotlib lcov curl cmake ninja-build openmpi-bin libopenmpi-dev adios2-mpi-bin adios2-serial-bin libadios2-mpi-c++11-dev libadios2-mpi-core-dev libadios2-serial-core-dev libadios2-serial-c++11-dev && \
55
apt-get clean && rm -rf /var/lib/apt/lists/*
66

77
RUN cd /tmp && \
@@ -16,6 +16,20 @@ RUN cd /tmp && \
1616

1717
RUN update-alternatives --install /usr/bin/python python /usr/bin/python3 10
1818

19+
# commit version is ver 0.16.1
20+
RUN mkdir /tmp/build-openpmd && cd /tmp/build-openpmd && \
21+
wget https://github.com/openPMD/openPMD-api/archive/3a60e77.tar.gz && \
22+
tar xzf 3a60e77.tar.gz && \
23+
mkdir openPMD-api-build && cd openPMD-api-build && \
24+
cmake ../openPMD-api-3a60e7714f6143c8fc7bf89809f2167d058359ee -DopenPMD_USE_PYTHON=ON -DPython_EXECUTABLE=$(which python3) -DopenPMD_USE_ADIOS2=ON && \
25+
cmake --build . -j 16 && \
26+
cmake --build . --target install && \
27+
cd / && \
28+
rm -rf /tmp/build-openpmd
29+
30+
# Technically not necessary (as we installed the api above) but makes it easier for package discovery
31+
RUN env openPMD_USE_MPI=ON python3 -m pip install openpmd-api --no-binary openpmd-api --break-system-packages
32+
1933
# Latest image has default user with uid 1000 (which maps to the one running the container on the CI host
2034
# Need to add user to the group that can access the GPU
2135
RUN usermod -a -G render ubuntu

scripts/docker/Dockerfile.nvcc

Lines changed: 60 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,36 @@
1-
FROM nvidia/cuda:11.6.1-devel-ubuntu20.04
1+
FROM nvidia/cuda:12.8.0-devel-ubuntu24.04
22

33
RUN apt-get clean && apt-get update -y && \
4-
DEBIAN_FRONTEND="noninteractive" TZ=America/New_York apt-get install -y --no-install-recommends git python3-minimal libpython3-stdlib bc hwloc wget openssh-client python3-numpy python3-h5py python3-matplotlib python3-scipy python3-pip lcov curl cuda-nsight-systems-11-6 cmake ninja-build
4+
DEBIAN_FRONTEND="noninteractive" TZ=America/New_York apt-get install -y --no-install-recommends git python3-minimal libpython3-stdlib bc hwloc wget openssh-client python3-numpy python3-h5py python3-matplotlib python3-scipy python3-pip lcov curl cuda-nsight-systems-12-6 cmake ninja-build libpython3-dev gcc-11 g++-11 emacs nvi sphinx-doc python3-sphinx-rtd-theme python3-sphinxcontrib.bibtex python3-sphinx-copybutton && \
5+
update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-11 10 && \
6+
update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-11 10
57

6-
RUN pip3 install unyt
8+
RUN g++ --version
9+
10+
RUN pip3 install unyt --break-system-packages
11+
12+
RUN pip3 install blosc2 --break-system-packages
13+
14+
# for Codespaces/VSCode Sphinx support
15+
RUN pip3 install esbonio --break-system-packages
16+
17+
# h5py from the repo is incompatible with the default numpy 2.1.0
18+
# Downgrading is not the cleanest solution, but it works...
19+
# see https://stackoverflow.com/questions/78634235/numpy-dtype-size-changed-may-indicate-binary-incompatibility-expected-96-from
20+
RUN pip3 install numpy==1.26.4 --break-system-packages
721

822
RUN wget -O - https://apt.llvm.org/llvm-snapshot.gpg.key| apt-key add - && \
9-
echo "deb http://apt.llvm.org/focal/ llvm-toolchain-focal-15 main" > /etc/apt/sources.list.d/llvm.list
23+
echo "deb http://apt.llvm.org/noble/ llvm-toolchain-noble-20 main" > /etc/apt/sources.list.d/llvm.list
1024

1125
RUN apt-get clean && apt-get update -y && \
12-
DEBIAN_FRONTEND="noninteractive" TZ=America/New_York apt-get install -y --no-install-recommends clang-15 llvm-15 libomp-15-dev && \
26+
DEBIAN_FRONTEND="noninteractive" TZ=America/New_York apt-get install -y --no-install-recommends clang-20 llvm-20 libomp-20-dev clangd-20 libstdc++-14-dev && \
1327
rm -rf /var/lib/apt/lists/*
1428

15-
1629
RUN cd /tmp && \
1730
wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.4.tar.bz2 && \
1831
tar xjf openmpi-4.1.4.tar.bz2 && \
1932
cd openmpi-4.1.4 && \
20-
./configure --prefix=/opt/openmpi --enable-mpi-cxx --with-cuda && \
33+
./configure --prefix=/opt/openmpi --disable-mpi-fortran --disable-oshmem --with-cuda && \
2134
make -j16 && \
2235
make install && \
2336
cd / && \
@@ -36,19 +49,51 @@ RUN cd /tmp && \
3649
cd / && \
3750
rm -rf /tmp/hdf5-1.12.2*
3851

39-
RUN update-alternatives --install /usr/bin/python python /usr/bin/python3 10
52+
RUN mkdir /tmp/build-adios2 && cd /tmp/build-adios2 && \
53+
wget https://github.com/ornladios/ADIOS2/archive/refs/tags/v2.10.1.tar.gz && \
54+
tar xzf v2.10.1.tar.gz && \
55+
mkdir adios2-build && cd adios2-build && \
56+
cmake ../ADIOS2-2.10.1 -DADIOS2_USE_Blosc2=ON -DADIOS2_USE_Fortran=OFF && \
57+
make -j 16 && make install && \
58+
cd / && \
59+
rm -rf /tmp/build-adios2
4060

41-
RUN curl -L https://github.com/Kitware/CMake/releases/download/v3.23.2/cmake-3.23.2-linux-x86_64.tar.gz -o cmake-3.23.2-linux-x86_64.tar.gz && \
42-
tar -xzf cmake-3.23.2-linux-x86_64.tar.gz -C /opt
61+
# commit version is ver 0.16.1
62+
RUN mkdir /tmp/build-openpmd && cd /tmp/build-openpmd && \
63+
wget https://github.com/openPMD/openPMD-api/archive/3a60e77.tar.gz && \
64+
tar xzf 3a60e77.tar.gz && \
65+
mkdir openPMD-api-build && cd openPMD-api-build && \
66+
cmake ../openPMD-api-3a60e7714f6143c8fc7bf89809f2167d058359ee -DopenPMD_USE_PYTHON=ON -DPython_EXECUTABLE=$(which python3) -DopenPMD_USE_ADIOS2=ON && \
67+
cmake --build . -j 16 && \
68+
cmake --build . --target install && \
69+
cd / && \
70+
rm -rf /tmp/build-openpmd
71+
72+
RUN mkdir /tmp/build-ascent
4373

44-
ENV PATH=/opt/cmake-3.23.2-linux-x86_64/bin:$PATH
74+
COPY ascent_build.patch /tmp/build-ascent
4575

46-
COPY build_ascent_cuda.sh /tmp/build-ascent/build_ascent_cuda.sh
76+
## NOTE: with enable_cuda=ON, you need a Docker VM with a LARGE amount of RAM (at least 15 GB RAM, 4 GB swap)
4777

78+
# commit version is dev branch on 2025-04-10
4879
RUN cd /tmp/build-ascent && \
49-
bash build_ascent_cuda.sh && \
80+
wget https://github.com/Alpine-DAV/ascent/archive/4da1379.tar.gz && \
81+
tar xzf 4da1379.tar.gz -C . --strip-components=1 && \
82+
wget https://github.com/LLNL/blt/archive/refs/tags/v0.6.2.tar.gz && \
83+
tar xzf v0.6.2.tar.gz -C ./src/blt --strip-components=1 && \
84+
cd ./scripts/build_ascent && \
85+
patch -p1 build_ascent.sh /tmp/build-ascent/ascent_build.patch && \
86+
env enable_cuda=ON enable_mpi=ON build_hdf5=false build_silo=false bash build_ascent.sh && \
5087
cd / && \
5188
rm -rf /tmp/build-ascent
5289

53-
# manually downgrade numpy as deprecated `typeDict` is still used by h5py
54-
RUN pip install numpy==1.21
90+
# Technically not necessary (as we installed the api above) but makes it easier for package discovery
91+
RUN env openPMD_USE_MPI=ON python3 -m pip install openpmd-api --no-binary openpmd-api --break-system-packages
92+
93+
# create new user
94+
RUN groupadd -g 109 render
95+
RUN useradd --create-home --shell /bin/bash -G render,sudo ci
96+
97+
USER ci
98+
99+
WORKDIR /home/ci

scripts/docker/ascent_build.patch

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
--- build_ascent.sh 2024-08-29 21:00:24.000000000 +0000
2+
+++ build_ascent_parthenon.sh 2024-08-30 09:55:58.976365723 +0000
3+
@@ -21,6 +21,8 @@
4+
# Build Options
5+
##############################################################################
6+
7+
+export MAKEFLAGS="--output-sync=target"
8+
+
9+
# shared options
10+
enable_cuda="${enable_cuda:=OFF}"
11+
enable_hip="${enable_hip:=OFF}"
12+
13+
@@ -126,8 +128,8 @@
14+
root_dir=$(ospath ${root_dir})
15+
root_dir=$(abs_path ${root_dir})
16+
script_dir=$(abs_path "$(dirname "${BASH_SOURCE[0]}")")
17+
-build_dir=$(ospath ${root_dir}/build)
18+
-source_dir=$(ospath ${root_dir}/source)
19+
+build_dir=$(ospath build)
20+
+source_dir=$(ospath source)
21+
22+
23+
# root_dir is where we will build and install
24+
@@ -140,7 +142,7 @@
25+
26+
# install_dir is where we will install
27+
# override with `prefix` env var
28+
-install_dir="${install_dir:=$root_dir/install}"
29+
+install_dir=/usr/local
30+
31+
echo "*** prefix: ${root_dir}"
32+
echo "*** build root: ${build_dir}"
33+
@@ -231,7 +233,7 @@
34+
hdf5_short_version=1.14
35+
hdf5_src_dir=$(ospath ${source_dir}/hdf5-${hdf5_version})
36+
hdf5_build_dir=$(ospath ${build_dir}/hdf5-${hdf5_version}/)
37+
-hdf5_install_dir=$(ospath ${install_dir}/hdf5-${hdf5_version}/)
38+
+hdf5_install_dir=/usr/local/hdf5/parallel
39+
hdf5_tarball=$(ospath ${source_dir}/hdf5-${hdf5_version}.tar.gz)
40+
41+
# build only if install doesn't exist
42+
@@ -650,7 +650,7 @@ fi # if enable_hip || enable_sycl
43+
################
44+
# VTK-m
45+
################
46+
-vtkm_version=v2.2.0
47+
+vtkm_version=v2.3.0
48+
vtkm_src_dir=$(ospath ${source_dir}/vtk-m-${vtkm_version})
49+
vtkm_build_dir=$(ospath ${build_dir}/vtk-m-${vtkm_version})
50+
vtkm_install_dir=$(ospath ${install_dir}/vtk-m-${vtkm_version}/)

0 commit comments

Comments
 (0)