Gromacs 2025.2のCentOS 7へのインストールメモ。 - 香川大学農学部ケミカルバイオロジー研究室

Contents

1 GROMACS 2025.2のCentOS 7（CUDA 12.3; TITAN RTX）へのインストール
2 GROMACS 2025.3 with support for Neutral Network PotentialsのAlmaLinux 8（CUDA 12.8; TITAN RTX）へのインストールとテスト
3 Performanceの比較検証
- 3.1 GROMACS 2025.2
  - - 3.1.0.1 CPU only (8 cores)
    - 3.1.0.2 1 GPU + 8 CPU

GROMACS 2025.2のCentOS 7（CUDA 12.3; TITAN RTX）へのインストール

$ wget https://ftp.gromacs.org/gromacs/gromacs-2025.2.tar.gz
$ tar xvzf gromacs-2025.2.tar.gz
$ cd gromacs-2025.2
$ mkdir build
$ cd build
$ scl enable devtoolset-11 bash
$ sudo rm /usr/local/cuda
$ sudo ln -s /usr/local/cuda-12.3 /usr/local/cuda
$ ccmake .. -DGMX_BUILD_OWN_FFTW=ON -DREGRESSIONTEST_DOWNLOAD=ON -DGMX_API=OFF -DBUILD_SHARED_LIBS=OFF -DCMAKE_INSTALL_PREFIX=/usr/local/gromacs/2025.2_cuda -DGMX_GPU=CUDA -DCMAKE_CUDA_ARCHITECTURES=native
$ make -j 18
$ make check -j 18
$ sudo make install
$ source /usr/local/gromacs/2025.2_cuda/bin/GMXRC

GROMACS 2025.3 with support for Neutral Network PotentialsのAlmaLinux 8（CUDA 12.8; TITAN RTX）へのインストールとテスト

参考: Gromacs 2025.2 with CUDA support – 計算科学研究センター

ANI is an abbreviation for ANAKIN-ME (Accurate NeurAl networK engINe for Molecular Energies), a neural network potential framework designed to provide quantum chemistry accuracy (close to DFT) for molecular energies and forces, but with the speed of classical force fields. ANI-2x can be used to simulate organic molecules containing seven elements (H, C, N, O, F, Cl, and S).

テスト結果

System: N-Acetyl-L-alanine methylamide (22 atoms) in water (6765 atoms)

dt = 0.001 ; (1 fs)

tcoup = V-rescale

pcoup = C-rescale

GPU, NVIDIA TITAN-RTX

command: gmx mdrun -deffnm md -ntmpi 1 -ntomp 1 -resethway -noconfout

Conditions	Performance (ns/day)	Speed up
NNP/MM (TorchANI) with GMX_NN_DEVICE=cpu (1 CPU)	2.482	x 1
NNP/MM (TorchANI) with GMX_NN_DEVICE=cuda	3.901	x 1.6
NNP/MM (TorchANI/NNPOps)	57.793	x 23
MM with 1 CPU	16.515	(x 6.7)
MM with 1 CPU and 1 GPU	635.914	(x 256)
MM with 2 CPU and 1 GPU	706.947	(x 285)

libtorchのインストール

$ sudo dnf install gcc-toolset-13 ninja-build cudnn-cuda-12 cudss-cuda-12 cusparselt-cuda-12 libnccl libnccl-devel openblas
$ conda create -n libtorch271 python=3.13
$ conda activate libtorch271
(libtorch271) $ conda install numpy pyyaml typing_extensions
(libtorch271) $ scl enable gcc-toolset-13 bash
(libtorch271) $ wget https://github.com/pytorch/pytorch/releases/download/v2.7.1/pytorch-v2.7.1.tar.gz
(libtorch271) $ tar xvzf pytorch-v2.7.1.tar.gz
(libtorch271) $ cd pytorch-v2.7.1
(libtorch271) $ cd third_party
(libtorch271) $ git clone https://github.com/NVIDIA/nccl.git
(libtorch271) $ cd nccl
(libtorch271) $ git checkout v2.21.5-1 
(libtorch271) $ cd ../../
(libtorch271) $ mkdir build && cd build
(libtorch271) $ cmake .. -GNinja -DBLAS=OpenBLAS -DBUILD_FUNCTORCH=OFF -DBUILD_PYTHON=False\
 -DBUILD_TEST=True -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/opt/libtorch/2.7.1/cu128\
 -DPython_EXECUTABLE=/home/user/miniforge3/envs/libtorch271/bin/python -DTORCH_BUILD_VERSION=2.7.1\
 -DCMAKE_PREFIX_PATH=/usr/local/cuda-12.8 -DUSE_CUDA=ON -DUSE_CUDNN=ON -DUSE_CUDSS=ON -DUSE_CUSPARSELT=ON\
 -DUSE_NUMPY=True -DCUDA_NVRTC_SHORTHASH=c3430e8b
(libtorch271) $ ninja -j2
(libtorch271) $ sudo ninja install
(libtorch271) $ cd ../../
(libtorch271) $ conda deactivate

GROMACS 2025.3のインストール

$ wget https://ftp.gromacs.org/gromacs/gromacs-2025.3.tar.gz
$ wget https://www.fftw.org/fftw-3.3.10.tar.gz
$ tar xvzf https://ftp.gromacs.org/gromacs/gromacs-2025.3.tar.gz
$ cd gromacs-2025.3
$ mkdir build && cd build
$ export FFTW_PATH=${BASEDIR}/fftw-3.3.10.tar.gz
$ export Torch_DIR=/opt/libtorch/2.7.1/cu128
$ /usr/local/cmake/4.1.0/bin/cmake .. -DCMAKE_PREFIX_PATH="${TORCH_DIR};/usr/local/cuda-12.8"\
 -DCMAKE_INSTALL_PREFIX=/usr/local/gromacs/2025.3_cuda_torchcu128 -DGMX_GPU=CUDA -DGMX_USE_CUFFTMP=OFF\
 -DGMX_NNPOT=TORCH -DCAFFE2_USE_CUDNN=ON -DCAFFE2_USE_CUSPARSELT=ON -DUSE_CUDSS=ON\
 -DPython_EXECUTABLE=/home/user/miniforge3/envs/libtorch271/bin/python -DGMX_BUILD_OWN_FFTW=ON\
 -DGMX_BUILD_OWN_FFTW_URL=${FFTW_PATH} -DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.8/bin/nvcc\
 -DCMAKE_CUDA_IMPLICIT_LINK_DIRECTORIES=/usr/local/cuda-12.8/targets/x86_64-linux/lib\
 -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-12.8 -DGMX_INSTALL_NBLIB_API=OFF\
 -DCUDA_NVRTC_SHORTHASH=c3430e8b -DREGRESSIONTEST_DOWNLOAD=ON\
 -Dnvtx3_dir=/usr/local/cuda-12.8/targets/x86_64-linux/include/nvtx3
$ make -j37
$ make -j37 check
$ sudo make install

Pretrained modelのエクスポート

参考: Neural Network Potentials – GROMACS Reference Manual

$ conda activate libtorch271
(libtorch271) $ pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 --index-url https://download.pytorch.org/whl/cu128
(libtorch271) $ conda install jupyterlab ipywidgets
(libtorch271) $ conda install cuda-version==12.8
(libtorch271) $ conda install -c conda-forge torchani nnpops

実行ファイル: lmuellender/gmx-nnpot-wrapper – GitHub

(libtorch271) $ cd gmx-nnpot-wrapper-main

以下のPython scriptでファイルを作成

import os
import torch
from models import GmxANIModel
import torchani

# デバイス設定
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# TorchANI の site-packages ディレクトリ
ta_dir = os.path.dirname(torchani.__file__)

# cuAEV の .so ファイルを探索
def find_cuaev_so(base):
    for fname in os.listdir(base):
        if "cuaev" in fname and fname.endswith(".so"):
            return os.path.join(base, fname)
    return None

cuaev_so = find_cuaev_so(ta_dir)
assert cuaev_so, f"cuAEV の .so が見つかりません: {ta_dir}"

# モデル作成
model = GmxANIModel(version=2, device=device)

# TorchScript 保存時に拡張ライブラリのパスを埋め込む
save_path = 'models/ani2x.pt'
extra_files = {"extension_libs": cuaev_so}

torch.jit.script(model).save(save_path, _extra_files=extra_files)

print(f"Saved model to {save_path} with cuAEV library: {cuaev_so}")

# モデル作成
model = GmxANIModel(use_opt='cuaev', version=2, device=device)

# TorchScript 保存時に拡張ライブラリのパスを埋め込む
save_path = 'models/ani2x_cuaev.pt'
extra_files = {"extension_libs": cuaev_so}

torch.jit.script(model).save(save_path, _extra_files=extra_files)

print(f"Saved model to {save_path} with cuAEV library: {cuaev_so}")

save_path = 'models/ani2x_nnpops.pt'

from models import GmxANIModel
# デバイス設定
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# example atomic number tensor for N-acetyl-L-alanine methylamide
atomic_numbers = torch.tensor([1,6,1,1,6,8,7,1,6,1,6,1,1,1,6,8,7,1,6,1,1,1], device=device)
model = GmxANIModel(use_opt='nnpops', atomic_numbers=atomic_numbers, version=2, device=device)

# nnpops can be found by checking for torch extension library
ext_lib = []
for lib in torch.ops.loaded_libraries:
    if lib:
        ext_lib.append(lib)
# if multiple extensions are found, they are separated by ':'
ext_lib = ":".join(ext_lib)
print("loaded extension libraries: ", ext_lib)
extra_files = {}
extra_files['extension_libs'] = ext_lib

torch.jit.script(model).save(save_path, _extra_files=extra_files)

(libtorch271) $ cd examples
(libtorch271) $ source /usr/local/gromacs/2025.3_cuda_torchcu128/bin/GMXRC
(libtorch271) $ export LD_LIBRARY_PATH="$CONDA_PREFIX/lib:$LD_LIBRARY_PATH"

Neural Network Potential (NNP)/MMシミュレーションのテスト（N-acetyl-L-alanine methylamide in water）

Using ani2x.pt with CPU

(libtorch271) $ export GMX_NN_DEVICE=cpu
(libtorch271) $ gmx grompp -f md.mdp -c conf.gro -p topol.top -o md.tpr
Neural network potential Interface is active, topology was modified!
Number of NN input atoms: 22
Number of regular atoms: 6765
Bonds removed: 62
Angles removed: 36
Dihedrals removed: 42
Connection-only (type 5) bonds added: 21
(libtorch271) $ gmx mdrun -deffnm md -ntmpi 1 -ntomp 8
                      :-) GROMACS - gmx mdrun, 2025.3 (-:

Reading file md.tpr, VERSION 2025.3 (single precision)
Changing nstlist from 15 to 100, rlist from 1.2 to 1.302

1 GPU selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
  PP:0,PME:0
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PP task will update and constrain coordinates on the GPU
PME tasks will do all aspects on the GPU
Using 1 MPI thread
Using 1 OpenMP threads 
    
               Core t (s)   Wall t (s)        (%)
       Time:      174.068      174.069      100.0
                 (ns/day)    (hour/ns)
Performance:        2.482        9.669

Using ani2x.pt with CUDA

(libtorch271) $ export GMX_NN_DEVICE=cuda
(libtorch271) $ gmx mdrun -deffnm md -ntmpi 1 -ntomp 1

               Core t (s)   Wall t (s)        (%)
       Time:      110.756      110.756      100.0
                 (ns/day)    (hour/ns)
Performance:        3.901        6.152

GROMACS reminds you: "Go back to the rock from under which you came" (Fiona Apple)

Segmentation fault (コアダンプ)

Using ani2x_cuaev.pt

(libtorch271) $ sed 's/ani2x.pt/ani2x_cuaev.pt/g' md.mdp > md_cuaev.mdp
(libtorch271) $ gmx grompp -f md_cuaev.mdp -c conf.gro -p topol.top -o md_cuaev.tpr
RuntimeError: AssertionError: cuaev currently does not support PBC

Using ani2x_nnpops.pt

(libtorch271) $ sed 's/ani2x.pt/ani2x_nnpops.pt/g' md.mdp > md_nnpops.mdp
(libtorch271) $ gmx grompp -f md_nnpops.mdp -c conf.gro -p topol.top -o md_nnpops.tpr -maxwarn 1
(libtorch271) $ gmx mdrun -v -deffnm md_nnpops -ntmpi 1 -ntomp 1
      R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 1 MPI rank

 Activity:              Num   Num      Call    Wall time         Giga-Cycles
                        Ranks Threads  Count      (s)         total sum    %
--------------------------------------------------------------------------------
 Neighbor search           1    1         51       0.149          0.447   2.0
 Launch PP GPU ops.        1    1       9951       0.206          0.618   2.8
 Force                     1    1       5001       0.019          0.058   0.3
 NN potential              1    1       5001       6.717         20.153  89.8
 PME GPU mesh              1    1       5001       0.144          0.432   1.9
 Wait GPU NB local         1    1       5001       0.001          0.004   0.0
 Wait GPU state copy       1    1      10256       0.012          0.036   0.2
 NB X/F buffer ops.        1    1         51       0.001          0.003   0.0
 Write traj.               1    1         51       0.063          0.190   0.8
 Kinetic energy            1    1        101       0.004          0.013   0.1
 Rest                                              0.159          0.477   2.1
--------------------------------------------------------------------------------
 Total                                             7.476         22.430 100.0
--------------------------------------------------------------------------------
 Breakdown of PME mesh activities
--------------------------------------------------------------------------------
 Wait PME GPU gather       1    1       5001       0.001          0.003   0.0
 Reduce GPU PME F          1    1       5001       0.001          0.002   0.0
 Launch PME GPU ops.       1    1      45009       0.133          0.400   1.8
--------------------------------------------------------------------------------

Performanceの比較検証

GROMACS 2025.2

入力データとしてADH cubicのテストデータ（原子数134,177）を使用。CPU, Intel Core i9-9980XE 3.00GHz; GPU, NVIDIA TITAN-RTX。

CPU only (8 cores)

$ gmx mdrun -deffnm adh_cubic_test2025.2 -pin on -resethway -noconfout -ntmpi 1 -ntomp 8 -bonded cpu -update cpu -nb cpu -pme cpu

starting mdrun 'NADP-DEPENDENT ALCOHOL DEHYDROGENASE in water'
10000 steps,     20.0 ps.

step 5000: resetting all time and cycle counters

               Core t (s)   Wall t (s)        (%)
       Time:      446.183       55.773      800.0
                 (ns/day)    (hour/ns)
Performance:       15.494        1.549

1 GPU + 8 CPU

$ gmx mdrun -deffnm adh_cubic_test2025.2 -pin on -resethway -noconfout -ntmpi 1 -ntomp 8 -bonded gpu -update gpu -nb gpu -pme gpu -nsteps 20000

1 GPU selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
  PP:0,PME:0
PP tasks will do (non-perturbed) short-ranged and most bonded interactions on the GPU
PP task will update and constrain coordinates on the GPU
PME tasks will do all aspects on the GPU
Using 1 MPI thread
Using 8 OpenMP threads 

starting mdrun 'NADP-DEPENDENT ALCOHOL DEHYDROGENASE in water'
20000 steps,     40.0 ps.

step 10000: resetting all time and cycle counters

               Core t (s)   Wall t (s)        (%)
       Time:       71.585        8.950      799.8
                 (ns/day)    (hour/ns)
Performance:      193.089        0.124