I have a requirement for a docker container to utilise the NVidia CUDA system.
Currently I use an Ubunutu Server VM in ESXi 6.7u2 with the NVidia GFX card passed exclusivly through. I want to move to Photon OS due to the lower system footprint, and consolidation of OS types!
I found the following answer on this community but following those steps results in errors.
Below are the steps followed, which combine the instructions from the VMware Communities post, and the NVidia Installation Guide for Docker.
Any help or advise is welcome.
VM Creation
Create new VM in ESXi
Add PCI Device and select GP107GL [Quadro P620]
20GB disk - thin prov
8GB RAM - All reserved
Mount disk ISO of photon-minimal-4.0-rev2-c001795b8.iso
VM setting of:
Hypervisor.CPUID.v0 FALSE
Photon Install
>> Start VM
>> Select "VMware kernel (not generic linux)
# System update
tdnf -y update
tdnf -y upgrade
# Configure SSH
systemctl start sshd
systemctl enable sshd
vim /etc/ssh/sshd_config
AllowRootLogin yes
systemctl restart sshd
# Docker start
systemctl start docker
systemctl enable docker
Install NVidia drivers
# Get sources
tdnf install -y linux-esx-devel
reboot
# install kernel api headers and devel
tdnf install -y build-essential wget tar
# Resize tmp
umount /tmp
mount -t tmpfs -o size=2G tmpfs /tmp
# NVidia drivers from here: https://www.nvidia.com/en-us/drivers/unix/
wget https://uk.download.nvidia.com/XFree86/Linux-x86_64/525.105.17/NVIDIA-Linux-x86_64-525.105.17.run
chmod a+x ./NVIDIA-Linux-x86_64-525.105.17.run
./NVIDIA-Linux-x86_64-525.105.17.run
reboot
# check nvidia device is found
nvidia-smi
Thu Apr 27 07:19:37 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro P620 Off | 00000000:0B:00.0 Off | N/A |
| 40% 47C P0 N/A / 40W | 0MiB / 2048MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Drivers installed ok.
Install NVidia Container Toolkit
# Setup the package repository and the GPG key:
tdnf install -y gpg
cd /etc/pki/rpm-gpg/
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /etc/pki/rpm-gpg/nvidia-container-toolkit-keyring.gpg
cat << EOF >>/etc/yum.repos.d/nvidia-container-toolkit.repo
[libnvidia-container]
name=libnvidia-container
baseurl=https://nvidia.github.io/libnvidia-container/centos7/x86_64
gpgcheck=0
enabled=1
EOF
# Install the toolkit
tdnf makecache
tdnf install nvidia-container-toolkit
# Register the runtime with docker
nvidia-ctk runtime configure --runtime=docker
systemctl restart docker
rm /etc/yum.repos.d/nvidia-container-toolkit.repo
Test with a base CUDA container
According to the installation guide, the output of the following should be the same NVidia-smi table above:
docker run --rm --runtime=nvidia --gpus all nvidia/cuda:12.0.0-base-ubuntu20.04 nvidia-smi
but I get:
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: driver rpc error: failed to process request: unknown.
dmesg:
[36185.054996] audit: type=1006 audit(1682579973.442:412): pid=20385 uid=0 subj=unconfined old-auid=4294967295 auid=0 tty=(none) old-ses=4294967295 ses=3 res=1
[36823.023793] docker0: port 2(veth5e0f5e7) entered blocking state
[36823.023796] docker0: port 2(veth5e0f5e7) entered disabled state
[36823.023843] device veth5e0f5e7 entered promiscuous mode
[36823.023864] audit: type=1700 audit(1682580611.410:413): dev=veth5e0f5e7 prom=256 old_prom=0 auid=4294967295 uid=0 gid=0 ses=4294967295
[36823.109463] nvc:[driver][20748]: segfault at 30 ip 00007f50a8466866 sp 00007fff51909d30 error 4 in libnvidia-container.so.1.13.1[7f50a8444000+39000]
[36823.109468] Code: 00 e8 fe 4a 00 00 39 c5 7c 12 45 85 e4 0f 85 f9 00 00 00 5b 5d 41 5c c3 0f 1f 40 00 48 8b 05 21 af 21 00 48 63 fd 48 8d 04 f8 <48> 39 18 75 db 81 fd ff 03 00 00 48 c7 00 00 00 00 00 7f 7e e8 e1
[36823.109496] audit: type=1701 audit(1682580611.498:414): auid=4294967295 uid=0 gid=0 ses=4294967295 subj=unconfined pid=20748 comm="nvc:[driver]" exe="/usr/bin/nvidia-container-cli" sig=11 res=1
[36823.262165] docker0: port 2(veth5e0f5e7) entered disabled state
[36823.262536] device veth5e0f5e7 left promiscuous mode
[36823.262549] docker0: port 2(veth5e0f5e7) entered disabled state
[36823.262576] audit: type=1700 audit(1682580611.650:415): dev=veth5e0f5e7 prom=0 old_prom=256 auid=4294967295 uid=0 gid=0 ses=4294967295
Searching on the internet reveals people on various platforms with the same error, but no general resolution.
As I said, this is working fine with Ubunut, but I would like to consolidate my VMs to use Photon.
Any help or advise is welcome.
---------
Photon OS 5_RC
Trying with photon-minimal-5.0_RC-4d5974638.x86_64 doesn't work:
* Installing the drivers works ok
* Installing the NVidia-container-toolkit works ok
Registering the toolkit with docker
nvidia-ctk runtime configure --runtime=docker
systemctl restart docker
Results in the error
INFO[0000] Loading docker config from /etc/docker/daemon.json
INFO[0000] Config file does not exist, creating new one
ERRO[0000] unable to flush config: unable to open /etc/docker/daemon.json for writing: open /etc/docker/daemon.json: no such file or directory
Trying to manually register the runtime with a systemmd dropin works:
sudo mkdir -p /etc/systemd/system/docker.service.d
sudo tee /etc/systemd/system/docker.service.d/override.conf <<EOF
[Service]
ExecStart=
ExecStart=/usr/bin/dockerd --host=fd:// --add-runtime=nvidia=/usr/bin/nvidia-container-runtime
EOF
sudo systemctl daemon-reload \
&& sudo systemctl restart docker
Running the CUDA test docker still results in the same error message. But dmesg is slightly different:
[ 2057.563936] __vm_enough_memory: pid: 2461, comm: nvc:[driver], no enough memory for the allocation
[ 2057.563941] __vm_enough_memory: pid: 2461, comm: nvc:[driver], no enough memory for the allocation
[ 2057.563946] __vm_enough_memory: pid: 2461, comm: nvc:[driver], no enough memory for the allocation