Photon OS

 View Only
  • 1.  Photon OS v4 or v5 with NVidia CUDA

    Posted Apr 27, 2023 09:40 AM

    I have a requirement for a docker container to utilise the NVidia CUDA system.

    Currently I use an Ubunutu Server VM in ESXi 6.7u2 with the NVidia GFX card passed exclusivly through. I want to move to Photon OS due to the lower system footprint, and consolidation of OS types!

    I found the following answer on this community but following those steps results in errors.

    Below are the steps followed, which combine the instructions from the VMware Communities post, and the NVidia Installation Guide for Docker.

    Any help or advise is welcome.

    VM Creation

     

    Create new VM in ESXi
    Add PCI Device and select GP107GL [Quadro P620]
    20GB disk - thin prov
    8GB RAM - All reserved
    Mount disk ISO of photon-minimal-4.0-rev2-c001795b8.iso
    VM setting of:
    Hypervisor.CPUID.v0 FALSE

     

    Photon Install

     

    >> Start VM
    >> Select "VMware kernel (not generic linux)
    
    # System update
    tdnf -y update
    tdnf -y upgrade
    
    # Configure SSH
    systemctl start sshd
    systemctl enable sshd
    vim /etc/ssh/sshd_config
    AllowRootLogin yes
    systemctl restart sshd
    
    # Docker start
    systemctl start docker
    systemctl enable docker

     

    Install NVidia drivers

     

    # Get sources
    tdnf install -y linux-esx-devel
    reboot
    
    # install kernel api headers and devel
    tdnf install -y build-essential wget tar
    
    # Resize tmp
    umount /tmp
    mount -t tmpfs -o size=2G tmpfs /tmp
    
    # NVidia drivers from here: https://www.nvidia.com/en-us/drivers/unix/
    wget https://uk.download.nvidia.com/XFree86/Linux-x86_64/525.105.17/NVIDIA-Linux-x86_64-525.105.17.run
    chmod a+x ./NVIDIA-Linux-x86_64-525.105.17.run
    ./NVIDIA-Linux-x86_64-525.105.17.run
    reboot
    
    # check nvidia device is found
    nvidia-smi
    
    Thu Apr 27 07:19:37 2023
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  Quadro P620         Off  | 00000000:0B:00.0 Off |                  N/A |
    | 40%   47C    P0    N/A /  40W |      0MiB /  2048MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                                  |
    |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    |        ID   ID                                                   Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+

     

    Drivers installed ok.

    Install NVidia Container Toolkit

     

    # Setup the package repository and the GPG key:
    tdnf install -y gpg
    cd /etc/pki/rpm-gpg/
    curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /etc/pki/rpm-gpg/nvidia-container-toolkit-keyring.gpg
    
    cat << EOF >>/etc/yum.repos.d/nvidia-container-toolkit.repo
    [libnvidia-container]
    name=libnvidia-container
    baseurl=https://nvidia.github.io/libnvidia-container/centos7/x86_64
    gpgcheck=0
    enabled=1
    EOF
    
    # Install the toolkit
    tdnf makecache
    tdnf install nvidia-container-toolkit
    
    # Register the runtime with docker
    nvidia-ctk runtime configure --runtime=docker
    systemctl restart docker
    
    rm /etc/yum.repos.d/nvidia-container-toolkit.repo

     

    Test with a base CUDA container

    According to the installation guide, the output of the following should be the same NVidia-smi table above:

     

    docker run --rm --runtime=nvidia --gpus all nvidia/cuda:12.0.0-base-ubuntu20.04 nvidia-smi

     

    but I get:

     

    docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
    nvidia-container-cli: initialization error: driver rpc error: failed to process request: unknown.

     

    dmesg:

     

    [36185.054996] audit: type=1006 audit(1682579973.442:412): pid=20385 uid=0 subj=unconfined old-auid=4294967295 auid=0 tty=(none) old-ses=4294967295 ses=3 res=1
    [36823.023793] docker0: port 2(veth5e0f5e7) entered blocking state
    [36823.023796] docker0: port 2(veth5e0f5e7) entered disabled state
    [36823.023843] device veth5e0f5e7 entered promiscuous mode
    [36823.023864] audit: type=1700 audit(1682580611.410:413): dev=veth5e0f5e7 prom=256 old_prom=0 auid=4294967295 uid=0 gid=0 ses=4294967295
    [36823.109463] nvc:[driver][20748]: segfault at 30 ip 00007f50a8466866 sp 00007fff51909d30 error 4 in libnvidia-container.so.1.13.1[7f50a8444000+39000]
    [36823.109468] Code: 00 e8 fe 4a 00 00 39 c5 7c 12 45 85 e4 0f 85 f9 00 00 00 5b 5d 41 5c c3 0f 1f 40 00 48 8b 05 21 af 21 00 48 63 fd 48 8d 04 f8 <48> 39 18 75 db 81 fd ff 03 00 00 48 c7 00 00 00 00 00 7f 7e e8 e1
    [36823.109496] audit: type=1701 audit(1682580611.498:414): auid=4294967295 uid=0 gid=0 ses=4294967295 subj=unconfined pid=20748 comm="nvc:[driver]" exe="/usr/bin/nvidia-container-cli" sig=11 res=1
    [36823.262165] docker0: port 2(veth5e0f5e7) entered disabled state
    [36823.262536] device veth5e0f5e7 left promiscuous mode
    [36823.262549] docker0: port 2(veth5e0f5e7) entered disabled state
    [36823.262576] audit: type=1700 audit(1682580611.650:415): dev=veth5e0f5e7 prom=0 old_prom=256 auid=4294967295 uid=0 gid=0 ses=4294967295

     

    Searching on the internet reveals people on various platforms with the same error, but no general resolution.

    As I said, this is working fine with Ubunut, but I would like to consolidate my VMs to use Photon.

    Any help or advise is welcome.

    ---------

    Photon OS 5_RC

    Trying with photon-minimal-5.0_RC-4d5974638.x86_64 doesn't work:

    * Installing the drivers works ok
    * Installing the NVidia-container-toolkit works ok

    Registering the toolkit with docker

     

    nvidia-ctk runtime configure --runtime=docker
    systemctl restart docker

     

    Results in the error

     

    INFO[0000] Loading docker config from /etc/docker/daemon.json
    INFO[0000] Config file does not exist, creating new one
    ERRO[0000] unable to flush config: unable to open /etc/docker/daemon.json for writing: open /etc/docker/daemon.json: no such file or directory

     

    Trying to manually register the runtime with a systemmd dropin works:

     

    sudo mkdir -p /etc/systemd/system/docker.service.d
    
    sudo tee /etc/systemd/system/docker.service.d/override.conf <<EOF
    [Service]
    ExecStart=
    ExecStart=/usr/bin/dockerd --host=fd:// --add-runtime=nvidia=/usr/bin/nvidia-container-runtime
    EOF
    
    sudo systemctl daemon-reload \
    && sudo systemctl restart docker

     

    Running the CUDA test docker still results in the same error message. But dmesg is slightly different:

     

    [ 2057.563936] __vm_enough_memory: pid: 2461, comm: nvc:[driver], no enough memory for the allocation
    [ 2057.563941] __vm_enough_memory: pid: 2461, comm: nvc:[driver], no enough memory for the allocation
    [ 2057.563946] __vm_enough_memory: pid: 2461, comm: nvc:[driver], no enough memory for the allocation

     

     



  • 2.  RE: Photon OS v4 or v5 with NVidia CUDA

    Posted Apr 27, 2023 11:22 PM

    Hi  ,

    A very comprehensive step-by-step guide! Congrats for the research and findings.

     

    The NVidia driver page shows up a hint about a workaround for a runc issue for the actually 2nd latest driver version they published. You've used that version.

     

    DCasota_0-1682636303188.png

    In Nvidia docs Step3 the tests explicitly run with sudo. Did that not work?

     

    Accordingly to https://github.com/opencontainers/runc/issues/3708 the issue has been resolved lately.
    The runc spec file for Photon OS 5.0 actually contains 1.1.4, and the latest runc release is 1.1.7, see Release Notes about Systemd v240+ and DeviceAllow.

     

    If it's not possible to go with the latest combo ESXi/Photon/NVidia driver/Cuda/Container-tool-kit, try an older combo. 



    Assuming the latest runc version fixes the issue completely, make build Photon OS with latest runc could be a possibility, too, but unknown dependencies may be difficult to resolve. But, the Photon OS team can help. You could ask on https://github.com/vmware/photon/issues for prioritizing the newer runc release.

     

    About the script: EOF is an identification, it should be unique. You could enumerate each block.

     

    cat << EOF1 >>/etc/yum.repos.d/nvidia-container-toolkit.repo
    [libnvidia-container]
    name=libnvidia-container
    baseurl=https://nvidia.github.io/libnvidia-container/centos7/x86_64
    gpgcheck=0
    enabled=1
    EOF1

     

     

     

     



  • 3.  RE: Photon OS v4 or v5 with NVidia CUDA

    Posted Apr 28, 2023 08:45 AM

    Thanks for your research!

    I've been running all my commands while logged in as root, so haven't been using `sudo`.

    Trying those steps in Photon OS 5 gives the same result:

     

    root@Photon5 [ /etc/pki/rpm-gpg ]# cat /etc/photon-release
    VMware Photon OS 5.0
    PHOTON_BUILD_NUMBER=4d5974638
    root@Photon5 [ /etc/pki/rpm-gpg ]# nvidia-smi
    Fri Apr 28 08:42:09 2023
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  Quadro P620         Off  | 00000000:0B:00.0 Off |                  N/A |
    | 38%   46C    P0    N/A /  40W |      0MiB /  2048MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                                  |
    |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    |        ID   ID                                                   Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+
    root@Photon5 [ /etc/pki/rpm-gpg ]# sudo ctr run --rm -t \
        --runc-binary=/usr/bin/nvidia-container-runtime \
        --env NVIDIA_VISIBLE_DEVICES=all \
        docker.io/nvidia/cuda:11.6.2-base-ubuntu20.04 \
        cuda-11.6.2-base-ubuntu20.04 nvidia-smi
    ctr: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
    nvidia-container-cli: initialization error: driver rpc error: failed to process request: unknown

     

     

    [  780.689512] __vm_enough_memory: pid: 1861, comm: nvc:[driver], no enough memory for the allocation
    [  780.689529] __vm_enough_memory: pid: 1861, comm: nvc:[driver], no enough memory for the allocation
    [  780.689535] __vm_enough_memory: pid: 1861, comm: nvc:[driver], no enough memory for the allocation

     

    I've raised an issue on the Photon githug to move to runc 1.1.7.

     

    In the meantime I will try some older drivers, see if they work

     



  • 4.  RE: Photon OS v4 or v5 with NVidia CUDA

    Posted Apr 28, 2023 11:44 AM

    Hi  ,

    a remark about

    # System update
    tdnf -y update
    tdnf -y upgrade

    System update/upgrade always brings me in a mode of "testing" and no longer in the direction of "resilience".

    For example on Photon OS 4.0rev2, runc has been updated to 1.1.1 on May 13th 2022, to 1.1.4 on October 18th 2022 and the actual 1.1.4-X release is from March 23rd 2023. Accordingly to the runc 1.1.7 Release Notes, the issue described began with 1.1.3.

    Hence for improving reproducibility, it is better to specify the sequence of updates by package releases and, if reasonable, to double check the behavior with root privileges as well. This is a learning note for myself, too.

    Best luck for your project.  
    Daniel