OOD Template: vnc_container won't run properly; “failed to connect to dbus”

Hello, we have an app defined to use template: vnc_container. We have found that it won’t launch unless we login to the node first as the user and restart the user dbus service

When it fails, in the output.log it shows:

Starting instance…

WARNING: group: unknown groupid XXX

INFO: Terminating squashfuse after timeout

INFO: Timeouts can be caused by a running background process

ERROR: container cleanup failed: no instance found with name XXXXXXXXXX

FATAL: container creation failed: while applying cgroups config: while creating cgroup manager: failed to connect to dbus (hint: for rootless containers, maybe you need to install dbus-user-session package, see runc/docs/cgroup-v2.md at main · opencontainers/runc · GitHub ): could not execute `busctl --user --no-pager status` (output: “Failed to connect to bus: No such file or directory\n”): exit status 1

FATAL: while executing starter: failed to start instance: while running /usr/local/libexec/apptainer/bin/starter: exit status 255

If we ssh into the node as the user and run

systemctl --user restart dbus.service

Then launch the app, it will work normally. We can see that the user dbus.service is running before restarting it.

We tried adding:
before_script: |

systemctl --user start dbus.service

but the template fails to launch the vnc apptainer image before it runs the before script.

We also tried installing the ubuntu-desktop group and setting graphical target as default, but that didn’t help either.

Does anyone have any other suggestions?

Hi and welcome!

Not sure what could cause this. Can I see the mounts you’re using? I’ve found that /var pretty much always needs to mount, but have never seen this error.

Hello,

In our submit.yml.erb we have

container_path: “<%= (cpath == ‘custom’) ? container_image_custom : cpath %>”
container_bindpath: “/var,/usr”
container_module: “apptainer”
container_command: “apptainer”
container_start_args:
websockify_cmd: “/usr/local/bin/websockify”
before_script: |
systemctl --user start dbus.service

We solved our problem. We figured out that it was failing in the template before it even got to our app.

It was failing here starting the ondemand container for handling vnc (the before_script actually runs after this)
ondemand/4.0.8-1/gems/ood_core-0.29.0/lib/ood_core/batch_connect/templates/vnc_container.rb
echo “Starting instance…”
#{container_command} instance start #{container_start_args} #{container_path} #{@instance_name}

if the user was not ssh’d in to the server because there was no /run/user/ directory/session. Why we were running into this and it seems no one else is we’re not sure. We realized that in order to fix it, we would need to establish a session for the user on the node before the job runs and would have to do it in the scheduler because it’s what is running the job and it is not done via ssh.

To fix it we created a prolog script for slurm and told it to run loginctl enable-linger for the user. We also created an epilog script to disable the linger after the job finishes.

slurm.conf
Prolog=/sharedfilesystemapth/prolog-script/prolog.sh
Epilog=/sharedfilesystemapth/prolog-script/epilog.sh

prolog.sh
#!/bin/bash
loginctl enable-linger $SLURM_JOB_USER
systemctl --user -M $SLURM_JOB_USER@.host start dbus.service

epilog.sh
#!/bin/bash
loginctl disable-linger $SLURM_JOB_USER

Then we also added /run to the container_bindpath so that this /run/user/ session directory would be available to the container image that we would be starting in our job separate from the one that runs above.