Running Interactive Desktop card disappears

We’re running OOD 3.0.1 and we have plenty of users who’re using OOD just fine. One user, however, is having an issue where she will submit a new interactive desktop job, the queued card appears, then as soon as the job starts, the card disappears. You can see the job running, but the green Running card is missing. Once the job completes, the card reappears and shows Completed like normal.

I’ve crawled around these forums and some things I found were suggestions about cleaning up .bashrc and .bash_profile, which I’ve done. I’ve also tried deleting her ~/ondemand directory.

What might the next troubleshooting steps look like?

Thank you!

Hi and welcome!

Here are the document pages on where to look. Specifically in the log directory of that particular job.

https://osc.github.io/ood-documentation/latest/how-tos/debug/debug-interactive-apps.html

It sounds like the issue is stemming from this (given it’s only affecting 1 single user) but the output.log of the job(s) that fail will give a much better indication of why it’s failing.

Thanks for the info. I’ve compared the affected user’s output.log and my own (which works fine) and attached only the differences at the bottom.

The only real difference I’m seeing is that mine claims to be setting the VNC password and writing the connection.yaml file, but I do see a connection.yml in her output directory as well and it seems to have valid information in it (passwords are set, etc).

broken:

(nm-applet:988950): Gtk-WARNING **: 10:11:56.971: gtk_widget_size_allocate(): attempt to allocate widget with width -1 and height 1

(mate-settings-daemon:988917): dbind-WARNING **: 10:12:00.727: AT-SPI: Error in GetItems, sender=(null), error=Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.


(mate-settings-daemon:988917): GLib-GObject-CRITICAL **: 10:25:46.606: g_object_unref: assertion 'G_IS_OBJECT (object)' failed

(mate-settings-daemon:988917): GLib-GObject-CRITICAL **: 10:25:46.607: g_object_unref: assertion 'G_IS_OBJECT (object)' failed

(mate-settings-daemon:988917): GLib-GObject-CRITICAL **: 10:25:46.607: g_object_unref: assertion 'G_IS_OBJECT (object)' failed

(mate-settings-daemon:988917): GLib-GObject-CRITICAL **: 10:25:46.607: g_object_unref: assertion 'G_IS_OBJECT (object)' failed

(mate-settings-daemon:988917): GLib-GObject-CRITICAL **: 10:25:46.607: g_object_unref: assertion 'G_IS_OBJECT (object)' failed

(mate-settings-daemon:988917): GLib-GObject-CRITICAL **: 10:25:46.607: g_object_unref: assertion 'G_IS_OBJECT (object)' failed

works:

(mate-settings-daemon:989429): dbind-WARNING **: 10:13:17.071: AT-SPI: Error in GetItems, sender=org.freedesktop.DBus, error=Message recipient disconnected from message bus without replying
Setting VNC password...
Generating connection YAML file...
mate-session[989373]: CRITICAL: gsm_systemd_set_session_idle: assertion 'session_path != NULL' failed

Yea it’s hard to say what’s relevant there or not. I’m not able to discern anything really relevant, but again, the ~/.bashrc or similar is where I would look. I’d also check to see if bash is even their SHELL (could be some other rc file you’re looking for).

Typically user specific conda environments can throw this off. That’s what comes to my mind anyhow.

Yeah, unfortunately I’ve already looked in her .bashrc and .bash_profile and they’re just the basic system defaults from RHEL (and she’s using bash, not zsh or another shell).

I wonder what causes the visibility of the card in the interactive sessions view. Like, OOD has to determine what to show, I wonder if I can trace down what’s causing it to not find that job in a list somewhere.

It’s the connection.yml + the state of the job. The job states from the scheduler correspond to what we show, with the exception of the starting state.

Starting: no connection.yml but the job is in a Running state
Running: a valid connection.yml and the job is in a Running state

If the job get’s completed on the OOD side then the job is actually completed on the scheduler side. Otherwise we’d sit in starting state for the entire job’s duration waiting for the connection.yml.

You can likely confirm this in sacctmgr (Slurm’s historic information) or similar to see the job only ran for a minute or so.

If it only impacts one user, that’s the only clue we have to go on. What scheduler do you use? Could there be something in ~/.slurm_defaults or similar? If it’s not their shell environment… Then it must be something else specific to that user.

Thanks for the detailed information. Somehow, the user reports the issue is magically fixed, and I didn’t do anything to fix it…

Either way, this info will help me troubleshoot if it comes back, or someone else experiences this issue.

Thanks!