Interactive app card completed but UGE job is still running

Within the last few months, we have upgraded from 1.8 to 2.0.32. Since the upgrade, a few of our researchers and admin staff have reported seeing OOD sessions immediately shown as completed, but the Univa Grid Engine (UGE) job is still running.

Taking job number 2895517 as an example, here is what I can see in the /var/log/ondemand-nginx/user1/error.log file:

App 38789 output: [2023-04-13 10:12:22 +0100 ]  INFO "execve = [{}, \"/opt/sge/bin/lx-amd64/qsub\", \"-wd\", \"/data/home/user1/ondemand/data/sys/dashboard/batch_connect/sys/rstudio/output/5a0bdeb5-e1ae-4472-b000-fca10d6486d7\", \"-N\", \
"ood_rstudio\", \"-o\", \"/data/home/user1/ondemand/data/sys/dashboard/batch_connect/sys/rstudio/output/5a0bdeb5-e1ae-4472-b000-fca10d6486d7/output.log\", \"-pe\", \"smp\", \"1\", \"-l\", \"ood_cores=1\", \"-l\", \"h_vmem=64G\", \"-l\", \
"h_rt=240:0:0\", \"-m\", \"bea\"]"
App 38789 output: [2023-04-13 10:12:22 +0100 ]  INFO "method=POST path=/pun/sys/dashboard/batch_connect/sys/rstudio/session_contexts format=html controller=BatchConnect::SessionContextsController action=create status=302 duration=289.08 vi
ew=0.00 location=https://ONDEMAND_HOST/pun/sys/dashboard/batch_connect/sessions"
App 38789 output: [2023-04-13 10:12:22 +0100 ]  INFO "execve = [{}, \"/opt/sge/bin/lx-amd64/qstat\", \"-r\", \"-xml\", \"-j\", \"2895517\"]"

plus nothing out of the ordinary in the scheduler logs, nor the job output files.

Running the same qstat command (/opt/sge/bin/lx-amd64/qstat -r -xml -j 2895517) shows the job details - could it be that OOD tried to run qstat before qsub had completed?

This issue looks very similar to Interactive App completed, slurm job remains active in that the UUID file in ~/ondemand/data/sys/dashboard/batch_connect/db is showing the cache_completed variable as true. However, unlike that issue, there is only one nginx master process.

There doesn’t seem to be a pattern here, as the issue is only affecting a random cohort of our ondemand user base, at random times of the day, so any help would be greatly appreciated.

Thanks,
Tom

Hello and welcome!

Sorry for the trouble. I think you are on to something with:

This combined with the seemingly random occurrence of the issue has me wondering if the scripts being handed off that have the error have a large task list and this may be what is causing the problem. Which is a long way to say this could be a bug in ood_core’s sge adapter.

I don’t have UGE to test against unfortunately. Could you try to submit some jobs against this with large task lists and one with a small task list and see if you can get this error to show up? That’s my guess, like you said, of what is going on and we may need to make some changes to the adapter if so.

Hi Travis,

Thank you for your prompt reply!

I have submitted roughly about 100 UGE jobs with various different resource requests (single core through to 75000 task array jobs), each just running a simple sleep statement for a few seconds. In my testing, I saw the issue a handful of times (maybe 3 or 4), but certainly not on every submission - at one point I had about 25 jobs submitted at once without observing the issue.

Nevertheless, changing the cache_completed variable from true to null in the UUID file for affected jobs did revive the jobs in OOD.

We have an alternative OOD server running as a hot spare which I can try to replicate the issue on - both servers submit to the same UGE cluster and share the same filesystem. If there is an intermittent communication issue between OOD and the scheduler, I would hope to be able to replicate the issue on the alternative OOD server.

p.s. apologies if we’re not on the same page here :slight_smile:

Thanks,
Tom

A quick update:

We have replicated the same issue on the alternative OOD server, also running 2.0.32. At some point next week, I’ll try upgrading our dev instance to 3.0 to see if the issue persists.