Within the last few months, we have upgraded from 1.8 to 2.0.32. Since the upgrade, a few of our researchers and admin staff have reported seeing OOD sessions immediately shown as completed, but the Univa Grid Engine (UGE) job is still running.
Taking job number 2895517 as an example, here is what I can see in the /var/log/ondemand-nginx/user1/error.log
file:
App 38789 output: [2023-04-13 10:12:22 +0100 ] INFO "execve = [{}, \"/opt/sge/bin/lx-amd64/qsub\", \"-wd\", \"/data/home/user1/ondemand/data/sys/dashboard/batch_connect/sys/rstudio/output/5a0bdeb5-e1ae-4472-b000-fca10d6486d7\", \"-N\", \
"ood_rstudio\", \"-o\", \"/data/home/user1/ondemand/data/sys/dashboard/batch_connect/sys/rstudio/output/5a0bdeb5-e1ae-4472-b000-fca10d6486d7/output.log\", \"-pe\", \"smp\", \"1\", \"-l\", \"ood_cores=1\", \"-l\", \"h_vmem=64G\", \"-l\", \
"h_rt=240:0:0\", \"-m\", \"bea\"]"
App 38789 output: [2023-04-13 10:12:22 +0100 ] INFO "method=POST path=/pun/sys/dashboard/batch_connect/sys/rstudio/session_contexts format=html controller=BatchConnect::SessionContextsController action=create status=302 duration=289.08 vi
ew=0.00 location=https://ONDEMAND_HOST/pun/sys/dashboard/batch_connect/sessions"
App 38789 output: [2023-04-13 10:12:22 +0100 ] INFO "execve = [{}, \"/opt/sge/bin/lx-amd64/qstat\", \"-r\", \"-xml\", \"-j\", \"2895517\"]"
plus nothing out of the ordinary in the scheduler logs, nor the job output files.
Running the same qstat command (/opt/sge/bin/lx-amd64/qstat -r -xml -j 2895517
) shows the job details - could it be that OOD tried to run qstat
before qsub
had completed?
This issue looks very similar to Interactive App completed, slurm job remains active in that the UUID file in ~/ondemand/data/sys/dashboard/batch_connect/db
is showing the cache_completed
variable as true
. However, unlike that issue, there is only one nginx master process.
There doesn’t seem to be a pattern here, as the issue is only affecting a random cohort of our ondemand user base, at random times of the day, so any help would be greatly appreciated.
Thanks,
Tom