Interactive app card completed but UGE job is still running

Within the last few months, we have upgraded from 1.8 to 2.0.32. Since the upgrade, a few of our researchers and admin staff have reported seeing OOD sessions immediately shown as completed, but the Univa Grid Engine (UGE) job is still running.

Taking job number 2895517 as an example, here is what I can see in the /var/log/ondemand-nginx/user1/error.log file:

App 38789 output: [2023-04-13 10:12:22 +0100 ]  INFO "execve = [{}, \"/opt/sge/bin/lx-amd64/qsub\", \"-wd\", \"/data/home/user1/ondemand/data/sys/dashboard/batch_connect/sys/rstudio/output/5a0bdeb5-e1ae-4472-b000-fca10d6486d7\", \"-N\", \
"ood_rstudio\", \"-o\", \"/data/home/user1/ondemand/data/sys/dashboard/batch_connect/sys/rstudio/output/5a0bdeb5-e1ae-4472-b000-fca10d6486d7/output.log\", \"-pe\", \"smp\", \"1\", \"-l\", \"ood_cores=1\", \"-l\", \"h_vmem=64G\", \"-l\", \
"h_rt=240:0:0\", \"-m\", \"bea\"]"
App 38789 output: [2023-04-13 10:12:22 +0100 ]  INFO "method=POST path=/pun/sys/dashboard/batch_connect/sys/rstudio/session_contexts format=html controller=BatchConnect::SessionContextsController action=create status=302 duration=289.08 vi
ew=0.00 location=https://ONDEMAND_HOST/pun/sys/dashboard/batch_connect/sessions"
App 38789 output: [2023-04-13 10:12:22 +0100 ]  INFO "execve = [{}, \"/opt/sge/bin/lx-amd64/qstat\", \"-r\", \"-xml\", \"-j\", \"2895517\"]"

plus nothing out of the ordinary in the scheduler logs, nor the job output files.

Running the same qstat command (/opt/sge/bin/lx-amd64/qstat -r -xml -j 2895517) shows the job details - could it be that OOD tried to run qstat before qsub had completed?

This issue looks very similar to Interactive App completed, slurm job remains active in that the UUID file in ~/ondemand/data/sys/dashboard/batch_connect/db is showing the cache_completed variable as true. However, unlike that issue, there is only one nginx master process.

There doesn’t seem to be a pattern here, as the issue is only affecting a random cohort of our ondemand user base, at random times of the day, so any help would be greatly appreciated.

Thanks,
Tom

Hello and welcome!

Sorry for the trouble. I think you are on to something with:

This combined with the seemingly random occurrence of the issue has me wondering if the scripts being handed off that have the error have a large task list and this may be what is causing the problem. Which is a long way to say this could be a bug in ood_core’s sge adapter.

I don’t have UGE to test against unfortunately. Could you try to submit some jobs against this with large task lists and one with a small task list and see if you can get this error to show up? That’s my guess, like you said, of what is going on and we may need to make some changes to the adapter if so.

Hi Travis,

Thank you for your prompt reply!

I have submitted roughly about 100 UGE jobs with various different resource requests (single core through to 75000 task array jobs), each just running a simple sleep statement for a few seconds. In my testing, I saw the issue a handful of times (maybe 3 or 4), but certainly not on every submission - at one point I had about 25 jobs submitted at once without observing the issue.

Nevertheless, changing the cache_completed variable from true to null in the UUID file for affected jobs did revive the jobs in OOD.

We have an alternative OOD server running as a hot spare which I can try to replicate the issue on - both servers submit to the same UGE cluster and share the same filesystem. If there is an intermittent communication issue between OOD and the scheduler, I would hope to be able to replicate the issue on the alternative OOD server.

p.s. apologies if we’re not on the same page here :slight_smile:

Thanks,
Tom

A quick update:

We have replicated the same issue on the alternative OOD server, also running 2.0.32. At some point next week, I’ll try upgrading our dev instance to 3.0 to see if the issue persists.

we’ve upgraded our ondemand instance to version 3.0.1, but unfortunately, the issue still persists.

There could be some incompatability bug here. Can you provide the output of /opt/sge/bin/lx-amd64/qstat -r -xml -j <someJobId> (you can obfuscate anything you like).

I can throw it against our test suite and see if there’s a bug.

ood3_ghost_job_bug-2627.txt (22.9 KB)

Please find attached the qstat xml output of a recent job which suffered the fate of what we are calling “the ghost bug”.

The affected user has also shared their output of “qstat”, and a screenshot of what they see in their ondemand instance:

$ qstat
job-ID     prior   name       user         state submit/start at     queue                          jclass                         slots ja-task-ID 
------------------------------------------------------------------------------------------------------------------------------------------------
   3071606 15.00000 ood_rstudi user1       r     08/23/2023 10:15:56 all.q@node060 

The session file for this job listed under /data/home/user1/ondemand/data/sys/dashboard/batch_connect/db/ did contain "cache_completed":true. We only started noticing this with ondemand 2.x (and now 3.x) - it didn’t appear when we were using 1.8, if that helps.

Please let me know if you require any further information from me, and thanks for looking into this!

To add some more information, when the “ghost bug” occurs, the “Active Jobs” page still shows the job as running (similar to qstat).

Hey, I’m starting to circle back to this issue. You link logs in /var/log/ondemand-nginx/$USER below - is there any error below this?

Unfortunately, the logs were rotated at the start of this month, so we are unable to check the logs back in April, or August.

We are currently running version 3.0.3 and are still experiencing this bug - when we notice this again, I’ll share the full nginx log from moments just before the bug.

Thank you so much, let’s hope that we actually capture errors after qstat query. I’m thinking something erroneous happens here and we don’t actually catch the error correctly - instead just completing the job. Let’s hope there are messages for the same in any case :crossed_fingers: . I know for sure we don’t have retry logic, so I’m thinking if we get an error then we put the job into undetermined state instead of completed. But I’d sort of need to know what the error is that’s being propagated.

Hi Jeff,

I’ve just come across this again today, please find attached the output of the nginx access and error logs for my username:

access_log_3163769.txt (4.4 KB)
error_log_3163769.txt (1.4 KB)

Running the referenced qstat command directly produces the following:

$ qstat -r -xml -j 3163769 | head
<?xml version='1.0'?>
<detailed_job_info  xmlns:xsd="http://www.univa.com" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="file:///.../qstat/qstat.xsd file:///.../util/resources/xsd/qstat/message.xsd file:///.../util/resources/xsd/qstat/detailed_job_info.xsd">
  <djob_info>
    <element>
      <JB_job_number>3163769</JB_job_number>
      <JB_job_name>ood_matlab</JB_job_name>
      <JB_version>0</JB_version>
      <JB_project>qmul</JB_project>
      <JB_department>qmul</JB_department>
      <JB_exec_file>job_scripts/3163769</JB_exec_file>

Here is the job card:

And finally, when I run:

$ sed -i 's|true|null|' ~/ondemand/data/sys/dashboard/batch_connect/db/8c0b1c3b-9575-491f-a16e-08192e5a241f 

the job comes back to life:

Could there possibly be a race condition, where OOD tries to poll the job before it actually exists as a job? If so, I wonder if we can add a sleep statement (or alike) somewhere between submission and the qstat operation within OOD.

p.s. i’ll be at the OOD user group meeting at SC23 if it helps to demo this in-person!

Many thanks

Yes I’m thinking this is exactly what’s happening.

If you wanted to add a sleep, you’d do so here on this line after @session.save but before format.html.

Your logs don’t show errors - which means it was able to parse it and successfully call qstat.

I wonder - what is the exit code and text from issuing qstat on a job that’s known to not exist. I.e., what’s the behavior from qstat if it couldn’t find the job?

/opt/sge/bin/lx-amd64/qstat -r -xml -j 111111111111111111

Thanks Jeff, I’ll try adding in a sleep statement on our dev instance tomorrow - stay tuned for more!

Here is the qstat output of a non-existing job:

$ qstat -r -xml -j 111111111111111111
<?xml version='1.0'?>
<unknown_jobs  xmlns:xsd="http://www.univa.com" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="file:///.../util/resources/xsd/qstat/qstat.xsd file:///.../util/resources/xsd/qstat/message.xsd file:///.../util/resources/xsd/qstat/detailed_job_info.xsd">
  <element>
    <ST_name>111111111111111111</ST_name>
    <ST_pos>0</ST_pos>
    <ST_access_specifier>0</ST_access_specifier>
  </element>
</unknown_jobs>

The exit code is 0.

A quick update:

We have not noticed the issue on our dev instance after adding a 2s sleep just after the cache_file.write(@session_context.to_json) line in the file you have referenced.

We are now rolling out the change to our production server and will continue monitoring to see if the issue re-occurs.

Many thanks