Interactive app card completed but UGE job is still running

tombradford · April 13, 2023, 10:58am

Within the last few months, we have upgraded from 1.8 to 2.0.32. Since the upgrade, a few of our researchers and admin staff have reported seeing OOD sessions immediately shown as completed, but the Univa Grid Engine (UGE) job is still running.

Taking job number 2895517 as an example, here is what I can see in the /var/log/ondemand-nginx/user1/error.log file:

App 38789 output: [2023-04-13 10:12:22 +0100 ]  INFO "execve = [{}, \"/opt/sge/bin/lx-amd64/qsub\", \"-wd\", \"/data/home/user1/ondemand/data/sys/dashboard/batch_connect/sys/rstudio/output/5a0bdeb5-e1ae-4472-b000-fca10d6486d7\", \"-N\", \
"ood_rstudio\", \"-o\", \"/data/home/user1/ondemand/data/sys/dashboard/batch_connect/sys/rstudio/output/5a0bdeb5-e1ae-4472-b000-fca10d6486d7/output.log\", \"-pe\", \"smp\", \"1\", \"-l\", \"ood_cores=1\", \"-l\", \"h_vmem=64G\", \"-l\", \
"h_rt=240:0:0\", \"-m\", \"bea\"]"
App 38789 output: [2023-04-13 10:12:22 +0100 ]  INFO "method=POST path=/pun/sys/dashboard/batch_connect/sys/rstudio/session_contexts format=html controller=BatchConnect::SessionContextsController action=create status=302 duration=289.08 vi
ew=0.00 location=https://ONDEMAND_HOST/pun/sys/dashboard/batch_connect/sessions"
App 38789 output: [2023-04-13 10:12:22 +0100 ]  INFO "execve = [{}, \"/opt/sge/bin/lx-amd64/qstat\", \"-r\", \"-xml\", \"-j\", \"2895517\"]"

plus nothing out of the ordinary in the scheduler logs, nor the job output files.

Running the same qstat command (/opt/sge/bin/lx-amd64/qstat -r -xml -j 2895517) shows the job details - could it be that OOD tried to run qstat before qsub had completed?

This issue looks very similar to Interactive App completed, slurm job remains active in that the UUID file in ~/ondemand/data/sys/dashboard/batch_connect/db is showing the cache_completed variable as true. However, unlike that issue, there is only one nginx master process.

There doesn’t seem to be a pattern here, as the issue is only affecting a random cohort of our ondemand user base, at random times of the day, so any help would be greatly appreciated.

Thanks,
Tom

travert · April 13, 2023, 1:59pm

Hello and welcome!

Sorry for the trouble. I think you are on to something with:

This combined with the seemingly random occurrence of the issue has me wondering if the scripts being handed off that have the error have a large task list and this may be what is causing the problem. Which is a long way to say this could be a bug in ood_core’s sge adapter.

I don’t have UGE to test against unfortunately. Could you try to submit some jobs against this with large task lists and one with a small task list and see if you can get this error to show up? That’s my guess, like you said, of what is going on and we may need to make some changes to the adapter if so.

tombradford · April 14, 2023, 9:54am

Hi Travis,

Thank you for your prompt reply!

I have submitted roughly about 100 UGE jobs with various different resource requests (single core through to 75000 task array jobs), each just running a simple sleep statement for a few seconds. In my testing, I saw the issue a handful of times (maybe 3 or 4), but certainly not on every submission - at one point I had about 25 jobs submitted at once without observing the issue.

Nevertheless, changing the cache_completed variable from true to null in the UUID file for affected jobs did revive the jobs in OOD.

We have an alternative OOD server running as a hot spare which I can try to replicate the issue on - both servers submit to the same UGE cluster and share the same filesystem. If there is an intermittent communication issue between OOD and the scheduler, I would hope to be able to replicate the issue on the alternative OOD server.

p.s. apologies if we’re not on the same page here

Thanks,
Tom

tombradford · May 19, 2023, 3:43pm

A quick update:

We have replicated the same issue on the alternative OOD server, also running 2.0.32. At some point next week, I’ll try upgrading our dev instance to 3.0 to see if the issue persists.

tombradford · August 18, 2023, 9:16am

we’ve upgraded our ondemand instance to version 3.0.1, but unfortunately, the issue still persists.

jeff.ohrstrom · August 18, 2023, 2:05pm

There could be some incompatability bug here. Can you provide the output of /opt/sge/bin/lx-amd64/qstat -r -xml -j <someJobId> (you can obfuscate anything you like).

I can throw it against our test suite and see if there’s a bug.

tombradford · August 23, 2023, 10:53am

ood3_ghost_job_bug-2627.txt (22.9 KB)

Please find attached the qstat xml output of a recent job which suffered the fate of what we are calling “the ghost bug”.

The affected user has also shared their output of “qstat”, and a screenshot of what they see in their ondemand instance:

$ qstat
job-ID     prior   name       user         state submit/start at     queue                          jclass                         slots ja-task-ID 
------------------------------------------------------------------------------------------------------------------------------------------------
   3071606 15.00000 ood_rstudi user1       r     08/23/2023 10:15:56 all.q@node060

The session file for this job listed under /data/home/user1/ondemand/data/sys/dashboard/batch_connect/db/ did contain "cache_completed":true. We only started noticing this with ondemand 2.x (and now 3.x) - it didn’t appear when we were using 1.8, if that helps.

Please let me know if you require any further information from me, and thanks for looking into this!

tombradford · September 12, 2023, 3:23pm

To add some more information, when the “ghost bug” occurs, the “Active Jobs” page still shows the job as running (similar to qstat).

jeff.ohrstrom · October 30, 2023, 4:21pm

Hey, I’m starting to circle back to this issue. You link logs in /var/log/ondemand-nginx/$USER below - is there any error below this?

tombradford · October 30, 2023, 5:07pm

Unfortunately, the logs were rotated at the start of this month, so we are unable to check the logs back in April, or August.

We are currently running version 3.0.3 and are still experiencing this bug - when we notice this again, I’ll share the full nginx log from moments just before the bug.

jeff.ohrstrom · October 30, 2023, 5:16pm

Thank you so much, let’s hope that we actually capture errors after qstat query. I’m thinking something erroneous happens here and we don’t actually catch the error correctly - instead just completing the job. Let’s hope there are messages for the same in any case . I know for sure we don’t have retry logic, so I’m thinking if we get an error then we put the job into undetermined state instead of completed. But I’d sort of need to know what the error is that’s being propagated.

tombradford · November 6, 2023, 4:02pm

Hi Jeff,

I’ve just come across this again today, please find attached the output of the nginx access and error logs for my username:

access_log_3163769.txt (4.4 KB)
error_log_3163769.txt (1.4 KB)

Running the referenced qstat command directly produces the following:

$ qstat -r -xml -j 3163769 | head
<?xml version='1.0'?>
<detailed_job_info  xmlns:xsd="http://www.univa.com" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="file:///.../qstat/qstat.xsd file:///.../util/resources/xsd/qstat/message.xsd file:///.../util/resources/xsd/qstat/detailed_job_info.xsd">
  <djob_info>
    <element>
      <JB_job_number>3163769</JB_job_number>
      <JB_job_name>ood_matlab</JB_job_name>
      <JB_version>0</JB_version>
      <JB_project>qmul</JB_project>
      <JB_department>qmul</JB_department>
      <JB_exec_file>job_scripts/3163769</JB_exec_file>

Here is the job card:

And finally, when I run:

$ sed -i 's|true|null|' ~/ondemand/data/sys/dashboard/batch_connect/db/8c0b1c3b-9575-491f-a16e-08192e5a241f

the job comes back to life:

Could there possibly be a race condition, where OOD tries to poll the job before it actually exists as a job? If so, I wonder if we can add a sleep statement (or alike) somewhere between submission and the qstat operation within OOD.

p.s. i’ll be at the OOD user group meeting at SC23 if it helps to demo this in-person!

Many thanks

jeff.ohrstrom · November 6, 2023, 4:41pm

Yes I’m thinking this is exactly what’s happening.

If you wanted to add a sleep, you’d do so here on this line after @session.save but before format.html.

github.com

OSC/ondemand/blob/7ba5cdc5a8a4ce8dc3a7b78d5bfbef670137fd09/apps/dashboard/app/controllers/batch_connect/session_contexts_controller.rb#L40


      
          set_render_format
          set_session_context
          
          # Read in context from form parameters
          @session_context.attributes = session_contexts_param
          
          @session = BatchConnect::Session.new
          respond_to do |format|
            if @session.save(app: @app, context: @session_context, format: @render_format)
              cache_file.write(@session_context.to_json)  # save context to cache file
          
              format.html { redirect_to batch_connect_sessions_url, notice: t('dashboard.batch_connect_sessions_status_blurb_create_success') }
              format.json { head :no_content }
            else
              format.html do
                set_app_groups
                render :new
              end
              format.json { render json: @session_context.errors, status: :unprocessable_entity }
            end
          end

jeff.ohrstrom · November 6, 2023, 4:45pm

Your logs don’t show errors - which means it was able to parse it and successfully call qstat.

I wonder - what is the exit code and text from issuing qstat on a job that’s known to not exist. I.e., what’s the behavior from qstat if it couldn’t find the job?

/opt/sge/bin/lx-amd64/qstat -r -xml -j 111111111111111111

tombradford · November 6, 2023, 5:07pm

Thanks Jeff, I’ll try adding in a sleep statement on our dev instance tomorrow - stay tuned for more!

Here is the qstat output of a non-existing job:

$ qstat -r -xml -j 111111111111111111
<?xml version='1.0'?>
<unknown_jobs  xmlns:xsd="http://www.univa.com" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="file:///.../util/resources/xsd/qstat/qstat.xsd file:///.../util/resources/xsd/qstat/message.xsd file:///.../util/resources/xsd/qstat/detailed_job_info.xsd">
  <element>
    <ST_name>111111111111111111</ST_name>
    <ST_pos>0</ST_pos>
    <ST_access_specifier>0</ST_access_specifier>
  </element>
</unknown_jobs>

The exit code is 0.

tombradford · November 9, 2023, 1:17pm

A quick update:

We have not noticed the issue on our dev instance after adding a 2s sleep just after the cache_file.write(@session_context.to_json) line in the file you have referenced.

We are now rolling out the change to our production server and will continue monitoring to see if the issue re-occurs.

Many thanks

system · May 7, 2024, 1:17pm

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Time Remaining incorrect Get Help	15	1200	May 26, 2022
Interactive App completed, slurm job remains active Get Help question	11	2339	February 23, 2022
Active Jobs with SSGE 8.1.9 : Request for jobs failed due to body parsing error Get Help	20	1789	May 26, 2022
SGE integration Get Help	16	1387	May 26, 2022
BUG - Updated OnDemand - Interactive Apps "Spill" over to other interactive apps on submission Get Help question	25	2066	May 26, 2022

Interactive app card completed but UGE job is still running

Related topics