Interactive app status change to completed

sumitsaluja · September 22, 2025, 7:54pm

Hi,

We have just installed OnDemand 4.0.7. The system is running, but whenever we submit an interactive app, its status immediately changes to “Completed,” even though the job is still running when I check via active tabs.

Could you please advise on how to resolve this? We are using the LSF scheduler.

Thank you,
Sumit Saluja

jeff.ohrstrom · September 22, 2025, 8:03pm

This can happen when you submit a job through bsub but then cannot read back that same job through bjobs. When bjobs is issued - if OnDemand can’t find the job, if bjobs fails for that job id, it assumes it’s completed and thus completes it in the UI.

This can happen when you have two separate portals running and you’re actively interacting with two separate instances of OnDemand and somehow your clusters are setup differently on the two different instances. For example you have a two instances of OnDemand pointing to two different clusters, but both clusters have the same name to OnDemand. I.e., one is a test cluster the other your production cluster.

Basically, you’ve submitted the job to your test cluster, but the other instance of OnDemand is querying the production cluster for that job id, doesn’t find it and completes it.

sumitsaluja · September 22, 2025, 8:19pm

I have 2 separate instance of Ondemand, one is version 3 and one is version 4. Both are pointing to same one cluster. Version 3 is running fine everything is as expected as we want. But version 4 has issue.

sumitsaluja · September 22, 2025, 8:22pm

This is what logs shows when I submit jobs,
App 92674 output: [2025-09-22 10:56:24 -0400 ] INFO “execve = [{“LSF_BINDIR”=>“/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin”, “LSF_LIBDIR”=>“/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/lib”, “LSF_ENVDIR”=>“/hpc/lsf/conf”, “LSF_SERVERDIR”=>“/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/etc”}, “/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bsub”, “-cwd”, “/hpc/users/salujs01/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/chimera/output/5ed180f3-f747-429a-a79d-7e2501cbe71e”, “-J”, “sys/dashboard/sys/bc_desktop/chimera”, “-W”, “60”, “-L”, “/bin/bash”, “-o”, “/hpc/users/salujs01/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/chimera/output/5ed180f3-f747-429a-a79d-7e2501cbe71e/output.log”, “-B”, “-P”, “acc_hpcstaff”, “-q”, “ondemand”, “-n”, “2”, “-R”, “rusage[mem=3000]”, “-R”, “span[hosts=1]”]”
App 92674 output: [2025-09-22 10:56:24 -0400 ] INFO “execve = [{“LSF_BINDIR”=>“/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin”, “LSF_LIBDIR”=>“/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/lib”, “LSF_ENVDIR”=>“/hpc/lsf/conf”, “LSF_SERVERDIR”=>“/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/etc”}, “/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bjobs”, “-a”, “-w”, “-W”, “201941624”]”

jeff.ohrstrom · September 22, 2025, 8:36pm

OK, what happens when you issue this same command from the CLI on the same machine that’s running OOD 4.0?

Or conversely, is that the same command issued on the 3.1 instance? And in either case, what does that command return when run on that instance.

Somehow one or both of these instances isn’t able to find this job when querying the scheduler. I’d also double check to be sure that they’re pointing to the same cluster. We do the same thing at OSC, run multiple instances at different versions, but in our case we always point to the same cluster so it’s safe for us. I’ve only ever seen this when two instances point to different clusters that are named the same.

sumitsaluja · September 22, 2025, 8:44pm

This is from server which is running OOD4
[root@ondemand01 ~]# bjobs -a -w -W 201986510
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME PROJ_NAME CPU_USED MEM SWAP PIDS START_TIME FINISH_TIME SLOTS
201986510 salujs01 RUN ondemand ondemand01 2*lh06c11 sys/dashboard/sys/bc_desktop/chimera 09/22-16:42:08 acc_hpcstaff 000:00:01.00 19 0 465717 09/22-16:42:10 - 2

This is from server which is running OOD3

[root@ondemand ~]# bjobs -a -w -W 201986510
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME PROJ_NAME CPU_USED MEM SWAP PIDS START_TIME FINISH_TIME SLOTS
201986510 salujs01 RUN ondemand ondemand01 2*lh06c11 sys/dashboard/sys/bc_desktop/chimera 09/22-16:42:08 acc_hpcstaff 000:00:01.00 19 0 465717 09/22-16:42:10 - 2
[root@ondemand ~]#

jeff.ohrstrom · September 22, 2025, 9:12pm

Don’t run these commands as root. Run them as your regular user with the environment variables as shown above. Remember the idea here is to replicate what OnDemand is seeing in the shell. to do this, you need to issue the same exact command under the same circumstances.

sumitsaluja · September 23, 2025, 2:01pm

Here is as regular user
(base) [salujs01@ondemand01 ~]$ bjobs -a -w -W 202015200
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME PROJ_NAME CPU_USED MEM SWAP PIDS START_TIME FINISH_TIME SLOTS
202015200 salujs01 RUN ondemand ondemand01 2*lh06c25 sys/dashboard/sys/bc_desktop/chimera 09/23-09:56:20 acc_hpcstaff 000:00:00.00 4 0 1912734,1912889,1912890 09/23-09:56:25 - 2
(base) [salujs01@ondemand01 ~]$

(base) [salujs01@ondemand ~]$ bjobs -a -w -W 202015200
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME PROJ_NAME CPU_USED MEM SWAP PIDS START_TIME FINISH_TIME SLOTS
202015200 salujs01 RUN ondemand ondemand01 2*lh06c25 sys/dashboard/sys/bc_desktop/chimera 09/23-09:56:20 acc_hpcstaff 000:00:01.00 6 0 1912734,1912889,1912890 09/23-09:56:25 - 2
(base) [salujs01@ondemand ~]$

jeff.ohrstrom · September 23, 2025, 2:04pm

What about the environment variables? Also when it goes into completed state, is there anything in the logs after the bjobs log line?

sumitsaluja · September 23, 2025, 2:10pm

This is environment variables
(base) [salujs01@ondemand01 ~]$ env|grep lsf
MANPATH=/hpc/lsf/10.1/man:
LSF_SERVERDIR=/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/etc
LSF_LIBDIR=/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/lib
LD_LIBRARY_PATH=/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/lib
LSF_BINDIR=/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin
PATH=/hpc/packages/minerva-rocky9/anaconda3/2024.06/bin:/hpc/packages/minerva-rocky9/anaconda3/2024.06/condabin:/hpc/users/salujs01/.local/bin:/hpc/users/salujs01/bin:/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/etc:/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin
LSF_ENVDIR=/hpc/lsf/conf

This after bjobs log line
INFO “execve = [{“LSF_BINDIR”=>“/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin”, “LSF_LIBDIR”=>“/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/lib”, “LSF_ENVDIR”=>“/hpc/lsf/conf”, “LSF_SERVERDIR”=>“/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/etc”}, “/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bjobs”, “-a”, “-w”, “-W”, “202015200”]”
App 19861 output: [2025-09-23 09:56:21 -0400 ] INFO “method=GET path=/pun/sys/dashboard/batch_connect/sessions format=html controller=BatchConnect::SessionsController action=index status=200 allocations=240636 duration=390.43 view=210.40”
App 19861 output: [2025-09-23 09:56:21 -0400 ] WARN “Announcement file not found: /etc/ood/config/announcement.md”
App 19861 output: [2025-09-23 09:56:21 -0400 ] WARN “Announcement file not found: /etc/ood/config/announcement.yml”
App 19861 output: [2025-09-23 09:56:21 -0400 ] INFO “method=GET path=/pun/sys/dashboard/apps/icon/bc_mssm_matlab/sys/sys format=html controller=AppsController action=icon status=200 allocations=44140 duration=123.19 view=0.00”

jeff.ohrstrom · September 23, 2025, 4:23pm

Do you have the 3.0 window open? Can you try this while only logged into the 4.0 instance? Again, this is an issue of conflicts between these 2 portals. If you’re not accessing the 3.0 portal - then it won’t conflict with the 4.0 portal.

Also please confirm that the cluster.d files are the exact same. Again, we do this at OSC - interact with the same cluster on multiple portals and it works fine. This only happens when there’s some mismatch between the two portals. I.e., they’re actually configured differently or interact with different schedulers that are named the same.

sumitsaluja · September 24, 2025, 4:12pm

Hi Jeff,

I have confirmed my 3.0 windows is not opened and 4.0 portal is logged in only and I also confirmed that we have same cluster.d files on both servers.

jeff.ohrstrom · September 24, 2025, 4:31pm

I feel like we’re missing something obvious. Are there any logs on the LSF side that may help us diagnose what’s going on here?

You’re submitting to a queue ondemand do you need to specify that queue when you issue bjobs?

Also I guess I’d just confirm that when you do this manually you issue bjobs you’re in fact using /hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bjobs.

Are both servers using LSF 10.1?

sumitsaluja · September 24, 2025, 5:19pm

We have one cluster /hpc is nfs mounted on all servers same as on demand. So on everywhere we have /hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bjobs
I noticed bjobs report little late around 5-6 seconds. How can I add delay in dashboard so that it can delayed reporting.

sumitsaluja · September 26, 2025, 9:35pm

Hi Jeff
We have one cluster /hpc is nfs mounted on all servers same as on demand. So on everywhere we have /hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bjobs
I noticed bjobs report little late around 5-6 seconds. How can I add delay in dashboard so that it can delayed reporting.

jeff.ohrstrom · September 29, 2025, 2:13pm

You’d have to add it before or after it calls info in the source code. Not sure if a delay is going to help anything.

Though I can look up somehow to print out what it’s seeing before it completes.

sumitsaluja · September 29, 2025, 4:38pm

Hi Jeff
I just created a bjobs wrapper, added a 3-second sleep, and included that wrapper in the bin overrides. It works, and I can now see jobs on my interactive list. Do you have a better approach I could use instead of adding a sleep?

Sumit Saluja

jeff.ohrstrom · September 29, 2025, 5:59pm

See I knew it was something simple that we were just missing!

I’d try to lower it to maybe .5 seconds. Or at least keep lowering it until it’s reliably responding.

I’m not sure if there’s any other mechanism we can toggle here.

sumitsaluja · September 29, 2025, 8:01pm

I tried if I do 2 seconds still see same issue. But now when I try sleep It is taking little time as it has to wait for 3 seconds.

Topic		Replies	Views
My active jobs and Shell app Get Help	6	101	September 22, 2025
BUG - Updated OnDemand - Interactive Apps "Spill" over to other interactive apps on submission Get Help question	25	2125	May 26, 2022
Interactive-desktop immediately completed Get Help	14	351	September 4, 2024
Interactive app card completed but UGE job is still running Get Help ondemand2	16	598	May 7, 2024
Interactive App completed, slurm job remains active Get Help question	11	2410	February 23, 2022

Interactive app status change to completed

Related topics