Interactive app status change to completed

Hi,

We have just installed OnDemand 4.0.7. The system is running, but whenever we submit an interactive app, its status immediately changes to “Completed,” even though the job is still running when I check via active tabs.

Could you please advise on how to resolve this? We are using the LSF scheduler.

Thank you,
Sumit Saluja

This can happen when you submit a job through bsub but then cannot read back that same job through bjobs. When bjobs is issued - if OnDemand can’t find the job, if bjobs fails for that job id, it assumes it’s completed and thus completes it in the UI.

This can happen when you have two separate portals running and you’re actively interacting with two separate instances of OnDemand and somehow your clusters are setup differently on the two different instances. For example you have a two instances of OnDemand pointing to two different clusters, but both clusters have the same name to OnDemand. I.e., one is a test cluster the other your production cluster.

Basically, you’ve submitted the job to your test cluster, but the other instance of OnDemand is querying the production cluster for that job id, doesn’t find it and completes it.

I have 2 separate instance of Ondemand, one is version 3 and one is version 4. Both are pointing to same one cluster. Version 3 is running fine everything is as expected as we want. But version 4 has issue.

This is what logs shows when I submit jobs,
App 92674 output: [2025-09-22 10:56:24 -0400 ] INFO “execve = [{“LSF_BINDIR”=>“/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin”, “LSF_LIBDIR”=>“/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/lib”, “LSF_ENVDIR”=>“/hpc/lsf/conf”, “LSF_SERVERDIR”=>“/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/etc”}, “/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bsub”, “-cwd”, “/hpc/users/salujs01/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/chimera/output/5ed180f3-f747-429a-a79d-7e2501cbe71e”, “-J”, “sys/dashboard/sys/bc_desktop/chimera”, “-W”, “60”, “-L”, “/bin/bash”, “-o”, “/hpc/users/salujs01/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/chimera/output/5ed180f3-f747-429a-a79d-7e2501cbe71e/output.log”, “-B”, “-P”, “acc_hpcstaff”, “-q”, “ondemand”, “-n”, “2”, “-R”, “rusage[mem=3000]”, “-R”, “span[hosts=1]”]”
App 92674 output: [2025-09-22 10:56:24 -0400 ] INFO “execve = [{“LSF_BINDIR”=>“/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin”, “LSF_LIBDIR”=>“/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/lib”, “LSF_ENVDIR”=>“/hpc/lsf/conf”, “LSF_SERVERDIR”=>“/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/etc”}, “/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bjobs”, “-a”, “-w”, “-W”, “201941624”]”

OK, what happens when you issue this same command from the CLI on the same machine that’s running OOD 4.0?

Or conversely, is that the same command issued on the 3.1 instance? And in either case, what does that command return when run on that instance.

Somehow one or both of these instances isn’t able to find this job when querying the scheduler. I’d also double check to be sure that they’re pointing to the same cluster. We do the same thing at OSC, run multiple instances at different versions, but in our case we always point to the same cluster so it’s safe for us. I’ve only ever seen this when two instances point to different clusters that are named the same.

This is from server which is running OOD4
[root@ondemand01 ~]# bjobs -a -w -W 201986510
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME PROJ_NAME CPU_USED MEM SWAP PIDS START_TIME FINISH_TIME SLOTS
201986510 salujs01 RUN ondemand ondemand01 2*lh06c11 sys/dashboard/sys/bc_desktop/chimera 09/22-16:42:08 acc_hpcstaff 000:00:01.00 19 0 465717 09/22-16:42:10 - 2

This is from server which is running OOD3

[root@ondemand ~]# bjobs -a -w -W 201986510
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME PROJ_NAME CPU_USED MEM SWAP PIDS START_TIME FINISH_TIME SLOTS
201986510 salujs01 RUN ondemand ondemand01 2*lh06c11 sys/dashboard/sys/bc_desktop/chimera 09/22-16:42:08 acc_hpcstaff 000:00:01.00 19 0 465717 09/22-16:42:10 - 2
[root@ondemand ~]#

Don’t run these commands as root. Run them as your regular user with the environment variables as shown above. Remember the idea here is to replicate what OnDemand is seeing in the shell. to do this, you need to issue the same exact command under the same circumstances.

Here is as regular user
(base) [salujs01@ondemand01 ~]$ bjobs -a -w -W 202015200
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME PROJ_NAME CPU_USED MEM SWAP PIDS START_TIME FINISH_TIME SLOTS
202015200 salujs01 RUN ondemand ondemand01 2*lh06c25 sys/dashboard/sys/bc_desktop/chimera 09/23-09:56:20 acc_hpcstaff 000:00:00.00 4 0 1912734,1912889,1912890 09/23-09:56:25 - 2
(base) [salujs01@ondemand01 ~]$

(base) [salujs01@ondemand ~]$ bjobs -a -w -W 202015200
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME PROJ_NAME CPU_USED MEM SWAP PIDS START_TIME FINISH_TIME SLOTS
202015200 salujs01 RUN ondemand ondemand01 2*lh06c25 sys/dashboard/sys/bc_desktop/chimera 09/23-09:56:20 acc_hpcstaff 000:00:01.00 6 0 1912734,1912889,1912890 09/23-09:56:25 - 2
(base) [salujs01@ondemand ~]$

What about the environment variables? Also when it goes into completed state, is there anything in the logs after the bjobs log line?

This is environment variables
(base) [salujs01@ondemand01 ~]$ env|grep lsf
MANPATH=/hpc/lsf/10.1/man:
LSF_SERVERDIR=/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/etc
LSF_LIBDIR=/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/lib
LD_LIBRARY_PATH=/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/lib
LSF_BINDIR=/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin
PATH=/hpc/packages/minerva-rocky9/anaconda3/2024.06/bin:/hpc/packages/minerva-rocky9/anaconda3/2024.06/condabin:/hpc/users/salujs01/.local/bin:/hpc/users/salujs01/bin:/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/etc:/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin
LSF_ENVDIR=/hpc/lsf/conf

This after bjobs log line
INFO “execve = [{“LSF_BINDIR”=>“/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin”, “LSF_LIBDIR”=>“/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/lib”, “LSF_ENVDIR”=>“/hpc/lsf/conf”, “LSF_SERVERDIR”=>“/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/etc”}, “/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bjobs”, “-a”, “-w”, “-W”, “202015200”]”
App 19861 output: [2025-09-23 09:56:21 -0400 ] INFO “method=GET path=/pun/sys/dashboard/batch_connect/sessions format=html controller=BatchConnect::SessionsController action=index status=200 allocations=240636 duration=390.43 view=210.40”
App 19861 output: [2025-09-23 09:56:21 -0400 ] WARN “Announcement file not found: /etc/ood/config/announcement.md”
App 19861 output: [2025-09-23 09:56:21 -0400 ] WARN “Announcement file not found: /etc/ood/config/announcement.yml”
App 19861 output: [2025-09-23 09:56:21 -0400 ] INFO “method=GET path=/pun/sys/dashboard/apps/icon/bc_mssm_matlab/sys/sys format=html controller=AppsController action=icon status=200 allocations=44140 duration=123.19 view=0.00”

Do you have the 3.0 window open? Can you try this while only logged into the 4.0 instance? Again, this is an issue of conflicts between these 2 portals. If you’re not accessing the 3.0 portal - then it won’t conflict with the 4.0 portal.

Also please confirm that the cluster.d files are the exact same. Again, we do this at OSC - interact with the same cluster on multiple portals and it works fine. This only happens when there’s some mismatch between the two portals. I.e., they’re actually configured differently or interact with different schedulers that are named the same.

Hi Jeff,

I have confirmed my 3.0 windows is not opened and 4.0 portal is logged in only and I also confirmed that we have same cluster.d files on both servers.

I feel like we’re missing something obvious. Are there any logs on the LSF side that may help us diagnose what’s going on here?

You’re submitting to a queue ondemand do you need to specify that queue when you issue bjobs?

Also I guess I’d just confirm that when you do this manually you issue bjobs you’re in fact using /hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bjobs.

Are both servers using LSF 10.1?

We have one cluster /hpc is nfs mounted on all servers same as on demand. So on everywhere we have /hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bjobs
I noticed bjobs report little late around 5-6 seconds. How can I add delay in dashboard so that it can delayed reporting.

Hi Jeff
We have one cluster /hpc is nfs mounted on all servers same as on demand. So on everywhere we have /hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bjobs
I noticed bjobs report little late around 5-6 seconds. How can I add delay in dashboard so that it can delayed reporting.

You’d have to add it before or after it calls info in the source code. Not sure if a delay is going to help anything.

Though I can look up somehow to print out what it’s seeing before it completes.

Hi Jeff
I just created a bjobs wrapper, added a 3-second sleep, and included that wrapper in the bin overrides. It works, and I can now see jobs on my interactive list. Do you have a better approach I could use instead of adding a sleep?

Sumit Saluja

See I knew it was something simple that we were just missing!

I’d try to lower it to maybe .5 seconds. Or at least keep lowering it until it’s reliably responding.

I’m not sure if there’s any other mechanism we can toggle here.

I tried if I do 2 seconds still see same issue. But now when I try sleep It is taking little time as it has to wait for 3 seconds.