We have just installed OnDemand 4.0.7. The system is running, but whenever we submit an interactive app, its status immediately changes to “Completed,” even though the job is still running when I check via active tabs.
Could you please advise on how to resolve this? We are using the LSF scheduler.
This can happen when you submit a job through bsub but then cannot read back that same job through bjobs. When bjobs is issued - if OnDemand can’t find the job, if bjobs fails for that job id, it assumes it’s completed and thus completes it in the UI.
This can happen when you have two separate portals running and you’re actively interacting with two separate instances of OnDemand and somehow your clusters are setup differently on the two different instances. For example you have a two instances of OnDemand pointing to two different clusters, but both clusters have the same name to OnDemand. I.e., one is a test cluster the other your production cluster.
Basically, you’ve submitted the job to your test cluster, but the other instance of OnDemand is querying the production cluster for that job id, doesn’t find it and completes it.
I have 2 separate instance of Ondemand, one is version 3 and one is version 4. Both are pointing to same one cluster. Version 3 is running fine everything is as expected as we want. But version 4 has issue.
OK, what happens when you issue this same command from the CLI on the same machine that’s running OOD 4.0?
Or conversely, is that the same command issued on the 3.1 instance? And in either case, what does that command return when run on that instance.
Somehow one or both of these instances isn’t able to find this job when querying the scheduler. I’d also double check to be sure that they’re pointing to the same cluster. We do the same thing at OSC, run multiple instances at different versions, but in our case we always point to the same cluster so it’s safe for us. I’ve only ever seen this when two instances point to different clusters that are named the same.
This is from server which is running OOD4
[root@ondemand01 ~]# bjobs -a -w -W 201986510
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME PROJ_NAME CPU_USED MEM SWAP PIDS START_TIME FINISH_TIME SLOTS
201986510 salujs01 RUN ondemand ondemand01 2*lh06c11 sys/dashboard/sys/bc_desktop/chimera 09/22-16:42:08 acc_hpcstaff 000:00:01.00 19 0 465717 09/22-16:42:10 - 2
This is from server which is running OOD3
[root@ondemand ~]# bjobs -a -w -W 201986510
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME PROJ_NAME CPU_USED MEM SWAP PIDS START_TIME FINISH_TIME SLOTS
201986510 salujs01 RUN ondemand ondemand01 2*lh06c11 sys/dashboard/sys/bc_desktop/chimera 09/22-16:42:08 acc_hpcstaff 000:00:01.00 19 0 465717 09/22-16:42:10 - 2
[root@ondemand ~]#
Don’t run these commands as root. Run them as your regular user with the environment variables as shown above. Remember the idea here is to replicate what OnDemand is seeing in the shell. to do this, you need to issue the same exact command under the same circumstances.
Do you have the 3.0 window open? Can you try this while only logged into the 4.0 instance? Again, this is an issue of conflicts between these 2 portals. If you’re not accessing the 3.0 portal - then it won’t conflict with the 4.0 portal.
Also please confirm that the cluster.d files are the exact same. Again, we do this at OSC - interact with the same cluster on multiple portals and it works fine. This only happens when there’s some mismatch between the two portals. I.e., they’re actually configured differently or interact with different schedulers that are named the same.
We have one cluster /hpc is nfs mounted on all servers same as on demand. So on everywhere we have /hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bjobs
I noticed bjobs report little late around 5-6 seconds. How can I add delay in dashboard so that it can delayed reporting.
Hi Jeff
We have one cluster /hpc is nfs mounted on all servers same as on demand. So on everywhere we have /hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bjobs
I noticed bjobs report little late around 5-6 seconds. How can I add delay in dashboard so that it can delayed reporting.
Hi Jeff
I just created a bjobs wrapper, added a 3-second sleep, and included that wrapper in the bin overrides. It works, and I can now see jobs on my interactive list. Do you have a better approach I could use instead of adding a sleep?