We are testing setting OOD instances as VM separate from the HPC cluster and I have a couple of questions. Some initial info: OOD version → 3.0.1, cluster → LSF
The first issue we had with starting an app was an error on the “template” copy. It seems the way we have configured the NFS mount doesn’t allow a “chgrp” command (even though it is in your home directory). Is there a specific reason why you use a “-a” for the rsync: https://github.com/OSC/ondemand/blob/v3.0.1/apps/dashboard/app/models/batch_connect/session.rb#L263 ? I changed it to “-rlt” and seems to proceed but I’m not sure if that have other implications.
After that I tried to run the app configuring the login node of our cluster in the “submit_host” (and add strict_host_checking: false).
Now the job get submitted and the folder in the home is created, also the file in the “db” folder gets created but there’s no active session even though the job runs. The OOD VM node has no scheduler binary available but I though that it will run them using ssh through the “submit_host”, no idea if that’s actually happening and how I can test it.
Can you help me with it?
You do in fact need that binary on the ood node. While the scheduler command doesn’t run there, it is used to construct the query you wish to submit before being sent off to the compute node. That might not be the total issue here, but let’s start there and see.
My previous post is incorrect as you are using the submit_host to handle the submission, which I forget is a second way to do this that doesn’t involve the binary.
I’m being told: “Their issue is that they need a shared NFS storage. i.e., the same $HOME that is on the VM and the compute node.”
Thanks for the quick answer. I’m going to try but I still don’t get why you need it, are the binaries names not included in the scheduler adapter? And the full path is part of the cluster configuration. So, you can’t really test if the command works but you can anyway construct the query
Ok. Was answering at the same time.
Anyway, I wrote that we do have the home mounted as NFS and in fact all the folder are created and visible from the OOD VM. Not sure how OOD checks that there’s an active session running. There’s probably something wrong somewhere else.
I had also a question regarding the “rsync” option which causes an issue with our NFS. Not sure if our change is fine or can be a source of issues too.
-rlptgoD. These include recursive copying, preserving links, permissions, times, group, owner, and device files. The reason for using -a is to ensure that the copied files maintain their original properties.
-rlt, which retains the recursive copying, symbolic links, and modification times but drops the preservation of permissions, group, and owner information. This might be a workaround if your NFS setup is restrictive about these aspects. But note that not preserving permissions and ownership could lead to issues, but I don’t think that is the issue here (though I need to check this).
The first logs I’d be curious about are if any configuration errors show up in the /var/log/httpd/<hostname>_error.log.
I’m wondering if there is anything in the logs to give us something because it is a bit odd this isn’t working for you and I’m not sure what it could be atm.
Also, when you run these jobs, it sounds like you are doing interactive apps. Do the cards show as completed? I know you said you can see the job run, and the directory and files created, but if it is going right to completed i’d wonder whether something is not right with the cache_completed and LSF here.
So that rsync copied to the root_stage in the home so I hope the original group and owners are not really needed (since I’m not sharing the folder in the home with the same group having permissions on the app). ButI think I missed the “-p” for the permissions and “–specials” coming with the “-D” so I’m going to update the change, thanks.
At the moment we still have OOD running also on the login node. I just tested that after running the app from that instance, the sessions shows up in the VM too. So the app “start” from the VM is somehow not working.
In the log I do get errors but I don’t think that’s useful. They looks like this:
Anyway, if running from the VM the session is not shown at all not and not even in the OOD instance in the login node, while the job is actually running. So there’s something in the process of “creation” of the session that is broken.
Sorry, I can’t quite follow this. I’m also confused as to the behavior still of what you see in OOD.
Are the cards showing up when you launch at all?
I can’t really read the log output there, sorry. Quite a bit is cutoff. But, you should not discount the information they provide as it’s going to be our only guide with a change to the software in the seesion.rb file because it’s a non-standard change.
Hmm, ok so are you saying:
no session card even shows up, ever
you can launch this from a login node and it works fine
the entire file structure of the app is created, or part of the structure?
Sorry, I can’t quite follow this. I’m also confused as to the behavior still of what you see in OOD.
Apologies if I was not clear enough. We have OOD 3.0.1 in prod, it is installed in the login node of our cluster and works as expected.
Now, we want to test a different configuration and move OOD instance in a VM outside of the cluster adding a “submit_host” in the cluster definition of OOD and mounting the home directory of our cluster as NFS.
Both the production instance running in the login node and the VM are connected to the same cluster so I can start an interactive application from any of the 2 instances. Let’s call the instance in the login node (and fully working) OOD_prod, the other I will call OOD_VM.
If I use the OOD_prod to start an interactive app, everything works as expected and the session is shown also in the OOD_VM. So I suppose this means the OOD_VM instance can correctly read the home directory (mounted as NFS) and use the scheduler binary in the cluster through the submit_host (if that’s the way it checks the job is still running).
Instead, if I start the interactive application using the OOD_VM instance nothing shows up in the OOD UI: no running application, no stopped application, no card at all. But the job was submitted and keeps running and I can access the interactive application if manually put the url with the port. In fact also in the output.log all looks fine. But, in this second case, the card for this app in not shows even on the OOD_prod instance.
I was now collecting more info to share but that’s when I figured out the session “descriptor” under the “db” folder is different. For the OOD_VM, the “job_id” is null. That’s probably why the card in the UI is not shown at all. If I manually modify the job_id field for the session with the id of the job running then the card shows in the UI.
So it seems the OOD_VM using submit_host can’t get the job_id after it submits the job. How can I check why that’s happening?
ok. I found the issue. It is some code in my submit.yaml. I have some complex quotes escaped which when wrapped in a “ssh” command are not working anymore. I need to work a bit on it. The rest looks fine.
Many thanks again for your answers!
If I had to guess, I would guess that the > ... 2>&1 is somehow throwing the ssh command off.
When you submit the job from the OOD_VM what do you see in <%= staged_root %>/output.log? I’d bet the output stream redirection isn’t working properly if you have to additionally ssh on top of the qsub or similar.
I see here: https://github.com/OSC/ood_core/blob/master/lib/ood_core/job/adapters/helper.rb#L34 the “ssh_wrap” it looks “ssh” + the rest of the array. So it seems everything after the ssh command is treated as its arguments. That’s probably the issue.
I have no idea if it works but in theory the behaviour should be: return 'ssh', args + [[cmd] + cmd_args]
instead of: return 'ssh', args + [cmd] + cmd_args
at the beginning of post.sh and it seems to work. Thanks for your suggestion.
Now, I was still wondering about the “-a” option of the rsync, changing it to “-rltp” seems to work fine. Is there any specific reason why you want to keep the owner and the group of the “template” when copying it to the home folder of a user? Usually a user is not setting ownership of folders in his home to values he is not even deciding (since that comes from the “admin” creating the app template).
We likely need to change the source code. I can only speculate as to why the -a flag was used (or dig up the commit and hope there’s a note about), but in any case, if it doesn’t work for you then we should likely change it anyhow.