I am trying to set up the Linux host adapter to our dedicated interactive nodes.
We set up the host based authentication and I can ssh from the ondemand server to the interactive node fine w/o password, e.g. with the command that OOD uses:
ssh -t -o “BatchMode=yes” -o “UserKnownHostsFile=/dev/null” -o “StrictHostKeyChecking=no” u0101881@frisco1.chpc.utah.edu
I also created the cluster.d config file, and the bc_desktop config file, and I think they are in a reasonable shape (not quite sure about the clusters.d where I also added the batch_connect section so that I can add the script_wrapper pieces, e.g. PATH to TurboVNC, WEBSOCKIFY_CMD,…, but, I think that should be correct, comparing to the scheduler based batch setup).
Now, when I push the button to start the interactive desktop, I get, in the OOD window:
Pseudo-terminal will not be allocated because stdin is not a terminal.
Warning: Permanently added ‘frisco1.chpc.utah.edu,155.101.26.201’ (ECDSA) to the list of known hosts.
Illegal variable name.
Badly placed ()'s.
Unmatched ".
I presume the first two lines are just a warning, but the other 3 suggest that something is wrong with some shell script. I went as far as injecting #!/bin/bash to the job_script_content.sh with no effect, and there’s no output.log so I suspect that this error is coming from somewhere before any of the job scripts get executed. My default shell is tcsh, so, I suspect somewhere there is a “#!/usr/bin/env bash” missing.
I would say turn on the debug flag (in the configuration) and debug the files that’ll drop 2 files in your home directory.
One is the outer wrapper (ending in _tmux) and this is fed to the ssh command in std in. The other is what get’s executed within the singularity container (ending in _sing).
My guess is you’re failing on the outer script. The shebang is #!/bin/bash so I guess we could be getting into trouble there (instead of using env).
But that’s where I would start, in this way you can reproduce the error in an interactive terminal modifying the file to see what’s wrong.
I am having the debug: true, and I am not seeing any _tmux file in the temp directory. This is what I have:
[05064afe-db1d-4d5b-844b-0951775c0c9e]$ ls -l
total 24
-rwxr-xr-x 1 u0101881 chpc 100 May 1 22:52 before.sh
drwxr-xr-x 2 u0101881 chpc 68 Apr 24 22:35 desktops
-rw-r–r-- 1 u0101881 chpc 7289 May 1 22:52 job_script_content.sh
-rw-r–r-- 1 u0101881 chpc 488 May 1 22:52 job_script_options.json
-rwxr-xr-x 1 u0101881 chpc 658 May 1 22:52 script.sh
-rw-r–r-- 1 u0101881 chpc 55 May 1 22:52 user_defined_context.json
My colleague Brett who has bash as default got further, his output.log got generated but is empty. His error in the browser just said that the job entered bad state.
I think we may have something with our configuration. I was not sure about a few things in your docs. I’ll detail that in a next post.
v2:
metadata:
title: "Frisco"
url: "https://www.chpc.utah.edu/documentation/guides/frisco-nodes.php"
hidden: false
login:
host: "frisco1.chpc.utah.edu"
job:
adapter: "linux_host"
submit_host: "frisco1.chpc.utah.edu" # This is the head for a login round robin
ssh_hosts: # These are the actual login nodes
- frisco1.chpc.utah.edu
- frisco2.chpc.utah.edu
- frisco3.chpc.utah.edu
- frisco4.chpc.utah.edu
- frisco5.chpc.utah.edu
- frisco6.chpc.utah.edu
- frisco7.chpc.utah.edu
- frisco8.chpc.utah.edu
site_timeout: 7200
debug: true
singularity_bin: /uufs/chpc.utah.edu/sys/installdir/singularity3/std/bin/singularity
singularity_bindpath: /etc,/mnt,/media,/opt,/run,/srv,/usr,/var,/uufs,/scratch
singularity_image: /opt/ood/linuxhost_adapter/centos7_lmod.sif
# Enabling strict host checking may cause the adapter to fail if the user's known_hosts does not have all the roundrobin hosts
strict_host_checking: false
tmux_bin: /usr/bin/tmux
batch_connect:
basic:
script_wrapper: |
#!/bin/bash
set -x
if [ -z "$LMOD_VERSION" ]; then
source /etc/profile.d/chpc.sh
fi
export XDG_RUNTIME_DIR=$(mktemp -d)
%s
set_host: "host=$(hostname -A | awk '{print $2}')"
vnc:
script_wrapper: |
#!/bin/bash
set -x
export PATH="/uufs/chpc.utah.edu/sys/installdir/turbovnc/std/opt/TurboVNC/bin:$PATH"
export WEBSOCKIFY_CMD="/uufs/chpc.utah.edu/sys/installdir/websockify/0.8.0/bin/websockify"
export XDG_RUNTIME_DIR=$(mktemp -d)
%s
set_host: "host=$(hostname -A | awk '{print $2}')"
I am setting the num_cores since our default bc_desktop/submit.yml.erb has it as:
batch_connect:
template: vnc
script:
native:
<%- if num_cores != "none" -%>
- "-n <%= num_cores %>"
<%- end -%>
Please, let me know how these config files look like to you. I have a feeling that I am missing something since I don’t see any singularity call in any of the job scripts. And I can ssh to the container fine (and thanks to all the bind mounts it can see all our software stack - that’s a great idea that I’ll keep in mind for future similar projects).
Also, any way to get more default debug info? I just get the script files as listed and the Apache logs that I pasted earlier.
OK, it seems unfortunately it writes out those debug files on the login host, after it ssh’s into it.
This is the very begging of what it’s trying to do (the full file being here), and it’s failing even trying to come up with these variables (that’s the illegal variable name from the $(), looks like tsch only accepts ``)
Even though it has a shebang header, that’s not really accounted for in std in.
We’re effectively doing something like this: cat test.sh | ssh user@host when it appears we should be doing cat test.sh | ssh user@host /bin/bash to force bash execution (or something similar).
I don’t think you’ll make it very far in tcsh at the moment because it looks like that initial script we’re trying to execute over ssh isn’t tcsh compliant and as you indicate, it can’t even do that initial set of writing out the 2 files (through cat heredocs) and then executing them.
That said, I think your config looks OK and your colleague with bash may have luck. I’m looking into tcsh compliance now.
Just for historical context, the LHA didn’t work with some shells like tsch. The issue was ultimately fixed. Thanks @mcuma for testing and bringing this bug to our attention!