I’m trying now to retrofit our HPC running under Centos 6 / SSGE with OOD to replace a homemade HPC web portal.
for the demo, i have started with the last 1.6.20 update of OOD under CentOS 7.7 and i get an error while submitting an interactive desktop : "qsub: argument to -N option must not contain / "
Warning: node-001.cluster.org:1 is taken because of /tmp/.X1-lock
Remove this file if there is no X server node-001.cluster.org:1
Killing Xvnc process ID 51890
Xvnc process ID 51890 already killed
cmdTrace.c(713):ERROR:104: ‘restore’ is an unrecognized subcommand
cmdModule.c(411):ERROR:104: ‘restore’ is an unrecognized subcommand
WebSocket server settings:
Listen on :59373
No SSL/TLS support (no cert file)
Backgrounding (daemon)
/opt/sge/default/spool/node-001/job_scripts/51: ligne156: Erreur de syntaxe près
du symbole inattendu « < »
/opt/sge/default/spool/node-001/job_scripts/51: ligne156: `done < <(tail -f --pi
d={SCRIPT_PID} "vnc.log") &'
generating cookie with syscall
generating cookie with syscall
generating cookie with syscall
generating cookie with syscall
[jms@masterc77 d6c1c0b7-61ee-43e6-856a-7c1151518f01]
Looks like you may have to specify the name of the job directly. Seems like you’ve seen the man pages and that’s why you’ve added that -N bit in native. I’d suggest the below where you attempt to add the name directly in the submit. Otherwise you’re wrapper may have to suffice until we patch it. We run qsub too at OSC but it’s version 6.1.2. You must have something higher?
batch_connect:
template: vnc
extra_args: "-listen tcp -vgl -geometry 1240x1024"
script:
# try to give it a static name
job_name: 'some-static-name-like-interactive'
To the problem you’re having with VNC, I can’t really tell why it’s being killed. Here’s a good log below which is very similar. It would appear that you’re unable to start a vnc server instance - or for some reason you start it it kills itself.
Warning: <somehost>:7 is taken because of /tmp/.X7-lock
Remove this file if there is no X server <somehost>:7
Desktop 'TurboVNC: <somehost>:8 (johrstrom)' started on display <somehost>:8
Log file is vnc.log
Successfully started VNC server on <somehost>:5908...
Script starting...
Starting websocket server...
Restoring modules from user's default, for system: "owens"
WebSocket server settings:
- Listen on :64333
- Flash security policy server
- No SSL/TLS support (no cert file)
- Backgrounding (daemon)
Scanning VNC log file for user authentications...
Generating connection YAML file...
Launching desktop 'xfce'...
generating cookie with syscall
generating cookie with syscall
generating cookie with syscall
There’s also a vnc.log file in your directory there. It may say something more informative about the problem. If you have ssh access to the instance you’re trying to boot on, I’d suggest shelling into it and attempting to run that job_script_content.sh script.
Here in line 107 you can add the -log "vnc.log:debug". Also notice that that line is in a loop where you’re trying to start VNC so you can add extra debug echo statements there if you need.
Thanks for your answer. I have put in place the “job_name:” directive lin my submit file, canceled the use of the wrapper and the error has gone away. but the job is still killed, i don’t know why.
could you confirm that’s the script to launch, either with qsub (8.1.9), either with ssh, is “job_script_content.sh” ?
Yes You can see even in your wrapper job_script_content.sh is the file that’s submitted. It will trigger script.sh as a child process and wait for it then cleanup after it. So you could think that job_script_content.sh is the initializer and wrapper and script.sh is the actual thing you want to do.
Again, it sounds like you’re having VNC issues. If you have development enabled, that may be easier to modify the actual files that you submit instead of hacking previously submitted files. In either case, we need to take a look at the vnc.log and maybe even turn vnc debug logging on if possible.
Your VNC Server is unable to boot in these statements. The vnc.log may indicate why.
i have worked to resolve this problem of “job_script_content.sh” that kills the vnc session.
a config that’s working : CentOS 7.6 + Ldap authentication + OOD 1.6.20 + SLURM v17
my target config not working : CentOS 7.6 + NIS authentication + OOD 1.6.20 + SSGE 8.1.9
in the 2nd case, it’s the loop waiting for the “Full-control authentication enabled for” motif in the “job_script_content.sh” which is involved :
echo "Scanning VNC log file for user authentications..."
while read -r line; do
if [[ ${line} =~ "Full-control authentication enabled for" ]]; then
change_passwd
create_yml
fi
done < <(tail -f --pid=${SCRIPT_PID} "vnc.log") &
and mainly the last line on the done condition
if i change that loop to
create_yml
echo "Scanning VNC log file for user authentications..."
while read -r line; do
if [[ ${line} =~ "Full-control authentication enabled for" ]]; then
change_passwd
create_yml
fi
#done < <(tail -f --pid=${SCRIPT_PID} "vnc.log") &
done < $(tail -f --pid=${SCRIPT_PID} "vnc.log" &)
It works with a qsub within a shell :
qsub -N BASIC -q inter -cwd -o output2.log job_script_content.sh
qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
37 0.55500 BASIC jms r 10/28/2019 08:42:00 inter@node-001.cluster.org 1
When the job fails, what does the vnc.log show? Again, turn debug log on if you need to.
Also, I would look into ENV variables, namely logging them on working jobs and failing jobs and seeing if there’s a difference. When you’re logged in ssh on a machine and submit a job to SSGE, does it pass your ENV variables? (like XDG_RUNTIME_DIR for example). This is where my thinking is.
Also when you compare these 2 setups, Slurm and SSGE, are you scheduling these jobs on the same destination node (node-001.cluster.org)? That would rule out any library problem, at least on the host.
Similar to ENV variables, do both schedulers load up the same modules? Maybe SSGE isn’t loading the required GL or X11 libraries? Again, the vnc.log should give us some indication of why this is VNC is killing itself.
in fact, it due to the queue shell defined by default to sh in SSGE. I changed it to bash (qconf -mq inter), restore the vnc.rb script to it’s original state and the submission of desktop is ok with SSGE.
but the launch of the noVNC client still fails. I’m going to continue my analysis and coming back to you after.