Error with job_name & SSGE 8.1.9

Hi, everybody

I’m trying now to retrofit our HPC running under Centos 6 / SSGE with OOD to replace a homemade HPC web portal.

for the demo, i have started with the last 1.6.20 update of OOD under CentOS 7.7 and i get an error while submitting an interactive desktop : "qsub: argument to -N option must not contain / "

here is my config :

/etc/ood/config/cluster.d/p6444.yml


v2:
metadata:
title: “SGE_HPC”
login:
host: “masterc77”
job:
adapter: “sge”
cluster: “p6444”
bin: “/opt/sge/bin/lx-amd64/”
conf: “/opt/sge/default/common”
sge_root: “/opt/sge”
libdrmaa_path: “/opt/sge/lib/lx-amd64/libdrmaa.so”
#bin_overrides:
#qsub: “/opt/qsub.sh”
# qstat: “”
# qhold: “”
# qrls: “”
# qdel: “”
batch_connect:
basic:
script_wrapper: |
module purge
%s
set_host: “host=$(hostname -a)”
vnc:
script_wrapper: |
module purge
export PATH="/opt/TurboVNC/bin:PATH" export WEBSOCKIFY_CMD="/opt/websockify/run" %s set_host: "host=(hostname -a)"

my bc-desktop form.yml


title: “interactive”
cluster: “p6444”
submit: “submit/interactive_submit.yml.erb”

attributes:
desktop: “interactive”
bc_queue: inter
bc_account: null
bc_num_hours: 8
bc_num_slots: 1

my submit file :


batch_connect:
template: vnc
extra_args: “-listen tcp -vgl -geometry 1240x1024”

RESULT :
error message : "qsub: argument to -N option must not contain / "

the generated “job_script_options.json” :

{ “job_name”: “sys/dashboard/sys/bc_desktop/interactive”, “workdir”: “/home/jms/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/interactive/output/cfe12f9e-84f9-42d0-b8c8-dd1af8afefaa”, “output_path”: “/home/jms/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/interactive/output/cfe12f9e-84f9-42d0-b8c8-dd1af8afefaa/output.log”, “shell_path”: “/bin/bash” }

to go further, i have tried to implement a qsub wrapper :slight_smile:

my config comes to :

/etc/ood/config/cluster.d/p6444.yml


v2:
metadata:
title: “SGE_HPC”
login:
host: “masterc77”
job:
adapter: “sge”
cluster: “p6444”
bin: “/opt/sge/bin/lx-amd64/”
conf: “/opt/sge/default/common”
sge_root: “/opt/sge”
libdrmaa_path: “/opt/sge/lib/lx-amd64/libdrmaa.so”
bin_overrides:
qsub: “/opt/qsub.sh”
# qstat: “”
# qhold: “”
# qrls: “”
# qdel: “”
batch_connect:
basic:
script_wrapper: |
module purge
%s
set_host: “host=$(hostname -a)”
vnc:
script_wrapper: |
module purge
export PATH="/opt/TurboVNC/bin:PATH" export WEBSOCKIFY_CMD="/opt/websockify/run" %s set_host: "host=(hostname -a)"

my submit file :

batch_connect:
template: vnc
extra_args: “-listen tcp -vgl -geometry 1240x1024”
script:
native:
- “-N INTERACTIVE”

my qsub wrapper :

#!/usr/bin/env bash
. /etc/profile.d/sge.sh

echo $0 $@ >/tmp/qsub.log
command=“qsub -v PATH”

while [ $# -gt 0 ] ; do
if [ “$1” == “-wd” ] ; then
Workdir=$2
command="$command $1 $2"
elif [ “$1” == “-N” ] ; then
if [[ $2 =~ “sys/dashboard” ]] ; then
echo $2 >>/tmp/qsub.log
else
command="$command $1 $2"
fi
else
command="$command $1 $2"
fi
shift
shift
done
command="$command $Workdir/job_script_content.sh"
chmod +x $Workdir/job_script_content.sh

echo $command >>/tmp/qsub.log
exec $command

my qsub wrapper log :
RECEIVED : /opt/qsub.sh -wd /home/jms/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/interactive/output/d6c1c0b7-61ee-43e6-856a-7c1151518f01 -N sys/dashboard/sys/bc_desktop/interactive -o /home/jms/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/interactive/output/d6c1c0b7-61ee-43e6-856a-7c1151518f01/output.log -q inter -l h_rt=08:00:00 -N INTERACTIVE
sys/dashboard/sys/bc_desktop/interactive
SUBMITTED : qsub -v PATH -wd /home/jms/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/interactive/output/d6c1c0b7-61ee-43e6-856a-7c1151518f01 -o /home/jms/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/interactive/output/d6c1c0b7-61ee-43e6-856a-7c1151518f01/output.log -q inter -l h_rt=08:00:00 -N INTERACTIVE /home/jms/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/interactive/output/d6c1c0b7-61ee-43e6-856a-7c1151518f01/job_script_content.sh

RESULT : the job starts , then fails

Question : what i am doing wrong ?

Direct submission within a shell :

qsub -v PATH -wd /home/jms/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/interactive/output/d6c1c0b7-61ee-43e6-856a-7c1151518f01 -o /home/jms/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/interactive/output/d6c1c0b7-61ee-43e6-856a-7c1151518f01/output.log -q inter -l h_rt=08:00:00 -N INTERACTIVE /home/jms/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/interactive/output/d6c1c0b7-61ee-43e6-856a-7c1151518f01/job_script_content.sh

RESULT : the job starts, then fails.

DETAILS :

[jms@masterc77 d6c1c0b7-61ee-43e6-856a-7c1151518f01]$ ls -ltr

total 44
drwxr-xr-x 2 jms my_users 90 18 oct. 03:06 desktops
-rw-r–r-- 1 jms my_users 32 20 oct. 14:05 user_defined_context.json
-rwxr-xr-x 1 jms my_users 100 20 oct. 14:05 before.sh
-rwxr-xr-x 1 jms my_users 558 20 oct. 14:05 script.sh
-rwxr-xr-x 1 jms my_users 5795 20 oct. 14:05 job_script_content.sh
-rw-r–r-- 1 jms my_users 498 20 oct. 14:05 job_script_options.json
-rw-r–r-- 1 jms my_users 666 20 oct. 14:05 INTERACTIVE.e50
-rw------- 1 jms my_users 16 20 oct. 14:31 vnc.passwd
-rw-r–r-- 1 jms my_users 1392 20 oct. 14:31 vnc.log
-rw-r–r-- 1 jms my_users 1414 20 oct. 14:32 output.log
-rw-r–r-- 1 jms my_users 791 20 oct. 14:32 INTERACTIVE.e51
[jms@masterc77 d6c1c0b7-61ee-43e6-856a-7c1151518f01]$ more INTERACTIVE.e51

Warning: node-001.cluster.org:1 is taken because of /tmp/.X1-lock
Remove this file if there is no X server node-001.cluster.org:1
Killing Xvnc process ID 51890
Xvnc process ID 51890 already killed
cmdTrace.c(713):ERROR:104: ‘restore’ is an unrecognized subcommand
cmdModule.c(411):ERROR:104: ‘restore’ is an unrecognized subcommand
WebSocket server settings:

  • Listen on :59373
  • No SSL/TLS support (no cert file)
  • Backgrounding (daemon)
    /opt/sge/default/spool/node-001/job_scripts/51: ligne156: Erreur de syntaxe près
    du symbole inattendu « < »
    /opt/sge/default/spool/node-001/job_scripts/51: ligne156: `done < <(tail -f --pi
    d={SCRIPT_PID} "vnc.log") &' generating cookie with syscall generating cookie with syscall generating cookie with syscall generating cookie with syscall [jms@masterc77 d6c1c0b7-61ee-43e6-856a-7c1151518f01]

Looks like you may have to specify the name of the job directly. Seems like you’ve seen the man pages and that’s why you’ve added that -N bit in native. I’d suggest the below where you attempt to add the name directly in the submit. Otherwise you’re wrapper may have to suffice until we patch it. We run qsub too at OSC but it’s version 6.1.2. You must have something higher?

batch_connect:
  template: vnc
    extra_args: "-listen tcp -vgl -geometry 1240x1024"
  script:
    # try to give it a static name
    job_name: 'some-static-name-like-interactive'

Here’s the version we run.

[johrstrom@owens-login01 ~]$ qsub --version
Version: 6.1.2
Commit: 661e092552de43a785c15d39a3634a541d86898e

To the problem you’re having with VNC, I can’t really tell why it’s being killed. Here’s a good log below which is very similar. It would appear that you’re unable to start a vnc server instance - or for some reason you start it it kills itself.

Warning: <somehost>:7 is taken because of /tmp/.X7-lock
Remove this file if there is no X server <somehost>:7

Desktop 'TurboVNC: <somehost>:8 (johrstrom)' started on display <somehost>:8

Log file is vnc.log
Successfully started VNC server on <somehost>:5908...
Script starting...
Starting websocket server...
Restoring modules from user's default, for system: "owens"
WebSocket server settings:
  - Listen on :64333
  - Flash security policy server
  - No SSL/TLS support (no cert file)
  - Backgrounding (daemon)
Scanning VNC log file for user authentications...
Generating connection YAML file...
Launching desktop 'xfce'...
generating cookie with syscall
generating cookie with syscall
generating cookie with syscall

There’s also a vnc.log file in your directory there. It may say something more informative about the problem. If you have ssh access to the instance you’re trying to boot on, I’d suggest shelling into it and attempting to run that job_script_content.sh script.

Here in line 107 you can add the -log "vnc.log:debug". Also notice that that line is in a loop where you’re trying to start VNC so you can add extra debug echo statements there if you need.

VNC_OUT=$(vncserver -log "vnc.log:debug" -rfbauth "vnc.passwd" -nohttpd -noxstartup -geometry 1536x780 -idletimeout 0  2>&1)
``

Hi, Jeff

Thanks for your answer. I have put in place the “job_name:” directive lin my submit file, canceled the use of the wrapper and the error has gone away. but the job is still killed, i don’t know why.

could you confirm that’s the script to launch, either with qsub (8.1.9), either with ssh, is “job_script_content.sh” ?

jean-marie

Yes You can see even in your wrapper job_script_content.sh is the file that’s submitted. It will trigger script.sh as a child process and wait for it then cleanup after it. So you could think that job_script_content.sh is the initializer and wrapper and script.sh is the actual thing you want to do.

Again, it sounds like you’re having VNC issues. If you have development enabled, that may be easier to modify the actual files that you submit instead of hacking previously submitted files. In either case, we need to take a look at the vnc.log and maybe even turn vnc debug logging on if possible.

Your VNC Server is unable to boot in these statements. The vnc.log may indicate why.

Hi, Jeff

i have worked to resolve this problem of “job_script_content.sh” that kills the vnc session.

  1. a config that’s working : CentOS 7.6 + Ldap authentication + OOD 1.6.20 + SLURM v17

  2. my target config not working : CentOS 7.6 + NIS authentication + OOD 1.6.20 + SSGE 8.1.9

in the 2nd case, it’s the loop waiting for the “Full-control authentication enabled for” motif in the “job_script_content.sh” which is involved :

echo "Scanning VNC log file for user authentications..."
while read -r line; do
  if [[ ${line} =~ "Full-control authentication enabled for" ]]; then
    change_passwd
    create_yml
  fi
done < <(tail -f --pid=${SCRIPT_PID} "vnc.log") &

and mainly the last line on the done condition

if i change that loop to

create_yml
echo "Scanning VNC log file for user authentications..."
while read -r line; do
  if [[ ${line} =~ "Full-control authentication enabled for" ]]; then
    change_passwd
    create_yml
  fi
#done < <(tail -f --pid=${SCRIPT_PID} "vnc.log") &
done < $(tail -f --pid=${SCRIPT_PID} "vnc.log" &)

It works with a qsub within a shell :

 qsub -N BASIC -q inter -cwd -o output2.log job_script_content.sh
qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 

 37 0.55500 BASIC      jms          r     10/28/2019 08:42:00 inter@node-001.cluster.org         1        

on the node, i can find these processes :warning:

root      30489  13439  0 08:48 pts/0    00:00:00          \_ grep --color=auto jms
jms       30054  30053  0 08:42 ?        00:00:00      \_ -sh /opt/sge/default/spool/node-001/job_scripts/37
jms       30190  30054  0 08:42 ?        00:00:00          \_ bash /home/jms/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/basic/output/1258436a-549e-48c6-987d-ae35b83f3358/script.sh
jms       30217  30190  0 08:42 ?        00:00:00              \_ xfwm4
jms       30177      1  0 08:42 ?        00:00:00 /opt/TurboVNC/bin/Xvnc :1 -desktop TurboVNC: node-001.cluster.org:1 (jms) -auth /home/jms/.Xauthority -geometry 1240x1024 -depth 24 -rfbwait 120000 -rfbauth vnc.passwd -x509cert /home/jms/.vnc/x509_cert.pem -x509key /home/jms/.vnc/x509_private.pem -rfbport 5901 -fp catalogue:/etc/X11/fontpath.d -deferupdate 1 -dridir /usr/lib64/dri -registrydir /usr/lib64/xorg -idletimeout 0 -listen tcp
jms       30210      1  0 08:42 ?        00:00:00 vglclient -detach -l /home/jms/.vnc/virtualGL.log
jms       30219      1  0 08:42 ?        00:00:00 python /opt/websockify/run -D 31003 localhost:5901
jms       30222      1  0 08:42 ?        00:00:00 tail -f --pid=30190 vnc.log

when i try this mod within OOD, the job is submitted :smiley: , but the launch of noVNC fails. ::disappointed_relieved:

what to do to continue ?

thanks

ps : with the original job_script_content.sh executed within a shell on the node, it’s OK. that’s would say that’s it’s due to SSGE…

jean-marie

When the job fails, what does the vnc.log show? Again, turn debug log on if you need to.

Also, I would look into ENV variables, namely logging them on working jobs and failing jobs and seeing if there’s a difference. When you’re logged in ssh on a machine and submit a job to SSGE, does it pass your ENV variables? (like XDG_RUNTIME_DIR for example). This is where my thinking is.

Also when you compare these 2 setups, Slurm and SSGE, are you scheduling these jobs on the same destination node (node-001.cluster.org)? That would rule out any library problem, at least on the host.

Similar to ENV variables, do both schedulers load up the same modules? Maybe SSGE isn’t loading the required GL or X11 libraries? Again, the vnc.log should give us some indication of why this is VNC is killing itself.

Jeff, you’re right. It’s a problem of SHELL.

in fact, it due to the queue shell defined by default to sh in SSGE. I changed it to bash (qconf -mq inter), restore the vnc.rb script to it’s original state and the submission of desktop is ok with SSGE.:smiley:

but the launch of the noVNC client still fails. I’m going to continue my analysis and coming back to you after.

jean-marie

Jeff, some good news. it’s OK with OOD 1.6.20 under CentOS7.6 with a PAM / NIS authentication and SSGE 8.1.9 :smiley:

if i resume, you have to do :

  • for the PAM authentication, follow the process described in the topic " Can OOD auth be handled by PAM?"

  • usage of the “job_name” in the “bc_desktop” submit files to avoid an error with SSGE SQUB


    batch_connect:
    template: vnc
    extra_args: “-listen tcp -vgl -geometry 1240x1024”
    script:
    job_name: “BASIC”

  • usage of bash as queue shell to avoid SSGE jobs to be killed

    qconf -mq inter

    qname inter
    hostlist @inter
    seq_no 0
    load_thresholds np_load_avg=1.75
    suspend_thresholds NONE
    nsuspend 1
    suspend_interval 00:05:00
    priority 0
    min_cpu_interval 00:05:00
    processors UNDEFINED
    qtype BATCH INTERACTIVE
    ckpt_list NONE
    pe_list make mpi smp
    rerun FALSE
    slots 1,[node-001=4]
    tmpdir /tmp
    shell /bin/bash
    prolog NONE
    epilog NONE
    shell_start_mode posix_compliant
    starter_method NONE
    suspend_method NONE
    resume_method NONE

  • usage of hostname coherent between OOD regex rules & SSGE to avoid the noVNC launch to fail.

that’s all

thanks again

jean-marie

1 Like