Hi, @jeff.ohrstrom
don’t worry with the delay. it’s nice to have your questions and hypothesis in return.
to sum up :
- i have a OOD v1.7.6 config on a test cluster made on CentOS 7.7, Son of SGE 8.1.9 + pam auth
- in parallel, i’m preparing on our real HPC ( CentOS 7.7, Son of SGE 8.1.6 + pam auth) the OOD 1.7.6 version (i hope you deliver soon the final 1.7 version)
on both, i have the “OOD_BC_SSH_TO_COMPUTE_NODE=0” set and the “fix_sge_procs.rb” inside the dashboard initializers activated.
so, first the "Request for jobs failed due to body parsing error” is closed. it was due to the test of the “linux_host” feature. so, removing the cluster.d yml file stopped this message.
for the second point : output of “ActiveJobs” not complete, no progress for the moment.
here is the view of our ssge_8.1.6 config
here is the view of our ssge_8.1.9 config
it’s the same thing
concerning the answer of qstat -r -xml -u $USER
here is the 8.1.6 answer :
i<?xml version='1.0'?>
<job_info xmlns:xsd=“http://arc.liv.ac.uk/repos/darcs/sge/source/dist/util/resources/schemas/qstat/qstat.xsd”>
<queue_info>
<job_list state=“running”>
<JB_job_number>261077</JB_job_number>
<JAT_prio>1.54559</JAT_prio>
<JB_name>BASIC</JB_name>
<JB_owner>jms</JB_owner>
r
<JAT_start_time>2020-01-29T15:48:28</JAT_start_time>
<queue_name>int.q@node-036.cluster.org</queue_name>
1
<full_job_name>BASIC</full_job_name>
<hard_request name=“h_rt” resource_contribution=“0.000000”>2592000</hard_request>
<hard_req_queue>int.q</hard_req_queue>
NONE
</job_list>
<job_list state=“running”>
<JB_job_number>261767</JB_job_number>
<JAT_prio>1.54601</JAT_prio>
<JB_name>FLUENT</JB_name>
<JB_owner>jms</JB_owner>
r
<JAT_start_time>2020-02-06T17:38:13</JAT_start_time>
<queue_name>all.q@node-026.cluster.org</queue_name>
32
<full_job_name>FLUENT</full_job_name>
<requested_pe name=“openmpi_exclusif_32”>32</requested_pe>
<granted_pe name=“openmpi_exclusif_32”>32</granted_pe>
<hard_request name=“lic_flue_acfd_solver” resource_contribution=“0.000000”>1</hard_request>
<hard_request name=“lic_flue_para_max” resource_contribution=“0.000000”>32</hard_request>
<hard_request name=“mem_free” resource_contribution=“0.000000”>1G</hard_request>
<hard_request name=“mem_dispo” resource_contribution=“0.000000”>1</hard_request>
<hard_request name=“swap_used” resource_contribution=“0.000000”>1G</hard_request>
<hard_request name=“vnode” resource_contribution=“0.000000”>0</hard_request>
<hard_request name=“h_rt” resource_contribution=“0.000000”>39600</hard_request>
<hard_request name=“short_node” resource_contribution=“0.000000”>0</hard_request>
<hard_request name=“urgent” resource_contribution=“0.000000”>0</hard_request>
<hard_request name=“frontal” resource_contribution=“0.000000”>0</hard_request>
<soft_request name=“highspeeddisk”>0</soft_request>
<soft_request name=“memoire”>63</soft_request>
<soft_request name=“nv_type”>K2200|K2000</soft_request>
<hard_req_queue>all.q@@DELL_32_7</hard_req_queue>
NONE
</job_list>
</queue_info>
<job_info>
</job_info>
</job_info>
this 8.1.9 answer
<?xml version='1.0'?>
<job_info xmlns:xsd=“http://arc.liv.ac.uk/repos/darcs/sge/source/dist/util/resources/schemas/qstat/qstat.xsd”>
<queue_info>
<job_list state=“running”>
<JB_job_number>147</JB_job_number>
<JAT_prio>0.50500</JAT_prio>
<JB_name>BASIC</JB_name>
<JB_owner>jms</JB_owner>
r
<JAT_start_time>2020-02-07T17:16:26</JAT_start_time>
<queue_name>int.q@node-001</queue_name>
1
<full_job_name>BASIC</full_job_name>
<hard_request name=“h_rt” resource_contribution=“0.000000”>2592000</hard_request>
<hard_req_queue>int.q</hard_req_queue>
NONE
</job_list>
<job_list state=“running”>
<JB_job_number>148</JB_job_number>
<JAT_prio>0.60500</JAT_prio>
<JB_name>PARAVIEW</JB_name>
<JB_owner>jms</JB_owner>
r
<JAT_start_time>2020-02-07T17:17:26</JAT_start_time>
<queue_name>all.q@node-002</queue_name>
8
<full_job_name>PARAVIEW</full_job_name>
<requested_pe name=“mpi”>8</requested_pe>
<granted_pe name=“mpi”>8</granted_pe>
<hard_request name=“mem_free” resource_contribution=“0.000000”>1G</hard_request>
<hard_request name=“swap_used” resource_contribution=“0.000000”>1G</hard_request>
<hard_req_queue>all.q@@VM_4</hard_req_queue>
NONE
</job_list>
</queue_info>
<job_info>
</job_info>
</job_info>
and for the end, the “e.backtrace” adding into the sge.rb file
with libdrmaa on ssge 8.1.9, the extract from ondemand-nginx/jms/error.log
App 30930 output: [2020-02-07 17:42:58 +0100 ] INFO "method=GET path=/pun/sys/dashboard/apps/show/activejobs format=html controller=AppsController action=show status=302 duration=3428.26 view=0.00 location=https://caravanshow.ddns.net/pun/sys/activejobs"
App 33512 output: Rails Error: Unable to access log file. Please ensure that /var/www/ood/apps/sys/activejobs/log/production.log exists and is writable (ie, make it writable for user and group: chmod 0664 /var/www/ood/apps/sys/activejobs/log/production.log). The log level has been raised to WARN and the output directed to STDERR until the problem is fixed.
App 33512 output: [2020-02-07 17:43:00 +0100 ] INFO “method=GET path=/pun/sys/activejobs/ format=html controller=JobsController action=index status=200 duration=63.87 view=63.01”
App 33512 output: [2020-02-07 17:43:01 +0100 ] INFO “method=GET path=/pun/sys/activejobs/jobs.json format=json controller=JobsController action=index status=200 duration=177.09 view=0.00”
without libdrmaa on ssge 8.1.9, the extract from ondemand-nginx/jms/error.log
App 27700 output: [2020-02-07 17:37:51 +0100 ] INFO “method=GET path=/pun/sys/dashboard/apps/show/activejobs format=html controller=AppsController action=show status=302 duration=3279.10 view=0.00 location=https://caravanshow.ddns.net/pun/sys/activejobs”
App 30263 output: Rails Error: Unable to access log file. Please ensure that /var/www/ood/apps/sys/activejobs/log/production.log exists and is writable (ie, make it writable for user and group: chmod 0664 /var/www/ood/apps/sys/activejobs/log/production.log). The log level has been raised to WARN and the output directed to STDERR until the problem is fixed.
App 30263 output: [2020-02-07 17:37:53 +0100 ] INFO “method=GET path=/pun/sys/activejobs/ format=html controller=JobsController action=index status=200 duration=40.29 view=39.55”
App 30263 output: [2020-02-07 17:37:54 +0100 ] INFO “method=GET path=/pun/sys/activejobs/jobs.json format=json controller=JobsController action=index status=200 duration=107.63 view=0.00”
That’s all for me for the moment. hope it’s helpfull for you.
jean-marie