Using PBS, can get shell, can't get jobs

I’ve just installed v1.7.14 on CentOS 8.2.2004, can login using SSSD/LDAP, can get a shell open in the browser and run an interactive job in that shell. Very pleased.

For whatever reason, I can’t get the active jobs page to work?

Strictly, the documentation for Add Cluster Config suggests In production you will also want to add a resource manager.

Because of the way our HPC (PBSPro 19.1.3) is set up, regular users don’t have login access to the Resource Manager, only to the Login Nodes, from which they can submit jobs. But having the job: host: and login: host: (in /etc/ood/config/clusters.d/server.yml) identical isn’t working. Users can successfully run qstat and qselect on the login nodes.

The error I’m seeing in the UI is Server: Connection refused qselect: cannot connect to server server.gen (errno=111)

In /var/log/httpd/error.log I’m seeing a lot of this:

[Tue Jun 30 02:16:48.436015 2020] [lua:warn] [pid 2064:tid 139921695598336] AH01471: Lua error: /opt/ood/mod_ood_proxy/lib/logger.lua:22: bad argument #2 to 'date' (number has no integer representation)

And the error I’m seeing in /var/log/ondemand-nginx/user/error.log looks like this:

App 2958 output: [2020-06-30 02:18:42 -0400 ] ERROR "OodCore::JobAdapterError: Connection refused\nqstat: cannot connect to server server.gen (errno=111)\n\n/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.7.14/gems/ood_core-0.11.4/lib/ood_core/job/adapters/pbspro.rb:290:in `rescue in info_all'\n/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.7.14/gems/ood_core-0.11.4/lib/ood_core/job/adapters/pbspro.rb:285:in `info_all'\n/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.7.14/gems/ood_core-0.11.4/lib/ood_core/job/adapter.rb:84:in `info_all_each'\n/var/www/ood/apps/sys/activejobs/app/models/jobs_json_request_handler.rb:46:in `each'\n/var/www/ood/apps/sys/activejobs/app/models/jobs_json_request_handler.rb:46:in `each_slice'\n/var/www/ood/apps/sys/activejobs/app/models/jobs_json_request_handler.rb:46:in `block in render'\n/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.7.14/gems/ood_core-0.11.4/lib/ood_core/clusters.rb:123:in `each'\n/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.7.14/gems/ood_core-0.11.4/lib/ood_core/clusters.rb:123:in `each'\n/var/www/ood/apps/sys/activejobs/app/models/jobs_json_request_handler.rb:44:in `each_with_index'\n/var/www/ood/apps/sys/activejobs/app/models/jobs_json_request_handler.rb:44:in `render'\n/var/www/ood/apps/sys/activejobs/app/controllers/jobs_controller.rb:18:in `block (2 levels) in index'\n/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.7.14/gems/actionpack-5.2.4.3/lib/action_controller/metal/mime_responds.rb:203:in `respond_to'\n/var/www/ood/apps/sys/activejobs/app/controllers/jobs_controller.rb:9:in `index'\n/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.7.14/gems/actionpack-5.2.4.3/lib/action_controller/metal/basic_implicit_render.rb:6:in `send_action'\n/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.7.14/gems/actionpack-5.2.4.3/lib/abstract_controller/base.rb:194:in `process_action'\n/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.7.14/gems/actionpack-5.2.4.3/lib/action_controller/metal/rendering.rb:30:in `process_action'\n/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.7.14/gems/actionpack-5.2.4.3/lib/abstract_controller/callbacks.rb:42:in `block in process_action'\n/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.7.14/gems/activesupport-5.2.4.3/lib/active_support/callbacks.rb:132:in `run_callbacks'\n/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.7.14/gems/actionpack-5.2.4.3/lib/abstract_controller/callbacks.rb:41:in `process_action'\n/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.7.14/gems/actionpack-5.2.4.3/lib/action_controller/metal/rescue.rb:22:in `process_action'\n/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.7.14/gems/actionpack-5.2.4.3/lib/action_controller/metal/instrumentation.rb:34:in `block in process_action'\n/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.7.14/gems/activesupport-5.2.4.3/lib/active_support/notifications.rb:168:in `block in instrument'\n/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.7.14/gems/activesupport-5.2.4.3/lib/active_support/notifications/instrumenter.rb:23:in `instrument'\n/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.7.14/gems/activesupport-5.2.4.3/lib/active_support/notifications.rb:168:in `instrument'\n/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.7.14/gems/actionpack-5.2.4.3/lib/action_controller/metal/instrumentation.rb:32:in `process_action'\n/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.7.14/gems/actionpack-5.2.4.3/lib/action_controller/metal/params_wrapper.rb:256:in `process_action'\n/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.7.14/gems/actionpack-5.2.4.3/lib/abstract_controller/base.rb:134:in `process'\n/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.7.14/gems/actionview-5.2.4.3/lib/action_view/rendering.rb:32:in `process'\n/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.7.14/gems/actionpack-5.2.4.3/lib/action_controller/metal/live.rb:255:in `block (2 levels) in process'\n/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.7.14/gems/activesupport-5.2.4.3/lib/active_support/dependencies/interlock.rb:42:in `block in running'\n/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.7.14/gems/activesupport-5.2.4.3/lib/active_support/concurrency/share_lock.rb:162:in `sharing'\n/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.7.14/gems/activesupport-5.2.4.3/lib/active_support/dependencies/interlock.rb:41:in `running'\n/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.7.14/gems/actionpack-5.2.4.3/lib/action_controller/metal/live.rb:247:in `block in process'\n/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.7.14/gems/actionpack-5.2.4.3/lib/action_controller/metal/live.rb:291:in `block in new_controller_thread'"
App 2958 output: [2020-06-30 02:18:42 -0400 ]  INFO "method=GET path=/pun/sys/activejobs/jobs.json format=json controller=JobsController action=index status=200 duration=8.51 view=0.00"

I’m still in testing phase in order to proof of concept for the team, so I don’t yet have a FQDN nor proper SSL set up.

Any tips would be appreciated.

EDIT: fixed typo, it’s clusters.d in /etc/

Hi and welcome!

You seem to have some issue with your PBS installation on that server. Server: Connection refused qselect: cannot connect to server server.gen (errno=111) is the the correct error to be concerned with.

In a shell session on this web server can you run qstat? We use the job.host field to determine which host to submit to. So the command we run in turn ends up looking like this PBS_DEFAULT=somehost qstat. I’m guessing if you run the same command manually from a shell on that same server you’ll get the same error.

Looking on the internet errno=111 connection refused meaning that qstat can’t connect to the Resource Manager node. I would guess maybe there’s a port mis configuration?

Either way - what you have to triage and get working is running qstat from that server, because this is all OOD does. If you can run it manually, then OOD can run it also. You could alternatively use bin_overrides to ssh somewhere else and execute the commands. But for your users that would require trust between all of these nodes so they don’t have to generate and deal with keys.

This topic on the PBS community seems to indicate that reverse DNS is important (you don’t have an FQDN), though it doesn’t seem to be solved.

Hope that helps!

The OnDemand host has to be able to execute “qstat” for the queue status page to work. This requires the PBS client side command line tools be available on the OnDemand node, and that the OnDemand node be permitted via net and host firewalls. etc. to reach the PBS server on TCP port 15001 in order for qstat to get the information to display.

You MIGHT be able to kludge around this by using some variant of the cluster config file speak:

bin_overrides:

qdel: “/usr/local/sbin/qdel-hack”

qstat: “/usr/local/sbin/qstat-hack”

qsub: “/usr/local/sbin/qsub-hack”

qrls: “/usr/local/sbin/qrls-hack”

qhold: “/usr/local/sbin/qhold-hack”

qselect: “/usr/local/sbin/qselect-hack”

where each of the “xxx-hack” scripts does an ssh to one of your login nodes, and executes the required PBS command with the specified arguments. That’s only a “Might”. We added the permits and routes necessary for ood to reach the PBSPro server node on port 15001.

Ric

image001.png

image002.png

The network ports were a good tip, but ultimately the answer was the until this AM unanswered question, “will the OpenPBS 20.0 client work with a PBSPro 19.1.3 scheduler?”. The answer is No, for those wondering. I’m still getting errors, but I think they are related to the issues compiling PBSPro on CentOS 8.2. qstat works fine from server to scheduler now. EDIT: yes, I’ve now got it working. Thanks.