SGE integration

cj.keist · January 18, 2019, 6:41pm

Trying a test install of OnDemand 1.4 on VM CentOS 7 server. I have it running:

Can log in
Can view home directory, edit files, created directories and delete files
Can start an ssh session into our HPC cluster head node
I can submit a job to the cluster

But so far I cannot get OnDemand to list running jobs on the HPC cluster nor show user specific jobs on the Cluster.
This is my cluster config file. I’m not finding anything in the logs to help. I can login to the OnDemand server and run the qstat from CLI with no issues.

v2:
metadata:
title: “CoS HPC”
login:
host: “172.17.255.254”
job:
adapter: “sge”
cluster: “CoS Cluster”
bin: “/cm/shared/apps/sge/2011.11p1/bin/linux-x64”
conf: “/cm/shared/apps/sge/2011.11p1/”
sge_root: “/cm/shared/apps/sge/2011.11p1”
libdrmaa_path: “/cm/shared/apps/sge/2011.11p1/lib/linux-x64/libdrmaa.so”

rodgers.355 · January 18, 2019, 7:51pm

@cj.keist after the release I realized that not all SGE installations treat qstat the same way. The system I originally developed for had a qstat that worked like qstat -u '*', others work more like qstat -u $USER. There’s a fix for that behavior in the pipeline. Does your OOD user have any jobs running?

efranz · January 18, 2019, 7:54pm

Just to verify, do you know if the rpm you have installed is ondemand-1.4.10-1.el7.x86_64.rpm or ondemand-1.4.9-1.el7.x86_64.rpm? I do know that ondemand-1.4.10-1.el7.x86_64.rpm has a bugfix for SGE, though it isn’t the one that @rodgers.355 mentioned.

rodgers.355 · January 18, 2019, 7:57pm

The latest released version of ood_core fixes a crash that occurs when libdrmaa is used and the job has left the queue. That mostly impacts the Job Composer; users may notice that jobs never “complete” even though they won’t appear in the queue anymore.

github.com

OSC/ood_core/blob/master/CHANGELOG.md

# Changelog

All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html).

## [Unreleased]
## [0.7.1] - 2019-01-11
### Fixed
- Fixed crash when libdrmaa is used to query for a job no longer in the queue

## [0.7.0] - 2018-12-26
### Added
- Addition of an optional live system test of a configurable job adapter

### Fixed
- Fix Torque adapter crash by fixing scope resolution on Attrl and Attropl
- Fix SGE adapter crash in `OodCore::Job::Adapters::Sge::Batch#get_info_enqueued_job` when libdrmma is not available (DRMMA constant not defined)

This file has been truncated. show original

cj.keist · January 18, 2019, 8:01pm

The qstat commands you listed do work on our SGE. Would you like output from them?

cj.keist · January 18, 2019, 8:02pm

The OnDemand package is: ondemand-1.4.10-1.el7.x86_64

cj.keist · January 22, 2019, 11:29pm

Just adding that the job list is showing jobs for the user logged in, just not showing all jobs running on the cluster right now.

deej · January 23, 2019, 2:40am

FWIW, our setup does the same thing. You can view your own jobs, but not all jobs on the Grid. We are running the latest Son of Grid Engine.

rodgers.355 · January 23, 2019, 2:24pm

@cj.keist and @deej that’s what I was talking about in terms of the bug / behavior that I changed for the unreleased version of the adapter. The new behavior always calls qstat -u the only difference for Active Jobs will be whether it uses $USER or *.

If either of you are interested in testing the unreleased version we can talk through a few options to do that?

deej · January 23, 2019, 2:57pm

I’d be happy to help with testing.

rodgers.355 · January 23, 2019, 4:57pm

@deej thanks for agreeing to help test.

The library that defines the SGE adapter is ood_core. In order to run the version of the library with the latest SGE fixes you should do add the following line to each Gemfile rooted in /var/www/ood/apps/sys/(myjobs|dashboard|file-editor|activejobs):

gem "ood_core", :git => "https://github.com/OSC/ood_core.git", :ref => "878153a"

After adding the line to the file a sudo-er will need to run the command RAILS_ENV=production scl enable rh-git29 rh-ruby24 rh-nodejs6 -- bin/setup which will update the application.

This will pin the version of ood_core to what is currently the HEAD of the job_array branch for each application that reference in the library’s history. When you are done testing or want to upgrade you should remove that line from the Gemfile.

Then restart your PUN and you will be using the newer version of the SGE adapter.

cj.keist · January 23, 2019, 5:44pm

Hi,
Thank you for the patch. I added in the gem line in each GemFile in the myjobs,dashboard,file-editor and active jobs. I then ran the scl command in each folder (have to comment out the existing ood_core in the GemFile in dashboard) and then restarted my server. So far it looks to be working!!
Will so some more testing.

deej · January 23, 2019, 7:23pm

Thank you! I can also confirm that the patch works, and we can now switch the view between just the person’s jobs and all jobs, and all jobs are shown.

I do notice one slight oddity. On this Grid we only have one queue defined, “all.q”. Some of the jobs correctly show the “all.q” queue, while most simply show “null” as a value for the queue. It doesn’t seem to affect anything but I thought you might want to know about it.

rodgers.355 · January 23, 2019, 7:36pm

I think I’ve seen that in testing as well. At a guess what is happening is that only jobs that explicitly set a target queue report their queue, while the others show the default value which is the not very useful null.

deej · January 23, 2019, 7:47pm

That is possible based on what I’m seeing. I’ll do some additional testing to confirm.

deej · January 23, 2019, 8:02pm

That is exactly the case. Two jobs submitted as:
qsub testme
and
qsub -q all.q testme

show up as “null” and “all.q” respectively.

Topic		Replies	Views
Question regarding Cluster setup for OOD Get Help	2	280	June 6, 2023
Active Jobs with SSGE 8.1.9 : Request for jobs failed due to body parsing error Get Help	20	1784	May 26, 2022
Active Job App not displaying jobs Get Help ondemand2 , question	4	334	August 27, 2023
Interactive app card completed but UGE job is still running Get Help ondemand2	16	520	May 7, 2024
Remote desktop not reflecting correct number of cores Get Help question	15	1283	May 26, 2022

SGE integration

Related topics