Remote desktop not reflecting correct number of cores

When launching a RemoteDesktop we provide an option to select the number of cores/slots but the number selected is not reflected in the “cores” count presented in the “pun/sys/dashboard/batch_connect/sessions” page. Instead, the page always indicates only “1” core is queued/running.

I updated the “/etc/ood/config/apps/bc_desktop/submit/rd_submit.yml.erb” file to use native SGE slot assignment:

$ cat rd_submit.yml.erb

batch_connect:
template: “vnc”
script:
job_name: “RemoteDesktop”
native: ["-pe", “slots”, “<%= bc_num_slots %>”]

With the above change the actual remote desktop session, does have the requested number of nodes assigned to it. So the job submission is working correctly, just not reflecting properly in the web-interface, which is confusing our users.

$ qstat
job-ID prior name user state submit/start at queue slots ja-task-ID

942184 0.60500 RemoteDesk smatott r 12/16/2019 10:44:33 all.q@n40 16

Can anyone suggest a way to get the web-interface to report the correct number of cores?

Thanks,

— Shawn

Sorry, the first answer I’d given was incorrect because I’d misread your question. I think @rodgers.355 has some ideas on how it could be done.

Hi there, original author of the Grid Engine adapter here. The short answer is I don’t think that you are doing anything wrong, you’ve just encountered a feature that Grid Engine doesn’t support.

OnDemand expects to get a list of nodes allocated to a particular job. None of the qstat options I found gave me all the information I needed in a single command*, so I didn’t implement that feature, and so the UI defaults to showing only a single allocated node.

As an alternative we could parse out the number of slots, and use them to create a list of nodes with simple numbered names unknown-0, unknown-1, which would improve the experience somewhat. I’ll share a custom initializer that will do this in this thread in a day or two. If the experience is better enough we can merge it into a future release of OnDemand.

* For the sake of performance we typically want to minimize the number of commands necessary to get information back from the scheduler. That said, I’m open to suggestions on better ways to do things with Grid Engine.

Here’s a raw paste of the contents of the file. I think the formatting is ok — as I mentioned, the “native” option does get properly passed along to the scheduler. Just not updated by the OOD interface.

Thanks,

---- Shawn

Thanks for the insight and I’d be happy to try out a patch if you can come up with one that works for SGE.

UI defaults to showing only a single allocated node. <<<

It’s a little confusing for the interface to read “cores” if it actually means “nodes”…. Is there a calculation being done to convert from node count to core count? I’ve attached the specific screen I’m taking about, in case there’s some confusion there.

---- Shawn

image001.jpg

When rendering the HTML template if part of the context passed to the rendering method is null then certain components will not be shown. In this case because we don’t have a list of nodes the node count pill isn’t shown. Here’s an screenshot I took this morning at OSC (where we run Torque/Moab) where the node section of the UI is rendered:

Shawn,

The production grid engine cluster (IDRE at UCLA) I usually use for this work is in a downtime today. Could you please post the output of a qstat -r -xml -j $JOB_ID for a job with 1 slot and another with multiple slots?

Thanks.

Here’s output for a job that has 16 slots:

<?xml version='1.0'?>

<detailed_job_info xmlns:xsd=“http://gridscheduler.svn.sourceforge.net/viewvc/gridscheduler/trunk/source/dist/util/resources/schemas/qstat/qstat.xsd?revision=11”>

<djob_info>

<JB_job_number>942195</JB_job_number>

<JB_ar>0</JB_ar>

<JB_exec_file>job_scripts/942195</JB_exec_file>

<JB_submission_time>1576525304</JB_submission_time>

<JB_owner>smatott</JB_owner>

<JB_uid>1105</JB_uid>

<JB_group>packages</JB_group>

<JB_gid>3002</JB_gid>

<JB_account>sge</JB_account>

<JB_merge_stderr>false</JB_merge_stderr>

<JB_mail_list>

<MR_user>smatott</MR_user>

<MR_host>u1</MR_host>

</JB_mail_list>

<JB_notify>false</JB_notify>

<JB_job_name>RemoteDesktop</JB_job_name>

<JB_stdout_path_list>

<path_list>

<PN_path>/mnt/lustre/users/smatott/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/hpc/output/9beca517-07fa-4379-8eb7-934460846a19/output.log</PN_path>

<PN_host></PN_host>

<PN_file_host></PN_file_host>

<PN_file_staging>false</PN_file_staging>

</path_list>

</JB_stdout_path_list>

<JB_jobshare>0</JB_jobshare>

<JB_hard_resource_list>

<qstat_l_requests>

<CE_name>h_rt</CE_name>

<CE_valtype>3</CE_valtype>

<CE_stringval>518400</CE_stringval>

<CE_doubleval>518400.000000</CE_doubleval>

<CE_relop>0</CE_relop>

<CE_consumable>0</CE_consumable>

<CE_dominant>0</CE_dominant>

<CE_pj_doubleval>0.000000</CE_pj_doubleval>

<CE_pj_dominant>0</CE_pj_dominant>

<CE_requestable>0</CE_requestable>

<CE_tagged>0</CE_tagged>

</qstat_l_requests>

</JB_hard_resource_list>

<JB_hard_queue_list>

<destin_ident_list>

<QR_name>all.q</QR_name>

</destin_ident_list>

</JB_hard_queue_list>

<JB_env_list>

<job_sublist>

<VA_variable>__SGE_PREFIX__O_HOME</VA_variable>

<VA_value>/mnt/lustre/users/smatott</VA_value>

</job_sublist>

<job_sublist>

<VA_variable>__SGE_PREFIX__O_LOGNAME</VA_variable>

<VA_value>root</VA_value>

</job_sublist>

<job_sublist>

<VA_variable>__SGE_PREFIX__O_PATH</VA_variable>

<VA_value>/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/bin:/opt/rh/rh-nodejs6/root/usr/bin:/opt/rh/rh-ruby24/root/usr/local/bin:/opt/rh/rh-ruby24/root/usr/bin:/opt/rh/httpd24/root/usr/bin:/opt/rh/httpd24/root/usr/sbin:/opt/ood/ondemand/root/usr/bin:/opt/ood/ondemand/root/usr/sbin:/sbin:/bin:/usr/sbin:/usr/bin</VA_value>

</job_sublist>

<job_sublist>

<VA_variable>__SGE_PREFIX__O_SHELL</VA_variable>

<VA_value>/bin/bash</VA_value>

</job_sublist>

<job_sublist>

<VA_variable>__SGE_PREFIX__O_MAIL</VA_variable>

<VA_value>/var/mail/root</VA_value>

</job_sublist>

<job_sublist>

<VA_variable>__SGE_PREFIX__O_HOST</VA_variable>

<VA_value>u1</VA_value>

</job_sublist>

<job_sublist>

<VA_variable>__SGE_PREFIX__O_WORKDIR</VA_variable>

<VA_value>/var/www/ood/apps/sys/dashboard</VA_value>

</job_sublist>

</JB_env_list>

<JB_script_file>STDIN</JB_script_file>

<JB_ja_tasks>

<ulong_sublist>

<JAT_status>128</JAT_status>

<JAT_task_number>1</JAT_task_number>

<JAT_scaled_usage_list>

<UA_name>cpu</UA_name>

<UA_value>22.540000</UA_value>

<UA_name>mem</UA_name>

<UA_value>10.376430</UA_value>

<UA_name>io</UA_name>

<UA_value>0.017148</UA_value>

<UA_name>iow</UA_name>

<UA_value>0.000000</UA_value>

<UA_name>vmem</UA_name>

<UA_value>5044887552.000000</UA_value>

<UA_name>maxvmem</UA_name>

<UA_value>5298225152.000000</UA_value>

</JAT_scaled_usage_list>

</ulong_sublist>

</JB_ja_tasks>

<JB_cwd>/mnt/lustre/users/smatott/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/hpc/output/9beca517-07fa-4379-8eb7-934460846a19</JB_cwd>

<JB_deadline>0</JB_deadline>

<JB_execution_time>0</JB_execution_time>

<JB_checkpoint_attr>0</JB_checkpoint_attr>

<JB_checkpoint_interval>0</JB_checkpoint_interval>

<JB_reserve>false</JB_reserve>

<JB_mail_options>0</JB_mail_options>

<JB_priority>1024</JB_priority>

<JB_restart>0</JB_restart>

<JB_verify>0</JB_verify>

<JB_script_size>0</JB_script_size>

<JB_pe>slots</JB_pe>

<JB_pe_range>

<RN_min>16</RN_min>

<RN_max>16</RN_max>

<RN_step>1</RN_step>

</JB_pe_range>

<JB_verify_suitable_queues>0</JB_verify_suitable_queues>

<JB_soft_wallclock_gmt>0</JB_soft_wallclock_gmt>

<JB_hard_wallclock_gmt>0</JB_hard_wallclock_gmt>

<JB_override_tickets>0</JB_override_tickets>

<JB_version>0</JB_version>

<JB_ja_structure>

<task_id_range>

<RN_min>1</RN_min>

<RN_max>1</RN_max>

<RN_step>1</RN_step>

</task_id_range>

</JB_ja_structure>

<JB_type>0</JB_type>

</djob_info>

<SME_global_message_list>

<MES_message_number>83</MES_message_number>

<MES_message>(Collecting of scheduler job information is turned off)</MES_message>

</SME_global_message_list>

</detailed_job_info>

Here’s the output for a job that has just one slot:

<?xml version='1.0'?>

<detailed_job_info xmlns:xsd=“http://gridscheduler.svn.sourceforge.net/viewvc/gridscheduler/trunk/source/dist/util/resources/schemas/qstat/qstat.xsd?revision=11”>

<djob_info>

<JB_job_number>942282</JB_job_number>

<JB_ar>0</JB_ar>

<JB_exec_file>job_scripts/942282</JB_exec_file>

<JB_submission_time>1576610203</JB_submission_time>

<JB_owner>smatott</JB_owner>

<JB_uid>1105</JB_uid>

<JB_group>packages</JB_group>

<JB_gid>3002</JB_gid>

<JB_account>sge</JB_account>

<JB_merge_stderr>false</JB_merge_stderr>

<JB_mail_list>

<MR_user>smatott</MR_user>

<MR_host>u1</MR_host>

</JB_mail_list>

<JB_notify>false</JB_notify>

<JB_job_name>RemoteDesktop</JB_job_name>

<JB_stdout_path_list>

<path_list>

<PN_path>/mnt/lustre/users/smatott/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/hpc/output/d0c0a8d5-8a63-48f6-97fe-f731c97d8f2b/output.log</PN_path>

<PN_host></PN_host>

<PN_file_host></PN_file_host>

<PN_file_staging>false</PN_file_staging>

</path_list>

</JB_stdout_path_list>

<JB_jobshare>0</JB_jobshare>

<JB_hard_resource_list>

<qstat_l_requests>

<CE_name>h_rt</CE_name>

<CE_valtype>3</CE_valtype>

<CE_stringval>518400</CE_stringval>

<CE_doubleval>518400.000000</CE_doubleval>

<CE_relop>0</CE_relop>

<CE_consumable>0</CE_consumable>

<CE_dominant>0</CE_dominant>

<CE_pj_doubleval>0.000000</CE_pj_doubleval>

<CE_pj_dominant>0</CE_pj_dominant>

<CE_requestable>0</CE_requestable>

<CE_tagged>0</CE_tagged>

</qstat_l_requests>

</JB_hard_resource_list>

<JB_hard_queue_list>

<destin_ident_list>

<QR_name>all.q</QR_name>

</destin_ident_list>

</JB_hard_queue_list>

<JB_env_list>

<job_sublist>

<VA_variable>__SGE_PREFIX__O_HOME</VA_variable>

<VA_value>/mnt/lustre/users/smatott</VA_value>

</job_sublist>

<job_sublist>

<VA_variable>__SGE_PREFIX__O_LOGNAME</VA_variable>

<VA_value>root</VA_value>

</job_sublist>

<job_sublist>

<VA_variable>__SGE_PREFIX__O_PATH</VA_variable>

<VA_value>/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/bin:/opt/rh/rh-nodejs6/root/usr/bin:/opt/rh/rh-ruby24/root/usr/local/bin:/opt/rh/rh-ruby24/root/usr/bin:/opt/rh/httpd24/root/usr/bin:/opt/rh/httpd24/root/usr/sbin:/opt/ood/ondemand/root/usr/bin:/opt/ood/ondemand/root/usr/sbin:/sbin:/bin:/usr/sbin:/usr/bin</VA_value>

</job_sublist>

<job_sublist>

<VA_variable>__SGE_PREFIX__O_SHELL</VA_variable>

<VA_value>/bin/bash</VA_value>

</job_sublist>

<job_sublist>

<VA_variable>__SGE_PREFIX__O_MAIL</VA_variable>

<VA_value>/var/mail/root</VA_value>

</job_sublist>

<job_sublist>

<VA_variable>__SGE_PREFIX__O_HOST</VA_variable>

<VA_value>u1</VA_value>

</job_sublist>

<job_sublist>

<VA_variable>__SGE_PREFIX__O_WORKDIR</VA_variable>

<VA_value>/var/www/ood/apps/sys/dashboard</VA_value>

</job_sublist>

</JB_env_list>

<JB_script_file>STDIN</JB_script_file>

<JB_ja_tasks>

<ulong_sublist>

<JAT_status>128</JAT_status>

<JAT_task_number>1</JAT_task_number>

<JAT_scaled_usage_list>

<UA_name>cpu</UA_name>

<UA_value>2.340000</UA_value>

<UA_name>mem</UA_name>

<UA_value>1.006822</UA_value>

<UA_name>io</UA_name>

<UA_value>0.014208</UA_value>

<UA_name>iow</UA_name>

<UA_value>0.000000</UA_value>

<UA_name>vmem</UA_name>

<UA_value>5176115200.000000</UA_value>

<UA_name>maxvmem</UA_name>

<UA_value>5176115200.000000</UA_value>

</JAT_scaled_usage_list>

</ulong_sublist>

</JB_ja_tasks>

<JB_cwd>/mnt/lustre/users/smatott/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/hpc/output/d0c0a8d5-8a63-48f6-97fe-f731c97d8f2b</JB_cwd>

<JB_deadline>0</JB_deadline>

<JB_execution_time>0</JB_execution_time>

<JB_checkpoint_attr>0</JB_checkpoint_attr>

<JB_checkpoint_interval>0</JB_checkpoint_interval>

<JB_reserve>false</JB_reserve>

<JB_mail_options>0</JB_mail_options>

<JB_priority>1024</JB_priority>

<JB_restart>0</JB_restart>

<JB_verify>0</JB_verify>

<JB_script_size>0</JB_script_size>

<JB_pe>slots</JB_pe>

<JB_pe_range>

<RN_min>1</RN_min>

<RN_max>1</RN_max>

<RN_step>1</RN_step>

</JB_pe_range>

<JB_verify_suitable_queues>0</JB_verify_suitable_queues>

<JB_soft_wallclock_gmt>0</JB_soft_wallclock_gmt>

<JB_hard_wallclock_gmt>0</JB_hard_wallclock_gmt>

<JB_override_tickets>0</JB_override_tickets>

<JB_version>0</JB_version>

<JB_ja_structure>

<task_id_range>

<RN_min>1</RN_min>

<RN_max>1</RN_max>

<RN_step>1</RN_step>

</task_id_range>

</JB_ja_structure>

<JB_type>0</JB_type>

</djob_info>

<SME_global_message_list>

<MES_message_number>83</MES_message_number>

<MES_message>(Collecting of scheduler job information is turned off)</MES_message>

</SME_global_message_list>

</detailed_job_info>

Thanks for the examples.

So JB_pe_range.RN_{min,max} lists the slots? If so, are min and max ever different for a running job?

If so, are min and max ever different for a running job? <<<

Not that I’m aware of.

Thanks again for the examples. When I was originally writing the adapter I missed the meaning of those XML nodes. If a sudoer creates the file /etc/ood/config/apps/dashboard/initializers/fix_sge_procs.rb with the following content then the 1 core issue will be fixed:

require "ood_core"
require "ood_core/job/adapters/sge/qstat_xml_j_r_listener"

# Patches the QStat output parser to correctly detect the number of
# slots/cores
class QstatXmlJRListener
  def initialize
    @parsed_job = {
      :tasks => [],
      :status => :queued,
      :procs => 1,
      :native => {}  # TODO: improve native attribute reporting
    }
    @current_text = nil
    @current_request = nil

    @processing_job_array_spec = false
    @adding_slots = false

    @job_array_spec = {
      start: nil,
      stop: nil,
      step: 1,  # Step can have a default of 1
    }
    @running_tasks = []
  end

  def tag_start(name, attrs)
    case name
    when 'task_id_range'
      toggle_processing_array_spec
    when 'JB_pe_range'
      toggle_adding_slots
    end
  end

  def tag_end(name)
    case name
    when 'JB_ja_tasks'
      end_JB_ja_tasks
    when 'JB_job_number'
      end_JB_job_number
    when 'JB_job_name'
      end_JB_job_name
    when 'JB_owner'
      end_JB_owner
    when 'JB_project'
      end_JB_project
    when 'JB_submission_time'
      end_JB_submission_time
    when 'hard_request'
      end_hard_request
    when 'JAT_start_time'
      end_JAT_start_time
    when 'CE_name'
      end_CE_name
    when 'CE_stringval'
      end_CE_stringval
    when 'QR_name'
      end_QR_name
    when 'JAT_task_number'
      end_JAT_task_number
    when 'djob_info'
      finalize_parsed_job
    when 'RN_min'
      set_job_array_piece(:start) if @processing_job_array_spec
      set_slots if @adding_slots
    when 'RN_max'
      set_job_array_piece(:stop) if @processing_job_array_spec
    when 'RN_step'
      set_job_array_piece(:step) if @processing_job_array_spec
    when 'task_id_range'
      toggle_processing_array_spec
    when 'JB_pe_range'
      toggle_adding_slots
    end
  end

  def toggle_adding_slots
    @adding_slots = ! @adding_slots
  end

  def set_slots
    @parsed_job[:procs] = @current_text.to_i
  end
end

I’ll also start the process to get this included into the next release of OnDemand so the patch won’t be necessary in the future.

Hi Morgan,

Thanks for the fix — I can confirm that it works as advertised! See attached screen grab.

— Shawn

Hi, everybody @OSC,

i wanted to tell i was just on this topic and that the fix works for my config : CentOS 7+ NIS + SSGE 8.1.9 + OOD 1.6.20

Thanks a lot.

jean-marie