mnakao
(Masahiro Nakao)
June 24, 2022, 12:20pm
1
Hello all,
I’m trying to install OOD on a Japanese supercomputer Fugaku.
Fugaku uses a special resource manager developed by Fujitsu Limited.
Its syntax is similar to existing resource managers such as slurm.
Could you please tell me how to add a new resource manager ?
I think there are many people like me, so it would be even better if there was a manual.
Perhaps the resource manager adapters are stored in
the directory /opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/ood_core-0.19.0/lib/ood_core/job/adapters/
.
I think I create an new adapter file in the directory, but that does not seem to be enough.
Thanks,
Very cool! I’ve heard a lot about that system.
Can you elaborate on this? I believe the only requirement would be that the name you’re trying to use matches the name of the actual .rb file and you need to use the factory pattern because that’s how it’s being instantiated.
Dynamically, based off of the name, and through a factory.
path_to_adapter = "ood_core/job/adapters/#{adapter}"
begin
require path_to_adapter
rescue Gem::LoadError => e
raise Gem::LoadError, "Specified '#{adapter}' for job adapter, but the gem is not loaded."
rescue LoadError => e
raise LoadError, "Could not load '#{adapter}'. Make sure that the job adapter in the configuration file is valid."
end
adapter_method = "build_#{adapter}"
Here’s a recent example of an additional adapter. Nothing references that, that is, require
s these additional files.
OSC:master
← plazonic:master
opened 10:34PM - 06 Feb 22 UTC
This is a job adapter that uses systemd - in particular systemd-run and systemct… l show. It is heavily based on linux_host adapter with minimal changes.
Its configuration is even simpler than linux_host, e.g.:
v2:
metadata:
title: My Login node
login:
host: login1
job:
cluster: login1
bin: "/usr/bin"
adapter: linux_systemd
submit_host: login1
ssh_hosts:
- login1
site_timeout: 2678400
strict_host_checking: false
As far as how it works - systemd-run is used to create a transient systemd user unit - this bit from linux_systemd/templates/script_wrapper.erb illustrates well what can/is set:
systemd-run --user -r --no-block --unit=<%= session_name %> -p RuntimeMaxSec=<%= script_timeout %> \
-p ExecStartPre="$systemd_service_tmp_file_pre" -p ExecStartPost="$systemd_service_tmp_file_post" \
-p StandardOutput="file:<%= output_path %>" -p StandardError="file:<%= error_path %>" \
-p Description="<%= job_name %>" "$systemd_service_tmp_file"
This creates a user unit called session_name (starting with "ondemand-"). Note that systemd-run takes care of much of the required functionality - timeout, pre/post scripts (for emails), description ... Programs started under such units live in their own systemd slice/cgroup and therefore can be managed/stopped/limited.
To check on a "job" status "systemctl --user show -t service --state=running ondemand-*" is parsed.
One caution - some of the systemd features might require a new enough systemd version - in particular StandardOutput=file:/..... This was tested to work on RHEL8 with desktop app.
This code clearly requires more work - in particular it is missing testing/specs entirely (sorry, don't know how :( ) but maybe you and others could find it useful as is. More testing and maybe more scripts for reacting to job failures and sending emails are needed. Per job limits (memory/cpu) could be added fairly easily - we didn't need them for our desktop app - we are relying on systemd user cgroup based limits.
┆Issue is synchronized with this [Asana task](https://app.asana.com/0/1201735133575781/1201775990272808) by [Unito](https://www.unito.io)
Also, as we’re finding out with Microsoft in a Azure specific adapter - being out of bounds with the main project may get tough for you. That is, you’ll always have to install and add all sorts of patches.
We’re happy to pull a new adapter in, even if it’s only for a subset of users. Having this in the upstream project could make your life a lot easier - and that’s half of my job.
So in sum, if you get this going and want to use it we’re happy to include it in the distribution to make it easier for you.
mnakao
(Masahiro Nakao)
June 25, 2022, 6:13am
4
Thank you for your help and quick reply.
And, excuse me. The issue I’m having isn’t due to adding a new resource manager adapter, it seems to be a Cluster Configuration File problem.
First, I will explain the Fugaku system. Fugaku consists of a compute nodes and pre/post nodes. The compute nodes use a special resource manager, while the pre/post nodes use a Slurm. I have confirmed that OOD works for pre/post nodes.
When I copy the Cluster Configuration File for pre/post nodes to it for compute nodes ( cd /etc/ood/config/clusters.d; cp pre-post.yml fugaku.yml) and execute the appropriate Interactive Apps, the following error will occur. Of cause, I don’t edit fugaku.yml.
#<ArgumentError: missing keywords: :id, :status>
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/ood_core-0.19.0/lib/ood_core/job/info.rb:89:in `initialize'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/ood_core-0.19.0/lib/ood_core/job/adapters/slurm.rb:670:in `new'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/ood_core-0.19.0/lib/ood_core/job/adapters/slurm.rb:670:in `handle_job_array'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/ood_core-0.19.0/lib/ood_core/job/adapters/slurm.rb:481:in `info'
/var/www/ood/apps/sys/dashboard/app/models/batch_connect/session.rb:349:in `update_info'
/var/www/ood/apps/sys/dashboard/app/models/batch_connect/session.rb:340:in `info'
/var/www/ood/apps/sys/dashboard/app/models/batch_connect/session.rb:334:in `status'
/var/www/ood/apps/sys/dashboard/app/models/batch_connect/session.rb:407:in `completed?'
/var/www/ood/apps/sys/dashboard/app/models/batch_connect/session.rb:118:in `block in all'
/var/www/ood/apps/sys/dashboard/app/models/batch_connect/session.rb:117:in `map'
/var/www/ood/apps/sys/dashboard/app/models/batch_connect/session.rb:117:in `all'
/var/www/ood/apps/sys/dashboard/app/controllers/batch_connect/sessions_controller.rb:7:in `index'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/actionpack-5.2.8/lib/action_controller/metal/basic_implicit_render.rb:6:in `send_action'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/actionpack-5.2.8/lib/abstract_controller/base.rb:194:in `process_action'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/actionpack-5.2.8/lib/action_controller/metal/rendering.rb:30:in `process_action'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/actionpack-5.2.8/lib/abstract_controller/callbacks.rb:42:in `block in process_action'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/activesupport-5.2.8/lib/active_support/callbacks.rb:132:in `run_callbacks'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/actionpack-5.2.8/lib/abstract_controller/callbacks.rb:41:in `process_action'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/actionpack-5.2.8/lib/action_controller/metal/rescue.rb:22:in `process_action'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/actionpack-5.2.8/lib/action_controller/metal/instrumentation.rb:34:in `block in process_action'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/activesupport-5.2.8/lib/active_support/notifications.rb:168:in `block in instrument'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/activesupport-5.2.8/lib/active_support/notifications/instrumenter.rb:23:in `instrument'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/activesupport-5.2.8/lib/active_support/notifications.rb:168:in `instrument'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/actionpack-5.2.8/lib/action_controller/metal/instrumentation.rb:32:in `process_action'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/actionpack-5.2.8/lib/action_controller/metal/params_wrapper.rb:256:in `process_action'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/actionpack-5.2.8/lib/abstract_controller/base.rb:134:in `process'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/actionview-5.2.8/lib/action_view/rendering.rb:32:in `process'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/actionpack-5.2.8/lib/action_controller/metal.rb:191:in `dispatch'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/actionpack-5.2.8/lib/action_controller/metal.rb:252:in `dispatch'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/actionpack-5.2.8/lib/action_dispatch/routing/route_set.rb:52:in `dispatch'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/actionpack-5.2.8/lib/action_dispatch/routing/route_set.rb:34:in `serve'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/actionpack-5.2.8/lib/action_dispatch/journey/router.rb:52:in `block in serve'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/actionpack-5.2.8/lib/action_dispatch/journey/router.rb:35:in `each'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/actionpack-5.2.8/lib/action_dispatch/journey/router.rb:35:in `serve'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/actionpack-5.2.8/lib/action_dispatch/routing/route_set.rb:840:in `call'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/rack-2.2.3.1/lib/rack/tempfile_reaper.rb:15:in `call'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/rack-2.2.3.1/lib/rack/etag.rb:27:in `call'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/rack-2.2.3.1/lib/rack/conditional_get.rb:27:in `call'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/rack-2.2.3.1/lib/rack/head.rb:12:in `call'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/actionpack-5.2.8/lib/action_dispatch/http/content_security_policy.rb:18:in `call'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/rack-2.2.3.1/lib/rack/session/abstract/id.rb:266:in `context'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/rack-2.2.3.1/lib/rack/session/abstract/id.rb:260:in `call'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/actionpack-5.2.8/lib/action_dispatch/middleware/cookies.rb:670:in `call'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/actionpack-5.2.8/lib/action_dispatch/middleware/callbacks.rb:28:in `block in call'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/activesupport-5.2.8/lib/active_support/callbacks.rb:98:in `run_callbacks'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/actionpack-5.2.8/lib/action_dispatch/middleware/callbacks.rb:26:in `call'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/actionpack-5.2.8/lib/action_dispatch/middleware/debug_exceptions.rb:61:in `call'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/actionpack-5.2.8/lib/action_dispatch/middleware/show_exceptions.rb:33:in `call'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/lograge-0.12.0/lib/lograge/rails_ext/rack/logger.rb:18:in `call_app'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/railties-5.2.8/lib/rails/rack/logger.rb:26:in `block in call'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/activesupport-5.2.8/lib/active_support/tagged_logging.rb:71:in `block in tagged'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/activesupport-5.2.8/lib/active_support/tagged_logging.rb:28:in `tagged'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/activesupport-5.2.8/lib/active_support/tagged_logging.rb:71:in `tagged'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/railties-5.2.8/lib/rails/rack/logger.rb:26:in `call'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/actionpack-5.2.8/lib/action_dispatch/middleware/remote_ip.rb:81:in `call'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/request_store-1.5.1/lib/request_store/middleware.rb:19:in `call'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/actionpack-5.2.8/lib/action_dispatch/middleware/request_id.rb:27:in `call'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/rack-2.2.3.1/lib/rack/method_override.rb:24:in `call'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/rack-2.2.3.1/lib/rack/runtime.rb:22:in `call'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/activesupport-5.2.8/lib/active_support/cache/strategy/local_cache_middleware.rb:29:in `call'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/actionpack-5.2.8/lib/action_dispatch/middleware/executor.rb:14:in `call'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/rack-2.2.3.1/lib/rack/sendfile.rb:110:in `call'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.26/gems/railties-5.2.8/lib/rails/engine.rb:524:in `call'
/opt/rh/ondemand/root/usr/share/ruby/vendor_ruby/phusion_passenger/rack/thread_handler_extension.rb:107:in `process_request'
/opt/rh/ondemand/root/usr/share/ruby/vendor_ruby/phusion_passenger/request_handler/thread_handler.rb:149:in `accept_and_process_next_request'
/opt/rh/ondemand/root/usr/share/ruby/vendor_ruby/phusion_passenger/request_handler/thread_handler.rb:110:in `main_loop'
/opt/rh/ondemand/root/usr/share/ruby/vendor_ruby/phusion_passenger/request_handler.rb:419:in `block (3 levels) in start_threads'
/opt/rh/ondemand/root/usr/share/ruby/vendor_ruby/phusion_passenger/utils.rb:113:in `block in create_thread_and_abort_on_exception'
The contents of pre-post.yml are as follows.
---
v2:
metadata:
title: "Pre/Post"
login:
host: "ondemand-test.fugaku.r-ccs.riken.jp"
default: true
job:
adapter: "slurm"
bin: "/usr/bin/"
conf: "/etc/slurm/slurm.conf"
Is the procedure correct when using multiple job schedulers?
Best,
You can search/grep /var/log/ondemand-nginx/$USER/error.log
for execve
for the actual squeue
command you’re issuing. I believe it’s very long, requesting a lot of fields and specifying the fields separator. I don’t know what sort of compatibility you have between squeue
and Fugaku’s scheduler - but you may have to play with it by hand to see what’s going on.
Here’s an example of a command I just pulled from my system. Obviously we’re using some unicode separator, not just any character.
App 46850 output: [2022-06-27 09:23:56 -0400 ] INFO "execve = [{\"SLURM_CONF\"=>\"/etc/slurm/slurm.conf\"}, \"/usr/bin/squeue\", \"--all\", \"--states=all\", \"--noconvert\", \"-o\", \"\\u001E%a\\u001F%A\\u001F%B\\u001F%c\\u001F%C\\u001F%d\\u001F%D\\u001F%e\\u001F%E\\u001F%f\\u001F%F\\u001F%g\\u001F%G\\u001F%h\\u001F%H\\u001F%i\\u001F%I\\u001F%j\\u001F%J\\u001F%k\\u001F%K\\u001F%l\\u001F%L\\u001F%m\\u001F%M\\u001F%n\\u001F%N\\u001F%o\\u001F%O\\u001F%q\\u001F%P\\u001F%Q\\u001F%r\\u001F%S\\u001F%t\\u001F%T\\u001F%u\\u001F%U\\u001F%v\\u001F%V\\u001F%w\\u001F%W\\u001F%x\\u001F%X\\u001F%y\\u001F%Y\\u001F%z\\u001F%Z\\u001F%b\", \"-j\", \"11804162\", \"-M\", \"pitzer\"]"
We think of schedulers as independent systems. Or at least independent cluster. Meaning, from our perspective, they have no relationship with each other.
mnakao
(Masahiro Nakao)
August 4, 2022, 4:32am
6
I have created a special resource manager, fujitsu_tcs, which has been merged into the OOD master repository.
To support the resource manager, I developed one new file and modified two existing files.
Please refer to the pull request below for details.
Thanks,
system
(system)
Closed
January 31, 2023, 4:33am
7
This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.