Not sure what the issue could be here. I guess I’d suggest restarting the web server in the help menu to be sure you picked up the new code base.
If it persists, show us an image of dev tools making the request and the response. Or maybe there’s an error in /var/log/ondemand-nginx/$USER/error.log that could tell us more.
before pasting the whole stacktrace here, I noticed that if I remove the config of a cluster which uses LinuxHost adapter from cluster.d the system-status works again. Maybe the logic changed in 4.1.1 when ssh to login nodes is (temporary) not working (?).
@statiksof I’ve got your issue on my rader and will be looking into it.
@emily.dragowsky can you provide some output from this command? This is the command we use to pull GPU info from a given system. From there we parse it so I’m guessing we’re not parsing your output correctly.
@emily.dragowsky I have a fix coming in our downstream libraries, but I don’t think we’ll be able to patch it any time soon.
In the interim you can apply this patch by dropping this file in the location specified in the comment to patch this particular class method.
# /etc/ood/config/apps/dashboard/initializers/gpu_fix.rb
Rails.application.config.after_initialize do
require 'ood_core/job/adapters/slurm'
class OodCore::Job::Adapters::Slurm < OodCore::Job::Adapter
# patch gpus_from_gres to incorporate https://github.com/OSC/ood_core/pull/925
def self.gpus_from_gres(gres)
gres.to_s.scan(/gpu[s:]*[\w()-]*[=:]?(\d+)(?:[(,]|$)/).flatten.map(&:to_i).sum
end
end
end
Just an update from my side: after investigating the error message further, it appears that /tmp/tmux-xxx/default was missing on one of the nodes. I manually started a tmux session there, and the system-status is now working again.
I realize this is not a proper fix, but it might help when updating the system-status code. Ideally, the code should not crash if the tmux directory is missing on some or all nodes.
By the way, where can I find the code for system-status?