System-status not working after upgrade to 4.1.1

Hi, I just upgraded ondemand to 4.1.1. without errors. system-status (slurm cluster) is not working anymore.

it is stuck on `loading`.

in the console log:

Failed to load resource: the server responded with a status of 500 (Internal Server Error)

I also see some errors in error.log:

Error during failsafe response: ActionController::UnknownFormat

Not sure what the issue could be here. I guess I’d suggest restarting the web server in the help menu to be sure you picked up the new code base.

If it persists, show us an image of dev tools making the request and the response. Or maybe there’s an error in /var/log/ondemand-nginx/$USER/error.log that could tell us more.

before pasting the whole stacktrace here, I noticed that if I remove the config of a cluster which uses LinuxHost adapter from cluster.d the system-status works again. Maybe the logic changed in 4.1.1 when ssh to login nodes is (temporary) not working (?).

OK that’s a good clue. I’ll file a ticket upstream, take a look and let you know.

1 Like

Just updated to 4.1.1

System Status so far loads stats – still fails to report GPU device usage, though.

@statiksof I’ve got your issue on my rader and will be looking into it.

@emily.dragowsky can you provide some output from this command? This is the command we use to pull GPU info from a given system. From there we parse it so I’m guessing we’re not parsing your output correctly.

sinfo -ahNO 'nodehost,gres:100,gresused:100,statelong'

If I strip out the multiple white-space characters, here’s a sample:
gput072 gpu:2(S:1) gpu:(null):2(IDX:0-1) mixed
gput073 gpu:2(S:1) gpu:(null):2(IDX:0-1) mixed
gput074 gpu:4(S:0-3) gpu:(null):1(IDX:0) mixed
gput075 gpu:4(S:0-3) gpu:(null):4(IDX:0-3) mixed

We are running Slurm 25.05.3

Thanks Jeff

@emily.dragowsky I have a fix coming in our downstream libraries, but I don’t think we’ll be able to patch it any time soon.

In the interim you can apply this patch by dropping this file in the location specified in the comment to patch this particular class method.

# /etc/ood/config/apps/dashboard/initializers/gpu_fix.rb
Rails.application.config.after_initialize do

  require 'ood_core/job/adapters/slurm'

  class OodCore::Job::Adapters::Slurm < OodCore::Job::Adapter
    # patch gpus_from_gres to incorporate https://github.com/OSC/ood_core/pull/925
    def self.gpus_from_gres(gres)
      gres.to_s.scan(/gpu[s:]*[\w()-]*[=:]?(\d+)(?:[(,]|$)/).flatten.map(&:to_i).sum
    end
  end
end  

Jeff –- you and the team rock!
Thanks so much ( :

Just an update from my side: after investigating the error message further, it appears that /tmp/tmux-xxx/default was missing on one of the nodes. I manually started a tmux session there, and the system-status is now working again.

I realize this is not a proper fix, but it might help when updating the system-status code. Ideally, the code should not crash if the tmux directory is missing on some or all nodes.

By the way, where can I find the code for system-status?