System-status not working after upgrade to 4.1.1

statiksof · February 12, 2026, 1:17pm

Hi, I just upgraded ondemand to 4.1.1. without errors. system-status (slurm cluster) is not working anymore.

it is stuck on `loading`.

in the console log:

Failed to load resource: the server responded with a status of 500 (Internal Server Error)

I also see some errors in error.log:

Error during failsafe response: ActionController::UnknownFormat

jeff.ohrstrom · February 12, 2026, 5:24pm

Not sure what the issue could be here. I guess I’d suggest restarting the web server in the help menu to be sure you picked up the new code base.

If it persists, show us an image of dev tools making the request and the response. Or maybe there’s an error in /var/log/ondemand-nginx/$USER/error.log that could tell us more.

statiksof · February 13, 2026, 8:17am

before pasting the whole stacktrace here, I noticed that if I remove the config of a cluster which uses LinuxHost adapter from cluster.d the system-status works again. Maybe the logic changed in 4.1.1 when ssh to login nodes is (temporary) not working (?).

jeff.ohrstrom · February 13, 2026, 2:29pm

OK that’s a good clue. I’ll file a ticket upstream, take a look and let you know.

emily.dragowsky · February 13, 2026, 7:42pm

Just updated to 4.1.1

System Status so far loads stats – still fails to report GPU device usage, though.

jeff.ohrstrom · February 19, 2026, 9:24pm

@statiksof I’ve got your issue on my rader and will be looking into it.

@emily.dragowsky can you provide some output from this command? This is the command we use to pull GPU info from a given system. From there we parse it so I’m guessing we’re not parsing your output correctly.

sinfo -ahNO 'nodehost,gres:100,gresused:100,statelong'

emily.dragowsky · February 20, 2026, 12:31am

If I strip out the multiple white-space characters, here’s a sample:
gput072 gpu:2(S:1) gpu:(null):2(IDX:0-1) mixed
gput073 gpu:2(S:1) gpu:(null):2(IDX:0-1) mixed
gput074 gpu:4(S:0-3) gpu:(null):1(IDX:0) mixed
gput075 gpu:4(S:0-3) gpu:(null):4(IDX:0-3) mixed

We are running Slurm 25.05.3

Thanks Jeff

jeff.ohrstrom · February 23, 2026, 9:50pm

@emily.dragowsky I have a fix coming in our downstream libraries, but I don’t think we’ll be able to patch it any time soon.

In the interim you can apply this patch by dropping this file in the location specified in the comment to patch this particular class method.

# /etc/ood/config/apps/dashboard/initializers/gpu_fix.rb
Rails.application.config.after_initialize do

  require 'ood_core/job/adapters/slurm'

  class OodCore::Job::Adapters::Slurm < OodCore::Job::Adapter
    # patch gpus_from_gres to incorporate https://github.com/OSC/ood_core/pull/925
    def self.gpus_from_gres(gres)
      gres.to_s.scan(/gpu[s:]*[\w()-]*[=:]?(\d+)(?:[(,]|$)/).flatten.map(&:to_i).sum
    end
  end
end

emily.dragowsky · March 4, 2026, 10:27pm

Jeff –- you and the team rock!
Thanks so much ( :

jeff.ohrstrom:

# /etc/ood/config/apps/dashboard/initializers/gpu_fix.rb
Rails.application.config.after_initialize do

  require 'ood_core/job/adapters/slurm'

  class OodCore::Job::Adapters::Slurm < OodCore::Job::Adapter
    # patch gpus_from_gres to incorporate https://github.com/OSC/ood_core/pull/925
    def self.gpus_from_gres(gres)
      gres.to_s.scan(/gpu[s:]*[\w()-]*[=:]?(\d+)(?:[(,]|$)/).flatten.map(&:to_i).sum
    end
  end
end

statiksof · March 10, 2026, 12:53pm

Just an update from my side: after investigating the error message further, it appears that /tmp/tmux-xxx/default was missing on one of the nodes. I manually started a tmux session there, and the system-status is now working again.

I realize this is not a proper fix, but it might help when updating the system-status code. Ideally, the code should not crash if the tmux directory is missing on some or all nodes.

By the way, where can I find the code for system-status?

karcaw · June 3, 2026, 10:26pm

This patch didn’t work for me. For now I have changed the file:

/var/www/ood/apps/sys/dashboard/app/views/system_status/index.turbo_stream.erb

from

<% Configuration.job_clusters.each_with_index do |c, cindex| %>

to

<% Configuration.job_clusters.reject(&:linux_host?).each_with_index do |c, cindex| %>

And this works for me because i have two clusters, but the relate to the same actual cluster, one is slurm, and one linuxhost. so ignoring the linuxhost is fine.

Topic		Replies	Views
Osc-systemstatus app is not working Get Help	18	586	September 22, 2023
System status app in 2.0.23 Get Help	4	562	September 30, 2022
System Status in OOD 4.0 not showing GPU utilization Get Help	14	322	April 22, 2025
Regarding host key verification & System Status issue with Ondemand 4.1.4 General Discussion	5	74	March 10, 2026
System status app Get Help	9	2086	September 29, 2020

System-status not working after upgrade to 4.1.1

Related topics