I just got introduced to the System Status app today, I was not aware it exists. And it does seem to support SLURM, which is good for us.
However, I am not having any success setting it up. After installing it per instructions I get an empty browser window when I run the app.
Can someone give me some basic troubleshooting tips for this? E.g. some basic info on what’s going on under the hood when the app runs in order to fetch and display the system status, what to run in irb to check how the parts of the app run on the OOD server, etc.
The empty browser window probably means some sort of crash. The app as it exists today still has some OSC specific stuff hard-coded into it. Seems like we should prioritize generalizing the code.
Thanks guys, I’ll reply in the issue.
I am trying to add a few customizations, and I am stuck with the lack of knowledge of how the arguments are passed from the lib/slurm_squeue_client.rb to the views/_node_status.erb.
I am adding a few more variables to the cluster (a.k.a. showqer) object in the slurm_squeue_client.rb, e.g.
self.nodes_other = cluster_info[:nodes_other]
And then adding the “nodes_other” = “offline” to the _node_status.erb
<%= showqer.nodes_used %> of <%= showqer.available_nodes %> Nodes Active (<%= showqer.available_nodes - showqer.nodes_used - showqer.nodes_other%> Nodes Free)
But when I run the app, I get error like:
NoMethodError at /pun/dev/osc-systemstatus/clusters
undefined method `friendly_error_message’ for nil:NilClass
though the friendly_error_message is defined in the cluster class.
Any thoughts on that? I was thinking perhaps there’s some formatting issue but looking at the slurm_squeue_client.rb I don’t see any.
The null object wasn’t added yet https://github.com/OSC/osc-systemstatus/blob/4ecde89261dca77b50473680174655a7b5677679/lib/slurm_squeue_client.rb#L186-L188 so instead when an exception is raised in the
setup method call the return value is nil, and then we don’t compact the array to get rid of nil prior to calling
friendly_error_message on each object.
I would just remove the line
rescue => e from the
setup call and let the exception be unhandled. Then you should see the error in the browser for easier debugging.
If you run into a wall, feel free to post an issue on https://github.com/OSC/osc-systemstatus with a feature request that if added would enable you to do what you are trying to accomplish.
I wrangled with this most of the day but I have it more-less working the way I wanted. I pushed a repo to GitHub which has some notes:
The only thing I need help with, which I have sort of asked before but found a workaround, is custom ordering of the clusters. I can’t seem to figure out how can that be done, so, if you could have some pointers that’d be great, as the current layout is sub-optimal. We’d like to have the main clusters, notchpeak and kingspeak, first, followed by lonepeak, ash and tangent. OOD has a mind of its own in ordering the clusters.
Attached is the picture of how the System Status looks w/o the order (the rest of the data is correct - I did some minor modifications to the sinfo and squeue commands and added the fields for the general and owner nodes):
Opened https://github.com/CHPC-UofU/ood-systemstatus/issues/1 to show some options for sorting. cluster#metadata is an OpenStruct so if you call a method on it and it isn’t defined it just returns nil. So you can add priority to metadata and do something like
cluster.metadata.priority.to_i to get the priority - so if not set it is 0, then you can set 1, 2, 3. Then the trick is to sort in descending order, so you sort_by the negative value of the priority.
Thanks Eric, the sorting worked well. I did the priority from 1 (highest) to 5 (lowest) so I have ascending order, but, it does not matter.
I have one more idea to implement, and would like to ask for opinion.
We have a lot of partitions (queues) and accounts associated with those partitions so it can be confusing for users to know what account/partition to use when. I wrote a small Python program called “myallocation” which prints the account/partition combinations available to user on each cluster, e.g.:
You have a general allocation on kingspeak. Account: chpc, Partition: kingspeak
You have a general allocation on kingspeak. Account: chpc, Partition: kingspeak-shared
You can use preemptable mode on kingspeak. Account: owner-guest, Partition: kingspeak-guest
You can use preemptable GPU mode on kingspeak. Account: owner-gpu-guest, Partition: kingspeak-gpu-guest
You have a GPU allocation on kingspeak. Account: kingspeak-gpu, Partition: kingspeak-gpu
I am debating how to present this information to the OOD users.
So, I am thinking about a workaround for now, putting that info at the bottom of each of the cluster status. That would make the page even more busy than it is, so, I am wondering if it would be possible to use a collapsible item. That is, the account/partition list would only expand by clicking on it. I see you have some mentions of that in the views/layout.erb, but I have never written an HTML like this so if it’s really possible like what I am thinking, do you have any reference on how that works?