Is there a preferred solution, or is it about offering flexibility? They do measure different things - Grafana is collecting from Prometheus on nodes about jobs, XDMoD is head node log analysis out of the box with a per worker plugin SUPReMM…
We currently don’t have any per-node metrics, so now it the time to be making decisions about Prometheus/SUPReMM on the nodes.
You may end up with both, I don’t think it’s an either/or, they both are sort of useful in different areas as you’ve already indicated.
I think Prometheus is a great operational tool. Our OPs team uses it to alert off of, so it’s great in that regard. Plus it can monitor just about anything in your infrastructure (including OOD!).
The support we have for Grafana is the ability to look at a single job’s metrics. And that’s great for that one job, but XDMoD can give you information about your last 100 jobs. It gives you context about your performance over time because it’s an application built around HPC jobs. So it’s a great tool for your users to diagnose their job’s performance over time.
Hope that helps, I get that it’s kind of a non-answer, but yes from our side it’s about flexibility and really it’s all about what you need to provide to your staff and your users and how much you want to invest in administrating these tools.