We are exploring ways to integrate Dynatrace with our Open OnDemand setup for enhanced job monitoring. Currently, OOD already provides job status (queued, running, completed) and system status (nodes, CPUs, GPUs available) on the main dashboard.
What we’d like to achieve with Dynatrace is:
-
Per-job and per-user resource utilization (CPU, memory, GPU, I/O) in real time.
-
Exporting OOD/Slurm job metadata (job ID, user, queue, partition, start/end times, exit status) into Dynatrace.
-
Building dashboards in Dynatrace that allow filtering by user/job to analyze efficiency, failures, and trends.
Has anyone implemented something similar, or is there a recommended approach for it?