I’m trying to use the systemd adaptor to run applications on a Slurm cluster login node.
Below is my cluster definition for such node. I set a timeout of 180s for testing. the documentation says the attribute name is timeout. The ood_core code seems to use a site_timeout attribute ood_core/lib/ood_core/job/adapters/systemd/launcher.rb at master · OSC/ood_core
I tried both, but my systemd job never get cancelled after 3 minutes.
Can you please help me troubleshoot and fix this ?
That adapter was a community contribution, so I can only guess. But if you have the debug flag on, it stands to reason that the resulting script_wrapper.erb.sh may exist as a templated sh script. In that script you’ll see how this line was templated. Did it get the correct RuntimeMaxSec? Also I think site_timeout is the absolute maximum you can request, whereas something like bc_num_hours will be the timeout you request for a job and the adapter will choose the smaller of the 2.
Yea again, not super familiar with this adapter, but it seems like while it’s running you should at least be able to query systemd to see information about the unit?