I was wondering whether anyone has experieced difficulties with getting MPI, specifically, OpenMPI, to work when in an Open OnDemand Remote Desktop application that has multiple nodes assigned to it using Slurm.
I can get an MPI job to run on a single node. I have an job script that I am using to run the MPI job with sbatch, and it runs successfully. I request a Remote Desktop application with two nodes, two tasks per node, and one CPU per task, and running the same job script from the Terminal inside the Remote Desktop session on the same nodes results in this error.
$ mpirun ./mpi_integration
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back to mpirun
due to a lack of common network interfaces and/or no route found
between them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
I’ve compared the Slurm environment variables, and as near as I can tell, the Slurm setup is the same for the two cases.
If I compile and run the same code with Intel MPI, it seems to run fine inside and outside the Remote Desktop application.
Any thoughts on where we might look next to try to figure this out?
As a codicil, can anyone confirm that they have a Remote Desktop application on a Slurm cluster that correctly runs an OpenMPI job that spans at least two nodes correctly?
If so, please let me know what version of TurboVNC you are using, and it would help if you could share the ./configure line for your OpenMPI.
Hi! Sorry for the non-response. We don’t run SLURM (yet) so we’re not super familiar with sbatch and srun.
One thing I would think off hand is that when we submit a SLURM job we use --export=NONE by default. That may have something to do with it (in your comparable terminal session the default is --export=ALL`).
I seem to recall that if you then run srun within the jobs’ environment that is significantly different. I’d imagine environment variables are the only way mpirun could figure out anything about the resources available right? (that’s not rhetorical I’m genuinely asking)
To change this behavior you can set this entry in the cluster yml.
You could also try strace mpirun ./mpi_integration and maybe that’ll give us some insight into what’s going on.
It seems the OOD environment/job is the issue. If you can run through a terminal, you’ve got it compiled correctly and it seems we’d need to setup the OOD sbatch job’s environment correctly.
Actually, at this point now, I don’t think this is specific to OOD but it is definitely tied to being inside a Slurm job. Checking to see whether the --export=ALL matters is a good idea. I’ll get onto the system administrators to see if they’ll twiddle that for us.
In testing, I can get a job that uses a whole node, if I start the TurboVNC server from inside the job on that node (whether the jobs was started by OOD or not) and connect to it, the mpirun fails; however, if I connect to the same node while the job is running and start a TurbVNC server outside the job, I can connect to it, and the mpirun succeeds.
At this point we’re trying to figure out good ways to isolate the real problem, so we can try to solve it.
That’s one reason why I would be curious whether others are able to get a Remote Desktop inside a Slurm job that uses more than one node via OOD and have mpirun succeed. That would suggest to me that we need to look to Slurm and SchedMD.
OK, maybe strace is your best bet then. It could be that mpirun is relying on XDG environment variables like XDG_RUNTIME_DIR or XDG_SESSION_ID or DISPLAY that VNC is modifying?
Take care with modifying your production system with copy_environment: true through. Especially before you can pinpoint what the cause is. This setting is going to affect any job you run with that config (Jupyter, RStudio, and so on) so it may negatively impact your users before you can really sort out what all affect it’ll have.