MPI, recovering node

Hi all,

I'm writing an MPI application, in which I handle failures and recover them. In order to do that, in case of one node failure, I would like to remove that node from the MPI_COMM_WORLD group and continue with the remaining nodes.

Does anybody know how I can do that?

I'm using MPICH-G2 by the way.

thanks in advance.

Sadly, MPI isn't really built for this. It's one of the big drawbacks to MPI in general. I once asked an MPI guru about this and he said "for performance reasons, MPI is designed for a static number of nodes at startup time". Apparently, your gather/scatter aggregate commands won't work right (or efficiently) if you have dynamic node allocation.

But it's been 3 solid years since I heard this, so maybe OpenMPI has made an improvement on the state of affairs. However, at least with MPICH, once a node fails, the whole process tree is supposed to die. If it doesn't, it's because your cluster admin hasn't done things correctly.

There is a feature called "checkpointing" around to tackle such problems.

Good starting point: https://ftg.lbl.gov/CheckpointRestart/Pubs/WTTC2008-BKK.pdf

It really doesn't make much sense to me. MPICH should suppose to run many nodes and there is a big possibility that a node can fail during the execution. It should at least continue the processing with the remaining nodes.

Thanks for the answers though. I'll keep looking for the solution.

The MPI specification predated Beowulf clusters, my friend. Before this time, you had computers of varying numbers of CPUs. It was conceived that you might have clusters of computers, but nothing on today's scale. Besides, the guys who dreamt up MPI were computer scientists, ie, not hardware guys or systems guys. MPI-2, which has the ability to spawn and connect to separate MPI instances, doesn't make this easy.

Search for MPI-2 libraries that support process/communication attachment/detachment. You might find something there. Please post back if you do.

UPDATE: See this PDF/slide presentation http://www.cs.utk.edu/~dongarra/WEB-PAGES/SPRING-2006/Lect03-mpi2-features.pdf

Search for "Process Management". You use "MPI_COMM_SPAWN" to create a new set of processes with the same arguments on the command line, but now you must use an "INTERcommunicator" (instead of INTRA); you can do MPI_SEND/MPI_RECV, but not collective functions. Still, if a node dies, this doesn't help!! You would basically need to create your own process and communication management on top of MPI. That's why I suggest you look for a library.

The current MPI specification assumes that nodes will stay alive during the execution. A guy who is interested in MPI implementations visited my institution 2 weeks ago and gave a presentation. I asked the same question and he said another specification (MPI-3) will be announced in the summer and this issue will be held. Right now all I can do is writing my own process management into the MPI library I'm using (like otheus mentioned before).