Massively parallel on single core?

Andre_Merzky · February 11, 2010, 7:48am

Hia all,

I am not sure how many people actually follow the HPC forum on unix.com, but you may be interested in discussing the following (academic) problem:

Assume you want to run a *very* large number (say 100.000) of very lightweight synchronous operations. As an example, assume that you want to run 100.000 instances of

sleep (3600); // thats one hour sleep

The trivial (aka braindead) approach would be

for ( int i = 0; i < 100000; i++ )
{
  ::sleep (3600);
}

Takes about 15 years to finish

One could start 1000 threads, and run a sleep in each of them. That reduces the runtime to 100 hours - still 4 days, and the system is totally idle all the time.

So, using more threads? Won't work, as the max-threads-per-process limit will be hit at some point.

So, spawn 100 processes which spawn 1000 threads each?
The max-threads-per-process limit is, on Linux, close to the max-threads-per-system limit, so that won't work. On other Unixes that is different, but I don't think you get 100.000 threads on a normal single CPU system. Do you?

So, what would your approach be?

I am not looking for a sleep replacement: so saying that I should set alarm or something similar is of not much use. Sleep is obviously only an example here - replace it with an extremely lightweight job, like running a very time consuming synchronous remote operation.

I am looking forward to the ideas you guys can come up with!

Cheers, Andre.

Neo · February 15, 2010, 3:56am

Seems overly academic.....

If practice, most people who have a requirement to run 100,000 parallel applications, they would turn do some distributed processing package, for example cluster management software.

Hardware and existing distributed processing software is cheaper (and more practical) than attempting to design a single-core solution (the title of this thread).

In general, you should design your HPC application as a distributed architecture and make the centralized approach a special case of a distributed architecture.

Andre_Merzky · February 15, 2010, 4:23am

Hi Neo,

thanks for your reply!

I agree abut your remark as distributed architectures. This is my day-job, and I like it a lot

I did not make the problem clear enough I think: the workload I am talking about are mostly idle jobs, so the CPU and memory load for each job is *very* low. Yes, I can beat the problem with more cores or nodes, but that seems very much like a waste, as those would be all idling most of the time.

Assume you plan for 1000 threads per core, and use quad code nodes - that would require 25 nodes which all idle all day long

Some more detail, if that helps: the idle processes/threads are basically watchers, which represent a CPU/Memory heavy remote job they spawned, and whose state they are watching. Only when that state changes they become active, and kick of data movements or spawn new jobs.

We can't control the design of the remote job startup API very well (third party, synchronous API only), thus our technical options for obtaining state information about those jobs are limited, and boil down to

void * run_job (void * data)
{
   // this call runs a remote job, and blocks for hours
   remote_api_call (data);
   store_output_data (data);
}

#define NJOBS 100000

int main ()
{
  pthread_t threads[NJOBS]
  for ( int i = 0; i < NJOBS; i++ )
  {
     pthread_create (threads,  run_job, ...)
  }

  for ( int i = 0; i < NJOBS; i++ )
  {
     pthread_join (threads);
  }
}

So, I can throw 25 nodes on that large for loop, and that is what we do basically - but what a waste...

The *real* workload are 100.000 CPU/Memory heavy remote jobs, which have sufficient resources to run concurrently. I am talking about the management side (our workflow engine).

Thanks, Andre.