Why threads across processes CANNOT scale in pingpong?

The question is about the scalability of pingpong communication using threads/processes. Intuitively, context switching between threads belonging to different processes should be no worse than the context switching among processes. However, we have different experimental results, where pingpong communication among threads (CASE-1) belonging to two processes offer 1/10 throughput of the pingpong communication among processes (CASE-2) based on the same data structure.

data structures:
ping/pong two atomic variables shared by processes using mmap, with ping++, pong++, ping--, pong-- respectively denoting atomic FAA(ping, 1), FAA(pong, 1), FAA(ping, -1), FAA(pong, -1). Both variables are initialized as 0.

workflow:

ping-side:

local variable count = 0;
while (true) {
  ping++;
  count++;
  while (count > 0) {
    if (!pong) sched_yield();
    else break;
  }
  if (pong) {
    pong--;
    count--;
  }
}

pong-side:

while (true) {
  while (!ping) {
    sched_yield();
  }
  pong++;
  ping--;
}

CASE-1

pid_t t = fork();
if (t > 0) {
   for (int t = 0; t < N; t++) {
    thread([&](int tid){
        ping-side operating on ping[tid]/pong[tid]
    }, t);
   }
} else if (t == 0) {
   for (int t = 0; t < N; t++) {
    thread([&](int tid){
        pong-side operating on ping[tid]/pong[tid]
    }, t);
   }
}

CASE-2

for (int t = 0; t < 2 * N; t++) {
pid_t tid = fork();
  if (tid == 0) {
    if (t < N)){
        ping-side operating on ping[t % N]/pong[t % N]
    } else {
        pong-side operating on ping[t % N]/pong[t % N]
    }
    break;
  } else if (t == 0) {
    wait_all();
  }
}

Single-threaded (N = 1):
Suppose we create only two processes, in CASE-1, each process utilizes one thread to evaluate pingpong with the thread of another process, we have 1.2 mops throughput; In CASE-2, we fork two processes to evaluate pingpong communication directly, which also provides 1.2 mops throughput.

Multi-thread (N = 96)
However, when we increase the thread/process number to N * 2, CASE-1 can only provide 5mops (with 96 threads in the two processes) whereas CASE-2 can offer 60mops when the process number is 96 x 2.

While it might seem intuitive that context switching between threads in different processes should perform similarly to switching between processes, there are some important differences between thread and process context switching that impact performance.

Would you like to explore the engineering mechanics behind context switching and why switching between threads is generally more efficient than switching between processes?

In the given example, both CASEs will produce a couple of process switching in each ping-pong round, in deed. This is because that, In CASE-1, a ping-pong communication will trigger switching in-between two threads that respectively belong to the two processes. Similarly, in CASE-2, a couple of processes will also trigger process switching in each round. In this regard, we wish there is no different between the two cases.

Regarding thread switching vs process switching, technically, thread shares heap/code segment with its parent process, such that switching between two threads belonging to a same process will not change their code segment and many other registers (also will not save the registers to local memory before switchout). As a result, thread switching should not be slower than process switching.

In our given example, we believe the thread switching (in CASE-1) must save registers and switch their code segment, just like what process switching (CASE-2) has done. This is because that each couple of threads in CASE-1 belong to different processes. In this regard, CASE-1 is almost as same as CASE-2. However, what confuses us is that why CASE-1 (inter-process thread communication) is worse than CASE-2 (inter-process communication)?

Let's quickly review the mechanics behind context switching and why switching between threads is generally more efficient than switching between processes .

Thread vs. Process Context Switching: Key Differences

  1. Context Size and Overhead:

Threads within the same process share the same memory space, which includes code, data, and resources like open file descriptors.

• When switching between threads of the same process, only CPU registers and stack pointers need to be saved and restored.

Processes, however, have separate memory spaces and require saving more state, including memory maps, virtual memory contexts, and page tables.

Impact: Process switches are more expensive because of the larger state that needs to be stored and restored.

  1. Memory Management Overhead:

Threads share memory, so no Translation Lookaside Buffer (TLB) flush is needed during a switch between threads of the same process.

Process switches, on the other hand, often require a TLB flush, which invalidates cached mappings of virtual to physical memory, incurring a performance penalty.

  1. System Call Overhead:

Thread switching generally happens inside user space or with lightweight synchronization primitives (like mutexes).

Process switching involves more kernel overhead, including scheduling decisions made by the kernel and higher system call costs.

  1. CPU Cache Efficiency:

Thread switching tends to be more cache-friendly because threads share the same memory and address space, reducing cache invalidation.

Process switching can invalidate portions of the CPU cache, causing performance hits when the next process reloads the cache.

  1. Scheduler Treatment:

• In modern operating systems, threads are treated as individual schedulable entities, meaning that switching between threads of different processes can involve almost as much overhead as process switching. However, switching between threads within the same process is still cheaper.

Conclusion: Why Thread Switching is Typically Faster

Switching between threads is typically faster than process switching due to the reduced need for memory management, state saving, and cache invalidation. The overhead difference becomes more apparent when many short-lived tasks are involved. That said, switching between threads belonging to different processes approaches the cost of process switching, as the kernel treats them as separate schedulable units.

In practical scenarios, if performance is a priority, developers often use multi-threading within a single process to avoid the higher cost of context switching between processes. However, in high-security or isolation-critical environments, multi-process architectures are still preferred despite the cost.

Yep, you provided a systematic and detailed explanation. Still, it's quite strange inter-process thread-level ping-pong cannot scale well as direct inter-process ping-pong.

Yes, it is indeed strange.

Often, the nuances of design trade-offs (and flaws) based on specific hardware specifications and software limitations, which are usually invisible when writing simple programs, can constrain systems in unexpected ways.

I’m currently working on a parallel processing project, and the code is far more complex than basic ping-pong. Almost every day, I encounter something “strange” that requires debugging and optimization!

How deep you want to explore these challenges depends on how much time you have and how skilled you are at analyzing the system. It’s a process that involves understanding how both the hardware and software stacks impose constraints at each step.

1 Like

Cannot agree with you any more. My young guys and I will follow the question in our OS courses. Also, thanks for your reminding and encourage.