The question is about the scalability of pingpong communication using threads/processes. Intuitively, context switching between threads belonging to different processes should be no worse than the context switching among processes. However, we have different experimental results, where pingpong communication among threads (CASE-1) belonging to two processes offer 1/10 throughput of the pingpong communication among processes (CASE-2) based on the same data structure.
data structures:
ping/pong two atomic variables shared by processes using mmap, with ping++, pong++, ping--, pong-- respectively denoting atomic FAA(ping, 1), FAA(pong, 1), FAA(ping, -1), FAA(pong, -1). Both variables are initialized as 0.
workflow:
ping-side:
local variable count = 0;
while (true) {
ping++;
count++;
while (count > 0) {
if (!pong) sched_yield();
else break;
}
if (pong) {
pong--;
count--;
}
}
pong-side:
while (true) {
while (!ping) {
sched_yield();
}
pong++;
ping--;
}
CASE-1
pid_t t = fork();
if (t > 0) {
for (int t = 0; t < N; t++) {
thread([&](int tid){
ping-side operating on ping[tid]/pong[tid]
}, t);
}
} else if (t == 0) {
for (int t = 0; t < N; t++) {
thread([&](int tid){
pong-side operating on ping[tid]/pong[tid]
}, t);
}
}
CASE-2
for (int t = 0; t < 2 * N; t++) {
pid_t tid = fork();
if (tid == 0) {
if (t < N)){
ping-side operating on ping[t % N]/pong[t % N]
} else {
pong-side operating on ping[t % N]/pong[t % N]
}
break;
} else if (t == 0) {
wait_all();
}
}
Single-threaded (N = 1):
Suppose we create only two processes, in CASE-1, each process utilizes one thread to evaluate pingpong with the thread of another process, we have 1.2 mops throughput; In CASE-2, we fork two processes to evaluate pingpong communication directly, which also provides 1.2 mops throughput.
Multi-thread (N = 96)
However, when we increase the thread/process number to N * 2, CASE-1 can only provide 5mops (with 96 threads in the two processes) whereas CASE-2 can offer 60mops when the process number is 96 x 2.