Memory Barriers for (Ubuntu) Linux (i686)

Hi all,

(Hope this is the right forum for this question)

I have some multi-threaded C code (compiled with GCC 4.4.3) which accesses shared variables. Although I've marked those variables with volatile to guard against compiler re-ordering, I'm concerned that processor out-of-order execution may cause my code to fail, and I'm looking for a "low-cost" method of guaranteeing ordering is maintained in my code.

For example, I have something like...

memset(&task, 0, sizeof(task_t));/* null memory */
task.id.prefix = prefix_id;
task.id.instance = instance_id;
/* write-memorybarrier required here */
task.state = task_ready;

Where I need to ensure that the "task state" is only set to "task_ready" after the previous instructions have been committed. As the "task" is shared between threads, another thread seeing the state as "ready" may try to access its member variables, so it's vital that the tasks "prefix" and "instance" have been updated.

I know this is a common problem and mutexes and semaphores provide in-built memory barriers to address this problem but I'm trying to build a scalable application and I want to avoid their use if possible. I also know GCC provides built-in atomic operations but I see they involve locking the data-bus, and I've heard about system primitives like "smp_wmb()" but I'm not sure how to incorporate these into my "user-space" program as they are platform dependent.

Therefore can anyone provide pointers or advise on how best (in terms of scalability and speed) to guarantee ordering is maintained?

Thanks.

I know you want to avoid mutexes, but this is really what they're for. You could try using Linux futexes, which handle the nonblocking case completely in userspace, but I'm not sure what that'd do for memory barriers(this MAY have something to do with a special mmap flag to create the futex pages, though I think futexes have changed several times since I began fiddling with them) -- and modern Linux NPTL pthreads is based on futexes anyway, so you'd just be reinventing pthreads.

Besides, what if you need to move it to MIPS or something?

[edit] One thought does occur to me. How large are these structures? Could you perhaps reorder it to put prefix, instance, and state in order in memory? You could assemble the data in an MMX or SSE register, then overwrite several structure members in one assembly op.

Hi Corona, (small world :))

I hadn't heard of futexes until you mentioned them, but I did some reading and it seems they still use atomic instructions to update shared variables. In that case I could just use one of GCC's built-in atomic operations like "__sync_fetch_ and_ add" or "__sync_bool_compare_and_swap" as described here...

Atomic Builtins - Using the GNU Compiler Collection (GCC)

The thing with these is they use the asm op-code "lock", which issues a hardware lock on the data-bus effectively locking every other process out of memory. Because I'm writing an application that should be scalable for a system with many cores, I'm discouraged by this.

Could be a possibility, I believe they have made advances into highly parrallel architectures recently, but the project is at a research stage right now so if I can get it to work well on x86 that's good enough for now. I like the sound of this idea though...

This could be a good solution, but I'm not sure how to do it. Do you have any examples of similar code as a guide?

prefix and instance are both uint32_t while state is an enum (guess that means its a uint32_t also?).

Small world, how so? :slight_smile:

Well, yes. It has to synchronize somehow. One way or another you must interrupt other cores with this change in status, or they may never know.

Wow, those are nice.

I think you're overreacting... Any memory I/O monopolizes the bus*, LOCK just guarantees one instruction gets two ops in a row.

Also. The original 8088 has precisely one instruction worth of cache, so locking the bus stalls it instantly... The huge caches, multiple independent memory buses, and cache communication systems in recent NUMA systems usually let cores keep going or find something else to do. I'm not sure LOCK XCGH even forces a real memory fetch anymore(might be simple to test, try to get back to you on that.)

Lastly, if you're doing no mutexing, what are you doing instead -- polling? That's not going to be more efficient, untold amounts of CPU will be expended on what amounts to a while(1) loop.

I really think pthreads is still what you're looking for. They've made it as fast as they know how, significantly changing the kernel to accommodate it.

  • Exceptions exist for very special-purpose memory chips like video RAM.

Noting your help on the Programming forum too!

Are you suggesting then, that if I used such an instruction relatively frequently (say once in a loop of maybe a 100 execution statements, per core), I shouldn't notice a significant drop in throughput of the application?

You'd expect that each core accessing the XCHG variable though would have to get the value from memory though as soon as it accessed it, otherwise what use would CMPXCHG be? Not sure about this area to be honest, (but I read that these atomic operations do create a memory barrier so a core cannot execute instructions either side of said barrier out of order).

What I'm building is a thread-pool with n pthreads equal to the number of cores (so I am using pthreads). The pthreads continually execute a list of "lightweight tasks". The user can create tasks and send "messages" between them. (If you've ever used Erlang, something similar to the abstraction provided there but in my case using C).

The pthreads occasionally check the "value" of a task "state", when they reach that task in the queue, therefore if the "state" isn't "ready" they simply move on to the next task (hence the pthread has more work to do and isn't polling continuously). You see what this means, as long as a pthread "eventually" discovers a task is "ready" that's okay, even if it's not asap. It seems like a lock would be unnecessary here then, but a pthread shouldn't detect that the task state is "ready" before its other data members have been updated (hence the need for a memory barrier).

If using these atomic operations isn't going to impact throughput, then great they solve the problem, but even that seems like overkill when I only need to ensure that a handful of statements are executed in a certain order.

I've accounted for that possibility in the elaborate design of the "task-queues", (trust me on this).

The "thread-pool" is a customised in design, dedicated for the application sitting atop of it. This application generates a high number of tasks, and possesses much implicit parallelism. I originally used it with existing thread-pools, but found I needed more control over the allocation of "tasks" to "cores", hence I'm making my own. When there is no work it means there is no work in the application, so that's okay.

After reading your post, I think I'll run with the atomics, see how that fares. Thanks for your help (again).

P.S. On a two-core single-CPU system, the overhead of XCHG vs LOCK XCHG with five seperate processes:

$ ./a.out & ./a.out & ./a.out & ./a.out & ./a.out &
12225 !Lock     time = 0 M 8 S 657 ms 205 us = 0.116 Mops/s
12229 !Lock     time = 0 M 8 S 801 ms 676 us = 0.114 Mops/s
12227 !Lock     time = 0 M 8 S 896 ms 459 us = 0.112 Mops/s
12228 !Lock     time = 0 M 8 S 958 ms 739 us = 0.112 Mops/s
12226 !Lock     time = 0 M 9 S 157 ms 723 us = 0.109 Mops/s
12228 Lock      time = 0 M 8 S 610 ms 749 us = 0.116 Mops/s
12227 Lock      time = 0 M 8 S 719 ms 860 us = 0.115 Mops/s
12225 Lock      time = 0 M 9 S 49 ms 622 us = 0.111 Mops/s
12226 Lock      time = 0 M 8 S 608 ms 304 us = 0.116 Mops/s
12229 Lock      time = 0 M 9 S 48 ms 352 us = 0.111 Mops/s

The code is a million loops of this:

                        "LOOP1:                 \n"
                        "       xchg    %ebx, a \n"
                        "       xchg    %ebx, a \n"
                        "       xchg    %ebx, a \n"
                        "       xchg    %ebx, a \n"
                        "       xchg    %ebx, a \n"
                        "       loop    LOOP1   \n"

except once with LOCK XCHG, one with just XCHG. No significant difference.

---------- Post updated at 03:25 PM ---------- Previous update was at 03:14 PM ----------

How so?

I don't see how using a different structure excludes pthreads. You wanted to avoid pthreads since it used atomic ops, and are prepared to use atomic ops instead? It's best to write portably if possible anyway.

Looks like you're right judging by those results. But if I ran it on 4/8/16/32 etc cores, would it still be the case? I have 4-core at work, I'll try it tomoro. Although if LOCK just causes a "re-ordering" of bus-access I suppose theoretically it should impact throughput.

I want to control which queue, and ultimately which pthread runs which task, based on the fact that in the upper application, some tasks communicate very frequently and some never. Also, there is much scope for assigning equal work loads across cores, (think of a n-ary tree structure where the n-paths are of equal length and communication is restricted to nodes of the same path.) I looked at Threading Building Blocks "tasks" at first, but found it too blunt a tool for what I want.

Not really, I originally wanted to use pthreads but they didn't offer the high number of threads and "lightweightness" I needed, (in the order of 10s of 1000s, with many short-lived threads), but user-level threads like GNU threads don't offer multi-core exploitation because the kernel isn't involved. So what I've done, with some inspiration from "protothreads", is provide an abstraction on top of pthreads which provides what I need in the form of lightweight tasks.

I wanted to avoid the pthreads syncrhonisation structures like mutexes because I sought to avoid their overhead and keep it scalable. There are ways to distribute work such that mutexes aren't necessary as long as an ordering of instructions can be guaranteed, hence following your advice, I'll try those atomic instructions from GCC.

thanks!

Huh? NPTL is a 1:1 threading model. NGPT is an M:N threading model. Of course the kernel is involved.

I was referring to the older GNU Portable Threads...

GNU Pth - The GNU Portable Threads

"Pth doesn't require any kernel support, but can NOT benefit from multiprocessor machines."

Hmm...

"In practice, this is no problem, because multiprocessor systems are rare, and portability is almost more important than highest concurrency."

(Written some time ago I presume)

You're referring to Next Generation Posix Threads developed by IBM though. I'm not sure but wasn't this abandoned in favour of simpler 1:1 threading in NPTL? But anyway, I needed something with more control than was on offer with something featuring a standard posix api.

The old system(usually known as linuxthreads) has been abandoned for many moons now. It was designed to operate without modifying the kernel, which made it a very strange beast -- it created pretend-thread processes with 100% shared memory, did all communication with signal traps, and had some very....unique bugs that turned out to be fundamental design flaws(zombie threads! Wow!)

So, no. It was not a high-performance threading model. There's only a very few things(like ulibc in embedded systems) that still stick with it these days.

With kernel 2.5 and later, they built enough things into the kernel to let them do threading properly. I've shown you bits of its code -- some fundamental things are almost down to the instruction level. NPTL is much, much better, and I've found it quite good.

Yes, and the overhead you were worried about was the same atomic operations you're hellbent on using now. I've looked in its code and shown you some of it; it's not bloated.

I remain stolidly unconvinced that spinlocking is more efficient than blocking. If your writer really can keep up with your readers, a proper queue might not block at all even in pthreads.

Are you sure of that? You only discovered thread-specific data last week.

NGPT was developed by a small team which included some people - Bill Abt, for example - from IBM. Work was funded by IBM so IBM got the kludos. Yes NGPT was shelved at v2.2 in favor of NPTL. For a while we looked at scheduler activations (see New GNU thread library) but then moved away from that concept towards a 1x1 model as proposed by Ingo Molnar. NPTL has been proven in numerous benchmarks to satisfactorily scale to tens of thousands of threads of execution.

Frankly if you wish to scale to tens of thousands of threads you have lots more problems to worry about than what you have discussed so far. See The C10K problem for example.

BTW, an alternative approach to scaling your application might be use something like CUDA and offload the application to a GPU. Hugh performance gains can be achieved if you can solve the parallel programming issues. Another approach might be to use a multithreaded parallel programming language such as Intel's CILK.

But with those atomic operations I can avoid sending threads to sleep and having to wake them up.

I would agree if no other progress is being made, but to reiterate, the thread-pool I'm building is for dedicated use with a custom application which features high numbers of tasks. 95% of the time I would expect my threads to have tasks to execute, a plenty of them. The remaining 5% would be a situation where there's no work to do, so it's not a problem.

Pretty sure, I've already built prototypes of my system in Erlang and I found a need for such control.

---------- Post updated at 12:41 AM ---------- Previous update was at 12:30 AM ----------

Are you saying it's possible to create tens of thousands of pthreads on linux? I tried this some time ago and I couldn't generate anywhere near that amount, doesn't the kernel impose strict limits on the number of threads that can be generated anyway?

Thanks, I have encountered similar problems. But to explain, I'm not working at the abstraction of pthreads in the "thread-pool" but rather very lightweight tasks of execution that involve little more than a function call, some memory and no context switching. Something along the lines of protothreads. The pthreads underneath simply run these tasks in a continous loop.

Thanks, I have examined CUDA and OpenMP, Threading Building Blocks etc, but the nature of what I'm doing involves an expansion of state, whereas these libraries typically feature a mapping of many parallel elements onto parallel resources, assigning a "thread" of execution to each iteration of a loop for example.

As mentioned in the previous post I began my work with Erlang, which offers 10s of 1000s of lightweight processes, communicating via message passing. But I found I needed more control than was on offer.

Your threads will mostly sleep anyway. Let me illustrate something for you.

Imagine a massive 100-core machine running 100,000 threads. Only 100 of them will run, 99.90% will sleep. Assuming 1000 context switches a second(the standard for desktop linux, servers are usually 100) this machine will pick up each thread once per second for one millisecond each.

When your 99,999 reader threads are busy, your writer gets one millisecond, once a second, to add jobs. If your 99,999 threads are out of work, your writer still gets one millisecond, once a second, to fill the job queue. But it needed 50, so some starve. The more threads you have, the more jobs your writer needs to add, and the less time it has to do so!

Let's jack up it's priority one increment, once per second is just dumb -- voila, no more starvation. The writer gets a full 50 milliseconds per second to fill the queue -- and 950 more to suck sludge with. Oops. A a high-priority thread won't be switched out to run a low one unless it blocks, so the writer's wasting 950/1000 milliseconds on an entire CPU.

Okay, maybe this super-mega-ultra-high-performance nonblocking queue writer can stand a tiny bit of blocking when we know it's sucking sludge. The polite way to tell the OS is usleep(), so we add usleep(1000) to the bottom of the loop. That's a little better, wasting just 50% CPU -- since it's still high priority it gets shoved ahead in line whenever usleep's done, every other task. Jacking the value of usleep brings it lower until it's wasting almost nothing, riding the line. Perfect.

Then a disk interrupt happens. This throws off the timing, letting readers starve. Oops -- you can't trust usleep()'s accuracy in a multitasking system!

Okay, so priority wastes time, and Blocking Is Evil. What else could be tried? Well, to run the writer thread more often without changing its priority, have more writers... One writer gets 1/50 of the time it needs, so how about fifty of them? 50 threads is a drop in the bucket when we have 100,000 of them. But how to distribute jobs to the writers in a parallel way? Another, miniature super-mega-ultra-high-performance nonblocking queue system to feed the big one? It'd need many threads, or high system priority, to get them to the writers in time, and a really intelligent management system of some sort to keep the right numbers of threads around with the right delays to keep things flowing efficiently.

Assuming, of course, that this is remotely efficiently when so much time is spent checking for readiness instead of doing actual work. If only there was a way to tell a thread to shut up and go away until there's anything for it to do...

There was an old lady who swallowed a spider,
That wiggled and wiggled and tickled inside her.
She swallowed the spider to catch the fly.
But I dunno why she swallowed that fly -
Perhaps she'll die.

You've got a mental (b)lock about (b)locking. You've got to keep perspective:

  1. In a loaded system, almost everything's sleeping all the time no matter how nonblocking your code is.
  2. Every thread that goes to sleep lets another thread wake.
  3. If it has to wait for something anyway, it's wasting time by not blocking, time another thread could've done work in.
  4. If your threads won't politely block when convenient, the OS impolitely pre-empts them when inconvenient, because:
  5. The scheduler isn't psychic. If you won't give any hints by potentially blocking, it's left free to pick the stupidest possible order and priority.
  6. You can't assume no starvation with the stupidest possible order and priority.

To sum up, blocking is there for a reason. Why wouldn't readers starve when they're 99,999 times more likely to get execution time than the writer? You're trying to build an Erlang here. Don't expect the same behavior in C.

That's an impressive reply Corona, you made your case very persuasively, but it's not "my case". The problem is you're not aware of the overall system, its context, aims and design (which is not your fault of course). I'll try and clarify a few things...

In my system, if a had a massive 100-core machine, I would only create 100 threads, if I had a 8 cores- 8 threads, 16 cores -16 threads etc. There is a 1:1 mapping of cores to threads where each thread runs on a designated core.

Yes, I am fully aware that in a multitasking OS there are still going to be more threads and processes going on, and my threads will be put to sleep and context-switched out occasionally, but why would I increase the frequency of this by adding more superfluous threads? Sure, I could split queues up between more threads but the "illusion" will only allow fairness, which I'm not really concerned about anyway.

(Recall, the thread pool allows the creation of light-weight tasks, not threads. Threads are oblivious to the higher layer.)

But this doesn't really reflect accurately what's happening in the system. "Reading" in this case is "executing tasks", "writing" is when a "reader" "creates a new task" (on another core's task-queue). Readers are writers and visa-versa. "add jobs" and "filling a job queue" suggests a task's work involves creating multiple sub-tasks on the other queues in one instance, which isn't a frequent scenario (although, it may be that a task needs to broadcast a message to n other tasks, but in that case "lock-stepping" through the recipients still isn't required).

Talking about thread's being "out of work" isn't a frequent scenario either. Approx 95% of the time there's an affluence of tasks (roughly) equally distributed, the remaining 5% indicates there's no work to do anyway.

No, this isn't an accurate description of the behaviour of the system. A task (writer) doesn't necessarily create more tasks (jobs) just because there are more tasks (threads) around. Like I said, imagine an n-ary tree with paths of equal length and communication only takes place between nodes residing on the same path. (In many cases nodes on the same path will be running on the same underlying thread, hence no syncrhonisation is necessary for those to communicate).

Let me put it this way, I have a list of jobs to do, as I progress through the list I find some jobs aren't ready. I can either block/sleep etc and wait for the job to become ready, or I can move onto the next task and execute that one instead. If I build my system with locks, I'll adhere to the first approach, whereas what I'm trying to achieve (aided by atomic flags and memory barriers) is the second.

Surely you're not saying that the first one is an equivalent approach to exploiting scalable parallel execution just because pre-emption is something that occurs in the OS anyway?

In my system, if I had a massive 100-core machine, I would only create 100 threads, if I had a 8 cores- 8 threads, 16 cores -16 threads etc. There is a 1:1 mapping of cores to threads where each thread runs on a designated core.

Just out of curiosity how are you going to ensure that each thread runs on a designated core? And have you thought through the overhead of attempting to ensure that each thread runs on a designated core?

At start-up I used the pthread_setaffinity_np function. Each pthread exists on its core for the life-time of the application, i.e:

    ret_code = pthread_setaffinity_np(thread->tid, sizeof(cpu_set_t), &cpuset);
    if(ret_code != 0)
    {
        fprintf(stderr, "error assigning thread to core\n");
        exit(ret_code);
    }

Is there a problem with doing it this way? (When I tested the code it appeared to be allocating threads to cores, correctly).

...and when you don't assume best-case, you could be going through the next few thousand not-ready tasks. Meanwhile every idle worker's doing the same thing. I begin to understand why you're concerned about contention for memory.

But because it doesn't block, you'll be wasting time scanning the list anyway, time that could have been spent doing actual work. And since your system's as busy idle as it is when actually busy you'll have a difficult time guessing how much. If I read you correctly, the jobs are all tiny. How tiny? How much more work is it to do a job than to scan the list? (And don't assume it could never, ever become mostly empty, that's the goal, not the proof.) If they're even in the same ballpark, you're going to be wasting a worrying proportion of CPU time scanning your list.

Anyway, the queue needn't block like you're describing. Put jobs in the queue when they become ready, don't just stick them there in advance, that way threads won't block when picking up jobs unless you're actually out of jobs -- in which case you want them to block. If the queue's big enough and jobs can be added fast enough, things can run smoothly. You can also do other things to streamline the queue -- hand out jobs 16 at a time instead of one at a time, switch between multiple queues, etc.

Bear in mind though that the "best" case is also the typical one here.

Not a problem here, the idle workers are searching through separate queues.

Fair enough, in rare cases, it could behave like that. Perhaps the time gained by not blocking would be spent searching.

Varies, some tasks are tiny, short-lived tasks, others persist and involve a greater amount of work...at a rough guess 50-50.

Okay, let's explore that for a minute. This queue would have to be able to dynamically grow and shrink right, so not only would there be a lock for accessing the queue, there'd also be a lock for allocating the memory on the heap. In addition, the repeated following of a pointer into allocated memory would likely cause repeated cache misses, which would also degrade performance.

So let's say, instead of allocating a task at a time, I allocate a chunk of tasks at a time, say 100 using calloc for example. Now my tasks are contiguous in memory and I've reduced the heap contention to a 100th while improving the cache hit rate cos I can bring multiple tasks into the cache in one go. Great, but many tasks don't persist for long, so why not recycle the memory in the list when a task has completed and then I don't have to allocate more chunks quite so often.

Now how do I achieve that, with a flag to say the task is ready perhaps, has to be atomic though because it's shared between threads and I have to ensure instructions are not reordered by using memory barriers.

But I'm not sure about memory barriers, I know, I'll ask the folks on the Unix forum...:slight_smile:

You see where this is leading.