semaphore access speed

otheus · September 23, 2008, 3:14am

AND the -l flag.

jim_mcnamara · September 23, 2008, 3:57am

Otheus -
utimes is the call made by

 time <command>

It gives cumulative times for user and kernel, as well as cum process wall time.
Call it for a starting set of values and an ending set of values, then substract for a delta value.

jim_mcnamara · September 23, 2008, 4:24am

Sorry - a little off track - I don't see where the OP has tried running ltrace with syscalls enabled - or strace.

The differences probably are due to OS implementation - maybe a lot of calls to brk in one OS and not the other. It's worth the 2 minutes it takes to execute Otheus' code under strace on each box. That would rule out an odd kernel setting or some implementation "feature" as the root cause of the differences. If you see odd behavior, like a lot of system calls on one box and not on the the other, maybe someone can relate that to something useful.

otheus · September 23, 2008, 4:27am

jim mcnamara:

Otheus -
utimes is the call made by
 time <command> 
It gives cumulative times for user and kernel, as well as cum process wall time.
Call it for a starting set of values and an ending set of values, then substract for a delta value.

I think you are confused. utime() and utimes() operate on an inode to change timestamps of files.

migurus · September 23, 2008, 3:46pm

to Otheus:
gprof -l complains about missing call-graph data, which I don't quite understand, but that is beyond the point in this thread.

to Jim:

On SCO
$ truss ./tstloop
semsys(1, 2819742, 2, 0, 0, 0) = 18827
semsys(0, 18827, 2, 6, 2147483024, 0) = 0
.
.
.
repeats on and on for 5,000,000 times, as coded.
I don't see any brk here.

On Linux
$ strace -Tc ./tstloop
time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
50.85 62.145774 29 2134526 semctl
49.15 60.059067 28 2134527 semget
0.00 0.000088 18 5 old_mmap
0.00 0.000054 18 3 mprotect
0.00 0.000039 20 2 open
0.00 0.000028 14 2 fstat64
0.00 0.000025 13 2 close
0.00 0.000020 20 1 munmap
0.00 0.000018 18 1 read
0.00 0.000018 18 1 uname
0.00 0.000018 18 1 stat64
0.00 0.000014 14 1 1 access
0.00 0.000014 14 1 set_thread_area
0.00 0.000012 12 1 time
0.00 0.000008 8 1 brk
------ ----------- ----------- --------- --------- ----------------
100.00 122.205197 4269075 1 total

when I try
$ strace ./tstloop
semget(1660977153, 2, IPC_CREAT|0777) = 32769 <0.000028>
semctl(32769, 2, IPC_64|GETALL, 0xfeead064) = 0 <0.000029>
.
.
.

Does this output clarifies anything?

otheus · September 24, 2008, 3:25am

Just to be clear, did you try "gprof -p -l" ?

Also, according to this page, you can run "truss -c" on SCO to get similar summary results. If we're lucky, you get per-system-call time, as shown above by ltrace -Tc on Linux. Then you can get what we're really after -- how much time does Linux spend in each system call versus SCO.

jim_mcnamara · September 24, 2008, 7:59am

SCO has kernel parameter issues with semphores. SHMMAX parameter by default is small. This is a known issue. Here is SCO docset for installing Postgres that discusses it.

Managing Kernel Resources

There are directions for viewing the SHMMAX setting . If it is default, consider raising the value and see if it resolves the shared memory problems you are having.

Otheus -

you are correct - times() uses struct tms not utimes(). That was my bad.

otheus · September 24, 2008, 8:14am

Jim, Are you thinking that the size of SHMxxx parameters influences the performance?

Migurus,

What do you have set for your Linux shared memory settings? "/sbin/sysctl -a |grep shm". (Ignore errors). Compare those to what is set for SCO (I don't know how to get those).

migurus · September 24, 2008, 7:54pm

My version of truss does not have -c flag, or any other flag similar to the one I used with trace under Linux.

On Linux :
$ /sbin/sysctl -a |grep shm
... <snip error msgs>
vm.hugetlb_shm_group = 0
kernel.shmmni = 4096
kernel.shmall = 2097152
kernel.shmmax = 33554432

Q: Why would SHM parameters affect semaphore performance?

Also, just in case I ran this
$ /sbin/sysctl -a |grep sem
<snip error msgs>
kernel.sem = 250 32000 32 128

otheus · September 25, 2008, 3:52am

*Possibly* because of the number of page tables and structures required to make use of all that memory. It might result in every op requiring two to three cache misses per call. If there are no cache misses, because shmem is less, then maybe it takes 3 times as fast. Jim??

migurus · September 29, 2008, 4:48pm

I'd like to ask gurus where else can I post my question. Would you recommend me any other group or forum?
Your suggestions would be appreciated.

otheus · September 30, 2008, 3:14am

As I previously suggested: LinuxQuestions.org.

Since the bottleneck appears to be within the system call, I suggest you also look at Kernel-related BB's (KernelTrap.org is a good one).

migurus · September 30, 2008, 5:35pm

Thanks Otheus and Jim, I got quite detailed answer here:
semaphore access speed | KernelTrap

So, the 2.6.9 kernel is not the best for modern h/w.

I appreciate everebodys time!

otheus · October 1, 2008, 5:36am

Kudos to you for your tenacity! However, I don't think this is the end of it.

I did a little research on strcmp's answer. 2.6.9 was released in 2004 and is standard with RHEL 4, which shipped with glibc 2.3.4. Pentium 3's were old in 2004. RHEL 5 ships with kernel 2.6.18 and glibc 2.5.12.

So I did some benchmarks.

I followed strcmp's suggestion and used a "falling timer" method, where the loop starts and ends after the time() call notes a change in seconds. There's a 10 to 100 ms variance on either side of the fall, so I took an average of several runs. Then I divide the ops/s number by the CPU speed (cycles/s) to get "tics per op".

2.6.18 / P3 / 800 MHz: 548300/s (average, 19 runs) = 1459 tics/op
2.6.18 / AMD Opteron 285 / 2.6 GHz: 1689138 (avg 6 runs) = 1539 tics/op
2.6.18 / AMD Opteron 270 / 1.0 GHz: 974228 (avg 7 runs) = 1026 tics/op
2.6.9 / Xeon / 3.6 GHz: 917196 (avg, 4 runs) = 3925 tics/op
2.6.9 / P3 / 1.25 GHz : 733927 (avg, 5 runs) = 1703 tics/op
2.6.9 / Xeon / 2.3 GHz: 1127894 (avg, 10 runs) = 2608 tics/op

For tics/op, smaller is better. So the 2.6.18 kernel is indeed faster than the 2.6.9 kernels. The Xeon is MUCH slower. Presumably the kernels were compiled by a lowest common denominator. No Optimization flags were enabled, but there was a difference in compilers: the 2.6.9 hosts used gcc 3.4.6, while the newer ones were with gcc 4.1.1. Also, it should be noted that we don't have an AMD running 2.6.9 nor a Xeon running 2.6.18.

It very may well be that the problem is that these kernels were not compiled optimally for the various architectures. Why the Xeons are so much slower is quite surprising, given their characteristic use as HPC components.

Regardless, none of these results seem to explain the fundamental question: Why is SCO so much faster??

otheus · October 1, 2008, 5:42am

Here is my code:

#include <stdio.h>
#include <sys/types.h>
#include <sys/ipc.h>
#include <sys/sem.h>
#include <time.h>
#define NSEMS   2

/* change this per CPU to run between 8 and 12 s*/
const static int maxloop = 10000000; 

main(int argc, char *argv[])
{
    time_t start,last,stop;
    long int i;
    int estimate = 100;
    int sid;
    key_t key;
    ushort vals[NSEMS] = { 0, 0 };

    key = ftok("/tmp",99);
    last=start=time(NULL);
    for (i = 0; i < 1000; ++i) {
        usleep(10);
        last=time(NULL);
        if (last > start) break;
    }
    start=last;
    last = 0;

    for (i = maxloop/8; i < maxloop; i++) {
      if ((sid = semget(key, NSEMS, IPC_CREAT | 0777)) == -1) {
          perror("Can Not Get Semaphore ID");
      }
      if (semctl(sid, NSEMS, GETALL, vals) == -1) {
          perror("Can Not Get Semaphore Values");
      }
    }

/* do the last 1/8th until the second changes.
    If your processor reaches the maxloop before that,
    change the maxloop or the divisor or the "estimate" */

    stop=time(NULL);
    for (i = maxloop - maxloop/8; i < maxloop; ++i) {
      if ( !(i % estimate) ) {
        last=time(NULL);
        if (last > stop) break;
        stop=last;
      }

      /* repeat semaphore opts */
      if ((sid = semget(key, NSEMS, IPC_CREAT | 0777)) == -1) {
          perror("Can Not Get Semaphore ID");
      }
      if (semctl(sid, NSEMS, GETALL, vals) == -1) {
          perror("Can Not Get Semaphore Values");
      }
    }
    stop=last;

    printf("%.2f semop/s (%i/%i) [%d]\n", (double)i/(stop-start), i, stop-start, estimate);
}

migurus · October 1, 2008, 6:03pm

Thanks for the code, when I ran it I got this:
2.6.9/Xeon/3.2 MHz
(3 runs)
129186.76 semop/s (8784700/68) [100]
129367.65 semop/s (8797000/68) [100]
129257.35 semop/s (8789500/68) [100]

SCO/Xeon/3.2 MHz
(3 runs)
527641.18 semop/s (8969900/17) [100]
564212.50 semop/s (9027400/16) [100]
535800.00 semop/s (9108600/17) [100]

I see SCO is capable of doing roughly 3 times more semop/s than Linux.

BTW, the h/w is identical, I believe I have it posted in this looong thread.

For Linux, I will re-post here s/w versions:
Fedora 2.6.9
gcc - ver.3.4.3
ldd - ver. 2.3.5
glibc - ver 2.3.5

migurus · October 1, 2008, 7:17pm

This is a little off-topic, but I am afraid I do not fully understand the code.

My understanding of the purpose of the initial loop

is to start the whole measurement at the point of change of the second. Right?

Next,
we have two loops: first run 8750000 times, given the maxloop value of 1000000,

Then we run loop up to 1250000 times where every 100th iteration we check for time and if second changed, we break out

otheus:

   stop=time(NULL);
   for (i = maxloop - maxloop/8; i < maxloop; ++i) {
   if ( !(i % estimate) ) {
   last=time(NULL);
   if (last > stop) break;
   stop=last;
   }
 
   /* repeat semaphore opts */
   if ((sid = semget(key, NSEMS, IPC_CREAT | 0777)) == -1) {
   perror("Can Not Get Semaphore ID");
   }
   if (semctl(sid, NSEMS, GETALL, vals) == -1) {
   perror("Can Not Get Semaphore Values");
   }
   }
   stop=last;

What is this technique of checking every Nth iteration for? why not check time every iteration? is it because this would add to many extra time calls and muddy the measurements?

Your clarification would be appreciated.

otheus · October 1, 2008, 10:33pm

That's quite ON topic and one reason I posted the code.

right.

You answered it. AFAIK each time() call involves a system call. I think there is a better way of handling this, but this is the first that came to mind.

I do appreciate you checking the code. That I'm prone to failure may be a modest understatement.

otheus · October 6, 2008, 5:11am

I checked the kernels. There were major changes in 1995 (before version 2.0), 1998 (before version 2.4), and then again with the introduction of the "O(1) scheduler" (not sure). Here's the code -- unchanged since 2.4 -- that does semctl(GETALL):

        sma = sem_lock(semid);
/* check condition omitted */
        nsems = sma->sem_nsems;
        err=-EIDRM;
        if (sem_checkid(sma,semid)) goto out_unlock;
        err = -EACCES;
        if (ipcperms (&sma->sem_perm, (cmd==SETVAL||cmd==SETALL)?S_IWUGO:S_IRUGO))
                goto out_unlock;

	case GETALL:	{
		ushort __user *array = arg.array;
		int i;

		if(nsems > SEMMSL_FAST) {
/* omitted code -- relevant only when nsems > 256 */
		}

		for (i = 0; i < sma->sem_nsems; i++)
			sem_io = sma->sem_base.semval;
		sem_unlock(sma);
		err = 0;
		if(copy_to_user(array, sem_io, nsems*sizeof(ushort)))
			err = -EFAULT;
		goto out_free;

Here's the code from 2.0:

        case GETALL:
                if (ipcperms (ipcp, S_IRUGO)) return -EACCES;
                switch (cmd) {
/* ommitted irrelevant code */
                case GETALL:
                        array = arg.array;
                        i = verify_area (VERIFY_WRITE, array, nsems*sizeof(ushort));
                        if (i)
                                return i;
                }
                break;
/* skipping case statements */

        if (semary[id] == IPC_UNUSED || semary[id] == IPC_NOID)
                return -EIDRM;
/* the next line provides the sem_checkid() call from 2.4/2.6 code */
        if (sma->sem_perm.seq != (unsigned int) semid / SEMMNI)
                return -EIDRM;

        switch (cmd) {
        case GETALL:
                if (ipcperms (ipcp, S_IRUGO))
                        return -EACCES;
                for (i = 0; i < sma->sem_nsems; i++)
                        sem_io = sma->sem_base.semval;
                memcpy_tofs (array, sem_io, nsems*sizeof(ushort));
                break;

When you break it down, the only difference is in the semlock() call, which is needed on multi-CPU systems. It could be SCO is also similarly limited. Why don't you try a linux 2.0 distribution, like RedHat 5.2, which uses Linux 2.0.36 (and uses the same sem code as above). Install it and benchmark the same code. That would be a great help to all of us, I think.

otheus · October 6, 2008, 5:57am

Migurus,

I emailed the maintainers of the code. This is a response I got back from Alan "Maddog" Cox (with permission to post here):

So here is yet another version using gettimeofday(). Don't bother posting the benchmarks here, unless they are significantly different. But prepare them for bugzilla:

#include <stdio.h>
#include <sys/types.h>
#include <sys/ipc.h>
#include <sys/sem.h>
#include <time.h>
#define NSEMS   2

const static long maxseconds = 1000;
main(int argc, char *argv[])
{
    struct timeval tod_start,tod_stop;
    long int start,stop;
    long int maxloop = 5000000;
    long int i;
    int sid;
    key_t key;
    ushort vals[NSEMS] = { 0, 0 };

    if (argc > 1)
      maxloop = atol(argv[1]);
    i = maxloop;

    key = ftok("/tmp",99);

    gettimeofday(&tod_start, NULL);
    while (--i) {
      if ((sid = semget(key, NSEMS, IPC_CREAT | 0777)) == -1) {
          perror("Can Not Get Semaphore ID");
      }
      if (semctl(sid, NSEMS, GETALL, vals) == -1) {
          perror("Can Not Get Semaphore Values");
      }
    }
    gettimeofday(&tod_stop,NULL);

    start = 1000*1000*(tod_start.tv_sec - maxseconds) + tod_start.tv_usec;
    stop  = 1000*1000*(tod_stop.tv_sec  - maxseconds) + tod_stop.tv_usec;

    printf("%.2f semop/s (%ld/%ld)\n",
      (double)maxloop/(stop-start)*1000*1000, maxloop, stop-start);
}