semaphore access speed

I am investigating some locking scheme using semaphores. To evaluate basic system speed I run a loop of getting some semaphore info and display it:

while : ; do ./semshow; done > res.txt

I ran this on 3 boxes - two similar modern HP XEON boxes, one running SCO OpenServer 5, the other is Fedora 2.6.9, and one old PIII box under modern Linux (have no info).
The results are very counter intuitive:

H/W OS avg number of runs ber sec
XEON SCO 1700
XEON Fedora 500
PIII Linux (recent distro, unknown) 900

All three systems were pretty much idle at the test run time.

I would like to ask, what would be the factors that makes ancient Unix to outperform modern OSes, also, how come PIII box outperform modern XEON box under similar OS. Any ponters would be appreciated.

The semshow program is very basic, see listing below:

 
#include <stdio.h>
#include <sys/types.h>
#include <sys/ipc.h>
#include <sys/sem.h>
#include  <sys/timeb.h>
#include <time.h>
#include "semlib.h"
#define MODE_CREATE     0
#define MODE_REMOVE     1
key_t   IPCKEY;
main(int argc, char *argv[])
{
  int   sid, i;
  pid_t last_rpid, last_wpid;
  char  dbuf[80];
  union semun           arg;
  unsigned short        vals[NSEMS];
  struct timeb tb;
  struct tm *tp;
        if((IPCKEY = get_ipc_key()) == -1)
        {
                errexit("Can Not Obtain IPC Key");
        }
        if((sid = semget(IPCKEY, NSEMS, 0)) == -1)
        {
                errexit("Can Not Get Semaphore ID");
        }
        memset(vals, 0, sizeof(vals));
        arg.array = &vals[0];
        if(semctl(sid, NSEMS, GETALL, arg) == -1)
        {
                errexit("Can Not Get Semaphore Values");
        }
        if((last_rpid = semctl(sid, RDLOCK, GETPID)) == -1)
        {
                errexit("Can Not Get Semaphore R-Pid");
        }
        if((last_wpid = semctl(sid, WRLOCK, GETPID)) == -1)
        {
                errexit("Can Not Get Semaphore W-Pid");
        }
        ftime(&tb);
        tp = localtime(&tb.time);
        strftime(dbuf, sizeof(dbuf) - 1, "%T", tp);
        printf("%12li.%03i %s  RD:[%i]  WR:[%i] %i/%i\n",
                tb.time, tb.millitm, dbuf, 
                vals[RDLOCK], vals[WRLOCK],
                last_rpid, last_wpid);
        exit(0);
}

semlib.h has these defs:

 
#define NSEMS   2
#define RDLOCK  0
#define WRLOCK  1
union   semun   {
        int             val;
        struct semid_ds *buf;
        unsigned short  *array;
};
 

Here's one answer: your time measurement is buggy. See man on ftime() under Linux:

The Fedora box has
/lib/tls/libc.so.6 -> libc-2.3.5.so
so, I assume the version is 2.3.5.

But that is besides the point.

The pure number of times system ran the program within a second would indicate overall system's power, so to speak. To my astonishment PIII under linux managed to run almost twice as many cycles withi a second than Xeon 3.2GHz under linux. Then the same Xeon h/w under old SCO OSR run twice as fast as PIII box.

But Problem #1 is that you are measuring this the wrong way. You have too many dependencies on library calls that are irrelevant to what you are trying to measure. Take out ALL the semaphore code, re-test, and see what times you get. Or, run a 10,000 x loop around the semaphore. Then scale down to the nearest second so that microseconds mistimings aren't coming into play.

Problem #2 is that semaphores require bus-access and are relatively independent of Processor speed.

I replaced the ftime with gettimeofday and re-run, the results are the same.

To summarize again,
same h/w :
program under SCO runs 3 times faster than under Linux,
same OS :
program runs 2 times faster on old PIII than on modern Xeon

Any reasons? any ideas?

I'm sorry, but you still haven't eliminated all "loose variables". But your semaphore code in a loop of 1000 or so and bench again.

I simplified my program to the very basic minimum, it gets ipc key, gets semaphore id and reads values of semaphores and then it outputs on line (time and semaphore values). That is all it does.

Then I run it in following manner:
$ while : ; do tstshow; done > x.txt
and I hit interrupt key after waiting for some time.

sample of the x.txt file content:
1221785538 [1,99]
1221785538 [1,99]
1221785538 [1,99]

Then to see how many times system was able to run it I do:
$ cut -c-12 x.txt | uniq -c

sample of the output:
616 1221785538
615 1221785539
612 1221785540

That tels me the system is capable of running this simple process just roughly 600 times per second.

Just to be consistent I would repeat results here:
PIII under Linux - 900 times/sec
Xeon 3.2 Linux - 600 times/sec
Xeon 3.2 SCO - 1800 times/sec
Systems are idle at the time of test.
How would you explain the results I got? I am trying to find a bottleneck on the modern h/w running Linux.

Please see my new simplified code below:

 
#include <stdio.h>
#include <sys/types.h>
#include <sys/ipc.h>
#include <sys/sem.h>
#include <time.h>
#define NSEMS   2
main(int argc, char *argv[])
{
  int           sid;
  ushort        vals[NSEMS] = {0, 0};
        if((sid = semget(get_ipc_key(), NSEMS, 0)) == -1)
        {
                errexit("Can Not Get Semaphore ID");
        }
        if(semctl(sid, NSEMS, GETALL, vals) == -1)
        {
                errexit("Can Not Get Semaphore Values");
        }
        printf("%12li [%i,%i]\n", time(NULL), vals[0], vals[1]);
        exit(0);
}

Since you don't seem to follow instructions, I don't expect you to benchmark the following code. In case you do, however, and your results are still widely unexpected, please make a note of it here.

#include <stdio.h>
#include <sys/types.h>
#include <sys/ipc.h>
#include <sys/sem.h>
#include <time.h>
#define NSEMS   2
main(int argc, char *argv[])
{
    time_t start,stop;
    long int i;
    int sid;
    key_t key;
    ushort vals[NSEMS] = { 0, 0 };

    key = ftok("/tmp",99);
    start=time(NULL);
    for (i = 0; i < 5000000; i++) {
      if ((sid = semget(key, NSEMS, IPC_CREAT | 0777)) == -1) {
          perror("Can Not Get Semaphore ID");
      }
      if (semctl(sid, NSEMS, GETALL, vals) == -1) {
          perror("Can Not Get Semaphore Values");
      }
    }
    stop=time(NULL);
    printf("%.2f semop/s [%i,%i]\n", (double)i/(stop-start), vals[0], vals[1]);
}

Just for emphasis, your results include the overhead of starting a large number of processes, and in each of them, printing a line of text to a buffered block device, and exiting the process. Bottlenecks in any of these operations are much more likely to explain the results than the thing you are trying to test.

I compiled and run the loop test code posted by Otheus.

Xeon/SCO results: 555555.56 semop/s [0,0]
Xeon/Linux results: 128205.13 semop/s [0,0]

I don't have access to PIII box right now, but as results are same (proportionally) as what I saw in my test I think it is not crucial not to have PIII results here.

As far as overhead mentioned by Era - I am aware of it, as it is closely resembles situation that I am evaluating, so it a valid test and not an oversight.

So, does anyone have any idea what the bottleneck could be? or where to look? I really appreciate your patience with me and your persistence in getting to the bottom of it.

Accessing a semaphore - assuming there is no deadlock - it is a direct memory access operation, adding a whole page or multiples to process memory has overhead. It can involve an expensive call: brk(), if no memory is already there.

It depends on what

size mycode

says in total, rounded up to a minimum of the PAGE_SIZE (mutiple) boundary (stack frame boundary usually), which can leave extra multiple pages of memory. Look at what is show to be allocated to heap. As a start. You can call sbrk(0) to find the end of process memory.

shmget allocates from heap in most implementations: /proc/sys/kernel/sh* proc directories have shared memory information. kernel settings control shared memory operations.

try this:

gcc -p -g -o otherus otheus.c
otheus.c
grpof otheus

That will display cum time spent + # calls for each of the function calls. You can see if/where there is a problem. If there is one with semaphores then semget will be your most likely problem - allocating pages of memory.

PS call utimes(struct tms *) to actual times, plus granularity is CLK_TCK, usually way better than time().

gettimeofday can be used to get wall time even more accurately as well. utimes() returns clock_t wall time.

Jim,
I never tried profiling before, so I ran into some problem here:
$ gcc -pg -o tstloop tstloop.c
$ tstloop
128205.13 semop/s [0,0]
$ gprof tstloop
gprof: gmon.out file is missing call-graph data

Any ideas?

As far as use of time vs. gettimeofday (which, in fact I did use) is quite irrelevant, I am trying to see how many times per second system is capable of creating a process which would simply read a semaphore.

Thanks.

------

Jim, utimes() doesn't do what you think it does. Gettimeofday is more correct, but then the code is a bit more complicated. Microsecond differences would not produce the disparities in performance he is seeing.

That's strange. Any chance you're running a different program that's elsewhere in the path? Try for instance running "./tstloop".

Also, for gprof, use the -l option (small L). I get:

$ gprof -l  ./semget
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  Ts/call  Ts/call  name
 57.10      0.13     0.13                             main (semget.c:21 @ 80485f0)
 25.12      0.18     0.06                             main (semget.c:18 @ 804861f)
  9.14      0.20     0.02                             main (semget.c:17 @ 8048614)
  6.85      0.22     0.02                             main (semget.c:16 @ 80485e4)

Line 21 is the semctl() function. Line 18 is the semget() call.

Now that we have concrete results, I'd cross-post this on the Linuxquestions.org com site.

The next step would be to use the debug/profiling version of libc (which I don't have floating around) and see if we can use gprof to find the bottleneck therein. Or, look at the source. It would seem to be of great benefit to have this improved.

We can marginally rule out page-faults and the like by looking at the output from /usr/bin/time -v. The minor page fault means the OS reserved a page of memory, and that internal tables were modified, but the process wasn't suspended due to IO or anything like that. It turns out, all of these are due to either the profiling code or the program invocation.

$ /usr/bin/time -vv  ./semget
555555.56 semop/s [0,0]
Command exited with non-zero status 24
        Command being timed: "./semget"
        User time (seconds): 2.85
        System time (seconds): 6.12
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:08.99
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 0
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 135
        Voluntary context switches: 7
        Involuntary context switches: 53
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 24

to Jim:

tstloop is the only program, I did re-run
$ ./tstloop
$ gprof -l ./tstloop
gprof: gmon.out file is missing call-graph data

Thank you for taking it to other forum.

to Otheus:

tstloop compiled with profiling:

 
$ /usr/bin/time -vv ./tstloop
125000.00 semop/s [0,0]
Command exited with non-zero status 24
        Command being timed: "tstloop"
        User time (seconds): 15.05
        System time (seconds): 24.33
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:39.39
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 0
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 121
        Voluntary context switches: 1
        Involuntary context switches: 326
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 24

and here are results from program compiled without profiling

 
$ /usr/bin/time -vv ./tstloop
128205.13 semop/s [0,0]
Command exited with non-zero status 24
        Command being timed: "./tstloop"
        User time (seconds): 15.39
        System time (seconds): 23.19
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:38.59
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 0
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 101
        Voluntary context switches: 1
        Involuntary context switches: 112
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 24

Try running gprof with -p and -l . My newer version might not need the -p.

-p flag helped, please see results below

$ gprof -p ./tstloop
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
99.70 3.64 3.64 main