Test program running taking much more time on high end server T5440 than low end server T5220

sanjay_singh85 · June 26, 2013, 9:45am

Hi all,
I have written the following program and run on both T5440 [1.4 GHz, 95 GB RAM, 32 cores(s), 256 logical (virtual) processor(s),] and T5220 [(UltraSPARC-T2 (chipid 0, clock 1165 MH) , 8GB RAM, 1 core, 8 virtual processors )] on same OS version. I found that T5540 server takes more time than T5220. Please find below the details.

test1.cpp

#include <iostream>
#include <pthread.h>
 
using namespace std;
#define NUM_OF_THREADS 20
 
struct ABCDEF {
char A[1024];
char B[1024];
};
 
void *start_func(void *)
{
    long long i = 6000;
    while(i--)
    {
                ABCDEF*             sdf = new ABCDEF;
                delete sdf;
                sdf = NULL;
    }
    return NULL;
}
int main(int argc, char* argv[])
{
    pthread_t tid[50];
    for(int i=0; i<NUM_OF_THREADS; i++)
    {
                pthread_create(&tid, NULL, start_func, NULL);
                cout<<"Creating thread " << i <<endl;
    }
 
    for(int i=0; i<NUM_OF_THREADS; i++)
    {
                pthread_join(tid, NULL);
                cout<<"Waiting for thread " << i <<endl;
    }
}

After executing the above program on T5440 takes :
real 0.78
user 3.94s
sys 0.05

After executing the above program on T5220 takes :
real 0.23
user 1.43s
sys 0.03

It seems that T5440 which is high end server takes almost 3 times more time than T5220 which is low end server.

However, I have one more observation. I tried the following program :

test2.cpp

#include <iostream>
#include <pthread.h>
 
using namespace std;
#define NUM_OF_THREADS 20
 
struct ABCDEF {
char A[1024];
char B[1024];
};
 
int main(int argc, char* argv[])
{
    long long i = 6000000;
    while(i--)
    {
        ABCDEF*  sdf = new ABCDEF;
        delete sdf;
        sdf = NULL;
    }
    return 0;
}

It seems that T5440 server is fast in this case as compaired to T5220 server.

Could anyone please help me out the exact reason for this behaviour as my application is slow as well on this T5440 server.

Thanks in advance !!!

regards,
Sanjay

DGPickett · June 27, 2013, 1:46pm

Did you compile for fastest on this platform on the slower machine? The SunWSPro compiler has a lot of optimizations, some very architecture specific.

The multicore CPUs can be slower on a single thread than some CPUs that are not trimmed to allow so many cores. Gamers still like 1-2 core machines, as the parallelization of a game is pretty low. If you run 32 or 64 copies at once, you might see the difference.

Then, there is the question of what is running concurrently on each server. Another app may be making CPU, RAM, DIsk or net speed competition. Some of this can be tricky to observe.

sanjay_singh85 · June 28, 2013, 5:37am

Thanks a lot for reply.

I complied the test program on both the servers and executed the corresponding binaries on both machines. Also, no other application is running on any of these servers.

Additionally, I tried one more experitment and found the following results.

Attached Program (ABC.cpp) is compiled by - /usr/sfw/bin/g++ -g -Wno-deprecated ABC.cpp -lpthread and Run by - time -p ./a.out

High Performance Architecture (kansparc54144) - root/labbws54144

4 socket(s)
32 core(s)
256 logical (virtual) processor(s)
The physical processor has 64 virtual processors (0-63)
UltraSPARC-T2+ (chipid 0, clock 1414 MHz)
The physical processor has 64 virtual processors (64-127)
UltraSPARC-T2+ (chipid 1, clock 1414 MHz)
The physical processor has 64 virtual processors (128-191)
UltraSPARC-T2+ (chipid 2, clock 1414 MHz)
The physical processor has 64 virtual processors (192-255)
UltraSPARC-T2+ (chipid 3, clock 1414 MHz)
Memory size: 98016 Megabytes

SunOS Generic_144488-17 sun4v sparc SUNW,T5440

Case1: Memory Operation (allocation, set, de-allocation) - commented line number 129 of test prog
real 0.78
user 3.94
sys 0.05

Case2: Memory Operation (allocation, set, de-allocation) and Computation (Matrix Mul)
real 14.54
user 280.18
sys 0.07

Low Performance Architecture (kansparc6744) - root/6744@labbws

1 socket(s)
1 core(s)
8 logical (virtual) processor(s)
The physical processor has 8 virtual processors (0-7)
UltraSPARC-T2 (chipid 0, clock 1165 MHz)
Memory size: 8192 Megabytes

SunOS 5.10 Generic_144488-17 sun4v sparc SUNW,SPARC-Enterprise-T5220

Case1: Memory Operation (allocation, set, de-allocation) - commented line number 129 of test prog
real 0.23
user 1.43
sys 0.03

Case2: Memory Operation (allocation, set, de-allocation) and Computation (Matrix Mul)
real 66.50
user 525.30
sys 0.44

MY CONCLUSION:
High Perf Arch perform good in case2 but bad in case1 (???).

I don't understand this behavior . could you please provide some information on the above behviour.

regards,
Sanjay

jlliagre · June 28, 2013, 7:19am

Your program is CPU bound and doesn't make any I/O so doesn't take that much advantage of the CMT architecture.

The difference in results might be due to the migration of threads from one core to another.

Have a look to this blog for a piece of code you could add for your threads to be bound to the same CPU during their execution:

https://blogs.oracle.com/d/entry/binding\_to\_the\_current_processor

sanjay_singh85 · July 1, 2013, 10:16am

Thanks a lot jlliagre for reply !!!

Your suggestion was very helpful for my analysis. I put the code you mentioned in the link into my test program. Results were actaully changed after that as it took " 107 second " to execute my program on high end server(i.e. 32 core, 4 sockets, 256 virtual cpus, 95 GB RAM, 14,1414 Mhz )and took 130 second to execute on another server (i.e 8 cores, 1 socket, 64 virtual cpu, 32 GB RAM,1165 Mhz).

However, I am still not able to understand , infact got more confused why after binding my test program with single CPU performance improves for multi-core multi-processor high end server.

I am also not able to understand how the migration of threads in case of multi-core multi-processor machine degrads the performance.
Could you please help to understand the reason for the same.

Thanks a lot for your time.

regards,
Sanjay Singh

DGPickett · July 1, 2013, 12:36pm

If a thread goes from one multicore to the other, the cache is empty. Often, everyting one writes, the other discards from cache. The may be similar problems with VM translation cache.

Corona688 · July 2, 2013, 12:15pm

It takes time for processes to move around from CPU to CPU to CPU to CPU. Cache must be copied, RAM perhaps re-fetched. Prevent it from moving and these losses are minimized.

DGPickett · July 2, 2013, 3:31pm

Well, as processes get dispatched to CPUs, some registers must be reloaded every time, like VM translation cache, even if it is the same CPU as last dispatch, as something else has been running in there, even if it is some 'Idle Process'. But RAM is cached inside the CPU possibly at two or more levels, by physical not virtual address so it is process-insensitive, and the farther away, CPUwise, the next dispatch of that process is, them more cache misses until cache is reloaded from RAM. Some CPUs have a variation on this scheme, where a VM translation miss is a first level cache miss.

Furthermore, many cache snoopers remove things from cache that are written by other parallel CPUs, so even if no other process has used a CPU since your process was last there, the cache hit rate is reduced for modified cache lines, which are often 16 or more bytes wide. Any modified byte on a line and the whole line is deleted from every other cache as that modified word makes its way to RAM.

RAM is a lot slower than the first level cache, and caches get faster as you get closer to th CPU, so the cost of cache misses is huge in CPU cycles. That is why programs that run like lightning once started still take time to get loaded and produce the first loop's data.

Fetching from disk to RAM adds to that delay, since disk is also much slower than RAM. If it wasn't, disk I/O could stop the CPUs dead.

jlliagre · July 2, 2013, 5:01pm

Can you post your updated code ?

sanjay_singh85 · July 3, 2013, 7:30am

Pleae find below the updated code which bind the process to CPU. It takes around 107 second t complete the execution.

#include <pthread.h>
#include <sys/processor.h>
#include <time.h>
#include<unistd.h>
using namespace std;
#define NUM_OF_THREADS 20
struct ABCDEF {
char A[1024];
char B[1024];
};
void bindnow()
{
  processorid_t proc = getcpuid();
  if (processor_bind(P_LWPID, P_MYID, proc, 0))
    { printf("Warning: Binding failed\n"); }
  else
    { printf("Bound to CPU %i\n", proc); }
}
 
void *start_func(void *)
{
    long long i = 6000000;
    //bindnow();
    while(i--)
    {
                ABCDEF*             sdf = new ABCDEF;
                delete sdf;
                sdf = NULL;
    }
    return NULL;
}
int main(int argc, char* argv[])
{
    pthread_t tid[50];
    struct timespec tps, tpe;
 if ((clock_gettime(CLOCK_REALTIME, &tps) != 0)  || (clock_gettime(CLOCK_REALTIME, &tpe) != 0)) {
  perror("clock_gettime");
    return -1;
  }
    bindnow();
    for(int i=0; i<NUM_OF_THREADS; i++)
    {
                pthread_create(&tid, NULL, start_func, NULL);
                cout<<"Creating thread " << i <<endl;
    }
     
    for(int i=0; i<NUM_OF_THREADS; i++)
    {
                pthread_join(tid, NULL);
                cout<<"Waiting for thread " << i <<endl;
    }
 clock_gettime(CLOCK_REALTIME, &tpe);
  printf("%lu s, %lu ns\n", tpe.tv_sec-tps.tv_sec,
    tpe.tv_nsec-tps.tv_nsec);
}

[root]kansparc54144:/ /usr/sfw/bin/g++ -g -Wno-deprecated ss2.cpp -lpthread -lrt -o ss2
[root]kansparc54144:/ ./ss2
Bound to CPU 64
Creating thread 0
Creating thread 1
Creating thread 2
Creating thread 3
Creating thread 4
Creating thread 5
Creating thread 6
Creating thread 7
Creating thread 8
Creating thread 9
Creating thread 10
Creating thread 11
Creating thread 12
Creating thread 13
Creating thread 14
Creating thread 15
Creating thread 16
Creating thread 17
Creating thread 18
Creating thread 19
start_funcWaiting for thread 0
Waiting for thread 1
Waiting for thread 2
Waiting for thread 3
Waiting for thread 4
Waiting for thread 5
Waiting for thread 6
Waiting for thread 7
Waiting for thread 8
Waiting for thread 9
Waiting for thread 10
Waiting for thread 11
Waiting for thread 12
Waiting for thread 13
Waiting for thread 14
Waiting for thread 15
Waiting for thread 16
Waiting for thread 17
Waiting for thread 18
Waiting for thread 19
107 s, 416364341 ns

Also, I commented the "bindnow" function in main and added in the "bindnow" function in "start_func" as shown below. It takes around 486 second to complete the execution.

void *start_func(void *)
{
    long long i = 6000000;
    bindnow();
    while(i--)
    {
                ABCDEF*             sdf = new ABCDEF;
                delete sdf;
                sdf = NULL;
    }
    return NULL;
}
int main(int argc, char* argv[])
{
    pthread_t tid[50];
    struct timespec tps, tpe;
 if ((clock_gettime(CLOCK_REALTIME, &tps) != 0)  || (clock_gettime(CLOCK_REALTIME, &tpe) != 0)) {
  perror("clock_gettime");
    return -1;
  }
    //bindnow();
    for(int i=0; i<NUM_OF_THREADS; i++)
    {
                pthread_create(&tid, NULL, start_func, NULL);
                cout<<"Creating thread " << i <<endl;
    }
    ...

root]kansparc54144:/ /usr/sfw/bin/g++ -g -Wno-deprecated ss2.cpp -lpthread -lrt -o ss2
[root]kansparc54144:/ ./ss2
Creating thread Bound to CPU 64
0
Creating thread 1
Bound to CPU 192
Creating thread 2
Bound to CPU 0
Creating thread Bound to CPU 129
3
Creating thread 4
Bound to CPU 211
Creating thread 5
Bound to CPU 101
Creating thread 6
Bound to CPU 19
Creating thread 7
Bound to CPU 142
Creating thread 8
Bound to CPU 192
Creating thread 9
Bound to CPU 110
Creating thread 10
Bound to CPU 0
Creating thread 11
Bound to CPU 147
Creating thread 12
Bound to CPU 229
Creating thread 13
Bound to CPU 119
Creating thread 14
Bound to CPU 9
Creating thread 15
Bound to CPU 147
Creating thread 16
Bound to CPU 101
Creating thread 17
Bound to CPU 247
Creating thread 18
Bound to CPU 19
Creating thread 19
Bound to CPU 147
Waiting for thread 0
Waiting for thread 1
Waiting for thread 2
Waiting for thread 3
Waiting for thread 4
Waiting for thread 5
Waiting for thread 6
Waiting for thread 7
Waiting for thread 8
Waiting for thread 9
Waiting for thread 10
Waiting for thread 11
Waiting for thread 12
Waiting for thread 13
Waiting for thread 14
Waiting for thread 15
Waiting for thread 16
Waiting for thread 17
Waiting for thread 18
Waiting for thread 19
486 s, 3873742799 ns

Corona688 · July 3, 2013, 11:43am

So.... When you don't call bindnow() it takes many times longer?

sanjay_singh85 · July 8, 2013, 7:10am

Hi All,

Thanks a lot for replies . It help me a lot to find out the issue which I was facing with my appplication. It was due to the multi-processors.
I bound my application to a processor with following code:

void ProcessorSetAdd()
{
    if (pset_create(&psid) != 0)
    {
        cout<<"pset_create() Failed" <<endl;
    }
    /* Assign CPU 0 to the processor-set */
    //for(ci=0; ci < 63; ci++)
    for(ci=8; ci < 16; ci++)
    {
        if (pset_assign(psid, ci, NULL) != 0)
        {
            cout<<"pset_assign() Failed for " << ci <<endl;
        }
    }
    /* Bind the current process to the processor-set */
    if (pset_bind(psid, P_PID, P_MYID, NULL) != 0)
    {
        cout<<"pset_bind() Failed" <<endl;
    }
    int pType;
    unsigned int noOfCPU = 0;
    processorid_t cpuList;
    pset_info(psid, &pType, &noOfCPU, &cpuList);
    cout<< "No of CPU in List is" << noOfCPU <<endl;
    cout<< "TYPE OF CPU" << pType <<endl;
}

It gave the same performance as server T5220.

Thanks a lot once again eveybody.

regards,
sanjay

DGPickett · July 10, 2013, 11:14am

Try using a multiple (2x isusually good) of the CPU core count for the child thread count, like 64. That way, the work is divided equally to all CPU cores with 8 or 32 and if any thread blocks, there is another to use the CPU core.

achenle · July 10, 2013, 4:09pm

I have a couple of observations:

The only thing the test program is testing is the ability of the standard malloc()/free() implementation to repeatedly allocate then free then allocate again the same blocks of memory to multiple threads. I question the usefulness of such a test.
The calculation of time spent ignores nanosecond rollover.

jlliagre · July 10, 2013, 7:16pm

A very good point indeed. I overlook the hidden malloc behind the C++ new and hidden free behind delete.

Linking with libmtmalloc should more than significantly boost the performance.

/usr/sfw/bin/g++ -g -Wno-deprecated ss2.cpp -lpthread -lrt -lmtmalloc -o ss2

achenle · July 12, 2013, 11:04am

jlliagre:

A very good point indeed. I overlook the hidden malloc behind the C++ new and hidden free behind delete.

Linking with libmtmalloc should more than significantly boost the performance.
/usr/sfw/bin/g++ -g -Wno-deprecated ss2.cpp -lpthread -lrt -lmtmalloc -o ss2

Yes, but it's still benchmarking nothing particularly useful.

jlliagre · July 12, 2013, 11:47am

Benchmarking multithreaded malloc/free might make sense.

DGPickett · July 16, 2013, 3:01pm

If you write your objects to an mmap'd file, extending it, you never need to malloc or free. Of course, writing so the objects get reused not deleted saves that churn, in C++, JAVA or whatever OO language.