Solaris SPARC speed issue

Pug · September 7, 2012, 12:34am

Hello helpful Unix gurus!

First, I appreciate any help I can get.

I have a product that we have ported (for years now) on Solaris SPARC 7.x OS from 1998 (or around that time I guess) that is compiled to a 32 executable. It has run (with various modifications over the years) on any Solaris SPARC platform (as far as I know).

Recently I have a customer that expects the program to run much faster. This program is a number crunching type of utility that goes off and does file input, parsing, bit manipulation and output and then exits, so normally it is pretty fast). It is written in C/C++ and is compiled to mostly native code with a few calls to the C run time library (static linkage).

So my question is this: is the fact that this is a 32 bit SPARC executable slowing down its running on a 64 bit SPARC. It is running on the following Solaris platform:

uname -a
SunOS appomahadev2 5.10 Generic_147440-01 sun4v sparc sun4v Solaris

isainfo -v
64-bit sparcv9 applications
        hpc vis3 fmaf asi_blk_init vis2 vis popc
32-bit sparc applications
        hpc vis3 fmaf asi_blk_init vis2 vis popc v8plus div32 mul32

I am asking to see:

if there is a known reason why the 32bit exe should run slower on a 64 bit platform?
if re-compiling this code on a newer Solaris SPARC 64 OS
will generate a SIGNIFICANTLY faster executable?

Getting a new Solaris SPARC box seems to be expensive and is not something we really want to do unless we need to.

Does anyone know of a cross compiling method whereby we can compile on a cheaper platform and target this platform?
Any ideas?

hergp · September 7, 2012, 1:41am

So you are running on a niagara type cpu (T1, T2, ...). These are cpus with a lot of cores and threads but with limited single thread power. When you say, it is an old application, then it is most likely a single threaded one and not very well suited for such an architecture (or vice versa).

Recompiling in 64 bit mode could speed up things a little, if there is a lot of 64 bit arithmetic going on in the program. But I would not expect to much from this step.

jlliagre · September 7, 2012, 3:44am

Alternatively, you might redesign your application code for it to be multi-threaded. It might run 8 times faster than its current speed on sun4v.

Pug · September 8, 2012, 12:23am

Thank you both for your insight.

The program is a single threaded type of process. It has to be that way.
However, we will try arranging to run the executable in parallel if possible.

If I understand you correctly then running say 10 of these processes at the same time instead of in sequence "should" process much faster (rather than just a little faster on a normal single core machine). Is that correct?

jlliagre · September 8, 2012, 3:59am

That really depends on your process characteristics. If your process is heavily doing floating point arithmetics, there would be a notable gain with a T2 (but not that much on a T1). If your process is I/O or memory bound, there should be no significant gain. If the process is CPU bound, you can expect it to run at least 8 times faster. If your process I/Os are introducing latencies that can be parallelized, you can expect it to run up to 64 times faster.

I'm assuming an 8 core, 8 thread per core CPU.

Pug · September 10, 2012, 6:40pm

Again, thank you for your responses.

So please let me clarify a few points to be sure.

'hergp' is saying that if the issue is that they are using a T1 T2 type multi core SPARC than the issue is not so much the 32 vs 64 bit compile as it is the single threaded vs multi-threaded processing. Correct?
On Windows x86 vs X64 we found that the increase in compiling to X64 is just under 20% (comparing the running of one process on the same input data). Would you expect something similar on SPARC?
'jlliagreis' saying that if it is possible to design our program to be multi-threaded (which I don't think we can) or if we run the program in parallel processes on different sets of data instead of in sequence, then we should be able to see some improvement. However, the amount will depend on the type of bottle neck we are experiencing (which we do not know yet). So for example, if it is straight CPU bound then we should see the best improvement, whereas if it is I/O bound then they may be less improvement, etc.
How do I know if we are running T1 or T2 with regard to floating point? I noticed that you said it is niagara based on the sun4v. How do I tell what T level they have?
What Solaris SPARC hardware would be better for doing this type of processing? I am not sure we have a choice but I am certainly curious.

---------- Post updated at 04:40 PM ---------- Previous update was at 09:38 AM ----------

One other things. They have two servers. One is a T2 and the other a T3 and specifically they are:SunOS 5.10 Generic_147440-09 sun4u sparc SUNW,Sun-Fire-V245 (T2)
SunOS 5.10 Generic_147440-01 sun4v sparc sun4v (T3)

What does this tell us about single thread / multi-thread or floating point etc?
How does sun4v differ from sun4u? How does T2 differ from T3?

Thank you.

jim_mcnamara · September 10, 2012, 10:07pm

Just to inject another point of vie. hergp and jllaigre are correct about threading.

To answer your 32/64bit question: time it yourself on the same dataset with two different compiles. We have done that on a different solaris architecture and found only a small amount of improvement.

If your number cruncher uses big arrays it is possible that you are wasting cpu. If your code constantly forces the cpu to bring in pages of data and to do a lot of searching in the cached pages, you are possibly wasting cpu.

Consider running your code and at the same time run trapstat. Thanks to this we got data to support using larger pagesize effectively. This DOES NOT nesessarily involve coding. A minor change to the way you invoke your code is needed. See the ppgsz man page for a very simple way to do this. Do this if and only if you have an MMU issue issue revealed by trapstat. And I do not know much about your architecture, this may not be as beneficial as it was on our M4000.

Have a read:
Multiple Page Size Support - Siwiki

jlliagre · September 11, 2012, 5:19am

Yes.

No. The gain observed when moving from x86 to x64 is generally due to the small number of registers available on x86 vs the larger one on x64. SPARC 32 bit doesn't have this deficiency.

Currently, your process is likely CPU bound otherwise you wouldn't have open this thread in the first place.

Run

psrinfo -pv
prtdiag -v

to get the CPU model you have.

SPARC64, T2+, T4

The V245 is not a T2, it uses a single thread, single core CPU.

Use psrinfo -pv to see what kind of CPU it is.

sun4v is CMT (chip multi-threading)
sun4u = UltraSPARC I, II, III, IV, IV+, SPARC64
sun4v = UltraSPARC T1, T2, T2+, T4
See Category:SPARC microprocessors - Wikipedia, the free encyclopedia for details on each model.

achenle · September 12, 2012, 9:17am

What compiler are you using? What are your compiler optimization options?

jlliagre · September 12, 2012, 11:36am

What do you mean with "mostly native code" ? C/C++ should be fully native code.
Also, from Solaris 10 onward, static link is not supported (and not possible) with the C run time (libc). It is only provided as a shared object.

Pug · September 19, 2012, 12:12am

Thank you again for all those answers. Sorry I was not able to reply sooner.

To answers your questions:

Regarding the compiler and optimizations, I am not sure what settings we were using during the 32 bit compile on Solaris 7. I would have to look for that info. I know it was using the native compiler and not GNU.

Regarding the 'native' code comment. My mistake, what I really meant to say is that it is mostly C code with very few calls into the C run time library or the OS. Very few dependencies on other code. Even the C++ is really minimal. The point is that what it spends time doing is mainly there in the code. The main thing we use that is not in the code are all the basic File I/O calls.

I am sure we can switch to dynamic C run time linkage, it was only done static because it has been done that way historically. This code was originally porting in 1992.

Thank you all again for your help. I feel fully informed and will talk to our customer about it all. We also plan to get a newer platform for porting.