Performance Bottleneck in server, Need help

We are wondering if we are facing performance issue in our server when running Informatica jobs. Two things to suspect:

  • cache memory never comes down even when Top shows > 99% used.
  • There is some contention io or network related or Cache is clogged

top - 20:58:20 up 16 days,  4:37, 16 users,  load average: 7.50, 4.85, 3.82
Tasks: 386 total,   2 running, 376 sleeping,   7 stopped,   1 zombie
Cpu(s): 35.1%us, 10.1%sy,  0.0%ni, 32.5%id, 20.9%wa,  0.0%hi,  1.3%si,  0.2%st
Mem:  32877500k total, 32626692k used,   250808k free,   291804k buffers
Swap: 20971516k total,    37056k used, 20934460k free, 18906004k cached


free -g
             total       used       free     shared    buffers     cached
Mem:            31         31          0          0          0         18
-/+ buffers/cache:         12         18
Swap:           19          0         19

If We kill all Informatica services & jobs... the utilization in Top comes down to

top - 20:16:17 up 16 days,  3:55, 18 users,  load average: 1.26, 0.94, 0.85
Tasks: 366 total,   1 running, 357 sleeping,   7 stopped,   1 zombie
Cpu(s):  8.0%us,  3.9%sy,  0.0%ni, 88.0%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  32877500k total, 19378856k used, 13498644k free,   225320k buffers
Swap: 20971516k total,    37060k used, 20934456k free, 18426624k cached

A) I am not sure why the cache memory is not released ?
B) Even when utilization in Top is going to 32 GB i.e. when Informatica jobs are running, Cache is still at 18GB ... Shouldn't cache be released because Informatica jobs are hanging ?

I strongly feel this is nothing to do with Server memory because I see swap memory not come into play at all but before I go to Informatica and raise a flag... I want to make sure there is nothing from server side... what more can I do to ensure that everything on server side works fine i.e. CPU & IO or Network, as I am not able to think of any other factor affecting this from server side?

Your bottleneck is IO. Compare between under load and normal load:
20.9%wa -> That's the percentage the CPU had to wait for IO. (Slow)
0.1%wa -> Normal load.

I am going to assume that this is a Linux kernel operating system based on how top looks; you did not say:
cached means memory not used and since it would be wasted, the kernel is using it for caching disk. At any time needed by any application use, it will be taken. You should not be concerned with it. You should considerate it free memory.

Sorry I forgot to mention that its Red Hat Linux & We are running this from Amazon Cloud.. so not sure if the IO problem is due to the box being in cloud. Is there a way to isolate which process is getting affected due to IO problem ?

Regards,
Jey

Take a look at iotop .