AIX CPU usage

baladelaware73 · October 1, 2016, 12:17pm

hi,
We have two LPARs, both have same capacity and believe same configuration. ulimit settings for oracle user is unlimited for both LPARs. Installed oracle databases with same configurations on both LPARs, both databases sync every second so volume is same. Both LPARs/databases have identical jobs running at exact time.
We run net-backup process at same time on both LPARs every night. LPAR1 finishes in 30 min but LPAR2 takes 2 hours to complete.
When I observed, i noticed backup processes running on LPAR1 consuming 35 to 40% CPU for every PID. And i noticed backup processes running on LPAR2 consuming only 5 to 10% CPU for every PID.
I'm understanding that backup process in LPAR2 delays because it's not utilizing available CPU, please correct me if 'm wrong? Secondly how we can make or instruct the AIX to utilize more CPUs for the backup processes running on LPAR2?

-Thanks in advance
-Bala

rbatte1 · October 3, 2016, 4:28am

If you are certain that you are using the same disk infrastructure and there is no lower class of IO rate given by the SAN (or any other infrastructure, e.g. different number of paths, fabric port settings etc.) then my first, perhaps wildly inaccurate, guess would be a missing index.

This could lead to more IO and therefore less CPU because there is less CPU work to do, you are doing more waiting for IO.

If the LPAR is otherwise idle, have a look at the output from vmstat 10 10 and see what the Wait column (towards the very end) tells you.

Other things to consider might be lower memory which might be causing paging. From the same output, look at the pages in & out (somewhere in the middle) and that might illustrate that. You can also have a look at lsps -a and if it's a bit heavy that could be an indication of low memory.

Beyond that, you would be looking into the oracle parameter file for variations.

I hope that this helps,
Robin

zaxxon · October 3, 2016, 5:24am

As Robin says especially have a look for IO.

In general it might be helpful to have nmon running for monitoring. So you can compare lot's of different system parameters from both hosts while holding them side by side.

baladelaware73 · October 3, 2016, 1:50pm

IO is high, IO could see from database side. I thought LPAR2 IO is high because it does not get sufficient CPU, am I making right statement?

zaxxon · October 4, 2016, 1:29am

It depends what you name as IO. When you use for example the OS tool iostat you can see how bytes have been read/written but also the number of IOs and the time in percentage, how much has been spent for IO.
I am no Oracle DBA and have no access so I don't know what you see.

Way back we had some SAN storage, that was far from being reached it's max bandwith MB/s wise, but it had so many small IOs, it wasn't able to handle more.

So with just one or two sentences by you describing it, it is hard to guess around what could be the reason.
Since the task is running for a longer time, it is really recommended to monitor also for this time.
Please do youself a favour, download nmon (I think it is actually installed already with the OS these days) and have it run in the background and get this performance data on both boxes.
It is really easy to setup and when you have the data, use lpar2rrd and paint some pictures so you can compare those graphs to get a clue.

Just google for nmon ibm and you will find also the documentation with examples.

rbatte1 · October 4, 2016, 4:34am

I would still say to check that all the indices you expect exist and are valid on the server with the problem, something like:-

SELECT OWNER,INDEX_NAME,STATUS
FROM DBA_INDEXES
WHERE STATUS NOT LIKE 'VALID'
ORDER BY OWNER,INDEX_NAME ;

If it's not there, you can use DBA_OBJECTS instead and give it OBJECT_TYPE of INDEX

You could also recalculate the statistics that Oracle uses to pick how it uses the data/indices, with something like a REBUILD INDEX or COMPUTE STATISTICS etc.
Have you got an explain plan from the database to show you what it's doing when it is running a long transaction.

My favoured explanation is that it is either missing an index or not using one, consequently you end up reading the whole table (lots of IO) and performing the equivalent of a massive if in a loop to get the data you want. The explain plan is key. If you see something about a full table scan on a huge table, that is a very good bet.

Try to get a comparable one on the good side and see if there is a difference.

It's been ages since I had this sort of thing so I can't remember how you generate it. I think we might even have used a client tool (Toad or Oracle Discoverer perhaps?) to give us the information. There will be a DBA command line way too I'm sure. From memory (no pun intended) it might be to do with the view V$SQL_PLAN, but I no longer have access to systems like that any more.

I hope that this helps,
Robin

baladelaware73 · October 4, 2016, 10:06am

This is a RMAN incremental backup. Why does it uses indexes for copying the file? I would agree if this is a long running SQL or any DML statements.

rbatte1 · October 4, 2016, 11:39am

Perhaps it would have been useful to know this. I now suggest you look at network settings. If there is a mismatch between the speeds of the switch port and the NIC then you might hit trouble, of if there is an extra step across the network (even a change of subnet on the same physical switch) that can slow bulk data transfer down too.

There are more things to consider now. I too would not expect it to read any indices for a backup done. I think it should read the object definitions and the full tables.

Can you explain more about your network between the LPARs, the device managing the backup and the recipient that gets the data?

If these are not run concurrently, is there anything else with a very high data volume being transferred over the network around the same time?

Robin

jim_mcnamara · October 4, 2016, 5:01pm

Oooo. RMAN.

ugh.
[story]
There is a sort of scripting language that is used to set up parameters for an RMAN backup. If you set "threads" to a high value (usually more than 4) you can get I/O saturation on most types of I/O devices, as well as signficant cpu usage.

Our DBA set threads to 20 on our one huge customer db. Incremental backups at 0400 completely ate the server and squashed all of our DB batch jobs for about two hours every night. On Saturday morning when full backups ran, all 16 cpus were nearly maxed out, nobody could work on the system until the backup was over - like 6 hours later. A refresh of the TEST db took 40+ hours. In this case the TEST server was hammered into oblivion.

It took several months to get the DBA to decide RMAN tuning was indeed the problem.
One change and all the system issues went away.
[/story]

I would strongly suggest that if your DBA does not know RMAN to open a support ticket with Oracle and get competent help.

baladelaware73 · October 5, 2016, 10:13am

I suspected network issues between LPAR2 db server and net-backup server, but the full backup of LPAR2 which runs on the weekend completes in the same time as LPAR1. This specific performance issue is only noticed for RMAN incremental backup. I also verified, it's the same RMAN script runs on LPAR1 & LPAR2, and the RMAN logs are having same contents on both sides. Pasting below the topas output taken while the incremental backup running.

Thanks

---------- Post updated at 10:13 AM ---------- Previous update was at 10:07 AM ----------

Apologize, format issues, pasting down again.

CPU                   LPAR1      LPAR2
user%                  12.7        6.4
kern%                  11.2        6
wait%                    1.1       3.8
idle%                    75         83.9
Physc                   2.91       1.34
Entc%                 36.43      16.73
  
Network           LPAR1          LPAR2
BPS                 30.7M          6.41M
I-Pkts               2.43k          825.2
O-pkts              21.8k          4.62K
B-in                  395k           116K
B-out              30.3M           6.30M
  
Disk           LPAR1             LPAR2
Busy%            8                10.7
BPS               3.14G           944M
TPS               6.79K           2.39K
B-Read           3.14G           942M
B-write           1.14M          2.32M
  
FileSystem   LPAR1          LPAR2
BPS             784M           250M
TPS             2.24K           2.13K
B-Read        784M            250M
B-write        428K             269K

zaxxon · October 6, 2016, 2:01am

Too bad you ignore constantly to use/try out nmon - it would be a help for you or others/anybody else in the future too.
Anyway, there are big differences in the BPS disk and FS paragraphs about up to 2-3 times between LPAR 1 and 2. Maybe have a test with dd or something else going somewhat direct to the storage (SAN?) to check if has the same performance for both LPARs.
Also check back with the storage/SAN guys, if they can see if there is more traffic on the port for LPAR 2 than LPAR 1, throttling it's IO.

Your output does not show anything about current memory situation including paging space. That would also be visible in a nmon output.
I remember that especially RMAN used lots of paging space in my former environment.

Btw.: Please do not just post an update with new formatting but correct the garbage post you have left next time, thanks