Performance problem - waiting on cache

keelba · January 21, 2011, 11:16am

My server is running HP-UX 11.23 and one Oracle database. The server has 8 CPUs and is mostly idle all the time. Buffer cache is set to 10%min/max with 5GB memory on the server.

I have a user complaining that a batch process is all of a sudden taking a long time to finish. The DBA gave me the PID of the offending query and when I look up that process in Glance it shows the reason for waiting is CACHE. I set the update interval to 1 second and it never changes from cache. The process is consuming little CPU but still waits on cache.

Eventually the process will finish and die but it takes much, much longer than it used to just a few days ago. I know of no other processes that exhibit this behavior yet everyday this one process does the same thing, even with a new PID.

My hypothesis is that the server thinks that some data resides in cache but does not so it waits and waits. This doesn't really make sense to me but I cannot think of anything else that might cause this.

I have tried scouring the net for answers and have found none. If someone has any kind of logical guess of what might cause this behavior I'd love to hear it. It might help me figure out what is really going on.

Thanks,
Kevin

methyl · January 21, 2011, 12:45pm

Something to read:
http://h21007.www2.hp.com/portal/download/files/unprot/devresource/Docs/TechPapers/UXPerfCookBook.pdf

What hardware? What discs? Any SANs? Any NFS?
What version of Oracle?

The Oracle DBA should be able to give you detailed statistics for activity of the Oracle database engine and detailed statistics for activity of the Oracle session concerned.
It may just be reading millions of records without using an index ... or something like that. Look very closely for Oracle sorts - these are a notorious bottleneck with default Oracle SGA settings because they use disc if they have not been given enough memory.

Perderabo · January 21, 2011, 1:08pm

Is this a typo... 8 CPU's and 5 GB of memory? 5 GB per CPU for a total of 40 GB? Or really just 5 GB? So your cache is just half a GB? And you're running Oracle?

Well, here's my guess... you need more memory. Lots more memory. We have 4 cpus and 32 GB on our (Linux) Oracle servers. Ask Oracle support to suggest how much memory you need. They will quickly tell you that 5 GB is not enough.

keelba · January 21, 2011, 1:45pm

You're right, it was a typo. I was just checking if you were paying attention. =)
Actually, I was on Glance and got the wrong information as I was talking to the end user about the problem and got my numbers mixed up. There is actually 32GB on the box.

The server is a Superdome. We have an EMC DMX4. No NFS. Oracle is 10.2.0.

In regards to your comments about Oracle and indexing, etc., wouldn't I see high disk usage if this was true? If the process was waiting on data sorts then it should go to sleep correct until Oracle comes back with some data? The disks are not very busy at all. Glance shows 23% busy and sar shows about 6K blks/s total usage. avwait is 0 and avserv is around 2-4.

We are already scheduling a reboot to see if by chance this will clear it up. Maybe it will, maybe it won't, but even if it does, I'd still like to be able to explain what is happening to the higher powers when they asked why we needed an emergency reboot.

Perderabo · January 21, 2011, 2:28pm

32 GB is a much better figure but still a bit low for 8 cpus. And I still say memory is the problem. How much of that is devoted to Oracle's shared memory segments? ipcs -mb should show that.

To read data from a filesystem the data is loaded into cache (unless it's already there) and then delivered to the process. If cache is full a cache buffer must be freed. Once the buffer is free the read can continue. When you write to a file you change the data in a cache buffer. The buffer is said to be "dirty". A dirty buffer must be written to disk before it can be freed. You need more cache buffers and this probably means more memory.

Locking cache at 10% is probably not a great idea. I would try min at 10% and max at 20%. The would give the kernel some more freedom in managing what memory it has.

keelba · January 21, 2011, 2:52pm

T ID KEY MODE OWNER GROUP SEGSZ
Shared Memory:
m 131 0x00000000 --rw-r----- oracle dba 2202075136
m 132 0x00000000 --rw-r----- oracle dba 2197815296
m 133 0x00000000 --rw-r----- oracle dba 1411452928
m 134 0x97604490 --rw-r----- oracle dba 16384

We just got through rebooting the server and ran another process and it is exhibiting the same behavior.

I can look into increasing the DBC but that may be difficult to get through and take some time with our change management.

What's strange is that this very process has been running fine for years. Then, within the past week, it has all of a sudden been acting strangely. I am trying to figure out what has changed but so far it appears nothing has.

Someone recommended I run truss against the process. I've never run truss before so I do not know what to expect from it but it's worth a shot.

Also, I am not certain that what I am seeing in Glance is a problem. I think it is abnormal to see a process waiting on cache but this may be absolutely normal and running the way it should based on the query. The problem could be somewhere else but I have no idea where else to look. The issue we're trying to solve is "a certain query is all of a sudden taking a long time to run".

Thanks everyone for the help so far.

---------- Post updated at 01:52 PM ---------- Previous update was at 01:41 PM ----------

Quick update: I ran the truss (HP's tusc command) and all I see is line after line of:

read *****************
lseek *****************
read *****************
lseek *****************
read *****************
lseek *****************
read *****************
lseek *****************
read *****************
lseek *****************

and on and on and on. I'm inclined to believe now that there is not anything wrong with the server but getting our DBAs to delve deeper can be quite an issue. My job is to prove that nothing is wrong with the server or else find the problem and fix it.

ctsgnb · January 21, 2011, 2:57pm

Check that the query is correctly designed
Check your memory sizing
Check your I/O configuration (asynchronous?)
Check your OS kernel parameters as well as your Oracle parameters

Do you have direct I/O ?
(also have a look at post #29 of this thread )

Any error message in the alert<DB_SID>.log ?

Get in touch with your Oracle DBA and ask him (or get it on your own) , the corresponding Oracle Installation Guide, you will find some sizing requirements as well as rule of thumb and advice for parameter tuning (also for some of your OS kernel parameters).

Any other appli running on that superdom ?

Is this oracle instance a new install or does it run for a long time and suddenly has performance issue ?

Are you running PA-RISC or Itanium?

You might also read this and read carefully the OS-Tuning and Kernel Tuning part on page 18 (and of course also the rest of the document ... )

You might want to work closely with your DBA if you are not familiar with oracle because this kind of troubleshooting may require OS+DBA knowledge.

Also ask your DBA if he has activated the automatic SGA and PGA management (he should, since those functionnalities are available in your oracle version)

keelba · January 21, 2011, 3:42pm

The Superdome is Itanium, sorry I did not specify that earlier. There are no other applications or databases running on this server.

I am trying to work with a DBA to find the cause. The database, server, and batch process have not changed in over a year. The uptime was more than 365 days. This is our pre-production database so, I am told, the data has not grown either.

The more I look at it, the more it looks like everything is running great from the OS perspective, really fast, in fact. It looks like all of the data is loaded up into the dbc and the query is repeatedly hitting it again and again and again. I cannot see the query but I am told that they are reading from several tables trying to build a new table. Again, they've done this nightly without issue until recently. In my book, something has changed but here I am, the Sys Admin, with all fingers pointing at me as if it has something to do with the server.

Thanks again folks, for all of your help. I'll keep digging and looking into your recommendations.

ctsgnb · January 21, 2011, 3:43pm

Oracle Database 10g runs I/O operations from both shared memory and process-private regions using the new HP-UX asynchronous driver. However, I/O operations through the asynchronous driver are not asynchronous in nature. This is because Oracle Database must perform a blocking wait to check the status of I/O operations submitted to the asynchronous driver. This causes some Oracle processes, such as the database writer process, to essentially process synchronous I/O.

This is an extract from this link

seems like asynchronous I/O on HP-UX with Oracle is only supported for raw FS

How many DB writer are currently set up on your DB ?

Few considerations about it you might want to prefer:

a) the FS type hosting the datafile can handle async I/O (oracle param DISK_ASYNCH_IO = true) so the DB writer can let the OS deal with the update (without having to do "wait" events)

b) if your FS type doesn't support async I/O you can setup more than 1 DBWR (DBWR_PROCESSES/db_writer_processes) but since the oracle buffer cache is divided in as much working sets, you might want to adjust the DB_BLOCK_LRU_LATCHES.

c) Last solution, activate the forking to simulate asynch I/O by activating the use of slave processes (DBWn_IO_SLAVES) which are in fact synchronous I/O (DISK_ASYNCH_IO = false). Risks of that last solution are buffer overhead, wait, memory overhead (IPCs & context switch)

Scrutinizer · January 21, 2011, 4:18pm

As I recall waiting for cache in Glance means the process is waiting for the file system buffer cache to be updated, which is normal for buffered IO. Everytime when this was all I could see I and there was hardly any CPU usage and the system seemed to function normally, it turned out that someone had been making changes to a query anyway

ctsgnb · January 21, 2011, 4:25pm

Some product like ODM (Oracle disk manager) i think, can also help to avoid double buffering (OS+DB) and may help to save the some RAM space used by the buffer cache system so that the saved space can for example be reused to increase the database buffer cache.

keelba · January 21, 2011, 4:29pm

Good news! The process always drops the indexes and rebuilds it every day before it starts. In our meeting right now the DBA said she just now rebuilt the indexes and the query finished immediately. At least now we have enough ammunition to point back to the database.

Thank you again everyone for all of your suggestions.

ctsgnb · January 21, 2011, 4:35pm

LoL

... Check that the query is correctly designed

That was the first sentence of my first post

keelba · January 21, 2011, 4:37pm

Well, they've already admitted that the query should be modified to run in parallel mode but they do not want to make this change until they've finished their current testing over the next 4-6 weeks.