Hi,
like to know how I can find out the reason behind a system rebooted.
If the system rebooted by itself --or-- if it was rebooted by a user and user removed the entry from errpt and history.
this thread, we will dedicate to find out HOW/WHO
Hi,
like to know how I can find out the reason behind a system rebooted.
If the system rebooted by itself --or-- if it was rebooted by a user and user removed the entry from errpt and history.
this thread, we will dedicate to find out HOW/WHO
Post output of the commands:
uptime
last reboot
who -b
errpt | head -20
Could a power outage not explain the lack of logging in the matter?
No Scott.
The system should have enough power in reserve to log an EPOW (Emergency Power Off Warning) even in a total power outage, before it goes down.....
If the disk containing the error log file is already inaccessible then the system should have enough power in reserve to log the error to the firmware NVRAM and it will be written out to the AIX error log when power is restored.
You'll see the usual error logging turned on message at boot time and then you'll see an error with a slightly earlier time stamp written out after the error daemon is up an running.
AIX system have been designed this way for many years.
---------- Post updated at 12:10 AM ---------- Previous update was at 12:06 AM ----------
filosophizer,
If the errpt entry was removed there will be a jump in the sequence numbers....
Although AIX often drops one even on a reboot - never understood, or looked into, why.
What does the error report show?
Thanks for the reply.
Here is the output of the commands
root@PRD /> uptime
10:53PM up 5 days, 14:15, 2 users, load average: 4.93, 4.46, 3.76
root@PRD /> errpt
IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION
E87EF1BE 0527150016 P O dumpcheck The largest dump device is too small.
E87EF1BE 0526150016 P O dumpcheck The largest dump device is too small.
E87EF1BE 0525150016 P O dumpcheck The largest dump device is too small.
E87EF1BE 0524150016 P O dumpcheck The largest dump device is too small.
E87EF1BE 0523150016 P O dumpcheck The largest dump device is too small.
26623394 0523104416 T H fscsi2 COMMUNICATION PROTOCOL ERROR
26623394 0523104316 T H fscsi2 COMMUNICATION PROTOCOL ERROR
26623394 0523104316 T H fscsi2 COMMUNICATION PROTOCOL ERROR
B6267342 0522235516 P H hdisk3 DISK OPERATION ERROR
E87EF1BE 0522150016 P O dumpcheck The largest dump device is too small.
A6DF45AA 0522083916 I O RMCdaemon The daemon is started.
D221BD55 0522083816 I O perftune RESTRICTED TUNABLES MODIFIED AT REBOOT
EC0BCCD4 0522083816 T H ent1 ETHERNET DOWN
EC0BCCD4 0522083816 T H ent2 ETHERNET DOWN
54E8A127 0522083516 T H sissas0 DEVICE OR MEDIA ERROR
B6267342 0522083516 P H hdisk3 DISK OPERATION ERROR
B6267342 0519220616 P H hdisk3 DISK OPERATION ERROR
root@PRD />
root@PRD /> last reboot
reboot ~ May 22 08:38
reboot ~ May 01 09:37
reboot ~ Apr 21 17:01
reboot ~ Apr 18 19:02
reboot ~ Apr 18 16:50
reboot ~ Apr 18 16:16
reboot ~ Mar 16 06:25
root@PRD /> who -b
. system boot May 22 08:38
IBM How to Investigate a System Reboot - United States
IBM Server running AIX with Oracle RAC reboots itself - United States
The /var/adm/wtmp account file
This binary file is used to store various types of login information. One type of information stored in this file is user login records. These records document the user name and time of login. Pseudo user names are used for shutdown and reboot. So when a system is shut down using one of the shut down commands, a record with the user name shutdown will be logged into the wtmp file. Similarly when a system is booted, a record with the user name reboot will be written into the wtmp file. Some shut down commands have flags that can be used to suppress login records in the wtmp file.
Note: Technically a reboot is a warm boot but the pseudo user name reboot is written into the wtmp file for both warm boots and cold boots.
Here is last command from it:
---------------------------
root pts/2 opmanager May 22 08:54 - 08:54 (00:00)
root pts/1 dbapc May 22 08:49 - 08:55 (00:05)
root pts/1 opmanager May 22 08:49 - 08:49 (00:00)
oracle ftp dbapc May 22 08:45 - 08:55 (00:10)
oracle ftp dbapc May 22 08:45 - 08:45 (00:00)
oracle pts/0 dbapc May 22 08:39 - 10:11 (01:32)
reboot ~ May 22 08:38
root pts/0 opmanager May 20 10:00 - 10:00 (00:00)
root pts/0 opmanager May 20 09:54 - 09:55 (00:00)
root pts/0 opmanager May 20 09:49 - 09:49 (00:00)
I see that right after reboot user oracle from PC: dbapc logged in. It could be possible that Mr. DBAPC was smart enough, did a reboot and removed the entry from errpt by
errpt -J NUMBER
Task is to find out
1- Who did the reboot, (user initiated) or abnormal shutdown (don't think so, there was power outage)
2- RAC reboot -- RAC could have done reboot but again errpt would have captured this
If your system is part of a RAC cluster, then it may be that it was the cluster's decision to evict that particular node. Evicting a node is done as quickly as possible to preserve the integrity of the database. This means: immediate shutdown of the host/node, without logging a shutdown record in /var/adm/wtmp, not calling sync to flush file buffers and not sending processes a SIGTERM.
If your node has been evicted, there should be information about this in the RAC cluster logs on the remaining nodes of the cluster...
Scrutunizer
checking RAC logs
root@PRD /etc/oracle/oprocd> ls -ltra
-rw-r--r-- 1 root system 304 May 22 10:05 prd.oprocd.log.2016-05-22-10:10:59
drwxrwx--- 2 root system 256 May 22 10:11 stop
-rw-r--r-- 1 root system 175 May 22 10:11 prd.oprocd.log
-rwxr--r-- 1 root system 512 May 22 10:11 prd.oprocd.lgl
drwxrwx--- 2 root system 256 May 22 10:11 fatal
drwxrwx--- 2 root system 256 May 22 10:11 check
root@PRD /etc/oracle/oprocd> cat prd.oprocd.log
May 22 10:11:00.148 | INF | monitoring started with timeout(1000), margin(500), skewTimeout(125)
May 22 10:11:00.194 | INF | fatal mode startup, setting process to fatal mode
root@PRD /etc/oracle/oprocd>
root@PRD /etc/oracle/oprocd>
root@PRD /etc/oracle/oprocd> cat prd.oprocd.log.2016-05-22-10:10:59
May 22 08:39:39.546 | INF | monitoring started with timeout(1000), margin(500), skewTimeout(125)
May 22 08:39:39.628 | INF | fatal mode startup, setting process to fatal mode
May 22 10:05:26.457 | INF | shutting down from client request
May 22 10:05:26.457 | INF | exiting current process in NORMAL mode
But, shouldn't the timestamp on the RAC logs be before the reboot, the timestamps certainly are after the reboot
if you notice AIX logs: reboot happened at 8:38 where as the RAC log is 10:11
a 2 hour gap, ?!
Am I reading it correct ?
Actually this is something you should investigate: There are disk errors noted prior to the boot (at 8:35) and two network interfacces going down (at 8:38). If the system is part of an HACMP cluster it might well be that HACMP did it (the so-called "dead-man-switch") to preserve cluster integrity.
For reasons similar to what has been said about RAC (which is a "cluster" on application level) HACMP doesn't take "no" for an answer when downing systems either. halt -q
is perhaps the most gentle and longest drawn-out method it employs.
I hope this helps.
bakunin
bakunin, thanks for the reply.
It is not part of HACMP cluster. Only RAC
hdisk3 -- that is a disk which is a free disk / which has some bad sectors, but it should not let the server reboot.
I am suspecting that a user did the reboot and removed the entry from error report, and trying to search the logs.
So your node is called "prd" ? How many nodes are in the RAC cluster? Or is it a one-node cluster? Did you check the logs for this node on the other nodes? I doubt you will find any information on the node itself, you must check the logs on the other nodes, the nodes that did not go down...
Scrutunizer
This is a 2 node RAC (version 10) , you mentioned that I should check logs on the other node which is called DR ;
what logs or what can I look into ?
below is something I thought of posting but, any suggestions would be appreciated.
root@DR /etc/oracle/oprocd> ls -ltra
-rw-r--r-- 1 root system 175 Mar 17 00:09 dr.oprocd.log.2016-03-17-00:19:23
-rw-r--r-- 1 root system 175 Mar 17 00:19 dr.oprocd.log.2016-03-21-23:36:51
-rw-r--r-- 1 root system 175 Mar 21 23:36 dr.oprocd.log.2016-03-22-00:32:19
-rw-r--r-- 1 root system 863 Apr 03 09:45 dr.oprocd.log.2016-04-03-09:46:36
-rw-r--r-- 1 root system 304 Apr 24 14:50 dr.oprocd.log.2016-04-24-15:21:08
-rw-r--r-- 1 root system 175 Apr 24 15:21 dr.oprocd.log.2016-04-24-16:15:52
-rw-r--r-- 1 root system 304 Apr 25 11:55 dr.oprocd.log.2016-04-25-11:56:46
drwxrwx--- 2 root system 256 Apr 25 11:56 stop
-rw-r--r-- 1 root system 175 Apr 25 11:56 dr.oprocd.log
-rwxr--r-- 1 root system 512 Apr 25 11:56 dr.oprocd.lgl
drwxrwx--- 2 root system 256 Apr 25 11:56 fatal
drwxrwx--- 2 root system 256 Apr 25 11:56 check
drwxrwxr-x 5 root system 12288 Apr 25 11:56 .
root@DR /etc/oracle/oprocd> errpt | more
IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION
B6267342 0529235516 P H hdisk32 DISK OPERATION ERROR
E87EF1BE 0529150016 P O dumpcheck The largest dump device is too small.
B6267342 0528235516 P H hdisk32 DISK OPERATION ERROR
E87EF1BE 0528150016 P O dumpcheck The largest dump device is too small.
B6267342 0527235516 P H hdisk32 DISK OPERATION ERROR
E87EF1BE 0527150016 P O dumpcheck The largest dump device is too small.
B6267342 0526235516 P H hdisk32 DISK OPERATION ERROR
E87EF1BE 0526150016 P O dumpcheck The largest dump device is too small.
B6267342 0525235516 P H hdisk32 DISK OPERATION ERROR
E87EF1BE 0525150016 P O dumpcheck The largest dump device is too small.
B6267342 0524235516 P H hdisk32 DISK OPERATION ERROR
E87EF1BE 0524150016 P O dumpcheck The largest dump device is too small.
B6267342 0523235516 P H hdisk32 DISK OPERATION ERROR
E87EF1BE 0523150016 P O dumpcheck The largest dump device is too small.
B6267342 0523132516 P H hdisk32 DISK OPERATION ERROR
B6267342 0523114216 P H hdisk32 DISK OPERATION ERROR
B6267342 0523113916 P H hdisk32 DISK OPERATION ERROR
26623394 0523113616 T H fscsi2 COMMUNICATION PROTOCOL ERROR
26623394 0523113616 T H fscsi2 COMMUNICATION PROTOCOL ERROR
F3931284 0523112616 I H ent2 ETHERNET NETWORK RECOVERY MODE
EC0BCCD4 0523112616 T H ent2 ETHERNET DOWN
F3931284 0523112316 I H ent2 ETHERNET NETWORK RECOVERY MODE
EC0BCCD4 0523112316 T H ent2 ETHERNET DOWN
F3931284 0523112016 I H ent2 ETHERNET NETWORK RECOVERY MODE
EC0BCCD4 0523112016 T H ent2 ETHERNET DOWN
F3931284 0523112016 I H ent2 ETHERNET NETWORK RECOVERY MODE
EC0BCCD4 0523112016 T H ent2 ETHERNET DOWN
54E8A127 0523101216 T H sissas0 DEVICE OR MEDIA ERROR
E87EF1BE 0522150016 P O dumpcheck The largest dump device is too small.
F3931284 0522083816 I H ent2 ETHERNET NETWORK RECOVERY MODE
EC0BCCD4 0522083816 T H ent2 ETHERNET DOWN
8650BE3F 0522083816 I H ent9 ETHERCHANNEL RECOVERY
F3931284 0522083816 I H ent0 ETHERNET NETWORK RECOVERY MODE
59224136 0522083816 P H ent9 ETHERCHANNEL FAILOVER
EC0BCCD4 0522083816 T H ent0 ETHERNET DOWN
F3931284 0522083416 I H ent2 ETHERNET NETWORK RECOVERY MODE
EC0BCCD4 0522083416 T H ent2 ETHERNET DOWN
F3931284 0522083416 I H ent0 ETHERNET NETWORK RECOVERY MODE
EC0BCCD4 0522083416 T H ent0 ETHERNET DOWN
F3931284 0522083316 I H ent0 ETHERNET NETWORK RECOVERY MODE
F3931284 0522083316 I H ent2 ETHERNET NETWORK RECOVERY MODE
E87EF1BE 0521150016 P O dumpcheck The largest dump device is too small.
E87EF1BE 0520150016 P O dumpcheck The largest dump device is too small.
B50A3F81 0520104416 P H ent9 TOTAL ETHERCHANNEL FAILURE
EC0BCCD4 0520104416 T H ent2 ETHERNET DOWN
EC0BCCD4 0520104416 T H ent0 ETHERNET DOWN
F3931284 0520101316 I H ent0 ETHERNET NETWORK RECOVERY MODE
59224136 0520101316 P H ent9 ETHERCHANNEL FAILOVER
F3931284 0520101316 I H ent2 ETHERNET NETWORK RECOVERY MODE
1788894A 0520101316 P H ent9 ETHERCHANNEL CANNOT FAIL OVER
EC0BCCD4 0520101316 T H ent0 ETHERNET DOWN
8650BE3F 0520101316 I H ent9 ETHERCHANNEL RECOVERY
EC0BCCD4 0520101316 T H ent2 ETHERNET DOWN
F3931284 0520101316 I H ent0 ETHERNET NETWORK RECOVERY MODE
59224136 0520101316 P H ent9 ETHERCHANNEL FAILOVER
F3931284 0520101316 I H ent2 ETHERNET NETWORK RECOVERY MODE
1788894A 0520101316 P H ent9 ETHERCHANNEL CANNOT FAIL OVER
EC0BCCD4 0520101316 T H ent0 ETHERNET DOWN
8650BE3F 0520101316 I H ent9 ETHERCHANNEL RECOVERY
EC0BCCD4 0520101316 T H ent2 ETHERNET DOWN
F3931284 0520101016 I H ent0 ETHERNET NETWORK RECOVERY MODE
59224136 0520101016 P H ent9 ETHERCHANNEL FAILOVER
F3931284 0520101016 I H ent2 ETHERNET NETWORK RECOVERY MODE
1788894A 0520100516 P H ent9 ETHERCHANNEL CANNOT FAIL OVER
EC0BCCD4 0520100516 T H ent0 ETHERNET DOWN
EC0BCCD4 0520100516 T H ent2 ETHERNET DOWN
root@DR /etc/oracle/oprocd> uptime
08:48AM up 35 days, 16:33, 3 users, load average: 3.73, 3.59, 3.58
root@DR /etc/oracle/oprocd> last reboot
reboot ~ Apr 24 16:14
reboot ~ Apr 24 15:20
Read the first answer from this thread:
Changing host Time backwards - two node RAC | Oracle Community
According to what you've pasted from Oracle logs, your clock was 2 hours ahead and then it was changed somehow. Or at least Oracle thought, that it was changed 2 hours back. Such situation, even if the clock would be 1 second back, leads to server reboot by Oracle Clusterware. Even on AIX.
Thank You agent.kgb,
very interesting post, I checked the link which you have posted
It appears, in my case, that someone played with the time, intentionally, otherwise how would this happen? RAC has been running for 3 years and it never happened, suddenly it happens and that too, 2 hours difference.
Now, I will have to find out if time was changed on the AIX by user.
I suppose the right term for it is unintentionally - duh! Well, joking aside, it might well be that someone not aware of that (actually i wouldn't have known that either, kudos to agent.kgb!) has changed the system time and caused the reboot.
Usually, if you use NTP, the time adjustment is set to "slew", which means that if time is off it is slowly adjusted, either by stretching the ticking of the clock out or by compressing it, so that the right time is eventually reached but now "jumps" occur. If your NTP setting is different, it might have adjusted an off time immediately and this way caused the reboot. Look at /etc/ntp.conf
to see what your NTP client is set to.
Here is some additional information about configuring NTP.
I hope this helps.
bakunin
looking at what Bakunin suggested
RAC Node PRD -- which rebooted by itself
root@PRD /> cat /etc/ntp.conf
# @(#)48 1.2 src/tcpip/etc/ntp.conf, ntp, tcpip610 2/16/96 10:16:34
# IBM_PROLOG_BEGIN_TAG
# This is an automatically generated prolog.
#
# tcpip610 src/tcpip/etc/ntp.conf 1.2
#
# Licensed Materials - Property of IBM
#
# Restricted Materials of IBM
#
#
#
# Default NTP configuration file.
#
# Broadcast client, no authentication.
#
broadcastclient
driftfile /etc/ntp.drift
tracefile /etc/ntp.trace
root@PRD />
RAC Node DR
root@DR /> cat /etc/ntp.conf
broadcastclient
driftfile /etc/ntp.drift
tracefile /etc/ntp.trace
root@DR />