System rebooted itself but errpt did not catch

filosophizer · May 24, 2016, 3:05pm

Hi,

like to know how I can find out the reason behind a system rebooted.

If the system rebooted by itself --or-- if it was rebooted by a user and user removed the entry from errpt and history.

this thread, we will dedicate to find out HOW/WHO

agent.kgb · May 24, 2016, 3:32pm

Post output of the commands:

uptime
last reboot
who -b
errpt | head -20

Scott · May 24, 2016, 3:35pm

Could a power outage not explain the lack of logging in the matter?

dukessd · May 24, 2016, 7:10pm

No Scott.
The system should have enough power in reserve to log an EPOW (Emergency Power Off Warning) even in a total power outage, before it goes down.....
If the disk containing the error log file is already inaccessible then the system should have enough power in reserve to log the error to the firmware NVRAM and it will be written out to the AIX error log when power is restored.
You'll see the usual error logging turned on message at boot time and then you'll see an error with a slightly earlier time stamp written out after the error daemon is up an running.
AIX system have been designed this way for many years.

---------- Post updated at 12:10 AM ---------- Previous update was at 12:06 AM ----------

filosophizer,
If the errpt entry was removed there will be a jump in the sequence numbers....
Although AIX often drops one even on a reboot - never understood, or looked into, why.
What does the error report show?

filosophizer · May 27, 2016, 4:00pm

Thanks for the reply.

Here is the output of the commands

root@PRD /> uptime
  10:53PM   up 5 days,  14:15,  2 users,  load average: 4.93, 4.46, 3.76

root@PRD /> errpt
IDENTIFIER TIMESTAMP  T C RESOURCE_NAME  DESCRIPTION
E87EF1BE   0527150016 P O dumpcheck      The largest dump device is too small.
E87EF1BE   0526150016 P O dumpcheck      The largest dump device is too small.
E87EF1BE   0525150016 P O dumpcheck      The largest dump device is too small.
E87EF1BE   0524150016 P O dumpcheck      The largest dump device is too small.
E87EF1BE   0523150016 P O dumpcheck      The largest dump device is too small.
26623394   0523104416 T H fscsi2         COMMUNICATION PROTOCOL ERROR
26623394   0523104316 T H fscsi2         COMMUNICATION PROTOCOL ERROR
26623394   0523104316 T H fscsi2         COMMUNICATION PROTOCOL ERROR
B6267342   0522235516 P H hdisk3         DISK OPERATION ERROR
E87EF1BE   0522150016 P O dumpcheck      The largest dump device is too small.
A6DF45AA   0522083916 I O RMCdaemon      The daemon is started.
D221BD55   0522083816 I O perftune       RESTRICTED TUNABLES MODIFIED AT REBOOT
EC0BCCD4   0522083816 T H ent1           ETHERNET DOWN
EC0BCCD4   0522083816 T H ent2           ETHERNET DOWN
54E8A127   0522083516 T H sissas0        DEVICE OR MEDIA ERROR
B6267342   0522083516 P H hdisk3         DISK OPERATION ERROR
B6267342   0519220616 P H hdisk3         DISK OPERATION ERROR
root@PRD />

root@PRD /> last reboot
reboot    ~                                   May 22 08:38
reboot    ~                                   May 01 09:37
reboot    ~                                   Apr 21 17:01
reboot    ~                                   Apr 18 19:02
reboot    ~                                   Apr 18 16:50
reboot    ~                                   Apr 18 16:16
reboot    ~                                   Mar 16 06:25

root@PRD /> who -b
   .        system boot May 22 08:38

IBM How to Investigate a System Reboot - United States
IBM Server running AIX with Oracle RAC reboots itself - United States
The /var/adm/wtmp account file
This binary file is used to store various types of login information. One type of information stored in this file is user login records. These records document the user name and time of login. Pseudo user names are used for shutdown and reboot. So when a system is shut down using one of the shut down commands, a record with the user name shutdown will be logged into the wtmp file. Similarly when a system is booted, a record with the user name reboot will be written into the wtmp file. Some shut down commands have flags that can be used to suppress login records in the wtmp file.

Note: Technically a reboot is a warm boot but the pseudo user name reboot is written into the wtmp file for both warm boots and cold boots.

Here is last command from it:


---------------------------

root      pts/2        opmanager       May 22 08:54 - 08:54  (00:00)
root      pts/1        dbapc                  May 22 08:49 - 08:55  (00:05)
root      pts/1        opmanager       May 22 08:49 - 08:49  (00:00)
oracle    ftp          dbapc                  May 22 08:45 - 08:55  (00:10)
oracle    ftp          dbapc                  May 22 08:45 - 08:45  (00:00)
oracle    pts/0        dbapc                  May 22 08:39 - 10:11  (01:32)
reboot    ~                                   May 22 08:38 
root      pts/0        opmanager       May 20 10:00 - 10:00  (00:00)
root      pts/0        opmanager       May 20 09:54 - 09:55  (00:00)
root      pts/0        opmanager       May 20 09:49 - 09:49  (00:00)

I see that right after reboot user oracle from PC: dbapc logged in. It could be possible that Mr. DBAPC was smart enough, did a reboot and removed the entry from errpt by

 errpt -J NUMBER

Task is to find out
1- Who did the reboot, (user initiated) or abnormal shutdown (don't think so, there was power outage)

2- RAC reboot -- RAC could have done reboot but again errpt would have captured this

Scrutinizer · May 28, 2016, 1:44am

If your system is part of a RAC cluster, then it may be that it was the cluster's decision to evict that particular node. Evicting a node is done as quickly as possible to preserve the integrity of the database. This means: immediate shutdown of the host/node, without logging a shutdown record in /var/adm/wtmp, not calling sync to flush file buffers and not sending processes a SIGTERM.

If your node has been evicted, there should be information about this in the RAC cluster logs on the remaining nodes of the cluster...

filosophizer · May 28, 2016, 6:32am

Scrutunizer

checking RAC logs

root@PRD /etc/oracle/oprocd> ls -ltra

-rw-r--r--    1 root     system          304 May 22 10:05 prd.oprocd.log.2016-05-22-10:10:59
drwxrwx---    2 root     system          256 May 22 10:11 stop
-rw-r--r--    1 root     system          175 May 22 10:11 prd.oprocd.log
-rwxr--r--    1 root     system          512 May 22 10:11 prd.oprocd.lgl
drwxrwx---    2 root     system          256 May 22 10:11 fatal
drwxrwx---    2 root     system          256 May 22 10:11 check


root@PRD /etc/oracle/oprocd> cat prd.oprocd.log
May 22 10:11:00.148 | INF | monitoring started with timeout(1000), margin(500), skewTimeout(125)
May 22 10:11:00.194 | INF | fatal mode startup, setting process to fatal mode

root@PRD /etc/oracle/oprocd>
root@PRD /etc/oracle/oprocd>
root@PRD /etc/oracle/oprocd> cat prd.oprocd.log.2016-05-22-10:10:59
May 22 08:39:39.546 | INF | monitoring started with timeout(1000), margin(500), skewTimeout(125)
May 22 08:39:39.628 | INF | fatal mode startup, setting process to fatal mode
May 22 10:05:26.457 | INF | shutting down from client request
May 22 10:05:26.457 | INF | exiting current process in NORMAL mode

But, shouldn't the timestamp on the RAC logs be before the reboot, the timestamps certainly are after the reboot

if you notice AIX logs: reboot happened at 8:38 where as the RAC log is 10:11

a 2 hour gap, ?!

Am I reading it correct ?

bakunin · May 28, 2016, 8:24am

filosophizer:

D221BD55   0522083816 I O perftune       RESTRICTED TUNABLES MODIFIED AT REBOOT
EC0BCCD4   0522083816 T H ent1           ETHERNET DOWN
EC0BCCD4   0522083816 T H ent2           ETHERNET DOWN
54E8A127   0522083516 T H sissas0        DEVICE OR MEDIA ERROR

Actually this is something you should investigate: There are disk errors noted prior to the boot (at 8:35) and two network interfacces going down (at 8:38). If the system is part of an HACMP cluster it might well be that HACMP did it (the so-called "dead-man-switch") to preserve cluster integrity.

For reasons similar to what has been said about RAC (which is a "cluster" on application level) HACMP doesn't take "no" for an answer when downing systems either. halt -q is perhaps the most gentle and longest drawn-out method it employs.

I hope this helps.

bakunin

filosophizer · May 29, 2016, 5:53am

bakunin, thanks for the reply.

It is not part of HACMP cluster. Only RAC

hdisk3 -- that is a disk which is a free disk / which has some bad sectors, but it should not let the server reboot.

I am suspecting that a user did the reboot and removed the entry from error report, and trying to search the logs.

Scrutinizer · May 29, 2016, 6:49am

So your node is called "prd" ? How many nodes are in the RAC cluster? Or is it a one-node cluster? Did you check the logs for this node on the other nodes? I doubt you will find any information on the node itself, you must check the logs on the other nodes, the nodes that did not go down...

filosophizer · May 30, 2016, 1:35pm

Scrutunizer

This is a 2 node RAC (version 10) , you mentioned that I should check logs on the other node which is called DR ;

what logs or what can I look into ?

below is something I thought of posting but, any suggestions would be appreciated.

root@DR /etc/oracle/oprocd> ls -ltra
-rw-r--r--    1 root     system          175 Mar 17 00:09 dr.oprocd.log.2016-03-17-00:19:23
-rw-r--r--    1 root     system          175 Mar 17 00:19 dr.oprocd.log.2016-03-21-23:36:51
-rw-r--r--    1 root     system          175 Mar 21 23:36 dr.oprocd.log.2016-03-22-00:32:19
-rw-r--r--    1 root     system          863 Apr 03 09:45 dr.oprocd.log.2016-04-03-09:46:36
-rw-r--r--    1 root     system          304 Apr 24 14:50 dr.oprocd.log.2016-04-24-15:21:08
-rw-r--r--    1 root     system          175 Apr 24 15:21 dr.oprocd.log.2016-04-24-16:15:52
-rw-r--r--    1 root     system          304 Apr 25 11:55 dr.oprocd.log.2016-04-25-11:56:46
drwxrwx---    2 root     system          256 Apr 25 11:56 stop
-rw-r--r--    1 root     system          175 Apr 25 11:56 dr.oprocd.log
-rwxr--r--    1 root     system          512 Apr 25 11:56 dr.oprocd.lgl
drwxrwx---    2 root     system          256 Apr 25 11:56 fatal
drwxrwx---    2 root     system          256 Apr 25 11:56 check
drwxrwxr-x    5 root     system        12288 Apr 25 11:56 .


root@DR /etc/oracle/oprocd> errpt | more

IDENTIFIER TIMESTAMP  T C RESOURCE_NAME  DESCRIPTION
B6267342   0529235516 P H hdisk32        DISK OPERATION ERROR
E87EF1BE   0529150016 P O dumpcheck      The largest dump device is too small.
B6267342   0528235516 P H hdisk32        DISK OPERATION ERROR
E87EF1BE   0528150016 P O dumpcheck      The largest dump device is too small.
B6267342   0527235516 P H hdisk32        DISK OPERATION ERROR
E87EF1BE   0527150016 P O dumpcheck      The largest dump device is too small.
B6267342   0526235516 P H hdisk32        DISK OPERATION ERROR
E87EF1BE   0526150016 P O dumpcheck      The largest dump device is too small.
B6267342   0525235516 P H hdisk32        DISK OPERATION ERROR
E87EF1BE   0525150016 P O dumpcheck      The largest dump device is too small.
B6267342   0524235516 P H hdisk32        DISK OPERATION ERROR
E87EF1BE   0524150016 P O dumpcheck      The largest dump device is too small.
B6267342   0523235516 P H hdisk32        DISK OPERATION ERROR
E87EF1BE   0523150016 P O dumpcheck      The largest dump device is too small.
B6267342   0523132516 P H hdisk32        DISK OPERATION ERROR
B6267342   0523114216 P H hdisk32        DISK OPERATION ERROR
B6267342   0523113916 P H hdisk32        DISK OPERATION ERROR
26623394   0523113616 T H fscsi2         COMMUNICATION PROTOCOL ERROR
26623394   0523113616 T H fscsi2         COMMUNICATION PROTOCOL ERROR
F3931284   0523112616 I H ent2           ETHERNET NETWORK RECOVERY MODE
EC0BCCD4   0523112616 T H ent2           ETHERNET DOWN
F3931284   0523112316 I H ent2           ETHERNET NETWORK RECOVERY MODE
EC0BCCD4   0523112316 T H ent2           ETHERNET DOWN
F3931284   0523112016 I H ent2           ETHERNET NETWORK RECOVERY MODE
EC0BCCD4   0523112016 T H ent2           ETHERNET DOWN
F3931284   0523112016 I H ent2           ETHERNET NETWORK RECOVERY MODE
EC0BCCD4   0523112016 T H ent2           ETHERNET DOWN
54E8A127   0523101216 T H sissas0        DEVICE OR MEDIA ERROR
E87EF1BE   0522150016 P O dumpcheck      The largest dump device is too small.
F3931284   0522083816 I H ent2           ETHERNET NETWORK RECOVERY MODE
EC0BCCD4   0522083816 T H ent2           ETHERNET DOWN
8650BE3F   0522083816 I H ent9           ETHERCHANNEL RECOVERY
F3931284   0522083816 I H ent0           ETHERNET NETWORK RECOVERY MODE
59224136   0522083816 P H ent9           ETHERCHANNEL FAILOVER
EC0BCCD4   0522083816 T H ent0           ETHERNET DOWN
F3931284   0522083416 I H ent2           ETHERNET NETWORK RECOVERY MODE
EC0BCCD4   0522083416 T H ent2           ETHERNET DOWN
F3931284   0522083416 I H ent0           ETHERNET NETWORK RECOVERY MODE
EC0BCCD4   0522083416 T H ent0           ETHERNET DOWN
F3931284   0522083316 I H ent0           ETHERNET NETWORK RECOVERY MODE
F3931284   0522083316 I H ent2           ETHERNET NETWORK RECOVERY MODE
E87EF1BE   0521150016 P O dumpcheck      The largest dump device is too small.
E87EF1BE   0520150016 P O dumpcheck      The largest dump device is too small.
B50A3F81   0520104416 P H ent9           TOTAL ETHERCHANNEL FAILURE
EC0BCCD4   0520104416 T H ent2           ETHERNET DOWN
EC0BCCD4   0520104416 T H ent0           ETHERNET DOWN
F3931284   0520101316 I H ent0           ETHERNET NETWORK RECOVERY MODE
59224136   0520101316 P H ent9           ETHERCHANNEL FAILOVER
F3931284   0520101316 I H ent2           ETHERNET NETWORK RECOVERY MODE
1788894A   0520101316 P H ent9           ETHERCHANNEL CANNOT FAIL OVER
EC0BCCD4   0520101316 T H ent0           ETHERNET DOWN
8650BE3F   0520101316 I H ent9           ETHERCHANNEL RECOVERY
EC0BCCD4   0520101316 T H ent2           ETHERNET DOWN
F3931284   0520101316 I H ent0           ETHERNET NETWORK RECOVERY MODE
59224136   0520101316 P H ent9           ETHERCHANNEL FAILOVER
F3931284   0520101316 I H ent2           ETHERNET NETWORK RECOVERY MODE
1788894A   0520101316 P H ent9           ETHERCHANNEL CANNOT FAIL OVER
EC0BCCD4   0520101316 T H ent0           ETHERNET DOWN
8650BE3F   0520101316 I H ent9           ETHERCHANNEL RECOVERY
EC0BCCD4   0520101316 T H ent2           ETHERNET DOWN
F3931284   0520101016 I H ent0           ETHERNET NETWORK RECOVERY MODE
59224136   0520101016 P H ent9           ETHERCHANNEL FAILOVER
F3931284   0520101016 I H ent2           ETHERNET NETWORK RECOVERY MODE
1788894A   0520100516 P H ent9           ETHERCHANNEL CANNOT FAIL OVER
EC0BCCD4   0520100516 T H ent0           ETHERNET DOWN
EC0BCCD4   0520100516 T H ent2           ETHERNET DOWN

root@DR /etc/oracle/oprocd> uptime
  08:48AM   up 35 days,  16:33,  3 users,  load average: 3.73, 3.59, 3.58

root@DR /etc/oracle/oprocd> last reboot
reboot    ~                                   Apr 24 16:14
reboot    ~                                   Apr 24 15:20

agent.kgb · May 30, 2016, 2:20pm

Read the first answer from this thread:

Changing host Time backwards - two node RAC | Oracle Community

According to what you've pasted from Oracle logs, your clock was 2 hours ahead and then it was changed somehow. Or at least Oracle thought, that it was changed 2 hours back. Such situation, even if the clock would be 1 second back, leads to server reboot by Oracle Clusterware. Even on AIX.

filosophizer · May 30, 2016, 3:43pm

Thank You agent.kgb,
very interesting post, I checked the link which you have posted

It appears, in my case, that someone played with the time, intentionally, otherwise how would this happen? RAC has been running for 3 years and it never happened, suddenly it happens and that too, 2 hours difference.

Now, I will have to find out if time was changed on the AIX by user.

bakunin · May 30, 2016, 4:20pm

I suppose the right term for it is unintentionally - duh! Well, joking aside, it might well be that someone not aware of that (actually i wouldn't have known that either, kudos to agent.kgb!) has changed the system time and caused the reboot.

Usually, if you use NTP, the time adjustment is set to "slew", which means that if time is off it is slowly adjusted, either by stretching the ticking of the clock out or by compressing it, so that the right time is eventually reached but now "jumps" occur. If your NTP setting is different, it might have adjusted an off time immediately and this way caused the reboot. Look at /etc/ntp.conf to see what your NTP client is set to.

Here is some additional information about configuring NTP.

I hope this helps.

bakunin

filosophizer · May 30, 2016, 5:27pm

looking at what Bakunin suggested

RAC Node PRD -- which rebooted by itself

root@PRD /> cat /etc/ntp.conf
# @(#)48        1.2  src/tcpip/etc/ntp.conf, ntp, tcpip610 2/16/96 10:16:34
# IBM_PROLOG_BEGIN_TAG
# This is an automatically generated prolog.
#
# tcpip610 src/tcpip/etc/ntp.conf 1.2
#
# Licensed Materials - Property of IBM
#
# Restricted Materials of IBM
#
#
#
# Default NTP configuration file.
#
#   Broadcast client, no authentication.
#
broadcastclient
driftfile /etc/ntp.drift
tracefile /etc/ntp.trace
root@PRD />

RAC Node DR

root@DR /> cat /etc/ntp.conf
broadcastclient
driftfile /etc/ntp.drift
tracefile /etc/ntp.trace
root@DR />