WTMPX File corrupted

drestarr96 · December 29, 2011, 7:11am

Hi All

I work on solaris 8, 9 and 10 platforms and have encountered an error which is my wtmpx files appear to be corrupted as all entries contain the date 1970 (the birth of unix).

Now this is obviously not the case, so my query is:

1 - Can the existing wtmpx files be manipulated to provide correct dates

2 - Is it possible to recreate the wtmpx files with historic data

3 - Is this a know fix or bug?

Many thanks for your anticipated support.

jim_mcnamara · December 29, 2011, 10:30am

I have seen data "messed up" on boxes that have excessively large utmpx or wtmpx files.
Or there are disk space issues.

There is no known bug (or patch) for this problem (AFAIK), so I would look at your file sizes, and maybe use dd to rescue the good data from earlier in the file.

The utmpd daemon writes to utmpx and wtmpx, so you can stop the daemon for a minute while you rename the old files.

drestarr96 · December 29, 2011, 11:17am

Thanks Jim

The method i use to read the wtmpx file is:

/usr/lib/acct/fwtmp < /var/adm/wtmpx

Piping the output to screen or a file gives very different results

Piping to screen shows the dates as 1970 (with no recent entries), where as piping to a file gives plenty of white space.

You mention using dd, how and what is it'

Regards

methyl · December 29, 2011, 11:59am

How big is the wtmpx file? If it is very near 2Gb, it it too big.

ls -la /var/adm/wtmpx

Is (or was) the disc partition containing the wtmpx file full?
Is the output from "who -u" correct (it comes from the utmpx file) ?

There is a known bug in Solaris if the hardware realtime clock was wrong (or had a flat battery) at the time the system was booted. Correcting the time with the unix "date" command is not enough.

"Piping to a file"? sounds unlikely.
Something like this (assuming /var/tmp has enough space).

/usr/lib/acct/fwtmp < /var/adm/wtmpx  > /var/tmp/temp_wtmpx
more /var/tmp/temp_wtmpx

Then look at the early records to see how long this file has been in use.

There is no automatic cleanup process, you need to write one.
I normally start a new wtmpx file each week and keep 8 weeks worth.
Some sites would require more history, but bear in mind that once a wtmpx file gets to be a year old, the output from commands like "last" can be nonsense.

When starting a new file, copy the old file to a new name with "cp -p" and then null the wtmpx file (only) with ">wtmpx". Do not rename the wtmpx file or you will have to create a new file and fix the permissions and stop/start the accounting process.
Do not touch the utmpx file. If you null that one you will have to reboot the server to fix the problem.

jim_mcnamara · December 29, 2011, 1:18pm

I've used dd to read "records " up to a certain point in the old file and move them to a new file, discarding the rest.

/usr/lib/acct/wtmpfix - this will attempt correct dates, it does not always work. The fact that is exists testifies to wtmpx corruption being an old problem.

methyl is right about rotating accounting files, very important to do.

I made up the 100000 in the dd example below, you have to determine where the line in the file goes south and you can't fix it:

cd /usr/lib/acct
 fwtmp < /var/adm/wtmpx | 
    awk '{z=NF; printf("%7d ", FNR); for(i=z-5; i<=z;i++) {printf("%s ", $i)}; print ""}'

# solaris 10 uses svcadm to utmpd
svcadm disable  svc:/system/utmp:default
dd bs=372 count=100000 if=/var/adm/wtmpx of=/var/adm/wtmpx.new
cp /var/adm/wtmpx.new  /var/adm/wtmpx
svcadm enable  svc:/system/utmp:default

You want to keep as many records as possible.

And I don't think it is a bug, per se. Sun used to explicitly tell you to rotate accounting logs to avoid corruption. And fwtmp was made for sysadmins who did not read that warning, I guess they got tired of hearing about it.

methyl · December 30, 2011, 5:20am

I've tried to avoid mentioning "wtmpfix" because it is a very specific repair tool and not really relevant to this problem. Historically I have run "wtmpfix" every day on a system which was feeding data to a commercial stats package.

I have used "fwtmp" to export a wtmp file for basic repairs following a power fail and a few more times when a computer has been started with the clock set incorrectly or ran out of disc space. I would normally copy the old file and start a new file before attempting any repair. If the length of history actually matters, you can combine the repaired file with the current file. It's easier to have a script which runs "last" on current and saved files.

I don't rely in wtmp for long term "last login" history and prefer keeping a brief rolling history in the users home directory. If fact for me the main use of wtmp is for basic weekly server usage statistics tabulated by IP Network. I also maintain the "btmp" files in parallel and check them automatically for basic hack attempts.

sebofo · January 24, 2012, 4:54am

In my experience, a corrupt wtmpx (or wtmp) file is ususally due to a write to the file being interrupted in the middle of writing a record. This means that log entries after this event will be shifted a number of bytes which are not a whole record.
The file has fixed-lenghth records. When reading the file from start, and the file is corrupted, there is somewhere a record which is shorter than the record length, and the reading program gets out of synch with the records.
So the way to fix the file is to find and remove the incomplete record. This can be done in a binary-capable editor such as Emacs (I have used that), where you look for recurring patterns to find the start of records, and when you find the short record you remove that and save the file. Formatting it with fwtmp will aid you in finding the number of records you need to pass before reaching the faulty record.
Possibly a simpler method would be to use dd in intelligent ways to first read the uncorrupted part of the file and then skip an offset of a number of bytes until you get output which can be formatted correctly by fwtmp.
What I am getting at is that you don't have to throw away the last part of the file, the information can be recovered by using my method.