AIX Health Check

Adnans2k · June 25, 2015, 3:35pm

Hi everyone, I am new to the Unix admin position, needed some help. My management wants to report how their over all AIX servers / environment is doing so far. I've been researching and found multiple commands to run on each LPAR, well I have few questions and also wanted to share the commands Im running, and wanted feed back if these commands are enough to show the environment is doing well, or should I do something different? also I'm not a scripter, but is there a way I can some how generate a script to automate this tast for me on each LPAR / every LPAR in the system? (our system is very small maybe 40 LPARs big)

Please let me know if I should provide more details to maybe get a better response.

Thanks in advance

commands gathered so far:-

topas
svmon -G -O unit=GB
nmon then press n
df -g
netstat -rn
lsvg -p rootvg
vmstat
iostat

bakunin · June 25, 2015, 5:57pm

That is a pretty wide field you are plowing there. Systems administration is not so much a question of doing something but tho painstakingly exact describe what has to be done. I'll be glad to provide commands for everything that needs to be done but let us first discuss what you understand when you say "how ... is doing".

Many commands you quoted are related to performance issues. You might want to read a little introduction to this for a discussion what "performance" is. But the question is: do you think performance issues need a constant monitoring? Are your systems that mission-critical performance-wise?

Many systems are in fact not. They need to run and might need to finish certain tasks on time but if they finish this task half an hour earlier or later wont't even be noticed. Most performance issues in fact are driven by the (complaining) customer. You don't need to monitor the systems in this respect at all, once they are too slow they will tell you. Further, to some extent you can trust the colleagues who set up the systems that they sized them more or less correctly for the respective purpose. (Now, this is not always the case but in a well-cared-for shop it mostly is. If you work in one which is not: don't try to develop monitoring, get out there while you can!)

After talking about so much about things you don't have to (daily) care for here are a few thing you do have to monitor: things which regularly (from of my experience) happen and are showstoppers:

Full file systems: this happens with a certain reularity and the upshot ranges from annoying to fatal. Get a full root-fs and AIX starts to throw fits. Get a full /tmp -fs and ksh (at least ksh93) produces unusual hiccups. Have a full /var and printing, spooling, job scheduling and much more will mostly not work any more (it might even not be possible to log on to the system because /var/wtmp cannot be written to - i had this once). Even more troublesome is if the FS with the archive logs for the database is full. "Archiver stuck" makes the Oracle database stand more or less still, dong nothing while grabbing up every ounce of processor- and memory-resources there are until the machine finally crashes.

Application not running: You might not like the idea but some application programs are just Serious-/Hardworking-/Ideal-/Thorough- -ly programmed, if you know what i mean. There are memory sinks which makes it necessary to restart them regularly, there are processes to exit without even so much as an error message and all other sorts of nightmares you can imagine - and then some. Monitoring an application means usually looking if certain processes are running (sometimes of a certain number of them are running) and raising an alarm if this is not the case.

Network-/Disk-errors: you might wonder why i mix up such seemingly different areas but the difference between SAN-services and LAN-services are starting to blur and the two begin to grow together. In a shop your size you probably have no physical disks any more but some sort of SAN box providing the storage. Some fabrics are notoriously losing pathes temporarily (i remember this being the case with AIX 5.3 and Hitachi storage - ultimately an AIX FC-driver problem). Depending on your precise setup it might be a good idea to test the network connection to some vital partners and the control the connectivity to the disks.

Backup-errors: There is a joke: the thing you positively do NOT want your systems administrator hear saying is: ahem, you do have a backup, yes? As funny as it sounds: at some time for everyone the excrement is hitting the air moving rotor and you are in deep kimchi. You need a backup in this case and it is usually exactly this moment when you find out that every backup you took in the last three years consists only of the message "couldn't continue, exiting now". Believe me, telling management about this very rarely gets you an immediate and substantial raise. Backups fail sometimes and this is no problem at all, but you need to know if this happens, because not having one backup doesn't matter but every day the same system complaining about about the backup being unsuccessful should ring every alarm bell there is.

VIOS: These are the most important systems you have! If they are not working, no other LPAR is working (at least in a way it could be noticed outisde the managed system). Particularly things like SEAs, SEA takeovers and similar events might be a good idea to track.

There is cron to set up a regular pattern of little scripts to carry out. When you take systems administration seriously you will need to pick up at least some scripting skills so the best time to start learning is right now. Don't be afraid, scripting is a lot of fun and you won't need big scripts to do what i talked about above. Some of the things will be one- or few-liners and you will pick that up in a moment. And, again: scripting is FUN! A creative and fulfilling process! On AIX you have the best shell there is for scripting at your command: the Korn Shell. I guarantee you once we get you started you will never want to stop.

I hope this helps.

bakunin

agent.kgb · June 26, 2015, 4:03am

I would suggest to use something like ganglia or lpar2rrd - both tools generate "manager-friendly" charts, although the installation procedures are not such easy...

zaxxon · June 26, 2015, 4:20am

I absolutely second what bakunin and agent.kgb wrote. It can't harm though to set up nmon in your crontab to write some performance data automatically to files so that you have something in the hand in case, when those complains about performance reach you. Makes investigation afterwards much easier. This data can also be fed to nmon2rrd which agent.kgb mentioned. Check the IBM Wiki for setting it up with cron:
Click me: NMON Documentation.

If you pick up bakunin's advice to write some small scripts, the AIX Error Report (check man errpt is the central place where problems of any kind are gathered in a list with timestamp, details, category etc. etc.
You could write or acquire a filter script, that checks this and sends you a mail for instance, if anything bad occurs.
You can also add a stanza to the ODM that can automatically trigger a action like a mail, script etc. to inform you. A script might be the prefered action, since you want to filter the entries in errpt for sure and also want to prevent a message flood etc. in case you have something producing entries like 100 per second. Had this from a connected jukebox once and I was happy not to have a plain mail being sent as action.

This thing about ODM entry is called "errnotify" and documented in the official IBM documentation online. Though here is a very summary about the error handling capabilites and facilities on AIX:
AIX for System Administrators

A very good blog in every regard anyway.

If your systems are connected to a HMC, you can additionally check there the events that come in for faults.

Adnans2k · June 29, 2015, 10:14am

Thank you so much for your support I appreciate it.

But just because I'm new, can anyone recommend any preferred sites or youtube channels where i can learn scripting to get these automated.

rbatte1 · June 29, 2015, 10:19am

A regular human review of the output from errpt and if necessary, errpt -a would be useful.

You can also run the hardware diagnostics to get reports on allocated real hardware (virtual devices are skipped) through the diag panels. I did know how to run this from the command line, but I've forgotten. :o

Robin

agent.kgb · June 29, 2015, 10:20am

I wouldn't recommend to use bash on AIX, but I think this guide can help to start scripting:

bakunin · June 29, 2015, 11:41am

Here is my favourite book about Korn Shell scripting: "The Korn Shell Programming Tutorial" by Barry J. Rosenberg. It will teach you everything you need to know from the beginning up to a medium advanced level of scripting.

But, again: what you need is not a tool, what you really need is to clarify WHAT you want to monitor. Before you find out WHAT to do every discussion about HOW to do it is moot. I for my part will be glad to help you with this and i am sure the experts here will too, but until after there is a clear picture what you want to do i won't suggest any tools. There might be good or bad tools for your purpose but before we have to establish what this purpose is.

Furthermore: you have heard "topas" and "nmon" and some other tools here a lot. These are very good if you want to get a quick and thorough overview of a system. Often this view is already plotted in graphs without having to bother with plotting tools.

I prefer to use system tools instead ("vmstat", "iostat", "ps", ...) because these offer a lot more flexibility first and because they do not aggregate data second. Aggregation of data is good to get overviews, but when you have to analyse a problem you might need the underlying data to get meaningful results. In such a case it is good to have the source, not some arbitrary summation thereof.

I hope this helps.

bakunin

MichaelFelt · July 7, 2015, 10:29am

Again, as Bakunin has clearly stated - it is about what you need to accomplish - and then learn the tools that will help you there.

A little known tool - that has been with AIX since roughly 2009 (AIX 6.1 TL4) is called AIX Runtime Expert.

Note: if you google AIX Runtime Expert you will also see many references to IBM System Director->Profile Manager. That "was" the gui that was developed to help work with this "engine".

In 50 words or less - AIX Runtime Expert (artex.base.rte and artex.base.samples) is a script/XML engine that can collect and compare AIX system configurations (aka profiles).

The key commands are: artexget and artexdiff (to get and compare results). There are other commands to list, merge profiles as well as to apply a profile to the local or a remote system.

Basic documentation is easily available in the AIX 7.1 differences guide (enhancements in AIX 6.1 TL6 and AIX 7.1 TL0 in 2010) at: https://books.google.nl/books?id=m6XEAgAAQBAJ&pg=PA181&lpg=PA181&dq=aix+runtime+expert&source=bl&ots=dtE2QMomU-&sig=DUEh7vt9mrTq_a3jVsnx1zU1U2I&hl=en&sa=X&ei=P96bVfq_G4P6UoutiJgL&ved=0CEIQ6AEwBg#v=onepage&q=aix%20runtime%20expert&f=false

Or the more classic information at: IBM Knowledge Center AIX Runtime Expert

bakunin · July 7, 2015, 10:48am

Hey! Great to see you back, Michael!

True. In fact, out of disdain for that Systems Director IBM chose to pester everybody with for the last years i never looked into it until you mentioned it. From a first glance it looks like a possible valuable addition to my toolbox. Good catch!

bakunin

techy1 · July 13, 2015, 1:42pm

Youtube does have a few videos on scripting if you just search for "bash scripting" or "shell scripting" and you can find some 101 online classes. There is nothing that I've seen focused on system monitoring (I didn't see specified what you need to monitor so I assume system)

something I keep on me is a book called "pro bash scripting" just in case I get stuck and need a quick reference. (this book often goes missing because other people like it as well)

One thing to note, since your learning scripting, I personally wouldn't start with monitoring "performance"

Reason: shell scripting can be a bit tricky at first, a small mistake and you can run into a loop, where the system resources are being consumed due to a bad code in the script. (maybe not the best example there but once you start making mistakes, you'll see what I mean)

I would suggest with using the commands manually and understanding what each one does first. iostat and vmstat are my preference as bakunin mentioned.

If they are just asking you for report summary's on the servers usage, lpar2rrd as mentioned, is a great way to give them just that.

Check out their site. bit tricky at times to setup, but once you finish management teams love it. plus they can check it whenever they want without asking you for a report.

I also do run nmon reports as well since it is more detailed (no one other than myself reviews these).

One of my personal fav. I view once every 6 months (if I can) is "HMCViewer"

for learning AIX, there is a sticky on the AIX page:

Before monitoring "performance" understanding the system and how it works is the biggest key. without this its more like building a house from the roof down.

IBM red books are free and very useful.

Hope this helps.

MichaelFelt · July 13, 2015, 4:07pm

If it is movies you want I recommend Nigel's collection of movies - at - https://www.youtube.com/user/nigelargriffiths

Adnans2k · July 15, 2015, 8:20am

Thanks for all the help everyone.