Daily checks for AIX business critical boxes.

Hi all,

I will like to know what are all sanitary checks which should be done on daily basis on all business critical AIX boxes without fail.

disk space, connectivity to the network, and security.

Sorry, but your question is too general to be answered.

For instance, it might be critical to watch filesystem capacity if a lot of data is getting stored on the system and the data is not coming in a steady predictable stream. On the other hand there are systems where very little data is stored and the filesystem doesn't have to be watched closely at all. For some systems looking at it every month is enough, for others it is vital to monitor it hourly, yet many systems are somewhere between these extremes.

Specify your question a bit and we might be able to help you better.

bakunin

is 'none' a valid answer?

If you have sufficient monitoring in place, there is no good reason to look directly after them on a daily basis at all - because I get a ticket or am called out in case of any issues. I do monthly capacity checks across my boxes and compare them with previous months - but basically this is all ...

Kind regards
zxmaus

Ok as you said some systems have to be monitored hourly, so i want to know what are the things to be monitored hourly is it just restricted to FileSystem, Memory.....?

as i dont have a real time experience so this question :slight_smile:

---------- Post updated at 08:26 PM ---------- Previous update was at 08:13 PM ----------

can you please explain what are the things covered in the sufficient monitoring..?

i think there is difference between a ticket being issued and checks on business critical boxes.

Hi,
in my company ticket = callout within one minute / responsetime for us SAs 5 min for prod, 15 min for non-prod - and we have a lot of business critical systems (global trading- and transaction systems) - we cannot afford any downtime.

we monitor cpu (wait + idle + usage), avm memory + pagingspace, diskspace (defined per filesystem via thresholds), processes (by names and numbers), logfiles (for defined keywords), obviously errpt, SAN (i.e. if all paths are up), network, nfs shares, systems pingable/reachable and if throughput is within thresholds, backups - we even monitor if the monitoring is up ... and basically everything else you could possibly think ...

Kind regards
zxmaus

Ask yourself what it is that keeps a system going (that is: fulfilling its purpose). This is your answer.

If anything has to be monitored every minute, hour, day, week or month depends on the system and the characteristics of its purpose. There is no general answer because there is no "general system".

If you ask "which is the best car" without specifying for which purpose the only thing one could answer is: that depends. If you want to transport tons of goods it might be some large truck and not the Ferrari, if you want to win races it might be the other way round and if you want to go offroad you will quickly find out that both are quite bad compared to a Landrover.

Coming back to your question: what does a system keep going:

a) environmental issues

  • energy
  • climate/temperature control
  • ....

b) OS level

  • availability of processing resources - CPU
  • availability of memory
  • availability of storage space - filesystem
  • OS resources consumption: process table, etc.
  • availability of network bandwith
  • ...

c) application specific

  • depends on the application, things like queue lengths, transaction times, ...

Be aware that this list is far from being complete, its just the most obvious things, feel free to add whatever is important for your system to continue working. As a rule: everything that is important for the system to continue doing its purpose you need a "sensor" - a logfile, a piece of software, a blinking warning lamp, what ever.

Some of the things might be already covered: you do not have to watch climate control if the system is in a data center where air condition is provided and covered for without you doing anthing. You still might want to watch over fans, etc. and get an alarm if the system starts overheating.

Speaking about the things left on the list: it depends on the system and what it is used for, how often something has to be checked. Because these checks take usually some processing power (in most cases little programs do the work) it is generally good to do the checks as often as necessary and as rarely as possible. If you have a system where never data gets stored (a gateway system, for instance) a check of the filesystems every minute is superfluous, on a database server it might be necessary. The same goes for CPU, network and all the other things on the list.

So there is no such thing as a "thing that has to be monitored hourly", because, whatever the thing in question is, depending on the specifics of the system one hour might be an overkill as well as far too little.

I hope this helps.

bakunin

Thanks a Ton

@bakunin, @zxmaus