monitoring various things (mainly activity) on different unix boxes

Hi there,

I want to ask you guys what you think about my problem.
I work as a sysadmin on about 7000 workstations or so and to save money and energy, we've decided to switch off as many workstations as possible during the night (probably by shutting it down by cron and power it on by WakeOnLAN).
We're also planning to develop a custom portal to let users choose if the workstation they're using cannot be shut down.

All of this is currently under discussion, and for now I need to report the nightly activity on a selection of workstations (RHEL4, RHEL5 and HP-UX 11.11), here is what I need :

  • monitor if there's local user logged on the workstation
  • monitor if there's remote user logged on the workstation
  • activity of those users if applicable
  • process running

Of course I first thought about the "ps -ef" and the "who" commands but there's about 50 workstations to monitor (during 2 weeks or so) and I'm not really a shell script guru and of course data collected must be compiled easily for me to report...

What do you think is the best option ?

Thanks

If you just need a minimalist output of who are logged in and load, then uptime or w are your friend
e.g.

gech:/home/vbe $ rsh ant -n w
connect to address 10.XXX.YYY.2 port 544: Connection refused
Trying krb4 rsh...
connect to address 10.XXX.YYY.2 port 544: Connection refused
trying normal rsh (/usr/bin/rsh)
  5:07pm  up 24 days,  5:59,  7 users,  load average: 0.01, 0.01, 0.01
User     tty           login@  idle   JCPU   PCPU  what
vbe      pts/0        11:16am581:49 119:10 119:10  top
vbe      pts/1        11:17am192:07                ssh us99
vbe      pts/2         3:32pm192:57                ksh
vbe      pts/3         4:30pm  1:25                more case_usage_001.txt
vbe      pts/4         3:24pm  5:52                ksh
vbe      pts/5         3:25pm            1      1  -ksh
vbe      pts/6         2:25pm                      more -s

What do you really need?
(You can always write a script using vmstat, iostat etc...)
On HP side. if you have an /opt/perf/bin directory, you could find monitorings tools there like mesureware mwa...

Thanks but I guess I didn't explain right :
I need to monitor who is connected (and doing what) during the night (e.g. from 9pm to 7am next morning) let's say every 15 minutes.

I thought about a shell script launched by cron job doing the who and the ps -ef command in a text file with some increment (like who.1.txt, who.2.txt and so on) and everyday when I start working compute all the data gathered during the night to have a report saying who was on each host and what was going on.

The purpose of all that is to check if there's some unknown activity by night that could possibly be killed by switching off workstations (user's crontab for example) and take the required actions to secure it.

Maybe my plan isn't the best (or at least the most effective) that's why I came to ask I you've got a better idea or if I'm on the good path.

That is exactly what w does ( read the man pages...).
You could try from a "master" box to execute your job using rdist ( but long since last time I did such things..) or use cron/at on all boxes and get them to write all at the same place (using NFS?)

Information who is connected available in syslog.

Please look at the syslogd on your box, you can increase / decrease / separate various logging on your system, for your scripts and/or log managment software to parse it.

For doing what really depends.. it would be a better approach to think what do you don't want users to do to the system or information.
That's why you have user kernel limits, unix permissions and ACLs and secure protocols to communicate and authenticate with (ssh, ssl, kerberos)

Try to make folks think about what they want, who will do it and with what permissions.

root account can be fine-tuned and logged (command wise) per your desire using sudo

Thinking of it, you could very well have jobs running without anyone connected...How will you find out that? with ps?
You will have to go through all /var/spool/cron... etc...
I usually keep a userlist file I use to kill everything at 20:00 (so that I can do some cleanup and sanity check before backups, people having specific jobs or who need to works after have to see with me... (and I remove them temporarily from the list...).
Could you not use that approach for a start? (machines are to not work from 21:00 -9:00, so who/what are the exceptions...)

That's interesting. This is pretty much the goal we're planning to achieve :
We want to establish exclusion lists of hosts that cannot be concerned under any circumstance by the energy saving (I'm thinking of the simulation workstations running some fluid calculus, it sometimes takes days to complete) and other critcal workstations.
This will be the VIP list.
Then we'll add another exclusion list where the user himself has control on. Let's say that user1 has a workstation which is pretty standard and so tagged as an energy saving one. For some reason, a day a week he uses it to compute a large amount of data, in this case, he puts his hosts in the temporary list by himself to exclude it from the process just for this one time.

But all of that will come soon enough, first we need to know what we are dealing with.

The reason why I'm investigating the nightly thing is because nobody has any clue of what exactly is going on at night. I can't think of any other way to do it and it could be pretty bad if I miss something...

---------- Post updated at 07:39 PM ---------- Previous update was at 07:20 PM ----------

The syslog approach is a pretty good idea actually. I will look into it on monday for sure. Thanks for the tip.

Concerning the doing what part, nobody but the sysadmin team has root access, some users have limited sudo rights (like ifconfig or some tools they're using). We really have a lot of different roles for workstations (thank God I'm only in charge of the workstation side) so I really don't know how I can monitor processes running with accuracy...

Ok, the syslog solution is a no-go, nothing can be modified on those workstations.
I think I'm going to rsh commands on the workstations and collect results in text files, and then, use some Perl to compute and extract the informations I need.