Prize of being an Admin - Part 2

I was reading this thread of admin_xor Prize of being an Admin and thought will share this experience of mine which is kind of opposite to what he did - I didn't tell anybody what happened :smiley:

We were porting one of the subsystem from Solaris to Linux. As part of that we developed many wrapper scripts. So, there is this rsh wrapper script which is deployed in the system which internally uses ssh if security is enabled or uses native rsh instead (this native rsh is placed in a different path, so that it will not show up in the $PATH and the wrapper rsh script is placed in /usr/bin). For some testing purpose, I modified the ssh command inside the rsh wrapper script to "rsh" command and forgot to change it back. So, you know what happened next. If I do a rsh, it goes into a continuous loop calling the rsh wrapper script over and over again, this clogged the cpu in no time. I did this change in the Testing teams setup. And the worst part was I did it in 2 of their setup.

Next day I came to office and there is a big fuss all over the place. I didn't bother cause it was't assigned to me and I totally forgot that what I did was causing this. After couple of days, the issue was assigned to me and then "oops" I realized it. Now what? Of course I didn't tell them :D. Hearing about what happened to admin_xor for what he did, imagine what would've happened to me.

Later I told them that it was a "human" :stuck_out_tongue: error and that there is no issue with the system. But then they asked how could it happen to 2 systems. I was like "it happened man, forget it" :smiley: - no I didn't say that, I told them we'll monitor it. I assured them that it is human error and we will monitor the system and if it re-occurs we will investigate again and now its not worth spending time on this - obviously I know it - cause I am the culprit.

--ahamed

'Tis a brave person who admits his/her failings to the public...

Bazza...

Would be even braver to admit it internally.

I guess your employer and colleagues will sooner or later be aware of this posting as you gave plenty of information to identify the case.

Not sure they'll appreciate, especially the ones who didn't found the root cause after a couple of days ...

Oof, that must have been painful to own up to. Nicely done though!

First observation: sh!t happens! That is a proven, reliable fact and an environment which can't cope with that is designed wrongly from the start. If you need a service to be not disrupted you shouldn't allow people to develop on it, because development will create the one or other hiccup to happen over time. Further, you need to take precautions against failure of every single part of the system if it should survive. Suppose instead of your error some hardware would have crashed, the network disrupted, whatever. This is what HA-solutions are for, for instance.

No SysAdmin in his right mind will let a manager determined to "save" on hardware off this hook: do you want to bet the projects future on me never doing an accidental typo? (As it is i have actually said exactly this in a design conference - and got my testing system.) And, by the way: when they decide about new office furniture for their offices any intention to save is usually abandoned immediately, so wtf?

Second aspect: whenever you do something it is your utmost responsibility to test what you have done. Immediately! So how can you create such a loop and not notice it? How can you implement this change even twice? This is not a question of introducing an error - that happens to all of us. It is a matter if noticing you have done something wrong and this has to do with the style of work: if i delete a file, i do an immediate "ls" to verify it (and it alone) is gone, if i do a "cd" i do a "pwd" to verify i am in the right directory, etc., etc.. This slows me down by perhaps 5%, but when i think i have something done i usually have it done - without any error. The 5% are easily recovered not having to do the error correction and/or recovery others eventually have to do.

So, i hope for your best, but you should really change your work ethics and learn from this accident. My 2 cents.

bakunin

2 Likes

I agree with everything Bakunin has said but, as a rule and as an IT pro, I never delete anything. If I'm going to edit a file, I copy it (usually with a different suffix) and if I'm going to delete a file, I rename (mv) it.

Then, when something stops working (and in IT anything that can go wrong usually does) and I need to know what the hell was in that file that I edited/deleted, I can find out.

Rule is to think hard before you edit or delete anything!!!

A list of fatal commands that really happened

rm * .tmp      # A space too many
last | reboot  # grep missing
hostname -f    # on Solaris sets hostname to -f
ifconfig -a 1  # on Solaris sets all interfaces to 0.0.0.1

From a senior admin standpoint I always do two things when hiring a new Jr. Admin. I ask them to count to 5 and I tell them if they ever break anything to tell me the minute it happens and what they were doing.

I would rather them tell me quickly, fix and let it be a learning point with little repercussions on the first go round than have them lie to me and have to spend 8 frustrating hours trying to diagnose then fix the issue. I have fired people for that.

Now, can you figure out why I ask them to count to 5? :slight_smile:

I can count ill 5 for sure and I am still learnig Linxu and enjoying the show

At one of my previous positions the company was looking to save money so they would take the average windows admin salary and bump it a bit and not make UNIX knowledge and experience a requirement in the job req. So what they would get was windows admins who wanted to become Unix admins.

Getting them to count to 5 verbally would confirm this because the windows admin would always start with 1 whereas an experienced Unix admin would generally start with 0 because of the way interfaces, disks and Unix in general start with 0 in a numbering scheme.

I would have failed your test :frowning:

Regards,
Alister

but it makes you think about it, doesn't it?

Me too.

Windows admins can count?

oh, They can... just not in proper base 10. They usually start with 1 versus 0.

:slight_smile:

Here's a third UNIX admin that'd fail your test, working from the assumption I was talking to a human, not a computer.

What's it supposed to prove again?

Ok, poor assumption on my part but it always passed in my environment (but then they never sent me a UNIX admin)..

suggestions?

Would it not be fair to say that windows admin are much simpler and easy to go around and talk around with. By no means I want to be disgraceful to anyone
My point is things when they are too simple as unix is meant to be as everything is a file they actually become complicated.
What feels more simple calling something as a device file or media just think from the perspective of every day user.
I feel unix was meant to be simple but people who carry it have made it complicated.
I recall my first job when I just had to bounce tomcat however I was fresh out of school and could not catch the terminology and went ahead asking a very senior guy and found out they just wanted a restart.
Why can't we call a restart as a restart although we type restart

Let's not turn this into a religious war. I'm sure Windows admins would complain at being called "much simpler", even :wink:

I disagree that Windows standards are more "intuitive". Ask anyone who's new to computing, they'll have no idea what they're looking at. It's just what everyone learns in the office...

UNIX can certainly be obscure in some ways. It's an operating system for programmers. The straightforward interface keeps programs simple, not necessarily the system. You don't need to be a programmer to use it, but you'll be at a disadvantage if you don't. It also opens up a lot of possibilities.

It's just slang, not a UNIX term. Like all slang in all languages, it can be baffling to anyone who's not a native speaker.

So, I think one way to check for UNIX experience, would be to ask them to write a quick shell script without consulting a cheat sheet...

"Unix is user-friendly. It just isn't promiscuous about which users it's friendly with." � Steven King

"Unix is simple. It just takes a genius to understand its simplicity." � Dennis Ritchie

xkcd: tar

not a religious war on my part. I am the only UNIX admin in a mostly windows shop... yet I'm still here, not going anywhere and my footprint is actually increasing. We must be good for something....

I'd also like to add that I don't think windows admins are lesser beings, the implication was that they weren't used to our environment and it was something to watch for. Our app layer was coded by windows programmers so the numbering systems and naming conventions weren't consistent. If you made assumptions either way it would bite you...

It's best that you admit your mistake straight away. They'll hate you for it in the beginning but praise that you had the courage to do so.