[Opinion] A Public Answer To Rob McNelly

bakunin · June 22, 2016, 5:27pm

Why Do We Need Root on the HMC?

In this article in IBMSystems Magazine Rob McNelly asked the question

Why Don't We Have Root on the HMC?

and he goes on to justify why we indeed shouldn't have root - kinda. I think his arguments are not as valid as he perhaps thinks they are and what's more i think he deserves an answer as public as his statement. I will paraphrase some of his statements as i understand them, but you should read his linked article yourself to finally judge if i have misrepresented or misunderstood him.

First, Mister McNelly says it is "in the nature" of Sysadmins to believe they need root everywhere. This might be the case for some immature hacker kids. Fact is, i - and certainly every other responsible sysadmin i know - only switch to root if i really need to do it, not because it is my "habit" to do so. It is just the nature of my work which calls for the power of the superuser: otherwise i wouldn't know how to increase filesystem sizes, unlock user accounts or start up/shut down systems - these are the most common requests i face every day. But my "normal" work, which doesn't require these extraordinary powers - writing scripts, working out procedures, ..., i do with my ordinary user account. The only group i carry is "staff" and the only thing different from any other user acount is the size of my HOME directory (~200MB) because i generate reports and lists rather using UNIX text filters than these abominable "office" suites. (As a rule of thumb: data that really matters is not stored within an Excel sheet.)

The second reason Mister McNelly cites is that an (arbitrarily) administrated system (as opposed to an appliance) is a support nightmare. Now i can appreciate this argument! But guess what: any system with a variable configuration is more difficult to support than a system with a fixed config. Maybe IBM should lock out all users from all their AIX systems as this would make supporting the OS much easier, no?

And why does the HMC have to be a separate system anyways? Lets face it: it is basically a (acceptably but not outstandingly well designed) web application and a supplemental set of commands to do on command line what can be done within the web application. Can't that be an application which can be installed? What needs a separate system here?

For instance, i have installed the "EMC solutions enabler" on an AIX LPAR to administrate my array of VMax storage systems. It is a set of executables i just use within scripts of my own and it writes plain log files i can read. I'll give you that, to use non-standard SCSI commands to communicate with the VMax which requires "gatekeeper devices" to be created is probably a pretty bad idea - there was a thing invented for that kind of service, i believe it was called "networks". But save for that �the management software for the system is a normal application. Why can't that be done for the HMC software?

Yes, i can understand Mister McNellys point that installing "everything and the kitchen sink" on the HMC can create problems - just like cramming several applications onto any other single system will likely cause problems and is a very bad design decision. But i wouldn't do that like i wouldn't design any other system that poorly. Still i could make my work easier with storing some really necessary files on the HMC without being forbidden to organize my HOME with that ridiculous restricted shell. I mean: does it really make support esaier when i am forced to have 50 files in my home instead of having them organized in neat subdirectories (which i can't create)? Who is helped by the fact that i cannot pipe the output of, say, lssyscfg , into a grep ? I might even want to use the same shell i use throughout my whole AIX installation - Korn Shell - instead of being forced to use bash solely on the HMC.

So, do i want root on the HMC, as McNelly finally asks? No, for the most time a decent user account with a normal, not-restricted shell would suffice. But to manage this account - in the same responsible way i manage the rest of my 350 LPARs - i'd like to become root now and then to do whatever administrators do. Of course i know how to jailbreak the HMC (like perhaps every halfways capable admin does), but why do i need to "break into" a system i have set up, a system i run and for which i (well, actually my company) have paid good money?

If IBM would put the effort they put into making it harder to become root into further development of the HMC software itself - wouldn't it help people (outside their support staff)? It reminds me somewhat on the situation with IPhones, Android phones, Cyanogenmod and that awful decision to make the replacement of batteries impossible. I understand that it helps protecting the cashflow because this way it is easier to gain money from customers without doing more.

But on one hand: i may have to bear it, but i do not have to like it. And on the other hand: we are not talking about some mobile phone for 69.99. We are talking about the two HMCs i use to manage one and a half dozen p780s and p880s, about 2 million dollars apiece. Do you think it is necessary to squeeze out some minimal additional benefit by pestering me with a restricted shell for my daily work? And if you really think i couldn't handle the responsibility for such a vital system: don't you think i should be removed from the position where i manage the LPARs running the corporate SAP systems too?

Just my 2 cents for the whole HMC discussion.

bakunin

dukessd · June 22, 2016, 7:30pm

I'll chip in... discussion is very welcome.

A large number of Power sysadmin are simply not able, or capable, of doing their jobs.

When they get their hands on an HMC (let alone large corporate bank staff hacking up the ODM and asking IBM how to fix their mess - on two year old code - without a reboot - in an AIX LPAR) they "think" (relative term) they see linux and just simply go MAD.

How can IBM support any of that?

The HMC is an important box so needs to be treated with respect, as you said it's vital to support an expensive estate.

In the early days, to some extent it still is, HMC users / admin "think" they can do all sorts of "things" on "their" "linux box"...

How could that be supported without making the HMC a black box and simply not letting it happen?

Do you use vio commands on a vio server, or just oem_setup to save time - it'll catch you out sooner or later...

Do you hack the ODM if a command doesn't work the way you want - it'll catch you out sooner or later...

Do it by the book or suffer the consequences.

Just my experience, hope it helps some body or some poor system with an irresponsible admin ;0)

If you have a beef about HMC, AIX, admin rights, Etc. raise a PMR and if IBM say "not supported" they'll give you the process to raise a DCR. If the DCR (design change request) is rejected at least they'll let you know why they think your idea is not possible or plausible.

bakunin · June 23, 2016, 3:29am

Amen. You are right, but i think you are missing the point: first, a determined "non-expert" (to avoid words more to the point) will be able to mess up anything. As i said, when locking him out of the HMC helps, why not lock him out of any other LPAR too?

Second: this is digging into a much larger area so i'll try to keep it short. The reason that so few capable admins for AIX are there is because IBM did (and, IMHO, still does) a very bad job at educating them. If i am a Linux admin and want to hone my skills i get myself a PC for $300 and start hacking. I will perhaps make it go FUBAR a few times but all this will teach me valuable lessons and i will be all the more capable once i work on really productive systems professionally. If i am an AIX admin i do - what? Buy myself a system for ~ $20k only to find out i can't even create an LPAR because i need to shell out another $50k in various licenses for one thing or the other? This might be OK for a bank, but is beyond my financial reach.

In addition the IBM documentation once used to be exhaustive. It isn't any more. In fact it is quite incomplete, bookmarks to the documentation tend to be invalidated within hours so that you start over searching for the same pages (which sometimes are not to be found again however) and even if you find what you search for the information is oftenly incomplete and leaves many questions open.

Of course there are courses: EUR 5k for 4 days of class and what they tell you is basically: "use SMIT and you are on the safe side". I don't care how to do something, i want to understand what i do. I found out over and over again that the people holding the course knew even less than me.

As i said: by making what is the HMC today into an application as did EMC, as did IBM with their PSSP, as do many other developers of all sorts of management software. There is no reason that it has to be so complicated that it needs the "black blox" to run smoothly.

Well, i do all that on occasion, especially when the "by-the-book" methods didn't work out. And i was only so much amused when i was finally allowed to throw out everything i learned on AIX out of the window and had to learn a second, completely different set of commands to do on a VIOS the same things i do on an LPAR.

For the book by which i should do it: if it doesn't tell me what i need to know to do it right it is simply incomplete and/or badly written. Don't hold me accountable for IBM delivering bad/wrong/incomplete/misleading documentation.

Having used AIX since version 3.2 i know this process. It is just my opinion that IBM took some wrong (design) decisions and even though i cannot help it i do not have to appreciate it either. And i do not have to take Rob McNellys apologetic stance towards this without objection.

Finally, on a more philosopic point about systems design in general: if you design a system to cater to the dumbest possible administrator you are likely to get the dumbest possible system which even the smartest possible admin can't make any more intelligent.

bakunin

gull04 · June 23, 2016, 9:58am

Before I chuck my couple of cents worth into the bucket here, a quick pr�cis on me and what I�m doing at the moment.

I�m nearing retirement, I�ve worked on a huge range of equipment � for a long list of names, pretty much all gone now. Probably worked on more than 20 flavours of *NIX for companies like Data General, Sun, Olivetti, Norsk Data, Wordplex, Motorola, Intergraph and a number of others.

For the last 15 years I have been a �Data Centre Migration Specialist�, whatever one of those is. At the moment I am sub- contracted to a client by IBM. At this point I should say that I am not permanently employed by IBM, but this is the fourth time that I�ve been contracted out by IBM. The current job is to move the data centre of a major player in the UK utility market into a new headquarters building, a project expected to last at least another 18 months.

The IBM estate is pretty mixed and aged, I have a number of P770�s, P740�s, P570�s and RS6000�s running a number of levels of AIX from 4.3 to 7.1 � with 7.2 about to go on the floor in the form of a number of S824�s � there are a total of four HMC�s. I have also got quite a number of Linux (200) and Sun (350) servers to move, the end client has hardware support from Oracle, IBM and HP-CDS and OS support from IBM and Oracle.

So now my 2� worth:-

I can agree with most of what has been said above, I can understand IBM wanting to lock the HMC appliance down as much as possible and I understand the sysadmin desire to have full control of any machine on the network as Bakunin says � if there�s not a competency issue. In truth, my main reason for coming down on the restricted side of this argument is exactly that � competency! I have a number of systems that have been up and running for longer than many of my support contacts have been systems admins, I don�t actually have privileged access to many of the systems � I have elevated access or �root� access on none of the systems. Should I need root access, it has to be requested, approved and I am issued with a one-time password.

I find it to be a total pain, but that is the implemented system. On investigation the reason for the system being implemented was, you guessed it competency! Cited examples, well I could give you any number. But an example that I think sums it up quite well is one that was easy to recover from, but could have been catastrophic had it been a customer facing system with say five or six thousand users. Instead of a development system, with just a couple of hundred developers. Where the �root� user executed a recursive delete command with a space in it, from the root directory and effectively deleted the full contents of the server � mostly source code and development tools.

I have worked in the *NIX world since 1981, over that time I have watched the skill level of the sysadmin degrade, a lot of it revolves around training � my first �Sysadmin I� course was five weeks long and I never actually saw a machine. It was all spent sitting at a Wyse 30 terminal, with a number of other trainees. Now I see sysadmins working for major vendors, with no training whatsoever.

I am in many respects happy that these administrative and management appliances have been made idiot proof as much as possible, but also very wary � just when you find that you have secured the systems against Idiot V1.0, you�ll find that the management will upgrade to Idiot V2.0.

IMHO only training and experience makes for a competent sysadmin, but unfortunately these things come with a high price tag. Inexperienced resource is easy to find and cheap to run, moving the support off shore can exacerbate the problem � through language, not competency although my personal experience has been that you have the same ratio of competent/incompetent people evenly distributed around the world.

I have tried to keep myself current with as much as I can, even attending further training � here I definitely agree with Bakunin. When I�m doing stuff �I want to know what I�m actually doing�, after many years of AIX � and using �smit� on both AIX and Solaris(for information, smit was ported to Solaris by a major financial company in the UK), I knew about pressing F6 to see what was going to be run by the system. The standard of knowledge of the instructor made it obvious that he had almost no experience, as he couldn�t answer some of the simplest questions and answered others incorrectly � at which point I actually asked to see the manager of the training facility to request reimbursement.

So when I see the standard of people moving into the sysadmin world, I can understand why the move to making things safe through idiot proofing. My approach would be to weed out the idiots and provision better training, but unfortunately that costs more.

Gull04

MadeInGermany · June 23, 2016, 2:54pm

Oh, you missed 11 exciting years:D
There is a strong believe that a new style of IT (cloud, virtualization, orchestration, automation, auto-scaling, self-healing, ...) will obsolete traditional system administration. Instead management-by-click will emerge.
Just order your desired IT-functions on your smartphone, and voila - your new company can go!

dukessd · June 23, 2016, 9:23pm

Sounds like we all agree then.
"management-by-click" - of course that'll all just work as expected...
Where it the sad face emoticon?

gull04 · June 24, 2016, 2:46am

Hi Guys,

As to "management-by-click", well maybe. But just in case, my Pig is outside, saddled and ready to fly.

Gull04

Peasant · June 24, 2016, 7:39am

It's the monies.

If you have quality documentation how things work under the hood, less calls to support (IBM, Oracle, HP or their bastard support firms), equals less cash for them.

There is no love. Just plain cash hunting everywhere you turn.
I've seen in my short career (about 10 years, oppose to you unix masters), folks intentionally delivering broken stuff just to fix it after and fill their hourly/monthly quota.

Other problem is a new generation of kids who emerge from technical and other faculties which know nothing, and worst, even don't want to learn.

All they do is run scripts someone else wrote.
Click puppets/AI installers/PXE someone else configured.
Everything needs to be done before, so mindless automatons can do their jobs.

Looks like 90% of IT today is in 'human centipede' mode, just forwarding crap around.

Excuse me upfront if i'm dull to you

Regards
Peasant.

bakunin · July 31, 2016, 10:45pm

It took some time to proove my point, but here it is: this is what happened last week:

On Tuesday both of my HMCs were no longer getting a connection to any of the managed systems although both were responding quite normally at their public interfaces both via ssh and the web GUI.

Our environment consists of about 20 p780 and p880, along with some smaller systems (p740s) thrown in for good measure. On that run some 350 LPARs of various sizes. Yes, we are a big shop and having no HMC to manage it is kind of a problem.

The first thing i did was trying to reboot one of the HMCs. As i can not really diagnose any problem because all the tools necessary for that are not available this was the best and fastest i could do to bring the system to a defined state. The reboot did take place, i did see the IPs of the service processors but not the given names of the managed systems any more. It did, of course, change nothing.

Since both HMCs lost all their connections at seemingly the same time i came up with the theory that maybe the network was responsible for that. So i got me a network admin and we traced the switch onto which all the service processors and the HMCs were connected. This management network is closed and unrouted, but we were able to confirm it worked and all the correct ARP information was there.

Note: forget to find out things like the MAC address of an interface on the HMC. Because the ifconfig command to do so is such a complicated thing IBM made things very easy for me by not confusing me at all with such information and made us dig into the logs of the switch to make sure the MAC addresses were what they should be. Thank you, IBM, for making my work so much easier.

At this point i opened a Prio-2-call at IBM. It was 2:00 pm and i expected to be called within the next 30 minutes. As it is, when i started to work with AIX more than 20 years ago this would have been a Prio-3-call and the phone would have rung within minutes. Times have changed.

It was Wednesday, sowhat past 14:30 when IBM deemed me finally worth an answer. The dispatcher first asked if i would agree to continue in english, which i allowed. (Big mistake. The english the technician spoke was barely understandable at all and i probably would have better understood his native bulgarian even though i don't speak that at all.)

First i told him what exactly i did up to this point, including the network trace and the reboot. He told me he would send some procedures i should carry out for him per mail but it would take him ten minutes or so to prepare the mail. No problem! Something happens, finally. After 30 minutes i was wondering, after 1 hour i was angry. After waiting for two hours i called the hotline again and asked who they thought i am. Within minutes the same technician called me and told me that "something went wrong with my email because he tried several times but it always came back with 'address unresolvable'" or so. OK, things can happen - but: couldn't he have called me and asked?? He obviously knew how to call me, no?

Well, after sending a mail to him myself he was able to answer that. I got a mail about how to create the hscpe user and use that to create a dump. I did so then uploaded the dumps from both HMCs to IBMs support site. (If you ever have to do that: a dump is some 2.5 GB in size, so it takes some time.)

On Thursday i got a mail from the guy, telling me that the good news were that nothing was amiss with my hardware. He advised me to check for loose network connections. I wrote back a rather acerbic comment that i did that at first and i already told him so, painstakingly describing the network traces we did. Anyways, i went to the datacenter and made sure all the network connections were there (and, what a surprise, it turned out that an interface i was able to determine the MAC address from the switches ARP cache for was indeed connected to that switch). I was told i would be passed over to second-level support.

On Friday nothing was to be heard from IBM. I suppose they were searching for the person doing the second-level support for this planet. In the meantime my colleague had a breakthrough, though: it is not possible to do a simple df on a HMC because that would perhaps disrupt the intricate work IBM has done with the HMCs software, but issuing

lshmcfs

he was able to detect that on both HMCs the /var filesystem was 100% full. Yes, there is a method to remedy that, namely the chhmcfs command, but - as usual - it didn't work. So the final solution was to break into the HMC, become root and do what UNIX-Admins have always done: clean up the filesystem by using rm . After several reboots and several rediscovery rounds we saw - kudos to my colleague - all our managed systems again.

Conclusion:

Yes, it was my fault not to have the idea with the /var FS earlier. I was tricked by both HMCs losing connection at about the same time and investigated in the completely wrong direction. On the other hand, this is not a UNIX system, it is an appliance. Why am i supposed to act as am admin checking for filesystems when i was first denied all the tools admins have?

Second, my life was made so much easier by being forced to rely on tricks like pulling MAC addresses out of the routers logs instead of simply issuing ifconfig . FInd out how long a system is up: uptime . Find out how long a HMC is up: impossible. Check how many packets are being sent/received on a UNIX system: entstat or netstat . Find out the same on a HMC: impossible. This list goes on and on.

And finally: even if i had diagnosed the problem correctly it wouldn't have helped me any. We actually tried the "official" methods of cleaning up before, but they didn't work at all (as they usually do - i have seen them fail more often than not). Only breaking in and using normal UNIX commands did what was expected. And why did IBM not see that full FS in the 2.6GB dump they required me to upload? Do i really want to take the risk of my multi-million-dollar environment becoming completely unusable because i have a system at the center which i can neither diagnose nor administrate and it takes support three days to fail?

Why do i pay six-figure amounts of money only to be pestered by questions which i have answered before they where even asked just because the standard questionnaire says so? I can print that damned questionnaire out and read it to myself for free without having to wait a day just to be called back.

Now, please tell me again what this "appliance" is for and why it is making my life easier.

bakunin

agent.kgb · September 23, 2016, 4:23am

Answer from Rob McNelly:

IBM Systems Magazine - More on the HMC and root

MichaelFelt · September 28, 2016, 6:55am

I responded at System's magazine - in the hope more of IBM will see that. My concluding remark is:

There is actually, or perhaps was, an easy path to become root by opening a PMR. And, in a prior life - as an AIX instructor I taught customers (aka students) how to open a PMR (we did so during the class) - and I also showed how to reuse the password from the previous class (officially the passwords are only valid from midnight to midnight of the day issued - guess how to reuse it :P)

While I can understand the desire for root on HMC I long decided I would not even 'desire' it - but take IBM at it's word about being an appliance and making sure - read demand - it work as an appliance.

I am quite capable of changing a pump in a car, washing machine or heating system. I am quite capable of administrating an HMC as root. However, all of these devices are sold and serviced by the sellar as an applicance. If the pump is not working - I expect someone asap (per terms of the SLA) to replace the pump.

(Hope you like my metaphor!)

zaxxon · September 30, 2016, 4:49am

:D:D Sorry I have to laugh, but those guys in Bulgaria remind me of a Deja Vu with another big company that seems to have found the same cost friendly country to place their support at. Those poor chaps acted often the same like you described and sometimes didn't want to pass calls to the next level which was also no big help.

So what Rob wrote in his answer, that you as customer should escalate etc. is in my eyes not a nice but maybe a common business behaviour these days with some big companies, to have the customer involved to ensure the quality of the vendor's support.
We heared this with the other big company too, but having to involve an escalation manager etc. gets tideous after some time as well and one asks himself, what is going wrong there, that I have to do so much effort to get some help or sometimes at least someone that even understands what my problem is.
If you put people there, that have not enough experience to offer good support, then this is a problem of the vendor and must not be a problem for the customer. It feels a bit like the concept of green banana software being used for support structures.

So why is the HMC so locked up...
Yes, in the long time as AIX admin I did not like it at all and absolutely agree with you, that people that are responsible for plenty mission critical servers with sensible applications/users, that already have the knowledge at hand to get along with the HMC, should be allowed to do so by default. They can still open up a support call if they get stuck.

Because if an admin has no clue and screws up one or many LPARs, he will usually be in more serious trouble than the one that screws up a HMC, which usually comes redundant with 2 of them, where not all important LPARs are always redundant.
And don't forget the VIOS - do something wrong there and you have a good chance that really lot's of LPARs get problems, so what.
In the end in a professional environment one will have a backup for the LPAR, VIOS as well as for the HMC.
And severe LPAR damage has most often a direct impact to users ie. our customers, even if it is "just" a cluster switch that takes some minutes but gets maybe 10k users disconnected and maybe some unpleasant attention by your boss/managers. Trouble with the HMC will usually go unnoticed by our users.

So the HMC is at least locked up for 2 reasons in my eyes:

a) The customer has strongly to rely on the support of the vendor. This is a dependency and some kind of "bonding" of the customer to the vendor. The vendor gets cash, the customer has a helping hand and feels good withit, simply they are just good friends and will most likely have more business in the future So far the possible theory.

b) It was said in the discussion, that the admins often have not enough skill/experience - true, but these guys have been in business way back in time and such will be in the future.
I have the impression, that in favour of cost efficient support structures, they have tried to make the HMC to be being easily maintained by their support, not because the customer side is so unexperienced.

cheers
zaxxon