Any hope for this bootlooping Sun V210?

Hi,

First post here!

I have a Sun V210 that I use occasionally for build testing things big-endian. I switched it on the other day, at it aint comin' up. I was wondering if anyone on this fine forum knows if it can be brought back from the dead.

With the SCC card in, and conencted to the serial management line, if I switch on I get:

ALOM - POST run incomplete previously, no POST this time

ALOM BOOTMON v1.6.10
ALOM Build Release: 001
Reset register: e8000000 EHRS ESRS LLRS CSRS


Check for Handshake


Returned from Boot Monitor and Handshake



Clearing Memory Cells
Memory Clean Complete


Loading the r
ALOM - POST run incomplete previously, no POST this time

ALOM BOOTMON v1.6.10
ALOM Build Release: 001
Reset register: e8000000 EHRS ESRS LLRS CSRS


Check for Handshake


Returned from Boot Monitor and Handshake



Clearing Memory Cells
Memory Clean Complete


Loading the r
ALOM - POST run incomplete previously, no POST this time

The fans are screaming away while these messages are looping.

If I remove the SCC card, there is more info (but not fans):

ALOM - POST run incomplete previ�
ALOM - Could not get all data from I2C - min post, no power on
ALOM - Could not get diag-switch from I2C

ALOM BOOTMON v1.6.10
ALOM Build Release: 001
Reset register: e0000000 EHRS ESRS LLRS


ALOM POST 1.0


Dual Port Memory Test, PASSED.

TTY External - Internal Loopback Test
TTY External - Internal Loopback Test, PASSED.

TTYC - Internal Loopback Test
TTYC - Internal Loopback Test, PASSED.

TTYD - Internal Loopback Test
TTYD - Internal Loopback Test, PASSED.


Memory Data Lines Test
Memory Data Lines Test, PASSED.

Memory Address Lines Test
  Slide address bits to test open address lines


ERROR: ALOM POST TEST
H/W under test    = Memory System Address Lines
    Test name     = Memory Address Lines Test
    Subtest name  = Memory Address Test 2

    Failure: Writing 0xFF to offset Address of 00000020

              Testing Address Line - SSP_ADDR<26> 

    Most LIKELY cause(s) of this failure include:

      The interconnection may be bad

          SSP_ADDR lines from U0301 to U0503
          SSP_DAT lines from U0301 to U0503

      U0503 the DRAM may be bad

END_ERROR



ERROR: ALOM POST TEST
H/W under test    = Memory System Address Lines
    Test name     = Memory Address Lines Test
    Subtest name  = Memory Address Test 4

    Failure: Base write to 0x0 affected offset of 00000020

              Testing Address Line - SSP_ADDR<26> 

    Most LIKELY cause(s) of this failure include:

      The interconnection may be bad

          SSP_ADDR lines from U0301 to U0503
          SSP_DAT lines from U0301 to U0503

      U0503 the DRAM may be bad

END_ERROR
ERROR: ALOM POST TEST
H/W under test    = Memory System Address Lines
    Test name     = Memory Address Lines Test
    Subtest name  = Memory Address Test 2

    Failure: Writing 0xFF to offset Address of 00200000

              Testing Address Line - SSP_ADDR<22> 

    Most LIKELY cause(s) of this failure include:

      The interconnection may be bad

          SSP_ADDR lines from U0301 to U0503
          SSP_DAT lines from U0301 to U0503

      U0503 the DRAM may be bad

END_ERROR
ERROR: ALOM POST TEST
H/W under test    = Memory System Address Lines
    Test name     = Memory Address Lines Test
    Subtest name  = Memory Address Test 4

    Failure: Base write to 0x0 affected offset of 00200000

              Testing Address Line - SSP_ADDR<22> 

    Most LIKELY cause(s) of this failure include:

      The interconnection may be bad

          SSP_ADDR lines from U0301 to U0503
          SSP_DAT lines from U0301 to U0503

      U0503 the DRAM may be bad

END_ERROR

  Test for shorted address lines
ERROR: ALOM POST TEST
H/W under test    = Memory System Address Lines
    Test name     = Memory Address Lines Test
    Subtest name  = Memory Address Test 5

    Failure: Writing data.
    Error at memory address: 00000020
    Good data was:           00000006
    Bad data was:            00000008
    XOR data was:            0000000e

    Most LIKELY cause(s) of this failure include:


I'm able to invoke the escape menu, and I've tried resetting the ALOM from there. No cigar.

So looks like bad RAM or bad RAM controller? I tried removing all RAM and booting. No change.

I notice a lot of jumpers on the main board, but searching the internet, I can't find their functions. I wonder if any of those could help?

Any hope for this poor machine? Thanks

Hi,

Having had a quick look through the logs that you've posted - I suspect that the ALOM has the issue!

Although it's a long shot here, the system should disable any faulty FRU's and allow you to login over the network. This presumes that the system is set to do that and that the faulty FRU has some redundancy.

You can then use the eeprom command to change the console output to TTYB (10101 on the rear panel) using eeprom setenv input-device ttyb where you should see the ok prompt you may also have to set the console as well.

Regards

Gull04

Hi gull04,

Thanks for the reply.

I'll give that a shot!

I was hoping that the serial ALOM could be revived somehow, but I guess not.

I do actually have a spare mainboard that I could put in, but I've been hesitant because it needs an ALOM password reset and I don't have a Solaris install to hand to run scadm :\

Hi Vext01,

I'm maybe a bit rusty on the vSeries now, but the default ALOM user and password was a "joey" account "admin" - you may be can return to that status by removing the button battery for a while. Also on the v210 remember to transfer the CCS card (Server won't boot without it as it contains both Mac Address and Hostid details).

Regards

Gull04

Sadly the ALOM password has been reset by whoever owned the motherboard before.

Even more sadly, the ALOM password is stored in flash memory, so removing the battery doesn't kill the password.

I found this article (which I'm unable to link to because I'm a new member -- the title is "unbricking a sun fire v210" if you wanted to search for it).

He says:

But he does say:

So I think I can install Solaris via that way, and then reset the ALOM password with scadm.

Hi,

You could try that, if you have a Solaris DVD you could boot that and reset the ALOM password - on the basis that you can get to the ok prompt.

Regards

Gull04

Well, I was unable to install solaris on the other machine, as it was a "managed system", meaning that you can't boot it with the SCC card ejected. It would just shutdown if you tried to turn it on.

I used another SCC card with a known password to boot it. This confirms the password is stored on the SCC card.

The problem I'm faced with now is that I can't get a 'OK>' prompt with 'console -f', even after resetting all ALOM settings to defaults with 'setdefaults -a'.

Any ideas why this would be?

Hi,

So what if any output do you see when you do a poweron from the ALOM?

Regards

Gull04

Hi gull04,

Thanks for your ongoing patience with this thread!

I'm not at the machine, but off the top of my head `poweron:

  • does indeed turn the system on and spins up the fans.
  • generates the "system power on event" message.
  • a few fan failure messages.

About the fans, one fan failure message was a CPU fan, I removed that CPU entirely as I believe the system should work with only 1 CPU anyway. There was also one chassis fan failure. Again I don't think that would stop the system booting. (I have spares which I will fit if I can get the system to boot).

IIRC, once the system is in the poweron state, `console -f` should give an 'ok>' prompt after hitting 'console -f' (and enter twice).

I can get you the exact ALOM messages later when I'm at the system.

------ Post updated at 06:48 PM ------

I replaced the fans on the CPU.

The console output looks like this:

sc> poweron
sc> 
SC Alert: SC System booted.

sc> console -f
Enter #. to return to ALOM.

SC Alert: VOLTAGE_SENSOR @ MB.BAT.V_BAT has exceeded low warning threshold.

I can hit enter as many times as I like, but no 'ok' prompt.

If the CPU were bad, the ALOM would know, right?

------ Post updated at 07:24 PM ------

I also tried resetting nvram, and provideing a bootscript like:

setenv output-mode ttya

no cigar.

Hi,

Can you show the output of sc> showlogs and sc> showenvironment - I seem to remember reading that the button battery either a CR2025 or a CR2032 I think shouldn't make a difference if it fails. But that there had been instances of being unable to get the server powered on.

Regards

Gull04

I'll get you the output from those commands later when I'm at the machine again.

And I'll get a new battery in case that is the issue!

Thanks

Hi Vext01,

The battery may not be the issue - although I'm sure that it's not helping in any way. From memory there was a SUN document published (Probably about 10 Years ago) that mentioned issues with the ALOM when the battery was discharged. It shouldn't have made any difference as the settings were all stored in NVRAM, but the flat battery could cause unpredictable behaviour (or something like that ).

I'd like to give this some more attention, but I'm preparing loads of stuff or auctioning things for a sequence of "Disaster Rehearsals" - so can only be here and active when the Auditors and Business Users are off patting themselves on the back!

Regards

Gull04

Hi Gull,

I've replaced the battery and it hasn't helped, I'm afraid.

Any last ideas before I source a new mainboard?

Cheers

Hi Guys,

I've read this discussion with interest. My one comment is that, AFAIR, the default ALOM userid/password combination for this series of SPARC is either admin/changeme or, on later models, root/changeme

And, yes, the means of resetting the ALOM password is to install Solaris (or boot a 'live' DVD) and use 'scadm' command.

Without successfully logging into the ALOM it won't let you do much.

Hi hicksd8,

I did eventually get into the LOM by swapping the SCC card for one with a known password, but the first thing I will do if I can get the `OK` prompt is install Solaris and reset the password on the first SCC card.

If only I could get the 'OK' openboot prompt :frowning:

It looks like your 'poweron' command is not properly executing and the mobo is not powering up. Hence, no OK> prompt after 'console'.

Here's another live thread where I'm having a similar discussion which might be worth your reading, perhaps not.

Netra 240 is similar to V240 is similar to V210

Yes, that's what the first mainboard did. The ALOM is corrupted I think. I gave up on that board.

Note that more recently in this thread I'm using a different mainboard.

Hi Vext01,

I have just been through the thread again and have a couple of quick questions/observations.

Firstly, it's fairly obvious (I think) that you are able to get into the ALOM and interact with it - just the ALOM doesn't seem to be communicating correctly with the system.

Secondly you mention that you removed a CPU, I'm not certain on this but from the dim and dark recesses of my memory I have this nagging feeling that these vSeries servers need CPU 0 to boot - although I would expect some better diagnostics than you are getting.

Finally, did you have any joy with the sc> showlogs or the sc> showenvironment commands?

Regards

Gull04

Hi Gull04,

As for CPU, I put both CPUs back in yesterday. I'll try some spare CPUs in slot 0 in case the CPU is faulty, although I'd have hoped that the ALOM would have realised if it was.

Here's the output of those commands:

sc> poweron
sc> showlogs

Log entries since DEC 03 10:13:29
----------------------------------
DEC 03 10:13:29 : 00060003: "SC System booted."
DEC 03 10:13:32 : 00060000: "SC Login: User admin Logged on."
DEC 03 10:16:23 : 00040001: "SC Request to Power On Host."
sc> showlogs -v
Persistent event log
--------------------
DEC 01 18:49:12 : 00040029: "Host system has shut down."
DEC 01 18:51:07 : 0004003e: "Different SCC detected. SC will reset itself momentarily."
DEC 01 18:51:39 : 00040002: "Host System has Reset"
DEC 01 18:51:52 : 00060003: "SC System booted."
DEC 01 18:52:57 : 00040071: "DISK @ HDD0 has been removed."
DEC 01 18:53:09 : 00040072: "DISK @ HDD1 has been inserted."
DEC 01 18:53:21 : 00040072: "DISK @ HDD0 has been inserted."
DEC 01 18:53:33 : 00040071: "DISK @ HDD0 has been removed."
DEC 01 18:54:16 : 00060004: "SC Request to Reset Host."
DEC 01 18:54:45 : 00040066: "ENCLOSURE_FAN @ F3.RS has FAILED."
DEC 01 18:55:11 : 00060016: "SC Request to execute XIR Reset on the Host."
DEC 01 18:57:44 : 00040002: "Host System has Reset"
DEC 01 18:57:46 : 00060003: "SC System booted."
DEC 01 18:59:31 : 0004000e: "SC Request to Power Off Host Immediately."
DEC 01 18:59:41 : 00040029: "Host system has shut down."

Log entries since DEC 03 10:13:29
----------------------------------
DEC 03 10:13:29 : 00060003: "SC System booted."
DEC 03 10:13:32 : 00060000: "SC Login: User admin Logged on."
DEC 03 10:16:23 : 00040001: "SC Request to Power On Host."

sc> showenvironment


=============== Environmental Status ===============


--------------------------------------------------------------------------------
System Temperatures (Temperatures in Celsius):
--------------------------------------------------------------------------------
Sensor         Status    Temp LowHard LowSoft LowWarn HighWarn HighSoft HighHard
--------------------------------------------------------------------------------
MB.P0.T_CORE    OK         48     --      --      --     110      115      118
MB.P1.T_CORE    OK         62     --      --      --     110      115      118
MB.T_ENC        OK         20     -6      -3       5      40       48       51

--------------------------------------
Front Status Panel:
--------------------------------------
Keyswitch position: NORMAL

--------------------------------------------------------
System Indicator Status:
--------------------------------------------------------
MB.LOCATE            MB.SERVICE           MB.ACT              
--------------------------------------------------------
OFF                  OFF                  OFF                 

--------------------------------------------
System Disks:
--------------------------------------------
Disk   Status            Service  OK2RM
--------------------------------------------
HDD0   OK                OFF      OFF
HDD1   NOT PRESENT       OFF      OFF

----------------------------------------------------------
Fans (Speeds Revolution Per Minute):
----------------------------------------------------------
Sensor           Status           Speed   Warn    Low
----------------------------------------------------------
F0.RS            UNAVAILABLE         --     --   1000
F1.RS            UNAVAILABLE         --     --   1000
F2.RS            UNAVAILABLE         --     --   1000
F3.RS            UNAVAILABLE         --     --   1000
MB.P0.F0.RS      UNAVAILABLE         --   2000   2000
MB.P0.F1.RS      UNAVAILABLE         --   2000   2000
MB.P1.F0.RS      UNAVAILABLE         --   2000   2000
MB.P1.F1.RS      UNAVAILABLE         --   2000   2000

--------------------------------------------------------------------------------
Voltage sensors (in Volts):
--------------------------------------------------------------------------------
Sensor         Status       Voltage LowSoft LowWarn HighWarn HighSoft
--------------------------------------------------------------------------------
MB.P0.V_CORE   OK             1.47      --    1.26    1.54       --
MB.P1.V_CORE   OK             1.49      --    1.26    1.54       --
MB.V_VTT       OK             1.28      --    1.17    1.43       --
MB.V_GBE_+2V5  OK             2.50      --    2.25    2.75       --
MB.V_GBE_CORE  OK             1.20      --    1.08    1.32       --
MB.V_VCCTM     OK             2.54      --    2.25    2.75       --
MB.V_+2V5      OK             2.58      --    2.34    2.86       --
MB.V_+1V5      OK             1.52      --    1.35    1.65       --
MB.BAT.V_BAT   OK             3.21      --    2.70      --       --

--------------------------------------------
Power Supply Indicators: 
--------------------------------------------
Supply    Active  Service  
--------------------------------------------
PS0       ON      OFF

------------------------------------------------------------------------------
Power Supplies:
------------------------------------------------------------------------------
Supply  Status          Underspeed  Overtemp  Overvolt  Undervolt  Overcurrent
------------------------------------------------------------------------------
PS0     OK              OFF         OFF       OFF       OFF        OFF

----------------------
Current sensors: 
----------------------
Sensor          Status
----------------------
MB.FF_SCSI       OK

Hi Vext01,

As you've had the machine in bits and have replaced the guts, I'm going to stick my neck out and suggest that the connections should be good as they should all have been re-seated during component changes. I'd maybe tempted to swap the PSU if you have one, but the maintenance indicator would generally be on if it was faulty.

It might be that just pulling and re-seating the power supply would be enough.

I'm kind of running out of steam on this, you've changed the MOBO and the CPU's. The fault has remained pretty consistent, from the environmentals it doesn't seem to be power related. I'ts unlikely that this will make any difference, but can you capture the ALOM config and then run a sc> resetsc you'll have to go through the setting up the ALOM again (Although you should just need to change the password as the rest should work).

Additionally something has just come to the fore, didn't happen to me - it was an engineer that I worked with. He had a fault on a v210 or it may have been a v240 - where he replaced the Mother Board and couldn't get to the ok prompt.

Turned out that he'd mixed the matched pairs of memory and that causes the OBP not to start, it didn't supply an error message either - I've been in touch with him and although the Dimm's were the same speed - they had different manufacturers and that was enough.

Regards

Gull04