instability

My server, which runs Planet Lisp, xach.com, etc, has been flaking out with increasing frequency lately.

Here's the setup:

  • Relion 1XT 1U Pentium 4 3.0Ghz from Penguin Computing
  • 2GB ECC memory
  • Two 80GB SATA drives in md-raid 1 mounted on /
  • Fedora Core 4, kernel 2.6.13-1.1532_FC4

It's gotten to the point where it is locking up every few days. I can't even compile a kernel; it either segfaults or I get this in random include files:

error: static or type qualifiers in non-parameter array declarator

Everything screams "hardware problem". I'm extremely bummed about it. I didn't save any of the material to ship the unit back, if that proves necessary, and I don't really want to have weeks of downtime waiting for some resolution. Anyone have any comments or suggestions?

UPDATE The server will be going down for overnight maintenance today. Planet Lisp should be back up sometime on Thursday.

Comments

Personally, I'd start with running a memory test (my usual tool is memtest86, written out as a boot sector on a CD, but it works just fine on a floppy). I'd suggest running the memory test without any OS running on the machine. Obviously, this means there will be some downtime. Switch it tio exhaustive testing, I think it's set for a slightly less complete array of tests by defaulty. It'll most probably take a while to complete, though.

If that's fine, I'd start suspecting the disk controller. If the CPU was fried, you'll most probably have a crash trying to run memtest86.
Swap that ram out.

(Anonymous)

Memory or disk?

Since you run ECC RAM it's probably not the RAM, maybe the mainboard?

Anyway, what I'd recommend is a fresh install. Maybe some data block got corrupted on disk?

Taking out the CPU and RAM and putting them in (and with all connectors as well) might help too.

I once had a PC that had random crashes, usually after about 20 minutes uptime. I took out the RAM and after putting it back in the machine stopped booting. Solution: the one RAM slot was broken. With the RAM in the other slots it worked perfectly again...

(Anonymous)

Re: Memory or disk?

Since you run ECC RAM it's probably not the RAM, maybe the mainboard?
Not so. If the RAM is faulty, it won't correct properly. I've seen failures in ECC with the memory tester I recommended below.
I would suspect RAM. How much free RAM are you running on the system? Perhaps you have bad memory in a high range like at 1.5GB, and it isn't noticable until you try to compile a kernel.

If not RAM, it may be bad L2 cache.

memtest86 is really the right way to test this.

Is the system crashing with a kernel panic and any stack trace, or just goes dead?
Kernel panic and stacktrace the last few times.
got any details? If it is not bad ram (and I think it will be) then you may have some sort of odd-ball kernel bug to track down. I've had production machines that ran beautifuly for a year and then got past a certain workload threshold and had kernel panics due to SCSI/RAID controller driver bugs. I have also seen kernel panics (not in recent years) with multiprocessor machines with certain drivers that were not properly spinlocked, that only showed up under the right set of conditions (but you don't have MP machine).

sometimes the only way to really catch things like this is to redirect console to a serial port, and setup a laptop with minicom logging the serial port output. and wait.

(Anonymous)

sounds like a memory problem

http://www.memtest.org/ has the best memory tester I've ever seen... better than ones I've purchased. I've seen failures in pass 13-14, so if I can make it past pass 15 I know the memory is OK. I recommend you run this overnight.

(Anonymous)

Re: sounds like a memory problem

btw, you can use http://www.memtest86.com/, too, but the one I first suggested is a more advanced fork of this. We've switched over to the .org version.

Good story in this vein

September 2014

S M T W T F S
 123456
78910111213
14151617181920
21222324252627
282930    
Powered by LiveJournal.com