old-www/LDP/LG/issue30/tag_lockups.html

271 lines
12 KiB
HTML

<!--startcut ======================================================= -->
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
<META NAME="generator" CONTENT="lgazmail v1.1pre8">
<TITLE>The Answer Guy 30: Hardware Lockups due to Graphics Load</TITLE>
</head>
<BODY BGCOLOR="#FFFFFF" TEXT="#000000" LINK="#0000FF" VLINK="#A000A0"
ALINK="#FF0000">
<!--endcut ========================================================= -->
<H4>"Linux Gazette...<I>making Linux just a little more fun!</I>"
</H4>
<P> <hr> <P>
<!-- =============================================================== -->
<H1 align="center"><A NAME="answer">
<img src="../gx/dennis/qbubble.gif" alt="" border="0" align="middle">
<a href="./index.html">The Answer Guy</a>
<img src="../gx/dennis/bbubble.gif" alt="" border="0" align="middle">
</A></H1> <BR>
<H4 align="center">By James T. Dennis,
<a href="mailto:linux-questions-only@ssc.com">linux-questions-only@ssc.com</a><BR>
Starshine Technical Services,
<A HREF="http://www.starshine.org/">http://www.starshine.org/</A> </H4>
<p><hr><p>
<H3><img src="../gx/dennis/qbub.gif" alt="(?)" width="50" height="28"
align="left" border="0">Hardware Lockups due to Graphics Load</H3>
<p><strong>From Brad Alexander on 30 May 1998
<!-- begin body -->
<BR><BR>
Hi Jim,
<br><br>
This isn't Linux-specific, but I'm having a problem and I'm hoping you can
help me come up with a workaround that isn't going to cost a lot of money.
<br><br>
I have an Intel P-100 on an Amptron AM-7900 board with 64MB of EDO RAM (2
32MB sticks), a gob of hard drives (a 2.2MB Quantum Fireball IDE and a
FutureDomain SCSI controller with a 420MB Conner, a 1GB Seagate, 1GB
Micropolis and 1GB Quantum Empire), a Diamond Stealth 64 with 2MB DRAM, and
a SoundBlaster 16 Plug'n'Pray.
<br><br>
I'm running a heavily modified RedHat 5.0 machine with an 800MB DOS
partition on <TT>/dev/hda1</TT> and a 200MB win95 partition on
<TT>/dev/hda3</TT> (Linux's <TT>/+/usr</TT> is on <TT>/dev/hda2</TT>).
<br><br>
I have been seeing system lockups for quite a while now. I noticed them
when running xlock in random mode initially, then noticed that I was also
starting to have problems with some of my dos apps, like Jane's Longbow and
Duke Nukem locking up. Under Linux, I settled on using <TT>xlock</TT> in
galaxy mode, and the lockups dropped to every couple of weeks. (Note that
during this time, I upgraded memory from 4 8MB sticks to 2 32s.)
<br><br>
Everything went all right until I upgraded to RedHat 5.0, with XFree86
3.3.1. The lockups increased to about every 2 days. Once I upgraded to
XFree86 3.3.2, they dropped back down to about once a week.
<br><br>
I'm basically using you as a sounding board to see if I might have missed
something. I'm thinking its hardware, but where? The stealth? The lockups
seem to occur during graphics app use, <TT>xlock</TT>, or the <TT>gimp</TT>.
The motherboard? The chip? What can I start replacing without sinking a whole
bunch of money into it?
<br><br>
Thanks in advance,
<br>--Brad Alexander
</strong></p>
<blockquote><img src="../gx/dennis/bbub.gif"width="50" height="28" alt="(!)"
align="left" border="0">
Well, the first thought would be to try a different video card.
I don't have too much confidence that the problem is truly
related to the video card's activity --- so it's just a
diagnostics start.
<br><br>
To see if this really is related to graphics, boot up the
system in text mode (don't run X, change your runlevel or
initdefault to one of the non-xdm modes if necessary). Now you
can run a couple of kernel builds on it (that's usually a
pretty good stress test. Try '<TT>make -j</TT>' to work it harder.
<br><br>
It would also be helpful to know what sort of lockup you're
getting. It may be that you could still login via a serial
port (using a null modem and a laptop or any other nearby
computer or terminal). Do do this simply add a line like
<blockquote><code>t1:23:respawn:/sbin/agetty -L 38400,19200,9600,2400,1200 ttyS1 vt100
</code></blockquote>
... to your <TT>/etc/inittab</TT>. This should allow you to use one of
your serial lines to login. It is possible for the Linux X
Windows system and console to be dead while the kernel and
other processes are still up and running. Another test is to
ping it from another system (if you have an ethernet LAN
connected to this machine). Even if telnet doesn't work you
want to ping it to see if the kernel is still responding.
<br><br>
It's also probably worth trying the software watchdog timer
code in the newer kernels. These allow you to configure a
kernel module to emulate a hardware watchdog timer card. These
WDT devices are basically a "dead man's switch" for your
system. If the timer isn't periodically updated by the kernel
(or by some <EM>other thread in the kernel</EM>, in the case of the
emulated WDT) then the WDT triggers a system reset.
<br><br>
Obviously a software emulation of this isn't quite as reliable
as a hardware WDT --- since a completely hung kernel will
never get around to calling on that module's thread of
execution. However, it isn't too unlikely that the hang is in
some specific kernel thread and that some other thread
continues to execute after other parts have died.
<br><br>
Frankly I'm not sure what the difference between the kernel
watchdog emulation code and the boot "<TT>panic=</TT>" parameter. But
that's definitely another thing to try (just add something
like <TT>panic=60</TT> to your lilo "<TT>append=</TT>" directive, or
manually when you boot up your system). I guess that the difference
would be that there may be some conditions under which the
kernel could get into a comatose or unresponsive state without
panic'ing (if it got tricked into some really long timeout wait
or something). The <TT>panic=</TT> option forces the Linux kernel to
reboot after a "panic" (a critical error condition detected by
the kernel, usually a corrupted table that fails its consistency and
integrity checks).
<br><br>
Normally the kernel would just display a "panic" message and
sit there waiting for human intervention. These are very
rare (other than the old "VFS kernel panic, unable to mount
root" that occurs when you have your kernel misconfigured for
your arrangement of hard drives --- or when you change the
hardware setting of your disk drives without updating your
kernel (with the '<TT>rdev</TT>' command to set the root device flags)
and/or without updating your <TT>LILO</TT> or <TT>LOADLIN</TT> commands
(which are usually used to pass these flags to your kernel to
over-ride the compiled in defaults).
<br><br>
Other than that common case I think I've only seen one or two
Linux kernel panics in the last 6 years. I've only had about a
half dozen unexplained system lockups over that period --- and
that's on about fifty Linux machines that I've managed during
various portions of that time. These lockups might have been
panics in situations that were so bad the kernel couldn't even
display an error message, there's no way to know).
<br><br>
I've only had to reboot unresponsive Linux boxes about a dozen
or so times in all the years I've used it. This was only a
problem in the late .99 and early 1.0x kernels when I was
running a very busy FTP/Web server that was simply overloaded
-- the TCP/IP stack would get so congested that the system
would timeout between my login name and password --- at the
console (I'd've loved a working SAK --- secure attention key
back then). I was glad to see the major TCP/IP re-write in
between 1.2 and 2.x.
<br><br>
I'm not trying to tout Linux' horn here --- (well, maybe a
little). The point is that I don't get panics and lockups
often enough to see how the <TT>panic=</TT> parameter and the
softdog/watchdog code would work in those situations.
<br><br>
However, if you enabled the <TT>panic=</TT> and/or the softdog
kernel option, you may see that the machine reboots without a minute
or two after your lockup (wait for ten or fifteen). This tells
you that some part of the kernel was still running (and that
the hardware isn't completely wigged out).
<br><br>
Beyond that the things to do are to take out all non-essential
hardware (the sound card would be a great choice --- and the
SCSI card, since you mention that your Linux partitions are on
the IDE drives. As with most technical computing issues the it
eventually boils down to a matter of cost. You mentioned a
couple of times how you don't want to spend money on solving
this problem. Ultimately the time you spend fighting with it
translates to money --- and you'll have to eventually ask what
your time is worth.
<br><br>
(The deeper part of this question is that you may find that
your home machine isn't worth the time <EM>or the money</EM> and you
may content yourself to just use any machines that you
encounter at work, or whatever. Strange as that sounds I've
had friends who refuse to keep a computer around the house
specifically because they "spend enough time with them at work"
and feels that "home is for family time").
<br><br>
At the same time I don't recommend throwing replacement
components at the problem without understanding the nature of
the problem. However, it may be that the best solution is to
replace the motherboard and/or the video card and/or the RAM.
<br><br>
Troubleshooting computers is difficult work. Whole books have
been devoted to the subject (I like the Win L. Rosch Hardware
Bible personally --- read it years ago and should probably get
an updated copy). There are also parts of the process that
can't be gained from any book --- that you must learn by
experience and figure out through some combination of analysis
and intuition. As our computers become more sophisticated the
balance seems to lean more for the intuition.
</blockquote>
<!-- end body -->
<!--================================================================-->
<P> <hr> <P>
<H5 align="center"><a href="http://www.linuxgazette.com/copying.html"
>Copyright &copy;</a> 1998, James T. Dennis <BR>
Published in <I>Linux Gazette</I> Issue 30 July 1998</H5>
<P> <hr> <P>
<!--================================================================-->
<table width="98%"><tr valign="center" align="center">
<td rowspan="3"><A HREF="./lg_answer30.html"><IMG
SRC="../gx/dennis/answernew.gif"
ALT="[ Answer Guy Index ]"></A></td>
<td><A HREF="tag_SCOkeys.html">SCOkeys</A></td>
<td><A HREF="tag_chroot.html">chroot</A></td>
<td><A HREF="tag_dosemu-db.html">dosemu-db</A></td>
<td><A HREF="tag_NTauth.html">NTauth</A></td>
<td><A HREF="tag_cdr.html">cdr</A></td>
<td><A HREF="tag_3270.html">3270</A></td>
<td><A HREF="linux-questions-only@ssc.comport.html">comport</A></td>
</tr><tr valign="center" align="center">
<td><A HREF="tag_lilostop.html">lilostop</A></td>
<td><A HREF="tag_emulate.html">emulate</A></td>
<td><A HREF="tag_ppadrivers.html">ppadrivers</A></td>
<td><A HREF="tag_database.html">database</A></td>
<td><A HREF="tag_vacation.html">vacation</A></td>
<td><A HREF="tag_nullmodem.html">nullmodem</A></td>
<td><A HREF="tag_lockups.html">lockups</A></td>
</tr><tr valign="center" align="center">
<td><A HREF="tag_gzipC.html">gzipC</A></td>
<td><A HREF="tag_newlook.html">newlook</A></td>
<td><A HREF="tag_c500.html">c500</A></td>
<td><A HREF="tag_solprint.html">solprint</A></td>
<td><A HREF="tag_vc1shell.html">vc1shell</A></td>
<td><A HREF="tag_memleak.html">memleak</A></td>
<td><A HREF="tag_tvcard.html">tvcard</A></td>
</tr></table>
<P> <hr> <P>
<!--================================================================-->
<A HREF="./index.html"><IMG SRC="../gx/indexnew.gif"
ALT="[ Table Of Contents ]"></A>
<A HREF="../index.html"><IMG SRC="../gx/homenew.gif"
ALT="[ Front Page ]"></A>
<A HREF="lg_bytes30.html"><IMG SRC="../gx/back2.gif"
ALT="[ Previous Section ]"></A>
<A HREF="./vrenios.html"><IMG SRC="../gx/fwd.gif"
ALT="[ Next Section ]"></A>
<!--startcut ======================================================= -->
</body>
</html>
<!--endcut ========================================================= -->