old-www/LDP/LG/issue74/tag/9.html

498 lines
20 KiB
HTML

<!--startcut ======================================================= -->
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
<META NAME="generator" CONTENT="lgazmail v1.4F.k">
<TITLE>The Answer Gang 74: random crashes - how to prepare bug report?</TITLE>
</HEAD><BODY BGCOLOR="#FFFFFF" TEXT="#000000"
LINK="#3366FF" VLINK="#A000A0">
<!--endcut ========================================================= -->
<P> <hr>
<!--startcut ======================================================= -->
<CENTER>
<!-- *** BEGIN navbar *** -->
<!-- *** END navbar *** -->
</CENTER>
</p>
<!--endcut ========================================================= -->
<!--startcut ======================================================= -->
<P> <hr>
<!-- begin tagnav ::::::::::::::::::::::::::::::::::::::::::::::::::-->
<p align="center">
<table width="100%" border="0"><tr>
<td align="right" valign="center"
><IMG ALT="" SRC="../../gx/navbar/left.jpg"
WIDTH="14" HEIGHT="45" BORDER="0" ALIGN="middle" border="0"
><A HREF="..//"
><IMG SRC="../../gx/navbar/toc.jpg" align="middle"
ALT="[ Table Of Contents ]" border="0"></A
><A HREF="../lg_answer74.html"
><IMG SRC="../../gx/dennis/answertoc.jpg" align="middle"
ALT="[ Answer Guy Current Index ]" border="0"></A></td>
<td align="center" valign="center"><A HREF="../lg_answer74.html#greeting"><img align="middle"
src="../../gx/dennis/smily.gif" alt="greetings" border="0"></A> &nbsp;
<A HREF="../tag/bios.html">Meet&nbsp;the&nbsp;Gang</A> &nbsp;
<A HREF="1.html">1</A> &nbsp;
<A HREF="2.html">2</A> &nbsp;
<A HREF="3.html">3</A> &nbsp;
<A HREF="4.html">4</A> &nbsp;
<A HREF="5.html">5</A> &nbsp;
<A HREF="6.html">6</A> &nbsp;
<A HREF="7.html">7</A> &nbsp;
<A HREF="8.html">8</A> &nbsp;
<A HREF="9.html">9</A>
</td>
<td align="left" valign="center"><A HREF="../../tag/kb.html"
><IMG SRC="../../gx/dennis/answerpast.jpg" align="middle"
ALT="[ Index of Past Answers ]" border="0"></A
><IMG ALT="" SRC="../../gx/navbar/right.jpg" align="middle"
WIDTH="14" HEIGHT="45" BORDER="0"></td></tr></table>
</p>
<!-- end tagnav ::::::::::::::::::::::::::::::::::::::::::::::::::::-->
<!--endcut ========================================================= -->
<P> <hr> <P>
<!-- ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: -->
<center>
<H1><A NAME="answer">
<img src="../../gx/dennis/qbubble.gif" alt="(?)"
border="0" align="middle">
<font color="#B03060">The Answer Gang</font>
<img src="../../gx/dennis/bbubble.gif" alt="(!)"
border="0" align="middle">
</A></H1>
<BR>
<H4>By Jim Dennis, Ben Okopnik, Dan Wilder, Breen, Chris, and the Gang,
the Editors of Linux Gazette...
and You!
<br>Send questions (or interesting answers) to
<a href="mailto:linux-questions-only@ssc.com">linux-questions-only@ssc.com</a>
</H4>
<p><em><font color="#990000">There is no guarantee</font></em>
that your questions here will <b>ever</b> be answered.
<em><font color="#990000">Readers at confidential sites</font></em>
must provide permission to publish. However,
<em><font color="#990000">you can be published anonymously</font></em>
- just let us know!
</p>
<p>TAG <a href="../tag/bios.html">Member bios</a>
| <a href="../../tag/members-faq.html">FAQ</a>
| <a href="../../tag/kb.html">Knowledge base</a></p>
</center>
<!-- ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: -->
<p><hr><p>
<!-- begin 9 -->
<H3 align="left"><img src="../../gx/dennis/qbubble.gif"
height="50" width="60" alt="(?) " border="0"
>random crashes - how to prepare bug report?</H3>
<p><strong>From N.P.Strickland
</strong></p>
<p align="right"><strong>Answered By Thomas Adam, Mike Ellis, Ben Okopnik, Huibert Alblas
</strong></p>
<P><STRONG>
Hi,
</STRONG></P>
<P><STRONG>
My linux machine is crashing randomly once every couple of days - it
freezes up and will not respond to anything (including ctrl-alt-del,
or ping from another machine) except the on/off switch. The load on
the machine is light, and the work it is doing is not particularly
unusual.
</STRONG></P>
<P><STRONG>
1) Can anyone suggest how I could gather useful information about what is
going on?
</STRONG></P>
<P><STRONG>
I put a line like this in <TT>/etc/syslog.conf:</TT>
</STRONG></P>
<blockquote><code><font color="#000033"><br> *.debug;mail.none;authpriv.none;cron.none /var/log/messages
</font></code></blockquote>
<P><STRONG>
As far as I understand it, this should get all possible debugging
information out of syslogd, although I'm not completely clear whether
any more could be squeezed out of klogd. In any case, I'm not getting
any messages around the time of a crash. I've also turned on all the
logging options that I can find in the processes that I am running,
without any helpful effect.
</STRONG></P>
<blockQuote>
<IMG SRC="../../gx/dennis/bbub.gif" ALT="(!)"
HEIGHT="28" WIDTH="50" BORDER="0"
> [Thomas]
Have you added any memory to your machine recently??
This has been known to "crash" machines randomly.
</blockQuote>
<blockQuote>
What programs do you have running on default?? Perhaps
you could send me (us) an output of the "pstree"
command so that we can see which process is linked to
what.
</blockQuote>
<blockQuote>
<IMG SRC="../../gx/dennis/bbub.gif" ALT="(!)"
HEIGHT="28" WIDTH="50" BORDER="0"
> [Mike]
Quite right, Thomas. If you have two or more memory modules (DIMMs
probably) in your machine, try removing one of them if you can. If the
fault goes appears to go away, try putting the module back in and see if
the fault re-appears. If the fault never goes away, replace the first
module and removing another and try again.
</blockQuote>
<blockQuote>
As you're running a 2.4 kernel, make sure you have plenty of swap. Sadly
the 2.4 kernels aren't as good as the older 2.2 and making maximum use of
swap, with the result that you are now strongly recommended to...
look at <A HREF="../../issue62/lg_tips62.html#tips/12"
>http://www.linuxgazette.com/issue62/lg_tips62.html#tips/12</A> if you
need help. I haven't heard tales of this causing random lock-ups, but you
never know!
</blockQuote>
<blockQuote>
<IMG SRC="../../gx/dennis/bbub.gif" ALT="(!)"
HEIGHT="28" WIDTH="50" BORDER="0"
> [Halb]
Yes, the early 2.4 kernels had 'some' trouble with swap space.
But at the time of 2.4.9 a completely new ( build from scratch ) VM was
introduced by Andrea Arcangeli, and incorperated by Linus since 2.4.10.
</blockQuote>
<blockQuote><DL><DT>
You can read a good story on:
<DD><A HREF="http://www.byte.com/documents/s=1436/byt20011024s0002/1029_moshe.html"
>http://www.byte.com/documents/s=1436/byt20011024s0002/1029_moshe.html</A>
</DL></blockQuote>
<blockQuote>
It is an interresting, not too long story.
</blockQuote>
<blockQuote>
<IMG SRC="../../gx/dennis/bbub.gif" ALT="(!)"
HEIGHT="28" WIDTH="50" BORDER="0"
> [JimD]
</blockQuote>
<blockQuote>
However, if you're using the new tmpfs, it might be wise to
err on the side of generosity when allocating swap space. Using
tmpfs, your <TT>/tmp</TT> (and/or <TT>/var/tmp</TT> or other designated directories)
can be sharing space with your swap (kernel VM paging).
</blockQuote>
<blockQuote>
Still, one or two swap partitions of 127Mb should be plenty for
most situations. I still like to keep my swap partitions smaller
than 127Mb (the historical limit was 128, but cylinder boundaries
usually round "up"). I also recommend putting one swap partition
on each physical drive (spindle) to allow the kernel to balance
the load across them (small performance gain, but neglible cost
on modern hard disks).
</blockQuote>
<P><STRONG>
<IMG SRC="../../gx/dennis/qbub.gif" ALT="(?)"
HEIGHT="28" WIDTH="50" BORDER="0"
>
2) If I can get any usable information about the problem, does anyone
know where I should send it?
</STRONG></P>
<blockQuote>
<IMG SRC="../../gx/dennis/bbub.gif" ALT="(!)"
HEIGHT="28" WIDTH="50" BORDER="0"
> [Thomas]
Here, to both me and the rest of TAG.
</blockQuote>
<P><STRONG>
<IMG SRC="../../gx/dennis/qbub.gif" ALT="(?)"
HEIGHT="28" WIDTH="50" BORDER="0"
>
If I knew that it was a kernel problem, I'd try the linux-kernel
mailing list. But that looks pretty intimidating, so I'd want to be
sure I knew what I was talking about first! Also, I guess that some
kind of hardware problem is more likely.
</STRONG></P>
<blockQuote>
<IMG SRC="../../gx/dennis/bbub.gif" ALT="(!)"
HEIGHT="28" WIDTH="50" BORDER="0"
> [Thomas]
I'm still edging my bets on memory...if it is a Kernel
problem then you could try to re-compile it using the
latest stable release.
</blockQuote>
<P><STRONG>
<IMG SRC="../../gx/dennis/qbub.gif" ALT="(?)"
HEIGHT="28" WIDTH="50" BORDER="0"
>
I'm using <A HREF="http://www.redhat.com/">Red Hat</A> 7.2, which includes the 2.4.7-10 kernel, on a
machine with an Intel Pentium 4 CPU running at 1.5 GHz and 512M of
RAM. Crashes occur even when I am not running X and no users are
logged on. The main process that I am running is the Jakarta Tomcat
web server, which runs a Java servlet, which runs the symbolic
mathematics program Maple as an external process. As far as I can
tell from the logs, when the last crash occurred, there had been no
request to the web server for some time. It's just possible that a
request triggered the crash, which prevented the request from being
logged, but I doubt it.
</STRONG></P>
<P><STRONG>
Thanks in advance for any suggestions.
</STRONG></P>
<P><STRONG>
Neil Strickland
</STRONG></P>
<blockQuote>
<IMG SRC="../../gx/dennis/bbub.gif" ALT="(!)"
HEIGHT="28" WIDTH="50" BORDER="0"
> [Thomas]
I might also suggest that you run the "strace"
commands on processes you think might be crashing.
That will then tell you where and how...if nothing
else.
</blockQuote>
<blockQuote>
<IMG SRC="../../gx/dennis/bbub.gif" ALT="(!)"
HEIGHT="28" WIDTH="50" BORDER="0"
> [Ben]
I'm pretty much of the same mind as Thomas on this one; Linux is pretty
much bullet-proof, what tends to cause crashes of this sort is hardware -
and that critical path doesn't include too many things, particularly when
the key word is "random". Memory would be the first thing I'd suspect (and
would test by replacement); the hard drive would be the second. I've
<em>heard</em> of wonky motherboards causing problems, but have never experienced
it myself. I've seen a power supply cause funky behavior before - even
though that was on a non-Linux system, it would be much the same - and...
that's pretty much it.
</blockQuote>
<blockQuote>
"strace", in my opinion, is not something you can run on a production
system. It's great for troubleshooting, but running a web server under it?
I just tried running "thttpd" under it, and it took approximately 30
seconds just to connect to the localhost - and about 15 more to cd into a
directory. Not feasible.
</blockQuote>
<blockQuote>
<IMG SRC="../../gx/dennis/bbub.gif" ALT="(!)"
HEIGHT="28" WIDTH="50" BORDER="0"
> [Thomas]
Hum, perhaps I wasn;t too clear on that point. What I
meant was that he should run strace on only one
process which he thinks <EM>might</EM> be causing the crash.
Hence the reason why I initially asked for his
"pstree" output.
</blockQuote>
<blockQuote>
But I agree, strace is not that good when trying to
analyse a "labour intensive" program such as a
webserver, but then I fail to see the need as to why
one would want to run "strace" on such a program
anyway....afterall, <A HREF="http://www.apache.org/">Apache</A> is stable enough
<IMG SRC="../../gx/dennis/smily.gif" ALT=":-)"
height="24" width="20" align="middle">
</blockQuote>
<HR width="10%" align="left"><P><STRONG>
<IMG SRC="../../gx/dennis/qbub.gif" ALT="(?)"
HEIGHT="28" WIDTH="50" BORDER="0"
>
Thanks again for all your help.
</STRONG></P>
<blockQuote>
<IMG SRC="../../gx/dennis/bbub.gif" ALT="(!)"
HEIGHT="28" WIDTH="50" BORDER="0"
> [Mike &amp; Ben]
You're welcome.
</blockQuote>
<P><STRONG><FONT COLOR="#000066"><EM><IMG SRC="../../gx/dennis/qbub.gif" ALT="(?)"
HEIGHT="28" WIDTH="50" BORDER="0"
>
Memory would be the first thing I'd suspect (and would test by replacement);
</EM></FONT></STRONG></P>
<P><STRONG>
I downloaded memtest86 (from <A HREF="http://www.teresaudio.com/memtest86"
>http://www.teresaudio.com/memtest86</A>) and
ran through its default tests twice (that took about 40 minutes - I
haven't yet tried the additional tests, which are supposed to take
four or five hours, altogether). Nothing came up. Do you think
that's reliable, or would you test by replacement anyway?
</STRONG></P>
<blockQuote><CODE>
.
</CODE></blockQuote>
<blockQuote>
<IMG SRC="../../gx/dennis/bbub.gif" ALT="(!)"
HEIGHT="28" WIDTH="50" BORDER="0"
> [Mike]
The problem may be an intermittent fault: if the tests take 40 minutes and
the machine usually runs for (say) 4 days, you've effectively given it
less than a 1% chance of finding the problem [40/(4*24*60)]. I'd still
seriously consider a test by replacement and/or removal of DIMMs.
</blockQuote>
<blockQuote>
<IMG SRC="../../gx/dennis/bbub.gif" ALT="(!)"
HEIGHT="28" WIDTH="50" BORDER="0"
> [Ben]
My rule of memory testing, for many years now, has been "a minimum of 24
hours - 48 is better - and hit it with freeze spray at the end." For a
system that needs to be up and running, however, "shotgunning" (wholesale
replacement of suspect hardware) is what offers the highest chance of quick
resolution.
</blockQuote>
<P><STRONG><FONT COLOR="#000066"><EM><IMG SRC="../../gx/dennis/qbub.gif" ALT="(?)"
HEIGHT="28" WIDTH="50" BORDER="0"
>
the hard drive would be the second
</EM></FONT></STRONG></P>
<P><STRONG><FONT COLOR="#000066"><EM>
I've seen a power supply cause funky behavior before
</EM></FONT></STRONG></P>
<P><STRONG>
These don't sound like easy things to test
<IMG SRC="../../gx/dennis/unsmily.gif" ALT=":-("
height="24" width="20" align="middle">
. Do you have any
suggestions?
</STRONG></P>
<blockQuote><CODE>
.
</CODE></blockQuote>
<blockQuote>
<IMG SRC="../../gx/dennis/bbub.gif" ALT="(!)"
HEIGHT="28" WIDTH="50" BORDER="0"
> [Mike]
They aren't, sadly. Testing by replacement is really the best option for
these sorts of problems, but beware, we had a machine here with a dodgy
PSU recently which cost us a lot more than a new PSU )-: By the time we'd
tracked down the problem we had...
</blockQuote>
<blockQuote><ul>
<LI>three suspect hard-drives
<LI>two suspect 128M DIMMs
<LI>two suspect motherboards
<LI>two suspect PIII processors
<LI>one suspect network card
<LI>one suspect video card
<LI>one suspect CD-ROM drive
<LI>one suspect floppy drive
<LI>one suspect keyboard
<LI>one suspect mouse
<LI>and a partridge in a pear tree
</ul></blockQuote>
<blockQuote>
The whole lot had to be disposed of because we had used the faulty PSU with
them, and the fault was that it generated occasional over-volt spikes
during power-up. These potentially weakened any or all of the other
components in the system rendering them unsuitable for mission-critical
applications (we actually purchased a cheap case, marked all the bits as
suspect and built them into a gash machine for playing with).
</blockQuote>
<blockQuote>
In your case, try cloning the hard-drive and replacing that. You can use
dd to clone the drive - dd if=/dev/current_hard_disc of=/dev/new_hard_disc
bs=4096 - assuming the hard-drives are the same size. Don't use the
partitions, though - <TT>/dev/hda</TT> and <TT>/dev/hdc</TT> will work, <TT>/dev/hda1</TT> and
<TT>/dev/hdc1</TT> won't since the partition table and MBR won't be copied. Using
the raw devices will also copy any other partitions if you've got them.
</blockQuote>
<blockQuote>
&lt;Ding/&gt; One bright idea that has just occurred to me - are you using any
external devices? If, for example, you've got an external SCSI scanner
on the same chain as your internal SCSI discs, a dodgy connection or
termination could potentially cause random crashes. It might also be
worthwhile checking any USB or fireware devices you've got connected. I
doubt serial or parallel devices would cause a problem, but it might be
worth checking just in case. Internal connections are also suspect - a
CD-ROM drive on the same IDE chain as your boot disc might cause
problems: you might even like to remove it completely if you don't use it
often. Any PCI cards are also candidates for suspicion - make sure they're
all plugged in fully.
</blockQuote>
<blockQuote>
Let us know how you get on!
</blockQuote>
<blockQuote>
Cheers,
</blockQuote>
<blockQuote>
Mike.
</blockQuote>
<blockQuote>
<IMG SRC="../../gx/dennis/bbub.gif" ALT="(!)"
HEIGHT="28" WIDTH="50" BORDER="0"
> [Ben]
Unfortunately, all my best suggestions come down to the above two. I used
to look for noise in power supply output with an oscilloscope -
interestingly enough, it was a fairly reliable method of sussing out the
problematic ones - but I suspect that it's not a common skill today.
There are a number of HDD testers out there, all hiding behind the
innocuous guise of disk performance measurement tools... but
Professor Moriarty is not fooled.
<IMG SRC="../../gx/dennis/smily.gif" ALT=":)"
height="24" width="20" align="middle">
</blockQuote>
<blockQuote>
Seriously, if running one of those (e.g., "bonnie++") for a few hours
doesn't make your HDD fall over and lie there twitching, you're probably
all right on that score.
</blockQuote>
<!-- end 9 -->
<P> <hr> </p>
<!-- *** BEGIN copyright *** -->
<H5 align="center">This page edited and maintained by the Editors
of <I>Linux Gazette</I>
<a href="http://www.linuxgazette.com/copying.html"
>Copyright &copy;</a> 2002
<BR>Published in issue 74 of <I>Linux Gazette</I> January 2002</H5>
<H6 ALIGN="center">HTML script maintained by
<A HREF="mailto:star@starshine.org">Heather Stern</a> of
Starshine Technical Services,
<A HREF="http://www.starshine.org/">http://www.starshine.org/</A>
</H6>
<!-- *** END copyright *** -->
<!--startcut ======================================================= -->
<P> <hr>
<!-- begin tagnav ::::::::::::::::::::::::::::::::::::::::::::::::::-->
<p align="center">
<table width="100%" border="0"><tr>
<td align="right" valign="center"
><IMG ALT="" SRC="../../gx/navbar/left.jpg"
WIDTH="14" HEIGHT="45" BORDER="0" ALIGN="middle" border="0"
><A HREF="..//"
><IMG SRC="../../gx/navbar/toc.jpg" align="middle"
ALT="[ Table Of Contents ]" border="0"></A
><A HREF="../lg_answer74.html"
><IMG SRC="../../gx/dennis/answertoc.jpg" align="middle"
ALT="[ Answer Guy Current Index ]" border="0"></A></td>
<td align="center" valign="center"><A HREF="../lg_answer74.html#greeting"><img align="middle"
src="../../gx/dennis/smily.gif" alt="greetings" border="0"></A> &nbsp;
<A HREF="../tag/bios.html">Meet&nbsp;the&nbsp;Gang</A> &nbsp;
<A HREF="1.html">1</A> &nbsp;
<A HREF="2.html">2</A> &nbsp;
<A HREF="3.html">3</A> &nbsp;
<A HREF="4.html">4</A> &nbsp;
<A HREF="5.html">5</A> &nbsp;
<A HREF="6.html">6</A> &nbsp;
<A HREF="7.html">7</A> &nbsp;
<A HREF="8.html">8</A> &nbsp;
<A HREF="9.html">9</A>
</td>
<td align="left" valign="center"><A HREF="../../tag/kb.html"
><IMG SRC="../../gx/dennis/answerpast.jpg" align="middle"
ALT="[ Index of Past Answers ]" border="0"></A
><IMG ALT="" SRC="../../gx/navbar/right.jpg" align="middle"
WIDTH="14" HEIGHT="45" BORDER="0"></td></tr></table>
</p>
<!-- end tagnav ::::::::::::::::::::::::::::::::::::::::::::::::::::-->
<!--endcut ========================================================= -->
<P> <hr>
<!--startcut ======================================================= -->
<CENTER>
<!-- *** BEGIN navbar *** -->
<!-- *** END navbar *** -->
</CENTER>
</p>
<!--endcut ========================================================= -->
<!--startcut ======================================================= -->
</BODY></HTML>
<!--endcut ========================================================= -->