old-www/LDP/LG/issue84/okopnik.html

322 lines
19 KiB
HTML

<!--startcut ==============================================-->
<!-- *** BEGIN HTML header *** -->
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML><HEAD>
<title>Perl One-Liner of the Month: The Adventure of the Misnamed Files LG #84</title>
</HEAD>
<BODY BGCOLOR="#FFFFFF" TEXT="#000000" LINK="#0000FF" VLINK="#0000AF"
ALINK="#FF0000">
<!-- *** END HTML header *** -->
<!-- *** BEGIN navbar *** -->
<IMG ALT="" SRC="../gx/navbar/left.jpg" WIDTH="14" HEIGHT="45" BORDER="0" ALIGN="bottom"><A HREF="hawk.html"><IMG ALT="[ Prev ]" SRC="../gx/navbar/prev.jpg" WIDTH="16" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="index.html"><IMG ALT="[ Table of Contents ]" SRC="../gx/navbar/toc.jpg" WIDTH="220" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../index.html"><IMG ALT="[ Front Page ]" SRC="../gx/navbar/frontpage.jpg" WIDTH="137" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="http://www.linuxgazette.com/cgi-bin/talkback/all.py?site=LG&article=http://www.linuxgazette.com/issue84/okopnik.html"><IMG ALT="[ Talkback ]" SRC="../gx/navbar/talkback.jpg" WIDTH="121" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../lg_faq.html"><IMG ALT="[ FAQ ]" SRC="./../gx/navbar/faq.jpg"WIDTH="62" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="orr.html"><IMG ALT="[ Next ]" SRC="../gx/navbar/next.jpg" WIDTH="15" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><IMG ALT="" SRC="../gx/navbar/right.jpg" WIDTH="15" HEIGHT="45" ALIGN="bottom">
<!-- *** END navbar *** -->
<!--endcut ============================================================-->
<TABLE BORDER><TR><TD WIDTH="200">
<A HREF="http://www.linuxgazette.com/">
<IMG ALT="LINUX GAZETTE" SRC="../gx/2002/lglogo_200x41.png"
WIDTH="200" HEIGHT="41" border="0"></A>
<BR CLEAR="all">
<SMALL>...<I>making Linux just a little more fun!</I></SMALL>
</TD><TD WIDTH="380">
<CENTER>
<BIG><BIG><STRONG><FONT COLOR="maroon">Perl One-Liner of the Month: The Adventure of the Misnamed Files</FONT></STRONG></BIG></BIG>
<BR>
<STRONG>By <A HREF="../authors/okopnik.html">Ben Okopnik</A></STRONG>
</CENTER>
</TD></TR>
</TABLE>
<P>
<!-- END header -->
<p><i>A REPORTER'S NOTE</i> </p>
<p><i>Recently, I became acquainted with a set of documents which journal
the adventures and experiences of none other than Woomert Foonly, the world-famous
but strangely reticent Hard-Nosed Computer Detective. To the best of my knowledge,
the information they contain is authentic. My anonymous correspondent - whom
I suspect to be Frink Ooblick, the great man's sidekick and confidant, although
my sworn promise forbids me to investigate - had emailed me a short note
which engaged my interest, then sent me an encrypted file which took several
nights of concerted hacking effort to decrypt (he seems to think that this
indicates a sense of humor on his part). This has become the pattern: once
in a while, I'll receive a file from a sender whose name matches a complex
regular expression (the procmail recipe for this has grown to several pages,
and now seems to be mutating on its own). I then have to drop whatever I'm
doing and begin hacking at high speed - the encryption method is, in some
manner which I've been unable to puzzle out, time-dependent, and becomes
much more difficult to break in a few short hours.</i> </p>
<p><i>In our early exchanges, I had been granted permission to publish this
material. My correspondent has stated that he chooses to keep his identity
private, and is satisfied to receive no credit for his work. I present it
here, though I can't claim authorship, in the hope of casting some light
on the work of that great detective whose exploits had until now been shrouded
in deepest mystery.</i> </p>
<p><b><i>Ben Okopnik</i></b> <br>
<i>On board S/V "Ulysses", October 10th, 2002</i> </p>
<p> </p>
<hr width="100%">
<p>The filesystem was quiet and dark; all the disk writes had been synched,
the hard drive had spun down, and the CPU had shifted into low-power mode.
Even the usually exuberant Frink seemed subdued on this occasion, and was
quietly double-checking their remote-system passwords and permissions, a
necessary precaution before they could leave the comfort of their '/home'
in the armored SSH transporter. </p>
<p>Woomert, however, felt calm and ready for action. This was where he preferred
to operate, in the twilight zone between power modes; in these conditions,
even the dreaded Heisenbugs <a href="#1">[1]</a> - though his current assignment
did not involve anything nearly that dangerous - would be somnolent, and
could often be trapped unaware. </p>
<p>His client, severely distraught and sobbing into a lace handkerchief, had
confessed that her file naming scheme had gone completely out of control -
wild strings had invaded her precious naming convention, formerly so full
of sense and intuitively obvious to even the casual user. The employee responsible
for this outrage had been severely LARTed <a href="#2">[2]</a>, but the police
detectives had simply shrugged in bafflement, and all other avenues pointed
to grim scenarios of manually renaming hundreds, if not thousands, of files.
True, the files contained the preferred names enclosed in the HTML <tt>'&lt;title&gt;'</tt>
tags - but the amount of work necessary to do it by hand was staggering.
Woomert was her last hope. </p>
<center>
<p>&nbsp;*&nbsp;&nbsp; *&nbsp;&nbsp; *&nbsp;&nbsp; *&nbsp;&nbsp; *&nbsp;&nbsp;
*&nbsp;&nbsp; *&nbsp;&nbsp; *&nbsp;&nbsp; *&nbsp;&nbsp; *</p>
</center>
<p>Moving quietly, Woomert approached the inode marked "<tt>/var/apache/htdocs</tt>".
Finding it had taken a bit of top-down traversal, but his familiarity with
the <tt>File::Find</tt> module <a href="#3">[3]</a> had made short work of
that; the client had sobbed out that the rogue file names matched the <tt>/^[A-Z][0-9]+\.html?$/</tt>
pattern <a href="#4">[4]</a> - in other words, they all started with a capital
letter followed by one or more digits, and ended with a ".htm" or a ".html"
extension. Given that hint, it had taken only seconds to locate the rogues'
lair. </p>
<p>As he entered, the disgraceful state of affairs became immediately evident:
<a href="misc/okopnik/ROTM_badboy.html">disreputable-looking</a> filenames hung around
on every corner, misbehaving and intimidating the passerby; others, dressed
in nothing more than transparent symlinks, leaned out of xterm windows and
lewdly propositioned the passing data. Something had to be done, and soon -
the newer filenames that came into the area were almost immediately
corrupted by the ubiquitous bad examples. </p>
<p>&nbsp;- "Sheesh, Woomert," whispered Frink, "this looks bad. What are you
going to do? There's <i>thousands</i> of them!" </p>
<p>&nbsp;- "Don't worry, kid." Woomert calmly ambled up to the command line
interface, his hat pulled down low against the headlight glare of the heavy
HTTP traffic. "I've just downloaded the latest version of Perl. They don't
stand a chance." Pulling on his typing gloves, he rapidly executed a one-line
command string. </p>
<pre><hr width="100%">perl -wlne'END{print$n}eof&amp;&amp;$n++;/&lt;title&gt;([^&lt;]+)/i&amp;&amp;$n--' *<br><hr
width="100%"></pre>
The results were astonishing: even as the monitor displayed a large '0',
every one of the miscreants suddenly stopped whatever they were doing and
whirled around to stare at the two of them. They could obviously sense the
sudden danger represented by these two strangers in trenchcoats; the largest
of them, an ugly character with "<tt>X6664934755666.htm</tt>" tattooed on
his chest immediately headed in their direction while reaching into his pocket.
His intentions were clearly not related to presenting Woomert and Frink with
flowers and the private DSA key to his domain.
<p>&nbsp;- "Quick, Woomert," cried Frink, "do something! It looks like he's
going to throw a Nimda, or even a Code Red!" </p>
<p>Woomert glanced over at his nervous sidekick. </p>
<p>&nbsp;- "I told you, kid, <b>relax</b>. Number one, we've got Perl..."
His lightning-fast fingers drummed out another virtuoso solo on the console:
</p>
<pre><hr width="100%">perl -wlne'/title&gt;([^&lt;]+)/i&amp;&amp;rename<tt>$ARGV</tt>,"<tt>$1.html</tt>"' *<br><hr
width="100%"></pre>
&nbsp;- "...and number two," as the wild scene around them faded, only to
reform as a neat, clean neighborhood with neatly arranged files proudly wearing
names like <br>
'<tt>History 227, Lecture 35: Origins of the Roman Revolution.html</tt>',
"we're running Linux. Viruses are pretty much someone else's problem."
<center>
<p>&nbsp;*&nbsp;&nbsp; *&nbsp;&nbsp; *&nbsp;&nbsp; *&nbsp;&nbsp; *&nbsp;&nbsp;
*&nbsp;&nbsp; *&nbsp;&nbsp; *&nbsp;&nbsp; *&nbsp;&nbsp; *</p>
</center>
<p>Later that evening, after they had collected their well-earned fee from
the grateful client and were relaxing with a fine high-altitude Lee Shan tea
that Woomert had brought back from his recent telnet to the Far East, Frink
finally ventured to ask the questions that had been on his mind ever since
that fateful showdown. </p>
<p>&nbsp;- "Woomert, I saw you fire off those command lines, but I couldn't
follow what you were doing. I could see the regular expression, and even noticed
the implicit loop, but what was all the rest of it?" </p>
<p>&nbsp;- "Elementary, my dear Frink. If you'll recall the first line..."
</p>
<pre><hr width="100%">perl -wlne'END{print<tt>$n</tt>}eof&amp;&amp;<tt>$n</tt>++;/&lt;title&gt;([^&lt;]+)/i&amp;&amp;<tt>$n</tt>--' *<br><hr
width="100%"></pre>
"...you'll see that I invoked Perl with the following command switches:"
<pre>-w Enable warnings</pre>
<pre>-l Enable line-end processing</pre>
<pre>-n Implicit non-printing loop</pre>
<pre>-e Execute the following commands</pre>
"By enabling warnings, I had told Perl to check my syntax, something that
should be done every time you run a script. I then specified line-end processing,
in effect adding a newline to each printed string. Then, I told it to loop
through the contents of each file, and run the string in the single quotes
as a script."
<p>"As you had so astutely noted, I had indeed set up a loop. What you may
have missed, however, was that there were actually <i>two</i> concurrent loops:
I had specified a list of files via the shell filespec of '*', and Perl would
read them in, one at a time. It's also important to note that the syntax
of the regex <i>inside</i> the quotes which enclose the script looks similar
to but is very different from the regex <i>outside</i> - the former is parsed
by Perl, using its internal regex engine; the latter is handled by the shell,
which uses a far simpler system called 'globbing'." </p>
<p>&nbsp;- "Wonderful!" Frink was as excited as a young pup on his first hunt.
"And what did you do in the script itself?" </p>
<p>"First, I wanted to double-check that my regex actually matched what I
thought it should. The easiest way was to count the number of files - I incremented
'<tt>$n</tt>' every time the 'eof' (end-of-file) function returned 'true'
- and subtract the number of matches. If the total had been greater than
0, that would have indicated a mismatch somewhere. Fortunately..." </p>
<p>&nbsp;- "Yes, I remember - it printed a zero." </p>
<p>"Which meant that everything was correct. The <tt>'END{print$n}'</tt> statement
was executed at the end of the run - note that this is independent of its
position in the script, although most people put it at the end. I saved a
keystroke by putting it first - a statement that follows a block, as in the
case of that <tt>'eof&amp;&amp;$n++'</tt>, does not require a semicolon.
In Perl Golf <a href="#5">[5]</a>, every stroke matters!" </p>
<p>"Next, let us examine the regex that I'd used; that's the heart of this
script." </p>
<p><tt>/&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # Begin the regex</tt>
<br>
<tt>title&gt;&nbsp;&nbsp;&nbsp; # Match this literal string</tt> <br>
<tt>([^&lt;]+)&nbsp;&nbsp; # Capture one or more characters not matching '&lt;'
in $1</tt> <br>
<tt>/i&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # End regex, use the "ignore
case" modifier</tt> </p>
<p>The '/'s enclose the regex that we're trying to match; that's standard
Perl syntax, which you seem to have recognized. See that '+', there? I'm taking
advantage of Perl's "greediness" in regex interpretation: '+' means 'one
or more of the preceding character or group', but with the implication of
'match as many as possible'. It will grab everything until a literal '&lt;'
(beginning of an HTML&nbsp;tag) or the end of the current line. So, every
time the pattern matched the line, I updated '<tt>$n</tt>' by using the logical
'and' coupled with the decrement operator." </p>
<p>"As an overview, here is an equivalent script that shows all of the above
in a more readable fashion:" </p>
<pre><hr width="100%">#!/usr/bin/perl -w</pre>
<pre>while ( &lt;&gt; ){ # Equivalent to "-n"</pre>
<pre>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <tt>$n</tt>++ if eof;<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <tt>$n</tt>-- if /&lt;title&gt;([^&lt;]+)/i;</pre>
<pre>}</pre>
<pre>print "<tt>$n</tt>\n" # The "\n" would have been added by "-l"<br><hr
width="100%"></pre>
"Obviously, this script would be invoked as <tt>'perl &lt;scriptname&gt;
*'</tt>, or simply <tt>'./scriptname *'</tt> if it had been made executable."
<p>"As a final item of note, look at the 'active' statement in the script,
the only one that makes any changes or creates any output. It is simply 'print'.
In fact, the whole line was a test - I wanted to make certain that everything
worked properly before committing actual changes to disk, something I believe
to be a wise policy. I could see, from the ugly looks of that crowd, that
I would not get two chances at the actual renaming; at least one of them
had an <tt>'rm -rf /'</tt>, and would not have hesitated to use it." </p>
<p>&nbsp;- "Heavens, Woomert!" Frink's shock was evident in his features.
"You must be as brave as a lion, to face something like that." </p>
<p>The famous detective glanced at the shiny stainless-steel and Kevlar "<tt>chroot</tt>"
call that he had extracted from his pocket and smiled. </p>
<p>&nbsp;- "Well, I had a few tricks held in reserve, anyway. On to the actual
renaming, eh? If you remember the expression itself..." </p>
<pre><hr width="100%">perl -wlne'/title&gt;([^&lt;]+)/i&amp;&amp;rename<tt>$ARGV</tt>,"<tt>$1.html</tt>"' *<br><hr
width="100%"></pre>
"...you'll note that much of it resembles the first one; however, there are
a few novel features. First, I still used "-l" in the option set, but now
the reason was somewhat different: since the captured strings in '<tt>$1</tt>'
were likely to contain a newline ('\n'), we needed a way to get rid of it.
Perl is very clever about removing leading and trailing whitespace and handling
odd characters when using 'rename', but <tt>'filename\n.html'</tt> would
cause problems. Fortunately, '-n' also 'auto-chomps' the incoming lines -
meaning that it will remove the newline character as the line is read in."
<p>"Next, <tt>'$ARGV'</tt> is a Perl variable containing the name of the file
that is currently being read. Since '<tt>$1</tt>' held the result of our
first capture within the regex (the contents within the first set of parentheses
are stored in '<tt>$1</tt>', the second in '<tt>$2</tt>', and so on), renaming
was an easy task. It would also let us regularize the extensions - 'html'
for all of them." </p>
<p>"If I were to spell out the above line in a more conservative - and perhaps
more readable - fashion, it would look like this: </p>
<pre><hr width="100%">#!/usr/bin/perl -w</pre>
<pre>while ( &lt;&gt; ){</pre>
<pre>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; chomp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # Equivalent to "-l"<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if ( /title&gt;([^&lt;]+)/i ){<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; rename <tt>$ARGV</tt>, "<tt>$1.html</tt>"<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }</pre>
<pre>}<br><hr width="100%"></pre>
&nbsp;- "Since they were bearing down on us, though..."
<p>&nbsp;- "Precisely; those extra characters could have made the difference
between life and death. I must say that I didn't expect them to react so fiercely
to a simple match-and-print, but they say that filesystems are getting smarter
and smarter - according to a Western guru <a href="#6">[6]</a> with whom
I once held converse, there were at least <i>five</i> journaling filesystems
available for Linux, and I've heard of many FS-related kernel patches since.
Fortunately, we were more than equal to the task." </p>
<p>"Now, if you'll be good enough to pass me that Rotterdam redfish paste
and that San Francisco sourdough, I'll tell you of the next case that's coming
up. Pay attention, young Frink - this promises to be a good one..." <br>
&nbsp; <br>
&nbsp; </p>
<p> </p>
<hr width="100%"><a name="1"></a>[1] (From the Jargon File:) A bug that disappears
or alters its behavior when one attempts to probe or isolate it.
<p><a name="2"></a>[2] (From the Jargon File:) Luser Attitude Readjustment
Tool (properly applied upside the head of the appropriate clueless person.)
</p>
<p><a name="3"></a>[3] See "perldoc File::Find". </p>
<p><a name="4"></a>[4] Matching patterns in Perl consist of so-called "regular
expressions". For more information on REs, see "perldoc perlre". </p>
<p><a name="5"></a>[5] Perl Golf is a highly twisted form of Perl programming
in which brevity is king, and readability is gleefully thrown out of the nearest
window. Woomert is an avid golfer who often produces unreadable (but effective)
gibberish in Perl; one-liners (lines of Perl less than 80 characters in length)
are often examples of Perl Golf. <b><i>NOTE</i></b>: This game is played
to impress other Perl hackers, and to produce short but effective command-line
tools. Using this in code that others are supposed to work with or code that
you write for pay is truly bad form, and can (<b>should!</b>) come back to
haunt you. </p>
<p><a name="6"></a>[6] Per Jim Dennis, 2001. There may be even more today...
</p>
<!-- *** BEGIN copyright *** -->
<hr>
<CENTER><SMALL><STRONG>
Copyright &copy; 2002, Ben Okopnik.
Copying license <A HREF="../copying.html">http://www.linuxgazette.com/copying.html</A><BR>
Published in Issue 84 of <i>Linux Gazette</i>, November 2002
</STRONG></SMALL></CENTER>
<!-- *** END copyright *** -->
<HR>
<!--startcut ==========================================================-->
<CENTER>
<!-- *** BEGIN navbar *** -->
<IMG ALT="" SRC="../gx/navbar/left.jpg" WIDTH="14" HEIGHT="45" BORDER="0" ALIGN="bottom"><A HREF="hawk.html"><IMG ALT="[ Prev ]" SRC="../gx/navbar/prev.jpg" WIDTH="16" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="index.html"><IMG ALT="[ Table of Contents ]" SRC="../gx/navbar/toc.jpg" WIDTH="220" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../index.html"><IMG ALT="[ Front Page ]" SRC="../gx/navbar/frontpage.jpg" WIDTH="137" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="http://www.linuxgazette.com/cgi-bin/talkback/all.py?site=LG&article=http://www.linuxgazette.com/issue84/okopnik.html"><IMG ALT="[ Talkback ]" SRC="../gx/navbar/talkback.jpg" WIDTH="121" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../lg_faq.html"><IMG ALT="[ FAQ ]" SRC="./../gx/navbar/faq.jpg"WIDTH="62" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="orr.html"><IMG ALT="[ Next ]" SRC="../gx/navbar/next.jpg" WIDTH="15" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><IMG ALT="" SRC="../gx/navbar/right.jpg" WIDTH="15" HEIGHT="45" ALIGN="bottom">
<!-- *** END navbar *** -->
</CENTER>
</BODY></HTML>
<!--endcut ============================================================-->