322 lines
19 KiB
HTML
322 lines
19 KiB
HTML
<!--startcut ==============================================-->
|
|
<!-- *** BEGIN HTML header *** -->
|
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
|
|
<HTML><HEAD>
|
|
<title>Perl One-Liner of the Month: The Adventure of the Misnamed Files LG #84</title>
|
|
</HEAD>
|
|
<BODY BGCOLOR="#FFFFFF" TEXT="#000000" LINK="#0000FF" VLINK="#0000AF"
|
|
ALINK="#FF0000">
|
|
<!-- *** END HTML header *** -->
|
|
|
|
<!-- *** BEGIN navbar *** -->
|
|
<IMG ALT="" SRC="../gx/navbar/left.jpg" WIDTH="14" HEIGHT="45" BORDER="0" ALIGN="bottom"><A HREF="hawk.html"><IMG ALT="[ Prev ]" SRC="../gx/navbar/prev.jpg" WIDTH="16" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="index.html"><IMG ALT="[ Table of Contents ]" SRC="../gx/navbar/toc.jpg" WIDTH="220" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../index.html"><IMG ALT="[ Front Page ]" SRC="../gx/navbar/frontpage.jpg" WIDTH="137" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="http://www.linuxgazette.com/cgi-bin/talkback/all.py?site=LG&article=http://www.linuxgazette.com/issue84/okopnik.html"><IMG ALT="[ Talkback ]" SRC="../gx/navbar/talkback.jpg" WIDTH="121" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../lg_faq.html"><IMG ALT="[ FAQ ]" SRC="./../gx/navbar/faq.jpg"WIDTH="62" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="orr.html"><IMG ALT="[ Next ]" SRC="../gx/navbar/next.jpg" WIDTH="15" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><IMG ALT="" SRC="../gx/navbar/right.jpg" WIDTH="15" HEIGHT="45" ALIGN="bottom">
|
|
<!-- *** END navbar *** -->
|
|
|
|
<!--endcut ============================================================-->
|
|
|
|
<TABLE BORDER><TR><TD WIDTH="200">
|
|
<A HREF="http://www.linuxgazette.com/">
|
|
<IMG ALT="LINUX GAZETTE" SRC="../gx/2002/lglogo_200x41.png"
|
|
WIDTH="200" HEIGHT="41" border="0"></A>
|
|
<BR CLEAR="all">
|
|
<SMALL>...<I>making Linux just a little more fun!</I></SMALL>
|
|
</TD><TD WIDTH="380">
|
|
|
|
|
|
<CENTER>
|
|
<BIG><BIG><STRONG><FONT COLOR="maroon">Perl One-Liner of the Month: The Adventure of the Misnamed Files</FONT></STRONG></BIG></BIG>
|
|
<BR>
|
|
<STRONG>By <A HREF="../authors/okopnik.html">Ben Okopnik</A></STRONG>
|
|
</CENTER>
|
|
|
|
</TD></TR>
|
|
</TABLE>
|
|
<P>
|
|
|
|
<!-- END header -->
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<p><i>A REPORTER'S NOTE</i> </p>
|
|
<p><i>Recently, I became acquainted with a set of documents which journal
|
|
the adventures and experiences of none other than Woomert Foonly, the world-famous
|
|
but strangely reticent Hard-Nosed Computer Detective. To the best of my knowledge,
|
|
the information they contain is authentic. My anonymous correspondent - whom
|
|
I suspect to be Frink Ooblick, the great man's sidekick and confidant, although
|
|
my sworn promise forbids me to investigate - had emailed me a short note
|
|
which engaged my interest, then sent me an encrypted file which took several
|
|
nights of concerted hacking effort to decrypt (he seems to think that this
|
|
indicates a sense of humor on his part). This has become the pattern: once
|
|
in a while, I'll receive a file from a sender whose name matches a complex
|
|
regular expression (the procmail recipe for this has grown to several pages,
|
|
and now seems to be mutating on its own). I then have to drop whatever I'm
|
|
doing and begin hacking at high speed - the encryption method is, in some
|
|
manner which I've been unable to puzzle out, time-dependent, and becomes
|
|
much more difficult to break in a few short hours.</i> </p>
|
|
<p><i>In our early exchanges, I had been granted permission to publish this
|
|
material. My correspondent has stated that he chooses to keep his identity
|
|
private, and is satisfied to receive no credit for his work. I present it
|
|
here, though I can't claim authorship, in the hope of casting some light
|
|
on the work of that great detective whose exploits had until now been shrouded
|
|
in deepest mystery.</i> </p>
|
|
<p><b><i>Ben Okopnik</i></b> <br>
|
|
<i>On board S/V "Ulysses", October 10th, 2002</i> </p>
|
|
<p> </p>
|
|
<hr width="100%">
|
|
<p>The filesystem was quiet and dark; all the disk writes had been synched,
|
|
the hard drive had spun down, and the CPU had shifted into low-power mode.
|
|
Even the usually exuberant Frink seemed subdued on this occasion, and was
|
|
quietly double-checking their remote-system passwords and permissions, a
|
|
necessary precaution before they could leave the comfort of their '/home'
|
|
in the armored SSH transporter. </p>
|
|
<p>Woomert, however, felt calm and ready for action. This was where he preferred
|
|
to operate, in the twilight zone between power modes; in these conditions,
|
|
even the dreaded Heisenbugs <a href="#1">[1]</a> - though his current assignment
|
|
did not involve anything nearly that dangerous - would be somnolent, and
|
|
could often be trapped unaware. </p>
|
|
<p>His client, severely distraught and sobbing into a lace handkerchief, had
|
|
confessed that her file naming scheme had gone completely out of control -
|
|
wild strings had invaded her precious naming convention, formerly so full
|
|
of sense and intuitively obvious to even the casual user. The employee responsible
|
|
for this outrage had been severely LARTed <a href="#2">[2]</a>, but the police
|
|
detectives had simply shrugged in bafflement, and all other avenues pointed
|
|
to grim scenarios of manually renaming hundreds, if not thousands, of files.
|
|
True, the files contained the preferred names enclosed in the HTML <tt>'<title>'</tt>
|
|
tags - but the amount of work necessary to do it by hand was staggering.
|
|
Woomert was her last hope. </p>
|
|
<center>
|
|
<p> * * * * *
|
|
* * * * *</p>
|
|
</center>
|
|
|
|
<p>Moving quietly, Woomert approached the inode marked "<tt>/var/apache/htdocs</tt>".
|
|
Finding it had taken a bit of top-down traversal, but his familiarity with
|
|
the <tt>File::Find</tt> module <a href="#3">[3]</a> had made short work of
|
|
that; the client had sobbed out that the rogue file names matched the <tt>/^[A-Z][0-9]+\.html?$/</tt>
|
|
pattern <a href="#4">[4]</a> - in other words, they all started with a capital
|
|
letter followed by one or more digits, and ended with a ".htm" or a ".html"
|
|
extension. Given that hint, it had taken only seconds to locate the rogues'
|
|
lair. </p>
|
|
<p>As he entered, the disgraceful state of affairs became immediately evident:
|
|
<a href="misc/okopnik/ROTM_badboy.html">disreputable-looking</a> filenames hung around
|
|
on every corner, misbehaving and intimidating the passerby; others, dressed
|
|
in nothing more than transparent symlinks, leaned out of xterm windows and
|
|
lewdly propositioned the passing data. Something had to be done, and soon -
|
|
the newer filenames that came into the area were almost immediately
|
|
corrupted by the ubiquitous bad examples. </p>
|
|
<p> - "Sheesh, Woomert," whispered Frink, "this looks bad. What are you
|
|
going to do? There's <i>thousands</i> of them!" </p>
|
|
<p> - "Don't worry, kid." Woomert calmly ambled up to the command line
|
|
interface, his hat pulled down low against the headlight glare of the heavy
|
|
HTTP traffic. "I've just downloaded the latest version of Perl. They don't
|
|
stand a chance." Pulling on his typing gloves, he rapidly executed a one-line
|
|
command string. </p>
|
|
<pre><hr width="100%">perl -wlne'END{print$n}eof&&$n++;/<title>([^<]+)/i&&$n--' *<br><hr
|
|
width="100%"></pre>
|
|
The results were astonishing: even as the monitor displayed a large '0',
|
|
every one of the miscreants suddenly stopped whatever they were doing and
|
|
whirled around to stare at the two of them. They could obviously sense the
|
|
sudden danger represented by these two strangers in trenchcoats; the largest
|
|
of them, an ugly character with "<tt>X6664934755666.htm</tt>" tattooed on
|
|
his chest immediately headed in their direction while reaching into his pocket.
|
|
His intentions were clearly not related to presenting Woomert and Frink with
|
|
flowers and the private DSA key to his domain.
|
|
<p> - "Quick, Woomert," cried Frink, "do something! It looks like he's
|
|
going to throw a Nimda, or even a Code Red!" </p>
|
|
<p>Woomert glanced over at his nervous sidekick. </p>
|
|
<p> - "I told you, kid, <b>relax</b>. Number one, we've got Perl..."
|
|
His lightning-fast fingers drummed out another virtuoso solo on the console:
|
|
</p>
|
|
<pre><hr width="100%">perl -wlne'/title>([^<]+)/i&&rename<tt>$ARGV</tt>,"<tt>$1.html</tt>"' *<br><hr
|
|
width="100%"></pre>
|
|
- "...and number two," as the wild scene around them faded, only to
|
|
reform as a neat, clean neighborhood with neatly arranged files proudly wearing
|
|
names like <br>
|
|
'<tt>History 227, Lecture 35: Origins of the Roman Revolution.html</tt>',
|
|
"we're running Linux. Viruses are pretty much someone else's problem."
|
|
<center>
|
|
<p> * * * * *
|
|
* * * * *</p>
|
|
</center>
|
|
|
|
<p>Later that evening, after they had collected their well-earned fee from
|
|
the grateful client and were relaxing with a fine high-altitude Lee Shan tea
|
|
that Woomert had brought back from his recent telnet to the Far East, Frink
|
|
finally ventured to ask the questions that had been on his mind ever since
|
|
that fateful showdown. </p>
|
|
<p> - "Woomert, I saw you fire off those command lines, but I couldn't
|
|
follow what you were doing. I could see the regular expression, and even noticed
|
|
the implicit loop, but what was all the rest of it?" </p>
|
|
<p> - "Elementary, my dear Frink. If you'll recall the first line..."
|
|
</p>
|
|
<pre><hr width="100%">perl -wlne'END{print<tt>$n</tt>}eof&&<tt>$n</tt>++;/<title>([^<]+)/i&&<tt>$n</tt>--' *<br><hr
|
|
width="100%"></pre>
|
|
"...you'll see that I invoked Perl with the following command switches:"
|
|
<pre>-w Enable warnings</pre>
|
|
|
|
<pre>-l Enable line-end processing</pre>
|
|
|
|
<pre>-n Implicit non-printing loop</pre>
|
|
|
|
<pre>-e Execute the following commands</pre>
|
|
"By enabling warnings, I had told Perl to check my syntax, something that
|
|
should be done every time you run a script. I then specified line-end processing,
|
|
in effect adding a newline to each printed string. Then, I told it to loop
|
|
through the contents of each file, and run the string in the single quotes
|
|
as a script."
|
|
<p>"As you had so astutely noted, I had indeed set up a loop. What you may
|
|
have missed, however, was that there were actually <i>two</i> concurrent loops:
|
|
I had specified a list of files via the shell filespec of '*', and Perl would
|
|
read them in, one at a time. It's also important to note that the syntax
|
|
of the regex <i>inside</i> the quotes which enclose the script looks similar
|
|
to but is very different from the regex <i>outside</i> - the former is parsed
|
|
by Perl, using its internal regex engine; the latter is handled by the shell,
|
|
which uses a far simpler system called 'globbing'." </p>
|
|
<p> - "Wonderful!" Frink was as excited as a young pup on his first hunt.
|
|
"And what did you do in the script itself?" </p>
|
|
<p>"First, I wanted to double-check that my regex actually matched what I
|
|
thought it should. The easiest way was to count the number of files - I incremented
|
|
'<tt>$n</tt>' every time the 'eof' (end-of-file) function returned 'true'
|
|
- and subtract the number of matches. If the total had been greater than
|
|
0, that would have indicated a mismatch somewhere. Fortunately..." </p>
|
|
<p> - "Yes, I remember - it printed a zero." </p>
|
|
<p>"Which meant that everything was correct. The <tt>'END{print$n}'</tt> statement
|
|
was executed at the end of the run - note that this is independent of its
|
|
position in the script, although most people put it at the end. I saved a
|
|
keystroke by putting it first - a statement that follows a block, as in the
|
|
case of that <tt>'eof&&$n++'</tt>, does not require a semicolon.
|
|
In Perl Golf <a href="#5">[5]</a>, every stroke matters!" </p>
|
|
<p>"Next, let us examine the regex that I'd used; that's the heart of this
|
|
script." </p>
|
|
<p><tt>/ # Begin the regex</tt>
|
|
<br>
|
|
<tt>title> # Match this literal string</tt> <br>
|
|
<tt>([^<]+) # Capture one or more characters not matching '<'
|
|
in $1</tt> <br>
|
|
<tt>/i # End regex, use the "ignore
|
|
case" modifier</tt> </p>
|
|
<p>The '/'s enclose the regex that we're trying to match; that's standard
|
|
Perl syntax, which you seem to have recognized. See that '+', there? I'm taking
|
|
advantage of Perl's "greediness" in regex interpretation: '+' means 'one
|
|
or more of the preceding character or group', but with the implication of
|
|
'match as many as possible'. It will grab everything until a literal '<'
|
|
(beginning of an HTML tag) or the end of the current line. So, every
|
|
time the pattern matched the line, I updated '<tt>$n</tt>' by using the logical
|
|
'and' coupled with the decrement operator." </p>
|
|
<p>"As an overview, here is an equivalent script that shows all of the above
|
|
in a more readable fashion:" </p>
|
|
<pre><hr width="100%">#!/usr/bin/perl -w</pre>
|
|
|
|
<pre>while ( <> ){ # Equivalent to "-n"</pre>
|
|
|
|
<pre> <tt>$n</tt>++ if eof;<br> <tt>$n</tt>-- if /<title>([^<]+)/i;</pre>
|
|
|
|
<pre>}</pre>
|
|
|
|
<pre>print "<tt>$n</tt>\n" # The "\n" would have been added by "-l"<br><hr
|
|
width="100%"></pre>
|
|
"Obviously, this script would be invoked as <tt>'perl <scriptname>
|
|
*'</tt>, or simply <tt>'./scriptname *'</tt> if it had been made executable."
|
|
<p>"As a final item of note, look at the 'active' statement in the script,
|
|
the only one that makes any changes or creates any output. It is simply 'print'.
|
|
In fact, the whole line was a test - I wanted to make certain that everything
|
|
worked properly before committing actual changes to disk, something I believe
|
|
to be a wise policy. I could see, from the ugly looks of that crowd, that
|
|
I would not get two chances at the actual renaming; at least one of them
|
|
had an <tt>'rm -rf /'</tt>, and would not have hesitated to use it." </p>
|
|
<p> - "Heavens, Woomert!" Frink's shock was evident in his features.
|
|
"You must be as brave as a lion, to face something like that." </p>
|
|
<p>The famous detective glanced at the shiny stainless-steel and Kevlar "<tt>chroot</tt>"
|
|
call that he had extracted from his pocket and smiled. </p>
|
|
<p> - "Well, I had a few tricks held in reserve, anyway. On to the actual
|
|
renaming, eh? If you remember the expression itself..." </p>
|
|
<pre><hr width="100%">perl -wlne'/title>([^<]+)/i&&rename<tt>$ARGV</tt>,"<tt>$1.html</tt>"' *<br><hr
|
|
width="100%"></pre>
|
|
"...you'll note that much of it resembles the first one; however, there are
|
|
a few novel features. First, I still used "-l" in the option set, but now
|
|
the reason was somewhat different: since the captured strings in '<tt>$1</tt>'
|
|
were likely to contain a newline ('\n'), we needed a way to get rid of it.
|
|
Perl is very clever about removing leading and trailing whitespace and handling
|
|
odd characters when using 'rename', but <tt>'filename\n.html'</tt> would
|
|
cause problems. Fortunately, '-n' also 'auto-chomps' the incoming lines -
|
|
meaning that it will remove the newline character as the line is read in."
|
|
<p>"Next, <tt>'$ARGV'</tt> is a Perl variable containing the name of the file
|
|
that is currently being read. Since '<tt>$1</tt>' held the result of our
|
|
first capture within the regex (the contents within the first set of parentheses
|
|
are stored in '<tt>$1</tt>', the second in '<tt>$2</tt>', and so on), renaming
|
|
was an easy task. It would also let us regularize the extensions - 'html'
|
|
for all of them." </p>
|
|
<p>"If I were to spell out the above line in a more conservative - and perhaps
|
|
more readable - fashion, it would look like this: </p>
|
|
<pre><hr width="100%">#!/usr/bin/perl -w</pre>
|
|
|
|
<pre>while ( <> ){</pre>
|
|
|
|
<pre> chomp; # Equivalent to "-l"<br> if ( /title>([^<]+)/i ){<br> rename <tt>$ARGV</tt>, "<tt>$1.html</tt>"<br> }</pre>
|
|
|
|
<pre>}<br><hr width="100%"></pre>
|
|
- "Since they were bearing down on us, though..."
|
|
<p> - "Precisely; those extra characters could have made the difference
|
|
between life and death. I must say that I didn't expect them to react so fiercely
|
|
to a simple match-and-print, but they say that filesystems are getting smarter
|
|
and smarter - according to a Western guru <a href="#6">[6]</a> with whom
|
|
I once held converse, there were at least <i>five</i> journaling filesystems
|
|
available for Linux, and I've heard of many FS-related kernel patches since.
|
|
Fortunately, we were more than equal to the task." </p>
|
|
<p>"Now, if you'll be good enough to pass me that Rotterdam redfish paste
|
|
and that San Francisco sourdough, I'll tell you of the next case that's coming
|
|
up. Pay attention, young Frink - this promises to be a good one..." <br>
|
|
<br>
|
|
</p>
|
|
<p> </p>
|
|
<hr width="100%"><a name="1"></a>[1] (From the Jargon File:) A bug that disappears
|
|
or alters its behavior when one attempts to probe or isolate it.
|
|
<p><a name="2"></a>[2] (From the Jargon File:) Luser Attitude Readjustment
|
|
Tool (properly applied upside the head of the appropriate clueless person.)
|
|
</p>
|
|
<p><a name="3"></a>[3] See "perldoc File::Find". </p>
|
|
<p><a name="4"></a>[4] Matching patterns in Perl consist of so-called "regular
|
|
expressions". For more information on REs, see "perldoc perlre". </p>
|
|
<p><a name="5"></a>[5] Perl Golf is a highly twisted form of Perl programming
|
|
in which brevity is king, and readability is gleefully thrown out of the nearest
|
|
window. Woomert is an avid golfer who often produces unreadable (but effective)
|
|
gibberish in Perl; one-liners (lines of Perl less than 80 characters in length)
|
|
are often examples of Perl Golf. <b><i>NOTE</i></b>: This game is played
|
|
to impress other Perl hackers, and to produce short but effective command-line
|
|
tools. Using this in code that others are supposed to work with or code that
|
|
you write for pay is truly bad form, and can (<b>should!</b>) come back to
|
|
haunt you. </p>
|
|
<p><a name="6"></a>[6] Per Jim Dennis, 2001. There may be even more today...
|
|
</p>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<!-- *** BEGIN copyright *** -->
|
|
<hr>
|
|
<CENTER><SMALL><STRONG>
|
|
Copyright © 2002, Ben Okopnik.
|
|
Copying license <A HREF="../copying.html">http://www.linuxgazette.com/copying.html</A><BR>
|
|
Published in Issue 84 of <i>Linux Gazette</i>, November 2002
|
|
</STRONG></SMALL></CENTER>
|
|
<!-- *** END copyright *** -->
|
|
<HR>
|
|
|
|
<!--startcut ==========================================================-->
|
|
<CENTER>
|
|
<!-- *** BEGIN navbar *** -->
|
|
<IMG ALT="" SRC="../gx/navbar/left.jpg" WIDTH="14" HEIGHT="45" BORDER="0" ALIGN="bottom"><A HREF="hawk.html"><IMG ALT="[ Prev ]" SRC="../gx/navbar/prev.jpg" WIDTH="16" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="index.html"><IMG ALT="[ Table of Contents ]" SRC="../gx/navbar/toc.jpg" WIDTH="220" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../index.html"><IMG ALT="[ Front Page ]" SRC="../gx/navbar/frontpage.jpg" WIDTH="137" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="http://www.linuxgazette.com/cgi-bin/talkback/all.py?site=LG&article=http://www.linuxgazette.com/issue84/okopnik.html"><IMG ALT="[ Talkback ]" SRC="../gx/navbar/talkback.jpg" WIDTH="121" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../lg_faq.html"><IMG ALT="[ FAQ ]" SRC="./../gx/navbar/faq.jpg"WIDTH="62" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="orr.html"><IMG ALT="[ Next ]" SRC="../gx/navbar/next.jpg" WIDTH="15" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><IMG ALT="" SRC="../gx/navbar/right.jpg" WIDTH="15" HEIGHT="45" ALIGN="bottom">
|
|
<!-- *** END navbar *** -->
|
|
</CENTER>
|
|
</BODY></HTML>
|
|
<!--endcut ============================================================-->
|