357 lines
20 KiB
HTML
357 lines
20 KiB
HTML
<!--startcut ==============================================-->
|
|
<!-- *** BEGIN HTML header *** -->
|
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
|
|
<HTML><HEAD>
|
|
<title>Learning Perl, part 2 LG #64</title>
|
|
</HEAD>
|
|
<BODY BGCOLOR="#FFFFFF" TEXT="#000000" LINK="#0000FF" VLINK="#0000AF"
|
|
ALINK="#FF0000">
|
|
<!-- *** END HTML header *** -->
|
|
|
|
<CENTER>
|
|
<A HREF="http://www.linuxgazette.com/">
|
|
<H1><IMG ALT="LINUX GAZETTE" SRC="../gx/lglogo.png"
|
|
WIDTH="600" HEIGHT="124" border="0"></H1></A>
|
|
|
|
<!-- *** BEGIN navbar *** -->
|
|
<IMG ALT="" SRC="../gx/navbar/left.jpg" WIDTH="14" HEIGHT="45" BORDER="0" ALIGN="bottom"><A HREF="nielsen.html"><IMG ALT="[ Prev ]" SRC="../gx/navbar/prev.jpg" WIDTH="16" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="index.html"><IMG ALT="[ Table of Contents ]" SRC="../gx/navbar/toc.jpg" WIDTH="220" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../index.html"><IMG ALT="[ Front Page ]" SRC="../gx/navbar/frontpage.jpg" WIDTH="137" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="http://www.linuxgazette.com/cgi-bin/talkback/all.py?site=LG&article=http://www.linuxgazette.com/issue64/okopnik.html"><IMG ALT="[ Talkback ]" SRC="../gx/navbar/talkback.jpg" WIDTH="121" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../faq/index.html"><IMG ALT="[ FAQ ]" SRC="./../gx/navbar/faq.jpg"WIDTH="62" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="sipos.html"><IMG ALT="[ Next ]" SRC="../gx/navbar/next.jpg" WIDTH="15" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><IMG ALT="" SRC="../gx/navbar/right.jpg" WIDTH="15" HEIGHT="45" ALIGN="bottom">
|
|
<!-- *** END navbar *** -->
|
|
<P>
|
|
</CENTER>
|
|
|
|
<!--endcut ============================================================-->
|
|
|
|
<H4 ALIGN="center">
|
|
"Linux Gazette...<I>making Linux just a little more fun!</I>"
|
|
</H4>
|
|
|
|
<P> <HR> <P>
|
|
<!--===================================================================-->
|
|
|
|
<center>
|
|
<H1><font color="maroon">Learning Perl, part 2</font></H1>
|
|
<H4>By <a href="mailto:ben-fuzzybear@yahoo.com">Ben Okopnik</a></H4>
|
|
</center>
|
|
<P> <HR> <P>
|
|
|
|
<!-- END header -->
|
|
|
|
|
|
|
|
|
|
<BLOCKQUOTE>
|
|
<tt>"I realized at that point that there was a huge ecological niche
|
|
between the C language and Unix shells. C was good for manipulating complex
|
|
things - you can call it 'manipulexity.' And the shells were good
|
|
at whipping up things - what I call 'whipupitude.' But there was this big
|
|
blank area where neither C nor shell were good, and that's where I aimed
|
|
Perl."</tt>
|
|
<br><tt> -- Larry Wall, author of Perl</tt>
|
|
</BLOCKQUOTE>
|
|
|
|
<p><b>Overview</b>
|
|
<p>In the first part, we talked about some basics and general issues in
|
|
Perl - writing a script, hash-bangs, style - as well as a number of specifics,
|
|
such as scalars, arrays, hashes, operators, and quoting methods. This month,
|
|
we'll take a look at the intrinsic Perl tools that make it so easy to use
|
|
from the command line, as well as their equivalents in scripts. We'll also
|
|
go a little deeper into quoting methods, and get a bit of a start on regexes
|
|
(regular expressions, or REs) - one of the most powerful tools in Perl,
|
|
and one that deserves an entire book all its own. <a href="#1">[1]</a>
|
|
<br>
|
|
<br>
|
|
<p><b>Quote Mechanisms</b>
|
|
<p>Most of you will be familiar with the standard quoting mechanisms in
|
|
Unix: the single and the double quote, which I'd already mentioned in my
|
|
previous article, have much the same functionality in Perl as they do in
|
|
the shell. Sometimes, though, escaping all the in-line metacharacters can
|
|
be a bit painful. Imagine trying to print a string like this:
|
|
<p><tt>``/// Don't say "shan't," "can't," or "won't." ///''</tt>
|
|
<p>Good grief! What can we do with a mess like that?
|
|
<p>Well, we could put in a whole bunch of escapes ("\"), but that would
|
|
be a pain - as well as a case of the LTS ("Leaning Toothpick Syndrome"):
|
|
<p><tt>print '\`\`\/\/\/ Don\'t...</tt>
|
|
<p><shudder> Obviously not a good answer. For times like these, Perl
|
|
provides alternate quoting mechanisms:
|
|
<p><tt>q// # Single quotes</tt>
|
|
<br><tt>qq// # Double quotes</tt>
|
|
<br><tt>qx// # Back quotes, for shell
|
|
execution</tt>
|
|
<br><tt>qw// # Word list - useful for
|
|
populating arrays</tt>
|
|
<p>Note also that the delimiter does not have to be '/', but can be any
|
|
character. Now our job becomes a bit easier:
|
|
<p><tt>print q-``/// Don't say "shan't," "can't," or "won't." ///''-;</tt>
|
|
<p>Simple, eh? By the way, this is something you would use only inside
|
|
a script; the shell interpretation mechanism would make a horrendous mess
|
|
of this if you tried it from the command line, especially things like back
|
|
quotes and slashes.
|
|
<br>
|
|
<br>
|
|
<p><b>Perl Invocation</b>
|
|
<p>"Hear my plea, O Perl of Great Wisdom!" Oh, never mind; I think that
|
|
was standard in Perl3, and is now deprecated... :)
|
|
<p>The most commonly-used switch in invoking Perl, if you're running it
|
|
from the command line, is '-e'; this one tells Perl to execute whatever
|
|
comes immediately after it. In fact, '-e' must be the last switch used
|
|
on the command line because <i>everything</i> after it is considered to
|
|
be part of the script!
|
|
<p><tt>perl -we 'print "The Gods send thread for the Web begun.\n"'</tt>
|
|
<p>"-w" is the "warn" switch that I mentioned the last time. It tells you
|
|
about all the non-fatal errors in your code, including variables that you
|
|
set but didn't use (invaluable for finding mistyped variable names) as
|
|
well as many, many other things. You should always - yes, <b>always</b>
|
|
- use "-w", whether on the command line or in a script.
|
|
<p>"-n" is the "non-printing loop" switch, which causes Perl to iterate
|
|
over the input, one line at a time - somewhat like "awk". If you want to
|
|
print a given line, you'll need to specify a condition for it:
|
|
<p><tt>perl -wne 'print if /holiday/' schedule.txt</tt>
|
|
<p>Perl will loop through "schedule.txt" and print any line that contains
|
|
the word "holiday", so you can get depressed about how little time off
|
|
you actually have.
|
|
<p>"-p" is the invocation for a "printing loop", which acts just like "-n"
|
|
except that it prints every line that it loops over. This is very useful
|
|
for "sed"-like operations, like modifying a file and writing it back out
|
|
(we'll discuss 's///', the substitution operator, in just a bit):
|
|
<p><tt>perl -wpe 's/holiday/Party time!/' schedule.txt</tt>
|
|
<p>This will perform the substitution on the first occurrence of the word
|
|
'holiday' in any given line (see "perldoc perlre" for discussion of modifiers
|
|
used with 's///', such as 'g'lobal.)
|
|
<p>The "-i" switch works well in combination with either of the above,
|
|
depending on the desired action; it allows you to perform an "in-place"
|
|
edit, i.e. make the changes in the specified file (optionally performing
|
|
a backup beforehand) rather than printing them out to the screen. Note
|
|
that we can't just tack an "i" onto the "wpe" string: it takes an optional
|
|
argument - the extension to be appended to the backup copy - and the text
|
|
that follows it is what specifies that extension.
|
|
<p><tt>perl -i~ -wpe 's/holiday/Party time!/' schedule.txt</tt>
|
|
<p>The above line will produce a "schedule.txt" with the modified text
|
|
in it, and a "schedule.txt~" that is the original file. "-i" without any
|
|
extension overwrites the original file; this is far more convenient than
|
|
producing a modified file and renaming it back to the original, but be
|
|
<b>sure</b> that your code is correct, or you'll wipe out your original
|
|
data!
|
|
<br>
|
|
<br>
|
|
<p><b>RegExes, or "Has The Cat Been Walking On My Keyboard Again?"</b>
|
|
<p>One of the most powerful tools available in Perl, the regular expression
|
|
is the way to match almost any imaginable character arrangement. Here (necessarily)
|
|
I'll cover only the very basics; if you find that you need more information,
|
|
dig into the "perlre" manpage that comes with Perl. That should keep you
|
|
busy for a while. :)
|
|
<p>REs are used for pattern matching, most commonly with the "m//" (matching)
|
|
and "s///" substitution) operators. Note that the delimiters in these,
|
|
just like in the quoting mechanisms, are not restricted to '/'; in fact,
|
|
the leading 'm' in the matching operator is required <i>only</i> if a non-default
|
|
delimiter is used. Otherwise, just the "//" is sufficient.
|
|
<p>Here are some of the metacharacters used with REs. Note that there are
|
|
many more; these are just enough to get us started:
|
|
<p><tt>. Matches any character
|
|
except the newline</tt>
|
|
<br><tt>^ Match the beginning
|
|
of the line</tt>
|
|
<br><tt>$ Match the end of the
|
|
line</tt>
|
|
<br><tt>| Alternation (match
|
|
"left|right|up|down|sideways")</tt>
|
|
<br><tt>* Match 0 or more times</tt>
|
|
<br><tt>+ Match 1 or more times</tt>
|
|
<br><tt>? Match 0 or 1 times</tt>
|
|
<br><tt>{n} Match exactly n times</tt>
|
|
<br><tt>{n,} Match at least n times</tt>
|
|
<br><tt>{n,m} Match at least n but not more than m times</tt>
|
|
<br>
|
|
<p>As an example, let's say that we have a file with a list of names:
|
|
<p><tt>Anne Bonney</tt>
|
|
<br><tt>Bartholomew Roberts</tt>
|
|
<br><tt>Charles Bellamy</tt>
|
|
<br><tt>Diego Grillo</tt>
|
|
<br><tt>Edward Teach</tt>
|
|
<br><tt>Francois Gautier</tt>
|
|
<br><tt>George Watling</tt>
|
|
<br><tt>Henry Every</tt>
|
|
<br><tt>Israel Hands</tt>
|
|
<br><tt>John Derdrake</tt>
|
|
<br><tt>KuoHsing Yeh</tt>
|
|
<br><tt>...</tt>
|
|
<p>and we want to replace the first name with 'Captain'. Obviously, we
|
|
would go through the file with a printing loop and do a substution if it
|
|
matched our criteria:<tt></tt>
|
|
<p><tt>s/^.+ /Captain /;</tt>
|
|
<p>The caret ('^') matches at the beginning of the line, the ".+" says
|
|
"any character, repeated 1 or more times", and the space matches a space.
|
|
Once we find what we're looking for, we're going to replace it with 'Captain'
|
|
followed by a space - since the string that we're replacing contains one,
|
|
we'll need to put it back.
|
|
<p>Let's say that we also knew that somewhere in the file, there are a
|
|
couple of names that contain apostrophes (<tt>Francois L'Ollonais</tt>),
|
|
and we wanted to skip them - or anything else that contained 'non-letter'
|
|
characters. Let's expand the regex a bit:
|
|
<p><tt>s/^[A-Z][a-z]* /Captain /;</tt>
|
|
<p>We've used the "character class" specifiers, "[]", to first match one
|
|
character between 'A' and 'Z' - note that only <b>one</b> character is
|
|
matched by this mechanism, a very important distinction! - followed by
|
|
a one-character match of 'a' through 'z' and an asterisk, which, again,
|
|
says "zero or more of the preceding character".
|
|
<p>Oops, wait! How about "<tt>KuoHsing</tt>"? The match would fail on the
|
|
'H', since upper-case characters were not included in the specified range.
|
|
OK, we'll modify the regex:
|
|
<p><tt>s/^\w* /Captain /;</tt>
|
|
<p>The '\w' is a "word character" - once again, it matches only one character
|
|
- that includes 'A-Z', 'a-z', and '_'. It is preferable to [A-Za-z_] because
|
|
it uses the value of $LOCALE (a system value) to determine what characters
|
|
should or should not be part of words - and this is important in languages
|
|
other than English. As well, '\w' is easier to type than '[A-Za-z_]'.
|
|
<p>Let's try something a bit different: What if we still wanted to
|
|
match all the first names, but now, rather than replacing them, we wanted
|
|
to swap them around with the last names, separate the two with a comma,
|
|
and precede the last name with the word 'Captain'? With regexes at our
|
|
command, it's not a problem:
|
|
<p><tt>s/^(\w*) (\w*)$/Captain $2, $1/;</tt>
|
|
<p>Note the parentheses and the "$1" and "$2" variables: the
|
|
parentheses "capture" the enclosed part of the regex, which we can
|
|
then refer to via the variables (the first captured piece is $1, the second
|
|
is $2, and so on.) So, here is the above regex in English:
|
|
<p><i>Starting from the beginning of the line, (begin capture into $1)
|
|
match any "word character" repeated zero or more times (end capture) and
|
|
followed by a space, (begin capture into $2) followed by any "word character"
|
|
repeated zero or more times (end capture) until the end of the line. Return
|
|
the word 'Captain' followed by a space, which is followed by the value
|
|
of $2, a comma, a space, and the value of $1.</i>
|
|
<p>I'd say that regexes are a <b>very</b> compact way to say all of the
|
|
above. At times like these, it becomes pretty obvious that Larry Wall is
|
|
a professional linguist. :)
|
|
<p>These are just simple examples of what goes into building a regex. I
|
|
must admit to cheating a bit: name-parsing is probably one of the biggest
|
|
challenges out there, and I could have spun these example out as long as
|
|
I wanted. Considering that the possibilities include "John deJongh", "Jan
|
|
M.
|
|
<br>van de Geijn", "Kathleen O'Hara-Mears", "Siu Tim Au Yeung", "Nang-Soa-Anee
|
|
Bongoj Niratpattanasai", and "Mjölby J. de Wærn" (remember to
|
|
use those LOCALE-aware matches, right?), the field is pretty broad and
|
|
very odd in spots. (Miss Niratpattanasai, after looking at something like
|
|
"John Smith". would probably agree. :)
|
|
<br>
|
|
<p>Here's an important factor to be aware of in the regex mechanism: by
|
|
default, it does "greedy matching". In other words, given a phrase like
|
|
<p><b><tt>Acciones son amores, no besos ni apachurrones</tt></b>
|
|
<p>and a regex like
|
|
<p><tt>/A.*es/</tt>
|
|
<p>it would match the following:
|
|
<p><b><tt>Acciones son amores, no besos ni apachurrones</tt></b>
|
|
<br><b><tt>|___________________________________________|</tt></b><tt></tt>
|
|
<p>Hmmm. Everything from the first 'A' (followed by zero or more of any
|
|
character) to the <i>last</i> 'es'. How can we match just the first instance,
|
|
then? To counteract the greed, Perl provides a "generosity" modifier to
|
|
quantifiers such as '*', '+', and '?':
|
|
<p><tt>/A.*?es/</tt>
|
|
<p><b><tt>Acciones son amores, no besos ni apachurrones</tt></b>
|
|
<br><b><tt>|______|</tt></b>
|
|
<p>There. Much better. For future reference, remember: if you're breaking
|
|
up a string by matching its pieces with a series of regexes, and the last
|
|
"chunks" are coming up empty, you've probably got a "greed" problem.
|
|
<br>
|
|
<br>
|
|
<p><b>The Default Buffer/Variable</b>
|
|
<p>Some of you, especially those who have done some programming in the
|
|
past, have probably been curious about some of the code constructs above,
|
|
like
|
|
<p><tt>print if /holiday/;</tt>
|
|
<p>"Print <b>what</b> if <b>what?</b> Where is the variable that we're
|
|
checking for the match? Shouldn't it be something like '<tt>if $x == /holiday/</tt>',
|
|
the way it is in the shell?"
|
|
<p>I'm glad you asked that question. :)
|
|
<p>Perl uses an interesting concept, found in a few other languages, of
|
|
the <i>default buffer</i> - also referred to as the <i>default variable</i>
|
|
and the <i>default pattern space</i>. Not surprisingly, it's used in the
|
|
looping constructs - when we use the "-n/-p" syntax in the Perl invocation,
|
|
it is the variable used to hold the current line - as well as in substitution
|
|
and matching, and a number of other places. The '$_' variable is the default
|
|
for all of the above; when a variable is not specified in a place where
|
|
you'd expect one, '$_' is usually the "culprit." In fact, '$_' is rather
|
|
difficult to explain - it turns up in so many places that coming up with
|
|
an algorithm is seemingly impossible - but it is wonderfully easy and intuitive
|
|
to <i>use</i>, once you get the idea.
|
|
<p>Consider the following:
|
|
<p><tt>perl -wne 'if ( $_ =~ /Henry/ ) { print $_; } pirates</tt>
|
|
<p>If a line in the "pirates" file, above, matches "Henry", it will be
|
|
printed. Fine; but now, let's play some amateur "Perl Golf" - that's a
|
|
contest among Perl hackers to see how many (key)strokes can be taken off
|
|
a piece of code and still leave it functional.
|
|
<p>Since we already know that Perl reads each line into '$_', we'll just
|
|
get rid of all the explicit declarations of it:
|
|
<p><tt>perl -wne 'if ( /Henry/ ) { print; } pirates</tt>
|
|
<p>Perl "knows" that we're matching against the default variable, and it
|
|
"knows" that the "print" statement applies to the same thing. Now, we apply
|
|
a little Perl idiom:
|
|
<p><tt>perl -wne 'print if /Henry/' pirates</tt>
|
|
<p>Isn't that nice? Perl actually allows you to write out your code with
|
|
the condition following the action; kinda the way you'd say things in English.
|
|
Oh, and we've snipped off the semicolon on the end because we don't need
|
|
it: it's a statement <i>separator</i>, and there's no statement following
|
|
<br>"/Henry/".
|
|
<p><grin> For those of you playing along at home, try
|
|
<p><tt>perl -ne'/Henry/&&print' pirates</tt>
|
|
<p>It shouldn't be <b>that</b> hard to figure out; the '&&' operator
|
|
in Perl works the same way as it does in the shell. Perl Golf is fun to
|
|
play, but be careful: it's easy to write code that will work but will
|
|
require lots of head-scratching to understand. Don't Do That. I may have
|
|
to maintain your code tomorrow... just like you may have to maintain mine.
|
|
<br>
|
|
<p>In the first example, note the "binding operator", '=~', which checks
|
|
for a match in the supplied variable. This is what you would use if you
|
|
were matching against a variable other than "$_". There is also a "negative
|
|
match" operator, '!~', which returns true if the match fails (the inverse
|
|
of '=~'.)
|
|
<p>Note also that the available modifiers for simple statements, like that
|
|
above, include not only the "if", but also "unless", "while", "until",
|
|
and "for". All of these, and more, are coming up in Part 3...
|
|
<br>
|
|
<br>
|
|
<p><b>Ben Okopnik</b>
|
|
<br><tt>perl -we '$perl=0;JsP $perl "perl"; $perl->perl(0)'\</tt>
|
|
<br><tt> 2>&1|perl -ne '{print ((split//)[19,29,20,4,5,1,2,</tt>
|
|
<br><tt>15,13,14,12,52,5,21,12,52,8,5,14,1,6,37,12,52,75])}'</tt>
|
|
<br>
|
|
<hr WIDTH="100%">
|
|
<br><a NAME="1"></a>[1]. And in fact, has one - "Mastering Regular Expressions"
|
|
by Jeffrey E. Friedl is considered to be a reference on the subject. It
|
|
includes some wonderful examples, and literally teaches the reader to "think
|
|
in regex".
|
|
<br>
|
|
<p><tt>References:</tt><tt></tt>
|
|
<p><tt>Relevant Perl man pages (available on any pro-Perl-y configured
|
|
system):</tt><tt></tt>
|
|
<p><tt>perl - overview
|
|
perlfaq - Perl FAQ</tt>
|
|
<br><tt>perltoc - doc TOC
|
|
perldata - data structures</tt>
|
|
<br><tt>perlsyn - syntax
|
|
perlop - operators/precedence</tt>
|
|
<br><tt>perlrun - execution
|
|
perlfunc - builtin functions</tt>
|
|
<br><tt>perltrap - traps for the unwary perlstyle - style guide</tt><tt></tt>
|
|
<p><tt>"perldoc", "perldoc -q" and "perldoc -f"</tt>
|
|
|
|
|
|
|
|
|
|
<!-- *** BEGIN copyright *** -->
|
|
<P> <hr> <!-- P -->
|
|
<H5 ALIGN=center>
|
|
|
|
Copyright © 2001, Ben Okopnik.<BR>
|
|
Copying license <A HREF="../copying.html">http://www.linuxgazette.com/copying.html</A><BR>
|
|
Published in Issue 64 of <i>Linux Gazette</i>, March 2001</H5>
|
|
<!-- *** END copyright *** -->
|
|
|
|
<!--startcut ==========================================================-->
|
|
<HR><P>
|
|
<CENTER>
|
|
<!-- *** BEGIN navbar *** -->
|
|
<IMG ALT="" SRC="../gx/navbar/left.jpg" WIDTH="14" HEIGHT="45" BORDER="0" ALIGN="bottom"><A HREF="nielsen.html"><IMG ALT="[ Prev ]" SRC="../gx/navbar/prev.jpg" WIDTH="16" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="index.html"><IMG ALT="[ Table of Contents ]" SRC="../gx/navbar/toc.jpg" WIDTH="220" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../index.html"><IMG ALT="[ Front Page ]" SRC="../gx/navbar/frontpage.jpg" WIDTH="137" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="http://www.linuxgazette.com/cgi-bin/talkback/all.py?site=LG&article=http://www.linuxgazette.com/issue64/okopnik.html"><IMG ALT="[ Talkback ]" SRC="../gx/navbar/talkback.jpg" WIDTH="121" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../faq/index.html"><IMG ALT="[ FAQ ]" SRC="./../gx/navbar/faq.jpg"WIDTH="62" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="sipos.html"><IMG ALT="[ Next ]" SRC="../gx/navbar/next.jpg" WIDTH="15" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><IMG ALT="" SRC="../gx/navbar/right.jpg" WIDTH="15" HEIGHT="45" ALIGN="bottom">
|
|
<!-- *** END navbar *** -->
|
|
</CENTER>
|
|
</BODY></HTML>
|
|
<!--endcut ============================================================-->
|