1568 lines
22 KiB
HTML
1568 lines
22 KiB
HTML
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
|
|
<HTML
|
|
><HEAD
|
|
><TITLE
|
|
>A Brief Introduction to Regular Expressions</TITLE
|
|
><META
|
|
NAME="GENERATOR"
|
|
CONTENT="Modular DocBook HTML Stylesheet Version 1.7"><LINK
|
|
REL="HOME"
|
|
TITLE="Advanced Bash-Scripting Guide"
|
|
HREF="index.html"><LINK
|
|
REL="UP"
|
|
TITLE="Regular Expressions"
|
|
HREF="regexp.html"><LINK
|
|
REL="PREVIOUS"
|
|
TITLE="Regular Expressions"
|
|
HREF="regexp.html"><LINK
|
|
REL="NEXT"
|
|
TITLE="Globbing"
|
|
HREF="globbingref.html"></HEAD
|
|
><BODY
|
|
CLASS="SECT1"
|
|
BGCOLOR="#FFFFFF"
|
|
TEXT="#000000"
|
|
LINK="#0000FF"
|
|
VLINK="#840084"
|
|
ALINK="#0000FF"
|
|
><DIV
|
|
CLASS="NAVHEADER"
|
|
><TABLE
|
|
SUMMARY="Header navigation table"
|
|
WIDTH="100%"
|
|
BORDER="0"
|
|
CELLPADDING="0"
|
|
CELLSPACING="0"
|
|
><TR
|
|
><TH
|
|
COLSPAN="3"
|
|
ALIGN="center"
|
|
>Advanced Bash-Scripting Guide: </TH
|
|
></TR
|
|
><TR
|
|
><TD
|
|
WIDTH="10%"
|
|
ALIGN="left"
|
|
VALIGN="bottom"
|
|
><A
|
|
HREF="regexp.html"
|
|
ACCESSKEY="P"
|
|
>Prev</A
|
|
></TD
|
|
><TD
|
|
WIDTH="80%"
|
|
ALIGN="center"
|
|
VALIGN="bottom"
|
|
>Chapter 18. Regular Expressions</TD
|
|
><TD
|
|
WIDTH="10%"
|
|
ALIGN="right"
|
|
VALIGN="bottom"
|
|
><A
|
|
HREF="globbingref.html"
|
|
ACCESSKEY="N"
|
|
>Next</A
|
|
></TD
|
|
></TR
|
|
></TABLE
|
|
><HR
|
|
ALIGN="LEFT"
|
|
WIDTH="100%"></DIV
|
|
><DIV
|
|
CLASS="SECT1"
|
|
><H1
|
|
CLASS="SECT1"
|
|
><A
|
|
NAME="AEN17129"
|
|
></A
|
|
>18.1. A Brief Introduction to Regular Expressions</H1
|
|
><P
|
|
>An expression is a string of characters. Those characters
|
|
having an interpretation above and beyond their literal
|
|
meaning are called <I
|
|
CLASS="FIRSTTERM"
|
|
>metacharacters</I
|
|
>.
|
|
A quote symbol, for example, may denote speech by a person,
|
|
<I
|
|
CLASS="FIRSTTERM"
|
|
>ditto</I
|
|
>, or a meta-meaning
|
|
|
|
<A
|
|
NAME="AEN17134"
|
|
HREF="#FTN.AEN17134"
|
|
><SPAN
|
|
CLASS="footnote"
|
|
>[1]</SPAN
|
|
></A
|
|
>
|
|
|
|
for the symbols that follow. Regular Expressions are sets
|
|
of characters and/or metacharacters that match (or specify)
|
|
patterns.</P
|
|
><P
|
|
>A Regular Expression contains one or more of the
|
|
following:</P
|
|
><P
|
|
></P
|
|
><UL
|
|
><LI
|
|
><P
|
|
><I
|
|
CLASS="FIRSTTERM"
|
|
>A character set</I
|
|
>. These are the
|
|
characters retaining their literal meaning. The
|
|
simplest type of Regular Expression consists
|
|
<EM
|
|
>only</EM
|
|
> of a character set, with no
|
|
metacharacters.</P
|
|
></LI
|
|
><LI
|
|
><P
|
|
><A
|
|
NAME="ANCHORREF"
|
|
></A
|
|
></P
|
|
><P
|
|
><I
|
|
CLASS="FIRSTTERM"
|
|
>An anchor</I
|
|
>. These designate
|
|
(<I
|
|
CLASS="FIRSTTERM"
|
|
>anchor</I
|
|
>) the position in the line of
|
|
text that the RE is to match. For example, <SPAN
|
|
CLASS="TOKEN"
|
|
>^</SPAN
|
|
>,
|
|
and <SPAN
|
|
CLASS="TOKEN"
|
|
>$</SPAN
|
|
> are anchors.</P
|
|
></LI
|
|
><LI
|
|
><P
|
|
><I
|
|
CLASS="FIRSTTERM"
|
|
>Modifiers</I
|
|
>. These expand or narrow
|
|
(<I
|
|
CLASS="FIRSTTERM"
|
|
>modify</I
|
|
>) the range of text the RE is
|
|
to match. Modifiers include the asterisk, brackets, and
|
|
the backslash.</P
|
|
></LI
|
|
></UL
|
|
><P
|
|
>The main uses for Regular Expressions
|
|
(<I
|
|
CLASS="FIRSTTERM"
|
|
>RE</I
|
|
>s) are text searches and string
|
|
manipulation. An RE <I
|
|
CLASS="FIRSTTERM"
|
|
>matches</I
|
|
> a single
|
|
character or a set of characters -- a string or a part of
|
|
a string.</P
|
|
><P
|
|
></P
|
|
><UL
|
|
><LI
|
|
><P
|
|
><A
|
|
NAME="ASTERISKREG"
|
|
></A
|
|
>The asterisk --
|
|
<SPAN
|
|
CLASS="TOKEN"
|
|
>*</SPAN
|
|
> -- matches any number of
|
|
repeats of the character string or RE preceding it,
|
|
including <EM
|
|
>zero</EM
|
|
> instances.</P
|
|
><P
|
|
><SPAN
|
|
CLASS="QUOTE"
|
|
>"1133*"</SPAN
|
|
> matches <TT
|
|
CLASS="REPLACEABLE"
|
|
><I
|
|
>11 +
|
|
one or more 3's</I
|
|
></TT
|
|
>:
|
|
<TT
|
|
CLASS="REPLACEABLE"
|
|
><I
|
|
>113</I
|
|
></TT
|
|
>, <TT
|
|
CLASS="REPLACEABLE"
|
|
><I
|
|
>1133</I
|
|
></TT
|
|
>,
|
|
<TT
|
|
CLASS="REPLACEABLE"
|
|
><I
|
|
>1133333</I
|
|
></TT
|
|
>, and so forth.</P
|
|
></LI
|
|
><LI
|
|
><P
|
|
><A
|
|
NAME="REGEXDOT"
|
|
></A
|
|
>The <I
|
|
CLASS="FIRSTTERM"
|
|
>dot</I
|
|
>
|
|
-- <SPAN
|
|
CLASS="TOKEN"
|
|
>.</SPAN
|
|
> -- matches
|
|
any one character, except a newline.
|
|
<A
|
|
NAME="AEN17189"
|
|
HREF="#FTN.AEN17189"
|
|
><SPAN
|
|
CLASS="footnote"
|
|
>[2]</SPAN
|
|
></A
|
|
>
|
|
</P
|
|
><P
|
|
><SPAN
|
|
CLASS="QUOTE"
|
|
>"13."</SPAN
|
|
> matches <TT
|
|
CLASS="REPLACEABLE"
|
|
><I
|
|
>13 + at
|
|
least one of any character (including a
|
|
space)</I
|
|
></TT
|
|
>: <TT
|
|
CLASS="REPLACEABLE"
|
|
><I
|
|
>1133</I
|
|
></TT
|
|
>,
|
|
<TT
|
|
CLASS="REPLACEABLE"
|
|
><I
|
|
>11333</I
|
|
></TT
|
|
>, but not
|
|
<TT
|
|
CLASS="REPLACEABLE"
|
|
><I
|
|
>13</I
|
|
></TT
|
|
> (additional character
|
|
missing).</P
|
|
><P
|
|
>See <A
|
|
HREF="textproc.html#CWSOLVER"
|
|
>Example 16-18</A
|
|
> for a demonstration
|
|
of <I
|
|
CLASS="FIRSTTERM"
|
|
>dot single-character</I
|
|
>
|
|
matching.</P
|
|
></LI
|
|
><LI
|
|
><P
|
|
><A
|
|
NAME="CARETREF"
|
|
></A
|
|
>The caret -- <SPAN
|
|
CLASS="TOKEN"
|
|
>^</SPAN
|
|
>
|
|
-- matches the beginning of a line, but sometimes, depending
|
|
on context, negates the meaning of a set of characters in
|
|
an RE.</P
|
|
></LI
|
|
><LI
|
|
><P
|
|
><A
|
|
NAME="DOLLARSIGNREF"
|
|
></A
|
|
></P
|
|
><P
|
|
>The dollar sign -- <SPAN
|
|
CLASS="TOKEN"
|
|
>$</SPAN
|
|
> -- at the end of an
|
|
RE matches the end of a line.</P
|
|
><P
|
|
><SPAN
|
|
CLASS="QUOTE"
|
|
>"XXX$"</SPAN
|
|
> matches <SPAN
|
|
CLASS="TOKEN"
|
|
>XXX</SPAN
|
|
> at the
|
|
end of a line.</P
|
|
><P
|
|
><SPAN
|
|
CLASS="QUOTE"
|
|
>"^$"</SPAN
|
|
> matches blank lines.</P
|
|
></LI
|
|
><LI
|
|
><P
|
|
><A
|
|
NAME="BRACKETSREF"
|
|
></A
|
|
></P
|
|
><P
|
|
>Brackets -- <SPAN
|
|
CLASS="TOKEN"
|
|
>[...]</SPAN
|
|
> -- enclose a set of characters
|
|
to match in a single RE.</P
|
|
><P
|
|
><SPAN
|
|
CLASS="QUOTE"
|
|
>"[xyz]"</SPAN
|
|
> matches any one of the characters
|
|
<TT
|
|
CLASS="REPLACEABLE"
|
|
><I
|
|
>x</I
|
|
></TT
|
|
>, <TT
|
|
CLASS="REPLACEABLE"
|
|
><I
|
|
>y</I
|
|
></TT
|
|
>,
|
|
or <TT
|
|
CLASS="REPLACEABLE"
|
|
><I
|
|
>z</I
|
|
></TT
|
|
>.</P
|
|
><P
|
|
><SPAN
|
|
CLASS="QUOTE"
|
|
>"[c-n]"</SPAN
|
|
> matches any one of the
|
|
characters in the range <TT
|
|
CLASS="REPLACEABLE"
|
|
><I
|
|
>c</I
|
|
></TT
|
|
>
|
|
to <TT
|
|
CLASS="REPLACEABLE"
|
|
><I
|
|
>n</I
|
|
></TT
|
|
>.</P
|
|
><P
|
|
><SPAN
|
|
CLASS="QUOTE"
|
|
>"[B-Pk-y]"</SPAN
|
|
> matches any one of the
|
|
characters in the ranges <TT
|
|
CLASS="REPLACEABLE"
|
|
><I
|
|
>B</I
|
|
></TT
|
|
>
|
|
to <TT
|
|
CLASS="REPLACEABLE"
|
|
><I
|
|
>P</I
|
|
></TT
|
|
>
|
|
and <TT
|
|
CLASS="REPLACEABLE"
|
|
><I
|
|
>k</I
|
|
></TT
|
|
> to
|
|
<TT
|
|
CLASS="REPLACEABLE"
|
|
><I
|
|
>y</I
|
|
></TT
|
|
>.</P
|
|
><P
|
|
><SPAN
|
|
CLASS="QUOTE"
|
|
>"[a-z0-9]"</SPAN
|
|
> matches any single lowercase
|
|
letter or any digit.</P
|
|
><P
|
|
><SPAN
|
|
CLASS="QUOTE"
|
|
>"[^b-d]"</SPAN
|
|
> matches any character
|
|
<EM
|
|
>except</EM
|
|
> those in
|
|
the range <TT
|
|
CLASS="REPLACEABLE"
|
|
><I
|
|
>b</I
|
|
></TT
|
|
> to
|
|
<TT
|
|
CLASS="REPLACEABLE"
|
|
><I
|
|
>d</I
|
|
></TT
|
|
>. This is an instance of
|
|
<SPAN
|
|
CLASS="TOKEN"
|
|
>^</SPAN
|
|
> negating or inverting the meaning
|
|
of the following RE (taking on a role similar to
|
|
<SPAN
|
|
CLASS="TOKEN"
|
|
>!</SPAN
|
|
> in a different context).</P
|
|
><P
|
|
>Combined sequences of bracketed characters match
|
|
common word patterns. <SPAN
|
|
CLASS="QUOTE"
|
|
>"[Yy][Ee][Ss]"</SPAN
|
|
> matches
|
|
<TT
|
|
CLASS="REPLACEABLE"
|
|
><I
|
|
>yes</I
|
|
></TT
|
|
>, <TT
|
|
CLASS="REPLACEABLE"
|
|
><I
|
|
>Yes</I
|
|
></TT
|
|
>,
|
|
<TT
|
|
CLASS="REPLACEABLE"
|
|
><I
|
|
>YES</I
|
|
></TT
|
|
>, <TT
|
|
CLASS="REPLACEABLE"
|
|
><I
|
|
>yEs</I
|
|
></TT
|
|
>,
|
|
and so forth.
|
|
<SPAN
|
|
CLASS="QUOTE"
|
|
>"[0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]"</SPAN
|
|
>
|
|
matches any Social Security number.</P
|
|
></LI
|
|
><LI
|
|
><P
|
|
><A
|
|
NAME="REGEXBS"
|
|
></A
|
|
></P
|
|
><P
|
|
>The backslash -- <SPAN
|
|
CLASS="TOKEN"
|
|
>\</SPAN
|
|
> -- <A
|
|
HREF="escapingsection.html#ESCP"
|
|
>escapes</A
|
|
> a special character, which
|
|
means that character gets interpreted literally (and is
|
|
therefore no longer <I
|
|
CLASS="FIRSTTERM"
|
|
>special</I
|
|
>).</P
|
|
><P
|
|
>A <SPAN
|
|
CLASS="QUOTE"
|
|
>"\$"</SPAN
|
|
> reverts back to its
|
|
literal meaning of <SPAN
|
|
CLASS="QUOTE"
|
|
>"$"</SPAN
|
|
>, rather than its
|
|
RE meaning of end-of-line. Likewise a <SPAN
|
|
CLASS="QUOTE"
|
|
>"\\"</SPAN
|
|
>
|
|
has the literal meaning of <SPAN
|
|
CLASS="QUOTE"
|
|
>"\"</SPAN
|
|
>.</P
|
|
></LI
|
|
><LI
|
|
><P
|
|
><A
|
|
NAME="ANGLEBRAC"
|
|
></A
|
|
></P
|
|
><P
|
|
><A
|
|
HREF="escapingsection.html#ESCP"
|
|
>Escaped</A
|
|
> <SPAN
|
|
CLASS="QUOTE"
|
|
>"angle
|
|
brackets"</SPAN
|
|
> -- <SPAN
|
|
CLASS="TOKEN"
|
|
>\<...\></SPAN
|
|
> -- mark word
|
|
boundaries.</P
|
|
><P
|
|
>The angle brackets must be escaped, since otherwise
|
|
they have only their literal character meaning.</P
|
|
><P
|
|
><SPAN
|
|
CLASS="QUOTE"
|
|
>"\<the\>"</SPAN
|
|
> matches the word
|
|
<SPAN
|
|
CLASS="QUOTE"
|
|
>"the,"</SPAN
|
|
> but not the words <SPAN
|
|
CLASS="QUOTE"
|
|
>"them,"</SPAN
|
|
>
|
|
<SPAN
|
|
CLASS="QUOTE"
|
|
>"there,"</SPAN
|
|
> <SPAN
|
|
CLASS="QUOTE"
|
|
>"other,"</SPAN
|
|
> etc.</P
|
|
><P
|
|
> <TABLE
|
|
BORDER="1"
|
|
BGCOLOR="#E0E0E0"
|
|
WIDTH="90%"
|
|
><TR
|
|
><TD
|
|
><FONT
|
|
COLOR="#000000"
|
|
><PRE
|
|
CLASS="SCREEN"
|
|
><TT
|
|
CLASS="PROMPT"
|
|
>bash$ </TT
|
|
><TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>cat textfile</B
|
|
></TT
|
|
>
|
|
<TT
|
|
CLASS="COMPUTEROUTPUT"
|
|
>This is line 1, of which there is only one instance.
|
|
This is the only instance of line 2.
|
|
This is line 3, another line.
|
|
This is line 4.</TT
|
|
>
|
|
|
|
|
|
<TT
|
|
CLASS="PROMPT"
|
|
>bash$ </TT
|
|
><TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>grep 'the' textfile</B
|
|
></TT
|
|
>
|
|
<TT
|
|
CLASS="COMPUTEROUTPUT"
|
|
>This is line 1, of which there is only one instance.
|
|
This is the only instance of line 2.
|
|
This is line 3, another line.</TT
|
|
>
|
|
|
|
|
|
<TT
|
|
CLASS="PROMPT"
|
|
>bash$ </TT
|
|
><TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>grep '\<the\>' textfile</B
|
|
></TT
|
|
>
|
|
<TT
|
|
CLASS="COMPUTEROUTPUT"
|
|
>This is the only instance of line 2.</TT
|
|
>
|
|
</PRE
|
|
></FONT
|
|
></TD
|
|
></TR
|
|
></TABLE
|
|
>
|
|
</P
|
|
></LI
|
|
></UL
|
|
><TABLE
|
|
CLASS="SIDEBAR"
|
|
BORDER="1"
|
|
CELLPADDING="5"
|
|
><TR
|
|
><TD
|
|
><DIV
|
|
CLASS="SIDEBAR"
|
|
><A
|
|
NAME="AEN17316"
|
|
></A
|
|
><P
|
|
></P
|
|
><P
|
|
>The only way to be certain that a particular RE works is to
|
|
test it.</P
|
|
><P
|
|
><TABLE
|
|
BORDER="0"
|
|
BGCOLOR="#E0E0E0"
|
|
WIDTH="100%"
|
|
><TR
|
|
><TD
|
|
><FONT
|
|
COLOR="#000000"
|
|
><PRE
|
|
CLASS="PROGRAMLISTING"
|
|
>TEST FILE: tstfile # No match.
|
|
# No match.
|
|
Run grep "1133*" on this file. # Match.
|
|
# No match.
|
|
# No match.
|
|
This line contains the number 113. # Match.
|
|
This line contains the number 13. # No match.
|
|
This line contains the number 133. # No match.
|
|
This line contains the number 1133. # Match.
|
|
This line contains the number 113312. # Match.
|
|
This line contains the number 1112. # No match.
|
|
This line contains the number 113312312. # Match.
|
|
This line contains no numbers at all. # No match.</PRE
|
|
></FONT
|
|
></TD
|
|
></TR
|
|
></TABLE
|
|
></P
|
|
><TABLE
|
|
BORDER="1"
|
|
BGCOLOR="#E0E0E0"
|
|
WIDTH="100%"
|
|
><TR
|
|
><TD
|
|
><FONT
|
|
COLOR="#000000"
|
|
><PRE
|
|
CLASS="SCREEN"
|
|
><TT
|
|
CLASS="PROMPT"
|
|
>bash$ </TT
|
|
><TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>grep "1133*" tstfile</B
|
|
></TT
|
|
>
|
|
<TT
|
|
CLASS="COMPUTEROUTPUT"
|
|
>Run grep "1133*" on this file. # Match.
|
|
This line contains the number 113. # Match.
|
|
This line contains the number 1133. # Match.
|
|
This line contains the number 113312. # Match.
|
|
This line contains the number 113312312. # Match.</TT
|
|
>
|
|
</PRE
|
|
></FONT
|
|
></TD
|
|
></TR
|
|
></TABLE
|
|
><P
|
|
></P
|
|
></DIV
|
|
></TD
|
|
></TR
|
|
></TABLE
|
|
><P
|
|
></P
|
|
><UL
|
|
><LI
|
|
STYLE="list-style-type: square"
|
|
><DIV
|
|
CLASS="FORMALPARA"
|
|
><P
|
|
><B
|
|
><A
|
|
NAME="EXTREGEX"
|
|
></A
|
|
>Extended REs. </B
|
|
>Additional metacharacters added to the basic set. Used
|
|
in <A
|
|
HREF="textproc.html#EGREPREF"
|
|
>egrep</A
|
|
>,
|
|
<A
|
|
HREF="awk.html#AWKREF"
|
|
>awk</A
|
|
>, and <A
|
|
HREF="wrapper.html#PERLREF"
|
|
>Perl</A
|
|
>.</P
|
|
></DIV
|
|
></LI
|
|
><LI
|
|
><P
|
|
><A
|
|
NAME="QUEXREGEX"
|
|
></A
|
|
></P
|
|
><P
|
|
>The question mark -- <SPAN
|
|
CLASS="TOKEN"
|
|
>?</SPAN
|
|
> -- matches zero or
|
|
one of the previous RE. It is generally used for matching
|
|
single characters.</P
|
|
></LI
|
|
><LI
|
|
><P
|
|
><A
|
|
NAME="PLUSREF"
|
|
></A
|
|
></P
|
|
><P
|
|
>The plus -- <SPAN
|
|
CLASS="TOKEN"
|
|
>+</SPAN
|
|
> -- matches one or more of the
|
|
previous RE. It serves a role similar to the <SPAN
|
|
CLASS="TOKEN"
|
|
>*</SPAN
|
|
>, but
|
|
does <EM
|
|
>not</EM
|
|
> match zero occurrences.</P
|
|
><P
|
|
><TABLE
|
|
BORDER="0"
|
|
BGCOLOR="#E0E0E0"
|
|
WIDTH="90%"
|
|
><TR
|
|
><TD
|
|
><FONT
|
|
COLOR="#000000"
|
|
><PRE
|
|
CLASS="PROGRAMLISTING"
|
|
># GNU versions of sed and awk can use "+",
|
|
# but it needs to be escaped.
|
|
|
|
echo a111b | sed -ne '/a1\+b/p'
|
|
echo a111b | grep 'a1\+b'
|
|
echo a111b | gawk '/a1+b/'
|
|
# All of above are equivalent.
|
|
|
|
# Thanks, S.C.</PRE
|
|
></FONT
|
|
></TD
|
|
></TR
|
|
></TABLE
|
|
></P
|
|
><P
|
|
><A
|
|
NAME="ESCPCB"
|
|
></A
|
|
></P
|
|
></LI
|
|
><LI
|
|
><P
|
|
><A
|
|
HREF="escapingsection.html#ESCP"
|
|
>Escaped</A
|
|
> <SPAN
|
|
CLASS="QUOTE"
|
|
>"curly
|
|
brackets"</SPAN
|
|
> -- <SPAN
|
|
CLASS="TOKEN"
|
|
>\{ \}</SPAN
|
|
> -- indicate the number
|
|
of occurrences of a preceding RE to match.</P
|
|
><P
|
|
>It is necessary to escape the curly brackets since
|
|
they have only their literal character meaning
|
|
otherwise. This usage is technically not part of the basic
|
|
RE set.</P
|
|
><P
|
|
><SPAN
|
|
CLASS="QUOTE"
|
|
>"[0-9]\{5\}"</SPAN
|
|
> matches exactly five digits
|
|
(characters in the range of 0 to 9).</P
|
|
><DIV
|
|
CLASS="NOTE"
|
|
><P
|
|
></P
|
|
><TABLE
|
|
CLASS="NOTE"
|
|
WIDTH="90%"
|
|
BORDER="0"
|
|
><TR
|
|
><TD
|
|
WIDTH="25"
|
|
ALIGN="CENTER"
|
|
VALIGN="TOP"
|
|
><IMG
|
|
SRC="../images/note.gif"
|
|
HSPACE="5"
|
|
ALT="Note"></TD
|
|
><TD
|
|
ALIGN="LEFT"
|
|
VALIGN="TOP"
|
|
><P
|
|
>Curly brackets are not available as an RE in the
|
|
<SPAN
|
|
CLASS="QUOTE"
|
|
>"classic"</SPAN
|
|
> (non-POSIX compliant) version
|
|
of <A
|
|
HREF="awk.html#AWKREF"
|
|
>awk</A
|
|
>.
|
|
<A
|
|
NAME="GNUGAWK"
|
|
></A
|
|
>However, the GNU extended version
|
|
of <I
|
|
CLASS="FIRSTTERM"
|
|
>awk</I
|
|
>, <B
|
|
CLASS="COMMAND"
|
|
>gawk</B
|
|
>,
|
|
has the <TT
|
|
CLASS="OPTION"
|
|
>--re-interval</TT
|
|
> option that permits
|
|
them (without being escaped).</P
|
|
><P
|
|
> <TABLE
|
|
BORDER="1"
|
|
BGCOLOR="#E0E0E0"
|
|
WIDTH="90%"
|
|
><TR
|
|
><TD
|
|
><FONT
|
|
COLOR="#000000"
|
|
><PRE
|
|
CLASS="SCREEN"
|
|
><TT
|
|
CLASS="PROMPT"
|
|
>bash$ </TT
|
|
><TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>echo 2222 | gawk --re-interval '/2{3}/'</B
|
|
></TT
|
|
>
|
|
<TT
|
|
CLASS="COMPUTEROUTPUT"
|
|
>2222</TT
|
|
>
|
|
</PRE
|
|
></FONT
|
|
></TD
|
|
></TR
|
|
></TABLE
|
|
>
|
|
</P
|
|
><P
|
|
><B
|
|
CLASS="COMMAND"
|
|
>Perl</B
|
|
> and some
|
|
<B
|
|
CLASS="COMMAND"
|
|
>egrep</B
|
|
> versions do not require escaping
|
|
the curly brackets.</P
|
|
></TD
|
|
></TR
|
|
></TABLE
|
|
></DIV
|
|
></LI
|
|
><LI
|
|
><P
|
|
><A
|
|
NAME="PARENGRPS"
|
|
></A
|
|
></P
|
|
><P
|
|
>Parentheses -- <B
|
|
CLASS="COMMAND"
|
|
>( )</B
|
|
> -- enclose a group of
|
|
REs. They are useful with the following
|
|
<SPAN
|
|
CLASS="QUOTE"
|
|
>"<SPAN
|
|
CLASS="TOKEN"
|
|
>|</SPAN
|
|
>"</SPAN
|
|
> operator and in <A
|
|
HREF="string-manipulation.html#EXPRPAREN"
|
|
>substring extraction</A
|
|
> using <A
|
|
HREF="moreadv.html#EXPRREF"
|
|
>expr</A
|
|
>.</P
|
|
></LI
|
|
><LI
|
|
><P
|
|
>The -- <B
|
|
CLASS="COMMAND"
|
|
>|</B
|
|
> -- <SPAN
|
|
CLASS="QUOTE"
|
|
>"or"</SPAN
|
|
> RE operator
|
|
matches any of a set of alternate characters.</P
|
|
><P
|
|
> <TABLE
|
|
BORDER="1"
|
|
BGCOLOR="#E0E0E0"
|
|
WIDTH="90%"
|
|
><TR
|
|
><TD
|
|
><FONT
|
|
COLOR="#000000"
|
|
><PRE
|
|
CLASS="SCREEN"
|
|
><TT
|
|
CLASS="PROMPT"
|
|
>bash$ </TT
|
|
><TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>egrep 're(a|e)d' misc.txt</B
|
|
></TT
|
|
>
|
|
<TT
|
|
CLASS="COMPUTEROUTPUT"
|
|
>People who read seem to be better informed than those who do not.
|
|
The clarinet produces sound by the vibration of its reed.</TT
|
|
>
|
|
</PRE
|
|
></FONT
|
|
></TD
|
|
></TR
|
|
></TABLE
|
|
>
|
|
</P
|
|
></LI
|
|
></UL
|
|
><DIV
|
|
CLASS="NOTE"
|
|
><P
|
|
></P
|
|
><TABLE
|
|
CLASS="NOTE"
|
|
WIDTH="100%"
|
|
BORDER="0"
|
|
><TR
|
|
><TD
|
|
WIDTH="25"
|
|
ALIGN="CENTER"
|
|
VALIGN="TOP"
|
|
><IMG
|
|
SRC="../images/note.gif"
|
|
HSPACE="5"
|
|
ALT="Note"></TD
|
|
><TD
|
|
ALIGN="LEFT"
|
|
VALIGN="TOP"
|
|
><P
|
|
>Some versions of <B
|
|
CLASS="COMMAND"
|
|
>sed</B
|
|
>,
|
|
<B
|
|
CLASS="COMMAND"
|
|
>ed</B
|
|
>, and <B
|
|
CLASS="COMMAND"
|
|
>ex</B
|
|
> support
|
|
escaped versions of the extended Regular Expressions
|
|
described above, as do the GNU utilities.</P
|
|
></TD
|
|
></TR
|
|
></TABLE
|
|
></DIV
|
|
><P
|
|
></P
|
|
><UL
|
|
><LI
|
|
STYLE="list-style-type: square"
|
|
><DIV
|
|
CLASS="FORMALPARA"
|
|
><P
|
|
><B
|
|
><A
|
|
NAME="POSIXREF"
|
|
></A
|
|
>POSIX Character Classes. </B
|
|
><TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>[:class:]</B
|
|
></TT
|
|
></P
|
|
></DIV
|
|
><P
|
|
>This is an alternate method of specifying a range of
|
|
characters to match.</P
|
|
></LI
|
|
><LI
|
|
><P
|
|
><TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>[:alnum:]</B
|
|
></TT
|
|
> matches alphabetic or
|
|
numeric characters. This is equivalent to
|
|
<TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>A-Za-z0-9</B
|
|
></TT
|
|
>.</P
|
|
></LI
|
|
><LI
|
|
><P
|
|
><TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>[:alpha:]</B
|
|
></TT
|
|
> matches alphabetic
|
|
characters. This is equivalent to
|
|
<TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>A-Za-z</B
|
|
></TT
|
|
>.</P
|
|
></LI
|
|
><LI
|
|
><P
|
|
><TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>[:blank:]</B
|
|
></TT
|
|
> matches a space or a
|
|
tab.</P
|
|
></LI
|
|
><LI
|
|
><P
|
|
><TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>[:cntrl:]</B
|
|
></TT
|
|
> matches control
|
|
characters.</P
|
|
></LI
|
|
><LI
|
|
><P
|
|
><TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>[:digit:]</B
|
|
></TT
|
|
> matches (decimal)
|
|
digits. This is equivalent to
|
|
<TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>0-9</B
|
|
></TT
|
|
>.</P
|
|
></LI
|
|
><LI
|
|
><P
|
|
><TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>[:graph:]</B
|
|
></TT
|
|
> (graphic printable
|
|
characters). Matches characters in the range of <A
|
|
HREF="special-chars.html#ASCIIDEF"
|
|
>ASCII</A
|
|
> 33 - 126. This is
|
|
the same as <TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>[:print:]</B
|
|
></TT
|
|
>, below,
|
|
but excluding the space character.</P
|
|
></LI
|
|
><LI
|
|
><P
|
|
><TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>[:lower:]</B
|
|
></TT
|
|
> matches lowercase
|
|
alphabetic characters. This is equivalent to
|
|
<TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>a-z</B
|
|
></TT
|
|
>.</P
|
|
></LI
|
|
><LI
|
|
><P
|
|
><TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>[:print:]</B
|
|
></TT
|
|
> (printable
|
|
characters). Matches characters in the range of ASCII 32 -
|
|
126. This is the same as <TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>[:graph:]</B
|
|
></TT
|
|
>,
|
|
above, but adding the space character.</P
|
|
></LI
|
|
><LI
|
|
><P
|
|
><A
|
|
NAME="WSPOSIX"
|
|
></A
|
|
><TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>[:space:]</B
|
|
></TT
|
|
>
|
|
matches whitespace characters (space and horizontal
|
|
tab).</P
|
|
></LI
|
|
><LI
|
|
><P
|
|
><TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>[:upper:]</B
|
|
></TT
|
|
> matches uppercase
|
|
alphabetic characters. This is equivalent to
|
|
<TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>A-Z</B
|
|
></TT
|
|
>.</P
|
|
></LI
|
|
><LI
|
|
><P
|
|
><TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>[:xdigit:]</B
|
|
></TT
|
|
> matches hexadecimal
|
|
digits. This is equivalent to
|
|
<TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>0-9A-Fa-f</B
|
|
></TT
|
|
>.</P
|
|
><DIV
|
|
CLASS="IMPORTANT"
|
|
><P
|
|
></P
|
|
><TABLE
|
|
CLASS="IMPORTANT"
|
|
WIDTH="90%"
|
|
BORDER="0"
|
|
><TR
|
|
><TD
|
|
WIDTH="25"
|
|
ALIGN="CENTER"
|
|
VALIGN="TOP"
|
|
><IMG
|
|
SRC="../images/important.gif"
|
|
HSPACE="5"
|
|
ALT="Important"></TD
|
|
><TD
|
|
ALIGN="LEFT"
|
|
VALIGN="TOP"
|
|
><P
|
|
>POSIX character classes generally require quoting
|
|
or <A
|
|
HREF="testconstructs.html#DBLBRACKETS"
|
|
>double brackets</A
|
|
>
|
|
([[ ]]).</P
|
|
></TD
|
|
></TR
|
|
></TABLE
|
|
></DIV
|
|
><P
|
|
> <TABLE
|
|
BORDER="1"
|
|
BGCOLOR="#E0E0E0"
|
|
WIDTH="90%"
|
|
><TR
|
|
><TD
|
|
><FONT
|
|
COLOR="#000000"
|
|
><PRE
|
|
CLASS="SCREEN"
|
|
><TT
|
|
CLASS="PROMPT"
|
|
>bash$ </TT
|
|
><TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>grep [[:digit:]] test.file</B
|
|
></TT
|
|
>
|
|
<TT
|
|
CLASS="COMPUTEROUTPUT"
|
|
>abc=723</TT
|
|
>
|
|
</PRE
|
|
></FONT
|
|
></TD
|
|
></TR
|
|
></TABLE
|
|
>
|
|
</P
|
|
><P
|
|
><TABLE
|
|
BORDER="0"
|
|
BGCOLOR="#E0E0E0"
|
|
WIDTH="90%"
|
|
><TR
|
|
><TD
|
|
><FONT
|
|
COLOR="#000000"
|
|
><PRE
|
|
CLASS="PROGRAMLISTING"
|
|
># ...
|
|
if [[ $arow =~ [[:digit:]] ]] # Numerical input?
|
|
then # POSIX char class
|
|
if [[ $acol =~ [[:alpha:]] ]] # Number followed by a letter? Illegal!
|
|
# ...
|
|
# From ktour.sh example script.</PRE
|
|
></FONT
|
|
></TD
|
|
></TR
|
|
></TABLE
|
|
>
|
|
</P
|
|
><P
|
|
>These character classes may even be used with <A
|
|
HREF="globbingref.html"
|
|
>globbing</A
|
|
>, to a limited
|
|
extent.</P
|
|
><P
|
|
> <TABLE
|
|
BORDER="1"
|
|
BGCOLOR="#E0E0E0"
|
|
WIDTH="90%"
|
|
><TR
|
|
><TD
|
|
><FONT
|
|
COLOR="#000000"
|
|
><PRE
|
|
CLASS="SCREEN"
|
|
><TT
|
|
CLASS="PROMPT"
|
|
>bash$ </TT
|
|
><TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>ls -l ?[[:digit:]][[:digit:]]?</B
|
|
></TT
|
|
>
|
|
<TT
|
|
CLASS="COMPUTEROUTPUT"
|
|
>-rw-rw-r-- 1 bozo bozo 0 Aug 21 14:47 a33b</TT
|
|
>
|
|
</PRE
|
|
></FONT
|
|
></TD
|
|
></TR
|
|
></TABLE
|
|
>
|
|
</P
|
|
><P
|
|
>POSIX character classes are used in
|
|
<A
|
|
HREF="textproc.html#EX49"
|
|
>Example 16-21</A
|
|
> and <A
|
|
HREF="textproc.html#LOWERCASE"
|
|
>Example 16-22</A
|
|
>.</P
|
|
></LI
|
|
></UL
|
|
><P
|
|
><A
|
|
HREF="sedawk.html#SEDREF"
|
|
>Sed</A
|
|
>, <A
|
|
HREF="awk.html#AWKREF"
|
|
>awk</A
|
|
>, and <A
|
|
HREF="wrapper.html#PERLREF"
|
|
>Perl</A
|
|
>, used as filters in scripts, take
|
|
REs as arguments when "sifting" or transforming files or I/O
|
|
streams. See <A
|
|
HREF="contributed-scripts.html#BEHEAD"
|
|
>Example A-12</A
|
|
> and <A
|
|
HREF="contributed-scripts.html#TREE"
|
|
>Example A-16</A
|
|
>
|
|
for illustrations of this.</P
|
|
><P
|
|
>The standard reference on this complex topic is Friedl's
|
|
<I
|
|
CLASS="CITETITLE"
|
|
>Mastering Regular
|
|
Expressions</I
|
|
>. <I
|
|
CLASS="CITETITLE"
|
|
>Sed &
|
|
Awk</I
|
|
>, by Dougherty and Robbins, also gives a very
|
|
lucid treatment of REs. See the <A
|
|
HREF="biblio.html"
|
|
><I
|
|
>Bibliography</I
|
|
></A
|
|
> for
|
|
more information on these books.</P
|
|
></DIV
|
|
><H3
|
|
CLASS="FOOTNOTES"
|
|
>Notes</H3
|
|
><TABLE
|
|
BORDER="0"
|
|
CLASS="FOOTNOTES"
|
|
WIDTH="100%"
|
|
><TR
|
|
><TD
|
|
ALIGN="LEFT"
|
|
VALIGN="TOP"
|
|
WIDTH="5%"
|
|
><A
|
|
NAME="FTN.AEN17134"
|
|
HREF="x17129.html#AEN17134"
|
|
><SPAN
|
|
CLASS="footnote"
|
|
>[1]</SPAN
|
|
></A
|
|
></TD
|
|
><TD
|
|
ALIGN="LEFT"
|
|
VALIGN="TOP"
|
|
WIDTH="95%"
|
|
><P
|
|
><A
|
|
NAME="METAMEANINGREF"
|
|
></A
|
|
>A
|
|
<I
|
|
CLASS="FIRSTTERM"
|
|
>meta-meaning</I
|
|
> is the meaning of a
|
|
term or expression on a higher level of abstraction. For
|
|
example, the <I
|
|
CLASS="FIRSTTERM"
|
|
>literal</I
|
|
> meaning
|
|
of <I
|
|
CLASS="FIRSTTERM"
|
|
>regular expression</I
|
|
> is an
|
|
ordinary expression that conforms to accepted usage. The
|
|
<I
|
|
CLASS="FIRSTTERM"
|
|
>meta-meaning</I
|
|
> is drastically different,
|
|
as discussed at length in this chapter.</P
|
|
></TD
|
|
></TR
|
|
><TR
|
|
><TD
|
|
ALIGN="LEFT"
|
|
VALIGN="TOP"
|
|
WIDTH="5%"
|
|
><A
|
|
NAME="FTN.AEN17189"
|
|
HREF="x17129.html#AEN17189"
|
|
><SPAN
|
|
CLASS="footnote"
|
|
>[2]</SPAN
|
|
></A
|
|
></TD
|
|
><TD
|
|
ALIGN="LEFT"
|
|
VALIGN="TOP"
|
|
WIDTH="95%"
|
|
><P
|
|
>Since <A
|
|
HREF="sedawk.html#SEDREF"
|
|
>sed</A
|
|
>, <A
|
|
HREF="awk.html#AWKREF"
|
|
>awk</A
|
|
>, and <A
|
|
HREF="textproc.html#GREPREF"
|
|
>grep</A
|
|
> process single lines, there
|
|
will usually not be a newline to match. In those cases where
|
|
there is a newline in a multiple line expression, the dot
|
|
will match the newline.
|
|
<TABLE
|
|
BORDER="0"
|
|
BGCOLOR="#E0E0E0"
|
|
WIDTH="90%"
|
|
><TR
|
|
><TD
|
|
><FONT
|
|
COLOR="#000000"
|
|
><PRE
|
|
CLASS="PROGRAMLISTING"
|
|
>#!/bin/bash
|
|
|
|
sed -e 'N;s/.*/[&]/' << EOF # Here Document
|
|
line1
|
|
line2
|
|
EOF
|
|
# OUTPUT:
|
|
# [line1
|
|
# line2]
|
|
|
|
|
|
|
|
echo
|
|
|
|
awk '{ $0=$1 "\n" $2; if (/line.1/) {print}}' << EOF
|
|
line 1
|
|
line 2
|
|
EOF
|
|
# OUTPUT:
|
|
# line
|
|
# 1
|
|
|
|
|
|
# Thanks, S.C.
|
|
|
|
exit 0</PRE
|
|
></FONT
|
|
></TD
|
|
></TR
|
|
></TABLE
|
|
></P
|
|
></TD
|
|
></TR
|
|
></TABLE
|
|
><DIV
|
|
CLASS="NAVFOOTER"
|
|
><HR
|
|
ALIGN="LEFT"
|
|
WIDTH="100%"><TABLE
|
|
SUMMARY="Footer navigation table"
|
|
WIDTH="100%"
|
|
BORDER="0"
|
|
CELLPADDING="0"
|
|
CELLSPACING="0"
|
|
><TR
|
|
><TD
|
|
WIDTH="33%"
|
|
ALIGN="left"
|
|
VALIGN="top"
|
|
><A
|
|
HREF="regexp.html"
|
|
ACCESSKEY="P"
|
|
>Prev</A
|
|
></TD
|
|
><TD
|
|
WIDTH="34%"
|
|
ALIGN="center"
|
|
VALIGN="top"
|
|
><A
|
|
HREF="index.html"
|
|
ACCESSKEY="H"
|
|
>Home</A
|
|
></TD
|
|
><TD
|
|
WIDTH="33%"
|
|
ALIGN="right"
|
|
VALIGN="top"
|
|
><A
|
|
HREF="globbingref.html"
|
|
ACCESSKEY="N"
|
|
>Next</A
|
|
></TD
|
|
></TR
|
|
><TR
|
|
><TD
|
|
WIDTH="33%"
|
|
ALIGN="left"
|
|
VALIGN="top"
|
|
>Regular Expressions</TD
|
|
><TD
|
|
WIDTH="34%"
|
|
ALIGN="center"
|
|
VALIGN="top"
|
|
><A
|
|
HREF="regexp.html"
|
|
ACCESSKEY="U"
|
|
>Up</A
|
|
></TD
|
|
><TD
|
|
WIDTH="33%"
|
|
ALIGN="right"
|
|
VALIGN="top"
|
|
>Globbing</TD
|
|
></TR
|
|
></TABLE
|
|
></DIV
|
|
></BODY
|
|
></HTML
|
|
> |