LDP/LDP/guide/docbook/Bash-Beginners-Guide/chap4.xml

<chapter id="chap_04"><title>Regular expressions</title>

<abstract>
<para>In this chapter we discuss:</para>
<para><itemizedlist>
<listitem><para>Using regular expressions</para></listitem>
<listitem><para>Regular expression metacharacters</para></listitem>
<listitem><para>Finding patterns in files or output</para></listitem>
<listitem><para>Character ranges and classes in Bash</para></listitem>
</itemizedlist></para>
</abstract>
<sect1 id="sect_04_01"><title>Regular expressions</title>
<sect2 id="sect_04_01_01"><title>What are regular expressions?</title>
<para>A <emphasis>regular expression<indexterm><primary>regular expressions</primary><secondary>definition</secondary></indexterm></emphasis> is a pattern that describes a set of strings.  Regular expressions are constructed analogously to arithmetic expressions by using various operators to combine smaller expressions.</para>
<para>The fundamental building blocks are the regular expressions that match a single character.  Most characters, including all letters and digits, are regular expressions that match themselves.  Any metacharacter with special meaning may be quoted by preceding it with a backslash.</para>


</sect2>
<sect2 id="sect_04_01_02"><title>Regular expression metacharacters</title>
<para>A regular expression<indexterm><primary>regular expressions</primary><secondary>metacharacters</secondary></indexterm> may be followed by one of several repetition operators<indexterm><primary>regular expressions</primary><secondary>operators</secondary></indexterm> (metacharacters):</para>
<table id="table_04_01" frame="all"><title>Regular expression operators</title>
<tgroup cols="2" align="left" colsep="1" rowsep="1"><thead>
<row><entry>Operator</entry><entry>Effect</entry></row>
</thead>
<tbody>
<row><entry>.</entry><entry>Matches any single character.</entry></row>
<row><entry>?</entry><entry>The preceding item is optional and will be matched, at most, once.</entry></row>
<row><entry>*</entry><entry>The preceding item will be matched zero or more times.</entry></row>
<row><entry>+</entry><entry>The preceding item will be matched one or more times.</entry></row>
<row><entry>{N}</entry><entry>The preceding item is matched exactly N times.</entry></row>
<row><entry>{N,}</entry><entry>The preceding item is matched N or more times.</entry></row>
<row><entry>{N,M}</entry><entry>The preceding item is matched at least N times, but not more than M times.</entry></row>
<row><entry>-</entry><entry>represents the range if it's not first or last in a list or the ending point of a range in a list.</entry></row>
<row><entry>^</entry><entry>Matches the empty string at the beginning of a line; also represents the characters not in the range of a list.</entry></row>
<row><entry>$</entry><entry>Matches the empty string at the end of a line.</entry></row>
<row><entry>\b</entry><entry>Matches the empty string at the edge of a word.</entry></row>
<row><entry>\B</entry><entry>Matches the empty string provided it's not at the edge of a word.</entry></row>
<row><entry>\&lt;</entry><entry>Match the empty string at the beginning of word.</entry></row>
<row><entry>\&gt;</entry><entry>Match the empty string at the end of word.</entry></row>
</tbody>
</tgroup>
</table>
<para>Two regular expressions<indexterm><primary>regular expressions</primary><secondary>concatenation</secondary></indexterm> may be concatenated; the resulting regular expression matches any string formed by concatenating two substrings that respectively match the concatenated subexpressions.</para>
<para>Two regular expressions<indexterm><primary>regular expressions</primary><secondary>joining</secondary></indexterm> may be joined by the infix operator <quote>|</quote>; the resulting regular expression matches any string matching either subexpression.</para>
<para>Repetition<indexterm><primary>regular expressions</primary><secondary>repetition</secondary></indexterm> takes precedence over concatenation, which in turn takes precedence over alternation.  A whole subexpression may be enclosed in parentheses to override these precedence rules.</para>


</sect2>
<sect2 id="sect_04_01_03"><title>Basic versus extended regular expressions</title>
<para>In basic<indexterm><primary>regular expressions</primary><secondary>basic expressions</secondary></indexterm> regular expressions the metacharacters <quote>?</quote>, <quote>+</quote>, <quote>{</quote>, <quote>|</quote>, <quote>(</quote>, and <quote>)</quote> lose their special meaning; instead use the backslashed versions <quote>\?</quote>, <quote>\+</quote>, <quote>\{</quote>, <quote>\|</quote>, <quote>\(</quote>, and <quote>\)</quote>.</para>
<para>Check in your system documentation whether commands using regular expressions support extended expressions.</para>
</sect2>

</sect1>

<sect1 id="sect_04_02"><title>Examples using grep</title>
<sect2 id="sect_04_02_01"><title>What is grep?</title>
<para><command>grep</command> searches<indexterm><primary>regular expressions</primary><secondary>examples</secondary></indexterm> the input files for lines containing a match to a given pattern list.  When it finds a match in a line, it copies the line to standard output (by default), or whatever other sort of output you have requested with options.</para>
<para>Though <command>grep<indexterm><primary>commands</primary><secondary>grep</secondary></indexterm></command> expects to do the matching on text, it has no limits on input line length other than available memory, and it can match arbitrary characters within a line.  If the final byte of an input file is not a <emphasis>newline</emphasis>, <command>grep</command> silently supplies one.  Since newline is also a separator for the list of patterns, there is no way to match newline characters in a text.</para>
<para>Some examples:</para>
<screen>
<prompt>cathy ~&gt;</prompt> <command>grep <parameter>root</parameter> <filename>/etc/passwd</filename></command>
root:x:0:0:root:/root:/bin/bash
operator:x:11:0:operator:/root:/sbin/nologin

<prompt>cathy ~&gt;</prompt> <command>grep <option>-n</option> <parameter>root</parameter> <filename>/etc/passwd</filename></command>
1:root:x:0:0:root:/root:/bin/bash
12:operator:x:11:0:operator:/root:/sbin/nologin

<prompt>cathy ~&gt;</prompt> <command>grep <option>-v</option> <parameter>bash</parameter> <filename>/etc/passwd </filename></command>| <command>grep <option>-v</option> <parameter>nologin</parameter></command>
sync:x:5:0:sync:/sbin:/bin/sync
shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown
halt:x:7:0:halt:/sbin:/sbin/halt
news:x:9:13:news:/var/spool/news:
mailnull:x:47:47::/var/spool/mqueue:/dev/null
xfs:x:43:43:X Font Server:/etc/X11/fs:/bin/false
rpc:x:32:32:Portmapper RPC user:/:/bin/false
nscd:x:28:28:NSCD Daemon:/:/bin/false
named:x:25:25:Named:/var/named:/bin/false
squid:x:23:23::/var/spool/squid:/dev/null
ldap:x:55:55:LDAP User:/var/lib/ldap:/bin/false
apache:x:48:48:Apache:/var/www:/bin/false

<prompt>cathy ~&gt;</prompt> <command>grep <option>-c</option> <parameter>false</parameter> <filename>/etc/passwd</filename></command>
7

<prompt>cathy ~&gt;</prompt> <command>grep <option>-i</option> <parameter>ps</parameter> <filename>~/.bash*</filename></command> | <command>grep <option>-v</option> <parameter>history</parameter></command>
/home/cathy/.bashrc:PS1="\[\033[1;44m\]$USER is in \w\[\033[0m\] "

</screen>
<para>With the first command, user <emphasis>cathy</emphasis> displays the lines from <filename>/etc/passwd</filename> containing the string <emphasis>root</emphasis>.</para>
<para>Then she displays the line numbers containing this search string.</para>
<para>With the third command she checks which users are not using <command>bash</command>, but accounts with the <command>nologin</command> shell are not displayed.</para>
<para>Then she counts the number of accounts that have <filename>/bin/false</filename> as the shell.</para>
<para>The last command displays the lines from all the files in her home directory starting with <filename>~/.bash</filename>, excluding matches containing the string <emphasis>history</emphasis>, so as to exclude matches from <filename>~/.bash_history</filename> which might contain the same string, in upper or lower cases.  Note that the search is for the <emphasis>string</emphasis> <quote>ps</quote>, and not for the <emphasis>command</emphasis> <command>ps</command>.</para>
<para>Now let's see what else we can do with grep, using regular expressions.</para>
</sect2>
<sect2 id="sect_04_02_02"><title>Grep and regular expressions</title>
<note><title>If you are not on Linux</title><para>We use GNU <command>grep</command> in these examples, which supports extended regular expressions.  GNU <command>grep</command> is the default on Linux systems.  If you are working on proprietary systems, check with the <option>-V</option> option which version you are using.  GNU <command>grep</command> can be downloaded from <ulink url="http://gnu.org/directory/" />.</para></note>
<sect3 id="sect_04_02_02_01"><title>Line and word anchors</title>
<para>From the previous example, we now exclusively want to display lines starting<indexterm><primary>regular expressions</primary><secondary>line anchors</secondary></indexterm> with the string <quote>root</quote>:</para>
<screen>
<prompt>cathy ~&gt;</prompt> <command>grep <parameter>^root</parameter> <filename>/etc/passwd</filename></command>
root:x:0:0:root:/root:/bin/bash
</screen>
<para>If we want to see which accounts have no shell assigned whatsoever, we search for lines ending in <quote>:</quote>:</para>
<screen>
<prompt>cathy ~&gt;</prompt> <command>grep <parameter>:$</parameter> <filename>/etc/passwd</filename></command>
news:x:9:13:news:/var/spool/news:
</screen>
<para>To check that <varname>PATH</varname> is exported in <filename>~/.bashrc</filename>, first select <quote>export</quote> lines and then search for lines starting with the string <quote>PATH</quote>, so as not to display <varname>MANPATH</varname> and other possible<indexterm><primary>regular expressions</primary><secondary>word anchors</secondary></indexterm> paths:</para>
<screen>
<prompt>cathy ~&gt;</prompt> <command>grep <parameter>export</parameter> <filename>~/.bashrc</filename></command> | <command>grep <parameter>'\&lt;PATH'</parameter></command>
  export PATH="/bin:/usr/lib/mh:/lib:/usr/bin:/usr/local/bin:/usr/ucb:/usr/dbin:$PATH"
</screen>
<para>Similarly, <emphasis>\&gt;</emphasis> matches the end of a word.</para>
<para>If you want to find a string that is a separate word (enclosed by spaces), it is better use the <option>-w</option>, as in this example where we are displaying information for the root partition:</para>
<screen>
<prompt>cathy ~&gt;</prompt> <command>grep <option>-w</option> <parameter>/</parameter> <filename>/etc/fstab</filename></command>
LABEL=/                 /                       ext3    defaults        1 1
</screen>
<para>If this option is not used, all the lines from the file system table will be displayed.</para>
</sect3>

<sect3 id="sect_04_02_02_02"><title>Character classes</title>
<para>A <emphasis>bracket expression<indexterm><primary>regular expressions</primary><secondary>bracket expressions</secondary></indexterm></emphasis> is a list of characters enclosed by <quote>[</quote> and <quote>]</quote>.  It matches any single character<indexterm><primary>regular expressions</primary><secondary>character classes</secondary></indexterm> in that list; if the first character of the list is the caret, <quote>^</quote>, then it matches any character NOT in the list.  For example, the regular expression <quote>[0123456789]</quote> matches any single digit.</para>
<para>Within a bracket expression, a <emphasis>range expression<indexterm><primary>regular expressions</primary><secondary>ranges</secondary></indexterm></emphasis> consists of two characters separated by a hyphen.  It matches any single character that sorts between the two characters, inclusive, using the locale's collating sequence and character set.  For example, in the default C locale, <quote>[a-d]</quote> is equivalent to <quote>[abcd]</quote>.  Many locales sort characters in dictionary order, and in these locales <quote>[a-d]</quote> is typically not equivalent to <quote>[abcd]</quote>; it might be equivalent to <quote>[aBbCcDd]</quote>, for example.  To obtain the traditional interpretation of bracket expressions, you can use the C locale by setting the <varname>LC_ALL</varname> environment variable to the value <quote>C</quote>.</para>
<para>Finally, certain named classes of characters are predefined within bracket expressions.  See the <command>grep</command> man or info pages for more information about these predefined expressions.</para>
<screen>
<prompt>cathy ~&gt;</prompt> <command>grep <parameter>[yf]</parameter> <filename>/etc/group</filename></command>
sys:x:3:root,bin,adm
tty:x:5:
mail:x:12:mail,postfix
ftp:x:50:
nobody:x:99:
floppy:x:19:
xfs:x:43:
nfsnobody:x:65534:
postfix:x:89:
</screen>
<para>In the example, all the lines containing either a <quote>y</quote> or <quote>f</quote> character are displayed.</para>
</sect3>

<sect3 id="sect_04_02_02_04"><title>Wildcards</title>
<para>Use the <quote>.</quote> for a single<indexterm><primary>regular expressions</primary><secondary>wildcards</secondary></indexterm> character match.  If you want to get a list of all five-character English dictionary words starting with <quote>c</quote> and ending in <quote>h</quote> (handy for solving crosswords):</para>
<screen>
<prompt>cathy ~&gt;</prompt> <command>grep <parameter>'\&lt;c...h\&gt;'</parameter> <filename>/usr/share/dict/words</filename></command>
catch
clash
cloth
coach
couch
cough
crash
crush
</screen>
<para>If you want to display lines containing the literal dot character, use the <option>-F</option> option to <command>grep</command>.</para>
<para>For matching multiple characters, use the asterisk.  This example selects all words starting with <quote>c</quote> and ending in <quote>h</quote> from the system's dictionary:</para>
<screen>
<prompt>cathy ~&gt;</prompt> <command>grep <parameter>'\&lt;c.*h\&gt;'</parameter> <filename>/usr/share/dict/words</filename></command>
caliph
cash
catch
cheesecloth
cheetah
--output omitted--
</screen>
<para>If you want to find the literal asterisk character in a file or output, use single quotes.  Cathy in the example below first tries finding the asterisk character in <filename>/etc/profile</filename> without using quotes, which does not return any lines.  Using quotes, output is generated:</para>
<screen>
<prompt>cathy ~&gt;</prompt> <command>grep <parameter>*</parameter> <filename>/etc/profile</filename></command>

<prompt>cathy ~&gt;</prompt> <command>grep <parameter>'*'</parameter> <filename>/etc/profile</filename></command>
for i in /etc/profile.d/*.sh ; do
</screen>
</sect3>
</sect2>

</sect1>
<sect1 id="sect_04_03"><title>Pattern matching using Bash features</title>
<sect2 id="sect_04_03_01"><title>Character ranges</title>
<para>Apart from <command>grep</command> and regular expressions, there's a good deal of pattern matching<indexterm><primary>features</primary><secondary>pattern matching</secondary></indexterm> that you can do directly in the shell, without having to use an external program.</para>
<para>As you already know, the asterisk (*) and the question mark (?) match any string<indexterm><primary>pattern matching</primary><secondary>using Bash</secondary></indexterm> or any single character, respectively.  Quote these special characters to match them literally:</para>
<screen>
<prompt>cathy ~&gt;</prompt> <command>touch <filename>"*"</filename></command>

<prompt>cathy ~&gt;</prompt> <command>ls <filename>"*"</filename></command>
*
</screen>
<para>But you can also use the square braces to match any enclosed character or range of characters, if pairs of characters are separated by a hyphen.  An example:</para>
<screen>
<prompt>cathy ~&gt;</prompt> <command>ls <option>-ld</option> <filename>[a-cx-z]*</filename></command>
drwxr-xr-x    2 cathy	 cathy		4096 Jul 20  2002 app-defaults/
drwxrwxr-x    4 cathy    cathy          4096 May 25  2002 arabic/
drwxrwxr-x    2 cathy    cathy          4096 Mar  4 18:30 bin/
drwxr-xr-x    7 cathy    cathy          4096 Sep  2  2001 crossover/
drwxrwxr-x    3 cathy    cathy          4096 Mar 22  2002 xml/
</screen>
<para>This lists all files in <emphasis>cathy</emphasis>'s home directory, starting with <quote>a</quote>, <quote>b</quote>, <quote>c</quote>, <quote>x</quote>, <quote>y</quote> or <quote>z</quote>.</para>
<para>If the first character within the braces is <quote>!</quote> or <quote>^</quote>, any character not enclosed will be matched.  To match the dash (<quote>-</quote>), include it as the first or last character in the set.  The sorting depends on the current locale and of the value of the <varname>LC_COLLATE<indexterm><primary>variables</primary><secondary>LC_COLLATE</secondary></indexterm></varname> variable, if it is set.  Mind that other locales might interpret <quote>[a-cx-z]</quote> as <quote>[aBbCcXxYyZz]</quote> if sorting is done in dictionary order.  If you want to be sure to have the traditional interpretation of ranges, force this behavior by setting <varname>LC_COLLATE</varname> or <varname>LC_ALL<indexterm><primary>variables</primary><secondary>LC_ALL</secondary></indexterm></varname> to <quote>C</quote>.</para>
</sect2>
<sect2 id="sect_04_03_02"><title>Character classes</title>
<para>Character classes<indexterm><primary>pattern matching</primary><secondary>character classes</secondary></indexterm> can be specified within the square braces, using the syntax <command>[:CLASS:]</command>, where CLASS<indexterm><primary>character classes</primary><secondary>types</secondary></indexterm> is defined in the POSIX standard and has one of the values</para>
<para><quote>alnum</quote>, <quote>alpha</quote>, <quote>ascii</quote>, <quote>blank</quote>, <quote>cntrl</quote>, <quote>digit</quote>, <quote>graph</quote>, <quote>lower</quote>, <quote>print</quote>, <quote>punct</quote>, <quote>space</quote>, <quote>upper</quote>, <quote>word</quote> or <quote>xdigit</quote>.</para>
<para>Some examples:</para>
<screen>
<prompt>cathy ~&gt;</prompt> <command>ls <option>-ld</option> <filename>[[:digit:]]*</filename></command>
drwxrwxr-x    2 cathy	cathy		4096 Apr 20 13:45 2/

<prompt>cathy ~&gt;</prompt> <command>ls <option>-ld</option> <filename>[[:upper:]]*</filename></command>
drwxrwxr--    3 cathy   cathy           4096 Sep 30  2001 Nautilus/
drwxrwxr-x    4 cathy   cathy           4096 Jul 11  2002 OpenOffice.org1.0/
-rw-rw-r--    1 cathy   cathy         997376 Apr 18 15:39 Schedule.sdc
</screen>
<para>When the <option>extglob<indexterm><primary>options</primary><secondary>extglob</secondary></indexterm></option> shell option is enabled (using the <command>shopt</command> built-in), several extended pattern matching operators are recognized.  Read more in the Bash info pages, section <menuchoice><guimenu>Basic shell features</guimenu><guisubmenu>Shell Expansions</guisubmenu><guisubmenu>Filename Expansion</guisubmenu><guimenuitem>Pattern Matching</guimenuitem></menuchoice>.</para>
</sect2>


</sect1>
<sect1 id="sect_04_04"><title>Summary</title>
<para>Regular expressions are powerful tools for selecting particular lines from files or output.  A lot of UNIX commands use regular expressions: <command>vim</command>, <command>perl</command>, the <application>PostgreSQL</application> database and so on.  They can be made available in any language or application using external libraries, and they even found their way to non-UNIX systems.  For instance, regular expressions are used in the Excell spreadsheet that comes with the MicroSoft Windows Office suite.  In this chapter we got the feel of the <command>grep</command> command, which is indispensable in any UNIX environment.</para>
<note><para>The <command>grep</command> command can do much more than the few tasks we discussed here; we only used it as an example for regular expressions.  The GNU <command>grep</command> version comes with plenty of documentation, which you are strongly advised to read!</para></note>
<para>Bash has built-in features for matching patterns and can recognize character classes and ranges.</para>

</sect1>
<sect1 id="sect_04_05"><title>Exercises</title>
<para>These exercises will help you master regular expressions.</para>

<orderedlist>
<listitem><para>Display a list of all the users on your system who log in with the Bash shell as a default.</para></listitem>
<listitem><para>From the <filename>/etc/group</filename> directory, display all lines starting with the string <quote>daemon</quote>.</para></listitem>
<listitem><para>Print all the lines from the same file that don't contain the string.</para></listitem>
<listitem><para>Display localhost information from the <filename>/etc/hosts</filename> file, display the line number(s) matching the search string and count the number of occurrences of the string.</para></listitem>
<listitem><para>Display a list of <filename>/usr/share/doc</filename> subdirectories containing information about shells.</para></listitem>
<listitem><para>How many <filename>README</filename> files do these subdirectories contain?  Don't count anything in the form of <quote>README.a_string</quote>.</para></listitem>
<listitem><para>Make a list of files in your home directory that were changed less that 10 hours ago, using <command>grep</command>, but leave out directories.</para></listitem>
<listitem><para>Put these commands in a shell script that will generate comprehensible output.</para></listitem>
<listitem><para>Can you find an alternative for <command>wc <option>-l</option></command>, using <command>grep</command>?</para></listitem>
<listitem><para>Using the file system table (<filename>/etc/fstab</filename> for instance), list local disk devices.</para></listitem>
<listitem><para>Make a script that checks whether a user exists in <filename>/etc/passwd</filename>.  For now, you can specify the user name in the script, you don't have to work with arguments and conditionals at this stage.</para></listitem>
<listitem><para>Display configuration files in <filename>/etc</filename> that contain numbers in their names.</para></listitem>
</orderedlist>

</sect1>

</chapter>