mirror of https://github.com/mkerrisk/man-pages
955 lines
36 KiB
Groff
955 lines
36 KiB
Groff
.\" Copyright (c) 2001-2003 The Open Group, All Rights Reserved
|
|
.TH "LEX" P 2003 "IEEE/The Open Group" "POSIX Programmer's Manual"
|
|
.\" lex
|
|
.SH NAME
|
|
lex \- generate programs for lexical tasks (\fBDEVELOPMENT\fP)
|
|
.SH SYNOPSIS
|
|
.LP
|
|
\fBlex\fP \fB[\fP\fB-t\fP\fB][\fP\fB-n|-v\fP\fB][\fP\fIfile\fP \fB...\fP\fB]\fP\fB\fP
|
|
.SH DESCRIPTION
|
|
.LP
|
|
The \fIlex\fP utility shall generate C programs to be used in lexical
|
|
processing of character input, and that can be used as an
|
|
interface to \fIyacc\fP. The C programs shall be generated from \fIlex\fP
|
|
source code and
|
|
conform to the ISO\ C standard. Usually, the \fIlex\fP utility shall
|
|
write the program it generates to the file
|
|
\fBlex.yy.c\fP; the state of this file is unspecified if \fIlex\fP
|
|
exits with a non-zero exit status. See the EXTENDED
|
|
DESCRIPTION section for a complete description of the \fIlex\fP input
|
|
language.
|
|
.SH OPTIONS
|
|
.LP
|
|
The \fIlex\fP utility shall conform to the Base Definitions volume
|
|
of IEEE\ Std\ 1003.1-2001, Section 12.2, Utility Syntax Guidelines.
|
|
.LP
|
|
The following options shall be supported:
|
|
.TP 7
|
|
\fB-n\fP
|
|
Suppress the summary of statistics usually written with the \fB-v\fP
|
|
option. If no table sizes are specified in the \fIlex\fP
|
|
source code and the \fB-v\fP option is not specified, then \fB-n\fP
|
|
is implied.
|
|
.TP 7
|
|
\fB-t\fP
|
|
Write the resulting program to standard output instead of \fBlex.yy.c\fP.
|
|
.TP 7
|
|
\fB-v\fP
|
|
Write a summary of \fIlex\fP statistics to the standard output. (See
|
|
the discussion of \fIlex\fP table sizes in Definitions in lex .) If
|
|
the \fB-t\fP option is specified and \fB-n\fP is not specified, this
|
|
report shall
|
|
be written to standard error. If table sizes are specified in the
|
|
\fIlex\fP source code, and if the \fB-n\fP option is not
|
|
specified, the \fB-v\fP option may be enabled.
|
|
.sp
|
|
.SH OPERANDS
|
|
.LP
|
|
The following operand shall be supported:
|
|
.TP 7
|
|
\fIfile\fP
|
|
A pathname of an input file. If more than one such \fIfile\fP is specified,
|
|
all files shall be concatenated to produce a
|
|
single \fIlex\fP program. If no \fIfile\fP operands are specified,
|
|
or if a \fIfile\fP operand is \fB'-'\fP , the standard
|
|
input shall be used.
|
|
.sp
|
|
.SH STDIN
|
|
.LP
|
|
The standard input shall be used if no \fIfile\fP operands are specified,
|
|
or if a \fIfile\fP operand is \fB'-'\fP . See
|
|
INPUT FILES.
|
|
.SH INPUT FILES
|
|
.LP
|
|
The input files shall be text files containing \fIlex\fP source code,
|
|
as described in the EXTENDED DESCRIPTION section.
|
|
.SH ENVIRONMENT VARIABLES
|
|
.LP
|
|
The following environment variables shall affect the execution of
|
|
\fIlex\fP:
|
|
.TP 7
|
|
\fILANG\fP
|
|
Provide a default value for the internationalization variables that
|
|
are unset or null. (See the Base Definitions volume of
|
|
IEEE\ Std\ 1003.1-2001, Section 8.2, Internationalization Variables
|
|
for
|
|
the precedence of internationalization variables used to determine
|
|
the values of locale categories.)
|
|
.TP 7
|
|
\fILC_ALL\fP
|
|
If set to a non-empty string value, override the values of all the
|
|
other internationalization variables.
|
|
.TP 7
|
|
\fILC_COLLATE\fP
|
|
.sp
|
|
Determine the locale for the behavior of ranges, equivalence classes,
|
|
and multi-character collating elements within regular
|
|
expressions. If this variable is not set to the POSIX locale, the
|
|
results are unspecified.
|
|
.TP 7
|
|
\fILC_CTYPE\fP
|
|
Determine the locale for the interpretation of sequences of bytes
|
|
of text data as characters (for example, single-byte as
|
|
opposed to multi-byte characters in arguments and input files), and
|
|
the behavior of character classes within regular expressions.
|
|
If this variable is not set to the POSIX locale, the results are unspecified.
|
|
.TP 7
|
|
\fILC_MESSAGES\fP
|
|
Determine the locale that should be used to affect the format and
|
|
contents of diagnostic messages written to standard
|
|
error.
|
|
.TP 7
|
|
\fINLSPATH\fP
|
|
Determine the location of message catalogs for the processing of \fILC_MESSAGES
|
|
\&.\fP
|
|
.sp
|
|
.SH ASYNCHRONOUS EVENTS
|
|
.LP
|
|
Default.
|
|
.SH STDOUT
|
|
.LP
|
|
If the \fB-t\fP option is specified, the text file of C source code
|
|
output of \fIlex\fP shall be written to standard
|
|
output.
|
|
.LP
|
|
If the \fB-t\fP option is not specified:
|
|
.IP " *" 3
|
|
Implementation-defined informational, error, and warning messages
|
|
concerning the contents of \fIlex\fP source code input shall
|
|
be written to either the standard output or standard error.
|
|
.LP
|
|
.IP " *" 3
|
|
If the \fB-v\fP option is specified and the \fB-n\fP option is not
|
|
specified, \fIlex\fP statistics shall also be written to
|
|
either the standard output or standard error, in an implementation-defined
|
|
format. These statistics may also be generated if table
|
|
sizes are specified with a \fB'%'\fP operator in the \fIDefinitions\fP
|
|
section, as long as the \fB-n\fP option is not
|
|
specified.
|
|
.LP
|
|
.SH STDERR
|
|
.LP
|
|
If the \fB-t\fP option is specified, implementation-defined informational,
|
|
error, and warning messages concerning the contents
|
|
of \fIlex\fP source code input shall be written to the standard error.
|
|
.LP
|
|
If the \fB-t\fP option is not specified:
|
|
.IP " 1." 4
|
|
Implementation-defined informational, error, and warning messages
|
|
concerning the contents of \fIlex\fP source code input shall
|
|
be written to either the standard output or standard error.
|
|
.LP
|
|
.IP " 2." 4
|
|
If the \fB-v\fP option is specified and the \fB-n\fP option is not
|
|
specified, \fIlex\fP statistics shall also be written to
|
|
either the standard output or standard error, in an implementation-defined
|
|
format. These statistics may also be generated if table
|
|
sizes are specified with a \fB'%'\fP operator in the \fIDefinitions\fP
|
|
section, as long as the \fB-n\fP option is not
|
|
specified.
|
|
.LP
|
|
.SH OUTPUT FILES
|
|
.LP
|
|
A text file containing C source code shall be written to \fBlex.yy.c\fP,
|
|
or to the standard output if the \fB-t\fP option is
|
|
present.
|
|
.SH EXTENDED DESCRIPTION
|
|
.LP
|
|
Each input file shall contain \fIlex\fP source code, which is a table
|
|
of regular expressions with corresponding actions in the
|
|
form of C program fragments.
|
|
.LP
|
|
When \fBlex.yy.c\fP is compiled and linked with the \fIlex\fP library
|
|
(using the \fB-l\ l\fP operand with \fIc99\fP), the resulting program
|
|
shall read character input from the standard input and shall
|
|
partition it into strings that match the given expressions.
|
|
.LP
|
|
When an expression is matched, these actions shall occur:
|
|
.IP " *" 3
|
|
The input string that was matched shall be left in \fIyytext\fP as
|
|
a null-terminated string; \fIyytext\fP shall either be an
|
|
external character array or a pointer to a character string. As explained
|
|
in Definitions in lex ,
|
|
the type can be explicitly selected using the \fB%array\fP or \fB%pointer\fP
|
|
declarations, but the default is
|
|
implementation-defined.
|
|
.LP
|
|
.IP " *" 3
|
|
The external \fBint\fP \fIyyleng\fP shall be set to the length of
|
|
the matching string.
|
|
.LP
|
|
.IP " *" 3
|
|
The expression's corresponding program fragment, or action, shall
|
|
be executed.
|
|
.LP
|
|
.LP
|
|
During pattern matching, \fIlex\fP shall search the set of patterns
|
|
for the single longest possible match. Among rules that
|
|
match the same number of characters, the rule given first shall be
|
|
chosen.
|
|
.LP
|
|
The general format of \fIlex\fP source shall be:
|
|
.sp
|
|
.RS
|
|
.nf
|
|
|
|
\fIDefinitions\fP
|
|
\fB%%\fP
|
|
\fIRules\fP
|
|
\fB%%\fP
|
|
\fIUser\fPSubroutines
|
|
.fi
|
|
.RE
|
|
.LP
|
|
The first \fB"%%"\fP is required to mark the beginning of the rules
|
|
(regular expressions and actions); the second
|
|
\fB"%%"\fP is required only if user subroutines follow.
|
|
.LP
|
|
Any line in the \fIDefinitions\fP section beginning with a <blank>
|
|
shall be assumed to be a C program fragment and shall
|
|
be copied to the external definition area of the \fBlex.yy.c\fP file.
|
|
Similarly, anything in the \fIDefinitions\fP section
|
|
included between delimiter lines containing only \fB"%{"\fP and \fB"%}"\fP
|
|
shall also be copied unchanged to the external
|
|
definition area of the \fBlex.yy.c\fP file.
|
|
.LP
|
|
Any such input (beginning with a <blank> or within \fB"%{"\fP and
|
|
\fB"%}"\fP delimiter lines) appearing at the
|
|
beginning of the \fIRules\fP section before any rules are specified
|
|
shall be written to \fBlex.yy.c\fP after the declarations of
|
|
variables for the \fIyylex\fP() function and before the first line
|
|
of code in \fIyylex\fP(). Thus, user variables local to
|
|
\fIyylex\fP() can be declared here, as well as application code to
|
|
execute upon entry to \fIyylex\fP().
|
|
.LP
|
|
The action taken by \fIlex\fP when encountering any input beginning
|
|
with a <blank> or within \fB"%{"\fP and
|
|
\fB"%}"\fP delimiter lines appearing in the \fIRules\fP section but
|
|
coming after one or more rules is undefined. The presence
|
|
of such input may result in an erroneous definition of the \fIyylex\fP()
|
|
function.
|
|
.SS Definitions in lex
|
|
.LP
|
|
\fIDefinitions\fP appear before the first \fB"%%"\fP delimiter. Any
|
|
line in this section not contained between \fB"%{"\fP
|
|
and \fB"%}"\fP lines and not beginning with a <blank> shall be assumed
|
|
to define a \fIlex\fP substitution string. The
|
|
format of these lines shall be:
|
|
.sp
|
|
.RS
|
|
.nf
|
|
|
|
\fIname substitute\fP
|
|
.fi
|
|
.RE
|
|
.LP
|
|
If a \fIname\fP does not meet the requirements for identifiers in
|
|
the ISO\ C standard, the result is undefined. The string
|
|
\fIsubstitute\fP shall replace the string { \fIname\fP} when it is
|
|
used in a rule. The \fIname\fP string shall be recognized in
|
|
this context only when the braces are provided and when it does not
|
|
appear within a bracket expression or within double-quotes.
|
|
.LP
|
|
In the \fIDefinitions\fP section, any line beginning with a \fB'%'\fP
|
|
(percent sign) character and followed by an
|
|
alphanumeric word beginning with either \fB's'\fP or \fB'S'\fP shall
|
|
define a set of start conditions. Any line beginning
|
|
with a \fB'%'\fP followed by a word beginning with either \fB'x'\fP
|
|
or \fB'X'\fP shall define a set of exclusive start
|
|
conditions. When the generated scanner is in a \fB%s\fP state, patterns
|
|
with no state specified shall be also active; in a
|
|
\fB%x\fP state, such patterns shall not be active. The rest of the
|
|
line, after the first word, shall be considered to be one or
|
|
more <blank>-separated names of start conditions. Start condition
|
|
names shall be constructed in the same way as definition
|
|
names. Start conditions can be used to restrict the matching of regular
|
|
expressions to one or more states as described in Regular Expressions
|
|
in lex .
|
|
.LP
|
|
Implementations shall accept either of the following two mutually-exclusive
|
|
declarations in the \fIDefinitions\fP section:
|
|
.TP 7
|
|
\fB%array\fP
|
|
Declare the type of \fIyytext\fP to be a null-terminated character
|
|
array.
|
|
.TP 7
|
|
\fB%pointer\fP
|
|
Declare the type of \fIyytext\fP to be a pointer to a null-terminated
|
|
character string.
|
|
.sp
|
|
.LP
|
|
The default type of \fIyytext\fP is implementation-defined. If an
|
|
application refers to \fIyytext\fP outside of the scanner
|
|
source file (that is, via an \fBextern\fP), the application shall
|
|
include the appropriate \fB%array\fP or \fB%pointer\fP
|
|
declaration in the scanner source file.
|
|
.LP
|
|
Implementations shall accept declarations in the \fIDefinitions\fP
|
|
section for setting certain internal table sizes. The
|
|
declarations are shown in the following table.
|
|
.sp
|
|
.ce 1
|
|
\fBTable: Table Size Declarations in \fIlex\fP\fP
|
|
.TS C
|
|
center; l2 l2 l.
|
|
\fBDeclaration\fP \fBDescription\fP \fBMinimum Value\fP
|
|
%\fBp\fP \fIn\fP Number of positions 2500
|
|
%\fBn\fP \fIn\fP Number of states 500
|
|
%\fBa\fP \fIn\fP Number of transitions 2000
|
|
%\fBe\fP \fIn\fP Number of parse tree nodes 1000
|
|
%\fBk\fP \fIn\fP Number of packed character classes 1000
|
|
%\fBo\fP \fIn\fP Size of the output array 3000
|
|
.TE
|
|
.LP
|
|
In the table, \fIn\fP represents a positive decimal integer, preceded
|
|
by one or more <blank>s. The exact meaning of these
|
|
table size numbers is implementation-defined. The implementation shall
|
|
document how these numbers affect the \fIlex\fP utility and
|
|
how they are related to any output that may be generated by the implementation
|
|
should limitations be encountered during the
|
|
execution of \fIlex\fP. It shall be possible to determine from this
|
|
output which of the table size values needs to be modified to
|
|
permit \fIlex\fP to successfully generate tables for the input language.
|
|
The values in the column Minimum Value represent the
|
|
lowest values conforming implementations shall provide.
|
|
.SS Rules in lex
|
|
.LP
|
|
The rules in \fIlex\fP source files are a table in which the left
|
|
column contains regular expressions and the right column
|
|
contains actions (C program fragments) to be executed when the expressions
|
|
are recognized.
|
|
.sp
|
|
.RS
|
|
.nf
|
|
|
|
\fIERE action
|
|
ERE action\fP\fB...
|
|
\fP
|
|
.fi
|
|
.RE
|
|
.LP
|
|
The extended regular expression (ERE) portion of a row shall be separated
|
|
from \fIaction\fP by one or more <blank>s. A
|
|
regular expression containing <blank>s shall be recognized under one
|
|
of the following conditions:
|
|
.IP " *" 3
|
|
The entire expression appears within double-quotes.
|
|
.LP
|
|
.IP " *" 3
|
|
The <blank>s appear within double-quotes or square brackets.
|
|
.LP
|
|
.IP " *" 3
|
|
Each <blank> is preceded by a backslash character.
|
|
.LP
|
|
.SS User Subroutines in lex
|
|
.LP
|
|
Anything in the user subroutines section shall be copied to \fBlex.yy.c\fP
|
|
following \fIyylex\fP().
|
|
.SS Regular Expressions in lex
|
|
.LP
|
|
The \fIlex\fP utility shall support the set of extended regular expressions
|
|
(see the Base Definitions volume of
|
|
IEEE\ Std\ 1003.1-2001, Section 9.4, Extended Regular Expressions),
|
|
with the following additions and exceptions to the syntax:
|
|
.TP 7
|
|
\fB"..."\fP
|
|
Any string enclosed in double-quotes shall represent the characters
|
|
within the double-quotes as themselves, except that
|
|
backslash escapes (which appear in the following table) shall be recognized.
|
|
Any backslash-escape sequence shall be terminated by
|
|
the closing quote. For example, \fB"\\01"\fP \fB"1"\fP represents
|
|
a single string: the octal value 1 followed by the character
|
|
\fB'1'\fP .
|
|
.TP 7
|
|
<\fIstate\fP>\fIr\fP,\ <\fIstate1,state2,\fP...>\fIr\fP
|
|
.sp
|
|
The regular expression \fIr\fP shall be matched only when the program
|
|
is in one of the start conditions indicated by \fIstate\fP,
|
|
\fIstate1\fP, and so on; see Actions in lex . (As an exception to
|
|
the typographical conventions of
|
|
the rest of this volume of IEEE\ Std\ 1003.1-2001, in this case <\fIstate\fP>
|
|
does not represent a metavariable, but
|
|
the literal angle-bracket characters surrounding a symbol.) The start
|
|
condition shall be recognized as such only at the beginning
|
|
of a regular expression.
|
|
.TP 7
|
|
\fIr\fP/\fIx\fP
|
|
The regular expression \fIr\fP shall be matched only if it is followed
|
|
by an occurrence of regular expression \fIx\fP (
|
|
\fIx\fP is the instance of trailing context, further defined below).
|
|
The token returned in \fIyytext\fP shall only match
|
|
\fIr\fP. If the trailing portion of \fIr\fP matches the beginning
|
|
of \fIx\fP, the result is unspecified. The \fIr\fP expression
|
|
cannot include further trailing context or the \fB'$'\fP (match-end-of-line)
|
|
operator; \fIx\fP cannot include the \fB'^'\fP
|
|
(match-beginning-of-line) operator, nor trailing context, nor the
|
|
\fB'$'\fP operator. That is, only one occurrence of trailing
|
|
context is allowed in a \fIlex\fP regular expression, and the \fB'^'\fP
|
|
operator only can be used at the beginning of such an
|
|
expression.
|
|
.TP 7
|
|
{\fIname\fP}
|
|
When \fIname\fP is one of the substitution symbols from the \fIDefinitions\fP
|
|
section, the string, including the enclosing
|
|
braces, shall be replaced by the \fIsubstitute\fP value. The \fIsubstitute\fP
|
|
value shall be treated in the extended regular
|
|
expression as if it were enclosed in parentheses. No substitution
|
|
shall occur if { \fIname\fP} occurs within a bracket expression
|
|
or within double-quotes.
|
|
.sp
|
|
.LP
|
|
Within an ERE, a backslash character shall be considered to begin
|
|
an escape sequence as specified in the table in the Base
|
|
Definitions volume of IEEE\ Std\ 1003.1-2001, Chapter 5, File Format
|
|
Notation (
|
|
\fB'\\\\'\fP , \fB'\\a'\fP , \fB'\\b'\fP , \fB'\\f'\fP , \fB'\\n'\fP
|
|
, \fB'\\r'\fP , \fB'\\t'\fP , \fB'\\v'\fP ). In
|
|
addition, the escape sequences in the following table shall be recognized.
|
|
.LP
|
|
A literal <newline> cannot occur within an ERE; the escape sequence
|
|
\fB'\\n'\fP can be used to represent a
|
|
<newline>. A <newline> shall not be matched by a period operator.
|
|
.br
|
|
.sp
|
|
.ce 1
|
|
\fBTable: Escape Sequences in \fIlex\fP\fP
|
|
.TS C
|
|
center; l1 lw(30)1 lw(30).
|
|
\fBEscape\fP T{
|
|
.na
|
|
\fB\ \fP
|
|
.ad
|
|
T} T{
|
|
.na
|
|
\fB\ \fP
|
|
.ad
|
|
T}
|
|
\fBSequence\fP T{
|
|
.na
|
|
\fBDescription\fP
|
|
.ad
|
|
T} T{
|
|
.na
|
|
\fBMeaning\fP
|
|
.ad
|
|
T}
|
|
\\\fIdigits\fP T{
|
|
.na
|
|
A backslash character followed by the longest sequence of one, two, or three octal-digit characters (01234567). If all of the digits are 0 (that is, representation of the NUL character), the behavior is undefined.
|
|
.ad
|
|
T} T{
|
|
.na
|
|
The character whose encoding is represented by the one, two, or three-digit octal integer. If the size of a byte on the system is greater than nine bits, the valid escape sequence used to represent a byte is implementation-defined. Multi-byte characters require multiple, concatenated escape sequences of this type, including the leading \fB'\\'\fP for each byte.
|
|
.ad
|
|
T}
|
|
\\x\fIdigits\fP T{
|
|
.na
|
|
A backslash character followed by the longest sequence of hexadecimal-digit characters (01234567abcdefABCDEF). If all of the digits are 0 (that is, representation of the NUL character), the behavior is undefined.
|
|
.ad
|
|
T} T{
|
|
.na
|
|
The character whose encoding is represented by the hexadecimal integer.
|
|
.ad
|
|
T}
|
|
\\c T{
|
|
.na
|
|
A backslash character followed by any character not described in this table or in the table in the Base Definitions volume of IEEE\ Std\ 1003.1-2001, Chapter 5, File Format Notation ( \fB'\\\\'\fP , \fB'\\a'\fP , \fB'\\b'\fP , \fB'\\f'\fP , \fB'\\n'\fP , \fB'\\r'\fP , \fB'\\t'\fP , \fB'\\v'\fP ).
|
|
.ad
|
|
T} T{
|
|
.na
|
|
The character \fB'c'\fP , unchanged.
|
|
.ad
|
|
T}
|
|
.TE
|
|
.TP 7
|
|
\fBNote:\fP
|
|
If a \fB'\\x'\fP sequence needs to be immediately followed by a hexadecimal
|
|
digit character, a sequence such as
|
|
\fB"\\x1"\fP \fB"1"\fP can be used, which represents a character containing
|
|
the value 1, followed by the character
|
|
\fB'1'\fP .
|
|
.sp
|
|
.LP
|
|
The order of precedence given to extended regular expressions for
|
|
\fIlex\fP differs from that specified in the Base Definitions
|
|
volume of IEEE\ Std\ 1003.1-2001, Section 9.4, Extended Regular
|
|
Expressions. The order of precedence for \fIlex\fP shall be as shown
|
|
in the following table, from high to low.
|
|
.TP 7
|
|
\fBNote:\fP
|
|
The escaped characters entry is not meant to imply that these are
|
|
operators, but they are included in the table to show their
|
|
relationships to the true operators. The start condition, trailing
|
|
context, and anchoring notations have been omitted from the
|
|
table because of the placement restrictions described in this section;
|
|
they can only appear at the beginning or ending of an
|
|
ERE.
|
|
.sp
|
|
.sp
|
|
.sp
|
|
.ce 1
|
|
\fBTable: ERE Precedence in \fIlex\fP\fP
|
|
.TS C
|
|
center; l2 l.
|
|
\fBExtended Regular Expression\fP \fBPrecedence\fP
|
|
collation-related bracket symbols [= =] [: :] [. .]
|
|
escaped characters \\<\fIspecial character\fP>
|
|
bracket expression [ ]
|
|
quoting "..."
|
|
grouping ( )
|
|
definition {\fIname\fP}
|
|
single-character RE duplication * + ?
|
|
concatenation \
|
|
interval expression {m,n}
|
|
alternation |
|
|
.TE
|
|
.LP
|
|
The ERE anchoring operators \fB'^'\fP and \fB'$'\fP do not appear
|
|
in the table. With \fIlex\fP regular expressions, these
|
|
operators are restricted in their use: the \fB'^'\fP operator can
|
|
only be used at the beginning of an entire regular expression,
|
|
and the \fB'$'\fP operator only at the end. The operators apply to
|
|
the entire regular expression. Thus, for example, the pattern
|
|
\fB"(^abc)|(def$)"\fP is undefined; it can instead be written as two
|
|
separate rules, one with the regular expression
|
|
\fB"^abc"\fP and one with \fB"def$"\fP , which share a common action
|
|
via the special \fB'|'\fP action (see below). If the
|
|
pattern were written \fB"^abc|def$"\fP , it would match either \fB"abc"\fP
|
|
or \fB"def"\fP on a line by itself.
|
|
.LP
|
|
Unlike the general ERE rules, embedded anchoring is not allowed by
|
|
most historical \fIlex\fP implementations. An example of
|
|
embedded anchoring would be for patterns such as \fB"(^|\ )foo(\ |$)"\fP
|
|
to match \fB"foo"\fP when it exists as a
|
|
complete word. This functionality can be obtained using existing \fIlex\fP
|
|
features:
|
|
.sp
|
|
.RS
|
|
.nf
|
|
|
|
\fB^foo/[ \\n] |
|
|
" foo"/[ \\n] /* Found foo as a separate word. */
|
|
\fP
|
|
.fi
|
|
.RE
|
|
.LP
|
|
Note also that \fB'$'\fP is a form of trailing context (it is equivalent
|
|
to \fB"/\\n"\fP ) and as such cannot be used with
|
|
regular expressions containing another instance of the operator (see
|
|
the preceding discussion of trailing context).
|
|
.LP
|
|
The additional regular expressions trailing-context operator \fB'/'\fP
|
|
can be used as an ordinary character if presented
|
|
within double-quotes, \fB"/"\fP ; preceded by a backslash, \fB"\\/"\fP
|
|
; or within a bracket expression, \fB"[/]"\fP . The
|
|
start-condition \fB'<'\fP and \fB'>'\fP operators shall be special
|
|
only in a start condition at the beginning of a
|
|
regular expression; elsewhere in the regular expression they shall
|
|
be treated as ordinary characters.
|
|
.SS Actions in lex
|
|
.LP
|
|
The action to be taken when an ERE is matched can be a C program fragment
|
|
or the special actions described below; the program
|
|
fragment can contain one or more C statements, and can also include
|
|
special actions. The empty C statement \fB';'\fP shall be a
|
|
valid action; any string in the \fBlex.yy.c\fP input that matches
|
|
the pattern portion of such a rule is effectively ignored or
|
|
skipped. However, the absence of an action shall not be valid, and
|
|
the action \fIlex\fP takes in such a condition is
|
|
undefined.
|
|
.LP
|
|
The specification for an action, including C statements and special
|
|
actions, can extend across several lines if enclosed in
|
|
braces:
|
|
.sp
|
|
.RS
|
|
.nf
|
|
|
|
\fIERE\fP \fB<\fP\fIone or more blanks\fP\fB> {\fP \fIprogram statement
|
|
program statement\fP \fB}
|
|
\fP
|
|
.fi
|
|
.RE
|
|
.LP
|
|
The default action when a string in the input to a \fBlex.yy.c\fP
|
|
program is not matched by any expression shall be to copy the
|
|
string to the output. Because the default behavior of a program generated
|
|
by \fIlex\fP is to read the input and copy it to the
|
|
output, a minimal \fIlex\fP source program that has just \fB"%%"\fP
|
|
shall generate a C program that simply copies the input to
|
|
the output unchanged.
|
|
.LP
|
|
Four special actions shall be available:
|
|
.sp
|
|
.RS
|
|
.nf
|
|
|
|
\fB| ECHO; REJECT; BEGIN
|
|
\fP
|
|
.fi
|
|
.RE
|
|
.TP 7
|
|
\fB|\fP
|
|
The action \fB'|'\fP means that the action for the next rule is the
|
|
action for this rule. Unlike the other three actions,
|
|
\fB'|'\fP cannot be enclosed in braces or be semicolon-terminated;
|
|
the application shall ensure that it is specified alone, with
|
|
no other actions.
|
|
.TP 7
|
|
\fBECHO;\fP
|
|
Write the contents of the string \fIyytext\fP on the output.
|
|
.TP 7
|
|
\fBREJECT;\fP
|
|
Usually only a single expression is matched by a given string in the
|
|
input. \fBREJECT\fP means "continue to the next
|
|
expression that matches the current input", and shall cause whatever
|
|
rule was the second choice after the current rule to be
|
|
executed for the same input. Thus, multiple rules can be matched and
|
|
executed for one input string or overlapping input strings.
|
|
For example, given the regular expressions \fB"xyz"\fP and \fB"xy"\fP
|
|
and the input \fB"xyz"\fP , usually only the regular
|
|
expression \fB"xyz"\fP would match. The next attempted match would
|
|
start after \fBz.\fP If the last action in the
|
|
\fB"xyz"\fP rule is \fBREJECT\fP, both this rule and the \fB"xy"\fP
|
|
rule would be executed. The \fBREJECT\fP action may be
|
|
implemented in such a fashion that flow of control does not continue
|
|
after it, as if it were equivalent to a \fBgoto\fP to another
|
|
part of \fIyylex\fP(). The use of \fBREJECT\fP may result in somewhat
|
|
larger and slower scanners.
|
|
.TP 7
|
|
\fBBEGIN\fP
|
|
The action:
|
|
.sp
|
|
.RS
|
|
.nf
|
|
|
|
\fBBEGIN\fP \fInewstate\fP\fB;
|
|
\fP
|
|
.fi
|
|
.RE
|
|
.LP
|
|
switches the state (start condition) to \fInewstate\fP. If the string
|
|
\fInewstate\fP has not been declared previously as a
|
|
start condition in the \fIDefinitions\fP section, the results are
|
|
unspecified. The initial state is indicated by the digit
|
|
\fB'0'\fP or the token \fBINITIAL\fP.
|
|
.sp
|
|
.LP
|
|
The functions or macros described below are accessible to user code
|
|
included in the \fIlex\fP input. It is unspecified whether
|
|
they appear in the C code output of \fIlex\fP, or are accessible only
|
|
through the \fB-l\ l\fP operand to \fIc99\fP (the \fIlex\fP library).
|
|
.TP 7
|
|
\fBint\ \fP \fIyylex\fP(\fBvoid\fP)
|
|
.sp
|
|
Performs lexical analysis on the input; this is the primary function
|
|
generated by the \fIlex\fP utility. The function shall return
|
|
zero when the end of input is reached; otherwise, it shall return
|
|
non-zero values (tokens) determined by the actions that are
|
|
selected.
|
|
.TP 7
|
|
\fBint\ \fP \fIyymore\fP(\fBvoid\fP)
|
|
.sp
|
|
When called, indicates that when the next input string is recognized,
|
|
it is to be appended to the current value of \fIyytext\fP
|
|
rather than replacing it; the value in \fIyyleng\fP shall be adjusted
|
|
accordingly.
|
|
.TP 7
|
|
\fBint\ \fP \fIyyless\fP(\fBint\ \fP \fIn\fP)
|
|
.sp
|
|
Retains \fIn\fP initial characters in \fIyytext\fP, NUL-terminated,
|
|
and treats the remaining characters as if they had not been
|
|
read; the value in \fIyyleng\fP shall be adjusted accordingly.
|
|
.TP 7
|
|
\fBint\ \fP \fIinput\fP(\fBvoid\fP)
|
|
.sp
|
|
Returns the next character from the input, or zero on end-of-file.
|
|
It shall obtain input from the stream pointer \fIyyin\fP,
|
|
although possibly via an intermediate buffer. Thus, once scanning
|
|
has begun, the effect of altering the value of \fIyyin\fP is
|
|
undefined. The character read shall be removed from the input stream
|
|
of the scanner without any processing by the scanner.
|
|
.TP 7
|
|
\fBint\ \fP \fIunput\fP(\fBint\ \fP \fIc\fP)
|
|
.sp
|
|
Returns the character \fB'c'\fP to the input; \fIyytext\fP and \fIyyleng\fP
|
|
are undefined until the next expression is
|
|
matched. The result of using \fIunput\fP() for more characters than
|
|
have been input is unspecified.
|
|
.sp
|
|
.LP
|
|
The following functions shall appear only in the \fIlex\fP library
|
|
accessible through the \fB-l\ l\fP operand; they can
|
|
therefore be redefined by a conforming application:
|
|
.TP 7
|
|
\fBint\ \fP \fIyywrap\fP(\fBvoid\fP)
|
|
.sp
|
|
Called by \fIyylex\fP() at end-of-file; the default \fIyywrap\fP()
|
|
shall always return 1. If the application requires
|
|
\fIyylex\fP() to continue processing with another source of input,
|
|
then the application can include a function \fIyywrap\fP(),
|
|
which associates another file with the external variable \fBFILE *\fP
|
|
\fIyyin\fP and shall return a value of zero.
|
|
.TP 7
|
|
\fBint\ \fP \fImain\fP(\fBint\ \fP \fIargc\fP, \fBchar *\fP\fIargv\fP[])
|
|
.sp
|
|
Calls \fIyylex\fP() to perform lexical analysis, then exits. The user
|
|
code can contain \fImain\fP() to perform
|
|
application-specific operations, calling \fIyylex\fP() as applicable.
|
|
.sp
|
|
.LP
|
|
Except for \fIinput\fP(), \fIunput\fP(), and \fImain\fP(), all external
|
|
and static names generated by \fIlex\fP shall begin
|
|
with the prefix \fByy\fP or \fBYY\fP.
|
|
.SH EXIT STATUS
|
|
.LP
|
|
The following exit values shall be returned:
|
|
.TP 7
|
|
\ 0
|
|
Successful completion.
|
|
.TP 7
|
|
>0
|
|
An error occurred.
|
|
.sp
|
|
.SH CONSEQUENCES OF ERRORS
|
|
.LP
|
|
Default.
|
|
.LP
|
|
\fIThe following sections are informative.\fP
|
|
.SH APPLICATION USAGE
|
|
.LP
|
|
Conforming applications are warned that in the \fIRules\fP section,
|
|
an ERE without an action is not acceptable, but need not be
|
|
detected as erroneous by \fIlex\fP. This may result in compilation
|
|
or runtime errors.
|
|
.LP
|
|
The purpose of \fIinput\fP() is to take characters off the input stream
|
|
and discard them as far as the lexical analysis is
|
|
concerned. A common use is to discard the body of a comment once the
|
|
beginning of a comment is recognized.
|
|
.LP
|
|
The \fIlex\fP utility is not fully internationalized in its treatment
|
|
of regular expressions in the \fIlex\fP source code or
|
|
generated lexical analyzer. It would seem desirable to have the lexical
|
|
analyzer interpret the regular expressions given in the
|
|
\fIlex\fP source according to the environment specified when the lexical
|
|
analyzer is executed, but this is not possible with the
|
|
current \fIlex\fP technology. Furthermore, the very nature of the
|
|
lexical analyzers produced by \fIlex\fP must be closely tied to
|
|
the lexical requirements of the input language being described, which
|
|
is frequently locale-specific anyway. (For example, writing
|
|
an analyzer that is used for French text is not automatically useful
|
|
for processing other languages.)
|
|
.SH EXAMPLES
|
|
.LP
|
|
The following is an example of a \fIlex\fP program that implements
|
|
a rudimentary scanner for a Pascal-like syntax:
|
|
.sp
|
|
.RS
|
|
.nf
|
|
|
|
\fB%{
|
|
/* Need this for the call to atof() below. */
|
|
#include <math.h>
|
|
/* Need this for printf(), fopen(), and stdin below. */
|
|
#include <stdio.h>
|
|
%}
|
|
.sp
|
|
|
|
DIGIT [0-9]
|
|
ID [a-z][a-z0-9]*
|
|
.sp
|
|
|
|
%%
|
|
.sp
|
|
|
|
{DIGIT}+ {
|
|
printf("An integer: %s (%d)\\n", yytext,
|
|
atoi(yytext));
|
|
}
|
|
.sp
|
|
|
|
{DIGIT}+"."{DIGIT}* {
|
|
printf("A float: %s (%g)\\n", yytext,
|
|
atof(yytext));
|
|
}
|
|
.sp
|
|
|
|
if|then|begin|end|procedure|function {
|
|
printf("A keyword: %s\\n", yytext);
|
|
}
|
|
.sp
|
|
|
|
{ID} printf("An identifier: %s\\n", yytext);
|
|
.sp
|
|
|
|
"+"|"-"|"*"|"/" printf("An operator: %s\\n", yytext);
|
|
.sp
|
|
|
|
"{"[^}\\n]*"}" /* Eat up one-line comments. */
|
|
.sp
|
|
|
|
[ \\t\\n]+ /* Eat up white space. */
|
|
.sp
|
|
|
|
\&. printf("Unrecognized character: %s\\n", yytext);
|
|
.sp
|
|
|
|
%%
|
|
.sp
|
|
|
|
int main(int argc, char *argv[])
|
|
{
|
|
++argv, --argc; /* Skip over program name. */
|
|
if (argc > 0)
|
|
yyin = fopen(argv[0], "r");
|
|
else
|
|
yyin = stdin;
|
|
.sp
|
|
|
|
yylex();
|
|
}
|
|
\fP
|
|
.fi
|
|
.RE
|
|
.SH RATIONALE
|
|
.LP
|
|
Even though the \fB-c\fP option and references to the C language are
|
|
retained in this description, \fIlex\fP may be
|
|
generalized to other languages, as was done at one time for EFL, the
|
|
Extended FORTRAN Language. Since the \fIlex\fP input
|
|
specification is essentially language-independent, versions of this
|
|
utility could be written to produce Ada, Modula-2, or Pascal
|
|
code, and there are known historical implementations that do so.
|
|
.LP
|
|
The current description of \fIlex\fP bypasses the issue of dealing
|
|
with internationalized EREs in the \fIlex\fP source code or
|
|
generated lexical analyzer. If it follows the model used by \fIawk\fP
|
|
(the source code is
|
|
assumed to be presented in the POSIX locale, but input and output
|
|
are in the locale specified by the environment variables), then
|
|
the tables in the lexical analyzer produced by \fIlex\fP would interpret
|
|
EREs specified in the \fIlex\fP source in terms of the
|
|
environment variables specified when \fIlex\fP was executed. The desired
|
|
effect would be to have the lexical analyzer interpret
|
|
the EREs given in the \fIlex\fP source according to the environment
|
|
specified when the lexical analyzer is executed, but this is
|
|
not possible with the current \fIlex\fP technology.
|
|
.LP
|
|
The description of octal and hexadecimal-digit escape sequences agrees
|
|
with the ISO\ C standard use of escape sequences. See
|
|
the RATIONALE for \fIed\fP for a discussion of bytes larger than 9
|
|
bits being represented by octal values.
|
|
Hexadecimal values can represent larger bytes and multi-byte characters
|
|
directly, using as many digits as required.
|
|
.LP
|
|
There is no detailed output format specification. The observed behavior
|
|
of \fIlex\fP under four different historical
|
|
implementations was that none of these implementations consistently
|
|
reported the line numbers for error and warning messages.
|
|
Furthermore, there was a desire that \fIlex\fP be allowed to output
|
|
additional diagnostic messages. Leaving message formats
|
|
unspecified avoids these formatting questions and problems with internationalization.
|
|
.LP
|
|
Although the \fB%x\fP specifier for \fIexclusive\fP start conditions
|
|
is not historical practice, it is believed to be a
|
|
minor change to historical implementations and greatly enhances the
|
|
usability of \fIlex\fP programs since it permits an
|
|
application to obtain the expected functionality with fewer statements.
|
|
.LP
|
|
The \fB%array\fP and \fB%pointer\fP declarations were added as a compromise
|
|
between historical systems. The System V-based
|
|
\fIlex\fP copies the matched text to a \fIyytext\fP array. The \fIflex\fP
|
|
program, supported in BSD and GNU systems, uses a
|
|
pointer. In the latter case, significant performance improvements
|
|
are available for some scanners. Most historical programs should
|
|
require no change in porting from one system to another because the
|
|
string being referenced is null-terminated in both cases. (The
|
|
method used by \fIflex\fP in its case is to null-terminate the token
|
|
in place by remembering the character that used to come right
|
|
after the token and replacing it before continuing on to the next
|
|
scan.) Multi-file programs with external references to
|
|
\fIyytext\fP outside the scanner source file should continue to operate
|
|
on their historical systems, but would require one of the
|
|
new declarations to be considered strictly portable.
|
|
.LP
|
|
The description of EREs avoids unnecessary duplication of ERE details
|
|
because their meanings within a \fIlex\fP ERE are the
|
|
same as that for the ERE in this volume of IEEE\ Std\ 1003.1-2001.
|
|
.LP
|
|
The reason for the undefined condition associated with text beginning
|
|
with a <blank> or within \fB"%{"\fP and
|
|
\fB"%}"\fP delimiter lines appearing in the \fIRules\fP section is
|
|
historical practice. Both the BSD and System V \fIlex\fP
|
|
copy the indented (or enclosed) input in the \fIRules\fP section (except
|
|
at the beginning) to unreachable areas of the
|
|
\fIyylex\fP() function (the code is written directly after a \fIbreak\fP
|
|
statement). In some cases, the System V \fIlex\fP generates an error
|
|
message or a syntax error, depending on the form of indented
|
|
input.
|
|
.LP
|
|
The intention in breaking the list of functions into those that may
|
|
appear in \fBlex.yy.c\fP \fIversus\fP those that only
|
|
appear in \fBlibl.a\fP is that only those functions in \fBlibl.a\fP
|
|
can be reliably redefined by a conforming application.
|
|
.LP
|
|
The descriptions of standard output and standard error are somewhat
|
|
complicated because historical \fIlex\fP implementations
|
|
chose to issue diagnostic messages to standard output (unless \fB-t\fP
|
|
was given). IEEE\ Std\ 1003.1-2001 allows this
|
|
behavior, but leaves an opening for the more expected behavior of
|
|
using standard error for diagnostics. Also, the System V behavior
|
|
of writing the statistics when any table sizes are given is allowed,
|
|
while BSD-derived systems can avoid it. The programmer can
|
|
always precisely obtain the desired results by using either the \fB-t\fP
|
|
or \fB-n\fP options.
|
|
.LP
|
|
The OPERANDS section does not mention the use of \fB-\fP as a synonym
|
|
for standard input; not all historical implementations
|
|
support such usage for any of the \fIfile\fP operands.
|
|
.LP
|
|
A description of the \fItranslation table\fP was deleted from early
|
|
proposals because of its relatively low usage in historical
|
|
applications.
|
|
.LP
|
|
The change to the definition of the \fIinput\fP() function that allows
|
|
buffering of input presents the opportunity for major
|
|
performance gains in some applications.
|
|
.LP
|
|
The following examples clarify the differences between \fIlex\fP regular
|
|
expressions and regular expressions appearing
|
|
elsewhere in this volume of IEEE\ Std\ 1003.1-2001. For regular expressions
|
|
of the form \fB"r/x"\fP , the string
|
|
matching \fIr\fP is always returned; confusion may arise when the
|
|
beginning of \fIx\fP matches the trailing portion of \fIr\fP.
|
|
For example, given the regular expression \fB"a*b/cc"\fP and the input
|
|
\fB"aaabcc"\fP , \fIyytext\fP would contain the
|
|
string \fB"aaab"\fP on this match. But given the regular expression
|
|
\fB"x*/xy"\fP and the input \fB"xxxy"\fP , the token
|
|
\fBxxx\fP, not \fBxx\fP, is returned by some implementations because
|
|
\fBxxx\fP matches \fB"x*"\fP .
|
|
.LP
|
|
In the rule \fB"ab*/bc"\fP , the \fB"b*"\fP at the end of \fIr\fP
|
|
extends \fIr\fP's match into the beginning of the
|
|
trailing context, so the result is unspecified. If this rule were
|
|
\fB"ab/bc"\fP , however, the rule matches the text
|
|
\fB"ab"\fP when it is followed by the text \fB"bc"\fP . In this latter
|
|
case, the matching of \fIr\fP cannot extend into the
|
|
beginning of \fIx\fP, so the result is specified.
|
|
.SH FUTURE DIRECTIONS
|
|
.LP
|
|
None.
|
|
.SH SEE ALSO
|
|
.LP
|
|
\fIc99\fP , \fIed\fP , \fIyacc\fP
|
|
.SH COPYRIGHT
|
|
Portions of this text are reprinted and reproduced in electronic form
|
|
from IEEE Std 1003.1, 2003 Edition, Standard for Information Technology
|
|
-- Portable Operating System Interface (POSIX), The Open Group Base
|
|
Specifications Issue 6, Copyright (C) 2001-2003 by the Institute of
|
|
Electrical and Electronics Engineers, Inc and The Open Group. In the
|
|
event of any discrepancy between this version and the original IEEE and
|
|
The Open Group Standard, the original IEEE and The Open Group Standard
|
|
is the referee document. The original Standard can be obtained online at
|
|
http://www.opengroup.org/unix/online.html .
|