SSL flex.texi
Sprache: unbekannt
|
|
\input texinfo.tex @c -*-texinfo-*-
@c %**start of header
@setfilename flex.info
@include version.texi
@settitle Lexical Analysis With Flex, for Flex @value{VERSION}
@set authors Vern Paxson, Will Estes and John Millaway
@c "Macro Hooks" index
@defindex hk
@c "Options" index
@defindex op
@dircategory Programming
@direntry
* flex: (flex). Fast lexical analyzer generator (lex replacement).
@end direntry
@c %**end of header
@copying
The flex manual is placed under the same licensing conditions as the
rest of flex:
Copyright @copyright{} 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2012
The Flex Project.
Copyright @copyright{} 1990, 1997 The Regents of the University of California.
All rights reserved.
This code is derived from software contributed to Berkeley by
Vern Paxson.
The United States Government has rights in this work pursuant
to contract no. DE-AC03-76SF00098 between the United States
Department of Energy and the University of California.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:
@enumerate
@item
Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
@item
Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
@end enumerate
Neither the name of the University nor the names of its contributors
may be used to endorse or promote products derived from this software
without specific prior written permission.
THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR
IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE.
@end copying
@titlepage
@title Lexical Analysis with Flex
@subtitle Edition @value{EDITION}, @value{UPDATED}
@author @value{authors}
@page
@vskip 0pt plus 1filll
@insertcopying
@end titlepage
@contents
@ifnottex
@node Top, Copyright, (dir), (dir)
@top flex
This manual describes @code{flex}, a tool for generating programs that
perform pattern-matching on text. The manual includes both tutorial and
reference sections.
This edition of @cite{The flex Manual} documents @code{flex} version
@value{VERSION}. It was last updated on @value{UPDATED}.
This manual was written by @value{authors}.
@menu
* Copyright::
* Reporting Bugs::
* Introduction::
* Simple Examples::
* Format::
* Patterns::
* Matching::
* Actions::
* Generated Scanner::
* Start Conditions::
* Multiple Input Buffers::
* EOF::
* Misc Macros::
* User Values::
* Yacc::
* Scanner Options::
* Performance::
* Cxx::
* Reentrant::
* Lex and Posix::
* Memory Management::
* Serialized Tables::
* Diagnostics::
* Limitations::
* Bibliography::
* FAQ::
* Appendices::
* Indices::
@detailmenu
--- The Detailed Node Listing ---
Format of the Input File
* Definitions Section::
* Rules Section::
* User Code Section::
* Comments in the Input::
Scanner Options
* Options for Specifying Filenames::
* Options Affecting Scanner Behavior::
* Code-Level And API Options::
* Options for Scanner Speed and Size::
* Debugging Options::
* Miscellaneous Options::
Reentrant C Scanners
* Reentrant Uses::
* Reentrant Overview::
* Reentrant Example::
* Reentrant Detail::
* Reentrant Functions::
The Reentrant API in Detail
* Specify Reentrant::
* Extra Reentrant Argument::
* Global Replacement::
* Init and Destroy Functions::
* Accessor Methods::
* Extra Data::
* About yyscan_t::
Memory Management
* The Default Memory Management::
* Overriding The Default Memory Management::
* A Note About yytext And Memory::
Serialized Tables
* Creating Serialized Tables::
* Loading and Unloading Serialized Tables::
* Tables File Format::
FAQ
* When was flex born?::
* How do I expand backslash-escape sequences in C-style quoted strings?::
* Why do flex scanners call fileno if it is not ANSI compatible?::
* Does flex support recursive pattern definitions?::
* How do I skip huge chunks of input (tens of megabytes) while using flex?::
* Flex is not matching my patterns in the same order that I defined them.::
* My actions are executing out of order or sometimes not at all.::
* How can I have multiple input sources feed into the same scanner at the same time?::
* Can I build nested parsers that work with the same input file?::
* How can I match text only at the end of a file?::
* How can I make REJECT cascade across start condition boundaries?::
* Why cant I use fast or full tables with interactive mode?::
* How much faster is -F or -f than -C?::
* If I have a simple grammar cant I just parse it with flex?::
* Why doesn't yyrestart() set the start state back to INITIAL?::
* How can I match C-style comments?::
* The period isn't working the way I expected.::
* Can I get the flex manual in another format?::
* Does there exist a "faster" NDFA->DFA algorithm?::
* How does flex compile the DFA so quickly?::
* How can I use more than 8192 rules?::
* How do I abandon a file in the middle of a scan and switch to a new file?::
* How do I execute code only during initialization (only before the first scan)?::
* How do I execute code at termination?::
* Where else can I find help?::
* Can I include comments in the "rules" section of the file?::
* I get an error about undefined yywrap().::
* How can I change the matching pattern at run time?::
* How can I expand macros in the input?::
* How can I build a two-pass scanner?::
* How do I match any string not matched in the preceding rules?::
* I am trying to port code from AT&T lex that uses yysptr and yysbuf.::
* Is there a way to make flex treat NULL like a regular character?::
* Whenever flex can not match the input it says "flex scanner jammed".::
* Why doesn't flex have non-greedy operators like perl does?::
* Memory leak - 16386 bytes allocated by malloc.::
* How do I track the byte offset for lseek()?::
* How do I use my own I/O classes in a C++ scanner?::
* How do I skip as many chars as possible?::
* deleteme00::
* Are certain equivalent patterns faster than others?::
* Is backing up a big deal?::
* Can I fake multi-byte character support?::
* deleteme01::
* Can you discuss some flex internals?::
* unput() messes up yy_at_bol::
* The | operator is not doing what I want::
* Why can't flex understand this variable trailing context pattern?::
* The ^ operator isn't working::
* Trailing context is getting confused with trailing optional patterns::
* Is flex GNU or not?::
* ERASEME53::
* I need to scan if-then-else blocks and while loops::
* ERASEME55::
* ERASEME56::
* ERASEME57::
* Is there a repository for flex scanners?::
* How can I conditionally compile or preprocess my flex input file?::
* Where can I find grammars for lex and yacc?::
* I get an end-of-buffer message for each character scanned.::
* unnamed-faq-62::
* unnamed-faq-63::
* unnamed-faq-64::
* unnamed-faq-65::
* unnamed-faq-66::
* unnamed-faq-67::
* unnamed-faq-68::
* unnamed-faq-69::
* unnamed-faq-70::
* unnamed-faq-71::
* unnamed-faq-72::
* unnamed-faq-73::
* unnamed-faq-74::
* unnamed-faq-75::
* unnamed-faq-76::
* unnamed-faq-77::
* unnamed-faq-78::
* unnamed-faq-79::
* unnamed-faq-80::
* unnamed-faq-81::
* unnamed-faq-82::
* unnamed-faq-83::
* unnamed-faq-84::
* unnamed-faq-85::
* unnamed-faq-86::
* unnamed-faq-87::
* unnamed-faq-88::
* unnamed-faq-90::
* unnamed-faq-91::
* unnamed-faq-92::
* unnamed-faq-93::
* unnamed-faq-94::
* unnamed-faq-95::
* unnamed-faq-96::
* unnamed-faq-97::
* unnamed-faq-98::
* unnamed-faq-99::
* unnamed-faq-100::
* unnamed-faq-101::
* What is the difference between YYLEX_PARAM and YY_DECL?::
* Why do I get "conflicting types for yylex" error?::
* How do I access the values set in a Flex action from within a Bison action?::
Appendices
* Makefiles and Flex::
* Bison Bridge::
* M4 Dependency::
* Common Patterns::
Indices
* Concept Index::
* Index of Functions and Macros::
* Index of Variables::
* Index of Data Types::
* Index of Hooks::
* Index of Scanner Options::
@end detailmenu
@end menu
@end ifnottex
@node Copyright, Reporting Bugs, Top, Top
@chapter Copyright
@cindex copyright of flex
@cindex distributing flex
@insertcopying
@node Reporting Bugs, Introduction, Copyright, Top
@chapter Reporting Bugs
@cindex bugs, reporting
@cindex reporting bugs
If you find a bug in @code{flex}, please report it using
the SourceForge Bug Tracking facilities which can be found on
@url{http://sourceforge.net/projects/flex,flex's SourceForge Page}.
@node Introduction, Simple Examples, Reporting Bugs, Top
@chapter Introduction
@cindex scanner, definition of
@code{flex} is a tool for generating @dfn{scanners}. A scanner is a
program which recognizes lexical patterns in text. The @code{flex}
program reads the given input files, or its standard input if no file
names are given, for a description of a scanner to generate. The
description is in the form of pairs of regular expressions and C code,
called @dfn{rules}. @code{flex} generates as output a C source file,
@file{lex.yy.c} by default, which defines a routine @code{yylex()}.
This file can be compiled and linked with the flex runtime library to
produce an executable. When the executable is run, it analyzes its
input for occurrences of the regular expressions. Whenever it finds
one, it executes the corresponding C code.
@node Simple Examples, Format, Introduction, Top
@chapter Some Simple Examples
First some simple examples to get the flavor of how one uses
@code{flex}.
@cindex username expansion
The following @code{flex} input specifies a scanner which, when it
encounters the string @samp{username} will replace it with the user's
login name:
@example
@verbatim
%%
username printf( "%s", getlogin() );
@end verbatim
@end example
@cindex default rule
@cindex rules, default
By default, any text not matched by a @code{flex} scanner is copied to
the output, so the net effect of this scanner is to copy its input file
to its output with each occurrence of @samp{username} expanded. In this
input, there is just one rule. @samp{username} is the @dfn{pattern} and
the @samp{printf} is the @dfn{action}. The @samp{%%} symbol marks the
beginning of the rules.
Here's another simple example:
@cindex counting characters and lines
@example
@verbatim
int num_lines = 0, num_chars = 0;
%%
\n ++num_lines; ++num_chars;
. ++num_chars;
%%
int main()
{
yylex();
printf( "# of lines = %d, # of chars = %d\n",
num_lines, num_chars );
}
@end verbatim
@end example
This scanner counts the number of characters and the number of lines in
its input. It produces no output other than the final report on the
character and line counts. The first line declares two globals,
@code{num_lines} and @code{num_chars}, which are accessible both inside
@code{yylex()} and in the @code{main()} routine declared after the
second @samp{%%}. There are two rules, one which matches a newline
(@samp{\n}) and increments both the line count and the character count,
and one which matches any character other than a newline (indicated by
the @samp{.} regular expression).
A somewhat more complicated example:
@cindex Pascal-like language
@example
@verbatim
/* scanner for a toy Pascal-like language */
%{
/* need this for the call to atof() below */
#include <math.h>
%}
DIGIT [0-9]
ID [a-z][a-z0-9]*
%%
{DIGIT}+ {
printf( "An integer: %s (%d)\n", yytext,
atoi( yytext ) );
}
{DIGIT}+"."{DIGIT}* {
printf( "A float: %s (%g)\n", yytext,
atof( yytext ) );
}
if|then|begin|end|procedure|function {
printf( "A keyword: %s\n", yytext );
}
{ID} printf( "An identifier: %s\n", yytext );
"+"|"-"|"*"|"/" printf( "An operator: %s\n", yytext );
"{"[\^{}}\n]*"}" /* eat up one-line comments */
[ \t\n]+ /* eat up whitespace */
. printf( "Unrecognized character: %s\n", yytext );
%%
int main( int argc, char **argv )
{
++argv, --argc; /* skip over program name */
if ( argc > 0 )
yyin = fopen( argv[0], "r" );
else
yyin = stdin;
yylex();
}
@end verbatim
@end example
This is the beginnings of a simple scanner for a language like Pascal.
It identifies different types of @dfn{tokens} and reports on what it has
seen.
The details of this example will be explained in the following
sections.
@node Format, Patterns, Simple Examples, Top
@chapter Format of the Input File
@cindex format of flex input
@cindex input, format of
@cindex file format
@cindex sections of flex input
The @code{flex} input file consists of three sections, separated by a
line containing only @samp{%%}.
@cindex format of input file
@example
@verbatim
definitions
%%
rules
%%
user code
@end verbatim
@end example
@menu
* Definitions Section::
* Rules Section::
* User Code Section::
* Comments in the Input::
@end menu
@node Definitions Section, Rules Section, Format, Format
@section Format of the Definitions Section
@cindex input file, Definitions section
@cindex Definitions, in flex input
The @dfn{definitions section} contains declarations of simple @dfn{name}
definitions to simplify the scanner specification, and declarations of
@dfn{start conditions}, which are explained in a later section.
@cindex aliases, how to define
@cindex pattern aliases, how to define
Name definitions have the form:
@example
@verbatim
name definition
@end verbatim
@end example
The @samp{name} is a word beginning with a letter or an underscore
(@samp{_}) followed by zero or more letters, digits, @samp{_}, or
@samp{-} (dash). The definition is taken to begin at the first
non-whitespace character following the name and continuing to the end of
the line. The definition can subsequently be referred to using
@samp{@{name@}}, which will expand to @samp{(definition)}. For example,
@cindex pattern aliases, defining
@cindex defining pattern aliases
@example
@verbatim
DIGIT [0-9]
ID [a-z][a-z0-9]*
@end verbatim
@end example
Defines @samp{DIGIT} to be a regular expression which matches a single
digit, and @samp{ID} to be a regular expression which matches a letter
followed by zero-or-more letters-or-digits. A subsequent reference to
@cindex pattern aliases, use of
@example
@verbatim
{DIGIT}+"."{DIGIT}*
@end verbatim
@end example
is identical to
@example
@verbatim
([0-9])+"."([0-9])*
@end verbatim
@end example
and matches one-or-more digits followed by a @samp{.} followed by
zero-or-more digits.
@cindex comments in flex input
An unindented comment (i.e., a line
beginning with @samp{/*}) is copied verbatim to the output up
to the next @samp{*/}.
@cindex %@{ and %@}, in Definitions Section
@cindex embedding C code in flex input
@cindex C code in flex input
Any @emph{indented} text or text enclosed in @samp{%@{} and @samp{%@}}
is also copied verbatim to the output (with the %@{ and %@} symbols
removed). The %@{ and %@} symbols must appear unindented on lines by
themselves.
@cindex %top
A @code{%top} block is similar to a @samp{%@{} ... @samp{%@}} block, except
that the code in a @code{%top} block is relocated to the @emph{top} of the
generated file, before any flex definitions @footnote{Actually,
@code{yyIN_HEADER} is defined before the @samp{%top} block.}.
The @code{%top} block is useful when you want certain preprocessor macros to be
defined or certain files to be included before the generated code.
The single characters, @samp{@{} and @samp{@}} are used to delimit the
@code{%top} block, as show in the example below:
@example
@verbatim
%top{
/* This code goes at the "top" of the generated file. */
#include <stdint.h>
#include <inttypes.h>
}
@end verbatim
@end example
Multiple @code{%top} blocks are allowed, and their order is preserved.
@node Rules Section, User Code Section, Definitions Section, Format
@section Format of the Rules Section
@cindex input file, Rules Section
@cindex rules, in flex input
The @dfn{rules} section of the @code{flex} input contains a series of
rules of the form:
@example
@verbatim
pattern action
@end verbatim
@end example
where the pattern must be unindented and the action must begin
on the same line.
@xref{Patterns}, for a further description of patterns and actions.
In the rules section, any indented or %@{ %@} enclosed text appearing
before the first rule may be used to declare variables which are local
to the scanning routine and (after the declarations) code which is to be
executed whenever the scanning routine is entered. Other indented or
%@{ %@} text in the rule section is still copied to the output, but its
meaning is not well-defined and it may well cause compile-time errors
(this feature is present for @acronym{POSIX} compliance. @xref{Lex and
Posix}, for other such features).
Any @emph{indented} text or text enclosed in @samp{%@{} and @samp{%@}}
is copied verbatim to the output (with the %@{ and %@} symbols removed).
The %@{ and %@} symbols must appear unindented on lines by themselves.
@node User Code Section, Comments in the Input, Rules Section, Format
@section Format of the User Code Section
@cindex input file, user code Section
@cindex user code, in flex input
The user code section is simply copied to @file{lex.yy.c} verbatim. It
is used for companion routines which call or are called by the scanner.
The presence of this section is optional; if it is missing, the second
@samp{%%} in the input file may be skipped, too.
@node Comments in the Input, , User Code Section, Format
@section Comments in the Input
@cindex comments, syntax of
Flex supports C-style comments, that is, anything between @samp{/*} and
@samp{*/} is
considered a comment. Whenever flex encounters a comment, it copies the
entire comment verbatim to the generated source code. Comments may
appear just about anywhere, but with the following exceptions:
@itemize
@cindex comments, in rules section
@item
Comments may not appear in the Rules Section wherever flex is expecting
a regular expression. This means comments may not appear at the
beginning of a line, or immediately following a list of scanner states.
@item
Comments may not appear on an @samp{%option} line in the Definitions
Section.
@end itemize
If you want to follow a simple rule, then always begin a comment on a
new line, with one or more whitespace characters before the initial
@samp{/*}). This rule will work anywhere in the input file.
All the comments in the following example are valid:
@cindex comments, valid uses of
@cindex comments in the input
@example
@verbatim
%{
/* code block */
%}
/* Definitions Section */
%x STATE_X
%%
/* Rules Section */
ruleA /* after regex */ { /* code block */ } /* after code block */
/* Rules Section (indented) */
<STATE_X>{
ruleC ECHO;
ruleD ECHO;
%{
/* code block */
%}
}
%%
/* User Code Section */
@end verbatim
@end example
@node Patterns, Matching, Format, Top
@chapter Patterns
@cindex patterns, in rules section
@cindex regular expressions, in patterns
The patterns in the input (see @ref{Rules Section}) are written using an
extended set of regular expressions. These are:
@cindex patterns, syntax
@cindex patterns, syntax
@table @samp
@item x
match the character 'x'
@item .
any character (byte) except newline
@cindex [] in patterns
@cindex character classes in patterns, syntax of
@cindex POSIX, character classes in patterns, syntax of
@item [xyz]
a @dfn{character class}; in this case, the pattern
matches either an 'x', a 'y', or a 'z'
@cindex ranges in patterns
@item [abj-oZ]
a "character class" with a range in it; matches
an 'a', a 'b', any letter from 'j' through 'o',
or a 'Z'
@cindex ranges in patterns, negating
@cindex negating ranges in patterns
@item [^A-Z]
a "negated character class", i.e., any character
but those in the class. In this case, any
character EXCEPT an uppercase letter.
@item [^A-Z\n]
any character EXCEPT an uppercase letter or
a newline
@item [a-z]@{-@}[aeiou]
the lowercase consonants
@item r*
zero or more r's, where r is any regular expression
@item r+
one or more r's
@item r?
zero or one r's (that is, ``an optional r'')
@cindex braces in patterns
@item r@{2,5@}
anywhere from two to five r's
@item r@{2,@}
two or more r's
@item r@{4@}
exactly 4 r's
@cindex pattern aliases, expansion of
@item @{name@}
the expansion of the @samp{name} definition
(@pxref{Format}).
@cindex literal text in patterns, syntax of
@cindex verbatim text in patterns, syntax of
@item "[xyz]\"foo"
the literal string: @samp{[xyz]"foo}
@cindex escape sequences in patterns, syntax of
@item \X
if X is @samp{a}, @samp{b}, @samp{f}, @samp{n}, @samp{r}, @samp{t}, or
@samp{v}, then the ANSI-C interpretation of @samp{\x}. Otherwise, a
literal @samp{X} (used to escape operators such as @samp{*})
@cindex NULL character in patterns, syntax of
@item \0
a NUL character (ASCII code 0)
@cindex octal characters in patterns
@item \123
the character with octal value 123
@item \x2a
the character with hexadecimal value 2a
@item (r)
match an @samp{r}; parentheses are used to override precedence (see below)
@item (?r-s:pattern)
apply option @samp{r} and omit option @samp{s} while interpreting pattern.
Options may be zero or more of the characters @samp{i}, @samp{s}, or @samp{x}.
@samp{i} means case-insensitive. @samp{-i} means case-sensitive.
@samp{s} alters the meaning of the @samp{.} syntax to match any single byte whatsoever.
@samp{-s} alters the meaning of @samp{.} to match any byte except @samp{\n}.
@samp{x} ignores comments and whitespace in patterns. Whitespace is ignored unless
it is backslash-escaped, contained within @samp{""}s, or appears inside a
character class.
The following are all valid:
@verbatim
(?:foo) same as (foo)
(?i:ab7) same as ([aA][bB]7)
(?-i:ab) same as (ab)
(?s:.) same as [\x00-\xFF]
(?-s:.) same as [^\n]
(?ix-s: a . b) same as ([Aa][^\n][bB])
(?x:a b) same as ("ab")
(?x:a\ b) same as ("a b")
(?x:a" "b) same as ("a b")
(?x:a[ ]b) same as ("a b")
(?x:a
/* comment */
b
c) same as (abc)
@end verbatim
@item (?# comment )
omit everything within @samp{()}. The first @samp{)}
character encountered ends the pattern. It is not possible to for the comment
to contain a @samp{)} character. The comment may span lines.
@cindex concatenation, in patterns
@item rs
the regular expression @samp{r} followed by the regular expression @samp{s}; called
@dfn{concatenation}
@item r|s
either an @samp{r} or an @samp{s}
@cindex trailing context, in patterns
@item r/s
an @samp{r} but only if it is followed by an @samp{s}. The text matched by @samp{s} is
included when determining whether this rule is the longest match, but is
then returned to the input before the action is executed. So the action
only sees the text matched by @samp{r}. This type of pattern is called
@dfn{trailing context}. (There are some combinations of @samp{r/s} that flex
cannot match correctly. @xref{Limitations}, regarding dangerous trailing
context.)
@cindex beginning of line, in patterns
@cindex BOL, in patterns
@item ^r
an @samp{r}, but only at the beginning of a line (i.e.,
when just starting to scan, or right after a
newline has been scanned).
@cindex end of line, in patterns
@cindex EOL, in patterns
@item r$
an @samp{r}, but only at the end of a line (i.e., just before a
newline). Equivalent to @samp{r/\n}.
@cindex newline, matching in patterns
Note that @code{flex}'s notion of ``newline'' is exactly
whatever the C compiler used to compile @code{flex}
interprets @samp{\n} as; in particular, on some DOS
systems you must either filter out @samp{\r}s in the
input yourself, or explicitly use @samp{r/\r\n} for @samp{r$}.
@cindex start conditions, in patterns
@item <s>r
an @samp{r}, but only in start condition @code{s} (see @ref{Start
Conditions} for discussion of start conditions).
@item <s1,s2,s3>r
same, but in any of start conditions @code{s1}, @code{s2}, or @code{s3}.
@item <*>r
an @samp{r} in any start condition, even an exclusive one.
@cindex end of file, in patterns
@cindex EOF in patterns, syntax of
@item <<EOF>>
an end-of-file.
@item <s1,s2><<EOF>>
an end-of-file when in start condition @code{s1} or @code{s2}
@end table
Note that inside of a character class, all regular expression operators
lose their special meaning except escape (@samp{\}) and the character class
operators, @samp{-}, @samp{]]}, and, at the beginning of the class, @samp{^}.
@cindex patterns, precedence of operators
The regular expressions listed above are grouped according to
precedence, from highest precedence at the top to lowest at the bottom.
Those grouped together have equal precedence (see special note on the
precedence of the repeat operator, @samp{@{@}}, under the documentation
for the @samp{--posix} POSIX compliance option). For example,
@cindex patterns, grouping and precedence
@example
@verbatim
foo|bar*
@end verbatim
@end example
is the same as
@example
@verbatim
(foo)|(ba(r*))
@end verbatim
@end example
since the @samp{*} operator has higher precedence than concatenation,
and concatenation higher than alternation (@samp{|}). This pattern
therefore matches @emph{either} the string @samp{foo} @emph{or} the
string @samp{ba} followed by zero-or-more @samp{r}'s. To match
@samp{foo} or zero-or-more repetitions of the string @samp{bar}, use:
@example
@verbatim
foo|(bar)*
@end verbatim
@end example
And to match a sequence of zero or more repetitions of @samp{foo} and
@samp{bar}:
@cindex patterns, repetitions with grouping
@example
@verbatim
(foo|bar)*
@end verbatim
@end example
@cindex character classes in patterns
In addition to characters and ranges of characters, character classes
can also contain @dfn{character class expressions}. These are
expressions enclosed inside @samp{[:} and @samp{:]} delimiters (which
themselves must appear between the @samp{[} and @samp{]} of the
character class. Other elements may occur inside the character class,
too). The valid expressions are:
@cindex patterns, valid character classes
@example
@verbatim
[:alnum:] [:alpha:] [:blank:]
[:cntrl:] [:digit:] [:graph:]
[:lower:] [:print:] [:punct:]
[:space:] [:upper:] [:xdigit:]
@end verbatim
@end example
These expressions all designate a set of characters equivalent to the
corresponding standard C @code{isXXX} function. For example,
@samp{[:alnum:]} designates those characters for which @code{isalnum()}
returns true - i.e., any alphabetic or numeric character. Some systems
don't provide @code{isblank()}, so flex defines @samp{[:blank:]} as a
blank or a tab.
For example, the following character classes are all equivalent:
@cindex character classes, equivalence of
@cindex patterns, character class equivalence
@example
@verbatim
[[:alnum:]]
[[:alpha:][:digit:]]
[[:alpha:][0-9]]
[a-zA-Z0-9]
@end verbatim
@end example
A word of caution. Character classes are expanded immediately when seen in the @code{flex} input.
This means the character classes are sensitive to the locale in which @code{flex}
is executed, and the resulting scanner will not be sensitive to the runtime locale.
This may or may not be desirable.
@itemize
@cindex case-insensitive, effect on character classes
@item If your scanner is case-insensitive (the @samp{-i} flag), then
@samp{[:upper:]} and @samp{[:lower:]} are equivalent to
@samp{[:alpha:]}.
@anchor{case and character ranges}
@item Character classes with ranges, such as @samp{[a-Z]}, should be used with
caution in a case-insensitive scanner if the range spans upper or lowercase
characters. Flex does not know if you want to fold all upper and lowercase
characters together, or if you want the literal numeric range specified (with
no case folding). When in doubt, flex will assume that you meant the literal
numeric range, and will issue a warning. The exception to this rule is a
character range such as @samp{[a-z]} or @samp{[S-W]} where it is obvious that you
want case-folding to occur. Here are some examples with the @samp{-i} flag
enabled:
@multitable {@samp{[a-zA-Z]}} {ambiguous} {@samp{[A-Z\[\\\]_`a-t]}} {@samp{[@@A-Z\[\\\]_`abc]}}
@item Range @tab Result @tab Literal Range @tab Alternate Range
@item @samp{[a-t]} @tab ok @tab @samp{[a-tA-T]} @tab
@item @samp{[A-T]} @tab ok @tab @samp{[a-tA-T]} @tab
@item @samp{[A-t]} @tab ambiguous @tab @samp{[A-Z\[\\\]_`a-t]} @tab @samp{[a-tA-T]}
@item @samp{[_-@{]} @tab ambiguous @tab @samp{[_`a-z@{]} @tab @samp{[_`a-zA-Z@{]}
@item @samp{[@@-C]} @tab ambiguous @tab @samp{[@@ABC]} @tab @samp{[@@A-Z\[\\\]_`abc]}
@end multitable
@cindex end of line, in negated character classes
@cindex EOL, in negated character classes
@item
A negated character class such as the example @samp{[^A-Z]} above
@emph{will} match a newline unless @samp{\n} (or an equivalent escape
sequence) is one of the characters explicitly present in the negated
character class (e.g., @samp{[^A-Z\n]}). This is unlike how many other
regular expression tools treat negated character classes, but
unfortunately the inconsistency is historically entrenched. Matching
newlines means that a pattern like @samp{[^"]*} can match the entire
input unless there's another quote in the input.
Flex allows negation of character class expressions by prepending @samp{^} to
the POSIX character class name.
@example
@verbatim
[:^alnum:] [:^alpha:] [:^blank:]
[:^cntrl:] [:^digit:] [:^graph:]
[:^lower:] [:^print:] [:^punct:]
[:^space:] [:^upper:] [:^xdigit:]
@end verbatim
@end example
Flex will issue a warning if the expressions @samp{[:^upper:]} and
@samp{[:^lower:]} appear in a case-insensitive scanner, since their meaning is
unclear. The current behavior is to skip them entirely, but this may change
without notice in future revisions of flex.
@item
The @samp{@{-@}} operator computes the difference of two character classes. For
example, @samp{[a-c]@{-@}[b-z]} represents all the characters in the class
@samp{[a-c]} that are not in the class @samp{[b-z]} (which in this case, is
just the single character @samp{a}). The @samp{@{-@}} operator is left
associative, so @samp{[abc]@{-@}[b]@{-@}[c]} is the same as @samp{[a]}. Be careful
not to accidentally create an empty set, which will never match.
@item
The @samp{@{+@}} operator computes the union of two character classes. For
example, @samp{[a-z]@{+@}[0-9]} is the same as @samp{[a-z0-9]}. This operator
is useful when preceded by the result of a difference operation, as in,
@samp{[[:alpha:]]@{-@}[[:lower:]]@{+@}[q]}, which is equivalent to
@samp{[A-Zq]} in the "C" locale.
@cindex trailing context, limits of
@cindex ^ as non-special character in patterns
@cindex $ as normal character in patterns
@item
A rule can have at most one instance of trailing context (the @samp{/} operator
or the @samp{$} operator). The start condition, @samp{^}, and @samp{<<EOF>>} patterns
can only occur at the beginning of a pattern, and, as well as with @samp{/} and @samp{$},
cannot be grouped inside parentheses. A @samp{^} which does not occur at
the beginning of a rule or a @samp{$} which does not occur at the end of
a rule loses its special properties and is treated as a normal character.
@item
The following are invalid:
@cindex patterns, invalid trailing context
@example
@verbatim
foo/bar$
<sc1>foo<sc2>bar
@end verbatim
@end example
Note that the first of these can be written @samp{foo/bar\n}.
@item
The following will result in @samp{$} or @samp{^} being treated as a normal character:
@cindex patterns, special characters treated as non-special
@example
@verbatim
foo|(bar$)
foo|^bar
@end verbatim
@end example
If the desired meaning is a @samp{foo} or a
@samp{bar}-followed-by-a-newline, the following could be used (the
special @code{|} action is explained below, @pxref{Actions}):
@cindex patterns, end of line
@example
@verbatim
foo |
bar$ /* action goes here */
@end verbatim
@end example
A similar trick will work for matching a @samp{foo} or a
@samp{bar}-at-the-beginning-of-a-line.
@end itemize
@node Matching, Actions, Patterns, Top
@chapter How the Input Is Matched
@cindex patterns, matching
@cindex input, matching
@cindex trailing context, matching
@cindex matching, and trailing context
@cindex matching, length of
@cindex matching, multiple matches
When the generated scanner is run, it analyzes its input looking for
strings which match any of its patterns. If it finds more than one
match, it takes the one matching the most text (for trailing context
rules, this includes the length of the trailing part, even though it
will then be returned to the input). If it finds two or more matches of
the same length, the rule listed first in the @code{flex} input file is
chosen.
@cindex token
@cindex yytext
@cindex yyleng
Once the match is determined, the text corresponding to the match
(called the @dfn{token}) is made available in the global character
pointer @code{yytext}, and its length in the global integer
@code{yyleng}. The @dfn{action} corresponding to the matched pattern is
then executed (@pxref{Actions}), and then the remaining input is scanned
for another match.
@cindex default rule
If no match is found, then the @dfn{default rule} is executed: the next
character in the input is considered matched and copied to the standard
output. Thus, the simplest valid @code{flex} input is:
@cindex minimal scanner
@example
@verbatim
%%
@end verbatim
@end example
which generates a scanner that simply copies its input (one character at
a time) to its output.
@cindex yytext, two types of
@cindex %array, use of
@cindex %pointer, use of
@vindex yytext
Note that @code{yytext} can be defined in two different ways: either as
a character @emph{pointer} or as a character @emph{array}. You can
control which definition @code{flex} uses by including one of the
special directives @code{%pointer} or @code{%array} in the first
(definitions) section of your flex input. The default is
@code{%pointer}, unless you use the @samp{-l} lex compatibility option,
in which case @code{yytext} will be an array. The advantage of using
@code{%pointer} is substantially faster scanning and no buffer overflow
when matching very large tokens (unless you run out of dynamic memory).
The disadvantage is that you are restricted in how your actions can
modify @code{yytext} (@pxref{Actions}), and calls to the @code{unput()}
function destroys the present contents of @code{yytext}, which can be a
considerable porting headache when moving between different @code{lex}
versions.
@cindex %array, advantages of
The advantage of @code{%array} is that you can then modify @code{yytext}
to your heart's content, and calls to @code{unput()} do not destroy
@code{yytext} (@pxref{Actions}). Furthermore, existing @code{lex}
programs sometimes access @code{yytext} externally using declarations of
the form:
@example
@verbatim
extern char yytext[];
@end verbatim
@end example
This definition is erroneous when used with @code{%pointer}, but correct
for @code{%array}.
The @code{%array} declaration defines @code{yytext} to be an array of
@code{YYLMAX} characters, which defaults to a fairly large value. You
can change the size by simply #define'ing @code{YYLMAX} to a different
value in the first section of your @code{flex} input. As mentioned
above, with @code{%pointer} yytext grows dynamically to accommodate
large tokens. While this means your @code{%pointer} scanner can
accommodate very large tokens (such as matching entire blocks of
comments), bear in mind that each time the scanner must resize
@code{yytext} it also must rescan the entire token from the beginning,
so matching such tokens can prove slow. @code{yytext} presently does
@emph{not} dynamically grow if a call to @code{unput()} results in too
much text being pushed back; instead, a run-time error results.
@cindex %array, with C++
Also note that you cannot use @code{%array} with C++ scanner classes
(@pxref{Cxx}).
@node Actions, Generated Scanner, Matching, Top
@chapter Actions
@cindex actions
Each pattern in a rule has a corresponding @dfn{action}, which can be
any arbitrary C statement. The pattern ends at the first non-escaped
whitespace character; the remainder of the line is its action. If the
action is empty, then when the pattern is matched the input token is
simply discarded. For example, here is the specification for a program
which deletes all occurrences of @samp{zap me} from its input:
@cindex deleting lines from input
@example
@verbatim
%%
"zap me"
@end verbatim
@end example
This example will copy all other characters in the input to the output
since they will be matched by the default rule.
Here is a program which compresses multiple blanks and tabs down to a
single blank, and throws away whitespace found at the end of a line:
@cindex whitespace, compressing
@cindex compressing whitespace
@example
@verbatim
%%
[ \t]+ putchar( ' ' );
[ \t]+$ /* ignore this token */
@end verbatim
@end example
@cindex %@{ and %@}, in Rules Section
@cindex actions, use of @{ and @}
@cindex actions, embedded C strings
@cindex C-strings, in actions
@cindex comments, in actions
If the action contains a @samp{@{}, then the action spans till the
balancing @samp{@}} is found, and the action may cross multiple lines.
@code{flex} knows about C strings and comments and won't be fooled by
braces found within them, but also allows actions to begin with
@samp{%@{} and will consider the action to be all the text up to the
next @samp{%@}} (regardless of ordinary braces inside the action).
@cindex |, in actions
An action consisting solely of a vertical bar (@samp{|}) means ``same as the
action for the next rule''. See below for an illustration.
Actions can include arbitrary C code, including @code{return} statements
to return a value to whatever routine called @code{yylex()}. Each time
@code{yylex()} is called it continues processing tokens from where it
last left off until it either reaches the end of the file or executes a
return.
@cindex yytext, modification of
Actions are free to modify @code{yytext} except for lengthening it
(adding characters to its end--these will overwrite later characters in
the input stream). This however does not apply when using @code{%array}
(@pxref{Matching}). In that case, @code{yytext} may be freely modified
in any way.
@cindex yyleng, modification of
@cindex yymore, and yyleng
Actions are free to modify @code{yyleng} except they should not do so if
the action also includes use of @code{yymore()} (see below).
@cindex preprocessor macros, for use in actions
There are a number of special directives which can be included within an
action:
@table @code
@item ECHO
@cindex ECHO
copies yytext to the scanner's output.
@item BEGIN
@cindex BEGIN
followed by the name of a start condition places the scanner in the
corresponding start condition (see below).
@item REJECT
@cindex REJECT
directs the scanner to proceed on to the ``second best'' rule which
matched the input (or a prefix of the input). The rule is chosen as
described above in @ref{Matching}, and @code{yytext} and @code{yyleng}
set up appropriately. It may either be one which matched as much text
as the originally chosen rule but came later in the @code{flex} input
file, or one which matched less text. For example, the following will
both count the words in the input and call the routine @code{special()}
whenever @samp{frob} is seen:
@example
@verbatim
int word_count = 0;
%%
frob special(); REJECT;
[^ \t\n]+ ++word_count;
@end verbatim
@end example
Without the @code{REJECT}, any occurrences of @samp{frob} in the input
would not be counted as words, since the scanner normally executes only
one action per token. Multiple uses of @code{REJECT} are allowed, each
one finding the next best choice to the currently active rule. For
example, when the following scanner scans the token @samp{abcd}, it will
write @samp{abcdabcaba} to the output:
@cindex REJECT, calling multiple times
@cindex |, use of
@example
@verbatim
%%
a |
ab |
abc |
abcd ECHO; REJECT;
.|\n /* eat up any unmatched character */
@end verbatim
@end example
The first three rules share the fourth's action since they use the
special @samp{|} action.
@code{REJECT} is a particularly expensive feature in terms of scanner
performance; if it is used in @emph{any} of the scanner's actions it
will slow down @emph{all} of the scanner's matching. Furthermore,
@code{REJECT} cannot be used with the @samp{-Cf} or @samp{-CF} options
(@pxref{Scanner Options}).
Note also that unlike the other special actions, @code{REJECT} is a
@emph{branch}. Code immediately following it in the action will
@emph{not} be executed.
@item yymore()
@cindex yymore()
tells the scanner that the next time it matches a rule, the
corresponding token should be @emph{appended} onto the current value of
@code{yytext} rather than replacing it. For example, given the input
@samp{mega-kludge} the following will write @samp{mega-mega-kludge} to
the output:
@cindex yymore(), mega-kludge
@cindex yymore() to append token to previous token
@example
@verbatim
%%
mega- ECHO; yymore();
kludge ECHO;
@end verbatim
@end example
First @samp{mega-} is matched and echoed to the output. Then @samp{kludge}
is matched, but the previous @samp{mega-} is still hanging around at the
beginning of
@code{yytext}
so the
@code{ECHO}
for the @samp{kludge} rule will actually write @samp{mega-kludge}.
@end table
@cindex yymore, performance penalty of
Two notes regarding use of @code{yymore()}. First, @code{yymore()}
depends on the value of @code{yyleng} correctly reflecting the size of
the current token, so you must not modify @code{yyleng} if you are using
@code{yymore()}. Second, the presence of @code{yymore()} in the
scanner's action entails a minor performance penalty in the scanner's
matching speed.
@cindex yyless()
@code{yyless(n)} returns all but the first @code{n} characters of the
current token back to the input stream, where they will be rescanned
when the scanner looks for the next match. @code{yytext} and
@code{yyleng} are adjusted appropriately (e.g., @code{yyleng} will now
be equal to @code{n}). For example, on the input @samp{foobar} the
following will write out @samp{foobarbar}:
@cindex yyless(), pushing back characters
@cindex pushing back characters with yyless
@example
@verbatim
%%
foobar ECHO; yyless(3);
[a-z]+ ECHO;
@end verbatim
@end example
An argument of 0 to @code{yyless()} will cause the entire current input
string to be scanned again. Unless you've changed how the scanner will
subsequently process its input (using @code{BEGIN}, for example), this
will result in an endless loop.
Note that @code{yyless()} is a macro and can only be used in the flex
input file, not from other source files.
@cindex unput()
@cindex pushing back characters with unput
@code{unput(c)} puts the character @code{c} back onto the input stream.
It will be the next character scanned. The following action will take
the current token and cause it to be rescanned enclosed in parentheses.
@cindex unput(), pushing back characters
@cindex pushing back characters with unput()
@example
@verbatim
{
int i;
/* Copy yytext because unput() trashes yytext */
char *yycopy = strdup( yytext );
unput( ')' );
for ( i = yyleng - 1; i >= 0; --i )
unput( yycopy[i] );
unput( '(' );
free( yycopy );
}
@end verbatim
@end example
Note that since each @code{unput()} puts the given character back at the
@emph{beginning} of the input stream, pushing back strings must be done
back-to-front.
@cindex %pointer, and unput()
@cindex unput(), and %pointer
An important potential problem when using @code{unput()} is that if you
are using @code{%pointer} (the default), a call to @code{unput()}
@emph{destroys} the contents of @code{yytext}, starting with its
rightmost character and devouring one character to the left with each
call. If you need the value of @code{yytext} preserved after a call to
@code{unput()} (as in the above example), you must either first copy it
elsewhere, or build your scanner using @code{%array} instead
(@pxref{Matching}).
@cindex pushing back EOF
@cindex EOF, pushing back
Finally, note that you cannot put back @samp{EOF} to attempt to mark the
input stream with an end-of-file.
@cindex input()
@code{input()} reads the next character from the input stream. For
example, the following is one way to eat up C comments:
@cindex comments, discarding
@cindex discarding C comments
@example
@verbatim
%%
"/*" {
int c;
for ( ; ; )
{
while ( (c = input()) != '*' &&
c != EOF )
; /* eat up text of comment */
if ( c == '*' )
{
while ( (c = input()) == '*' )
;
if ( c == '/' )
break; /* found the end */
}
if ( c == EOF )
{
error( "EOF in comment" );
break;
}
}
}
@end verbatim
@end example
@cindex input(), and C++
@cindex yyinput()
(Note that if the scanner is compiled using @code{C++}, then
@code{input()} is instead referred to as @b{yyinput()}, in order to
avoid a name clash with the @code{C++} stream by the name of
@code{input}.)
@cindex flushing the internal buffer
@cindex YY_FLUSH_BUFFER
@code{YY_FLUSH_BUFFER;} flushes the scanner's internal buffer so that
the next time the scanner attempts to match a token, it will first
refill the buffer using @code{YY_INPUT()} (@pxref{Generated Scanner}).
This action is a special case of the more general
@code{yy_flush_buffer;} function, described below (@pxref{Multiple
Input Buffers})
@cindex yyterminate()
@cindex terminating with yyterminate()
@cindex exiting with yyterminate()
@cindex halting with yyterminate()
@code{yyterminate()} can be used in lieu of a return statement in an
action. It terminates the scanner and returns a 0 to the scanner's
caller, indicating ``all done''. By default, @code{yyterminate()} is
also called when an end-of-file is encountered. It is a macro and may
be redefined.
@node Generated Scanner, Start Conditions, Actions, Top
@chapter The Generated Scanner
@cindex yylex(), in generated scanner
The output of @code{flex} is the file @file{lex.yy.c}, which contains
the scanning routine @code{yylex()}, a number of tables used by it for
matching tokens, and a number of auxiliary routines and macros. By
default, @code{yylex()} is declared as follows:
@example
@verbatim
int yylex()
{
... various definitions and the actions in here ...
}
@end verbatim
@end example
@cindex yylex(), overriding
(If your environment supports function prototypes, then it will be
@code{int yylex( void )}.) This definition may be changed by defining
the @code{YY_DECL} macro. For example, you could use:
@cindex yylex, overriding the prototype of
@example
@verbatim
#define YY_DECL float lexscan( a, b ) float a, b;
@end verbatim
@end example
to give the scanning routine the name @code{lexscan}, returning a float,
and taking two floats as arguments. Note that if you give arguments to
the scanning routine using a K&R-style/non-prototyped function
declaration, you must terminate the definition with a semi-colon (;).
@code{flex} generates @samp{C99} function definitions by
default. However flex does have the ability to generate obsolete, er,
@samp{traditional}, function definitions. This is to support
bootstrapping gcc on old systems. Unfortunately, traditional
definitions prevent us from using any standard data types smaller than
int (such as short, char, or bool) as function arguments. For this
reason, future versions of @code{flex} may generate standard C99 code
only, leaving K&R-style functions to the historians. Currently, if you
do @strong{not} want @samp{C99} definitions, then you must use
@code{%option noansi-definitions}.
@cindex stdin, default for yyin
@cindex yyin
Whenever @code{yylex()} is called, it scans tokens from the global input
file @file{yyin} (which defaults to stdin). It continues until it
either reaches an end-of-file (at which point it returns the value 0) or
one of its actions executes a @code{return} statement.
@cindex EOF and yyrestart()
@cindex end-of-file, and yyrestart()
@cindex yyrestart()
If the scanner reaches an end-of-file, subsequent calls are undefined
unless either @file{yyin} is pointed at a new input file (in which case
scanning continues from that file), or @code{yyrestart()} is called.
@code{yyrestart()} takes one argument, a @code{FILE *} pointer (which
can be NULL, if you've set up @code{YY_INPUT} to scan from a source other
than @code{yyin}), and initializes @file{yyin} for scanning from that
file. Essentially there is no difference between just assigning
@file{yyin} to a new input file or using @code{yyrestart()} to do so;
the latter is available for compatibility with previous versions of
@code{flex}, and because it can be used to switch input files in the
middle of scanning. It can also be used to throw away the current input
buffer, by calling it with an argument of @file{yyin}; but it would be
better to use @code{YY_FLUSH_BUFFER} (@pxref{Actions}). Note that
@code{yyrestart()} does @emph{not} reset the start condition to
@code{INITIAL} (@pxref{Start Conditions}).
@cindex RETURN, within actions
If @code{yylex()} stops scanning due to executing a @code{return}
statement in one of the actions, the scanner may then be called again
and it will resume scanning where it left off.
@cindex YY_INPUT
By default (and for purposes of efficiency), the scanner uses
block-reads rather than simple @code{getc()} calls to read characters
from @file{yyin}. The nature of how it gets its input can be controlled
by defining the @code{YY_INPUT} macro. The calling sequence for
@code{YY_INPUT()} is @code{YY_INPUT(buf,result,max_size)}. Its action
is to place up to @code{max_size} characters in the character array
@code{buf} and return in the integer variable @code{result} either the
number of characters read or the constant @code{YY_NULL} (0 on Unix
systems) to indicate @samp{EOF}. The default @code{YY_INPUT} reads from
the global file-pointer @file{yyin}.
@cindex YY_INPUT, overriding
Here is a sample definition of @code{YY_INPUT} (in the definitions
section of the input file):
@example
@verbatim
%{
#define YY_INPUT(buf,result,max_size) \
{ \
int c = getchar(); \
result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \
}
%}
@end verbatim
@end example
This definition will change the input processing to occur one character
at a time.
@cindex yywrap()
When the scanner receives an end-of-file indication from YY_INPUT, it
then checks the @code{yywrap()} function. If @code{yywrap()} returns
false (zero), then it is assumed that the function has gone ahead and
set up @file{yyin} to point to another input file, and scanning
continues. If it returns true (non-zero), then the scanner terminates,
returning 0 to its caller. Note that in either case, the start
condition remains unchanged; it does @emph{not} revert to
@code{INITIAL}.
@cindex yywrap, default for
@cindex noyywrap, %option
@cindex %option noyywrapp
If you do not supply your own version of @code{yywrap()}, then you must
either use @code{%option noyywrap} (in which case the scanner behaves as
though @code{yywrap()} returned 1), or you must link with @samp{-lfl} to
obtain the default version of the routine, which always returns 1.
For scanning from in-memory buffers (e.g., scanning strings), see
@ref{Scanning Strings}. @xref{Multiple Input Buffers}.
@cindex ECHO, and yyout
@cindex yyout
@cindex stdout, as default for yyout
The scanner writes its @code{ECHO} output to the @file{yyout} global
(default, @file{stdout}), which may be redefined by the user simply by
assigning it to some other @code{FILE} pointer.
@node Start Conditions, Multiple Input Buffers, Generated Scanner, Top
@chapter Start Conditions
@cindex start conditions
@code{flex} provides a mechanism for conditionally activating rules.
Any rule whose pattern is prefixed with @samp{<sc>} will only be active
when the scanner is in the @dfn{start condition} named @code{sc}. For
example,
@example
@verbatim
<STRING>[^"]* { /* eat up the string body ... */
...
}
@end verbatim
@end example
will be active only when the scanner is in the @code{STRING} start
condition, and
@cindex start conditions, multiple
@example
@verbatim
<INITIAL,STRING,QUOTE>\. { /* handle an escape ... */
...
}
@end verbatim
@end example
will be active only when the current start condition is either
@code{INITIAL}, @code{STRING}, or @code{QUOTE}.
@cindex start conditions, inclusive v.s.@: exclusive
Start conditions are declared in the definitions (first) section of the
input using unindented lines beginning with either @samp{%s} or
@samp{%x} followed by a list of names. The former declares
@dfn{inclusive} start conditions, the latter @dfn{exclusive} start
conditions. A start condition is activated using the @code{BEGIN}
action. Until the next @code{BEGIN} action is executed, rules with the
given start condition will be active and rules with other start
conditions will be inactive. If the start condition is inclusive, then
rules with no start conditions at all will also be active. If it is
exclusive, then @emph{only} rules qualified with the start condition
will be active. A set of rules contingent on the same exclusive start
condition describe a scanner which is independent of any of the other
rules in the @code{flex} input. Because of this, exclusive start
conditions make it easy to specify ``mini-scanners'' which scan portions
of the input that are syntactically different from the rest (e.g.,
comments).
If the distinction between inclusive and exclusive start conditions
is still a little vague, here's a simple example illustrating the
connection between the two. The set of rules:
@cindex start conditions, inclusive
@example
@verbatim
%s example
%%
<example>foo do_something();
bar something_else();
@end verbatim
@end example
is equivalent to
@cindex start conditions, exclusive
@example
@verbatim
%x example
%%
<example>foo do_something();
<INITIAL,example>bar something_else();
@end verbatim
@end example
Without the @code{<INITIAL,example>} qualifier, the @code{bar} pattern in
the second example wouldn't be active (i.e., couldn't match) when in
start condition @code{example}. If we just used @code{<example>} to
qualify @code{bar}, though, then it would only be active in
@code{example} and not in @code{INITIAL}, while in the first example
it's active in both, because in the first example the @code{example}
start condition is an inclusive @code{(%s)} start condition.
@cindex start conditions, special wildcard condition
Also note that the special start-condition specifier
@code{<*>}
matches every start condition. Thus, the above example could also
have been written:
@cindex start conditions, use of wildcard condition (<*>)
@example
@verbatim
%x example
%%
<example>foo do_something();
<*>bar something_else();
@end verbatim
@end example
The default rule (to @code{ECHO} any unmatched character) remains active
in start conditions. It is equivalent to:
@cindex start conditions, behavior of default rule
@example
@verbatim
<*>.|\n ECHO;
@end verbatim
@end example
@cindex BEGIN, explanation
@findex BEGIN
@vindex INITIAL
@code{BEGIN(0)} returns to the original state where only the rules with
no start conditions are active. This state can also be referred to as
the start-condition @code{INITIAL}, so @code{BEGIN(INITIAL)} is
equivalent to @code{BEGIN(0)}. (The parentheses around the start
condition name are not required but are considered good style.)
@code{BEGIN} actions can also be given as indented code at the beginning
of the rules section. For example, the following will cause the scanner
to enter the @code{SPECIAL} start condition whenever @code{yylex()} is
called and the global variable @code{enter_special} is true:
@cindex start conditions, using BEGIN
@example
@verbatim
int enter_special;
%x SPECIAL
%%
if ( enter_special )
BEGIN(SPECIAL);
<SPECIAL>blahblahblah
...more rules follow...
@end verbatim
@end example
To illustrate the uses of start conditions, here is a scanner which
provides two different interpretations of a string like @samp{123.456}.
By default it will treat it as three tokens, the integer @samp{123}, a
dot (@samp{.}), and the integer @samp{456}. But if the string is
preceded earlier in the line by the string @samp{expect-floats} it will
treat it as a single token, the floating-point number @samp{123.456}:
@cindex start conditions, for different interpretations of same input
@example
@verbatim
%{
#include <math.h>
%}
%s expect
%%
expect-floats BEGIN(expect);
<expect>[0-9]+.[0-9]+ {
printf( "found a float, = %f\n",
atof( yytext ) );
}
<expect>\n {
/* that's the end of the line, so
* we need another "expect-number"
* before we'll recognize any more
* numbers
*/
BEGIN(INITIAL);
}
[0-9]+ {
printf( "found an integer, = %d\n",
atoi( yytext ) );
}
"." printf( "found a dot\n" );
@end verbatim
@end example
@cindex comments, example of scanning C comments
Here is a scanner which recognizes (and discards) C comments while
maintaining a count of the current input line.
@cindex recognizing C comments
@example
@verbatim
%x comment
%%
int line_num = 1;
"/*" BEGIN(comment);
<comment>[^*\n]* /* eat anything that's not a '*' */
<comment>"*"+[^*/\n]* /* eat up '*'s not followed by '/'s */
<comment>\n ++line_num;
<comment>"*"+"/" BEGIN(INITIAL);
@end verbatim
@end example
This scanner goes to a bit of trouble to match as much
text as possible with each rule. In general, when attempting to write
a high-speed scanner try to match as much possible in each rule, as
it's a big win.
Note that start-conditions names are really integer values and
can be stored as such. Thus, the above could be extended in the
following fashion:
@cindex start conditions, integer values
@cindex using integer values of start condition names
@example
@verbatim
%x comment foo
%%
int line_num = 1;
int comment_caller;
"/*" {
comment_caller = INITIAL;
BEGIN(comment);
}
...
<foo>"/*" {
comment_caller = foo;
BEGIN(comment);
}
<comment>[^*\n]* /* eat anything that's not a '*' */
<comment>"*"+[^*/\n]* /* eat up '*'s not followed by '/'s */
<comment>\n ++line_num;
<comment>"*"+"/" BEGIN(comment_caller);
@end verbatim
@end example
@cindex YY_START, example
Furthermore, you can access the current start condition using the
integer-valued @code{YY_START} macro. For example, the above
assignments to @code{comment_caller} could instead be written
@cindex getting current start state with YY_START
@example
@verbatim
comment_caller = YY_START;
@end verbatim
@end example
@vindex YY_START
Flex provides @code{YYSTATE} as an alias for @code{YY_START} (since that
is what's used by AT&T @code{lex}).
For historical reasons, start conditions do not have their own
name-space within the generated scanner. The start condition names are
unmodified in the generated scanner and generated header.
@xref{option-header}. @xref{option-prefix}.
Finally, here's an example of how to match C-style quoted strings using
exclusive start conditions, including expanded escape sequences (but
not including checking for a string that's too long):
@cindex matching C-style double-quoted strings
@example
@verbatim
%x str
%%
char string_buf[MAX_STR_CONST];
char *string_buf_ptr;
\" string_buf_ptr = string_buf; BEGIN(str);
<str>\" { /* saw closing quote - all done */
BEGIN(INITIAL);
*string_buf_ptr = '\0';
/* return string constant token type and
* value to parser
*/
}
<str>\n {
/* error - unterminated string constant */
/* generate error message */
}
<str>\\[0-7]{1,3} {
/* octal escape sequence */
int result;
(void) sscanf( yytext + 1, "%o", &result );
if ( result > 0xff )
/* error, constant is out-of-bounds */
*string_buf_ptr++ = result;
}
<str>\\[0-9]+ {
/* generate error - bad escape sequence; something
* like '\48' or '\0777777'
*/
}
<str>\\n *string_buf_ptr++ = '\n';
<str>\\t *string_buf_ptr++ = '\t';
<str>\\r *string_buf_ptr++ = '\r';
<str>\\b *string_buf_ptr++ = '\b';
<str>\\f *string_buf_ptr++ = '\f';
<str>\\(.|\n) *string_buf_ptr++ = yytext[1];
<str>[^\\\n\"]+ {
char *yptr = yytext;
while ( *yptr )
*string_buf_ptr++ = *yptr++;
}
@end verbatim
@end example
@cindex start condition, applying to multiple patterns
Often, such as in some of the examples above, you wind up writing a
whole bunch of rules all preceded by the same start condition(s). Flex
makes this a little easier and cleaner by introducing a notion of start
condition @dfn{scope}. A start condition scope is begun with:
@example
@verbatim
<SCs>{
@end verbatim
@end example
where @code{<SCs>} is a list of one or more start conditions. Inside the
start condition scope, every rule automatically has the prefix
@code{<SCs>} applied to it, until a @samp{@}} which matches the initial
@samp{@{}. So, for example,
@cindex extended scope of start conditions
@example
@verbatim
<ESC>{
"\\n" return '\n';
"\\r" return '\r';
"\\f" return '\f';
"\\0" return '\0';
}
@end verbatim
@end example
is equivalent to:
@example
@verbatim
<ESC>"\\n" return '\n';
<ESC>"\\r" return '\r';
<ESC>"\\f" return '\f';
<ESC>"\\0" return '\0';
@end verbatim
@end example
Start condition scopes may be nested.
@cindex stacks, routines for manipulating
@cindex start conditions, use of a stack
--> --------------------
--> maximum size reached
--> --------------------
[ Verzeichnis aufwärts0.60unsichere Verbindung
Übersetzung europäischer Sprachen durch Browser
]
|
2026-03-28
|