SSL nfa.rs
Sprache: unbekannt
|
|
use core::{fmt, mem};
use alloc::{boxed::Box, format, string::String, sync::Arc, vec, vec::Vec};
#[cfg(feature = "syntax")]
use crate::nfa::thompson::{
compiler::{Compiler, Config},
error::BuildError,
};
use crate::{
nfa::thompson::builder::Builder,
util::{
alphabet::{self, ByteClassSet, ByteClasses},
captures::{GroupInfo, GroupInfoError},
look::{Look, LookMatcher, LookSet},
primitives::{
IteratorIndexExt, PatternID, PatternIDIter, SmallIndex, StateID,
},
sparse_set::SparseSet,
},
};
/// A byte oriented Thompson non-deterministic finite automaton (NFA).
///
/// A Thompson NFA is a finite state machine that permits unconditional epsilon
/// transitions, but guarantees that there exists at most one non-epsilon
/// transition for each element in the alphabet for each state.
///
/// An NFA may be used directly for searching, for analysis or to build
/// a deterministic finite automaton (DFA).
///
/// # Cheap clones
///
/// Since an NFA is a core data type in this crate that many other regex
/// engines are based on top of, it is convenient to give ownership of an NFA
/// to said regex engines. Because of this, an NFA uses reference counting
/// internally. Therefore, it is cheap to clone and it is encouraged to do so.
///
/// # Capabilities
///
/// Using an NFA for searching via the
/// [`PikeVM`](crate::nfa::thompson::pikevm::PikeVM) provides the most amount
/// of "power" of any regex engine in this crate. Namely, it supports the
/// following in all cases:
///
/// 1. Detection of a match.
/// 2. Location of a match, including both the start and end offset, in a
/// single pass of the haystack.
/// 3. Location of matching capturing groups.
/// 4. Handles multiple patterns, including (1)-(3) when multiple patterns are
/// present.
///
/// # Capturing Groups
///
/// Groups refer to parenthesized expressions inside a regex pattern. They look
/// like this, where `exp` is an arbitrary regex:
///
/// * `(exp)` - An unnamed capturing group.
/// * `(?P<name>exp)` or `(?<name>exp)` - A named capturing group.
/// * `(?:exp)` - A non-capturing group.
/// * `(?i:exp)` - A non-capturing group that sets flags.
///
/// Only the first two forms are said to be _capturing_. Capturing
/// means that the last position at which they match is reportable. The
/// [`Captures`](crate::util::captures::Captures) type provides convenient
/// access to the match positions of capturing groups, which includes looking
/// up capturing groups by their name.
///
/// # Byte oriented
///
/// This NFA is byte oriented, which means that all of its transitions are
/// defined on bytes. In other words, the alphabet of an NFA consists of the
/// 256 different byte values.
///
/// While DFAs nearly demand that they be byte oriented for performance
/// reasons, an NFA could conceivably be *Unicode codepoint* oriented. Indeed,
/// a previous version of this NFA supported both byte and codepoint oriented
/// modes. A codepoint oriented mode can work because an NFA fundamentally uses
/// a sparse representation of transitions, which works well with the large
/// sparse space of Unicode codepoints.
///
/// Nevertheless, this NFA is only byte oriented. This choice is primarily
/// driven by implementation simplicity, and also in part memory usage. In
/// practice, performance between the two is roughly comparable. However,
/// building a DFA (including a hybrid DFA) really wants a byte oriented NFA.
/// So if we do have a codepoint oriented NFA, then we also need to generate
/// byte oriented NFA in order to build an hybrid NFA/DFA. Thus, by only
/// generating byte oriented NFAs, we can produce one less NFA. In other words,
/// if we made our NFA codepoint oriented, we'd need to *also* make it support
/// a byte oriented mode, which is more complicated. But a byte oriented mode
/// can support everything.
///
/// # Differences with DFAs
///
/// At the theoretical level, the precise difference between an NFA and a DFA
/// is that, in a DFA, for every state, an input symbol unambiguously refers
/// to a single transition _and_ that an input symbol is required for each
/// transition. At a practical level, this permits DFA implementations to be
/// implemented at their core with a small constant number of CPU instructions
/// for each byte of input searched. In practice, this makes them quite a bit
/// faster than NFAs _in general_. Namely, in order to execute a search for any
/// Thompson NFA, one needs to keep track of a _set_ of states, and execute
/// the possible transitions on all of those states for each input symbol.
/// Overall, this results in much more overhead. To a first approximation, one
/// can expect DFA searches to be about an order of magnitude faster.
///
/// So why use an NFA at all? The main advantage of an NFA is that it takes
/// linear time (in the size of the pattern string after repetitions have been
/// expanded) to build and linear memory usage. A DFA, on the other hand, may
/// take exponential time and/or space to build. Even in non-pathological
/// cases, DFAs often take quite a bit more memory than their NFA counterparts,
/// _especially_ if large Unicode character classes are involved. Of course,
/// an NFA also provides additional capabilities. For example, it can match
/// Unicode word boundaries on non-ASCII text and resolve the positions of
/// capturing groups.
///
/// Note that a [`hybrid::regex::Regex`](crate::hybrid::regex::Regex) strikes a
/// good balance between an NFA and a DFA. It avoids the exponential build time
/// of a DFA while maintaining its fast search time. The downside of a hybrid
/// NFA/DFA is that in some cases it can be slower at search time than the NFA.
/// (It also has less functionality than a pure NFA. It cannot handle Unicode
/// word boundaries on non-ASCII text and cannot resolve capturing groups.)
///
/// # Example
///
/// This shows how to build an NFA with the default configuration and execute a
/// search using the Pike VM.
///
/// ```
/// use regex_automata::{nfa::thompson::pikevm::PikeVM, Match};
///
/// let re = PikeVM::new(r"foo[0-9]+")?;
/// let mut cache = re.create_cache();
/// let mut caps = re.create_captures();
///
/// let expected = Some(Match::must(0, 0..8));
/// re.captures(&mut cache, b"foo12345", &mut caps);
/// assert_eq!(expected, caps.get_match());
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
///
/// # Example: resolving capturing groups
///
/// This example shows how to parse some simple dates and extract the
/// components of each date via capturing groups.
///
/// ```
/// # if cfg!(miri) { return Ok(()); } // miri takes too long
/// use regex_automata::{
/// nfa::thompson::pikevm::PikeVM,
/// util::captures::Captures,
/// };
///
/// let vm = PikeVM::new(r"(?P<y>\d{4})-(?P<m>\d{2})-(?P<d>\d{2})")?;
/// let mut cache = vm.create_cache();
///
/// let haystack = "2012-03-14, 2013-01-01 and 2014-07-05";
/// let all: Vec<Captures> = vm.captures_iter(
/// &mut cache, haystack.as_bytes()
/// ).collect();
/// // There should be a total of 3 matches.
/// assert_eq!(3, all.len());
/// // The year from the second match is '2013'.
/// let span = all[1].get_group_by_name("y").unwrap();
/// assert_eq!("2013", &haystack[span]);
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
///
/// This example shows that only the last match of a capturing group is
/// reported, even if it had to match multiple times for an overall match
/// to occur.
///
/// ```
/// use regex_automata::{nfa::thompson::pikevm::PikeVM, Span};
///
/// let re = PikeVM::new(r"([a-z]){4}")?;
/// let mut cache = re.create_cache();
/// let mut caps = re.create_captures();
///
/// let haystack = b"quux";
/// re.captures(&mut cache, haystack, &mut caps);
/// assert!(caps.is_match());
/// assert_eq!(Some(Span::from(3..4)), caps.get_group(1));
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[derive(Clone)]
pub struct NFA(
// We make NFAs reference counted primarily for two reasons. First is that
// the NFA type itself is quite large (at least 0.5KB), and so it makes
// sense to put it on the heap by default anyway. Second is that, for Arc
// specifically, this enables cheap clones. This tends to be useful because
// several structures (the backtracker, the Pike VM, the hybrid NFA/DFA)
// all want to hang on to an NFA for use during search time. We could
// provide the NFA at search time via a function argument, but this makes
// for an unnecessarily annoying API. Instead, we just let each structure
// share ownership of the NFA. Using a deep clone would not be smart, since
// the NFA can use quite a bit of heap space.
Arc<Inner>,
);
impl NFA {
/// Parse the given regular expression using a default configuration and
/// build an NFA from it.
///
/// If you want a non-default configuration, then use the NFA
/// [`Compiler`] with a [`Config`].
///
/// # Example
///
/// ```
/// use regex_automata::{nfa::thompson::pikevm::PikeVM, Match};
///
/// let re = PikeVM::new(r"foo[0-9]+")?;
/// let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
///
/// let expected = Some(Match::must(0, 0..8));
/// re.captures(&mut cache, b"foo12345", &mut caps);
/// assert_eq!(expected, caps.get_match());
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[cfg(feature = "syntax")]
pub fn new(pattern: &str) -> Result<NFA, BuildError> {
NFA::compiler().build(pattern)
}
/// Parse the given regular expressions using a default configuration and
/// build a multi-NFA from them.
///
/// If you want a non-default configuration, then use the NFA
/// [`Compiler`] with a [`Config`].
///
/// # Example
///
/// ```
/// use regex_automata::{nfa::thompson::pikevm::PikeVM, Match};
///
/// let re = PikeVM::new_many(&["[0-9]+", "[a-z]+"])?;
/// let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
///
/// let expected = Some(Match::must(1, 0..3));
/// re.captures(&mut cache, b"foo12345bar", &mut caps);
/// assert_eq!(expected, caps.get_match());
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[cfg(feature = "syntax")]
pub fn new_many<P: AsRef<str>>(patterns: &[P]) -> Result<NFA, BuildError> {
NFA::compiler().build_many(patterns)
}
/// Returns an NFA with a single regex pattern that always matches at every
/// position.
///
/// # Example
///
/// ```
/// use regex_automata::{nfa::thompson::{NFA, pikevm::PikeVM}, Match};
///
/// let re = PikeVM::new_from_nfa(NFA::always_match())?;
/// let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
///
/// let expected = Some(Match::must(0, 0..0));
/// re.captures(&mut cache, b"", &mut caps);
/// assert_eq!(expected, caps.get_match());
/// re.captures(&mut cache, b"foo", &mut caps);
/// assert_eq!(expected, caps.get_match());
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
pub fn always_match() -> NFA {
// We could use NFA::new("") here and we'd get the same semantics, but
// hand-assembling the NFA (as below) does the same thing with a fewer
// number of states. It also avoids needing the 'syntax' feature
// enabled.
//
// Technically all we need is the "match" state, but we add the
// "capture" states so that the PikeVM can use this NFA.
//
// The unwraps below are OK because we add so few states that they will
// never exhaust any default limits in any environment.
let mut builder = Builder::new();
let pid = builder.start_pattern().unwrap();
assert_eq!(pid.as_usize(), 0);
let start_id =
builder.add_capture_start(StateID::ZERO, 0, None).unwrap();
let end_id = builder.add_capture_end(StateID::ZERO, 0).unwrap();
let match_id = builder.add_match().unwrap();
builder.patch(start_id, end_id).unwrap();
builder.patch(end_id, match_id).unwrap();
let pid = builder.finish_pattern(start_id).unwrap();
assert_eq!(pid.as_usize(), 0);
builder.build(start_id, start_id).unwrap()
}
/// Returns an NFA that never matches at any position.
///
/// This is a convenience routine for creating an NFA with zero patterns.
///
/// # Example
///
/// ```
/// use regex_automata::nfa::thompson::{NFA, pikevm::PikeVM};
///
/// let re = PikeVM::new_from_nfa(NFA::never_match())?;
/// let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
///
/// re.captures(&mut cache, b"", &mut caps);
/// assert!(!caps.is_match());
/// re.captures(&mut cache, b"foo", &mut caps);
/// assert!(!caps.is_match());
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
pub fn never_match() -> NFA {
// This always succeeds because it only requires one NFA state, which
// will never exhaust any (default) limits.
let mut builder = Builder::new();
let sid = builder.add_fail().unwrap();
builder.build(sid, sid).unwrap()
}
/// Return a default configuration for an `NFA`.
///
/// This is a convenience routine to avoid needing to import the `Config`
/// type when customizing the construction of an NFA.
///
/// # Example
///
/// This example shows how to build an NFA with a small size limit that
/// results in a compilation error for any regex that tries to use more
/// heap memory than the configured limit.
///
/// ```
/// use regex_automata::nfa::thompson::{NFA, pikevm::PikeVM};
///
/// let result = PikeVM::builder()
/// .thompson(NFA::config().nfa_size_limit(Some(1_000)))
/// // Remember, \w is Unicode-aware by default and thus huge.
/// .build(r"\w+");
/// assert!(result.is_err());
/// ```
#[cfg(feature = "syntax")]
pub fn config() -> Config {
Config::new()
}
/// Return a compiler for configuring the construction of an `NFA`.
///
/// This is a convenience routine to avoid needing to import the
/// [`Compiler`] type in common cases.
///
/// # Example
///
/// This example shows how to build an NFA that is permitted match invalid
/// UTF-8. Without the additional syntax configuration here, compilation of
/// `(?-u:.)` would fail because it is permitted to match invalid UTF-8.
///
/// ```
/// use regex_automata::{
/// nfa::thompson::pikevm::PikeVM,
/// util::syntax,
/// Match,
/// };
///
/// let re = PikeVM::builder()
/// .syntax(syntax::Config::new().utf8(false))
/// .build(r"[a-z]+(?-u:.)")?;
/// let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
///
/// let expected = Some(Match::must(0, 1..5));
/// re.captures(&mut cache, b"\xFFabc\xFF", &mut caps);
/// assert_eq!(expected, caps.get_match());
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[cfg(feature = "syntax")]
pub fn compiler() -> Compiler {
Compiler::new()
}
/// Returns an iterator over all pattern identifiers in this NFA.
///
/// Pattern IDs are allocated in sequential order starting from zero,
/// where the order corresponds to the order of patterns provided to the
/// [`NFA::new_many`] constructor.
///
/// # Example
///
/// ```
/// use regex_automata::{nfa::thompson::NFA, PatternID};
///
/// let nfa = NFA::new_many(&["[0-9]+", "[a-z]+", "[A-Z]+"])?;
/// let pids: Vec<PatternID> = nfa.patterns().collect();
/// assert_eq!(pids, vec![
/// PatternID::must(0),
/// PatternID::must(1),
/// PatternID::must(2),
/// ]);
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
pub fn patterns(&self) -> PatternIter<'_> {
PatternIter {
it: PatternID::iter(self.pattern_len()),
_marker: core::marker::PhantomData,
}
}
/// Returns the total number of regex patterns in this NFA.
///
/// This may return zero if the NFA was constructed with no patterns. In
/// this case, the NFA can never produce a match for any input.
///
/// This is guaranteed to be no bigger than [`PatternID::LIMIT`] because
/// NFA construction will fail if too many patterns are added.
///
/// It is always true that `nfa.patterns().count() == nfa.pattern_len()`.
///
/// # Example
///
/// ```
/// use regex_automata::nfa::thompson::NFA;
///
/// let nfa = NFA::new_many(&["[0-9]+", "[a-z]+", "[A-Z]+"])?;
/// assert_eq!(3, nfa.pattern_len());
///
/// let nfa = NFA::never_match();
/// assert_eq!(0, nfa.pattern_len());
///
/// let nfa = NFA::always_match();
/// assert_eq!(1, nfa.pattern_len());
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[inline]
pub fn pattern_len(&self) -> usize {
self.0.start_pattern.len()
}
/// Return the state identifier of the initial anchored state of this NFA.
///
/// The returned identifier is guaranteed to be a valid index into the
/// slice returned by [`NFA::states`], and is also a valid argument to
/// [`NFA::state`].
///
/// # Example
///
/// This example shows a somewhat contrived example where we can easily
/// predict the anchored starting state.
///
/// ```
/// use regex_automata::nfa::thompson::{NFA, State, WhichCaptures};
///
/// let nfa = NFA::compiler()
/// .configure(NFA::config().which_captures(WhichCaptures::None))
/// .build("a")?;
/// let state = nfa.state(nfa.start_anchored());
/// match *state {
/// State::ByteRange { trans } => {
/// assert_eq!(b'a', trans.start);
/// assert_eq!(b'a', trans.end);
/// }
/// _ => unreachable!("unexpected state"),
/// }
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[inline]
pub fn start_anchored(&self) -> StateID {
self.0.start_anchored
}
/// Return the state identifier of the initial unanchored state of this
/// NFA.
///
/// This is equivalent to the identifier returned by
/// [`NFA::start_anchored`] when the NFA has no unanchored starting state.
///
/// The returned identifier is guaranteed to be a valid index into the
/// slice returned by [`NFA::states`], and is also a valid argument to
/// [`NFA::state`].
///
/// # Example
///
/// This example shows that the anchored and unanchored starting states
/// are equivalent when an anchored NFA is built.
///
/// ```
/// use regex_automata::nfa::thompson::NFA;
///
/// let nfa = NFA::new("^a")?;
/// assert_eq!(nfa.start_anchored(), nfa.start_unanchored());
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[inline]
pub fn start_unanchored(&self) -> StateID {
self.0.start_unanchored
}
/// Return the state identifier of the initial anchored state for the given
/// pattern, or `None` if there is no pattern corresponding to the given
/// identifier.
///
/// If one uses the starting state for a particular pattern, then the only
/// match that can be returned is for the corresponding pattern.
///
/// The returned identifier is guaranteed to be a valid index into the
/// slice returned by [`NFA::states`], and is also a valid argument to
/// [`NFA::state`].
///
/// # Errors
///
/// If the pattern doesn't exist in this NFA, then this returns an error.
/// This occurs when `pid.as_usize() >= nfa.pattern_len()`.
///
/// # Example
///
/// This example shows that the anchored and unanchored starting states
/// are equivalent when an anchored NFA is built.
///
/// ```
/// use regex_automata::{nfa::thompson::NFA, PatternID};
///
/// let nfa = NFA::new_many(&["^a", "^b"])?;
/// // The anchored and unanchored states for the entire NFA are the same,
/// // since all of the patterns are anchored.
/// assert_eq!(nfa.start_anchored(), nfa.start_unanchored());
/// // But the anchored starting states for each pattern are distinct,
/// // because these starting states can only lead to matches for the
/// // corresponding pattern.
/// let anchored = Some(nfa.start_anchored());
/// assert_ne!(anchored, nfa.start_pattern(PatternID::must(0)));
/// assert_ne!(anchored, nfa.start_pattern(PatternID::must(1)));
/// // Requesting a pattern not in the NFA will result in None:
/// assert_eq!(None, nfa.start_pattern(PatternID::must(2)));
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[inline]
pub fn start_pattern(&self, pid: PatternID) -> Option<StateID> {
self.0.start_pattern.get(pid.as_usize()).copied()
}
/// Get the byte class set for this NFA.
///
/// A byte class set is a partitioning of this NFA's alphabet into
/// equivalence classes. Any two bytes in the same equivalence class are
/// guaranteed to never discriminate between a match or a non-match. (The
/// partitioning may not be minimal.)
///
/// Byte classes are used internally by this crate when building DFAs.
/// Namely, among other optimizations, they enable a space optimization
/// where the DFA's internal alphabet is defined over the equivalence
/// classes of bytes instead of all possible byte values. The former is
/// often quite a bit smaller than the latter, which permits the DFA to use
/// less space for its transition table.
#[inline]
pub(crate) fn byte_class_set(&self) -> &ByteClassSet {
&self.0.byte_class_set
}
/// Get the byte classes for this NFA.
///
/// Byte classes represent a partitioning of this NFA's alphabet into
/// equivalence classes. Any two bytes in the same equivalence class are
/// guaranteed to never discriminate between a match or a non-match. (The
/// partitioning may not be minimal.)
///
/// Byte classes are used internally by this crate when building DFAs.
/// Namely, among other optimizations, they enable a space optimization
/// where the DFA's internal alphabet is defined over the equivalence
/// classes of bytes instead of all possible byte values. The former is
/// often quite a bit smaller than the latter, which permits the DFA to use
/// less space for its transition table.
///
/// # Example
///
/// This example shows how to query the class of various bytes.
///
/// ```
/// use regex_automata::nfa::thompson::NFA;
///
/// let nfa = NFA::new("[a-z]+")?;
/// let classes = nfa.byte_classes();
/// // 'a' and 'z' are in the same class for this regex.
/// assert_eq!(classes.get(b'a'), classes.get(b'z'));
/// // But 'a' and 'A' are not.
/// assert_ne!(classes.get(b'a'), classes.get(b'A'));
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[inline]
pub fn byte_classes(&self) -> &ByteClasses {
&self.0.byte_classes
}
/// Return a reference to the NFA state corresponding to the given ID.
///
/// This is a convenience routine for `nfa.states()[id]`.
///
/// # Panics
///
/// This panics when the given identifier does not reference a valid state.
/// That is, when `id.as_usize() >= nfa.states().len()`.
///
/// # Example
///
/// The anchored state for a pattern will typically correspond to a
/// capturing state for that pattern. (Although, this is not an API
/// guarantee!)
///
/// ```
/// use regex_automata::{nfa::thompson::{NFA, State}, PatternID};
///
/// let nfa = NFA::new("a")?;
/// let state = nfa.state(nfa.start_pattern(PatternID::ZERO).unwrap());
/// match *state {
/// State::Capture { slot, .. } => {
/// assert_eq!(0, slot.as_usize());
/// }
/// _ => unreachable!("unexpected state"),
/// }
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[inline]
pub fn state(&self, id: StateID) -> &State {
&self.states()[id]
}
/// Returns a slice of all states in this NFA.
///
/// The slice returned is indexed by `StateID`. This provides a convenient
/// way to access states while following transitions among those states.
///
/// # Example
///
/// This demonstrates that disabling UTF-8 mode can shrink the size of the
/// NFA considerably in some cases, especially when using Unicode character
/// classes.
///
/// ```
/// # if cfg!(miri) { return Ok(()); } // miri takes too long
/// use regex_automata::nfa::thompson::NFA;
///
/// let nfa_unicode = NFA::new(r"\w")?;
/// let nfa_ascii = NFA::new(r"(?-u)\w")?;
/// // Yes, a factor of 45 difference. No lie.
/// assert!(40 * nfa_ascii.states().len() < nfa_unicode.states().len());
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[inline]
pub fn states(&self) -> &[State] {
&self.0.states
}
/// Returns the capturing group info for this NFA.
///
/// The [`GroupInfo`] provides a way to map to and from capture index
/// and capture name for each pattern. It also provides a mapping from
/// each of the capturing groups in every pattern to their corresponding
/// slot offsets encoded in [`State::Capture`] states.
///
/// Note that `GroupInfo` uses reference counting internally, such that
/// cloning a `GroupInfo` is very cheap.
///
/// # Example
///
/// This example shows how to get a list of all capture group names for
/// a particular pattern.
///
/// ```
/// use regex_automata::{nfa::thompson::NFA, PatternID};
///
/// let nfa = NFA::new(r"(a)(?P<foo>b)(c)(d)(?P<bar>e)")?;
/// // The first is the implicit group that is always unnammed. The next
/// // 5 groups are the explicit groups found in the concrete syntax above.
/// let expected = vec![None, None, Some("foo"), None, None, Some("bar")];
/// let got: Vec<Option<&str>> =
/// nfa.group_info().pattern_names(PatternID::ZERO).collect();
/// assert_eq!(expected, got);
///
/// // Using an invalid pattern ID will result in nothing yielded.
/// let got = nfa.group_info().pattern_names(PatternID::must(999)).count();
/// assert_eq!(0, got);
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[inline]
pub fn group_info(&self) -> &GroupInfo {
&self.0.group_info()
}
/// Returns true if and only if this NFA has at least one
/// [`Capture`](State::Capture) in its sequence of states.
///
/// This is useful as a way to perform a quick test before attempting
/// something that does or does not require capture states. For example,
/// some regex engines (like the PikeVM) require capture states in order to
/// work at all.
///
/// # Example
///
/// This example shows a few different NFAs and whether they have captures
/// or not.
///
/// ```
/// use regex_automata::nfa::thompson::{NFA, WhichCaptures};
///
/// // Obviously has capture states.
/// let nfa = NFA::new("(a)")?;
/// assert!(nfa.has_capture());
///
/// // Less obviously has capture states, because every pattern has at
/// // least one anonymous capture group corresponding to the match for the
/// // entire pattern.
/// let nfa = NFA::new("a")?;
/// assert!(nfa.has_capture());
///
/// // Other than hand building your own NFA, this is the only way to build
/// // an NFA without capturing groups. In general, you should only do this
/// // if you don't intend to use any of the NFA-oriented regex engines.
/// // Overall, capturing groups don't have many downsides. Although they
/// // can add a bit of noise to simple NFAs, so it can be nice to disable
/// // them for debugging purposes.
/// //
/// // Notice that 'has_capture' is false here even when we have an
/// // explicit capture group in the pattern.
/// let nfa = NFA::compiler()
/// .configure(NFA::config().which_captures(WhichCaptures::None))
/// .build("(a)")?;
/// assert!(!nfa.has_capture());
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[inline]
pub fn has_capture(&self) -> bool {
self.0.has_capture
}
/// Returns true if and only if this NFA can match the empty string.
/// When it returns false, all possible matches are guaranteed to have a
/// non-zero length.
///
/// This is useful as cheap way to know whether code needs to handle the
/// case of a zero length match. This is particularly important when UTF-8
/// modes are enabled, as when UTF-8 mode is enabled, empty matches that
/// split a codepoint must never be reported. This extra handling can
/// sometimes be costly, and since regexes matching an empty string are
/// somewhat rare, it can be beneficial to treat such regexes specially.
///
/// # Example
///
/// This example shows a few different NFAs and whether they match the
/// empty string or not. Notice the empty string isn't merely a matter
/// of a string of length literally `0`, but rather, whether a match can
/// occur between specific pairs of bytes.
///
/// ```
/// use regex_automata::{nfa::thompson::NFA, util::syntax};
///
/// // The empty regex matches the empty string.
/// let nfa = NFA::new("")?;
/// assert!(nfa.has_empty(), "empty matches empty");
/// // The '+' repetition operator requires at least one match, and so
/// // does not match the empty string.
/// let nfa = NFA::new("a+")?;
/// assert!(!nfa.has_empty(), "+ does not match empty");
/// // But the '*' repetition operator does.
/// let nfa = NFA::new("a*")?;
/// assert!(nfa.has_empty(), "* does match empty");
/// // And wrapping '+' in an operator that can match an empty string also
/// // causes it to match the empty string too.
/// let nfa = NFA::new("(a+)*")?;
/// assert!(nfa.has_empty(), "+ inside of * matches empty");
///
/// // If a regex is just made of a look-around assertion, even if the
/// // assertion requires some kind of non-empty string around it (such as
/// // \b), then it is still treated as if it matches the empty string.
/// // Namely, if a match occurs of just a look-around assertion, then the
/// // match returned is empty.
/// let nfa = NFA::compiler()
/// .syntax(syntax::Config::new().utf8(false))
/// .build(r"^$\A\z\b\B(?-u:\b\B)")?;
/// assert!(nfa.has_empty(), "assertions match empty");
/// // Even when an assertion is wrapped in a '+', it still matches the
/// // empty string.
/// let nfa = NFA::new(r"\b+")?;
/// assert!(nfa.has_empty(), "+ of an assertion matches empty");
///
/// // An alternation with even one branch that can match the empty string
/// // is also said to match the empty string overall.
/// let nfa = NFA::new("foo|(bar)?|quux")?;
/// assert!(nfa.has_empty(), "alternations can match empty");
///
/// // An NFA that matches nothing does not match the empty string.
/// let nfa = NFA::new("[a&&b]")?;
/// assert!(!nfa.has_empty(), "never matching means not matching empty");
/// // But if it's wrapped in something that doesn't require a match at
/// // all, then it can match the empty string!
/// let nfa = NFA::new("[a&&b]*")?;
/// assert!(nfa.has_empty(), "* on never-match still matches empty");
/// // Since a '+' requires a match, using it on something that can never
/// // match will itself produce a regex that can never match anything,
/// // and thus does not match the empty string.
/// let nfa = NFA::new("[a&&b]+")?;
/// assert!(!nfa.has_empty(), "+ on never-match still matches nothing");
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[inline]
pub fn has_empty(&self) -> bool {
self.0.has_empty
}
/// Whether UTF-8 mode is enabled for this NFA or not.
///
/// When UTF-8 mode is enabled, all matches reported by a regex engine
/// derived from this NFA are guaranteed to correspond to spans of valid
/// UTF-8. This includes zero-width matches. For example, the regex engine
/// must guarantee that the empty regex will not match at the positions
/// between code units in the UTF-8 encoding of a single codepoint.
///
/// See [`Config::utf8`] for more information.
///
/// This is enabled by default.
///
/// # Example
///
/// This example shows how UTF-8 mode can impact the match spans that may
/// be reported in certain cases.
///
/// ```
/// use regex_automata::{
/// nfa::thompson::{self, pikevm::PikeVM},
/// Match, Input,
/// };
///
/// let re = PikeVM::new("")?;
/// let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
///
/// // UTF-8 mode is enabled by default.
/// let mut input = Input::new("☃");
/// re.search(&mut cache, &input, &mut caps);
/// assert_eq!(Some(Match::must(0, 0..0)), caps.get_match());
///
/// // Even though an empty regex matches at 1..1, our next match is
/// // 3..3 because 1..1 and 2..2 split the snowman codepoint (which is
/// // three bytes long).
/// input.set_start(1);
/// re.search(&mut cache, &input, &mut caps);
/// assert_eq!(Some(Match::must(0, 3..3)), caps.get_match());
///
/// // But if we disable UTF-8, then we'll get matches at 1..1 and 2..2:
/// let re = PikeVM::builder()
/// .thompson(thompson::Config::new().utf8(false))
/// .build("")?;
/// re.search(&mut cache, &input, &mut caps);
/// assert_eq!(Some(Match::must(0, 1..1)), caps.get_match());
///
/// input.set_start(2);
/// re.search(&mut cache, &input, &mut caps);
/// assert_eq!(Some(Match::must(0, 2..2)), caps.get_match());
///
/// input.set_start(3);
/// re.search(&mut cache, &input, &mut caps);
/// assert_eq!(Some(Match::must(0, 3..3)), caps.get_match());
///
/// input.set_start(4);
/// re.search(&mut cache, &input, &mut caps);
/// assert_eq!(None, caps.get_match());
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[inline]
pub fn is_utf8(&self) -> bool {
self.0.utf8
}
/// Returns true when this NFA is meant to be matched in reverse.
///
/// Generally speaking, when this is true, it means the NFA is supposed to
/// be used in conjunction with moving backwards through the haystack. That
/// is, from a higher memory address to a lower memory address.
///
/// It is often the case that lower level routines dealing with an NFA
/// don't need to care about whether it is "meant" to be matched in reverse
/// or not. However, there are some specific cases where it matters. For
/// example, the implementation of CRLF-aware `^` and `$` line anchors
/// needs to know whether the search is in the forward or reverse
/// direction. In the forward direction, neither `^` nor `$` should match
/// when a `\r` has been seen previously and a `\n` is next. However, in
/// the reverse direction, neither `^` nor `$` should match when a `\n`
/// has been seen previously and a `\r` is next. This fundamentally changes
/// how the state machine is constructed, and thus needs to be altered
/// based on the direction of the search.
///
/// This is automatically set when using a [`Compiler`] with a configuration
/// where [`Config::reverse`] is enabled. If you're building your own NFA
/// by hand via a [`Builder`]
#[inline]
pub fn is_reverse(&self) -> bool {
self.0.reverse
}
/// Returns true if and only if all starting states for this NFA correspond
/// to the beginning of an anchored search.
///
/// Typically, an NFA will have both an anchored and an unanchored starting
/// state. Namely, because it tends to be useful to have both and the cost
/// of having an unanchored starting state is almost zero (for an NFA).
/// However, if all patterns in the NFA are themselves anchored, then even
/// the unanchored starting state will correspond to an anchored search
/// since the pattern doesn't permit anything else.
///
/// # Example
///
/// This example shows a few different scenarios where this method's
/// return value varies.
///
/// ```
/// use regex_automata::nfa::thompson::NFA;
///
/// // The unanchored starting state permits matching this pattern anywhere
/// // in a haystack, instead of just at the beginning.
/// let nfa = NFA::new("a")?;
/// assert!(!nfa.is_always_start_anchored());
///
/// // In this case, the pattern is itself anchored, so there is no way
/// // to run an unanchored search.
/// let nfa = NFA::new("^a")?;
/// assert!(nfa.is_always_start_anchored());
///
/// // When multiline mode is enabled, '^' can match at the start of a line
/// // in addition to the start of a haystack, so an unanchored search is
/// // actually possible.
/// let nfa = NFA::new("(?m)^a")?;
/// assert!(!nfa.is_always_start_anchored());
///
/// // Weird cases also work. A pattern is only considered anchored if all
/// // matches may only occur at the start of a haystack.
/// let nfa = NFA::new("(^a)|a")?;
/// assert!(!nfa.is_always_start_anchored());
///
/// // When multiple patterns are present, if they are all anchored, then
/// // the NFA is always anchored too.
/// let nfa = NFA::new_many(&["^a", "^b", "^c"])?;
/// assert!(nfa.is_always_start_anchored());
///
/// // But if one pattern is unanchored, then the NFA must permit an
/// // unanchored search.
/// let nfa = NFA::new_many(&["^a", "b", "^c"])?;
/// assert!(!nfa.is_always_start_anchored());
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[inline]
pub fn is_always_start_anchored(&self) -> bool {
self.start_anchored() == self.start_unanchored()
}
/// Returns the look-around matcher associated with this NFA.
///
/// A look-around matcher determines how to match look-around assertions.
/// In particular, some assertions are configurable. For example, the
/// `(?m:^)` and `(?m:$)` assertions can have their line terminator changed
/// from the default of `\n` to any other byte.
///
/// If the NFA was built using a [`Compiler`], then this matcher
/// can be set via the [`Config::look_matcher`] configuration
/// knob. Otherwise, if you've built an NFA by hand, it is set via
/// [`Builder::set_look_matcher`].
///
/// # Example
///
/// This shows how to change the line terminator for multi-line assertions.
///
/// ```
/// use regex_automata::{
/// nfa::thompson::{self, pikevm::PikeVM},
/// util::look::LookMatcher,
/// Match, Input,
/// };
///
/// let mut lookm = LookMatcher::new();
/// lookm.set_line_terminator(b'\x00');
///
/// let re = PikeVM::builder()
/// .thompson(thompson::Config::new().look_matcher(lookm))
/// .build(r"(?m)^[a-z]+$")?;
/// let mut cache = re.create_cache();
///
/// // Multi-line assertions now use NUL as a terminator.
/// assert_eq!(
/// Some(Match::must(0, 1..4)),
/// re.find(&mut cache, b"\x00abc\x00"),
/// );
/// // ... and \n is no longer recognized as a terminator.
/// assert_eq!(
/// None,
/// re.find(&mut cache, b"\nabc\n"),
/// );
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[inline]
pub fn look_matcher(&self) -> &LookMatcher {
&self.0.look_matcher
}
/// Returns the union of all look-around assertions used throughout this
/// NFA. When the returned set is empty, it implies that the NFA has no
/// look-around assertions and thus zero conditional epsilon transitions.
///
/// This is useful in some cases enabling optimizations. It is not
/// unusual, for example, for optimizations to be of the form, "for any
/// regex with zero conditional epsilon transitions, do ..." where "..."
/// is some kind of optimization.
///
/// This isn't only helpful for optimizations either. Sometimes look-around
/// assertions are difficult to support. For example, many of the DFAs in
/// this crate don't support Unicode word boundaries or handle them using
/// heuristics. Handling that correctly typically requires some kind of
/// cheap check of whether the NFA has a Unicode word boundary in the first
/// place.
///
/// # Example
///
/// This example shows how this routine varies based on the regex pattern:
///
/// ```
/// use regex_automata::{nfa::thompson::NFA, util::look::Look};
///
/// // No look-around at all.
/// let nfa = NFA::new("a")?;
/// assert!(nfa.look_set_any().is_empty());
///
/// // When multiple patterns are present, since this returns the union,
/// // it will include look-around assertions that only appear in one
/// // pattern.
/// let nfa = NFA::new_many(&["a", "b", "a^b", "c"])?;
/// assert!(nfa.look_set_any().contains(Look::Start));
///
/// // Some groups of assertions have various shortcuts. For example:
/// let nfa = NFA::new(r"(?-u:\b)")?;
/// assert!(nfa.look_set_any().contains_word());
/// assert!(!nfa.look_set_any().contains_word_unicode());
/// assert!(nfa.look_set_any().contains_word_ascii());
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[inline]
pub fn look_set_any(&self) -> LookSet {
self.0.look_set_any
}
/// Returns the union of all prefix look-around assertions for every
/// pattern in this NFA. When the returned set is empty, it implies none of
/// the patterns require moving through a conditional epsilon transition
/// before inspecting the first byte in the haystack.
///
/// This can be useful for determining what kinds of assertions need to be
/// satisfied at the beginning of a search. For example, typically DFAs
/// in this crate will build a distinct starting state for each possible
/// starting configuration that might result in look-around assertions
/// being satisfied differently. However, if the set returned here is
/// empty, then you know that the start state is invariant because there
/// are no conditional epsilon transitions to consider.
///
/// # Example
///
/// This example shows how this routine varies based on the regex pattern:
///
/// ```
/// use regex_automata::{nfa::thompson::NFA, util::look::Look};
///
/// // No look-around at all.
/// let nfa = NFA::new("a")?;
/// assert!(nfa.look_set_prefix_any().is_empty());
///
/// // When multiple patterns are present, since this returns the union,
/// // it will include look-around assertions that only appear in one
/// // pattern. But it will only include assertions that are in the prefix
/// // of a pattern. For example, this includes '^' but not '$' even though
/// // '$' does appear.
/// let nfa = NFA::new_many(&["a", "b", "^ab$", "c"])?;
/// assert!(nfa.look_set_prefix_any().contains(Look::Start));
/// assert!(!nfa.look_set_prefix_any().contains(Look::End));
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[inline]
pub fn look_set_prefix_any(&self) -> LookSet {
self.0.look_set_prefix_any
}
// FIXME: The `look_set_prefix_all` computation was not correct, and it
// seemed a little tricky to fix it. Since I wasn't actually using it for
// anything, I just decided to remove it in the run up to the regex 1.9
// release. If you need this, please file an issue.
/*
/// Returns the intersection of all prefix look-around assertions for every
/// pattern in this NFA. When the returned set is empty, it implies at
/// least one of the patterns does not require moving through a conditional
/// epsilon transition before inspecting the first byte in the haystack.
/// Conversely, when the set contains an assertion, it implies that every
/// pattern in the NFA also contains that assertion in its prefix.
///
/// This can be useful for determining what kinds of assertions need to be
/// satisfied at the beginning of a search. For example, if you know that
/// [`Look::Start`] is in the prefix intersection set returned here, then
/// you know that all searches, regardless of input configuration, will be
/// anchored.
///
/// # Example
///
/// This example shows how this routine varies based on the regex pattern:
///
/// ```
/// use regex_automata::{nfa::thompson::NFA, util::look::Look};
///
/// // No look-around at all.
/// let nfa = NFA::new("a")?;
/// assert!(nfa.look_set_prefix_all().is_empty());
///
/// // When multiple patterns are present, since this returns the
/// // intersection, it will only include assertions present in every
/// // prefix, and only the prefix.
/// let nfa = NFA::new_many(&["^a$", "^b$", "$^ab$", "^c$"])?;
/// assert!(nfa.look_set_prefix_all().contains(Look::Start));
/// assert!(!nfa.look_set_prefix_all().contains(Look::End));
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[inline]
pub fn look_set_prefix_all(&self) -> LookSet {
self.0.look_set_prefix_all
}
*/
/// Returns the memory usage, in bytes, of this NFA.
///
/// This does **not** include the stack size used up by this NFA. To
/// compute that, use `std::mem::size_of::<NFA>()`.
///
/// # Example
///
/// This example shows that large Unicode character classes can use quite
/// a bit of memory.
///
/// ```
/// # if cfg!(miri) { return Ok(()); } // miri takes too long
/// use regex_automata::nfa::thompson::NFA;
///
/// let nfa_unicode = NFA::new(r"\w")?;
/// let nfa_ascii = NFA::new(r"(?-u:\w)")?;
///
/// assert!(10 * nfa_ascii.memory_usage() < nfa_unicode.memory_usage());
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[inline]
pub fn memory_usage(&self) -> usize {
use core::mem::size_of;
size_of::<Inner>() // allocated on the heap via Arc
+ self.0.states.len() * size_of::<State>()
+ self.0.start_pattern.len() * size_of::<StateID>()
+ self.0.group_info.memory_usage()
+ self.0.memory_extra
}
}
impl fmt::Debug for NFA {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
self.0.fmt(f)
}
}
/// The "inner" part of the NFA. We split this part out so that we can easily
/// wrap it in an `Arc` above in the definition of `NFA`.
///
/// See builder.rs for the code that actually builds this type. This module
/// does provide (internal) mutable methods for adding things to this
/// NFA before finalizing it, but the high level construction process is
/// controlled by the builder abstraction. (Which is complicated enough to
/// get its own module.)
#[derive(Default)]
pub(super) struct Inner {
/// The state sequence. This sequence is guaranteed to be indexable by all
/// starting state IDs, and it is also guaranteed to contain at most one
/// `Match` state for each pattern compiled into this NFA. (A pattern may
/// not have a corresponding `Match` state if a `Match` state is impossible
/// to reach.)
states: Vec<State>,
/// The anchored starting state of this NFA.
start_anchored: StateID,
/// The unanchored starting state of this NFA.
start_unanchored: StateID,
/// The starting states for each individual pattern. Starting at any
/// of these states will result in only an anchored search for the
/// corresponding pattern. The vec is indexed by pattern ID. When the NFA
/// contains a single regex, then `start_pattern[0]` and `start_anchored`
/// are always equivalent.
start_pattern: Vec<StateID>,
/// Info about the capturing groups in this NFA. This is responsible for
/// mapping groups to slots, mapping groups to names and names to groups.
group_info: GroupInfo,
/// A representation of equivalence classes over the transitions in this
/// NFA. Two bytes in the same equivalence class must not discriminate
/// between a match or a non-match. This map can be used to shrink the
/// total size of a DFA's transition table with a small match-time cost.
///
/// Note that the NFA's transitions are *not* defined in terms of these
/// equivalence classes. The NFA's transitions are defined on the original
/// byte values. For the most part, this is because they wouldn't really
/// help the NFA much since the NFA already uses a sparse representation
/// to represent transitions. Byte classes are most effective in a dense
/// representation.
byte_class_set: ByteClassSet,
/// This is generated from `byte_class_set`, and essentially represents the
/// same thing but supports different access patterns. Namely, this permits
/// looking up the equivalence class of a byte very cheaply.
///
/// Ideally we would just store this, but because of annoying code
/// structure reasons, we keep both this and `byte_class_set` around for
/// now. I think I would prefer that `byte_class_set` were computed in the
/// `Builder`, but right now, we compute it as states are added to the
/// `NFA`.
byte_classes: ByteClasses,
/// Whether this NFA has a `Capture` state anywhere.
has_capture: bool,
/// When the empty string is in the language matched by this NFA.
has_empty: bool,
/// Whether UTF-8 mode is enabled for this NFA. Briefly, this means that
/// all non-empty matches produced by this NFA correspond to spans of valid
/// UTF-8, and any empty matches produced by this NFA that split a UTF-8
/// encoded codepoint should be filtered out by the corresponding regex
/// engine.
utf8: bool,
/// Whether this NFA is meant to be matched in reverse or not.
reverse: bool,
/// The matcher to be used for look-around assertions.
look_matcher: LookMatcher,
/// The union of all look-around assertions that occur anywhere within
/// this NFA. If this set is empty, then it means there are precisely zero
/// conditional epsilon transitions in the NFA.
look_set_any: LookSet,
/// The union of all look-around assertions that occur as a zero-length
/// prefix for any of the patterns in this NFA.
look_set_prefix_any: LookSet,
/*
/// The intersection of all look-around assertions that occur as a
/// zero-length prefix for any of the patterns in this NFA.
look_set_prefix_all: LookSet,
*/
/// Heap memory used indirectly by NFA states and other things (like the
/// various capturing group representations above). Since each state
/// might use a different amount of heap, we need to keep track of this
/// incrementally.
memory_extra: usize,
}
impl Inner {
/// Runs any last finalization bits and turns this into a full NFA.
pub(super) fn into_nfa(mut self) -> NFA {
self.byte_classes = self.byte_class_set.byte_classes();
// Do epsilon closure from the start state of every pattern in order
// to compute various properties such as look-around assertions and
// whether the empty string can be matched.
let mut stack = vec![];
let mut seen = SparseSet::new(self.states.len());
for &start_id in self.start_pattern.iter() {
stack.push(start_id);
seen.clear();
// let mut prefix_all = LookSet::full();
let mut prefix_any = LookSet::empty();
while let Some(sid) = stack.pop() {
if !seen.insert(sid) {
continue;
}
match self.states[sid] {
State::ByteRange { .. }
| State::Dense { .. }
| State::Fail => continue,
State::Sparse(_) => {
// This snippet below will rewrite this sparse state
// as a dense state. By doing it here, we apply this
// optimization to all hot "sparse" states since these
// are the states that are reachable from the start
// state via an epsilon closure.
//
// Unfortunately, this optimization did not seem to
// help much in some very limited ad hoc benchmarking.
//
// I left the 'Dense' state type in place in case we
// want to revisit this, but I suspect the real way
// to make forward progress is a more fundamental
// rearchitecting of how data in the NFA is laid out.
// I think we should consider a single contiguous
// allocation instead of all this indirection and
// potential heap allocations for every state. But this
// is a large re-design and will require API breaking
// changes.
// self.memory_extra -= self.states[sid].memory_usage();
// let trans = DenseTransitions::from_sparse(sparse);
// self.states[sid] = State::Dense(trans);
// self.memory_extra += self.states[sid].memory_usage();
continue;
}
State::Match { .. } => self.has_empty = true,
State::Look { look, next } => {
prefix_any = prefix_any.insert(look);
stack.push(next);
}
State::Union { ref alternates } => {
// Order doesn't matter here, since we're just dealing
// with look-around sets. But if we do richer analysis
// here that needs to care about preference order, then
// this should be done in reverse.
stack.extend(alternates.iter());
}
State::BinaryUnion { alt1, alt2 } => {
stack.push(alt2);
stack.push(alt1);
}
State::Capture { next, .. } => {
stack.push(next);
}
}
}
self.look_set_prefix_any =
self.look_set_prefix_any.union(prefix_any);
}
NFA(Arc::new(self))
}
/// Returns the capturing group info for this NFA.
pub(super) fn group_info(&self) -> &GroupInfo {
&self.group_info
}
/// Add the given state to this NFA after allocating a fresh identifier for
/// it.
///
/// This panics if too many states are added such that a fresh identifier
/// could not be created. (Currently, the only caller of this routine is
/// a `Builder`, and it upholds this invariant.)
pub(super) fn add(&mut self, state: State) -> StateID {
match state {
State::ByteRange { ref trans } => {
self.byte_class_set.set_range(trans.start, trans.end);
}
State::Sparse(ref sparse) => {
for trans in sparse.transitions.iter() {
self.byte_class_set.set_range(trans.start, trans.end);
}
}
State::Dense { .. } => unreachable!(),
State::Look { look, .. } => {
self.look_matcher
.add_to_byteset(look, &mut self.byte_class_set);
self.look_set_any = self.look_set_any.insert(look);
}
State::Capture { .. } => {
self.has_capture = true;
}
State::Union { .. }
| State::BinaryUnion { .. }
| State::Fail
| State::Match { .. } => {}
}
let id = StateID::new(self.states.len()).unwrap();
self.memory_extra += state.memory_usage();
self.states.push(state);
id
}
/// Set the starting state identifiers for this NFA.
///
/// `start_anchored` and `start_unanchored` may be equivalent. When they
/// are, then the NFA can only execute anchored searches. This might
/// occur, for example, for patterns that are unconditionally anchored.
/// e.g., `^foo`.
pub(super) fn set_starts(
&mut self,
start_anchored: StateID,
start_unanchored: StateID,
start_pattern: &[StateID],
) {
self.start_anchored = start_anchored;
self.start_unanchored = start_unanchored;
self.start_pattern = start_pattern.to_vec();
}
/// Sets the UTF-8 mode of this NFA.
pub(super) fn set_utf8(&mut self, yes: bool) {
self.utf8 = yes;
}
/// Sets the reverse mode of this NFA.
pub(super) fn set_reverse(&mut self, yes: bool) {
self.reverse = yes;
}
/// Sets the look-around assertion matcher for this NFA.
pub(super) fn set_look_matcher(&mut self, m: LookMatcher) {
self.look_matcher = m;
}
/// Set the capturing groups for this NFA.
///
/// The given slice should contain the capturing groups for each pattern,
/// The capturing groups in turn should correspond to the total number of
/// capturing groups in the pattern, including the anonymous first capture
/// group for each pattern. If a capturing group does have a name, then it
/// should be provided as a Arc<str>.
///
/// This returns an error if a corresponding `GroupInfo` could not be
/// built.
pub(super) fn set_captures(
&mut self,
captures: &[Vec<Option<Arc<str>>>],
) -> Result<(), GroupInfoError> {
self.group_info = GroupInfo::new(
captures.iter().map(|x| x.iter().map(|y| y.as_ref())),
)?;
Ok(())
}
/// Remap the transitions in every state of this NFA using the given map.
/// The given map should be indexed according to state ID namespace used by
/// the transitions of the states currently in this NFA.
///
/// This is particularly useful to the NFA builder, since it is convenient
/// to add NFA states in order to produce their final IDs. Then, after all
/// of the intermediate "empty" states (unconditional epsilon transitions)
/// have been removed from the builder's representation, we can re-map all
/// of the transitions in the states already added to their final IDs.
pub(super) fn remap(&mut self, old_to_new: &[StateID]) {
for state in &mut self.states {
state.remap(old_to_new);
}
self.start_anchored = old_to_new[self.start_anchored];
self.start_unanchored = old_to_new[self.start_unanchored];
for id in self.start_pattern.iter_mut() {
*id = old_to_new[*id];
}
}
}
impl fmt::Debug for Inner {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
writeln!(f, "thompson::NFA(")?;
for (sid, state) in self.states.iter().with_state_ids() {
let status = if sid == self.start_anchored {
'^'
} else if sid == self.start_unanchored {
'>'
} else {
' '
};
writeln!(f, "{}{:06?}: {:?}", status, sid.as_usize(), state)?;
}
let pattern_len = self.start_pattern.len();
if pattern_len > 1 {
writeln!(f, "")?;
for pid in 0..pattern_len {
let sid = self.start_pattern[pid];
writeln!(f, "START({:06?}): {:?}", pid, sid.as_usize())?;
}
}
writeln!(f, "")?;
writeln!(
f,
"transition equivalence classes: {:?}",
self.byte_classes,
)?;
writeln!(f, ")")?;
Ok(())
}
}
/// A state in an NFA.
///
/// In theory, it can help to conceptualize an `NFA` as a graph consisting of
/// `State`s. Each `State` contains its complete set of outgoing transitions.
///
/// In practice, it can help to conceptualize an `NFA` as a sequence of
/// instructions for a virtual machine. Each `State` says what to do and where
/// to go next.
///
/// Strictly speaking, the practical interpretation is the most correct one,
/// because of the [`Capture`](State::Capture) state. Namely, a `Capture`
/// state always forwards execution to another state unconditionally. Its only
/// purpose is to cause a side effect: the recording of the current input
/// position at a particular location in memory. In this sense, an `NFA`
/// has more power than a theoretical non-deterministic finite automaton.
///
/// For most uses of this crate, it is likely that one may never even need to
/// be aware of this type at all. The main use cases for looking at `State`s
/// directly are if you need to write your own search implementation or if you
/// need to do some kind of analysis on the NFA.
#[derive(Clone, Eq, PartialEq)]
pub enum State {
/// A state with a single transition that can only be taken if the current
/// input symbol is in a particular range of bytes.
ByteRange {
/// The transition from this state to the next.
trans: Transition,
},
/// A state with possibly many transitions represented in a sparse fashion.
/// Transitions are non-overlapping and ordered lexicographically by input
/// range.
///
/// In practice, this is used for encoding UTF-8 automata. Its presence is
/// primarily an optimization that avoids many additional unconditional
/// epsilon transitions (via [`Union`](State::Union) states), and thus
/// decreases the overhead of traversing the NFA. This can improve both
/// matching time and DFA construction time.
Sparse(SparseTransitions),
/// A dense representation of a state with multiple transitions.
Dense(DenseTransitions),
/// A conditional epsilon transition satisfied via some sort of
/// look-around. Look-around is limited to anchor and word boundary
/// assertions.
///
/// Look-around states are meant to be evaluated while performing epsilon
/// closure (computing the set of states reachable from a particular state
/// via only epsilon transitions). If the current position in the haystack
/// satisfies the look-around assertion, then you're permitted to follow
/// that epsilon transition.
Look {
/// The look-around assertion that must be satisfied before moving
/// to `next`.
look: Look,
/// The state to transition to if the look-around assertion is
/// satisfied.
next: StateID,
},
/// An alternation such that there exists an epsilon transition to all
/// states in `alternates`, where matches found via earlier transitions
/// are preferred over later transitions.
Union {
/// An ordered sequence of unconditional epsilon transitions to other
/// states. Transitions earlier in the sequence are preferred over
/// transitions later in the sequence.
alternates: Box<[StateID]>,
},
/// An alternation such that there exists precisely two unconditional
/// epsilon transitions, where matches found via `alt1` are preferred over
/// matches found via `alt2`.
///
/// This state exists as a common special case of Union where there are
/// only two alternates. In this case, we don't need any allocations to
/// represent the state. This saves a bit of memory and also saves an
/// additional memory access when traversing the NFA.
BinaryUnion {
/// An unconditional epsilon transition to another NFA state. This
/// is preferred over `alt2`.
alt1: StateID,
/// An unconditional epsilon transition to another NFA state. Matches
/// reported via this transition should only be reported if no matches
/// were found by following `alt1`.
alt2: StateID,
},
/// An empty state that records a capture location.
///
/// From the perspective of finite automata, this is precisely equivalent
--> --------------------
--> maximum size reached
--> --------------------
[ zur Elbe Produktseite wechseln0.44Quellennavigators
Analyse erneut starten
]
|
2026-04-04
|