Spracherkennung für: .rs vermutete Sprache: Unknown {[0] [0] [0]} [Methode: Schwerpunktbildung, einfache Gewichte, sechs Dimensionen]
/*!
A DFA that can return spans for matching capturing groups.
This module is the home of a [one-pass DFA](DFA).
This module also contains a [`Builder`] and a [`Config`] for building and
configuring a one-pass DFA.
*/
// A note on naming and credit:
//
// As far as I know, Russ Cox came up with the practical vision and
// implementation of a "one-pass regex engine." He mentions and describes it
// briefly in the third article of his regexp article series:
//
https://swtch.com/~rsc/regexp/regexp3.html
//
// Cox's implementation is in RE2, and the implementation below is most
// heavily inspired by RE2's. The key thing they have in common is that
// their transitions are defined over an alphabet of bytes. In contrast,
// Go's regex engine also has a one-pass engine, but its transitions are
// more firmly rooted on Unicode codepoints. The ideas are the same, but the
// implementations are different.
//
// RE2 tends to call this a "one-pass NFA." Here, we call it a "one-pass DFA."
// They're both true in their own ways:
//
// * The "one-pass" criterion is generally a property of the NFA itself. In
// particular, it is said that an NFA is one-pass if, after each byte of input
// during a search, there is at most one "VM thread" remaining to take for the
// next byte of input. That is, there is never any ambiguity as to the path to
// take through the NFA during a search.
//
// * On the other hand, once a one-pass NFA has its representation converted
// to something where a constant number of instructions is used for each byte
// of input, the implementation looks a lot more like a DFA. It's technically
// more powerful than a DFA since it has side effects (storing offsets inside
// of slots activated by a transition), but it is far closer to a DFA than an
// NFA simulation.
//
// Thus, in this crate, we call it a one-pass DFA.
use alloc::{vec, vec::Vec};
use crate::{
dfa::{remapper::Remapper, DEAD},
nfa::thompson::{self, NFA},
util::{
alphabet::ByteClasses,
captures::Captures,
escape::DebugByte,
int::{Usize, U32, U64, U8},
look::{Look, LookSet, UnicodeWordBoundaryError},
primitives::{NonMaxUsize, PatternID, StateID},
search::{Anchored, Input, Match, MatchError, MatchKind, Span},
sparse_set::SparseSet,
},
};
/// The configuration used for building a [one-pass DFA](DFA).
///
/// A one-pass DFA configuration is a simple data object that is typically used
/// with [`Builder::configure`]. It can be cheaply cloned.
///
/// A default configuration can be created either with `Config::new`, or
/// perhaps more conveniently, with [`DFA::config`].
#[derive(Clone, Debug, Default)]
pub struct Config {
match_kind: Option<MatchKind>,
starts_for_each_pattern: Option<bool>,
byte_classes: Option<bool>,
size_limit: Option<Option<usize>>,
}
impl Config {
/// Return a new default one-pass DFA configuration.
pub fn new() -> Config {
Config::default()
}
/// Set the desired match semantics.
///
/// The default is [`MatchKind::LeftmostFirst`], which corresponds to the
/// match semantics of Perl-like regex engines. That is, when multiple
/// patterns would match at the same leftmost position, the pattern that
/// appears first in the concrete syntax is chosen.
///
/// Currently, the only other kind of match semantics supported is
/// [`MatchKind::All`]. This corresponds to "classical DFA" construction
/// where all possible matches are visited.
///
/// When it comes to the one-pass DFA, it is rarer for preference order and
/// "longest match" to actually disagree. Since if they did disagree, then
/// the regex typically isn't one-pass. For example, searching `Samwise`
/// for `Sam|Samwise` will report `Sam` for leftmost-first matching and
/// `Samwise` for "longest match" or "all" matching. However, this regex is
/// not one-pass if taken literally. The equivalent regex, `Sam(?:|wise)`
/// is one-pass and `Sam|Samwise` may be optimized to it.
///
/// The other main difference is that "all" match semantics don't support
/// non-greedy matches. "All" match semantics always try to match as much
/// as possible.
pub fn match_kind(mut self, kind: MatchKind) -> Config {
self.match_kind = Some(kind);
self
}
/// Whether to compile a separate start state for each pattern in the
/// one-pass DFA.
///
/// When enabled, a separate **anchored** start state is added for each
/// pattern in the DFA. When this start state is used, then the DFA will
/// only search for matches for the pattern specified, even if there are
/// other patterns in the DFA.
///
/// The main downside of this option is that it can potentially increase
/// the size of the DFA and/or increase the time it takes to build the DFA.
///
/// You might want to enable this option when you want to both search for
/// anchored matches of any pattern or to search for anchored matches of
/// one particular pattern while using the same DFA. (Otherwise, you would
/// need to compile a new DFA for each pattern.)
///
/// By default this is disabled.
///
/// # Example
///
/// This example shows how to build a multi-regex and then search for
/// matches for a any of the patterns or matches for a specific pattern.
///
/// ```
/// use regex_automata::{
/// dfa::onepass::DFA, Anchored, Input, Match, PatternID,
/// };
///
/// let re = DFA::builder()
/// .configure(DFA::config().starts_for_each_pattern(true))
/// .build_many(&["[a-z]+", "[0-9]+"])?;
/// let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
/// let haystack = "123abc";
/// let input = Input::new(haystack).anchored(Anchored::Yes);
///
/// // A normal multi-pattern search will show pattern 1 matches.
/// re.try_search(&mut cache, &input, &mut caps)?;
/// assert_eq!(Some(Match::must(1, 0..3)), caps.get_match());
///
/// // If we only want to report pattern 0 matches, then we'll get no
/// // match here.
/// let input = input.anchored(Anchored::Pattern(PatternID::must(0)));
/// re.try_search(&mut cache, &input, &mut caps)?;
/// assert_eq!(None, caps.get_match());
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
pub fn starts_for_each_pattern(mut self, yes: bool) -> Config {
self.starts_for_each_pattern = Some(yes);
self
}
/// Whether to attempt to shrink the size of the DFA's alphabet or not.
///
/// This option is enabled by default and should never be disabled unless
/// one is debugging a one-pass DFA.
///
/// When enabled, the DFA will use a map from all possible bytes to their
/// corresponding equivalence class. Each equivalence class represents a
/// set of bytes that does not discriminate between a match and a non-match
/// in the DFA. For example, the pattern `[ab]+` has at least two
/// equivalence classes: a set containing `a` and `b` and a set containing
/// every byte except for `a` and `b`. `a` and `b` are in the same
/// equivalence class because they never discriminate between a match and a
/// non-match.
///
/// The advantage of this map is that the size of the transition table
/// can be reduced drastically from (approximately) `#states * 256 *
/// sizeof(StateID)` to `#states * k * sizeof(StateID)` where `k` is the
/// number of equivalence classes (rounded up to the nearest power of 2).
/// As a result, total space usage can decrease substantially. Moreover,
/// since a smaller alphabet is used, DFA compilation becomes faster as
/// well.
///
/// **WARNING:** This is only useful for debugging DFAs. Disabling this
/// does not yield any speed advantages. Namely, even when this is
/// disabled, a byte class map is still used while searching. The only
/// difference is that every byte will be forced into its own distinct
/// equivalence class. This is useful for debugging the actual generated
/// transitions because it lets one see the transitions defined on actual
/// bytes instead of the equivalence classes.
pub fn byte_classes(mut self, yes: bool) -> Config {
self.byte_classes = Some(yes);
self
}
/// Set a size limit on the total heap used by a one-pass DFA.
///
/// This size limit is expressed in bytes and is applied during
/// construction of a one-pass DFA. If the DFA's heap usage exceeds
/// this configured limit, then construction is stopped and an error is
/// returned.
///
/// The default is no limit.
///
/// # Example
///
/// This example shows a one-pass DFA that fails to build because of
/// a configured size limit. This particular example also serves as a
/// cautionary tale demonstrating just how big DFAs with large Unicode
/// character classes can get.
///
/// ```
/// # if cfg!(miri) { return Ok(()); } // miri takes too long
/// use regex_automata::{dfa::onepass::DFA, Match};
///
/// // 6MB isn't enough!
/// DFA::builder()
/// .configure(DFA::config().size_limit(Some(6_000_000)))
/// .build(r"\w{20}")
/// .unwrap_err();
///
/// // ... but 7MB probably is!
/// // (Note that DFA sizes aren't necessarily stable between releases.)
/// let re = DFA::builder()
/// .configure(DFA::config().size_limit(Some(7_000_000)))
/// .build(r"\w{20}")?;
/// let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
/// let haystack = "A".repeat(20);
/// re.captures(&mut cache, &haystack, &mut caps);
/// assert_eq!(Some(Match::must(0, 0..20)), caps.get_match());
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
///
/// While one needs a little more than 3MB to represent `\w{20}`, it
/// turns out that you only need a little more than 4KB to represent
/// `(?-u:\w{20})`. So only use Unicode if you need it!
pub fn size_limit(mut self, limit: Option<usize>) -> Config {
self.size_limit = Some(limit);
self
}
/// Returns the match semantics set in this configuration.
pub fn get_match_kind(&self) -> MatchKind {
self.match_kind.unwrap_or(MatchKind::LeftmostFirst)
}
/// Returns whether this configuration has enabled anchored starting states
/// for every pattern in the DFA.
pub fn get_starts_for_each_pattern(&self) -> bool {
self.starts_for_each_pattern.unwrap_or(false)
}
/// Returns whether this configuration has enabled byte classes or not.
/// This is typically a debugging oriented option, as disabling it confers
/// no speed benefit.
pub fn get_byte_classes(&self) -> bool {
self.byte_classes.unwrap_or(true)
}
/// Returns the DFA size limit of this configuration if one was set.
/// The size limit is total number of bytes on the heap that a DFA is
/// permitted to use. If the DFA exceeds this limit during construction,
/// then construction is stopped and an error is returned.
pub fn get_size_limit(&self) -> Option<usize> {
self.size_limit.unwrap_or(None)
}
/// Overwrite the default configuration such that the options in `o` are
/// always used. If an option in `o` is not set, then the corresponding
/// option in `self` is used. If it's not set in `self` either, then it
/// remains not set.
pub(crate) fn overwrite(&self, o: Config) -> Config {
Config {
match_kind: o.match_kind.or(self.match_kind),
starts_for_each_pattern: o
.starts_for_each_pattern
.or(self.starts_for_each_pattern),
byte_classes: o.byte_classes.or(self.byte_classes),
size_limit: o.size_limit.or(self.size_limit),
}
}
}
/// A builder for a [one-pass DFA](DFA).
///
/// This builder permits configuring options for the syntax of a pattern, the
/// NFA construction and the DFA construction. This builder is different from a
/// general purpose regex builder in that it permits fine grain configuration
/// of the construction process. The trade off for this is complexity, and
/// the possibility of setting a configuration that might not make sense. For
/// example, there are two different UTF-8 modes:
///
/// * [`syntax::Config::utf8`](crate::util::syntax::Config::utf8) controls
/// whether the pattern itself can contain sub-expressions that match invalid
/// UTF-8.
/// * [`thompson::Config::utf8`] controls whether empty matches that split a
/// Unicode codepoint are reported or not.
///
/// Generally speaking, callers will want to either enable all of these or
/// disable all of these.
///
/// # Example
///
/// This example shows how to disable UTF-8 mode in the syntax and the NFA.
/// This is generally what you want for matching on arbitrary bytes.
///
/// ```
/// # if cfg!(miri) { return Ok(()); } // miri takes too long
/// use regex_automata::{
/// dfa::onepass::DFA,
/// nfa::thompson,
/// util::syntax,
/// Match,
/// };
///
/// let re = DFA::builder()
/// .syntax(syntax::Config::new().utf8(false))
/// .thompson(thompson::Config::new().utf8(false))
/// .build(r"foo(?-u:[^b])ar.*")?;
/// let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
///
/// let haystack = b"foo\xFFarzz\xE2\x98\xFF\n";
/// re.captures(&mut cache, haystack, &mut caps);
/// // Notice that `(?-u:[^b])` matches invalid UTF-8,
/// // but the subsequent `.*` does not! Disabling UTF-8
/// // on the syntax permits this.
/// //
/// // N.B. This example does not show the impact of
/// // disabling UTF-8 mode on a one-pass DFA Config,
/// // since that only impacts regexes that can
/// // produce matches of length 0.
/// assert_eq!(Some(Match::must(0, 0..8)), caps.get_match());
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[derive(Clone, Debug)]
pub struct Builder {
config: Config,
#[cfg(feature = "syntax")]
thompson: thompson::Compiler,
}
impl Builder {
/// Create a new one-pass DFA builder with the default configuration.
pub fn new() -> Builder {
Builder {
config: Config::default(),
#[cfg(feature = "syntax")]
thompson: thompson::Compiler::new(),
}
}
/// Build a one-pass DFA from the given pattern.
///
/// If there was a problem parsing or compiling the pattern, then an error
/// is returned.
#[cfg(feature = "syntax")]
pub fn build(&self, pattern: &str) -> Result<DFA, BuildError> {
self.build_many(&[pattern])
}
/// Build a one-pass DFA from the given patterns.
///
/// When matches are returned, the pattern ID corresponds to the index of
/// the pattern in the slice given.
#[cfg(feature = "syntax")]
pub fn build_many<P: AsRef<str>>(
&self,
patterns: &[P],
) -> Result<DFA, BuildError> {
let nfa =
self.thompson.build_many(patterns).map_err(BuildError::nfa)?;
self.build_from_nfa(nfa)
}
/// Build a DFA from the given NFA.
///
/// # Example
///
/// This example shows how to build a DFA if you already have an NFA in
/// hand.
///
/// ```
/// use regex_automata::{dfa::onepass::DFA, nfa::thompson::NFA, Match};
///
/// // This shows how to set non-default options for building an NFA.
/// let nfa = NFA::compiler()
/// .configure(NFA::config().shrink(true))
/// .build(r"[a-z0-9]+")?;
/// let re = DFA::builder().build_from_nfa(nfa)?;
/// let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
/// re.captures(&mut cache, "foo123bar", &mut caps);
/// assert_eq!(Some(Match::must(0, 0..9)), caps.get_match());
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
pub fn build_from_nfa(&self, nfa: NFA) -> Result<DFA, BuildError> {
// Why take ownership if we're just going to pass a reference to the
// NFA to our internal builder? Well, the first thing to note is that
// an NFA uses reference counting internally, so either choice is going
// to be cheap. So there isn't much cost either way.
//
// The real reason is that a one-pass DFA, semantically, shares
// ownership of an NFA. This is unlike other DFAs that don't share
// ownership of an NFA at all, primarily because they want to be
// self-contained in order to support cheap (de)serialization.
//
// But then why pass a '&nfa' below if we want to share ownership?
// Well, it turns out that using a '&NFA' in our internal builder
// separates its lifetime from the DFA we're building, and this turns
// out to make code a bit more composable. e.g., We can iterate over
// things inside the NFA while borrowing the builder as mutable because
// we know the NFA cannot be mutated. So TL;DR --- this weirdness is
// "because borrow checker."
InternalBuilder::new(self.config.clone(), &nfa).build()
}
/// Apply the given one-pass DFA configuration options to this builder.
pub fn configure(&mut self, config: Config) -> &mut Builder {
self.config = self.config.overwrite(config);
self
}
/// Set the syntax configuration for this builder using
/// [`syntax::Config`](crate::util::syntax::Config).
///
/// This permits setting things like case insensitivity, Unicode and multi
/// line mode.
///
/// These settings only apply when constructing a one-pass DFA directly
/// from a pattern.
#[cfg(feature = "syntax")]
pub fn syntax(
&mut self,
config: crate::util::syntax::Config,
) -> &mut Builder {
self.thompson.syntax(config);
self
}
/// Set the Thompson NFA configuration for this builder using
/// [`nfa::thompson::Config`](crate::nfa::thompson::Config).
///
/// This permits setting things like whether additional time should be
/// spent shrinking the size of the NFA.
///
/// These settings only apply when constructing a DFA directly from a
/// pattern.
#[cfg(feature = "syntax")]
pub fn thompson(&mut self, config: thompson::Config) -> &mut Builder {
self.thompson.configure(config);
self
}
}
/// An internal builder for encapsulating the state necessary to build a
/// one-pass DFA. Typical use is just `InternalBuilder::new(..).build()`.
///
/// There is no separate pass for determining whether the NFA is one-pass or
/// not. We just try to build the DFA. If during construction we discover that
/// it is not one-pass, we bail out. This is likely to lead to some undesirable
/// expense in some cases, so it might make sense to try an identify common
/// patterns in the NFA that make it definitively not one-pass. That way, we
/// can avoid ever trying to build a one-pass DFA in the first place. For
/// example, '\w*\s' is not one-pass, and since '\w' is Unicode-aware by
/// default, it's probably not a trivial cost to try and build a one-pass DFA
/// for it and then fail.
///
/// Note that some (immutable) fields are duplicated here. For example, the
/// 'nfa' and 'classes' fields are both in the 'DFA'. They are the same thing,
/// but we duplicate them because it makes composition easier below. Otherwise,
/// since the borrow checker can't see through method calls, the mutable borrow
/// we use to mutate the DFA winds up preventing borrowing from any other part
/// of the DFA, even though we aren't mutating those parts. We only do this
/// because the duplication is cheap.
#[derive(Debug)]
struct InternalBuilder<'a> {
/// The DFA we're building.
dfa: DFA,
/// An unordered collection of NFA state IDs that we haven't yet tried to
/// build into a DFA state yet.
///
/// This collection does not ultimately wind up including every NFA state
/// ID. Instead, each ID represents a "start" state for a sub-graph of the
/// NFA. The set of NFA states we then use to build a DFA state consists
/// of that "start" state and all states reachable from it via epsilon
/// transitions.
uncompiled_nfa_ids: Vec<StateID>,
/// A map from NFA state ID to DFA state ID. This is useful for easily
/// determining whether an NFA state has been used as a "starting" point
/// to build a DFA state yet. If it hasn't, then it is mapped to DEAD,
/// and since DEAD is specially added and never corresponds to any NFA
/// state, it follows that a mapping to DEAD implies the NFA state has
/// no corresponding DFA state yet.
nfa_to_dfa_id: Vec<StateID>,
/// A stack used to traverse the NFA states that make up a single DFA
/// state. Traversal occurs until the stack is empty, and we only push to
/// the stack when the state ID isn't in 'seen'. Actually, even more than
/// that, if we try to push something on to this stack that is already in
/// 'seen', then we bail out on construction completely, since it implies
/// that the NFA is not one-pass.
stack: Vec<(StateID, Epsilons)>,
/// The set of NFA states that we've visited via 'stack'.
seen: SparseSet,
/// Whether a match NFA state has been observed while constructing a
/// one-pass DFA state. Once a match state is seen, assuming we are using
/// leftmost-first match semantics, then we don't add any more transitions
/// to the DFA state we're building.
matched: bool,
/// The config passed to the builder.
///
/// This is duplicated in dfa.config.
config: Config,
/// The NFA we're building a one-pass DFA from.
///
/// This is duplicated in dfa.nfa.
nfa: &'a NFA,
/// The equivalence classes that make up the alphabet for this DFA>
///
/// This is duplicated in dfa.classes.
classes: ByteClasses,
}
impl<'a> InternalBuilder<'a> {
/// Create a new builder with an initial empty DFA.
fn new(config: Config, nfa: &'a NFA) -> InternalBuilder {
let classes = if !config.get_byte_classes() {
// A one-pass DFA will always use the equivalence class map, but
// enabling this option is useful for debugging. Namely, this will
// cause all transitions to be defined over their actual bytes
// instead of an opaque equivalence class identifier. The former is
// much easier to grok as a human.
ByteClasses::singletons()
} else {
nfa.byte_classes().clone()
};
// Normally a DFA alphabet includes the EOI symbol, but we don't need
// that in the one-pass DFA since we handle look-around explicitly
// without encoding it into the DFA. Thus, we don't need to delay
// matches by 1 byte. However, we reuse the space that *would* be used
// by the EOI transition by putting match information there (like which
// pattern matches and which look-around assertions need to hold). So
// this means our real alphabet length is 1 fewer than what the byte
// classes report, since we don't use EOI.
let alphabet_len = classes.alphabet_len().checked_sub(1).unwrap();
let stride2 = classes.stride2();
let dfa = DFA {
config: config.clone(),
nfa: nfa.clone(),
table: vec![],
starts: vec![],
// Since one-pass DFAs have a smaller state ID max than
// StateID::MAX, it follows that StateID::MAX is a valid initial
// value for min_match_id since no state ID can ever be greater
// than it. In the case of a one-pass DFA with no match states, the
// min_match_id will keep this sentinel value.
min_match_id: StateID::MAX,
classes: classes.clone(),
alphabet_len,
stride2,
pateps_offset: alphabet_len,
// OK because PatternID::MAX*2 is guaranteed not to overflow.
explicit_slot_start: nfa.pattern_len().checked_mul(2).unwrap(),
};
InternalBuilder {
dfa,
uncompiled_nfa_ids: vec![],
nfa_to_dfa_id: vec![DEAD; nfa.states().len()],
stack: vec![],
seen: SparseSet::new(nfa.states().len()),
matched: false,
config,
nfa,
classes,
}
}
/// Build the DFA from the NFA given to this builder. If the NFA is not
/// one-pass, then return an error. An error may also be returned if a
/// particular limit is exceeded. (Some limits, like the total heap memory
/// used, are configurable. Others, like the total patterns or slots, are
/// hard-coded based on representational limitations.)
fn build(mut self) -> Result<DFA, BuildError> {
self.nfa.look_set_any().available().map_err(BuildError::word)?;
for look in self.nfa.look_set_any().iter() {
// This is a future incompatibility check where if we add any
// more look-around assertions, then the one-pass DFA either
// needs to reject them (what we do here) or it needs to have its
// Transition representation modified to be capable of storing the
// new assertions.
if look.as_repr() > Look::WordUnicodeNegate.as_repr() {
return Err(BuildError::unsupported_look(look));
}
}
if self.nfa.pattern_len().as_u64() > PatternEpsilons::PATTERN_ID_LIMIT
{
return Err(BuildError::too_many_patterns(
PatternEpsilons::PATTERN_ID_LIMIT,
));
}
if self.nfa.group_info().explicit_slot_len() > Slots::LIMIT {
return Err(BuildError::not_one_pass(
"too many explicit capturing groups (max is 16)",
));
}
assert_eq!(DEAD, self.add_empty_state()?);
// This is where the explicit slots start. We care about this because
// we only need to track explicit slots. The implicit slots---two for
// each pattern---are tracked as part of the search routine itself.
let explicit_slot_start = self.nfa.pattern_len() * 2;
self.add_start_state(None, self.nfa.start_anchored())?;
if self.config.get_starts_for_each_pattern() {
for pid in self.nfa.patterns() {
self.add_start_state(
Some(pid),
self.nfa.start_pattern(pid).unwrap(),
)?;
}
}
// NOTE: One wonders what the effects of treating 'uncompiled_nfa_ids'
// as a stack are. It is really an unordered *set* of NFA state IDs.
// If it, for example, in practice led to discovering whether a regex
// was or wasn't one-pass later than if we processed NFA state IDs in
// ascending order, then that would make this routine more costly in
// the somewhat common case of a regex that isn't one-pass.
while let Some(nfa_id) = self.uncompiled_nfa_ids.pop() {
let dfa_id = self.nfa_to_dfa_id[nfa_id];
// Once we see a match, we keep going, but don't add any new
// transitions. Normally we'd just stop, but we have to keep
// going in order to verify that our regex is actually one-pass.
self.matched = false;
// The NFA states we've already explored for this DFA state.
self.seen.clear();
// The NFA states to explore via epsilon transitions. If we ever
// try to push an NFA state that we've already seen, then the NFA
// is not one-pass because it implies there are multiple epsilon
// transition paths that lead to the same NFA state. In other
// words, there is ambiguity.
self.stack_push(nfa_id, Epsilons::empty())?;
while let Some((id, epsilons)) = self.stack.pop() {
match *self.nfa.state(id) {
thompson::State::ByteRange { ref trans } => {
self.compile_transition(dfa_id, trans, epsilons)?;
}
thompson::State::Sparse(ref sparse) => {
for trans in sparse.transitions.iter() {
self.compile_transition(dfa_id, trans, epsilons)?;
}
}
thompson::State::Dense(ref dense) => {
for trans in dense.iter() {
self.compile_transition(dfa_id, &trans, epsilons)?;
}
}
thompson::State::Look { look, next } => {
let looks = epsilons.looks().insert(look);
self.stack_push(next, epsilons.set_looks(looks))?;
}
thompson::State::Union { ref alternates } => {
for &sid in alternates.iter().rev() {
self.stack_push(sid, epsilons)?;
}
}
thompson::State::BinaryUnion { alt1, alt2 } => {
self.stack_push(alt2, epsilons)?;
self.stack_push(alt1, epsilons)?;
}
thompson::State::Capture { next, slot, .. } => {
let slot = slot.as_usize();
let epsilons = if slot < explicit_slot_start {
// If this is an implicit slot, we don't care
// about it, since we handle implicit slots in
// the search routine. We can get away with that
// because there are 2 implicit slots for every
// pattern.
epsilons
} else {
// Offset our explicit slots so that they start
// at index 0.
let offset = slot - explicit_slot_start;
epsilons.set_slots(epsilons.slots().insert(offset))
};
self.stack_push(next, epsilons)?;
}
thompson::State::Fail => {
continue;
}
thompson::State::Match { pattern_id } => {
// If we found two different paths to a match state
// for the same DFA state, then we have ambiguity.
// Thus, it's not one-pass.
if self.matched {
return Err(BuildError::not_one_pass(
"multiple epsilon transitions to match state",
));
}
self.matched = true;
// Shove the matching pattern ID and the 'epsilons'
// into the current DFA state's pattern epsilons. The
// 'epsilons' includes the slots we need to capture
// before reporting the match and also the conditional
// epsilon transitions we need to check before we can
// report a match.
self.dfa.set_pattern_epsilons(
dfa_id,
PatternEpsilons::empty()
.set_pattern_id(pattern_id)
.set_epsilons(epsilons),
);
// N.B. It is tempting to just bail out here when
// compiling a leftmost-first DFA, since we will never
// compile any more transitions in that case. But we
// actually need to keep going in order to verify that
// we actually have a one-pass regex. e.g., We might
// see more Match states (e.g., for other patterns)
// that imply that we don't have a one-pass regex.
// So instead, we mark that we've found a match and
// continue on. When we go to compile a new DFA state,
// we just skip that part. But otherwise check that the
// one-pass property is upheld.
}
}
}
}
self.shuffle_states();
Ok(self.dfa)
}
/// Shuffle all match states to the end of the transition table and set
/// 'min_match_id' to the ID of the first such match state.
///
/// The point of this is to make it extremely cheap to determine whether
/// a state is a match state or not. We need to check on this on every
/// transition during a search, so it being cheap is important. This
/// permits us to check it by simply comparing two state identifiers, as
/// opposed to looking for the pattern ID in the state's `PatternEpsilons`.
/// (Which requires a memory load and some light arithmetic.)
fn shuffle_states(&mut self) {
let mut remapper = Remapper::new(&self.dfa);
let mut next_dest = self.dfa.last_state_id();
for i in (0..self.dfa.state_len()).rev() {
let id = StateID::must(i);
let is_match =
self.dfa.pattern_epsilons(id).pattern_id().is_some();
if !is_match {
continue;
}
remapper.swap(&mut self.dfa, next_dest, id);
self.dfa.min_match_id = next_dest;
next_dest = self.dfa.prev_state_id(next_dest).expect(
"match states should be a proper subset of all states",
);
}
remapper.remap(&mut self.dfa);
}
/// Compile the given NFA transition into the DFA state given.
///
/// 'Epsilons' corresponds to any conditional epsilon transitions that need
/// to be satisfied to follow this transition, and any slots that need to
/// be saved if the transition is followed.
///
/// If this transition indicates that the NFA is not one-pass, then
/// this returns an error. (This occurs, for example, if the DFA state
/// already has a transition defined for the same input symbols as the
/// given transition, *and* the result of the old and new transitions is
/// different.)
fn compile_transition(
&mut self,
dfa_id: StateID,
trans: &thompson::Transition,
epsilons: Epsilons,
) -> Result<(), BuildError> {
let next_dfa_id = self.add_dfa_state_for_nfa_state(trans.next)?;
for byte in self
.classes
.representatives(trans.start..=trans.end)
.filter_map(|r| r.as_u8())
{
let oldtrans = self.dfa.transition(dfa_id, byte);
let newtrans =
Transition::new(self.matched, next_dfa_id, epsilons);
// If the old transition points to the DEAD state, then we know
// 'byte' has not been mapped to any transition for this DFA state
// yet. So set it unconditionally. Otherwise, we require that the
// old and new transitions are equivalent. Otherwise, there is
// ambiguity and thus the regex is not one-pass.
if oldtrans.state_id() == DEAD {
self.dfa.set_transition(dfa_id, byte, newtrans);
} else if oldtrans != newtrans {
return Err(BuildError::not_one_pass(
"conflicting transition",
));
}
}
Ok(())
}
/// Add a start state to the DFA corresponding to the given NFA starting
/// state ID.
///
/// If adding a state would blow any limits (configured or hard-coded),
/// then an error is returned.
///
/// If the starting state is an anchored state for a particular pattern,
/// then callers must provide the pattern ID for that starting state.
/// Callers must also ensure that the first starting state added is the
/// start state for all patterns, and then each anchored starting state for
/// each pattern (if necessary) added in order. Otherwise, this panics.
fn add_start_state(
&mut self,
pid: Option<PatternID>,
nfa_id: StateID,
) -> Result<StateID, BuildError> {
match pid {
// With no pid, this should be the start state for all patterns
// and thus be the first one.
None => assert!(self.dfa.starts.is_empty()),
// With a pid, we want it to be at self.dfa.starts[pid+1].
Some(pid) => assert!(self.dfa.starts.len() == pid.one_more()),
}
let dfa_id = self.add_dfa_state_for_nfa_state(nfa_id)?;
self.dfa.starts.push(dfa_id);
Ok(dfa_id)
}
/// Add a new DFA state corresponding to the given NFA state. If adding a
/// state would blow any limits (configured or hard-coded), then an error
/// is returned. If a DFA state already exists for the given NFA state,
/// then that DFA state's ID is returned and no new states are added.
///
/// It is not expected that this routine is called for every NFA state.
/// Instead, an NFA state ID will usually correspond to the "start" state
/// for a sub-graph of the NFA, where all states in the sub-graph are
/// reachable via epsilon transitions (conditional or unconditional). That
/// sub-graph of NFA states is ultimately what produces a single DFA state.
fn add_dfa_state_for_nfa_state(
&mut self,
nfa_id: StateID,
) -> Result<StateID, BuildError> {
// If we've already built a DFA state for the given NFA state, then
// just return that. We definitely do not want to have more than one
// DFA state in existence for the same NFA state, since all but one of
// them will likely become unreachable. And at least some of them are
// likely to wind up being incomplete.
let existing_dfa_id = self.nfa_to_dfa_id[nfa_id];
if existing_dfa_id != DEAD {
return Ok(existing_dfa_id);
}
// If we don't have any DFA state yet, add it and then add the given
// NFA state to the list of states to explore.
let dfa_id = self.add_empty_state()?;
self.nfa_to_dfa_id[nfa_id] = dfa_id;
self.uncompiled_nfa_ids.push(nfa_id);
Ok(dfa_id)
}
/// Unconditionally add a new empty DFA state. If adding it would exceed
/// any limits (configured or hard-coded), then an error is returned. The
/// ID of the new state is returned on success.
///
/// The added state is *not* a match state.
fn add_empty_state(&mut self) -> Result<StateID, BuildError> {
let state_limit = Transition::STATE_ID_LIMIT;
// Note that unlike dense and lazy DFAs, we specifically do NOT
// premultiply our state IDs here. The reason is that we want to pack
// our state IDs into 64-bit transitions with other info, so the fewer
// the bits we use for state IDs the better. If we premultiply, then
// our state ID space shrinks. We justify this by the assumption that
// a one-pass DFA is just already doing a fair bit more work than a
// normal DFA anyway, so an extra multiplication to compute a state
// transition doesn't seem like a huge deal.
let next_id = self.dfa.table.len() >> self.dfa.stride2();
let id = StateID::new(next_id)
.map_err(|_| BuildError::too_many_states(state_limit))?;
if id.as_u64() > Transition::STATE_ID_LIMIT {
return Err(BuildError::too_many_states(state_limit));
}
self.dfa
.table
.extend(core::iter::repeat(Transition(0)).take(self.dfa.stride()));
// The default empty value for 'PatternEpsilons' is sadly not all
// zeroes. Instead, a special sentinel is used to indicate that there
// is no pattern. So we need to explicitly set the pattern epsilons to
// the correct "empty" PatternEpsilons.
self.dfa.set_pattern_epsilons(id, PatternEpsilons::empty());
if let Some(size_limit) = self.config.get_size_limit() {
if self.dfa.memory_usage() > size_limit {
return Err(BuildError::exceeded_size_limit(size_limit));
}
}
Ok(id)
}
/// Push the given NFA state ID and its corresponding epsilons (slots and
/// conditional epsilon transitions) on to a stack for use in a depth first
/// traversal of a sub-graph of the NFA.
///
/// If the given NFA state ID has already been pushed on to the stack, then
/// it indicates the regex is not one-pass and this correspondingly returns
/// an error.
fn stack_push(
&mut self,
nfa_id: StateID,
epsilons: Epsilons,
) -> Result<(), BuildError> {
// If we already have seen a match and we are compiling a leftmost
// first DFA, then we shouldn't add any more states to look at. This is
// effectively how preference order and non-greediness is implemented.
// if !self.config.get_match_kind().continue_past_first_match()
// && self.matched
// {
// return Ok(());
// }
if !self.seen.insert(nfa_id) {
return Err(BuildError::not_one_pass(
"multiple epsilon transitions to same state",
));
}
self.stack.push((nfa_id, epsilons));
Ok(())
}
}
/// A one-pass DFA for executing a subset of anchored regex searches while
/// resolving capturing groups.
///
/// A one-pass DFA can be built from an NFA that is one-pass. An NFA is
/// one-pass when there is never any ambiguity about how to continue a search.
/// For example, `a*a` is not one-pass becuase during a search, it's not
/// possible to know whether to continue matching the `a*` or to move on to
/// the single `a`. However, `a*b` is one-pass, because for every byte in the
/// input, it's always clear when to move on from `a*` to `b`.
///
/// # Only anchored searches are supported
///
/// In this crate, especially for DFAs, unanchored searches are implemented by
/// treating the pattern as if it had a `(?s-u:.)*?` prefix. While the prefix
/// is one-pass on its own, adding anything after it, e.g., `(?s-u:.)*?a` will
/// make the overall pattern not one-pass. Why? Because the `(?s-u:.)` matches
/// any byte, and there is therefore ambiguity as to when the prefix should
/// stop matching and something else should start matching.
///
/// Therefore, one-pass DFAs do not support unanchored searches. In addition
/// to many regexes simply not being one-pass, it implies that one-pass DFAs
/// have limited utility. With that said, when a one-pass DFA can be used, it
/// can potentially provide a dramatic speed up over alternatives like the
/// [`BoundedBacktracker`](crate::nfa::thompson::backtrack::BoundedBacktracker)
/// and the [`PikeVM`](crate::nfa::thompson::pikevm::PikeVM). In particular,
/// a one-pass DFA is the only DFA capable of reporting the spans of matching
/// capturing groups.
///
/// To clarify, when we say that unanchored searches are not supported, what
/// that actually means is:
///
/// * The high level routines, [`DFA::is_match`] and [`DFA::captures`], always
/// do anchored searches.
/// * Since iterators are most useful in the context of unanchored searches,
/// there is no `DFA::captures_iter` method.
/// * For lower level routines like [`DFA::try_search`], an error will be
/// returned if the given [`Input`] is configured to do an unanchored search or
/// search for an invalid pattern ID. (Note that an [`Input`] is configured to
/// do an unanchored search by default, so just giving a `Input::new` is
/// guaranteed to return an error.)
///
/// # Other limitations
///
/// In addition to the [configurable heap limit](Config::size_limit) and
/// the requirement that a regex pattern be one-pass, there are some other
/// limitations:
///
/// * There is an internal limit on the total number of explicit capturing
/// groups that appear across all patterns. It is somewhat small and there is
/// no way to configure it. If your pattern(s) exceed this limit, then building
/// a one-pass DFA will fail.
/// * If the number of patterns exceeds an internal unconfigurable limit, then
/// building a one-pass DFA will fail. This limit is quite large and you're
/// unlikely to hit it.
/// * If the total number of states exceeds an internal unconfigurable limit,
/// then building a one-pass DFA will fail. This limit is quite large and
/// you're unlikely to hit it.
///
/// # Other examples of regexes that aren't one-pass
///
/// One particularly unfortunate example is that enabling Unicode can cause
/// regexes that were one-pass to no longer be one-pass. Consider the regex
/// `(?-u)\w*\s` for example. It is one-pass because there is exactly no
/// overlap between the ASCII definitions of `\w` and `\s`. But `\w*\s`
/// (i.e., with Unicode enabled) is *not* one-pass because `\w` and `\s` get
/// translated to UTF-8 automatons. And while the *codepoints* in `\w` and `\s`
/// do not overlap, the underlying UTF-8 encodings do. Indeed, because of the
/// overlap between UTF-8 automata, the use of Unicode character classes will
/// tend to vastly increase the likelihood of a regex not being one-pass.
///
/// # How does one know if a regex is one-pass or not?
///
/// At the time of writing, the only way to know is to try and build a one-pass
/// DFA. The one-pass property is checked while constructing the DFA.
///
/// This does mean that you might potentially waste some CPU cycles and memory
/// by optimistically trying to build a one-pass DFA. But this is currently the
/// only way. In the future, building a one-pass DFA might be able to use some
/// heuristics to detect common violations of the one-pass property and bail
/// more quickly.
///
/// # Resource usage
///
/// Unlike a general DFA, a one-pass DFA has stricter bounds on its resource
/// usage. Namely, construction of a one-pass DFA has a time and space
/// complexity of `O(n)`, where `n ~ nfa.states().len()`. (A general DFA's time
/// and space complexity is `O(2^n)`.) This smaller time bound is achieved
/// because there is at most one DFA state created for each NFA state. If
/// additional DFA states would be required, then the pattern is not one-pass
/// and construction will fail.
///
/// Note though that currently, this DFA uses a fully dense representation.
/// This means that while its space complexity is no worse than an NFA, it may
/// in practice use more memory because of higher constant factors. The reason
/// for this trade off is two-fold. Firstly, a dense representation makes the
/// search faster. Secondly, the bigger an NFA, the more unlikely it is to be
/// one-pass. Therefore, most one-pass DFAs are usually pretty small.
///
/// # Example
///
/// This example shows that the one-pass DFA implements Unicode word boundaries
/// correctly while simultaneously reporting spans for capturing groups that
/// participate in a match. (This is the only DFA that implements full support
/// for Unicode word boundaries.)
///
/// ```
/// # if cfg!(miri) { return Ok(()); } // miri takes too long
/// use regex_automata::{dfa::onepass::DFA, Match, Span};
///
/// let re = DFA::new(r"\b(?P<first>\w+)[[:space:]]+(?P<last>\w+)\b")?;
/// let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
///
/// re.captures(&mut cache, "Шерлок Холмс", &mut caps);
/// assert_eq!(Some(Match::must(0, 0..23)), caps.get_match());
/// assert_eq!(Some(Span::from(0..12)), caps.get_group_by_name("first"));
/// assert_eq!(Some(Span::from(13..23)), caps.get_group_by_name("last"));
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
///
/// # Example: iteration
///
/// Unlike other regex engines in this crate, this one does not provide
/// iterator search functions. This is because a one-pass DFA only supports
/// anchored searches, and so iterator functions are generally not applicable.
///
/// However, if you know that all of your matches are
/// directly adjacent, then an iterator can be used. The
/// [`util::iter::Searcher`](crate::util::iter::Searcher) type can be used for
/// this purpose:
///
/// ```
/// # if cfg!(miri) { return Ok(()); } // miri takes too long
/// use regex_automata::{
/// dfa::onepass::DFA,
/// util::iter::Searcher,
/// Anchored, Input, Span,
/// };
///
/// let re = DFA::new(r"\w(\d)\w")?;
/// let (mut cache, caps) = (re.create_cache(), re.create_captures());
/// let input = Input::new("a1zb2yc3x").anchored(Anchored::Yes);
///
/// let mut it = Searcher::new(input).into_captures_iter(caps, |input, caps| {
/// Ok(re.try_search(&mut cache, input, caps)?)
/// }).infallible();
/// let caps0 = it.next().unwrap();
/// assert_eq!(Some(Span::from(1..2)), caps0.get_group(1));
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[derive(Clone)]
pub struct DFA {
/// The configuration provided by the caller.
config: Config,
/// The NFA used to build this DFA.
///
/// NOTE: We probably don't need to store the NFA here, but we use enough
/// bits from it that it's convenient to do so. And there really isn't much
/// cost to doing so either, since an NFA is reference counted internally.
nfa: NFA,
/// The transition table. Given a state ID 's' and a byte of haystack 'b',
/// the next state is `table[sid + classes[byte]]`.
///
/// The stride of this table (i.e., the number of columns) is always
/// a power of 2, even if the alphabet length is smaller. This makes
/// converting between state IDs and state indices very cheap.
///
/// Note that the stride always includes room for one extra "transition"
/// that isn't actually a transition. It is a 'PatternEpsilons' that is
/// used for match states only. Because of this, the maximum number of
/// active columns in the transition table is 257, which means the maximum
/// stride is 512 (the next power of 2 greater than or equal to 257).
table: Vec<Transition>,
/// The DFA state IDs of the starting states.
///
/// `starts[0]` is always present and corresponds to the starting state
/// when searching for matches of any pattern in the DFA.
///
/// `starts[i]` where i>0 corresponds to the starting state for the pattern
/// ID 'i-1'. These starting states are optional.
starts: Vec<StateID>,
/// Every state ID >= this value corresponds to a match state.
///
/// This is what a search uses to detect whether a state is a match state
/// or not. It requires only a simple comparison instead of bit-unpacking
/// the PatternEpsilons from every state.
min_match_id: StateID,
/// The alphabet of this DFA, split into equivalence classes. Bytes in the
/// same equivalence class can never discriminate between a match and a
/// non-match.
classes: ByteClasses,
/// The number of elements in each state in the transition table. This may
/// be less than the stride, since the stride is always a power of 2 and
/// the alphabet length can be anything up to and including 256.
alphabet_len: usize,
/// The number of columns in the transition table, expressed as a power of
/// 2.
stride2: usize,
/// The offset at which the PatternEpsilons for a match state is stored in
/// the transition table.
///
/// PERF: One wonders whether it would be better to put this in a separate
/// allocation, since only match states have a non-empty PatternEpsilons
/// and the number of match states tends be dwarfed by the number of
/// non-match states. So this would save '8*len(non_match_states)' for each
/// DFA. The question is whether moving this to a different allocation will
/// lead to a perf hit during searches. You might think dealing with match
/// states is rare, but some regexes spend a lot of time in match states
/// gobbling up input. But... match state handling is already somewhat
/// expensive, so maybe this wouldn't do much? Either way, it's worth
/// experimenting.
pateps_offset: usize,
/// The first explicit slot index. This refers to the first slot appearing
/// immediately after the last implicit slot. It is always 'patterns.len()
/// * 2'.
///
/// We record this because we only store the explicit slots in our DFA
/// transition table that need to be saved. Implicit slots are handled
/// automatically as part of the search.
explicit_slot_start: usize,
}
impl DFA {
/// Parse the given regular expression using the default configuration and
/// return the corresponding one-pass DFA.
///
/// If you want a non-default configuration, then use the [`Builder`] to
/// set your own configuration.
///
/// # Example
///
/// ```
/// use regex_automata::{dfa::onepass::DFA, Match};
///
/// let re = DFA::new("foo[0-9]+bar")?;
/// let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
///
/// re.captures(&mut cache, "foo12345barzzz", &mut caps);
/// assert_eq!(Some(Match::must(0, 0..11)), caps.get_match());
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[cfg(feature = "syntax")]
#[inline]
pub fn new(pattern: &str) -> Result<DFA, BuildError> {
DFA::builder().build(pattern)
}
/// Like `new`, but parses multiple patterns into a single "multi regex."
/// This similarly uses the default regex configuration.
///
/// # Example
///
/// ```
/// use regex_automata::{dfa::onepass::DFA, Match};
///
/// let re = DFA::new_many(&["[a-z]+", "[0-9]+"])?;
/// let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
///
/// re.captures(&mut cache, "abc123", &mut caps);
/// assert_eq!(Some(Match::must(0, 0..3)), caps.get_match());
///
/// re.captures(&mut cache, "123abc", &mut caps);
/// assert_eq!(Some(Match::must(1, 0..3)), caps.get_match());
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[cfg(feature = "syntax")]
#[inline]
pub fn new_many<P: AsRef<str>>(patterns: &[P]) -> Result<DFA, BuildError> {
DFA::builder().build_many(patterns)
}
/// Like `new`, but builds a one-pass DFA directly from an NFA. This is
/// useful if you already have an NFA, or even if you hand-assembled the
/// NFA.
///
/// # Example
///
/// This shows how to hand assemble a regular expression via its HIR,
/// compile an NFA from it and build a one-pass DFA from the NFA.
///
/// ```
/// use regex_automata::{
/// dfa::onepass::DFA,
/// nfa::thompson::NFA,
/// Match,
/// };
/// use regex_syntax::hir::{Hir, Class, ClassBytes, ClassBytesRange};
///
/// let hir = Hir::class(Class::Bytes(ClassBytes::new(vec![
/// ClassBytesRange::new(b'0', b'9'),
/// ClassBytesRange::new(b'A', b'Z'),
/// ClassBytesRange::new(b'_', b'_'),
/// ClassBytesRange::new(b'a', b'z'),
/// ])));
///
/// let config = NFA::config().nfa_size_limit(Some(1_000));
/// let nfa = NFA::compiler().configure(config).build_from_hir(&hir)?;
///
/// let re = DFA::new_from_nfa(nfa)?;
/// let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
/// let expected = Some(Match::must(0, 0..1));
/// re.captures(&mut cache, "A", &mut caps);
/// assert_eq!(expected, caps.get_match());
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
pub fn new_from_nfa(nfa: NFA) -> Result<DFA, BuildError> {
DFA::builder().build_from_nfa(nfa)
}
/// Create a new one-pass DFA that matches every input.
///
/// # Example
///
/// ```
/// use regex_automata::{dfa::onepass::DFA, Match};
///
/// let dfa = DFA::always_match()?;
/// let mut cache = dfa.create_cache();
/// let mut caps = dfa.create_captures();
///
/// let expected = Match::must(0, 0..0);
/// dfa.captures(&mut cache, "", &mut caps);
/// assert_eq!(Some(expected), caps.get_match());
/// dfa.captures(&mut cache, "foo", &mut caps);
/// assert_eq!(Some(expected), caps.get_match());
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
pub fn always_match() -> Result<DFA, BuildError> {
let nfa = thompson::NFA::always_match();
Builder::new().build_from_nfa(nfa)
}
/// Create a new one-pass DFA that never matches any input.
///
/// # Example
///
/// ```
/// use regex_automata::dfa::onepass::DFA;
///
/// let dfa = DFA::never_match()?;
/// let mut cache = dfa.create_cache();
/// let mut caps = dfa.create_captures();
///
/// dfa.captures(&mut cache, "", &mut caps);
/// assert_eq!(None, caps.get_match());
/// dfa.captures(&mut cache, "foo", &mut caps);
/// assert_eq!(None, caps.get_match());
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
pub fn never_match() -> Result<DFA, BuildError> {
let nfa = thompson::NFA::never_match();
Builder::new().build_from_nfa(nfa)
}
/// Return a default configuration for a DFA.
///
/// This is a convenience routine to avoid needing to import the `Config`
/// type when customizing the construction of a DFA.
///
/// # Example
///
/// This example shows how to change the match semantics of this DFA from
/// its default "leftmost first" to "all." When using "all," non-greediness
/// doesn't apply and neither does preference order matching. Instead, the
/// longest match possible is always returned. (Although, by construction,
/// it's impossible for a one-pass DFA to have a different answer for
/// "preference order" vs "longest match.")
///
/// ```
/// use regex_automata::{dfa::onepass::DFA, Match, MatchKind};
///
/// let re = DFA::builder()
/// .configure(DFA::config().match_kind(MatchKind::All))
/// .build(r"(abc)+?")?;
/// let mut cache = re.create_cache();
/// let mut caps = re.create_captures();
///
/// re.captures(&mut cache, "abcabc", &mut caps);
/// // Normally, the non-greedy repetition would give us a 0..3 match.
/// assert_eq!(Some(Match::must(0, 0..6)), caps.get_match());
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[inline]
pub fn config() -> Config {
Config::new()
}
/// Return a builder for configuring the construction of a DFA.
///
/// This is a convenience routine to avoid needing to import the
/// [`Builder`] type in common cases.
///
/// # Example
///
/// This example shows how to use the builder to disable UTF-8 mode.
///
/// ```
/// # if cfg!(miri) { return Ok(()); } // miri takes too long
/// use regex_automata::{
/// dfa::onepass::DFA,
/// nfa::thompson,
/// util::syntax,
/// Match,
/// };
///
/// let re = DFA::builder()
/// .syntax(syntax::Config::new().utf8(false))
/// .thompson(thompson::Config::new().utf8(false))
/// .build(r"foo(?-u:[^b])ar.*")?;
/// let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
///
/// let haystack = b"foo\xFFarzz\xE2\x98\xFF\n";
/// let expected = Some(Match::must(0, 0..8));
/// re.captures(&mut cache, haystack, &mut caps);
/// assert_eq!(expected, caps.get_match());
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[inline]
pub fn builder() -> Builder {
Builder::new()
}
/// Create a new empty set of capturing groups that is guaranteed to be
/// valid for the search APIs on this DFA.
///
/// A `Captures` value created for a specific DFA cannot be used with any
/// other DFA.
///
/// This is a convenience function for [`Captures::all`]. See the
/// [`Captures`] documentation for an explanation of its alternative
/// constructors that permit the DFA to do less work during a search, and
/// thus might make it faster.
#[inline]
pub fn create_captures(&self) -> Captures {
Captures::all(self.nfa.group_info().clone())
}
/// Create a new cache for this DFA.
///
/// The cache returned should only be used for searches for this
/// DFA. If you want to reuse the cache for another DFA, then you
/// must call [`Cache::reset`] with that DFA (or, equivalently,
/// [`DFA::reset_cache`]).
#[inline]
pub fn create_cache(&self) -> Cache {
Cache::new(self)
}
/// Reset the given cache such that it can be used for searching with the
/// this DFA (and only this DFA).
///
/// A cache reset permits reusing memory already allocated in this cache
/// with a different DFA.
///
/// # Example
///
/// This shows how to re-purpose a cache for use with a different DFA.
///
/// ```
/// # if cfg!(miri) { return Ok(()); } // miri takes too long
/// use regex_automata::{dfa::onepass::DFA, Match};
///
/// let re1 = DFA::new(r"\w")?;
/// let re2 = DFA::new(r"\W")?;
/// let mut caps1 = re1.create_captures();
/// let mut caps2 = re2.create_captures();
///
/// let mut cache = re1.create_cache();
/// assert_eq!(
/// Some(Match::must(0, 0..2)),
/// { re1.captures(&mut cache, "Δ", &mut caps1); caps1.get_match() },
/// );
///
/// // Using 'cache' with re2 is not allowed. It may result in panics or
/// // incorrect results. In order to re-purpose the cache, we must reset
/// // it with the one-pass DFA we'd like to use it with.
/// //
/// // Similarly, after this reset, using the cache with 're1' is also not
/// // allowed.
/// re2.reset_cache(&mut cache);
/// assert_eq!(
/// Some(Match::must(0, 0..3)),
/// { re2.captures(&mut cache, "☃", &mut caps2); caps2.get_match() },
/// );
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[inline]
pub fn reset_cache(&self, cache: &mut Cache) {
cache.reset(self);
}
/// Return the config for this one-pass DFA.
#[inline]
pub fn get_config(&self) -> &Config {
&self.config
}
/// Returns a reference to the underlying NFA.
#[inline]
pub fn get_nfa(&self) -> &NFA {
&self.nfa
}
/// Returns the total number of patterns compiled into this DFA.
///
/// In the case of a DFA that contains no patterns, this returns `0`.
#[inline]
pub fn pattern_len(&self) -> usize {
self.get_nfa().pattern_len()
}
/// Returns the total number of states in this one-pass DFA.
///
/// Note that unlike dense or sparse DFAs, a one-pass DFA does not expose
/// a low level DFA API. Therefore, this routine has little use other than
/// being informational.
#[inline]
pub fn state_len(&self) -> usize {
self.table.len() >> self.stride2()
}
/// Returns the total number of elements in the alphabet for this DFA.
///
/// That is, this returns the total number of transitions that each
/// state in this DFA must have. The maximum alphabet size is 256, which
/// corresponds to each possible byte value.
///
/// The alphabet size may be less than 256 though, and unless
/// [`Config::byte_classes`] is disabled, it is typically must less than
/// 256. Namely, bytes are grouped into equivalence classes such that no
/// two bytes in the same class can distinguish a match from a non-match.
/// For example, in the regex `^[a-z]+$`, the ASCII bytes `a-z` could
/// all be in the same equivalence class. This leads to a massive space
/// savings.
///
/// Note though that the alphabet length does _not_ necessarily equal the
/// total stride space taken up by a single DFA state in the transition
/// table. Namely, for performance reasons, the stride is always the
/// smallest power of two that is greater than or equal to the alphabet
/// length. For this reason, [`DFA::stride`] or [`DFA::stride2`] are
/// often more useful. The alphabet length is typically useful only for
/// informational purposes.
///
/// Note also that unlike dense or sparse DFAs, a one-pass DFA does
/// not have a special end-of-input (EOI) transition. This is because
/// a one-pass DFA handles look-around assertions explicitly (like the
/// [`PikeVM`](crate::nfa::thompson::pikevm::PikeVM)) and does not build
/// them into the transitions of the DFA.
#[inline]
pub fn alphabet_len(&self) -> usize {
self.alphabet_len
}
/// Returns the total stride for every state in this DFA, expressed as the
--> --------------------
--> maximum size reached
--> --------------------