Spracherkennung für: .rs vermutete Sprache: Unknown {[0] [0] [0]} [Methode: Schwerpunktbildung, einfache Gewichte, sechs Dimensionen]
/*!
Defines a high-level intermediate (HIR) representation for regular expressions.
The HIR is represented by the [`Hir`] type, and it principally constructed via
[translation](translate) from an [`Ast`](crate::ast::Ast). Alternatively, users
may use the smart constructors defined on `Hir` to build their own by hand. The
smart constructors simultaneously simplify and "optimize" the HIR, and are also
the same routines used by translation.
Most regex engines only have an HIR like this, and usually construct it
directly from the concrete syntax. This crate however first parses the
concrete syntax into an `Ast`, and only then creates the HIR from the `Ast`,
as mentioned above. It's done this way to facilitate better error reporting,
and to have a structured representation of a regex that faithfully represents
its concrete syntax. Namely, while an `Hir` value can be converted back to an
equivalent regex pattern string, it is unlikely to look like the original due
to its simplified structure.
*/
use core::{char, cmp};
use alloc::{
boxed::Box,
format,
string::{String, ToString},
vec,
vec::Vec,
};
use crate::{
ast::Span,
hir::interval::{Interval, IntervalSet, IntervalSetIter},
unicode,
};
pub use crate::{
hir::visitor::{visit, Visitor},
unicode::CaseFoldError,
};
mod interval;
pub mod literal;
pub mod print;
pub mod translate;
mod visitor;
/// An error that can occur while translating an `Ast` to a `Hir`.
#[derive(Clone, Debug, Eq, PartialEq)]
pub struct Error {
/// The kind of error.
kind: ErrorKind,
/// The original pattern that the translator's Ast was parsed from. Every
/// span in an error is a valid range into this string.
pattern: String,
/// The span of this error, derived from the Ast given to the translator.
span: Span,
}
impl Error {
/// Return the type of this error.
pub fn kind(&self) -> &ErrorKind {
&self.kind
}
/// The original pattern string in which this error occurred.
///
/// Every span reported by this error is reported in terms of this string.
pub fn pattern(&self) -> &str {
&self.pattern
}
/// Return the span at which this error occurred.
pub fn span(&self) -> &Span {
&self.span
}
}
/// The type of an error that occurred while building an `Hir`.
///
/// This error type is marked as `non_exhaustive`. This means that adding a
/// new variant is not considered a breaking change.
#[non_exhaustive]
#[derive(Clone, Debug, Eq, PartialEq)]
pub enum ErrorKind {
/// This error occurs when a Unicode feature is used when Unicode
/// support is disabled. For example `(?-u:\pL)` would trigger this error.
UnicodeNotAllowed,
/// This error occurs when translating a pattern that could match a byte
/// sequence that isn't UTF-8 and `utf8` was enabled.
InvalidUtf8,
/// This error occurs when one uses a non-ASCII byte for a line terminator,
/// but where Unicode mode is enabled and UTF-8 mode is disabled.
InvalidLineTerminator,
/// This occurs when an unrecognized Unicode property name could not
/// be found.
UnicodePropertyNotFound,
/// This occurs when an unrecognized Unicode property value could not
/// be found.
UnicodePropertyValueNotFound,
/// This occurs when a Unicode-aware Perl character class (`\w`, `\s` or
/// `\d`) could not be found. This can occur when the `unicode-perl`
/// crate feature is not enabled.
UnicodePerlClassNotFound,
/// This occurs when the Unicode simple case mapping tables are not
/// available, and the regular expression required Unicode aware case
/// insensitivity.
UnicodeCaseUnavailable,
}
#[cfg(feature = "std")]
impl std::error::Error for Error {}
impl core::fmt::Display for Error {
fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> core::fmt::Result {
crate::error::Formatter::from(self).fmt(f)
}
}
impl core::fmt::Display for ErrorKind {
fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> core::fmt::Result {
use self::ErrorKind::*;
let msg = match *self {
UnicodeNotAllowed => "Unicode not allowed here",
InvalidUtf8 => "pattern can match invalid UTF-8",
InvalidLineTerminator => "invalid line terminator, must be ASCII",
UnicodePropertyNotFound => "Unicode property not found",
UnicodePropertyValueNotFound => "Unicode property value not found",
UnicodePerlClassNotFound => {
"Unicode-aware Perl class not found \
(make sure the unicode-perl feature is enabled)"
}
UnicodeCaseUnavailable => {
"Unicode-aware case insensitivity matching is not available \
(make sure the unicode-case feature is enabled)"
}
};
f.write_str(msg)
}
}
/// A high-level intermediate representation (HIR) for a regular expression.
///
/// An HIR value is a combination of a [`HirKind`] and a set of [`Properties`].
/// An `HirKind` indicates what kind of regular expression it is (a literal,
/// a repetition, a look-around assertion, etc.), where as a `Properties`
/// describes various facts about the regular expression. For example, whether
/// it matches UTF-8 or if it matches the empty string.
///
/// The HIR of a regular expression represents an intermediate step between
/// its abstract syntax (a structured description of the concrete syntax) and
/// an actual regex matcher. The purpose of HIR is to make regular expressions
/// easier to analyze. In particular, the AST is much more complex than the
/// HIR. For example, while an AST supports arbitrarily nested character
/// classes, the HIR will flatten all nested classes into a single set. The HIR
/// will also "compile away" every flag present in the concrete syntax. For
/// example, users of HIR expressions never need to worry about case folding;
/// it is handled automatically by the translator (e.g., by translating
/// `(?i:A)` to `[aA]`).
///
/// The specific type of an HIR expression can be accessed via its `kind`
/// or `into_kind` methods. This extra level of indirection exists for two
/// reasons:
///
/// 1. Construction of an HIR expression *must* use the constructor methods on
/// this `Hir` type instead of building the `HirKind` values directly. This
/// permits construction to enforce invariants like "concatenations always
/// consist of two or more sub-expressions."
/// 2. Every HIR expression contains attributes that are defined inductively,
/// and can be computed cheaply during the construction process. For example,
/// one such attribute is whether the expression must match at the beginning of
/// the haystack.
///
/// In particular, if you have an `HirKind` value, then there is intentionally
/// no way to build an `Hir` value from it. You instead need to do case
/// analysis on the `HirKind` value and build the `Hir` value using its smart
/// constructors.
///
/// # UTF-8
///
/// If the HIR was produced by a translator with
/// [`TranslatorBuilder::utf8`](translate::TranslatorBuilder::utf8) enabled,
/// then the HIR is guaranteed to match UTF-8 exclusively for all non-empty
/// matches.
///
/// For empty matches, those can occur at any position. It is the
/// responsibility of the regex engine to determine whether empty matches are
/// permitted between the code units of a single codepoint.
///
/// # Stack space
///
/// This type defines its own destructor that uses constant stack space and
/// heap space proportional to the size of the HIR.
///
/// Also, an `Hir`'s `fmt::Display` implementation prints an HIR as a regular
/// expression pattern string, and uses constant stack space and heap space
/// proportional to the size of the `Hir`. The regex it prints is guaranteed to
/// be _semantically_ equivalent to the original concrete syntax, but it may
/// look very different. (And potentially not practically readable by a human.)
///
/// An `Hir`'s `fmt::Debug` implementation currently does not use constant
/// stack space. The implementation will also suppress some details (such as
/// the `Properties` inlined into every `Hir` value to make it less noisy).
#[derive(Clone, Eq, PartialEq)]
pub struct Hir {
/// The underlying HIR kind.
kind: HirKind,
/// Analysis info about this HIR, computed during construction.
props: Properties,
}
/// Methods for accessing the underlying `HirKind` and `Properties`.
impl Hir {
/// Returns a reference to the underlying HIR kind.
pub fn kind(&self) -> &HirKind {
&self.kind
}
/// Consumes ownership of this HIR expression and returns its underlying
/// `HirKind`.
pub fn into_kind(mut self) -> HirKind {
core::mem::replace(&mut self.kind, HirKind::Empty)
}
/// Returns the properties computed for this `Hir`.
pub fn properties(&self) -> &Properties {
&self.props
}
/// Splits this HIR into its constituent parts.
///
/// This is useful because `let Hir { kind, props } = hir;` does not work
/// because of `Hir`'s custom `Drop` implementation.
fn into_parts(mut self) -> (HirKind, Properties) {
(
core::mem::replace(&mut self.kind, HirKind::Empty),
core::mem::replace(&mut self.props, Properties::empty()),
)
}
}
/// Smart constructors for HIR values.
///
/// These constructors are called "smart" because they do inductive work or
/// simplifications. For example, calling `Hir::repetition` with a repetition
/// like `a{0}` will actually return a `Hir` with a `HirKind::Empty` kind
/// since it is equivalent to an empty regex. Another example is calling
/// `Hir::concat(vec![expr])`. Instead of getting a `HirKind::Concat`, you'll
/// just get back the original `expr` since it's precisely equivalent.
///
/// Smart constructors enable maintaining invariants about the HIR data type
/// while also simulanteously keeping the representation as simple as possible.
impl Hir {
/// Returns an empty HIR expression.
///
/// An empty HIR expression always matches, including the empty string.
#[inline]
pub fn empty() -> Hir {
let props = Properties::empty();
Hir { kind: HirKind::Empty, props }
}
/// Returns an HIR expression that can never match anything. That is,
/// the size of the set of strings in the language described by the HIR
/// returned is `0`.
///
/// This is distinct from [`Hir::empty`] in that the empty string matches
/// the HIR returned by `Hir::empty`. That is, the set of strings in the
/// language describe described by `Hir::empty` is non-empty.
///
/// Note that currently, the HIR returned uses an empty character class to
/// indicate that nothing can match. An equivalent expression that cannot
/// match is an empty alternation, but all such "fail" expressions are
/// normalized (via smart constructors) to empty character classes. This is
/// because empty character classes can be spelled in the concrete syntax
/// of a regex (e.g., `\P{any}` or `(?-u:[^\x00-\xFF])` or `[a&&b]`), but
/// empty alternations cannot.
#[inline]
pub fn fail() -> Hir {
let class = Class::Bytes(ClassBytes::empty());
let props = Properties::class(&class);
// We can't just call Hir::class here because it defers to Hir::fail
// in order to canonicalize the Hir value used to represent "cannot
// match."
Hir { kind: HirKind::Class(class), props }
}
/// Creates a literal HIR expression.
///
/// This accepts anything that can be converted into a `Box<[u8]>`.
///
/// Note that there is no mechanism for storing a `char` or a `Box<str>`
/// in an HIR. Everything is "just bytes." Whether a `Literal` (or
/// any HIR node) matches valid UTF-8 exclusively can be queried via
/// [`Properties::is_utf8`].
///
/// # Example
///
/// This example shows that concatenations of `Literal` HIR values will
/// automatically get flattened and combined together. So for example, even
/// if you concat multiple `Literal` values that are themselves not valid
/// UTF-8, they might add up to valid UTF-8. This also demonstrates just
/// how "smart" Hir's smart constructors are.
///
/// ```
/// use regex_syntax::hir::{Hir, HirKind, Literal};
///
/// let literals = vec![
/// Hir::literal([0xE2]),
/// Hir::literal([0x98]),
/// Hir::literal([0x83]),
/// ];
/// // Each literal, on its own, is invalid UTF-8.
/// assert!(literals.iter().all(|hir| !hir.properties().is_utf8()));
///
/// let concat = Hir::concat(literals);
/// // But the concatenation is valid UTF-8!
/// assert!(concat.properties().is_utf8());
///
/// // And also notice that the literals have been concatenated into a
/// // single `Literal`, to the point where there is no explicit `Concat`!
/// let expected = HirKind::Literal(Literal(Box::from("☃".as_bytes())));
/// assert_eq!(&expected, concat.kind());
/// ```
#[inline]
pub fn literal<B: Into<Box<[u8]>>>(lit: B) -> Hir {
let bytes = lit.into();
if bytes.is_empty() {
return Hir::empty();
}
let lit = Literal(bytes);
let props = Properties::literal(&lit);
Hir { kind: HirKind::Literal(lit), props }
}
/// Creates a class HIR expression. The class may either be defined over
/// ranges of Unicode codepoints or ranges of raw byte values.
///
/// Note that an empty class is permitted. An empty class is equivalent to
/// `Hir::fail()`.
#[inline]
pub fn class(class: Class) -> Hir {
if class.is_empty() {
return Hir::fail();
} else if let Some(bytes) = class.literal() {
return Hir::literal(bytes);
}
let props = Properties::class(&class);
Hir { kind: HirKind::Class(class), props }
}
/// Creates a look-around assertion HIR expression.
#[inline]
pub fn look(look: Look) -> Hir {
let props = Properties::look(look);
Hir { kind: HirKind::Look(look), props }
}
/// Creates a repetition HIR expression.
#[inline]
pub fn repetition(mut rep: Repetition) -> Hir {
// If the sub-expression of a repetition can only match the empty
// string, then we force its maximum to be at most 1.
if rep.sub.properties().maximum_len() == Some(0) {
rep.min = cmp::min(rep.min, 1);
rep.max = rep.max.map(|n| cmp::min(n, 1)).or(Some(1));
}
// The regex 'a{0}' is always equivalent to the empty regex. This is
// true even when 'a' is an expression that never matches anything
// (like '\P{any}').
//
// Additionally, the regex 'a{1}' is always equivalent to 'a'.
if rep.min == 0 && rep.max == Some(0) {
return Hir::empty();
} else if rep.min == 1 && rep.max == Some(1) {
return *rep.sub;
}
let props = Properties::repetition(&rep);
Hir { kind: HirKind::Repetition(rep), props }
}
/// Creates a capture HIR expression.
///
/// Note that there is no explicit HIR value for a non-capturing group.
/// Since a non-capturing group only exists to override precedence in the
/// concrete syntax and since an HIR already does its own grouping based on
/// what is parsed, there is no need to explicitly represent non-capturing
/// groups in the HIR.
#[inline]
pub fn capture(capture: Capture) -> Hir {
let props = Properties::capture(&capture);
Hir { kind: HirKind::Capture(capture), props }
}
/// Returns the concatenation of the given expressions.
///
/// This attempts to flatten and simplify the concatenation as appropriate.
///
/// # Example
///
/// This shows a simple example of basic flattening of both concatenations
/// and literals.
///
/// ```
/// use regex_syntax::hir::Hir;
///
/// let hir = Hir::concat(vec![
/// Hir::concat(vec![
/// Hir::literal([b'a']),
/// Hir::literal([b'b']),
/// Hir::literal([b'c']),
/// ]),
/// Hir::concat(vec![
/// Hir::literal([b'x']),
/// Hir::literal([b'y']),
/// Hir::literal([b'z']),
/// ]),
/// ]);
/// let expected = Hir::literal("abcxyz".as_bytes());
/// assert_eq!(expected, hir);
/// ```
pub fn concat(subs: Vec<Hir>) -> Hir {
// We rebuild the concatenation by simplifying it. Would be nice to do
// it in place, but that seems a little tricky?
let mut new = vec![];
// This gobbles up any adjacent literals in a concatenation and smushes
// them together. Basically, when we see a literal, we add its bytes
// to 'prior_lit', and whenever we see anything else, we first take
// any bytes in 'prior_lit' and add it to the 'new' concatenation.
let mut prior_lit: Option<Vec<u8>> = None;
for sub in subs {
let (kind, props) = sub.into_parts();
match kind {
HirKind::Literal(Literal(bytes)) => {
if let Some(ref mut prior_bytes) = prior_lit {
prior_bytes.extend_from_slice(&bytes);
} else {
prior_lit = Some(bytes.to_vec());
}
}
// We also flatten concats that are direct children of another
// concat. We only need to do this one level deep since
// Hir::concat is the only way to build concatenations, and so
// flattening happens inductively.
HirKind::Concat(subs2) => {
for sub2 in subs2 {
let (kind2, props2) = sub2.into_parts();
match kind2 {
HirKind::Literal(Literal(bytes)) => {
if let Some(ref mut prior_bytes) = prior_lit {
prior_bytes.extend_from_slice(&bytes);
} else {
prior_lit = Some(bytes.to_vec());
}
}
kind2 => {
if let Some(prior_bytes) = prior_lit.take() {
new.push(Hir::literal(prior_bytes));
}
new.push(Hir { kind: kind2, props: props2 });
}
}
}
}
// We can just skip empty HIRs.
HirKind::Empty => {}
kind => {
if let Some(prior_bytes) = prior_lit.take() {
new.push(Hir::literal(prior_bytes));
}
new.push(Hir { kind, props });
}
}
}
if let Some(prior_bytes) = prior_lit.take() {
new.push(Hir::literal(prior_bytes));
}
if new.is_empty() {
return Hir::empty();
} else if new.len() == 1 {
return new.pop().unwrap();
}
let props = Properties::concat(&new);
Hir { kind: HirKind::Concat(new), props }
}
/// Returns the alternation of the given expressions.
///
/// This flattens and simplifies the alternation as appropriate. This may
/// include factoring out common prefixes or even rewriting the alternation
/// as a character class.
///
/// Note that an empty alternation is equivalent to `Hir::fail()`. (It
/// is not possible for one to write an empty alternation, or even an
/// alternation with a single sub-expression, in the concrete syntax of a
/// regex.)
///
/// # Example
///
/// This is a simple example showing how an alternation might get
/// simplified.
///
/// ```
/// use regex_syntax::hir::{Hir, Class, ClassUnicode, ClassUnicodeRange};
///
/// let hir = Hir::alternation(vec![
/// Hir::literal([b'a']),
/// Hir::literal([b'b']),
/// Hir::literal([b'c']),
/// Hir::literal([b'd']),
/// Hir::literal([b'e']),
/// Hir::literal([b'f']),
/// ]);
/// let expected = Hir::class(Class::Unicode(ClassUnicode::new([
/// ClassUnicodeRange::new('a', 'f'),
/// ])));
/// assert_eq!(expected, hir);
/// ```
///
/// And another example showing how common prefixes might get factored
/// out.
///
/// ```
/// use regex_syntax::hir::{Hir, Class, ClassUnicode, ClassUnicodeRange};
///
/// let hir = Hir::alternation(vec![
/// Hir::concat(vec![
/// Hir::literal("abc".as_bytes()),
/// Hir::class(Class::Unicode(ClassUnicode::new([
/// ClassUnicodeRange::new('A', 'Z'),
/// ]))),
/// ]),
/// Hir::concat(vec![
/// Hir::literal("abc".as_bytes()),
/// Hir::class(Class::Unicode(ClassUnicode::new([
/// ClassUnicodeRange::new('a', 'z'),
/// ]))),
/// ]),
/// ]);
/// let expected = Hir::concat(vec![
/// Hir::literal("abc".as_bytes()),
/// Hir::alternation(vec![
/// Hir::class(Class::Unicode(ClassUnicode::new([
/// ClassUnicodeRange::new('A', 'Z'),
/// ]))),
/// Hir::class(Class::Unicode(ClassUnicode::new([
/// ClassUnicodeRange::new('a', 'z'),
/// ]))),
/// ]),
/// ]);
/// assert_eq!(expected, hir);
/// ```
///
/// Note that these sorts of simplifications are not guaranteed.
pub fn alternation(subs: Vec<Hir>) -> Hir {
// We rebuild the alternation by simplifying it. We proceed similarly
// as the concatenation case. But in this case, there's no literal
// simplification happening. We're just flattening alternations.
let mut new = Vec::with_capacity(subs.len());
for sub in subs {
let (kind, props) = sub.into_parts();
match kind {
HirKind::Alternation(subs2) => {
new.extend(subs2);
}
kind => {
new.push(Hir { kind, props });
}
}
}
if new.is_empty() {
return Hir::fail();
} else if new.len() == 1 {
return new.pop().unwrap();
}
// Now that it's completely flattened, look for the special case of
// 'char1|char2|...|charN' and collapse that into a class. Note that
// we look for 'char' first and then bytes. The issue here is that if
// we find both non-ASCII codepoints and non-ASCII singleton bytes,
// then it isn't actually possible to smush them into a single class.
// (Because classes are either "all codepoints" or "all bytes." You
// can have a class that both matches non-ASCII but valid UTF-8 and
// invalid UTF-8.) So we look for all chars and then all bytes, and
// don't handle anything else.
if let Some(singletons) = singleton_chars(&new) {
let it = singletons
.into_iter()
.map(|ch| ClassUnicodeRange { start: ch, end: ch });
return Hir::class(Class::Unicode(ClassUnicode::new(it)));
}
if let Some(singletons) = singleton_bytes(&new) {
let it = singletons
.into_iter()
.map(|b| ClassBytesRange { start: b, end: b });
return Hir::class(Class::Bytes(ClassBytes::new(it)));
}
// Similar to singleton chars, we can also look for alternations of
// classes. Those can be smushed into a single class.
if let Some(cls) = class_chars(&new) {
return Hir::class(cls);
}
if let Some(cls) = class_bytes(&new) {
return Hir::class(cls);
}
// Factor out a common prefix if we can, which might potentially
// simplify the expression and unlock other optimizations downstream.
// It also might generally make NFA matching and DFA construction
// faster by reducing the scope of branching in the regex.
new = match lift_common_prefix(new) {
Ok(hir) => return hir,
Err(unchanged) => unchanged,
};
let props = Properties::alternation(&new);
Hir { kind: HirKind::Alternation(new), props }
}
/// Returns an HIR expression for `.`.
///
/// * [`Dot::AnyChar`] maps to `(?su-R:.)`.
/// * [`Dot::AnyByte`] maps to `(?s-Ru:.)`.
/// * [`Dot::AnyCharExceptLF`] maps to `(?u-Rs:.)`.
/// * [`Dot::AnyCharExceptCRLF`] maps to `(?Ru-s:.)`.
/// * [`Dot::AnyByteExceptLF`] maps to `(?-Rsu:.)`.
/// * [`Dot::AnyByteExceptCRLF`] maps to `(?R-su:.)`.
///
/// # Example
///
/// Note that this is a convenience routine for constructing the correct
/// character class based on the value of `Dot`. There is no explicit "dot"
/// HIR value. It is just an abbreviation for a common character class.
///
/// ```
/// use regex_syntax::hir::{Hir, Dot, Class, ClassBytes, ClassBytesRange};
///
/// let hir = Hir::dot(Dot::AnyByte);
/// let expected = Hir::class(Class::Bytes(ClassBytes::new([
/// ClassBytesRange::new(0x00, 0xFF),
/// ])));
/// assert_eq!(expected, hir);
/// ```
#[inline]
pub fn dot(dot: Dot) -> Hir {
match dot {
Dot::AnyChar => {
let mut cls = ClassUnicode::empty();
cls.push(ClassUnicodeRange::new('\0', '\u{10FFFF}'));
Hir::class(Class::Unicode(cls))
}
Dot::AnyByte => {
let mut cls = ClassBytes::empty();
cls.push(ClassBytesRange::new(b'\0', b'\xFF'));
Hir::class(Class::Bytes(cls))
}
Dot::AnyCharExcept(ch) => {
let mut cls =
ClassUnicode::new([ClassUnicodeRange::new(ch, ch)]);
cls.negate();
Hir::class(Class::Unicode(cls))
}
Dot::AnyCharExceptLF => {
let mut cls = ClassUnicode::empty();
cls.push(ClassUnicodeRange::new('\0', '\x09'));
cls.push(ClassUnicodeRange::new('\x0B', '\u{10FFFF}'));
Hir::class(Class::Unicode(cls))
}
Dot::AnyCharExceptCRLF => {
let mut cls = ClassUnicode::empty();
cls.push(ClassUnicodeRange::new('\0', '\x09'));
cls.push(ClassUnicodeRange::new('\x0B', '\x0C'));
cls.push(ClassUnicodeRange::new('\x0E', '\u{10FFFF}'));
Hir::class(Class::Unicode(cls))
}
Dot::AnyByteExcept(byte) => {
let mut cls =
ClassBytes::new([ClassBytesRange::new(byte, byte)]);
cls.negate();
Hir::class(Class::Bytes(cls))
}
Dot::AnyByteExceptLF => {
let mut cls = ClassBytes::empty();
cls.push(ClassBytesRange::new(b'\0', b'\x09'));
cls.push(ClassBytesRange::new(b'\x0B', b'\xFF'));
Hir::class(Class::Bytes(cls))
}
Dot::AnyByteExceptCRLF => {
let mut cls = ClassBytes::empty();
cls.push(ClassBytesRange::new(b'\0', b'\x09'));
cls.push(ClassBytesRange::new(b'\x0B', b'\x0C'));
cls.push(ClassBytesRange::new(b'\x0E', b'\xFF'));
Hir::class(Class::Bytes(cls))
}
}
}
}
/// The underlying kind of an arbitrary [`Hir`] expression.
///
/// An `HirKind` is principally useful for doing case analysis on the type
/// of a regular expression. If you're looking to build new `Hir` values,
/// then you _must_ use the smart constructors defined on `Hir`, like
/// [`Hir::repetition`], to build new `Hir` values. The API intentionally does
/// not expose any way of building an `Hir` directly from an `HirKind`.
#[derive(Clone, Debug, Eq, PartialEq)]
pub enum HirKind {
/// The empty regular expression, which matches everything, including the
/// empty string.
Empty,
/// A literalstring that matches exactly these bytes.
Literal(Literal),
/// A single character class that matches any of the characters in the
/// class. A class can either consist of Unicode scalar values as
/// characters, or it can use bytes.
///
/// A class may be empty. In which case, it matches nothing.
Class(Class),
/// A look-around assertion. A look-around match always has zero length.
Look(Look),
/// A repetition operation applied to a sub-expression.
Repetition(Repetition),
/// A capturing group, which contains a sub-expression.
Capture(Capture),
/// A concatenation of expressions.
///
/// A concatenation matches only if each of its sub-expressions match one
/// after the other.
///
/// Concatenations are guaranteed by `Hir`'s smart constructors to always
/// have at least two sub-expressions.
Concat(Vec<Hir>),
/// An alternation of expressions.
///
/// An alternation matches only if at least one of its sub-expressions
/// match. If multiple sub-expressions match, then the leftmost is
/// preferred.
///
/// Alternations are guaranteed by `Hir`'s smart constructors to always
/// have at least two sub-expressions.
Alternation(Vec<Hir>),
}
impl HirKind {
/// Returns a slice of this kind's sub-expressions, if any.
pub fn subs(&self) -> &[Hir] {
use core::slice::from_ref;
match *self {
HirKind::Empty
| HirKind::Literal(_)
| HirKind::Class(_)
| HirKind::Look(_) => &[],
HirKind::Repetition(Repetition { ref sub, .. }) => from_ref(sub),
HirKind::Capture(Capture { ref sub, .. }) => from_ref(sub),
HirKind::Concat(ref subs) => subs,
HirKind::Alternation(ref subs) => subs,
}
}
}
impl core::fmt::Debug for Hir {
fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> core::fmt::Result {
self.kind.fmt(f)
}
}
/// Print a display representation of this Hir.
///
/// The result of this is a valid regular expression pattern string.
///
/// This implementation uses constant stack space and heap space proportional
/// to the size of the `Hir`.
impl core::fmt::Display for Hir {
fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> core::fmt::Result {
crate::hir::print::Printer::new().print(self, f)
}
}
/// The high-level intermediate representation of a literal.
///
/// A literal corresponds to `0` or more bytes that should be matched
/// literally. The smart constructors defined on `Hir` will automatically
/// concatenate adjacent literals into one literal, and will even automatically
/// replace empty literals with `Hir::empty()`.
///
/// Note that despite a literal being represented by a sequence of bytes, its
/// `Debug` implementation will attempt to print it as a normal string. (That
/// is, not a sequence of decimal numbers.)
#[derive(Clone, Eq, PartialEq)]
pub struct Literal(pub Box<[u8]>);
impl core::fmt::Debug for Literal {
fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result {
crate::debug::Bytes(&self.0).fmt(f)
}
}
/// The high-level intermediate representation of a character class.
///
/// A character class corresponds to a set of characters. A character is either
/// defined by a Unicode scalar value or a byte. Unicode characters are used
/// by default, while bytes are used when Unicode mode (via the `u` flag) is
/// disabled.
///
/// A character class, regardless of its character type, is represented by a
/// sequence of non-overlapping non-adjacent ranges of characters.
///
/// Note that `Bytes` variant may be produced even when it exclusively matches
/// valid UTF-8. This is because a `Bytes` variant represents an intention by
/// the author of the regular expression to disable Unicode mode, which in turn
/// impacts the semantics of case insensitive matching. For example, `(?i)k`
/// and `(?i-u)k` will not match the same set of strings.
#[derive(Clone, Eq, PartialEq)]
pub enum Class {
/// A set of characters represented by Unicode scalar values.
Unicode(ClassUnicode),
/// A set of characters represented by arbitrary bytes (one byte per
/// character).
Bytes(ClassBytes),
}
impl Class {
/// Apply Unicode simple case folding to this character class, in place.
/// The character class will be expanded to include all simple case folded
/// character variants.
///
/// If this is a byte oriented character class, then this will be limited
/// to the ASCII ranges `A-Z` and `a-z`.
///
/// # Panics
///
/// This routine panics when the case mapping data necessary for this
/// routine to complete is unavailable. This occurs when the `unicode-case`
/// feature is not enabled and the underlying class is Unicode oriented.
///
/// Callers should prefer using `try_case_fold_simple` instead, which will
/// return an error instead of panicking.
pub fn case_fold_simple(&mut self) {
match *self {
Class::Unicode(ref mut x) => x.case_fold_simple(),
Class::Bytes(ref mut x) => x.case_fold_simple(),
}
}
/// Apply Unicode simple case folding to this character class, in place.
/// The character class will be expanded to include all simple case folded
/// character variants.
///
/// If this is a byte oriented character class, then this will be limited
/// to the ASCII ranges `A-Z` and `a-z`.
///
/// # Error
///
/// This routine returns an error when the case mapping data necessary
/// for this routine to complete is unavailable. This occurs when the
/// `unicode-case` feature is not enabled and the underlying class is
/// Unicode oriented.
pub fn try_case_fold_simple(
&mut self,
) -> core::result::Result<(), CaseFoldError> {
match *self {
Class::Unicode(ref mut x) => x.try_case_fold_simple()?,
Class::Bytes(ref mut x) => x.case_fold_simple(),
}
Ok(())
}
/// Negate this character class in place.
///
/// After completion, this character class will contain precisely the
/// characters that weren't previously in the class.
pub fn negate(&mut self) {
match *self {
Class::Unicode(ref mut x) => x.negate(),
Class::Bytes(ref mut x) => x.negate(),
}
}
/// Returns true if and only if this character class will only ever match
/// valid UTF-8.
///
/// A character class can match invalid UTF-8 only when the following
/// conditions are met:
///
/// 1. The translator was configured to permit generating an expression
/// that can match invalid UTF-8. (By default, this is disabled.)
/// 2. Unicode mode (via the `u` flag) was disabled either in the concrete
/// syntax or in the parser builder. By default, Unicode mode is
/// enabled.
pub fn is_utf8(&self) -> bool {
match *self {
Class::Unicode(_) => true,
Class::Bytes(ref x) => x.is_ascii(),
}
}
/// Returns the length, in bytes, of the smallest string matched by this
/// character class.
///
/// For non-empty byte oriented classes, this always returns `1`. For
/// non-empty Unicode oriented classes, this can return `1`, `2`, `3` or
/// `4`. For empty classes, `None` is returned. It is impossible for `0` to
/// be returned.
///
/// # Example
///
/// This example shows some examples of regexes and their corresponding
/// minimum length, if any.
///
/// ```
/// use regex_syntax::{hir::Properties, parse};
///
/// // The empty string has a min length of 0.
/// let hir = parse(r"")?;
/// assert_eq!(Some(0), hir.properties().minimum_len());
/// // As do other types of regexes that only match the empty string.
/// let hir = parse(r"^$\b\B")?;
/// assert_eq!(Some(0), hir.properties().minimum_len());
/// // A regex that can match the empty string but match more is still 0.
/// let hir = parse(r"a*")?;
/// assert_eq!(Some(0), hir.properties().minimum_len());
/// // A regex that matches nothing has no minimum defined.
/// let hir = parse(r"[a&&b]")?;
/// assert_eq!(None, hir.properties().minimum_len());
/// // Character classes usually have a minimum length of 1.
/// let hir = parse(r"\w")?;
/// assert_eq!(Some(1), hir.properties().minimum_len());
/// // But sometimes Unicode classes might be bigger!
/// let hir = parse(r"\p{Cyrillic}")?;
/// assert_eq!(Some(2), hir.properties().minimum_len());
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
pub fn minimum_len(&self) -> Option<usize> {
match *self {
Class::Unicode(ref x) => x.minimum_len(),
Class::Bytes(ref x) => x.minimum_len(),
}
}
/// Returns the length, in bytes, of the longest string matched by this
/// character class.
///
/// For non-empty byte oriented classes, this always returns `1`. For
/// non-empty Unicode oriented classes, this can return `1`, `2`, `3` or
/// `4`. For empty classes, `None` is returned. It is impossible for `0` to
/// be returned.
///
/// # Example
///
/// This example shows some examples of regexes and their corresponding
/// maximum length, if any.
///
/// ```
/// use regex_syntax::{hir::Properties, parse};
///
/// // The empty string has a max length of 0.
/// let hir = parse(r"")?;
/// assert_eq!(Some(0), hir.properties().maximum_len());
/// // As do other types of regexes that only match the empty string.
/// let hir = parse(r"^$\b\B")?;
/// assert_eq!(Some(0), hir.properties().maximum_len());
/// // A regex that matches nothing has no maximum defined.
/// let hir = parse(r"[a&&b]")?;
/// assert_eq!(None, hir.properties().maximum_len());
/// // Bounded repeats work as you expect.
/// let hir = parse(r"x{2,10}")?;
/// assert_eq!(Some(10), hir.properties().maximum_len());
/// // An unbounded repeat means there is no maximum.
/// let hir = parse(r"x{2,}")?;
/// assert_eq!(None, hir.properties().maximum_len());
/// // With Unicode enabled, \w can match up to 4 bytes!
/// let hir = parse(r"\w")?;
/// assert_eq!(Some(4), hir.properties().maximum_len());
/// // Without Unicode enabled, \w matches at most 1 byte.
/// let hir = parse(r"(?-u)\w")?;
/// assert_eq!(Some(1), hir.properties().maximum_len());
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
pub fn maximum_len(&self) -> Option<usize> {
match *self {
Class::Unicode(ref x) => x.maximum_len(),
Class::Bytes(ref x) => x.maximum_len(),
}
}
/// Returns true if and only if this character class is empty. That is,
/// it has no elements.
///
/// An empty character can never match anything, including an empty string.
pub fn is_empty(&self) -> bool {
match *self {
Class::Unicode(ref x) => x.ranges().is_empty(),
Class::Bytes(ref x) => x.ranges().is_empty(),
}
}
/// If this class consists of exactly one element (whether a codepoint or a
/// byte), then return it as a literal byte string.
///
/// If this class is empty or contains more than one element, then `None`
/// is returned.
pub fn literal(&self) -> Option<Vec<u8>> {
match *self {
Class::Unicode(ref x) => x.literal(),
Class::Bytes(ref x) => x.literal(),
}
}
}
impl core::fmt::Debug for Class {
fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result {
use crate::debug::Byte;
let mut fmter = f.debug_set();
match *self {
Class::Unicode(ref cls) => {
for r in cls.ranges().iter() {
fmter.entry(&(r.start..=r.end));
}
}
Class::Bytes(ref cls) => {
for r in cls.ranges().iter() {
fmter.entry(&(Byte(r.start)..=Byte(r.end)));
}
}
}
fmter.finish()
}
}
/// A set of characters represented by Unicode scalar values.
#[derive(Clone, Debug, Eq, PartialEq)]
pub struct ClassUnicode {
set: IntervalSet<ClassUnicodeRange>,
}
impl ClassUnicode {
/// Create a new class from a sequence of ranges.
///
/// The given ranges do not need to be in any specific order, and ranges
/// may overlap. Ranges will automatically be sorted into a canonical
/// non-overlapping order.
pub fn new<I>(ranges: I) -> ClassUnicode
where
I: IntoIterator<Item = ClassUnicodeRange>,
{
ClassUnicode { set: IntervalSet::new(ranges) }
}
/// Create a new class with no ranges.
///
/// An empty class matches nothing. That is, it is equivalent to
/// [`Hir::fail`].
pub fn empty() -> ClassUnicode {
ClassUnicode::new(vec![])
}
/// Add a new range to this set.
pub fn push(&mut self, range: ClassUnicodeRange) {
self.set.push(range);
}
/// Return an iterator over all ranges in this class.
///
/// The iterator yields ranges in ascending order.
pub fn iter(&self) -> ClassUnicodeIter<'_> {
ClassUnicodeIter(self.set.iter())
}
/// Return the underlying ranges as a slice.
pub fn ranges(&self) -> &[ClassUnicodeRange] {
self.set.intervals()
}
/// Expand this character class such that it contains all case folded
/// characters, according to Unicode's "simple" mapping. For example, if
/// this class consists of the range `a-z`, then applying case folding will
/// result in the class containing both the ranges `a-z` and `A-Z`.
///
/// # Panics
///
/// This routine panics when the case mapping data necessary for this
/// routine to complete is unavailable. This occurs when the `unicode-case`
/// feature is not enabled.
///
/// Callers should prefer using `try_case_fold_simple` instead, which will
/// return an error instead of panicking.
pub fn case_fold_simple(&mut self) {
self.set
.case_fold_simple()
.expect("unicode-case feature must be enabled");
}
/// Expand this character class such that it contains all case folded
/// characters, according to Unicode's "simple" mapping. For example, if
/// this class consists of the range `a-z`, then applying case folding will
/// result in the class containing both the ranges `a-z` and `A-Z`.
///
/// # Error
///
/// This routine returns an error when the case mapping data necessary
/// for this routine to complete is unavailable. This occurs when the
/// `unicode-case` feature is not enabled.
pub fn try_case_fold_simple(
&mut self,
) -> core::result::Result<(), CaseFoldError> {
self.set.case_fold_simple()
}
/// Negate this character class.
///
/// For all `c` where `c` is a Unicode scalar value, if `c` was in this
/// set, then it will not be in this set after negation.
pub fn negate(&mut self) {
self.set.negate();
}
/// Union this character class with the given character class, in place.
pub fn union(&mut self, other: &ClassUnicode) {
self.set.union(&other.set);
}
/// Intersect this character class with the given character class, in
/// place.
pub fn intersect(&mut self, other: &ClassUnicode) {
self.set.intersect(&other.set);
}
/// Subtract the given character class from this character class, in place.
pub fn difference(&mut self, other: &ClassUnicode) {
self.set.difference(&other.set);
}
/// Compute the symmetric difference of the given character classes, in
/// place.
///
/// This computes the symmetric difference of two character classes. This
/// removes all elements in this class that are also in the given class,
/// but all adds all elements from the given class that aren't in this
/// class. That is, the class will contain all elements in either class,
/// but will not contain any elements that are in both classes.
pub fn symmetric_difference(&mut self, other: &ClassUnicode) {
self.set.symmetric_difference(&other.set);
}
/// Returns true if and only if this character class will either match
/// nothing or only ASCII bytes. Stated differently, this returns false
/// if and only if this class contains a non-ASCII codepoint.
pub fn is_ascii(&self) -> bool {
self.set.intervals().last().map_or(true, |r| r.end <= '\x7F')
}
/// Returns the length, in bytes, of the smallest string matched by this
/// character class.
///
/// Returns `None` when the class is empty.
pub fn minimum_len(&self) -> Option<usize> {
let first = self.ranges().get(0)?;
// Correct because c1 < c2 implies c1.len_utf8() < c2.len_utf8().
Some(first.start.len_utf8())
}
/// Returns the length, in bytes, of the longest string matched by this
/// character class.
///
/// Returns `None` when the class is empty.
pub fn maximum_len(&self) -> Option<usize> {
let last = self.ranges().last()?;
// Correct because c1 < c2 implies c1.len_utf8() < c2.len_utf8().
Some(last.end.len_utf8())
}
/// If this class consists of exactly one codepoint, then return it as
/// a literal byte string.
///
/// If this class is empty or contains more than one codepoint, then `None`
/// is returned.
pub fn literal(&self) -> Option<Vec<u8>> {
let rs = self.ranges();
if rs.len() == 1 && rs[0].start == rs[0].end {
Some(rs[0].start.encode_utf8(&mut [0; 4]).to_string().into_bytes())
} else {
None
}
}
/// If this class consists of only ASCII ranges, then return its
/// corresponding and equivalent byte class.
pub fn to_byte_class(&self) -> Option<ClassBytes> {
if !self.is_ascii() {
return None;
}
Some(ClassBytes::new(self.ranges().iter().map(|r| {
// Since we are guaranteed that our codepoint range is ASCII, the
// 'u8::try_from' calls below are guaranteed to be correct.
ClassBytesRange {
start: u8::try_from(r.start).unwrap(),
end: u8::try_from(r.end).unwrap(),
}
})))
}
}
/// An iterator over all ranges in a Unicode character class.
///
/// The lifetime `'a` refers to the lifetime of the underlying class.
#[derive(Debug)]
pub struct ClassUnicodeIter<'a>(IntervalSetIter<'a, ClassUnicodeRange>);
impl<'a> Iterator for ClassUnicodeIter<'a> {
type Item = &'a ClassUnicodeRange;
fn next(&mut self) -> Option<&'a ClassUnicodeRange> {
self.0.next()
}
}
/// A single range of characters represented by Unicode scalar values.
///
/// The range is closed. That is, the start and end of the range are included
/// in the range.
#[derive(Clone, Copy, Default, Eq, PartialEq, PartialOrd, Ord)]
pub struct ClassUnicodeRange {
start: char,
end: char,
}
impl core::fmt::Debug for ClassUnicodeRange {
fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> core::fmt::Result {
let start = if !self.start.is_whitespace() && !self.start.is_control()
{
self.start.to_string()
} else {
format!("0x{:X}", u32::from(self.start))
};
let end = if !self.end.is_whitespace() && !self.end.is_control() {
self.end.to_string()
} else {
format!("0x{:X}", u32::from(self.end))
};
f.debug_struct("ClassUnicodeRange")
.field("start", &start)
.field("end", &end)
.finish()
}
}
impl Interval for ClassUnicodeRange {
type Bound = char;
#[inline]
fn lower(&self) -> char {
self.start
}
#[inline]
fn upper(&self) -> char {
self.end
}
#[inline]
fn set_lower(&mut self, bound: char) {
self.start = bound;
}
#[inline]
fn set_upper(&mut self, bound: char) {
self.end = bound;
}
/// Apply simple case folding to this Unicode scalar value range.
///
/// Additional ranges are appended to the given vector. Canonical ordering
/// is *not* maintained in the given vector.
fn case_fold_simple(
&self,
ranges: &mut Vec<ClassUnicodeRange>,
) -> Result<(), unicode::CaseFoldError> {
let mut folder = unicode::SimpleCaseFolder::new()?;
if !folder.overlaps(self.start, self.end) {
return Ok(());
}
let (start, end) = (u32::from(self.start), u32::from(self.end));
for cp in (start..=end).filter_map(char::from_u32) {
for &cp_folded in folder.mapping(cp) {
ranges.push(ClassUnicodeRange::new(cp_folded, cp_folded));
}
}
Ok(())
}
}
impl ClassUnicodeRange {
/// Create a new Unicode scalar value range for a character class.
///
/// The returned range is always in a canonical form. That is, the range
/// returned always satisfies the invariant that `start <= end`.
pub fn new(start: char, end: char) -> ClassUnicodeRange {
ClassUnicodeRange::create(start, end)
}
/// Return the start of this range.
///
/// The start of a range is always less than or equal to the end of the
/// range.
pub fn start(&self) -> char {
self.start
}
/// Return the end of this range.
///
/// The end of a range is always greater than or equal to the start of the
/// range.
pub fn end(&self) -> char {
self.end
}
/// Returns the number of codepoints in this range.
pub fn len(&self) -> usize {
let diff = 1 + u32::from(self.end) - u32::from(self.start);
// This is likely to panic in 16-bit targets since a usize can only fit
// 2^16. It's not clear what to do here, other than to return an error
// when building a Unicode class that contains a range whose length
// overflows usize. (Which, to be honest, is probably quite common on
// 16-bit targets. For example, this would imply that '.' and '\p{any}'
// would be impossible to build.)
usize::try_from(diff).expect("char class len fits in usize")
}
}
/// A set of characters represented by arbitrary bytes (where one byte
/// corresponds to one character).
#[derive(Clone, Debug, Eq, PartialEq)]
pub struct ClassBytes {
set: IntervalSet<ClassBytesRange>,
}
impl ClassBytes {
/// Create a new class from a sequence of ranges.
///
/// The given ranges do not need to be in any specific order, and ranges
/// may overlap. Ranges will automatically be sorted into a canonical
/// non-overlapping order.
pub fn new<I>(ranges: I) -> ClassBytes
where
I: IntoIterator<Item = ClassBytesRange>,
{
ClassBytes { set: IntervalSet::new(ranges) }
}
/// Create a new class with no ranges.
///
/// An empty class matches nothing. That is, it is equivalent to
/// [`Hir::fail`].
pub fn empty() -> ClassBytes {
ClassBytes::new(vec![])
}
/// Add a new range to this set.
pub fn push(&mut self, range: ClassBytesRange) {
self.set.push(range);
}
/// Return an iterator over all ranges in this class.
///
/// The iterator yields ranges in ascending order.
pub fn iter(&self) -> ClassBytesIter<'_> {
ClassBytesIter(self.set.iter())
}
/// Return the underlying ranges as a slice.
pub fn ranges(&self) -> &[ClassBytesRange] {
self.set.intervals()
}
/// Expand this character class such that it contains all case folded
/// characters. For example, if this class consists of the range `a-z`,
/// then applying case folding will result in the class containing both the
/// ranges `a-z` and `A-Z`.
///
/// Note that this only applies ASCII case folding, which is limited to the
/// characters `a-z` and `A-Z`.
pub fn case_fold_simple(&mut self) {
self.set.case_fold_simple().expect("ASCII case folding never fails");
}
/// Negate this byte class.
///
/// For all `b` where `b` is a any byte, if `b` was in this set, then it
/// will not be in this set after negation.
pub fn negate(&mut self) {
self.set.negate();
}
/// Union this byte class with the given byte class, in place.
pub fn union(&mut self, other: &ClassBytes) {
self.set.union(&other.set);
}
/// Intersect this byte class with the given byte class, in place.
pub fn intersect(&mut self, other: &ClassBytes) {
self.set.intersect(&other.set);
}
/// Subtract the given byte class from this byte class, in place.
pub fn difference(&mut self, other: &ClassBytes) {
self.set.difference(&other.set);
}
/// Compute the symmetric difference of the given byte classes, in place.
///
/// This computes the symmetric difference of two byte classes. This
/// removes all elements in this class that are also in the given class,
/// but all adds all elements from the given class that aren't in this
/// class. That is, the class will contain all elements in either class,
/// but will not contain any elements that are in both classes.
pub fn symmetric_difference(&mut self, other: &ClassBytes) {
self.set.symmetric_difference(&other.set);
}
/// Returns true if and only if this character class will either match
/// nothing or only ASCII bytes. Stated differently, this returns false
/// if and only if this class contains a non-ASCII byte.
pub fn is_ascii(&self) -> bool {
self.set.intervals().last().map_or(true, |r| r.end <= 0x7F)
}
/// Returns the length, in bytes, of the smallest string matched by this
/// character class.
///
/// Returns `None` when the class is empty.
pub fn minimum_len(&self) -> Option<usize> {
if self.ranges().is_empty() {
None
} else {
Some(1)
}
}
/// Returns the length, in bytes, of the longest string matched by this
/// character class.
///
/// Returns `None` when the class is empty.
pub fn maximum_len(&self) -> Option<usize> {
if self.ranges().is_empty() {
None
} else {
Some(1)
}
}
/// If this class consists of exactly one byte, then return it as
/// a literal byte string.
///
/// If this class is empty or contains more than one byte, then `None`
/// is returned.
pub fn literal(&self) -> Option<Vec<u8>> {
let rs = self.ranges();
if rs.len() == 1 && rs[0].start == rs[0].end {
Some(vec![rs[0].start])
} else {
None
}
}
/// If this class consists of only ASCII ranges, then return its
/// corresponding and equivalent Unicode class.
pub fn to_unicode_class(&self) -> Option<ClassUnicode> {
if !self.is_ascii() {
return None;
}
Some(ClassUnicode::new(self.ranges().iter().map(|r| {
// Since we are guaranteed that our byte range is ASCII, the
// 'char::from' calls below are correct and will not erroneously
// convert a raw byte value into its corresponding codepoint.
ClassUnicodeRange {
start: char::from(r.start),
end: char::from(r.end),
}
})))
}
}
/// An iterator over all ranges in a byte character class.
///
/// The lifetime `'a` refers to the lifetime of the underlying class.
#[derive(Debug)]
pub struct ClassBytesIter<'a>(IntervalSetIter<'a, ClassBytesRange>);
impl<'a> Iterator for ClassBytesIter<'a> {
type Item = &'a ClassBytesRange;
fn next(&mut self) -> Option<&'a ClassBytesRange> {
self.0.next()
}
}
/// A single range of characters represented by arbitrary bytes.
///
/// The range is closed. That is, the start and end of the range are included
/// in the range.
#[derive(Clone, Copy, Default, Eq, PartialEq, PartialOrd, Ord)]
pub struct ClassBytesRange {
start: u8,
end: u8,
}
impl Interval for ClassBytesRange {
type Bound = u8;
#[inline]
fn lower(&self) -> u8 {
self.start
}
#[inline]
fn upper(&self) -> u8 {
self.end
}
#[inline]
fn set_lower(&mut self, bound: u8) {
self.start = bound;
}
#[inline]
fn set_upper(&mut self, bound: u8) {
self.end = bound;
}
/// Apply simple case folding to this byte range. Only ASCII case mappings
/// (for a-z) are applied.
///
/// Additional ranges are appended to the given vector. Canonical ordering
/// is *not* maintained in the given vector.
fn case_fold_simple(
&self,
ranges: &mut Vec<ClassBytesRange>,
) -> Result<(), unicode::CaseFoldError> {
if !ClassBytesRange::new(b'a', b'z').is_intersection_empty(self) {
let lower = cmp::max(self.start, b'a');
let upper = cmp::min(self.end, b'z');
ranges.push(ClassBytesRange::new(lower - 32, upper - 32));
}
if !ClassBytesRange::new(b'A', b'Z').is_intersection_empty(self) {
let lower = cmp::max(self.start, b'A');
let upper = cmp::min(self.end, b'Z');
ranges.push(ClassBytesRange::new(lower + 32, upper + 32));
}
Ok(())
}
}
impl ClassBytesRange {
/// Create a new byte range for a character class.
///
/// The returned range is always in a canonical form. That is, the range
/// returned always satisfies the invariant that `start <= end`.
pub fn new(start: u8, end: u8) -> ClassBytesRange {
ClassBytesRange::create(start, end)
}
/// Return the start of this range.
///
/// The start of a range is always less than or equal to the end of the
/// range.
pub fn start(&self) -> u8 {
self.start
}
/// Return the end of this range.
///
/// The end of a range is always greater than or equal to the start of the
/// range.
pub fn end(&self) -> u8 {
self.end
}
/// Returns the number of bytes in this range.
pub fn len(&self) -> usize {
usize::from(self.end.checked_sub(self.start).unwrap())
.checked_add(1)
.unwrap()
}
}
impl core::fmt::Debug for ClassBytesRange {
fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> core::fmt::Result {
f.debug_struct("ClassBytesRange")
.field("start", &crate::debug::Byte(self.start))
.field("end", &crate::debug::Byte(self.end))
.finish()
}
}
/// The high-level intermediate representation for a look-around assertion.
///
/// An assertion match is always zero-length. Also called an "empty match."
#[derive(Clone, Copy, Debug, Eq, PartialEq)]
pub enum Look {
/// Match the beginning of text. Specifically, this matches at the starting
/// position of the input.
Start = 1 << 0,
/// Match the end of text. Specifically, this matches at the ending
/// position of the input.
End = 1 << 1,
/// Match the beginning of a line or the beginning of text. Specifically,
/// this matches at the starting position of the input, or at the position
/// immediately following a `\n` character.
StartLF = 1 << 2,
/// Match the end of a line or the end of text. Specifically, this matches
/// at the end position of the input, or at the position immediately
/// preceding a `\n` character.
EndLF = 1 << 3,
/// Match the beginning of a line or the beginning of text. Specifically,
/// this matches at the starting position of the input, or at the position
/// immediately following either a `\r` or `\n` character, but never after
/// a `\r` when a `\n` follows.
StartCRLF = 1 << 4,
/// Match the end of a line or the end of text. Specifically, this matches
/// at the end position of the input, or at the position immediately
/// preceding a `\r` or `\n` character, but never before a `\n` when a `\r`
/// precedes it.
EndCRLF = 1 << 5,
/// Match an ASCII-only word boundary. That is, this matches a position
/// where the left adjacent character and right adjacent character
/// correspond to a word and non-word or a non-word and word character.
WordAscii = 1 << 6,
/// Match an ASCII-only negation of a word boundary.
WordAsciiNegate = 1 << 7,
/// Match a Unicode-aware word boundary. That is, this matches a position
/// where the left adjacent character and right adjacent character
/// correspond to a word and non-word or a non-word and word character.
WordUnicode = 1 << 8,
/// Match a Unicode-aware negation of a word boundary.
WordUnicodeNegate = 1 << 9,
}
impl Look {
/// Flip the look-around assertion to its equivalent for reverse searches.
/// For example, `StartLF` gets translated to `EndLF`.
///
/// Some assertions, such as `WordUnicode`, remain the same since they
/// match the same positions regardless of the direction of the search.
#[inline]
pub const fn reversed(self) -> Look {
match self {
Look::Start => Look::End,
Look::End => Look::Start,
Look::StartLF => Look::EndLF,
Look::EndLF => Look::StartLF,
Look::StartCRLF => Look::EndCRLF,
Look::EndCRLF => Look::StartCRLF,
Look::WordAscii => Look::WordAscii,
Look::WordAsciiNegate => Look::WordAsciiNegate,
Look::WordUnicode => Look::WordUnicode,
Look::WordUnicodeNegate => Look::WordUnicodeNegate,
}
}
/// Return the underlying representation of this look-around enumeration
/// as an integer. Giving the return value to the [`Look::from_repr`]
/// constructor is guaranteed to return the same look-around variant that
/// one started with within a semver compatible release of this crate.
#[inline]
pub const fn as_repr(self) -> u16 {
// AFAIK, 'as' is the only way to zero-cost convert an int enum to an
// actual int.
self as u16
}
/// Given the underlying representation of a `Look` value, return the
/// corresponding `Look` value if the representation is valid. Otherwise
/// `None` is returned.
#[inline]
pub const fn from_repr(repr: u16) -> Option<Look> {
match repr {
0b00_0000_0001 => Some(Look::Start),
0b00_0000_0010 => Some(Look::End),
0b00_0000_0100 => Some(Look::StartLF),
0b00_0000_1000 => Some(Look::EndLF),
0b00_0001_0000 => Some(Look::StartCRLF),
0b00_0010_0000 => Some(Look::EndCRLF),
0b00_0100_0000 => Some(Look::WordAscii),
0b00_1000_0000 => Some(Look::WordAsciiNegate),
0b01_0000_0000 => Some(Look::WordUnicode),
0b10_0000_0000 => Some(Look::WordUnicodeNegate),
_ => None,
}
}
/// Returns a convenient single codepoint representation of this
/// look-around assertion. Each assertion is guaranteed to be represented
/// by a distinct character.
///
/// This is useful for succinctly representing a look-around assertion in
/// human friendly but succinct output intended for a programmer working on
/// regex internals.
#[inline]
pub const fn as_char(self) -> char {
match self {
Look::Start => 'A',
Look::End => 'z',
Look::StartLF => '^',
Look::EndLF => '$',
Look::StartCRLF => 'r',
Look::EndCRLF => 'R',
Look::WordAscii => 'b',
Look::WordAsciiNegate => 'B',
Look::WordUnicode => '',
Look::WordUnicodeNegate => '',
}
}
}
/// The high-level intermediate representation for a capturing group.
///
/// A capturing group always has an index and a child expression. It may
/// also have a name associated with it (e.g., `(?P<foo>\w)`), but it's not
/// necessary.
///
/// Note that there is no explicit representation of a non-capturing group
/// in a `Hir`. Instead, non-capturing grouping is handled automatically by
/// the recursive structure of the `Hir` itself.
#[derive(Clone, Debug, Eq, PartialEq)]
pub struct Capture {
/// The capture index of the capture.
pub index: u32,
--> --------------------
--> maximum size reached
--> --------------------