A parser for a simplified, regular subset of JavaScript regular expressions that doesn’t support capturing.
Because it’s regular, the subset doesn’t support:
- backreferences
Because it doesn’t support capturing, it doesn’t support:
- capturing groups (
(…)
unless(?:…)
) - greediness modifiers
Because it’s simplified, it doesn’t support:
- assertions (anchors, word boundaries,
(?=…)
, and(?!…)
) - escapes with easy alternatives that are obscure (
\cX
), uncommon (\f
,\v
), or syntactically awkward (\0
) - escapes that aren’t necessary in any context
\s
and\S
(what they match is not obvious)
Syntax
When syntactically valid, a pattern has the same meaning as it does in JavaScript (i.e. when passed to the RegExp
constructor) with no flags.
pattern = disjunction disjunction = alternative [ "|" disjunction ] alternative = *term term = atom [ quantifier ] quantifier = "*" / ; zero or more "+" / ; one or more "?" / ; zero or one "{" *DIGIT "}" / ; exactly count. counts are at most Number.MAX_SAFE_INTEGER. "{" *DIGIT ",}" / ; at least count "{" *DIGIT "," *DIGIT "}" ; at least first count and at most second. must be a non-empty range. atom = pattern-character / ; the character itself "." / ; any character except CR, LF, U+2028, and U+2029 "\" atom-escape / character-class / "(?:" pattern ")" character-class = "[" [ "^" ] ; indicates a negated character class *range "]" range = range-character "-" range-character / ; must be a non-empty range range-character / "\" predefined-range range-character = range-plain-character / "\" range-escape character-escape = "n" / ; LF "r" / ; CR "t" / ; tab "x" 2hex-digit / "u" 4hex-digit predefined-range = "d" / "D" / ; [0-9], [^0-9] "w" / "W" ; [0-9A-Za-z_], [^…] atom-escape = character-escape / predefined-range / pattern-metacharacter / "/" range-escape = character-escape / range-metacharacter / "/" / "[" range-metacharacter = "^" / "\" / "-" / "]" pattern-metacharacter = "^" / "$" / "\" / "." / "*" / "+" / "?" / "(" / ")" / "[" / "]" / "{" / "}" / "|" hex-digit = HEXDIG / ; 0-9A-F %x61-66 ; a-f
pattern-character
is any UTF-16 code unit that is not a pattern-metacharacter
. Similarly, range-plain-character
is any UTF-16 code unit that is not a range-metacharacter
.
Parser output format
Every node has a type
property (a string). The node types and their properties are:
Disjunction
alternatives
, a non-empty array ofAlternative
nodes
Alternative
terms
, an array ofTerm
nodes
Term
atom
, aCharacter
,CharacterClass
, orDisjunction
nodequantifier
,null
or an object with the following properties:min
, the minimum number of repetitions indicated by the quantifier; an integer from 0 to 2^53−1max
, the maximum number of repetitions indicated by the quantifier; an integer frommin
to 2^53−1, orInfinity
Character
value
, an integer UTF-16 code unit
CharacterClass
negated
, a booleanranges
, an array of objects with inclusive integerstart
andend
properties representing UTF-16 code units
Nodes can be modified safely.
API
.parse(string)
Returns a Disjunction
node or throws a PatternError
.
.PatternError
The type of error thrown by .parse
.
Example
const ret3 = ; ret3
type: 'Disjunction' alternatives: type: 'Alternative' terms: type: 'Term' atom: type: 'CharacterClass' negated: false ranges: start: 97 end: 97 start: 98 end: 98 quantifier: min: 1 max: Infinity