tinylex
A simple iterative lexer written in TypeScript
Under development
Install:
npm install tinylex
Import:
const lexer =
Code:
const code = `## Darklord source#summon "messenger" forge harken(msg) { messenger(msg || 'All shall flee before me!')} craft lieutenants = 12craft message = "I have " + leutenants + " servants" harken.wield(message)`
Rules:
const KEYWORDS = 'summon' 'forge' 'craft' 'wield' 'if' 'while' 'true' 'false' 'null' const KEYWORD = `^(?:)`const COMMENT = /^\s*\n/const IDENTIFIER = /^[a-z]\w*/const NUMBER = /^??\d+\.??/const STRING_SINGLE = /^''/const STRING_DOUBLE = /^""/const LOGICAL = /^/const WHITESPACE = /^\s/ const rules = COMMENT 'COMMENT' KEYWORD 0 IDENTIFIER 'IDENTIFIER' NUMBER 'NUMBER' LOGICAL 0 STRING_DOUBLE 'STRING' STRING_SINGLE 'STRING' WHITESPACE
Instantiate:
const lexer = code rules
Consume:
for let token of lexer console
or
while!lexer console
or
const tokens = ...lexerconsole
or
const tokens = lexerconsole
Result:
// ------------------------------------------------------------------// generated tokens// 'COMMENT' '#' 'COMMENT' '# Darklord source' 'COMMENT' '#' 'SUMMON' 'summon' 'STRING' 'messenger' 'FORGE' 'forge' 'IDENTIFIER' 'harken' '(' '(' 'IDENTIFIER' 'msg' ')' ')' '{' '{' 'IDENTIFIER' 'messenger' '(' '(' 'IDENTIFIER' 'msg' '||' '||' 'STRING' 'All shall flee before me!' ')' ')' '}' '}' 'CRAFT' 'craft' 'IDENTIFIER' 'lieutenants' '=' '=' 'NUMBER' '12' 'CRAFT' 'craft' 'IDENTIFIER' 'message' '=' '=' 'STRING' 'I have ' '+' '+' 'IDENTIFIER' 'leutenants' '+' '+' 'STRING' ' servants' 'IDENTIFIER' 'harken' '.' '.' 'WIELD' 'wield' '(' '(' 'IDENTIFIER' 'message' ')' ')' 'EOF' 'EOF'
Rules
const rules = COMMENT 'COMMENT' // ['COMMENT', '# Darklord source'] KEYWORD 0 // ['SUMMON', 'summon'] IDENTIFIER 'IDENTIFIER' // ['IDENTIFIER', 'harken'] NUMBER 'NUMBER' // ['NUMBER', '12'] LOGICAL 0 // ['||', '||'] STRING_DOUBLE 'STRING' // ['STRING', 'messenger'] STRING_SINGLE 'STRING' // ['STRING', 'All shall flee...'] WHITESPACE
Rules can be specified in the form [RegExp, string|number|function|null|undefined]
RegExp
: the match criteria specified as a regular expression object.
string
: the name of the token, e.g., 'COMMENT'
as in [COMMENT, 'COMMENT']
. The token content is taken from match group 0 (the lexeme) of the RegExp match object which produces the token ['COMMENT', '# Darklord source']
. If the RegExp contains a match group, then match group 1 is used, as is the case for the RegExp used for the string rules, e.g., /^"([^"]*)"/
, which captures the portion of the match between the quotes. This only works for match group 1.
number
: the number of the match group to use for both the token name and content, as in [KEYWORD, 0]
which produces the token ['SUMMON', 'summon']
. This means that if your regular expression contains a match group, you can use it to generate the name and value for the token: [SOME_REGEXP, 1]
.
null|undefined
: no token should be created from the match - effectively discards the match altogether, as in [WHITESPACE]
which swallows whitespace with no other effect. The cursor is advanced by the length of the lexeme (match group 0).
function
: a function used to create the token, discard the match, and/or advance the cursor by some positive, non-zero integer amount (TinyLex
always advances the cursor to avoid infinite loops). Functions here can also push multiple tokens if desired. If the function returns null
or undefined
, the cursor is advanced by the length of the lexeme (match group 0). If the function returns a number <= 1, the cursor is advanced by one. The function's this
context is set to the lexer instance.
// We could use a function to swallow whitespace.WHITESPACE { // Advance the cursor by one. If we don't return a number, the // cursor is advanced by the size of the lexeme (match group 0), // so in this case returning 1 is no different from returning // null or undefined. return 1}
// We could use a function to customize the token in some way.LOGICAL { const lexeme = match0 // We don't actually need to do this because by default the // cursor is advanced by the lexeme length (match group 0). return lexemelength}
Note: when using a rule function you must push one or more tokens onto the tokens array unless you intentionally intend to discard the match. If no tokens are pushed no token will be generated.
onToken
Function
The This function, if given, is called for every token. It can modify the contents of the token, return an entirely new token, or discard some or all tokens (except for the final EOF
token which can be transformed but not removed). onToken
can be utilized by calling lexer.onToken
and passing a function definition. This function is called with its this
context set to the lexer instance.
const lexer = code rules // The callback function will have it's 'this' context set// to the lexer instance.lexer
Options
The option onError
specifies what to do if a match is not found at the cursor.
tokenize
: (default) Tokenize the next single character and advance the cursor by one.
ignore
: Advance the cursor by one and do nothing else.
throw
: Throw an error indicating that a match was not found.
// onError can be 'tokenize', 'throw', or 'ignore'.const lexer = code rules onError: 'tokenize'
Note: onError
is the only configuration option.