Lexer which can parse any text input to tokens, according to provided regular expressions.
In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of tokens (strings with an assigned and thus identified meaning). A program that performs lexical analysis may be termed a lexer, tokenizer, or scanner, though scanner is also a term for the first stage of a lexer. A lexer is generally combined with a parser, which together analyze the syntax of programming languages, web pages, and so forth.
- Allow named regular expressions, so you don't have to work with it a lot
- Allow post-processing tokens, to get more information you require
Package is available as universal-lexer
in NPM, so you can use it in your project using
npm install universal-lexer
or yarn add universal-lexer
Code itself is written in ES6 and should work in Node.js 6+ environment.
If you would like to use it in browser or older development, there is also transpiled and bundled (UMD) version included.
You can use universal-lexer/browser
in your requires or UniversalLexer
in global environment (in browser):
// Load library
const UniversalLexer = require('universal-lexer/browser')
// Create lexer
const lexer = UniversalLexer.compile(definitions)
// ...
You've got two sets of functions:
// Load library
const UniversalLexer = require('universal-lexer')
// Build code for this lexer
const code1 = UniversalLexer.build([ { type: 'Colon', value: ':' } ])
const code2 = UniversalLexer.buildFromFile('json.yaml')
// Compile dynamically a function which can be used
const func1 = UniversalLexer.compile([ { type: 'Colon', value: ':' } ])
const func2 = UniversalLexer.compileFromFile('json.yaml')
There are two ways of passing rules to this lexer: from file or array of definitions.
Simply, pass definitions to lexer:
// Load library
const UniversalLexer = require('universal-lexer')
// Create token definition
const Colon = {
type: 'Colon',
value: ':'
}
// Build array of definitions
const definitions = [ Colon ]
// Create lexer
const lexer = UniversalLexer.compile(definitions)
A definition is more complex object:
// Required fields: 'type' and either `regex` or `value`
{
// Token name
type: 'String',
// String value which should be searched on beginning on string
value: 'abc',
value: '(',
// Regular expression to validate
// if current token should be parsed as this token
// Useful i.e. when you require separator after sentence,
// but you don't want to include it.
valid: '"',
// Regular expression flags for 'valid' field
validFlags: 'i',
// Regular expression to find current token
// You can use named groups as well (?<name>expression):
// Then it will attach this information to token.
regex: '"(?<value>([^"]|\\.)+)"',
// Regular expression flags for 'regex' field
regexFlags: 'i'
}
// Load library
const UniversalLexer = require('universal-lexer')
const lexer = UniversalLexer.compileFromFile('scss.yaml')
YAML file for now should contain only Tokens
property with definitions.
Later it may have more advanced stuff like macros (for simpler syntax).
Example:
Tokens:
# Whitespaces
- type: NewLine
value: "\n"
- type: Space
regex: '[ \t]+'
# Math
- type: Operator
regex: '[+-*/]'
# Color
# It has 'valid' field, to be sure that it's not i.e. blacker
# Now, it will check if there is no text after
- type: Color
regex: '(?<value>black|white)'
valid: '(black|white)[^\w]'
Processing input data, after you created a lexer is pretty straight-forward with for
method:
// Load library
const UniversalLexer = require('universal-lexer')
// Create lexer
const tokenize = UniversalLexer.compileFromFile('scss.yaml')
// Build processor
const tokens = tokenize('some { background: code }').tokens
If you would like to make more advanced parsing on parsed tokens, you can do it with addProcessor
method:
// Load library
const UniversalLexer = require('universal-lexer')
// Create lexer
const tokenize = UniversalLexer.compileFromFile('scss.yaml')
// That's 'Literal' definition:
const Literal = {
type: 'Literal',
regex: '(?<value>([^\t \n;"'',{}()\[\]#=:~&\\]|(\\.))+)'
}
// Create processor which will replace all '\X' to 'X' in value
function process (token) {
if (token.type === 'Literal') {
token.data.value = token.data.value.replace(/\\(.)/g, '$1')
}
return token
}
// Also, you can return a new token
function process2 (token) {
if (token.type !== 'Literal') {
return token
}
return {
type: 'Literal',
data: {
value: token.data.value.replace(/\\(.)/g, '$1')
},
start: token.start,
end: token.end
}
}
// Get all tokens...
const tokens = tokenize('some { background: code }', process).tokens
If you would like to get beautified code of lexer,
you can use second argument of compile
functions:
UniversalLexer.compile(definitions, true)
UniversalLexer.compileFromFile('scss.yaml', true)
On success you will retrieve simple object with array of tokens:
{
tokens: [
{ type: 'Whitespace', data: { value: ' ' }, start: 0, end: 5 },
{ type: 'Word', data: { value: 'some' }, start: 5, end: 9 }
]
}
When something is wrong you will get error information:
{
error: 'Unrecognized token',
index: 1,
line: 1,
column: 2
}
For now, you can see example of JSON semantics in examples/json.yaml
file.
After installing globally (or inside of NPM scripts) universal-lexer
command is available:
Usage: universal-lexer [options] output.js
Options:
--version Show version number [boolean]
-s, --source Semantics file [required]
-b, --beautify Should beautify code? [boolean] [default: true]
-h, --help Show help [boolean]
Examples:
universal-lexer -s json.yaml lexer.js build lexer from semantics file
- 2.0.6 - bugfix for single characters
- 2.0.5 - fix mistake in README file (post-processing code)
- 2.0.4 - remove unneeded
benchmark
dependency - 2.0.3 - add unit and E2E tests, fix small bugs
- 2.0.2 - added CLI command
- 2.0.1 - fix typo in README file
- 2.0.0 - optimize it (even 10x faster) by expression analysis and some other things
- 1.0.8 - change that current position in syntax error starts from 1 always
- 1.0.7 - optimize definitions with "value", make syntax errors developer-friendly
- 1.0.6 - optimized Lexer performance (20% faster in average)
- 1.0.5 - fix browser version to be put into NPM package properly
- 1.0.4 - bugfix for debugging
- 1.0.3 - add proper sanitization for debug HTML
- 1.0.2 - small fixes for README file
- 1.0.1 - added Rollup.js support to build version for browser