In this article I will explain an LR(1) parser I've been working on for... too long. But I'm really happy with the end result.
This article describing the parser is much longer than the parser itself, so feel free to skip ahead to the final product, a single file with less than 200 lines of code that can turn a grammar and input string into an AST.
The code in this article is javascript so that I can include some interactive demos but it should be straightforward to port this code to other languages. That is ultimately my goal in my larger project of making a self-hosting language that compiles to wasm. But the parser stands quite well on its own too.
I'm not going to explain much about what an LR(1) parser is or why you'd want to use one, since the wikipedia page or a book will do a much better job and this is not particularly my area of expertise: I just have a cool implementation I want to demonstrate here. But in brief, an LR(1) parser can parse a lot(*) of grammars, so it can be a useful approach for a general-purpose tool that can accept many grammars. There are other parsers with their own trade-offs, but this article is about LR(1).
* All deterministic context-free grammars by way of a transformation.
At the highest level, our program will take a grammar and source string as input and produce an abstract syntax tree as output.
However, to do this we will need to define internal representations for:
The action/goto table can be reused many times for the same grammar, so it could make sense to have a serializable representation.
You will often see grammars written in some flavor of BNF which look something like:
E -> E * B | E + B | B
B -> 0 | 1
Under this scheme, non-terminals start with a capital letter, ->
separates the left
hand non-terminal symbol from its definition, |
separates branches of the definition.
Non-terminals are everything else and whitespace can be ignored.
Don't get too hung up on the capital letter distinction, since you would for many performance reasons want to put a lexer in front of this LR(1) parser which can distinguish between terminals and non-terminals with more nuance.
To make this a more convenient machine format, let's expand the branches into a rule for each branch:
E -> E * B
E -> E + B
E -> B
B -> 0
B -> 1
And now we'll augment the grammar with an initial rule.
I'm not entirely sure why, but this seems to makes the internal implementation simpler.
The augmented symbol is often called S'
but you can call yours whatever you want.
This augmented entry point must be the first rule but the other rules can appear in any order.
S' -> E
E -> E * B
E -> E + B
E -> B
B -> 0
B -> 1
You could also automatically augment grammars but I'm doing that explicitly in the input for this parser.
Now converting this grammar to json, so we can lean on JSON.parse()
instead of writing
a custom BNF parser:
[
[ "S'", "E" ],
[ "E", "E", "*", "B" ],
[ "E", "E", "+", "B" ],
[ "E", "B" ],
[ "B", "0" ],
[ "B", "1" ]
]
This last representation will be the final format we use for grammars.
LR parsers need to store something referred to as an "item". An item is a rule from the grammar plus a "dot".
Items are often written in this form, with a "•" character somewhere on the right-hand side of the rule:
E -> • E + B
The dot represents how far along the parser is for the given rule.
LR(1) parsers also need to keep track of the lookahead non-terminal symbol. On top of that, our implementation needs to track which state comes next and index that the current item has in its state.
Instead of copying the rule and placing a dot into the rule itself, this implementation stores the rule and dot as integer indexes.
Putting that all together, each item has:
For example, the data for for the item E -> E + •B
could look like:
{ "rule": 1, "dot": 2, "lookahead": "0", "index": 4, "next": 10 }
This parser stores an itemset as an array so that each rule has an integer index, but the order of items in an itemset is not important.
Each itemset also has a key so that if the same collection of items is generated by separate iterations, those duplicate states will be merged.
Let's look at a different grammar:
[
[ "S'", "S" ],
[ "S", "X", "X" ],
[ "X", "a", "X" ],
[ "X", "b" ]
]
Here's what the itemsets (states) for the example grammar in the previous section looks like,
including the next
and rule
fields:
I0 S' -> •S, lookahead=$, next=1, rule=0
S -> •X X, lookahead=$, next=2, rule=1
X -> •a X, lookahead=a, next=3, rule=2
X -> •a X, lookahead=b, next=3, rule=2
X -> •b, lookahead=a, next=4, rule=3
X -> •b, lookahead=b, next=4, rule=3
I1 S' -> S•, lookahead=$, next=-1, rule=0
I2 S -> X •X, lookahead=$, next=5, rule=1
X -> •a X, lookahead=$, next=6, rule=2
X -> •b, lookahead=$, next=7, rule=3
I3 X -> a •X, lookahead=a, next=8, rule=2
X -> a •X, lookahead=b, next=8, rule=2
X -> •a X, lookahead=a, next=3, rule=2
X -> •b, lookahead=a, next=4, rule=3
X -> •a X, lookahead=b, next=3, rule=2
X -> •b, lookahead=b, next=4, rule=3
I4 X -> b•, lookahead=a, next=-1, rule=3
X -> b•, lookahead=b, next=-1, rule=3
I5 S -> X X•, lookahead=$, next=-1, rule=1
I6 X -> a •X, lookahead=$, next=9, rule=2
X -> •a X, lookahead=$, next=6, rule=2
X -> •b, lookahead=$, next=7, rule=3
I7 X -> b•, lookahead=$, next=-1, rule=3
I8 X -> a X•, lookahead=a, next=-1, rule=2
X -> a X•, lookahead=b, next=-1, rule=2
I9 X -> a X•, lookahead=$, next=-1, rule=2
Or that same example as something that looks more like a data structure:
[
[
{"rule":0,"dot":0,"lookahead":"$","index":0,"next":1},
{"rule":1,"dot":0,"lookahead":"$","next":2,"index":-1},
{"rule":2,"dot":0,"lookahead":"a","next":3,"index":-1},
{"rule":2,"dot":0,"lookahead":"b","next":3,"index":-1},
{"rule":3,"dot":0,"lookahead":"a","next":4,"index":-1},
{"rule":3,"dot":0,"lookahead":"b","next":4,"index":-1},
],
[
{"rule":0,"dot":1,"lookahead":"$","index":0,"next":-1},
],
[
{"rule":1,"dot":1,"lookahead":"$","index":1,"next":5},
{"rule":2,"dot":0,"lookahead":"$","next":6,"index":-1},
{"rule":3,"dot":0,"lookahead":"$","next":7,"index":-1},
],
[
{"rule":2,"dot":1,"lookahead":"a","index":2,"next":8},
{"rule":2,"dot":1,"lookahead":"b","index":3,"next":8},
{"rule":2,"dot":0,"lookahead":"a","next":3,"index":-1},
{"rule":3,"dot":0,"lookahead":"a","next":4,"index":-1},
{"rule":2,"dot":0,"lookahead":"b","next":3,"index":-1},
{"rule":3,"dot":0,"lookahead":"b","next":4,"index":-1},
],
[
{"rule":3,"dot":1,"lookahead":"a","index":4,"next":-1},
{"rule":3,"dot":1,"lookahead":"b","index":5,"next":-1},
],
[
{"rule":1,"dot":2,"lookahead":"$","index":0,"next":-1},
],
[
{"rule":2,"dot":1,"lookahead":"$","index":1,"next":9},
{"rule":2,"dot":0,"lookahead":"$","next":6,"index":-1},
{"rule":3,"dot":0,"lookahead":"$","next":7,"index":-1},
],
[
{"rule":3,"dot":1,"lookahead":"$","index":2,"next":-1},
],
[
{"rule":2,"dot":2,"lookahead":"a","index":0,"next":-1},
{"rule":2,"dot":2,"lookahead":"b","index":1,"next":-1},
],
[
{"rule":2,"dot":2,"lookahead":"$","index":0,"next":-1},
],
]
Finally, the action/goto table creates two arrays, an action
table and a
goto
table. The indexes in each array correspond to a state in the itemsets.
Elements in the action
table map terminals to shift N
, reduce
N
, or acc
(accept, when the parse is finished) operations.
Here's what the action/goto table looks like for the previous grammar and itemsets:
# a b $ S X
0 s3 s4 1 2
1 acc
2 s6 s7 5
3 s3 s4 8
4 r3 r3
5 r1
6 s6 s7 9
7 r3
8 r2 r2
9 r2
The action table is on the left under the terminal columns a
, b
,
and $
. The goto table is on the right under the non-terminal columns S
and X
.
Or again, as something more like a data structure:
{
"actions": [
{"S":"s1","X":"s2","a":"s3","b":"s4"},
{"$":"acc"},
{"X":"s5","a":"s6","b":"s7"},
{"X":"s8","a":"s3","b":"s4"},
{"a":"r3","b":"r3"},
{"$":"r1"},
{"X":"s9","a":"s6","b":"s7"},
{"$":"r3"},
{"a":"r2","b":"r2"},
{"$":"r2"},
],
"goto": [
{"S":1,"X":2},
{},
{"X":5},
{"X":8},
{},
{},
{"X":9},
{},
{},
{},
],
}
Now that we have some internal and external data structures figured out, we can define functions that operate on these data structures.
Again, feel free to skip ahead to the full source code, lr1.js.
The first()
function takes a grammar G
and a symbol A
as
inputs and returns the set of non-terminals that can appear immediately after the symbol
A
.
Or if A
is a terminal, returns a set containing A
.
This implementation uses an explicit stack rather than recursion to make tracking visitors for avoiding cycles simpler.
function first(G, A) {
let result = new Set
let visited = new Set
let stack = [A]
while (stack.length > 0) {
let X = stack.pop()
if (visited.has(X)) continue
visited.add(X)
if (!/^[A-Z]/.test(X) || X === '$') {
result.add(X)
continue
}
for (let i = 0; i < G.length; i++) {
let r = G[i]
if (r[0] === X && r[1] !== X) {
stack.push(r[1])
}
}
}
return result
}
The closure()
function takes a grammar G
and an array of
items
as input.
The return value is the set of items provided as an argument concatenated with
extra items. Whenever an item from the input has a non-terminal immediately after its dot,
all of the rules for that non-terminal are appended to the output as items with a dot set to 0
for each of the lookaheads for that item.
The lookaheads for the additional items are set to be the lookahead for the instigating item if the
dot is at the end of the rule or otherwise the union of the first()
for the current and
next symbols. I'm still not entirely sure why the current and next symbols are used here but putting
this causes the implementation to match all of the reference tables I could find.
function closure(G, items) {
let stack = items.slice()
let state = items.slice()
let m = 0
let visited = new Set
while (true) {
if (stack.length === 0) break
let item = stack.shift()
let key = item.rule + ',' + item.dot + ',' + item.lookahead
if (visited.has(key)) continue
visited.add(key)
state.push(item)
let r = G[item.rule]
let X1 = r[item.dot+1]
let X2 = r[item.dot+2]
for (let i = 0; i < G.length; i++) {
if (G[i][0] !== X1) continue
let ls = item.dot+2 === r.length
? new Set([item.lookahead])
: union(first(G, X1), first(G,X2))
for (let lookahead of ls) {
let nitem = {
rule: i,
dot: 0,
lookahead,
next: -1,
index: -1,
}
stack.push(nitem)
}
}
}
return state
}
Plus we use this set union helper:
function union(A,B) {
return new Set(Array.from(A).concat(Array.from(B)))
}
This closure()
implementation uses an explicit stack and iteration rather than
recursion to make tracking visitors simpler, just like first()
.
States
is a class that collects and deduplicates sets of items (states).
Items in a set do not have a defined order, so the items are sorted to create a unique key in
States.key()
.
add()
will not insert duplicate entries and returns an existing index when it receives
a set of items that has already been inserted.
class States {
static key(items) {
return items.map(item => {
return item.rule + ',' + item.dot + ',' + item.lookahead
}).sort().join('_')
}
constructor(G) {
this.G = G
this.states = []
this._stateKeys = new Map
}
add(items) {
let key = State.key(items)
let i = this._stateKeys.get(key)
if (i !== undefined) return i
i = this.states.length
this.states.push(items)
this._stateKeys.set(key, i)
return i
}
has(items) {
return this._stateKeys.has(State.key(items))
}
}
The last function we need to define before we can generate the states for a grammar is
groupNext()
.
This function takes a state (an array of items) as input and returns a map of the the items grouped by their next symbol as the key and a version of the item with the dot incremented by 1 as the value.
groupNext()
assigns the index into state
that the group originated from,
so that later itemSets()
can assign the next
property for the parent
state.
function groupNext(G, state) {
let groups = new Map
for (let i = 0; i < state.length; i++) {
let s = state[i]
let rule = G[s.rule]
if (s.dot+1 >= rule.length) continue
let X0 = rule[0]
let X1 = rule[s.dot+1]
let gs = groups.get(X1) ?? []
gs.push({
rule: s.rule,
dot: s.dot+1,
lookahead: s.lookahead,
index: i,
next: -1,
})
groups.set(X1, gs)
}
return groups
}
Finally, we have everything we need to generate the itemsets (states) for a given grammar. We'll use these states to generate the action/goto table. This was the trickiest part of the whole program to write and debug, along with the closure lookahead assignment.
The job of itemSets()
is to take the grammar G
and return all of the
itemsets (states) for the grammar.
function itemSets(G) {
let states = new States(G)
let I0 = closure(G, [ { rule: 0, dot: 0, lookahead: '$', index: 0, next: -1 } ])
let i0 = states.add(I0)
let stack = [ I0 ]
while (true) {
if (stack.length === 0) break
let I = stack.shift()
let groups = groupNext(G, I)
for (let T of Array.from(groups.keys()).sort()) {
let gs = groups.get(T)
let J = closure(G, gs)
if (!states.has(J)) stack.push(J)
let j = states.add(J)
for (let si = 0; si < gs.length; si++) {
let g = gs[si]
I[g.index].next = j
}
}
}
return states.states
}
The first state, I0
is the closure of the first rule as an item with dot 0
and a lookahead of $
.
This state initializes the stack for iteration.
Each iteration shifts the first item off the stack. Popping also works but the results will diverge more from reference materials.
The shifted set of items I
are fed into groupNext()
to group the items
according to their next symbol and incremet the item dot values.
The groups are sorted by symbol to better match reference materials and to generate deterministic
output but sorting is not required.
Then J
is generated as the closure of the items in each group gs
.
If J
hasn't appeared before, it is pushed onto the stack.
Whether or not J
has appeared before, we get a state index j
from
states.add()
.
Using the index
property for each item in gs
,
we assign the next
property in I
for the corresponding item in
I
for which the grouped item g
derives.
We will need this next
property later when it comes time to generate the action/goto
table.
Now that we have a function to produce itemsets for all of the states in our grammar, we can build an action/goto table.
createTable()
takes the grammar G
and an array of states
generated by itemSets()
to populate and return two arrays of Maps, one Map for each
state, for the action and goto tables.
Looping over each state, for each item
in the state,
set the action to:
$
, acceptitem.rule
item.next
Still inside the loop, if the current item symbol is a non-terminal, set the item symbol for this
state's goto to item.next
.
function createTable(G, states) {
let action = Array(states.length).fill(null).map(x => new Map)
let goTo = Array(states.length).fill(null).map(x => new Map)
for (let i = 0; i < states.length; i++) {
let I = states[i]
for (let j = 0; j < I.length; j++) {
let item = I[j]
if (item.rule === 0 && item.dot+1 === G[item.rule].length && item.lookahead === '$') {
action[i].set('$', 'acc')
} else if (item.dot+1 === G[item.rule].length) {
let a = item.lookahead
action[i].set(a, 'r' + item.rule)
} else {
let a = G[item.rule][item.dot+1]
action[i].set(a, 's' + item.next)
}
let A = G[item.rule][item.dot+1]
if (/^[A-Z]/.test(A)) {
goTo[i].set(A, item.next)
}
}
}
return { action, goTo }
}
Finally, we can generate an AST based on an { action, goTo }
table and an input string.
A stack
and nodes
array are kept in sync for iterations of the loop, with
stack
storing rule numbers and nodes
storing symbol
and
children
properties.
Each iteration of the loop looks at the state (the index of a rule in the grammar) on the top of the
stack.
Then the action a
for the lookahead
(the current token from the input
string) is read from the action table.
Actions that start with an "s" shift and "r" reduce, followed by the action number.
Shift actions push the current lookahead as a terminal symbol to nodes
and push the action number to the stack
.
They also advance the lookahead to the next input token.
Reduce actions treat the action number as a rule in the grammar G
.
The arity
is the number of symbols in this rule (minus one, the left-hand symbol).
Then the action handler pushes a new node with the non-terminal left-hand symbol and
arity
-many children
popped from nodes
.
arity
-many stack items are also popped while children
is populated.
Finally, the reduce action looks up the rule symbol (obtained from the action number) in the goto table for the new top of the stack. This value in the goto table is pushed into the stack.
There are also "acc" actions that accept the parse and return the AST.
function parse(G, { action, goTo }, input) {
let firstSymbol = G[0][0]
let offset = 0
let stack = [0]
let nodes = []
let lookahead = input[0]
while (true) {
let state = stack[stack.length-1]
let a = action[state].get(lookahead)
if (a === undefined) throw new Error(`no action for state: ${state} lookahead: ${lookahead}`)
let n = Number(a.slice(1))
if (/^s/.test(a)) {
nodes.push({ symbol: lookahead, offset })
stack.push(n)
lookahead = input[++offset]
} else if (/^r/.test(a)) {
let symbol = G[n][0]
let arity = G[n].length - 1
let children = Array(arity)
for (let i = 0; i < arity; i++) {
children[arity-1-i] = nodes.pop()
stack.pop()
}
nodes.push({ symbol, children })
state = stack[stack.length-1]
let gt = goTo[state].get(symbol)
if (gt === undefined) throw new Error(`undefined goto for state=${state} symbol=${symbol}`)
stack.push(gt)
} else if (a === 'acc') {
return { symbol: firstSymbol, children: nodes }
} else {
throw new Error(`unexpected value for action ${a}`)
}
}
}
The full public domain source is available here: lr1.js.
Use the interactive demo below to compose or edit a grammar and to generate itemsets, an action/goto table, or an AST from an input string.
# itemsets
...
# action/goto table
...
# ast output
...
Here are some more links that helped me to figure everything out and put all of this together.