2 Now that I've implemented an LR parser which handles indents nicely I
5 First the basic LR parser. This is nothing new.
7 There are a set of tokens - primarily numbers. They can have content
8 attached but that does not affect the parsing.
9 Any difference that affects parsing must make a different token.
10 So each reserved word is a token, each special symbol is too.
11 Then 'identifier', 'number', 'string'. Are all tokens.
13 These are "terminals" - anything produced by lexical analysis aka
14 scanning is a terminal token.
16 There are also "non-terminal" token. These are the "head" of
18 For each nonterminal there must be at least one production, possibly
19 several. Each production consists of a "head" and a "body". The body
20 is simply a list of token, whether terminals or non-terminals.
22 If_statement -> if Expression then Statementlist else Statementlist fi
24 here I have capitalised the non-terminals.
25 What this means is that if the head token is permitted somewhere, then
26 the body sequence can be used in it's stead. Alternately, if the body
27 sequence is found where the head might be expected, then it can be
28 treated as if it *was* the head.
31 Parsing a text (sequence of terminals) against a grammar like this
32 involves a sequence of "shift" and "reduce" steps. This happens in
33 the context of a stack. Each entry on the stack represents a single
34 token, though it may have other state attached.
35 It also happens in the context of a 1-token lookahead. As well as
36 knowing what is in the stack, we know what the next terminal token is.
38 A "shift" operation pushes the look-ahead terminal onto the stack.
39 This then allows us to see what the next terminal is.
40 A "reduce" operation can happen when the last few tokens on the stack
41 exactly match the body of some production (and when all the earlier
42 tokens on the stack provides a valid preamble to the head of that
44 When this happens we pop off the tokens representing the body, and
45 push on the head token.
46 So terminals get on to the stack from 'shift' and non-terminals get on
47 the stack from 'reduce'.
49 Eventually a reduce state will push the special 'start' non-terminal
50 on to the stack. When that happens, we have finished.
52 So the first interesting question is: how do we know when to shift and
53 when to reduce - and what production to reduce. This is the fun bit.
55 First we need to introduce the idea of a 'state'. A state if
56 effectively some position in some production. so one state would
57 represent that state in the above production where 'then' has been
58 shifted. In this state we are expecting a statementlist.
59 another state might be after the statementlist when we are expecting
60 'else'. In any state, we know there are a bunch of tokens on the
61 stack, and there is something we are expecting.
63 Every entry on the stack already mentioned has a 'state' along with
64 the token. It is the state *after* that token was shifted or reduced.
65 Based on the state of the top element of the stack, we know what to
66 expect. Based on the state and the look-ahead terminal, we know what
69 To encode this knowledge we need a few different data structures.
70 1/ An action table. This maps from a state plus a look-ahead terminal
71 to an action. The action is one of "Shift", "Accept" (which means
72 that Start symbol has been reduced) or "Reduce-N", where N is the
73 number of the production to be reduced.
74 2/ A go_to table, or state-transition table. This maps a state plus a
75 token to a new state. Of the two states mentions above, the 'go_to'
76 entry for the first state combined with the token "statementlist"
77 would be the second state.
78 3/ A production table. For each production we need to know how many
79 token to pop, which token to push, and maybe what action to perform
80 when that happens. Often this will extract information from the
81 popped tokens and store it in the pushed token to build up an
85 In simple terms the action of parsing them becomes:
87 1- Set current state to '0'
88 2- look up current state together with lookahead token in
90 2a- if "Accept" - stop.
91 2b- if shift, push 'state' and 'look-ahead' onto stack
92 2c- if reduce, find production, perform action, pop appropriate
93 number of tokens, keeping state from first token as current
95 Then push head token and current state to stack
96 3- look up current state and top-of-stack token in go-to table,
97 and set current state to the value found.
100 All we need now at those tables. The production table is easily
101 extracted from whatever form the grammar is presented in. The action
102 and go-to tables take a bit of work. First we will need a few tools.
104 Firstly we need to store a set of small numbers. This will be used
105 both to hold a set of tokens, and a set of 'items' which will be
106 described later. Items are actually pairs of small numbers. These
107 sets need to support test for membership, insertion, union, iteration,
108 and an equality test.
110 Next we need a search table that these sets can be stored in. The
111 sets of items needs to be indexed so we can check if a particular
112 itemset already exists.
114 Finally we will need some small lookup tables from token numbers to
115 small numbers. These will usually be fairly sparsely populated.
117 Given these we can start to find the states and create the goto table,
118 and it is here that we need to understand what an item is.
120 An 'item' is a production plus a number of body tokens that have
121 already been shifted. So "5:0" refers to production 5 with no tokens
122 shifted yet. If production 5 had 4 body tokens, then "5:0", "5:1",
123 "5:2", "5:3", "5:4" would all be valid items.
125 "5:0" is a state where we are expecting the body of production 5.
126 If the second body token in production 3 was the head of production 5:
131 Then "3:1" would also be a state when we are expecting the body of
132 production '5'. So "3:1" and "5:0" are really the same state.
134 A "State" then is a collection of "items". Some of the items will
135 have a non-zero token count, and some will have a zero token count.
136 We need a data structure that contains a set of "items" - much like a
137 set of token, so probably the same data structure can be used. We
138 need to be see if we already have a particular state so they will need
139 to be stored in some sort of lookup table with the ability to test two
140 states to see if they are equal. It is best if the lookup only
141 compares the non-zero items as they imply the zero items, but
142 computing them for a new set is non-trivial. If we can avoid that
143 step when the state is already known, that is a useful optimisation.
145 We build the set of states effectively by recursing down the
146 production tree. We start with a state containing the item "0:0",
147 This is assigned the next free state number, which is "0".
149 For each state we need to produce all the states that it can reach.
150 These are states which we reach after shifting a token, possible via
151 reduction, so there could be one for each token. We will need a set
152 of tokens we have dealt with already.
154 First we need to "complete" the itemset. This involves adding all the
155 ":0" items that are relevant.
157 a/ Initialise a set of possible 'next' tokens.
158 b/ For each item in the state, consider the 'next' token in the
159 body, if there is one.
160 c/ if this token is already in our 'next' set, move on to the next item at
162 d/ for every production from this token, add an item with a token
164 e/ We've modified the itemset we are iterating, so start again at
165 the beginning at 'b'.
167 Now we have a complete itemset and a set of possible 'next' we can
168 look for states to go-to.
170 For each token in the 'next' set we potentially create a new itemset.
171 To do that we look at each item in our completed set and see if the
172 productions next token is the current next token. If it is, we add
173 the item to the new itemset with the token count incremented.
174 Once we have this new itemset we check to see if it is already in the
175 search table. If not, we:
177 - add it to the table
178 - add it to the list of states still to process
179 - add 'current_state + current_token -> new_state' to the goto
182 Once we have completed and created goto tables for every state, we are
183 ready to move on to creating the action table.
184 Part of this is fairly straight forward, part is somewhat laborious
185 but nor particularly difficult. We'll do the straight forward bit
188 For each state we need to create a lookup table to map from a terminal
189 to an action. So we iterate through the states, allocate a lookup
190 table, and then iterate through the items in that state.
192 If we find an item where the next token is a terminal, then we add a
193 'SHIFT' action for that terminal in the current state.
194 If we find an item where there is no next token - the token count
195 matches the body count for the production, then we want to add a
196 REDUCE action for that production, but what lookahead tokens should we
199 In many cases it won't matter. There is usually only one reducible
200 production, so if we arrange to reduce it whenever there is no
201 SHIFTable token, the parse will proceed correctly. But sometimes it
202 isn't that easy. Consider the rather contrived grammar:
209 If we see 'x' and then 'y', it would seem valid to reduce either of
210 the last two productions. What we choose depends on the next
211 terminal. If it is 'c' we want to reduce to B, if 'd' we want to
212 reduce to 'C'. To make that decision we need some idea of what the
213 'next' token might be in different circumstances.
215 This is the somewhat laborious part, but thankfully we have a
216 computer. We need to complete 3 things for every non-terminal.
218 1/ Whether it can produce an empty string or not. If there is a
223 Then clearly A can product the empty string. Given that, if there
224 is another production:
228 then B can also produce the empty string. Rather then recursively
229 walking from each nonterminal, we will use a dynamic programming
230 approach were we repeatedly examine all productions flagging those
231 non-terminals which can produce 'empty', until we don't find any
232 more non-terminals to flag.
234 2/ The set of terminals which can come 'first' in the expansion of
235 each non-terminal. This clearly included the first token from each
236 production of that nonterminal if the token is a terminal, but also
237 all the 'first' terminals for the first token of each production
238 when that is a non-terminal. Further if some of the first tokens
239 in a production can produce the empty string, then the 'first'
240 terminals of later tokens may also been needed.
241 Again we will use a dynamic programming approach to build up these
242 lists until they stop changing.
244 3/ The set of terminals that can 'follow' a given non-terminal. This
245 is what we are really after. Given this, we will know which
246 lookahead terminal should trigger which reduction.
250 To correctly set the 'empty' flags we first clear them all (assume
251 nothing can produce 'empty' and then we walk the list of production
252 and if we find one where the body is empty, or only contains symbols
253 which are nonterminals that can produce empty, then we flag the head
254 non-terminal as "can produce empty".
256 If while doing this we set the flag on any non-terminal, we need to
257 loop back to the start and check every production again, as some
258 more production might now be able to produce "empty". When we have
259 made a full pass with now changes we can move on.
262 The "first" set for each terminal is trivially the set which
263 contains just that terminal. We might choose to create all of those
264 for simplicity, or we might choose to handle terminals specially
265 when creating unions later in this process.
267 For each non-terminal we start out assuming the set of 'first'
268 terminals is the empty set. Then for every production we examine
269 all the token in the body up to and including (but not beyond) the
270 first token which cannot produce 'empty'. For each of these we union
271 the 'first' terminals for the token to the set of 'first' terminals
272 for the head of the production.
274 Once we have examined every production, if any 'first' set was
275 modified we loop back to the start and examine every production
276 again. We keep going until a full pass has not added any terminals
279 Setting the 'empty' flag and computing 'first' can be combined.
280 This should reduce the total number of passes by 1 as there will
281 only be a single pass that does nothing. It might reduce the number
282 of passes a little more as more happens in each.
285 Now we get to use the "first" sets to produce "follow" sets.
286 We initially assume each 'follow' set is empty, and repeatedly scan
287 the production list adding things to the 'follow' sets until one
288 scan has made no changes (once again, a dynamic programming approach).
290 For each production, and for each token in that production, we add to
291 the tokens 'follow' set, the entire 'first' set of the next token,
292 and every subsequent token until one is found that does not produce
293 'empty'. As this part of calculating 'follow' is only based on
294 'first' which does not change any more we do not need to repeat it
295 until stable. One the next step requires that.
297 For the last token in each production, and for every previous token
298 until one if found that does not produce 'empty', the 'follow' set
299 of the head token is added to the follow set of the body token.
301 Once these sets are stable, we will know every terminal which can
302 possibly come after each non-terminal.
305 Now that we have 'follow' we can examine each item in each itemset and
306 decide on which SHIFT or REDUCE actions are appropriate for each
308 It is still possible that some conflict might occur - two different
309 productions might suggest actions for the same terminal.
311 If two production both suggest 'SHIFT' for the same terminal, then
312 that isn't a conflict. We just SHIFT and don't worry which production
315 If one production suggests 'SHIFT' and one suggests 'REDUCE', then
316 that is technically a conflict, but it is usually safe to resolve it
317 in favour of the SHIFT. This implies that we take the longest valid
318 match which is often what is desired.
320 And example that can cause this sort of conflict happens with a
323 Statement -> if ( Expression ) Statement
324 | if ( Expression ) Statement else Statement
328 if ( Expression ) if ( Expression ) Statement
330 on the stack, and we see an 'else', we could reduce to
332 if ( Expression ) Statement
336 if ( Expression ) if ( Expression ) Statement else
338 In the 'shift' case the else clause would apply to the most recent
339 'if', in the 'reduce' case it would end up applying to the more
340 distant 'if'. This 'shift' is probably preferred.
342 If two different productions suggest a 'REDUCE' for the same
343 look-ahead character then that is a genuine conflict that cannot be
344 resolved. It means that the grammar is not LALR(1). It might still
345 be LR(1), but the simplifications of LALR(1) are over-simplification
346 in this case. In many cases though, a REDUCE/REDUCE conflict signals
347 a genuine ambiguity in the grammar.
356 When we have a Number and we see a '+', do we reduce the number to a
360 ------------------------------------------
361 I read the LALR paper:
363 http://3e8.org/pub/scheme/doc/parsing/Efficient%20Computation%20of%20LALR(1)%20Look-Ahead%20Sets.pdf
366 So I'll figure it out myself.
368 The simple "FOLLOW" function find the 'next' terminal for each
369 non-terminal in any context at all. This can be too broad.
370 LALR find a more focused FOLLOW function that is potentially different
373 For a start we can consider 'follow' for each specific instance of
374 each non-terminal in the grammar.
375 This is 'first' of the next symbol and if that symbol can be empty,
376 then first of the following symbol etc.
377 But what is 'follow' of the non-terminal at the end of a production?
378 It is exactly FOLLOW of the non-terminal at the head of the
379 production. No, that is too simple. We've discarded info about the
382 Different approach. Calculate the follow sets as we calculate the
384 When we add an item to an itemset to complete it, we also add a
385 'follow' set for that item, which is 'first' of the next token (after
386 dot) possibly combined with 'first' of following tokens and possibly
387 'follow' of the item.
388 This 'follow' set remains with the item as it gets duplicated into
389 other itemsets. Once 'dot' gets to the end of the item, the 'follow'
390 set becomes active for that itemset and that item. i.e. if the
391 look-ahead is in that follow set, the given item is reduced.
393 That seems to be a completely general rule that doesn't merge
395 Yes it does. It merged the itemsets earlier.
397 i.e. when we have built an itemset with attached follow sets for each
398 time, we can only merge if the follow sets are the same, or don't
401 So: we need 'empty' and 'first'.
403 An itemset contains a list of
404 production + offset + follow-set
406 An itemset if consistent if the follow-sets are disjoint and don't
407 contain any terminal which is 'next' on and item.
408 So while building an itemset we keep a set of all 'next' terminals and
409 check for membership before adding.
411 We only merge two itemsets if merging the follow sets doesn't create
412 an inconsistent itemset.
414 Precedence can help avoid inconsistent set.
415 Each production can potentially be given a precedence and an
416 associativity: left or right.
417 When considering two productions that can use a symbol, if one has a
418 higher precedence, it gets the symbol.
419 If they have the same precedence they and have different associativity
420 or no associativity there is a conflict.
421 If they have associativity and one is a shift and one a reduce, then
422 left-associative prefers reduce and right associative prefers shift.
425 We really need the 'go to' table but the 'action' might be optional.
427 For LR(0), each state can identify a production to reduce.
428 If there is no such production, we shift and follow the goto.
429 If there is no 'goto' state it is an error.
431 For a simple LR(1), we try 'go to' first. If that works we shift. If
432 it doesn't we apply the one reduction. If there is no reduction we
435 Firstly we create the LR(0) states and impose precedence and see if
436 there are any conflicts. If not we finish.
438 Then we create first/follow and see if that resolves all conflicts.
440 Can we do this just for conflicts? A conflict it two or more
441 productions in an itemset which can act. Each conflicted state
442 indicates a single non-terminal. So we only need 'follow' for those.
443 But to create that we need 'follow' for the head of any 'reduce'
444 production. So to limit the set of follows needed, we would need to
445 know all the nonterminals which can be an ancestor of each state.
447 We can short-circuit the (SLR) FOLLOW step if a conflict is between
448 two production with same head ... is that even possible? Probably.
451 If creating 'follow' doesn't resolve conflicts we need to try LALR.
452 This means, for each conflicted state, finding the 'follow' for the
453 heads of the items in 'this' state.
454 This involves following the 'goto' links backwards to all previous
455 states and asking them ... something.
457 If this (conflicted) state has a completed production from X, we must
458 have got here eventually from a state which eventually lead to this
459 state, and has a 'goto' for 'X'.
462 To find 'follow' for a given item X->a, we scan all states and see if
463 following the body of the production (a) from each state T leads to the
465 If it does, we need to find FOLLOW(X, T) as the look-ahead tokens to
467 FOLLOW(X,T) is IFOLLOW(X,T) unioned with other stuff.
468 IFOLLOW(X,T), is the 'first' of whatever comes after X in items in T,
469 If empty can come after X, we need to union in the follow of this item
472 IT is all too hard. Let's not bother.
474 Just a simple LR(0) plus precedence and assuming SHIFT over REDUCE if
477 So we read a token and if cannot goto and we can reduce, we reduce.
478 If we cannot reduce we have an error.
481 However: I want to know about conflicts, not have them hidden. So
482 while I don't want to use a separate action table and don't want the
483 parser to know which look ahead tokens are for which reduction,
484 I want the grammar processor to know.
488 Each itemset has a 'next' symset as well as 'items' and 'go_to'.
489 This maps from non-terminal to magic number.
490 The magic number is an index into a list of symset (which might be shared). These are pure sets of terminals.
492 These tell us the possible 'next' terminals after any for the
493 productions form the given non-terminal in this state.
495 In state zero, there is one symset for 'start' and it contains only 'EOF'.
497 When we create a new symset, we do so given an old symset and a
500 As we add each advanced item from the old symset to the new, we copy
501 the 'next' itemsef for the head non-terminal.
502 As we add new itemsets we calculate the 'next' for the new
503 non-terminal as the 'first' for the following symbols and, if they all
504 can produce 'empty', we union with the 'next' for the head of the
505 production we are basing on.
507 A finished itemset if consistent if all the 'next' itemsets for
508 products that can reduce are disjoint, and are also disjoint with the
509 set of terminals which can come 'next'.
510 If it is inconsistent, that is an error.
511 It can be made consistent by applying precedence rules. This assigns
512 a number to each production, and a left/right to each number.
513 We don't record actions for items which have lower precedence than
514 items we do record actions for.
516 When we create an itemset we might find that it already exists. If it
517 does we merge the new 'next' symbols (if any) into the old. These
518 changes potentially need to be carried forward, so we flag that
521 Precedence is set in the .gram file by lines that start with a lone $
522 The first word is one of 'left' 'right' 'nonassoc' which creates a
523 precedence level and sets associativity.
524 After that come terminal. A second '$' precedes a virtual-terminal
525 name which can be linked to the level.
527 At the end of a production, '$$' before the '${' introduces a
528 virtual-terminal which sets the precedence for this production,
529 over-riding any defaults.
531 Precedence setting must come before the first use of a terminal.
535 It's not works. What exactly do I store again?
537 In each state we need the 'follow' set for each non-terminal.
538 As we add non-terminals we calculate by looking at 'first' of stuff in
539 production we started from and possibly the follow for the head of
541 For old non-terminals they come from the previous state.
543 So when we 'complete' a set by adding productions, we add new sets.
544 When we create a new set, we copy what we have.
547 It works. But can I do better?
548 - avoid multiple copies of 'next' sets using refcount.
549 - even more by using a lookup table.
550 Currently have 756 follows with 186 itemsets
551 Only 73 unique next sets.
552 346 calls to 'build' the 186 sets. Some called 4 times. Most called twice.
554 After adding refcount:
555 487 follow sets, still 73 Unique.
562 With the itemset we just compare core, which is easily found.
563 With the LA sets we also want to just compare the "core", but that is
565 I'd be happy for equality to be "new is subset of found" but how then
566 do we impose an ordering.
567 As it is a sequential search, we could as the new is always >= old.
568 That might do for a start...
571 I need an LA set for each item, not for each nonterminal.
573 In the core, LA is inherited from items in previous state.
574 In the completion, LA is calculated from core and completion items and
575 *may* be affected by changes to the LA of the core.
576 However changes to LA of core will propagate to subsequent states so
577 we need to do recalculation, so recording that 'may' probably isn't
580 So when we visit an item we always compute the completion so that we
581 can update the look-ahead sets.