Now that I've implemented an LR parser which handles indents nicely I want to describe it. First the basic LR parser. This is nothing new. There are a set of tokens - primarily numbers. They can have content attached but that does not affect the parsing. Any difference that affects parsing must make a different token. So each reserved word is a token, each special symbol is too. Then 'identifier', 'number', 'string'. Are all tokens. These are "terminals" - anything produced by lexical analysis aka scanning is a terminal token. There are also "non-terminal" token. These are the "head" of productions. For each nonterminal there must be at least one production, possibly several. Each production consists of a "head" and a "body". The body is simply a list of token, whether terminals or non-terminals. If_statement -> if Expression then Statementlist else Statementlist fi here I have capitalised the non-terminals. What this means is that if the head token is permitted somewhere, then the body sequence can be used in it's stead. Alternately, if the body sequence is found where the head might be expected, then it can be treated as if it *was* the head. Parsing a text (sequence of terminals) against a grammar like this involves a sequence of "shift" and "reduce" steps. This happens in the context of a stack. Each entry on the stack represents a single token, though it may have other state attached. It also happens in the context of a 1-token lookahead. As well as knowing what is in the stack, we know what the next terminal token is. A "shift" operation pushes the look-ahead terminal onto the stack. This then allows us to see what the next terminal is. A "reduce" operation can happen when the last few tokens on the stack exactly match the body of some production (and when all the earlier tokens on the stack provides a valid preamble to the head of that production). When this happens we pop off the tokens representing the body, and push on the head token. So terminals get on to the stack from 'shift' and non-terminals get on the stack from 'reduce'. Eventually a reduce state will push the special 'start' non-terminal on to the stack. When that happens, we have finished. So the first interesting question is: how do we know when to shift and when to reduce - and what production to reduce. This is the fun bit. First we need to introduce the idea of a 'state'. A state if effectively some position in some production. so one state would represent that state in the above production where 'then' has been shifted. In this state we are expecting a statementlist. another state might be after the statementlist when we are expecting 'else'. In any state, we know there are a bunch of tokens on the stack, and there is something we are expecting. Every entry on the stack already mentioned has a 'state' along with the token. It is the state *after* that token was shifted or reduced. Based on the state of the top element of the stack, we know what to expect. Based on the state and the look-ahead terminal, we know what to do. To encode this knowledge we need a few different data structures. 1/ An action table. This maps from a state plus a look-ahead terminal to an action. The action is one of "Shift", "Accept" (which means that Start symbol has been reduced) or "Reduce-N", where N is the number of the production to be reduced. 2/ A go_to table, or state-transition table. This maps a state plus a token to a new state. Of the two states mentions above, the 'go_to' entry for the first state combined with the token "statementlist" would be the second state. 3/ A production table. For each production we need to know how many token to pop, which token to push, and maybe what action to perform when that happens. Often this will extract information from the popped tokens and store it in the pushed token to build up an abstract syntax tree. In simple terms the action of parsing them becomes: 1- Set current state to '0' 2- look up current state together with lookahead token in the action table. 2a- if "Accept" - stop. 2b- if shift, push 'state' and 'look-ahead' onto stack 2c- if reduce, find production, perform action, pop appropriate number of tokens, keeping state from first token as current state. Then push head token and current state to stack 3- look up current state and top-of-stack token in go-to table, and set current state to the value found. 4- go to 2. All we need now at those tables. The production table is easily extracted from whatever form the grammar is presented in. The action and go-to tables take a bit of work. First we will need a few tools. Firstly we need to store a set of small numbers. This will be used both to hold a set of tokens, and a set of 'items' which will be described later. Items are actually pairs of small numbers. These sets need to support test for membership, insertion, union, iteration, and an equality test. Next we need a search table that these sets can be stored in. The sets of items needs to be indexed so we can check if a particular itemset already exists. Finally we will need some small lookup tables from token numbers to small numbers. These will usually be fairly sparsely populated. Given these we can start to find the states and create the goto table, and it is here that we need to understand what an item is. An 'item' is a production plus a number of body tokens that have already been shifted. So "5:0" refers to production 5 with no tokens shifted yet. If production 5 had 4 body tokens, then "5:0", "5:1", "5:2", "5:3", "5:4" would all be valid items. "5:0" is a state where we are expecting the body of production 5. If the second body token in production 3 was the head of production 5: 3: A -> X Y Z 5: Y -> B1 B2 B2 B4 Then "3:1" would also be a state when we are expecting the body of production '5'. So "3:1" and "5:0" are really the same state. A "State" then is a collection of "items". Some of the items will have a non-zero token count, and some will have a zero token count. We need a data structure that contains a set of "items" - much like a set of token, so probably the same data structure can be used. We need to be see if we already have a particular state so they will need to be stored in some sort of lookup table with the ability to test two states to see if they are equal. It is best if the lookup only compares the non-zero items as they imply the zero items, but computing them for a new set is non-trivial. If we can avoid that step when the state is already known, that is a useful optimisation. We build the set of states effectively by recursing down the production tree. We start with a state containing the item "0:0", This is assigned the next free state number, which is "0". For each state we need to produce all the states that it can reach. These are states which we reach after shifting a token, possible via reduction, so there could be one for each token. We will need a set of tokens we have dealt with already. First we need to "complete" the itemset. This involves adding all the ":0" items that are relevant. a/ Initialise a set of possible 'next' tokens. b/ For each item in the state, consider the 'next' token in the body, if there is one. c/ if this token is already in our 'next' set, move on to the next item at b/ d/ for every production from this token, add an item with a token count of zero e/ We've modified the itemset we are iterating, so start again at the beginning at 'b'. Now we have a complete itemset and a set of possible 'next' we can look for states to go-to. For each token in the 'next' set we potentially create a new itemset. To do that we look at each item in our completed set and see if the productions next token is the current next token. If it is, we add the item to the new itemset with the token count incremented. Once we have this new itemset we check to see if it is already in the search table. If not, we: - assign a number - add it to the table - add it to the list of states still to process - add 'current_state + current_token -> new_state' to the goto table. Once we have completed and created goto tables for every state, we are ready to move on to creating the action table. Part of this is fairly straight forward, part is somewhat laborious but nor particularly difficult. We'll do the straight forward bit first. For each state we need to create a lookup table to map from a terminal to an action. So we iterate through the states, allocate a lookup table, and then iterate through the items in that state. If we find an item where the next token is a terminal, then we add a 'SHIFT' action for that terminal in the current state. If we find an item where there is no next token - the token count matches the body count for the production, then we want to add a REDUCE action for that production, but what lookahead tokens should we add it for? In many cases it won't matter. There is usually only one reducible production, so if we arrange to reduce it whenever there is no SHIFTable token, the parse will proceed correctly. But sometimes it isn't that easy. Consider the rather contrived grammar: A -> B c | C d B -> x y C -> x y If we see 'x' and then 'y', it would seem valid to reduce either of the last two productions. What we choose depends on the next terminal. If it is 'c' we want to reduce to B, if 'd' we want to reduce to 'C'. To make that decision we need some idea of what the 'next' token might be in different circumstances. This is the somewhat laborious part, but thankfully we have a computer. We need to complete 3 things for every non-terminal. 1/ Whether it can produce an empty string or not. If there is a production: A -> Then clearly A can product the empty string. Given that, if there is another production: B -> A A then B can also produce the empty string. Rather then recursively walking from each nonterminal, we will use a dynamic programming approach were we repeatedly examine all productions flagging those non-terminals which can produce 'empty', until we don't find any more non-terminals to flag. 2/ The set of terminals which can come 'first' in the expansion of each non-terminal. This clearly included the first token from each production of that nonterminal if the token is a terminal, but also all the 'first' terminals for the first token of each production when that is a non-terminal. Further if some of the first tokens in a production can produce the empty string, then the 'first' terminals of later tokens may also been needed. Again we will use a dynamic programming approach to build up these lists until they stop changing. 3/ The set of terminals that can 'follow' a given non-terminal. This is what we are really after. Given this, we will know which lookahead terminal should trigger which reduction. Empty: To correctly set the 'empty' flags we first clear them all (assume nothing can produce 'empty' and then we walk the list of production and if we find one where the body is empty, or only contains symbols which are nonterminals that can produce empty, then we flag the head non-terminal as "can produce empty". If while doing this we set the flag on any non-terminal, we need to loop back to the start and check every production again, as some more production might now be able to produce "empty". When we have made a full pass with now changes we can move on. First The "first" set for each terminal is trivially the set which contains just that terminal. We might choose to create all of those for simplicity, or we might choose to handle terminals specially when creating unions later in this process. For each non-terminal we start out assuming the set of 'first' terminals is the empty set. Then for every production we examine all the token in the body up to and including (but not beyond) the first token which cannot produce 'empty'. For each of these we union the 'first' terminals for the token to the set of 'first' terminals for the head of the production. Once we have examined every production, if any 'first' set was modified we loop back to the start and examine every production again. We keep going until a full pass has not added any terminals to any set. Setting the 'empty' flag and computing 'first' can be combined. This should reduce the total number of passes by 1 as there will only be a single pass that does nothing. It might reduce the number of passes a little more as more happens in each. Follow Now we get to use the "first" sets to produce "follow" sets. We initially assume each 'follow' set is empty, and repeatedly scan the production list adding things to the 'follow' sets until one scan has made no changes (once again, a dynamic programming approach). For each production, and for each token in that production, we add to the tokens 'follow' set, the entire 'first' set of the next token, and every subsequent token until one is found that does not produce 'empty'. As this part of calculating 'follow' is only based on 'first' which does not change any more we do not need to repeat it until stable. One the next step requires that. For the last token in each production, and for every previous token until one if found that does not produce 'empty', the 'follow' set of the head token is added to the follow set of the body token. Once these sets are stable, we will know every terminal which can possibly come after each non-terminal. Now that we have 'follow' we can examine each item in each itemset and decide on which SHIFT or REDUCE actions are appropriate for each look-ahead terminal. It is still possible that some conflict might occur - two different productions might suggest actions for the same terminal. If two production both suggest 'SHIFT' for the same terminal, then that isn't a conflict. We just SHIFT and don't worry which production it belongs to. If one production suggests 'SHIFT' and one suggests 'REDUCE', then that is technically a conflict, but it is usually safe to resolve it in favour of the SHIFT. This implies that we take the longest valid match which is often what is desired. And example that can cause this sort of conflict happens with a grammar containing: Statement -> if ( Expression ) Statement | if ( Expression ) Statement else Statement If we have: if ( Expression ) if ( Expression ) Statement on the stack, and we see an 'else', we could reduce to if ( Expression ) Statement or shift to if ( Expression ) if ( Expression ) Statement else In the 'shift' case the else clause would apply to the most recent 'if', in the 'reduce' case it would end up applying to the more distant 'if'. This 'shift' is probably preferred. If two different productions suggest a 'REDUCE' for the same look-ahead character then that is a genuine conflict that cannot be resolved. It means that the grammar is not LALR(1). It might still be LR(1), but the simplifications of LALR(1) are over-simplification in this case. In many cases though, a REDUCE/REDUCE conflict signals a genuine ambiguity in the grammar. Expr -> Expr + Expr | Val + Expr | Number Val -> Number When we have a Number and we see a '+', do we reduce the number to a Val or to an Expr? ------------------------------------------ I read the LALR paper: http://3e8.org/pub/scheme/doc/parsing/Efficient%20Computation%20of%20LALR(1)%20Look-Ahead%20Sets.pdf It doesn't help. So I'll figure it out myself. The simple "FOLLOW" function find the 'next' terminal for each non-terminal in any context at all. This can be too broad. LALR find a more focused FOLLOW function that is potentially different for each state. For a start we can consider 'follow' for each specific instance of each non-terminal in the grammar. This is 'first' of the next symbol and if that symbol can be empty, then first of the following symbol etc. But what is 'follow' of the non-terminal at the end of a production? It is exactly FOLLOW of the non-terminal at the head of the production. No, that is too simple. We've discarded info about the state we are in. Different approach. Calculate the follow sets as we calculate the item sets. When we add an item to an itemset to complete it, we also add a 'follow' set for that item, which is 'first' of the next token (after dot) possibly combined with 'first' of following tokens and possibly 'follow' of the item. This 'follow' set remains with the item as it gets duplicated into other itemsets. Once 'dot' gets to the end of the item, the 'follow' set becomes active for that itemset and that item. i.e. if the look-ahead is in that follow set, the given item is reduced. That seems to be a completely general rule that doesn't merge anything... Yes it does. It merged the itemsets earlier. i.e. when we have built an itemset with attached follow sets for each time, we can only merge if the follow sets are the same, or don't conflict. So: we need 'empty' and 'first'. An itemset contains a list of production + offset + follow-set An itemset if consistent if the follow-sets are disjoint and don't contain any terminal which is 'next' on and item. So while building an itemset we keep a set of all 'next' terminals and check for membership before adding. We only merge two itemsets if merging the follow sets doesn't create an inconsistent itemset. Precedence can help avoid inconsistent set. Each production can potentially be given a precedence and an associativity: left or right. When considering two productions that can use a symbol, if one has a higher precedence, it gets the symbol. If they have the same precedence they and have different associativity or no associativity there is a conflict. If they have associativity and one is a shift and one a reduce, then left-associative prefers reduce and right associative prefers shift. We really need the 'go to' table but the 'action' might be optional. For LR(0), each state can identify a production to reduce. If there is no such production, we shift and follow the goto. If there is no 'goto' state it is an error. For a simple LR(1), we try 'go to' first. If that works we shift. If it doesn't we apply the one reduction. If there is no reduction we error. Firstly we create the LR(0) states and impose precedence and see if there are any conflicts. If not we finish. Then we create first/follow and see if that resolves all conflicts. If yes, we finish. Can we do this just for conflicts? A conflict it two or more productions in an itemset which can act. Each conflicted state indicates a single non-terminal. So we only need 'follow' for those. But to create that we need 'follow' for the head of any 'reduce' production. So to limit the set of follows needed, we would need to know all the nonterminals which can be an ancestor of each state. We can short-circuit the (SLR) FOLLOW step if a conflict is between two production with same head ... is that even possible? Probably. But unlikely? If creating 'follow' doesn't resolve conflicts we need to try LALR. This means, for each conflicted state, finding the 'follow' for the heads of the items in 'this' state. This involves following the 'goto' links backwards to all previous states and asking them ... something. If this (conflicted) state has a completed production from X, we must have got here eventually from a state which eventually lead to this state, and has a 'goto' for 'X'. To find 'follow' for a given item X->a, we scan all states and see if following the body of the production (a) from each state T leads to the current state (S). If it does, we need to find FOLLOW(X, T) as the look-ahead tokens to choose this item. FOLLOW(X,T) is IFOLLOW(X,T) unioned with other stuff. IFOLLOW(X,T), is the 'first' of whatever comes after X in items in T, If empty can come after X, we need to union in the follow of this item too. IT is all too hard. Let's not bother. Just a simple LR(0) plus precedence and assuming SHIFT over REDUCE if no precedence info. So we read a token and if cannot goto and we can reduce, we reduce. If we cannot reduce we have an error. However: I want to know about conflicts, not have them hidden. So while I don't want to use a separate action table and don't want the parser to know which look ahead tokens are for which reduction, I want the grammar processor to know. So: Each itemset has a 'next' symset as well as 'items' and 'go_to'. This maps from non-terminal to magic number. The magic number is an index into a list of symset (which might be shared). These are pure sets of terminals. These tell us the possible 'next' terminals after any for the productions form the given non-terminal in this state. In state zero, there is one symset for 'start' and it contains only 'EOF'. When we create a new symset, we do so given an old symset and a non-terminal. As we add each advanced item from the old symset to the new, we copy the 'next' itemsef for the head non-terminal. As we add new itemsets we calculate the 'next' for the new non-terminal as the 'first' for the following symbols and, if they all can produce 'empty', we union with the 'next' for the head of the production we are basing on. A finished itemset if consistent if all the 'next' itemsets for products that can reduce are disjoint, and are also disjoint with the set of terminals which can come 'next'. If it is inconsistent, that is an error. It can be made consistent by applying precedence rules. This assigns a number to each production, and a left/right to each number. We don't record actions for items which have lower precedence than items we do record actions for. When we create an itemset we might find that it already exists. If it does we merge the new 'next' symbols (if any) into the old. These changes potentially need to be carried forward, so we flag that itemset for review. Precedence is set in the .gram file by lines that start with a lone $ The first word is one of 'left' 'right' 'nonassoc' which creates a precedence level and sets associativity. After that come terminal. A second '$' precedes a virtual-terminal name which can be linked to the level. At the end of a production, '$$' before the '${' introduces a virtual-terminal which sets the precedence for this production, over-riding any defaults. Precedence setting must come before the first use of a terminal. ------- It's not works. What exactly do I store again? In each state we need the 'follow' set for each non-terminal. As we add non-terminals we calculate by looking at 'first' of stuff in production we started from and possibly the follow for the head of that production. For old non-terminals they come from the previous state. So when we 'complete' a set by adding productions, we add new sets. When we create a new set, we copy what we have. It works. But can I do better? - avoid multiple copies of 'next' sets using refcount. - even more by using a lookup table. Currently have 756 follows with 186 itemsets Only 73 unique next sets. 346 calls to 'build' the 186 sets. Some called 4 times. Most called twice. After adding refcount: 487 follow sets, still 73 Unique. 345 calls to 'build' ----- My LR code is wrong. With the itemset we just compare core, which is easily found. With the LA sets we also want to just compare the "core", but that is that? I'd be happy for equality to be "new is subset of found" but how then do we impose an ordering. As it is a sequential search, we could as the new is always >= old. That might do for a start... No all wrong. I need an LA set for each item, not for each nonterminal. In the core, LA is inherited from items in previous state. In the completion, LA is calculated from core and completion items and *may* be affected by changes to the LA of the core. However changes to LA of core will propagate to subsequent states so we need to do recalculation, so recording that 'may' probably isn't helpful. So when we visit an item we always compute the completion so that we can update the look-ahead sets.