Apr 2013 What do I remember about how parsing works? And why don't I just use yacc? A: because I didn't invent that :-) We have productions from a non-terminal to a list of terminals and non-terminals. We have an input stream of terminals. Parsing involves a stack of symbols and two operations: shift and reduce. "shift" take the next terminal from input and pushes it onto the stack. "reduce" pops a sequence of symbol off the stack which match the right hand side of a production, and replace it with the non-terminal. On every "reduce" action some parsing code is run which typically gathers important info from the symbol list and attaches it to the non-terminal which is pushed. We finish with "program" as the only non-terminal on the stack, and empty input stream, and a full abstract syntax tree attached to the "program". OR an error. To choose between "shift" and "reduce" we use a state machine. Each state corresponds to a point between two symbols in a production, though multiple state may well get merged. For each state there is a list of terminals which should be shifted and a list of terminal that should trigger a 'reduce', together with the production that should be reduced. Each of these also determines the next state. Actually not. We have: a stack of "state" + anything the actions might want. Actions: given a state and a lookahead determine action: - shift: newstate - reduce: non-terminal, count, action - accept: all done When a reduce happens, we look up a separate "goto" table goto: state + non-terminal -> state We create these tables in 4 steps using some intermediate tables. build_first create the 'first' map which lists all the possible 'first' terminals for each non-terminal build_follow create the 'follow' map which lists all possible following terminals for each non-terminal build_itemsets create itemsets which maps an 'itemset' to closures. Each 'itemset' corresponds to a 'state' above. An 'item' is a production plus an index, 0 for the start, 1 after the first symbol etc. An 'itemset' is a collection of 'items' that can all exist at the same point. The 'closure' is a list of nonterminals and terminals which may follow this point. i.e. all that follow immediately, plus the first for each, plus the first of each of those, etc. This involves building the 'goto' table build_actions - if [A -> alpha . a beta] is in Ii and goto(Ii, a)==Ij, action[i,a] is "shift j" - if [A -> alpha . ] is in Ii then action[i,a] is "reduce A -> alpha" for all a if FOLLOW(a) build_first: Each terminal may only start with that terminal. Each non-terminal that can produce an empty string gets 'Empty' added. This is repeated until no more can be added (As setting Empty might allow other non-terms to produce empty) Then for each production, and for each symbol until a non-empty is found add the first of each symbol to the first for this symbol. Then repeat until no change. build_follow: build_follow requires the first() mapping from symbol to terminal sets. Initially, follow() everything is empty, except follow(Start) is End Then for each production for each symbol but the last, all of first(next-symbol) except Empty into follow(this) for last symbol, if first(last) contains empty, then all for follow(last) go to follow(this) or something like that build_itemsets first itemset has one item; "START 0" for each new itemset, we add closure and build_goto. if build_goto added itemsets, we add their closure and build their goto. To close an itemset: - collect lists of terminals and nonterminal which are 'next' in each production - for each non-terminal, add 'first()' ??? the code in LRparse.itcl looks confused - or I am. Having created the closure, we create the goto for the itemset plus each symbol in the closure. For an itemset+symbol, the 'goto' is the set of all items which are one step forward. over *this* symbol. This might be a new itemset, which will need a closure and gotos. So we assign a number to each terminal and a higher number to each non-terminal. We need to be able to store sets of these numbers. This would be an auto-grow array for which the first part is sorted. Every time (say) 8 entries are added, we sort the new entries in. We also assign an index to each production, which is a list of a head and several body parts An 'item' then is 2 numbers: production+offset. An 'itemset' is just like the terminal sets, but fatter. Each itemset gets assigned a number. A 'goto' is a map from itemset plus symbol (2 numbers) to another itemset. 'first' and 'follow' are arrays pointing to termsets. Infact there is probably one array for symbols where each has name, first, follow, first_production 'itemsets' needs content-lookup so could just use linear search, or occasional sort. 'goto' is a separate small table for each itemset which can be created sorted. 'actions' is a per-itemset list which is created sorted. ================================= Errors: grammar errors are "shift-reduce" conflicts and "reduce-reduce" conflicts. shift-reduce are resolved by shifting, but cause a warning reduce-reduce are resolved arbitrarily I guess. Best to avoid them. However we can use precedence to resolve these conflicts.... nearly as easy to be explicit about different nonterminals Parse errors happen where we can neither 'shift' or 'reduce'. We can discard unexpected symbols until we recognise something, but that may be never and might discard too much. Or we can reduce prematurely so we are likely to expect more significant symbols. What we actually do is replace an unexpected symbol by 'error' which should let us reduce.... don't really know how this is meant to work. We want to shift some and reduce some but how much of each? Maybe replace each bad symbol by 'error' ... but then we lose stuff. Maybe check each state on the stack to see if anything recognises the next symbol. If not, discard it and try again? Or something. So an error production is: A -> error term meaning that in the context of A, if we find an error, we look for 'term' and decide that is a (bad) A. So if we get an unexpected terminal, we pop states until one allows us to shift and error, then shift the error, then discard terminals until one is recognised. ================================ Lexer. hand crafted which "knows" about identifiers/reserved words numbers symbols newlines end of file. indentation strings comments Gets list of reserved words from grammer, and list of symbols ident/reserve: _alpha followed by _alphanum numbers: digit followed by digits, periods, a-z and +- immediately following an 'e' or 'E' newlines are .. newlines strings are: ' upto next' on same line, " upto next ", 3 of these upto next 3 on same line, or some multiline thing with no indent. comments are: // to end of line, or /* to */ but without another /* symbols: any ASCII after space and before delete other than digits, alpha and quotes. i.e. anything left. Unknown symbols are the "unknown" token. Grammer is: nonterm -> nonterm + foo += bar "$1 $2 - args for HEAD struct" | something else ; Any identifer that does appear on the LHS is a literal. Any symbol sequence is a symbol. So as we parse we add each word into symbol table. As we find proctions we mark some of them as non-terminals. First non-terminal is the 'start' symbol. This builds an abstract syntax tree. Each node in the tree has a name (the symbol) and either: for non-terminals: a list of other nodes for terminals: the id and value Each symbol on the rhs can be followed by a number which indicates where in the symbol list of the lhs this symbol should go. 1 is first, 2 is second. 0 has a special meaning. The list of symbols in parent is set to the list of symbols in this child, then others are appended. =============================== How do we handle fine indents in parsing? Each token carries not only the token name/value, but also the line-num, char-num of where it starts, and the indent size of that line. A 'reduce' ensures all tokens have compatible indents. If the first token is a terminal, it is ignored.(maybe) Indentation changes result in 'CLOSE' tokens statementlist -> statementlist statement statement_end | statementend -> ; | ; CLOSE | CLOSE No, that doesn't work - it could swallow 'CLOSE's. I suspect we need OPEN and CLOSE to be paired. Or we want 'SAME INDENT' and 'SMALLER INDENT' to be distinct. Python uses INDENT and DETENT and NEWLINE, which is sometimes ignored. Maybe: If a newline is found were it isn't expected it is quietly ignored. If it is followed by an indent, it and following newlines to a dedent are also ignored. Then block -> : NEWLINE INDENT statementlist DEDENT | { statementlist } statementlist -> statementlist statement ; | statementlist statement NEWLINE | statementlist statement ; NEWLINE ================================== Just clarifying how the states in the parser work. A 'state' is a collection of points in different productions that are all equivalent in some way. We have a stack. Every item on the stack is a pair of a 'state' and a 'frame'. A 'frame' is effectively the abstract syntax tree for some symbol which lead to that state. The two main parsing operations are: SHIFT(state): This pushes the given state onto the stack together with a frame built for the terminal. The given state is one step further along one or more productions from the state that was on top of the stack (and is now second from top). REDUCE(production): This pops one entry from the stack for each body part in the production and combines the frames as determined by the production to product a new frame. It also looks up the head in the goto table for the top-of-stack state and makes that the new state So in each case, a state plus a symbol founds goes to a new state. So given a state we examine each production and look at each following symbol. For each unique symbol found, we create a state which is one step forward in each production over precisely that symbol. To this state we add the '0' start for all productions for all non-terminals which could come next Once we get a new state, we create lists of the terms and non-terms which come next in all the different productions. Then for each nonterm we add all terms and non-terms at start of productions for that nonterm. Isn't this just 'first' all over again? No, because 'first' skips over 'empty' nonterminals, but we don't want to here. This pair of symsets then lives with the itemset and are used to build the goto table and to expand the itemsets to a full state. So each itemset needs a set of items and (computed from that) a set of following symbols. We need to be able to quick check if a given itemset is know. The set of following symbols is used partly to help complete the itemset, partly to build the goto table from that itemset. So we don't need to hold on to it if we build the complete itemset. So after we complete the itemset and build the goto table for it, an itemset is just a list of items. However: completing an itemset is a lot of work, and if we have already found it, that is pointless. So we want to check against known itemsets before completion. So we need to be able to see the non-completed form of existing itemsets. The completion of an itemset involves adding "X,0" items, All other items are non-zero. So as long as we get the encoding right, it should be easy to just compare the prefix. itemsets need some sort of stable identity to be stored in the goto tables, so we need to store a number in each itemset. How do we provide lookup? I could use a skip list... yes. --------------------------------------------- I've changed my mind ... I think. I don't want the grammar processing to be in the same program as the grammar matching. It is kind-a neat having just one program and being able to fiddle the grammar and re-run it. However there are impedance mismatches. 1/ symbol numbering. The matcher needs to 'know' the numbers of symbols as compile time constants. The could be done by having the lexer assign known numbers to known names while parsing the grammar. 2/ AST types. It requires a single data type with a simple list of children. This is overly simplistic and makes the matchers unnecessarily messy 3/ AST management code. This probably has to change with the grammar anyway and keeping it separate doesn't gain anything but loses much. If we had an interpreted language with introspection and 'eval', then it might make sense, but I don't really want that sort of language. So: Have the grammar processing output some C (or later 'plato') code to do the parsing. The grammar file can then define the ast data types and the per-production code. So we need some syntax to embed code and grammar in the same file. It would be best if it looks like C to emacs so I can edit it easily. Each production needs to know the type (always a struct I think) of the frame for the head and some code to run. The code has '$N' where variables can be substituted. $0 means HEAD frame, Old frames are (probably) freed so we need to move whatever is needed out. The generated code needs to include the go_to table which maps state to token to state, and the action table which maps state to terminal to action, where action is ACCEPT, SHIFT or REDUCE(n) where n is a production. The table of productions lists body size, token, (to feed to go_to) and code fragment. gram file contains: #include lines #define lines enum lines struct lines static lines $$$$$ to mark start of grammar head -> token token ${ code }$ #include lines go in .c #define lines go in .h struct lines go in .h other lines go in .c #define TOK_token N gets put in .h func define for parser goes in .h "known" table goes in .c go_to as an array of pointers to arrays of pairs of integer with length. To we index state into go_to then binary search token. action table is array of numbers -1 for accept -2 for reduce (N*256 + body_size) for production. production table is a switch statement which includes the code and assigned the result token. Wait - do I care about token names in the C code? I probably don't. If I want to use similar things I can create my own defines. So drop the TOK_token stuff... though it might make the generated C file more readable. ------------ I need to be able to capture raw input text that matches some sequence of symbols. This will be used to copy code out of the grammar file and into the code or include files. It is important to capture spacing and comments as well as the functional bits. Probably best to turn on capture when we see a newline, then if we decide we don't want that, we can turn off and discard it. So 'capture' collects every character processed which will typically be leading spaces and the target token. When capturing C code we count '{}' and if we find something we don't recognised, we add it to the previous stanza. Or maybe have '$$$$$' tokens to differentiate include from C code. ------------------------------------- What about indents and newlines? If I find a newline that I'm not expecting, I can ignore it but require future tokens to be indented more than the token which started the current production(s). But I don't really know which are the current productions until we reduce, by which time it might be a little late to report an error. With each frame on the stack we record the indent of the line which started that nonterminal. On each 'shift', we ensure the new symbol has at least the same indent. On reduce, we set the new indent based on first token reduced. But that isn't the whole picture. I think there are two different sorts of regions. In one, we don't really expect newline, we are happy for them to appear anywhere, as long as indenting never goes backwards. In the other, we do expect newlines, they terminate something but we allow extras as long as they are *more* indented than before. In effect the extra indent in a line continuation and hides the newline. if a and b : then C that was expected and the indent is structural if a and b: C First one not expected but due to indent, is ignored. Second one is expected and indent required and structural. Intent relative to 'if' line if a and b : C d e The newline was possible as a terminator, so the next indent becomes a line continuation. That could be a bit confusing, so we probably don't want to allow that? Each newline gets parsed as either INDENT - if it is followed by an increasing indent NEWLINE- if it is followed by the same indent UNDENT+- if it is followed by a reduced indent. If an indent is not expected, it and all newlines and indents and matching undents are ignored. If newline or undent is not ignored or expected, it is an error. That would make: if asdad { stuff } an error because that first NEWLINE isn't expected or ignored. I guess it should be optional before a '{'. IF -> if Expression NEWLINE { Statementlist } | if Expression { Statementlist } | if Expression : INDENT Statementlist UNDENT Maybe this: - If we see an unexpected INDENT, we hold it and continue. Next SHIFTed symbol gets it attached. - when we reduce all the attached indents get collected and then attached to the nonterm when it is 'shifted' in. - If we see an unexpected UNDENT, we hold it and continue. Any shift while we hold an UNDENT is an error (how to handle?) unless it holds matching INDENTS (i.e. is a reduced nonterm). That is promising but doesn't explain newlines and allows: a = b; c = d; Why is this a problem? It could be a single line, but is wrapped. I think it is a problem because a NEWLINE was allowed after the ';'. So it doesn't really work as one line. a = b ; c = d; is more messy. Newline is allowed there too. a = b ; c = d That makes sense. newline wasn't permitted. a = b + c ; d = f; That is OK. Why the difference? INDENT is attached to '+' which is not the start of the production that swallows it... actually it is. statementlist -> statementlist statement the 'statement' is 'd = f;', + was swallowed into statementlist. Maybe that means it is bad. So: When we reduce a production, and indents that are not on the first symbol must be matched by pending undents. Any indents that are on the first symbol must be carried to parent This is getting there, but doesn't work for { foo; } as shifting the last '}' will be an error as an undent is pending.... No, that is the old rule. Current rule is: - If we see an unexpected INDENT, we hold it and continue. Next SHIFTed symbol gets it attached. i.e. we read the next symbol and mark it as indented. - When we see an unexpected UNDENT, we keep count of it and continue. It is not attached (yet). - When we reduce a production, Any indents attached to the first token are moved to the created token. Any indents on other tokens must match saved undents, and are canceled. If there aren't enough saved undents, then it is an error. This means that if a nonterminal continues onto more lines, it must end the last line. So a a ; b b ; c c ; d d is not ok, but a a ; b b ; c c ; d d is as is a a ; b b ; c c ; d d That seems to make sense. Only I want to revise the above again. - If we see an INDENT or UNDENT which has no action, we count it, keeping a single count which is +ve or -ve or 0. - When we shift in a terminal, it collects the current count. - When we reduce a production, the count on the first token gets attached to the resulting nonterminal. The counts on the remaining tokens must be non-negative, and there must be enough excess UNDENTS to counter them. But what about NEWLINEs? If there is an action for it, we just accept it and eventually shift it. But what if no action? 1/ a = [ 1, 2, INDENT 3, 4, NEWLINE 5, 6, UNDENT ] Actually, that isn't an UNDENT, it is an UNDENT INDENT if anything. But in any case the NEWLINE is allowed and ignored 2/ if true { INDENT a = NEWLINE b + c UNDENT } First newline is confusing and should be illegal. But why? However is that different to '1'? Maybe because NEWLINE is expected in a little while? 3/ if true { a = b + c; d = e; } is even more like 1, and even worse than 2. Even with ';'s so no reason to expect newlines soon, it looks bad. But here it is the indent which is bad, not the newline. Is it because '1' has "a -> a b" (first token is common) while bads is "a -> b c d" - different token. Maybe we need to flag non-terminals which are, or are not, allowed to have unexpected newlines. So a statement-list may not, but other lists may. If a production contains a newline, then the head may not contain unexpected newlines. But how would we know? And indents are still allowed. So if the stack contains indents.... a = b + c is bad, but a = b + c is fine. a = b + c is also fine. New attempt. When we see an INDENT, UNDENT, or NEWLINE which would otherwise be an error, we record it. We should never end up recording an indent and undent at the same time. Probably not a new line either. When we SHIFT a terminal, we attach to it in the current recorded state, which is a simple number: 1 for NEWLINE, 2 for INDENT, -N for UNDENT, 0 for none of the above. When we REDUCE a production: Any state on the first symbol is carried up to the resulting nonterminal. Any other INDENTS must cancel with following UNDENTs. They can follow either in subsequent symbols, or in the current (unshifted) state. There must be no uncanceled UNDENTs on any symbol, but there can be in the state. These are carried forward. Any NEWLINEs between an INDENT and UNDENT are canceled together with the INDENTs. If there are any extra NEWLINES, and a NEWLINE has an ACTION for the current state, then that is an error - stray newline, indent expected. That seems good, though will need to be tested. Remaining question: Is an undent less than the previous indent allows? How is it parsed and what does it mean? if cond1 and (cond 2 or cond 3) and cond 4 This could be an UNDENT followed by an INDENT. However this fails the above rules. We would reduce "cond 2 or cond 3" which contains an INDENT before we see an UNDENT. So we would have '(' Cond-with-indent ')' [undent pending] ['and' is lookahead] Maybe each symbol gets a 'leading indent' and 'internal indent'. Internal indents can be canceled against undents and hide newlines. Leading indents must wait until they become internal. Only then can they cancel newlines and undents. I'm not sure about combining these things. Might need 2 numbers. Any newline after an indent is ignored. So there is no point keeping both and indent and a newline on a symbol - just the indent is enough. Undents can only ever be 'internal'. Intents can be leading or internal. So we have 'leading indent', 'internal undent', 'internal indent'. where indent can also be newline. When we shift in a terminal, it can have undent or indent. It cannot have both. If there is an indent, the undent must have been canceled. When we reduce, indent/newline/undent is canceled. If there is an undent followed by an indent, that is an error. so a symbol can never carry both an indent and an undent. If it carries and undent, it is probably a closing bracket. if a { b } '}' carries an undent which balances the indent on 'b'. What about: if a: foo else: bar The undent doesn't get attached to 'else' as indenting syntax is explict here. How about: if (a) { foo; } else { bar; } 'else' sees an undent and a newline. The undent balances the indent in the 'block' so it only carries the newline. But ... the 'else' - in both cases - gets a newline, and in the first it isn't allowed because it is at the same level as newlines being recognised terminators. But in this case the 'then' *must* be on the next line because we just had a line-oriented block. Maybe we just have to 'expect' that newline. Anyway, a symbol never carries both an indent and an undent. Can it carry a newline with either of those? And newline 'inside' the dent will get ignored, so the interesting possibilities are: newline indent and undent newline I think 'newline' can be leading while 'indent' is internal. However I don't think the other possibility is valid. 'leading_nl' is one of "nothing", "newline" "indent" There can only be a single indent. 'internal_nl' can be a number of undents, or a number of indents, or nothing. There are no 'internal newlines'. If they were in a dent, they were ignored. If they weren't then based on production they were either ignored or caused an error. So to process a production, we first move the leading nl out of the way. Then we look at leading, then internal for each node. If leading is: nothing: do nothing indent: increment current indent newline: if indented, ignore. if outdented, error. If nonterm allows, ignore else error Then add 'internal' to current indent. However once a negative indent has been seen, no further positives are allowed. (09may2013) Damn - I've broken something. My grammar has "NEWLINE" at end of statement, but parser returns "UNDENT" which isn't recognised so it is swallowed. So I think I really want every NEWLINE to be visible. So as same-level newline is "NEWLINE" A great-level newline is "NEWLINE INDENT" A lesser-level newline is "NEWLINE UNDENT+ INDENT?" If a grammer rule can be ended by a newline it just puts NEWLINE at the end. If a grammer rule requires an undent to close it, then it must also list an indent: IF -> if EXPR : INDENT commandlist UNDENT The newline before the INDENT gets held, then destroyed by the INDENT. The newline before the UNDENT is probably match by the commandlist, otherwise it is help, then destroyed by the UNDENT. So we are moving NEWLINE-destruction from lexer to parser. So the pending dents are NEWLINE then UNDENT then INDENT or NEWLINE The first NEWLINE is destroyed by following INDENT or UNDENT. And UNDENT cancels a previous INDENT, but an INDENT doesn't cancel an UNDENT. A newline following an INDENT is impossible for a terminal. So we can have UNDENT INDENT-or-NEWLINE where UNDENT can have a count. These are then attached to the next thing shifted, except that there cannot be both an UNDENT and an INDENT when the SHIFT happens, else error unexpected INDENT. So a shifted terminal has undent count and possible newline, or no undents and possible indent or newline When we reduce a series of symbols to a nonterminal: - the tags on the first symbol are carried to the non-terminal because they come completely before both. - remaining indents and undents must match, or must be able to cancel against pending undents, including the look-ahead symbol. - and newlines between indent and undent are ignored. - Arg. There is a thing which is one of undents undents newline indents newline -nothing- Each symbol has 2 of these: leading and internal. A Terminal always has -nothing- for the internal. These are collected from previous reductions and shifted ignored symbols. A non terminal gain them in a reduction. The leading thing of a new nonterminal is the leading thing of the first reduced symbol. The internal thing of a new nonterminal is a combination of the leading and internal things (in that order) of the remaining symbols, together with some undents that are pending, or in the lookahead. As we merge the remaining things: - indents may not follow undents - newlines following an indent and before next undent are discarded. - indent then undent are discarded. - newlines with no indent are either discarded or cause an error depending on context So result for internal of new nonterminal is either - some number of undents - some number of indents When we see an unexpected newline or indent it goes on the next terminal. When we see an unexpected undent it goes on the current top-of-stack as internal. So: pending: 0 = nothing 1 = newline 2 = indent leading: same internal: count, -ve undents, +ve indents I'm getting close. Stray indents must be disallowed in the same places where stray newlines are.... or something. And how do I know where these stray things are allowed? Hypothesis: There are two sorts of nonterminals - horizontal and vertical. Horizontal nonterminals are expected to be all on one line but can use indents for line continuation. Their productions do not contain a NEWLINE symbol. Vertical nonterminals are expected to run vertically. Their productions do often contain newlines, possibly indirectly. A horizontal nonterminal may contain an indent, but not an unindented newline. A vertical nonterminal may contain newlines but no indents. a = [ 1, 2, 3, 4, 5, 6, ] The list if number is horizontal and starts with an indent, so safe. The [list] structure is also horizontal and contains an indent The second indent is legal because it in horizontal. if a { a; b; c; d; } FOUND A MISTAKE. First token in production does not necessarily hold first terminal. It could have an empty production. That is easy to handle: reducing an empty production collects the pending stuff. double-damn! I do need internal newlines after all - maybe. { a = 1 } That indent is used by statementlist which is vertical so the newline in binding, which is horizontal, isn't protected by the indent and so is not allowed. i.e. the indent at the start doesn't protect things. type str = { a: int; b: char; } More thoughts next morning: - Newline before indent needs to be invisible, else the indent cannot effect line continuation DONE - hidden undents might blur with expected undents. Need to decide if care is needed: almost certainly is - I think the horiz/vert distinction is on productions, not nonterms. A production "A -> A ... " is vertical, others are horizontal. Maybe 'headed' and 'headless' are better term. A 'headless' production can contain unindented newlines. but not unexpected indents. A 'headed' production can be indented but must not have an unindented newline. And that afternoon.... a = 1 + 2 is now complaining because it sees an indent in "1+2" which is Term -> Term + Factor so is vertical and doesn't like indent. But the indent is really in "a = 1 + 2" which is horizontal so permitted. How to we express that? A production that starts with an indent or newline can contain with a sequence of newlines (if vertical) or an indent and a sequence of newlines (if horizontal). A production that does not start with Newline or Indent is deemed to be horizontal, so can contain an indent and some newlines. A production which contains or starts with an indent which is not canceled by a trailing undent passes that indent up to it's parent. So an indent must ultimately be on the first symbol of a recursive (vertical) production, or it must be the first indented symbol in a non-recursive (horizontal) production. Do I really need internal indents? A production starts: - on a newline (same indent as something earlier) - on an indented line - in the middle of a line Based on this and recursive nature we classify as horizontal or vertical (h if middle or non-recursive. v if non-middle and recursive) The first non-middle token in the body must start with indent or newline corresponding to horiz or vert status. Subsequent non-middle tokens may only start with a newline Any non-initial indent in a body must be closed before the production is reduced. So a nonterm can only carry an initial indent. a = 1 + 2 + 3 + 4 a = 1 + 2 Wrong again!!!! Once we see an unexpected indent, future newlines are ignored. Sometimes. Until we see the indent. Can we have multiple unexpected indents? Yes - if never expecting any. So we have a count of unexpected indents. When we see one we expected we push the counter with the indent and restore it when the undent arrives. If the pending production is horizontal, newlines are ignored. If vertical, newlines are recognised. (next morning - 11may2013) So when we see a NEWLINE which the grammar allows, how do we know whether to SHIFT it or SKIP it? We SKIP if the it is or will be part of a production that contains an indent line continuation. But it might be in lots of productions concurrently, and the LALR engine doesn't really tell us about productions until they are ready to reduce. Proposal: Require NEWLINEs to appear in grammer following a nonterminal which is the entire object that they complete So if any command can be terminated by newline or semi, then Termcommand -> Command NEWLINE | Command ; Commandlist -> CommandlistN | CommandlistN -> Command | CommandlistN Command Alternately if the newline is only needed for some commands: Command -> IfCommand | WhileCommand | Assignment ; | Assignment NEWLINE | FunCall ; | FunCall NEWLINE Then when we see a NEWLINE that is expected we trace-ahead how many stack frames would be absorbed before it is shifted. i.e. while action(state, NEWLINE) == Reduce(production): if production.len > 0: tos -= production.len state = tos.state state = goto(state, production.head) tos++ Now if any symbol from 'tos' to the real tos contains an Indent, we Skip the NEWLINE, otherwise we process all those reductions and then SHIFT the NEWLINE. Can we deduce this ahead of time so the matcher just does a lookup? The result needed is stack depth to examine, and that cannot be known ahead of time. So we need the 'production->head' mapping to be more accessible, but still need to search. How does this relate to 'recursive' and 'horiz'v'vert' ? Is the symbol before a NEWLINE there-fore horizontal? It shouldn't come up. We could flag every nonterm followed by NEWLINE as 'oneline', and flag all nonterms which produce recursion as 'recursive' and warn if any nonterm gets both flags. When we 'Skip' a newline, do we still record it in m.nl? I think not, it would be irrelevant. Do we want to search ahead every time we see a newline? We would need to skip to the next expected symbol, then see how many symbols are about to be reduced, and examine those symbols. Maybe this leads to a very different approach. INDENT and UNDENT are still the same: attached to next symbol and passed back. But NEWLINEs are not. When we see a newline, we optionally skip forward to the next expected symbol (or ERROR), then examine how much of the stack we would reduce. If there is a net indent, or a vertical production, we ignore the newline otherwise we remember the decision and start reducing. How does that work? a = b + c + d return The first newline appear only as INDENT and is recorded on the + The second would trigger a reduction back to "Assignment" "a = b + c" which contains an indent (on the '+') so the new is ignored and The '+' is seen and shifted. The third newline is likewise ignored. The UNDENT gets queued and the NEWLINE now sees no net indent and so it isn't ignored. So the Assignment is reduced, the NEWLINE is SHIFTED and we reduce a command. a = [ b, c, d, e, f, g, ] x = ... First newline is only seen as indent, attached to 'b'. Second newline is not expected so we peek at 'd'. 'd' is shiftable as current production is "listN -> listN , IDENTIFIER" This 'listN' contains the indent on 'b' as a leading indent, but as it is a recursive production, that is accepted as an indent and the newline is ignored. This continues until we have the comma after 'g' shifted. Look-ahead is now ']' which triggers a reduction to 'List' List -> listN | Which is not recursive. Though it contains a recursive which has an indent, so that might be cool So we only go back until we find a suitable indent. { thing a = b (c) + d First newline gets attached to start of 'thing' as indent next terminates 'thing' as it is the next to be reduced Third reduces a=b to 'assignment' which has no indent, so is not ignored. { if a { stuff } a = b After the '}' we don't expect a newline because we have complete command. We see a newline so are tempted to ignore it but need to check the indents. look-ahead of 'a' (like everything else) reduces commandlistN -> commandlistN command which is recursive and has a leading indent, so we are good. a = [ b, c ,d, e ,f, g, ] This is weird and probably should be illegal. The NEWLINE after 'c' is not expected so we peek at ','. This would be shifted so there is no indent to permit the 'ignore' and we trigger an error. Probably good. So to sum it all up. - Lexer returns: INDENT for a newline with an increased indent NEWLINE for a newline with the same indent NEWLINE UNDENT+ {NEWLINE | INDENT} for a newline with lesser indent. - Parser tracks: - depth of lexed "ignored_indent" When an INDENT or UNDENT isn't expect by the grammar - number of "unparsed_undents" which is incremented when we accept an unexpected indent, and decremented when we reduce an indent away. - When an INDENT is shifted, "ignored_indent" is attached to is, and then set to zero. "unparsed_undents" must be zero. - When an INDENT leads to ERROR, we tag the next terminal as indented and count the indent in "ignored_indent" - When an UNDENT is seen while the "ignored_indent is +ve, we decrement the UNDENT count and increment "unparsed_undents". However if there is a pending indent waiting to be shifted, we cancel that instead of incrementing "unparsed_undents". - When an UNDENT is seen while "ignored_indents" is zero, we either SHIFT it or trigger an ERROR. On SHIFT we find the previous UNDENT (which must exist) and restore the indent count. - When we reduce a production we check that it has only one indented symbol which may be internal or at the start, and this must correspond to whether it is horizontal or vertical. If it was at the start, the resulting symbol gets the indent. If it is internal, "unparsed_undents" is decremented if it is +ve, otherwise the indent is 'carried forward' to the next symbol shifted. - When we see a NEWLINE that the grammar expects, we each production that will be reduced before it is shifted, and if any contain a legal indent we discard the newline, else we accept it. - When we see a NEWLINE that the grammar does not expect we skip it till we find an expected symbol and repeat the 'search reduceable productions' process and throw and error if the right indent isn't present. if a and b: c d while true : New idea. end-of-line is different from start-of-line. End of line may or may not terminate a nonterminal Start of line is one of 'indent' or 'same'. a new line is coded as one of: Indent Eol Same Eol Undent+ Same Eol Undent+ Indent Eol, Indent and Undent can be in the grammar, Same cannot. Eol is ignored if not expected or a reduce production would contain an indent. NO that doesn't work - what would terminate that line? I was using the newline after the indent, but that doesn't exist any more. Try again: When a newline is ignored by an indent, we count that and ignore all indent/undents and newlines until the matching undent arrives. That is parsed as a Newline. This implies no nesting. What about: if a == b && c == d { print("Hello"); print("world"); } The first newline is an indent, so it is ignored but indent is remembered. The second newline?? The third shouldn't be ignored. Maybe the second is in a state where a recursive symbol is expected. But so is the first. if a == b && c == d { print(x); print(y); } Why is newline after 'b' different to the one after (x); ?? The one after the 'b' isn't expected. The one after '(x);' isn't either. So neither newlines are semantic. If the ';' was missing, the second newline would reduce to a vertical production, so is ignored. What about a = b + c * d +e ?? The first newline is followed by indent, so ignored. The second is at end if indented thing, so ignored. The third is still indented, so still ignored After the next undent a newline will fix it. So maybe we want: { newline undent}* [ newline | indent ] ?? OK, I have a new problem with indents. if condition : INDENT statementlist UNDENT else : INDENT statementlist UNDENT doesn't work, because it is a horizontal production, but the 'else' is not indented. Actually I have: If_Statement -> if Expression Block else Block and Block -> : INDENT Statementlist UNDENT | { Statementlist } Also with same productions: if expression { stuff; } doesn't work because the '{' is not indented. Can I find a clean way to allow these automatically, or do I need some explicit tagging in the grammar? Tagging with OptionalNewline seems to work and probably makes sense. It would be hard to prevent: function foo (thing) { .... at the same time as allowing '\n{' ------------------ Next steps (02May2013) - Get indenting and newlines worked out. - create basic structure of grammar. This will be a long way from final, but creates room to try things out. + A file contains declarations: types, functions, methods, constants, implementations maybe variables interface - between modules. import/export etc init function? + Types: array, struct, enum, union? pointer function parameterized, string int float char Bool closures + Code block: loops, conditionals, break/continue/return catch, defer? actions: assign (= := += *= ...), funcall, methodcall + expression: conditions, boolean, arithmetics, binary 'in' + bindings, variable declarations, local variables etc. local functions? are they just closures? ----------- TO FIX DONE - (" scanner doesn't notice quote DONE - grammar cannot say "no type of these nonterms - grammar to synthesise A' symbols DONE - better tracing - show full stack info - store 'indents' better. - need lots of test code. - use a 'tos' pointer rather than counter. - GPL notice? - easier to write 'main' function - make it easier to handle plain-text files. shift-symbol, record-indents, handle linebreaks