3 What do I remember about how parsing works?
4 And why don't I just use yacc?
5 A: because I didn't invent that :-)
7 We have productions from a non-terminal to a list of
8 terminals and non-terminals.
10 We have an input stream of terminals.
11 Parsing involves a stack of symbols and two operations: shift and reduce.
13 "shift" take the next terminal from input and pushes it onto the stack.
14 "reduce" pops a sequence of symbol off the stack which match the
15 right hand side of a production, and replace it with the non-terminal.
17 On every "reduce" action some parsing code is run which typically
18 gathers important info from the symbol list and attaches it to the
19 non-terminal which is pushed.
21 We finish with "program" as the only non-terminal on the stack,
22 and empty input stream, and a full abstract syntax tree attached to the
27 To choose between "shift" and "reduce" we use a state machine.
28 Each state corresponds to a point between two symbols in a production,
29 though multiple state may well get merged.
30 For each state there is a list of terminals which should be shifted
31 and a list of terminal that should trigger a 'reduce', together with
32 the production that should be reduced. Each of these also determines
36 We have: a stack of "state" + anything the actions might want.
38 Actions: given a state and a lookahead determine action:
40 - reduce: non-terminal, count, action
43 When a reduce happens, we look up a separate "goto" table
44 goto: state + non-terminal -> state
46 We create these tables in 4 steps using some intermediate tables.
48 create the 'first' map which lists all the possible 'first' terminals
51 create the 'follow' map which lists all possible following terminals
54 create itemsets which maps an 'itemset' to closures.
55 Each 'itemset' corresponds to a 'state' above.
56 An 'item' is a production plus an index, 0 for the start, 1 after the first
58 An 'itemset' is a collection of 'items' that can all exist at the same point.
59 The 'closure' is a list of nonterminals and terminals which may follow
60 this point. i.e. all that follow immediately, plus the first for each,
61 plus the first of each of those, etc.
63 This involves building the 'goto' table
67 - if [A -> alpha . a beta] is in Ii and goto(Ii, a)==Ij,
68 action[i,a] is "shift j"
69 - if [A -> alpha . ] is in Ii then
70 action[i,a] is "reduce A -> alpha" for all a if FOLLOW(a)
73 Each terminal may only start with that terminal.
74 Each non-terminal that can produce an empty string gets 'Empty' added.
75 This is repeated until no more can be added (As setting Empty might
76 allow other non-terms to produce empty)
77 Then for each production, and for each symbol until a non-empty is found
78 add the first of each symbol to the first for this symbol.
79 Then repeat until no change.
82 build_follow requires the first() mapping from symbol to terminal sets.
83 Initially, follow() everything is empty, except follow(Start) is End
84 Then for each production
85 for each symbol but the last, all of first(next-symbol) except Empty into
87 for last symbol, if first(last) contains empty, then all for follow(last)
89 or something like that
92 first itemset has one item; "START 0"
93 for each new itemset, we add closure and build_goto.
94 if build_goto added itemsets, we add their closure and build their goto.
97 - collect lists of terminals and nonterminal which are 'next'
99 - for each non-terminal, add 'first()' ???
100 the code in LRparse.itcl looks confused - or I am.
102 Having created the closure, we create the goto for the itemset plus
103 each symbol in the closure.
104 For an itemset+symbol, the 'goto' is the set of all items which are
105 one step forward. over *this* symbol. This might be a new itemset,
106 which will need a closure and gotos.
109 So we assign a number to each terminal and a higher number to each non-terminal.
110 We need to be able to store sets of these numbers. This would be an auto-grow array
111 for which the first part is sorted. Every time (say) 8 entries are added, we
112 sort the new entries in.
114 We also assign an index to each production, which is a list of a head and several
116 An 'item' then is 2 numbers: production+offset.
117 An 'itemset' is just like the terminal sets, but fatter.
118 Each itemset gets assigned a number.
119 A 'goto' is a map from itemset plus symbol (2 numbers) to another itemset.
121 'first' and 'follow' are arrays pointing to termsets.
122 Infact there is probably one array for symbols where each has
123 name, first, follow, first_production
124 'itemsets' needs content-lookup so could just use linear search, or occasional sort.
126 'goto' is a separate small table for each itemset which can be created sorted.
128 'actions' is a per-itemset list which is created sorted.
130 =================================
133 grammar errors are "shift-reduce" conflicts and "reduce-reduce" conflicts.
135 shift-reduce are resolved by shifting, but cause a warning
136 reduce-reduce are resolved arbitrarily I guess. Best to avoid them.
138 However we can use precedence to resolve these conflicts.... nearly
139 as easy to be explicit about different nonterminals
141 Parse errors happen where we can neither 'shift' or 'reduce'.
143 We can discard unexpected symbols until we recognise something, but
144 that may be never and might discard too much.
145 Or we can reduce prematurely so we are likely to expect more significant
147 What we actually do is replace an unexpected symbol by 'error' which should let
148 us reduce.... don't really know how this is meant to work. We want to shift
149 some and reduce some but how much of each? Maybe replace each bad symbol
150 by 'error' ... but then we lose stuff.
151 Maybe check each state on the stack to see if anything recognises the next symbol.
152 If not, discard it and try again? Or something.
154 So an error production is:
156 meaning that in the context of A, if we find an error, we look for 'term'
157 and decide that is a (bad) A.
158 So if we get an unexpected terminal, we pop states until one
159 allows us to shift and error, then shift the error, then discard terminals
160 until one is recognised.
163 ================================
165 hand crafted which "knows" about
166 identifiers/reserved words
175 Gets list of reserved words from grammer, and list of symbols
177 ident/reserve: _alpha followed by _alphanum
178 numbers: digit followed by digits, periods, a-z and +- immediately following
180 newlines are .. newlines
181 strings are: ' upto next' on same line, " upto next ", 3 of these upto next
182 3 on same line, or some multiline thing with no indent.
183 comments are: // to end of line, or /* to */ but without another /*
185 symbols: any ASCII after space and before delete other than digits,
186 alpha and quotes. i.e. anything left. Unknown symbols are the
191 nonterm -> nonterm + foo += bar "$1 $2 - args for HEAD struct"
194 Any identifer that does appear on the LHS is a literal.
195 Any symbol sequence is a symbol.
196 So as we parse we add each word into symbol table. As we
197 find proctions we mark some of them as non-terminals.
198 First non-terminal is the 'start' symbol.
200 This builds an abstract syntax tree.
202 Each node in the tree has a name (the symbol) and either:
203 for non-terminals: a list of other nodes
204 for terminals: the id and value
206 Each symbol on the rhs can be followed by a number which indicates where
207 in the symbol list of the lhs this symbol should go. 1 is first, 2 is second.
208 0 has a special meaning. The list of symbols in parent is set to the list
209 of symbols in this child, then others are appended.
211 ===============================
212 How do we handle fine indents in parsing?
214 Each token carries not only the token name/value, but also
215 the line-num, char-num of where it starts, and the indent size
216 of that line. A 'reduce' ensures all tokens have compatible indents.
217 If the first token is a terminal, it is ignored.(maybe)
219 Indentation changes result in 'CLOSE' tokens
220 statementlist -> statementlist statement statement_end
225 No, that doesn't work - it could swallow 'CLOSE's.
226 I suspect we need OPEN and CLOSE to be paired.
227 Or we want 'SAME INDENT' and 'SMALLER INDENT' to be distinct.
228 Python uses INDENT and DETENT and NEWLINE, which is sometimes ignored.
231 If a newline is found were it isn't expected it is quietly ignored.
232 If it is followed by an indent, it and following newlines to a dedent
235 block -> : NEWLINE INDENT statementlist DEDENT
237 statementlist -> statementlist statement ;
238 | statementlist statement NEWLINE
239 | statementlist statement ; NEWLINE
241 ==================================
242 Just clarifying how the states in the parser work.
244 A 'state' is a collection of points in different productions that are
245 all equivalent in some way.
247 We have a stack. Every item on the stack is a pair of a 'state' and a
248 'frame'. A 'frame' is effectively the abstract syntax tree for some
249 symbol which lead to that state.
251 The two main parsing operations are:
252 SHIFT(state): This pushes the given state onto the stack together
253 with a frame built for the terminal.
254 The given state is one step further along one or more productions from the
255 state that was on top of the stack (and is now second from top).
257 REDUCE(production): This pops one entry from the stack for each body
258 part in the production and combines the frames as determined by the
259 production to product a new frame.
260 It also looks up the head in the goto table for the top-of-stack
261 state and makes that the new state
263 So in each case, a state plus a symbol founds goes to a new state.
265 So given a state we examine each production and look at each following
267 For each unique symbol found, we create a state which is one step
268 forward in each production over precisely that symbol.
269 To this state we add the '0' start for all productions for all
270 non-terminals which could come next
272 Once we get a new state, we create lists of the terms and non-terms
273 which come next in all the different productions. Then for each
274 nonterm we add all terms and non-terms at start of productions
275 for that nonterm. Isn't this just 'first' all over again? No,
276 because 'first' skips over 'empty' nonterminals, but we don't want to here.
277 This pair of symsets then lives with the itemset and are used to build
278 the goto table and to expand the itemsets to a full state.
280 So each itemset needs a set of items and (computed from that) a set of
281 following symbols. We need to be able to quick check if a given
283 The set of following symbols is used partly to help complete the
284 itemset, partly to build the goto table from that itemset. So we
285 don't need to hold on to it if we build the complete itemset.
286 So after we complete the itemset and build the goto table for it,
287 an itemset is just a list of items.
289 However: completing an itemset is a lot of work, and if we have
290 already found it, that is pointless. So we want to check against
291 known itemsets before completion. So we need to be able to see the
292 non-completed form of existing itemsets.
293 The completion of an itemset involves adding "X,0" items, All other
294 items are non-zero. So as long as we get the encoding right, it
295 should be easy to just compare the prefix.
297 itemsets need some sort of stable identity to be stored in the goto
298 tables, so we need to store a number in each itemset.
299 How do we provide lookup? I could use a skip list... yes.
303 ---------------------------------------------
304 I've changed my mind ... I think.
305 I don't want the grammar processing to be in the same program as the
306 grammar matching. It is kind-a neat having just one program and being
307 able to fiddle the grammar and re-run it. However there are impedance
309 1/ symbol numbering. The matcher needs to 'know' the numbers of
310 symbols as compile time constants. The could be done by having the
311 lexer assign known numbers to known names while parsing the grammar.
312 2/ AST types. It requires a single data type with a simple list of
313 children. This is overly simplistic and makes the matchers
315 3/ AST management code. This probably has to change with the grammar
316 anyway and keeping it separate doesn't gain anything but loses
319 If we had an interpreted language with introspection and 'eval', then
320 it might make sense, but I don't really want that sort of language.
322 So: Have the grammar processing output some C (or later 'plato') code
323 to do the parsing. The grammar file can then define the ast data
324 types and the per-production code.
325 So we need some syntax to embed code and grammar in the same file. It
326 would be best if it looks like C to emacs so I can edit it easily.
328 Each production needs to know the type (always a struct I think) of
329 the frame for the head and some code to run.
331 The code has '$N' where variables can be substituted. $0 means HEAD
333 Old frames are (probably) freed so we need to move whatever is
336 The generated code needs to include the go_to table which maps
337 state to token to state, and the action table which maps
338 state to terminal to action, where action is ACCEPT, SHIFT or
339 REDUCE(n) where n is a production. The table of productions lists
340 body size, token, (to feed to go_to) and code fragment.
350 $$$$$ to mark start of grammar
352 head -> token token ${ code }$
354 #include lines go in .c
355 #define lines go in .h
356 struct lines go in .h
362 func define for parser goes in .h
364 "known" table goes in .c
365 go_to as an array of pointers to arrays of pairs of integer with
366 length. To we index state into go_to then binary search token.
368 action table is array of numbers
371 (N*256 + body_size) for production.
373 production table is a switch statement which includes the code
374 and assigned the result token.
376 Wait - do I care about token names in the C code? I probably don't.
377 If I want to use similar things I can create my own defines.
378 So drop the TOK_token stuff... though it might make the generated
379 C file more readable.
383 I need to be able to capture raw input text that matches some
384 sequence of symbols. This will be used to copy code out of the
385 grammar file and into the code or include files.
386 It is important to capture spacing and comments as well as the
388 Probably best to turn on capture when we see a newline, then if we
389 decide we don't want that, we can turn off and discard it.
390 So 'capture' collects every character processed which will typically
391 be leading spaces and the target token.
393 When capturing C code we count '{}' and if we find something we don't
394 recognised, we add it to the previous stanza.
395 Or maybe have '$$$$$' tokens to differentiate include from C code.
398 -------------------------------------
400 What about indents and newlines?
401 If I find a newline that I'm not expecting, I can ignore it but
402 require future tokens to be indented more than the token which
403 started the current production(s). But I don't really know which
404 are the current productions until we reduce, by which time it might be
405 a little late to report an error.
407 With each frame on the stack we record the indent of the line
408 which started that nonterminal. On each 'shift', we ensure
409 the new symbol has at least the same indent.
410 On reduce, we set the new indent based on first token reduced.
412 But that isn't the whole picture.
413 I think there are two different sorts of regions.
414 In one, we don't really expect newline, we are happy for them to
415 appear anywhere, as long as indenting never goes backwards.
416 In the other, we do expect newlines, they terminate something
417 but we allow extras as long as they are *more* indented than before.
418 In effect the extra indent in a line continuation and hides the
424 that was expected and the indent is structural
430 First one not expected but due to indent, is ignored.
431 Second one is expected and indent required and structural.
432 Intent relative to 'if' line
437 The newline was possible as a terminator, so the next indent
438 becomes a line continuation. That could be a bit confusing,
439 so we probably don't want to allow that?
441 Each newline gets parsed as either
442 INDENT - if it is followed by an increasing indent
443 NEWLINE- if it is followed by the same indent
444 UNDENT+- if it is followed by a reduced indent.
446 If an indent is not expected, it and all newlines and indents
447 and matching undents are ignored.
448 If newline or undent is not ignored or expected, it is an error.
455 an error because that first NEWLINE isn't expected or ignored.
456 I guess it should be optional before a '{'.
458 IF -> if Expression NEWLINE { Statementlist }
459 | if Expression { Statementlist }
460 | if Expression : INDENT Statementlist UNDENT
464 - If we see an unexpected INDENT, we hold it and continue.
465 Next SHIFTed symbol gets it attached.
466 - when we reduce all the attached indents get collected and
467 then attached to the nonterm when it is 'shifted' in.
468 - If we see an unexpected UNDENT, we hold it and continue.
469 Any shift while we hold an UNDENT is an error (how to handle?)
470 unless it holds matching INDENTS (i.e. is a reduced nonterm).
472 That is promising but doesn't explain newlines and allows:
475 Why is this a problem? It could be a single line, but is wrapped.
476 I think it is a problem because a NEWLINE was allowed after the ';'.
477 So it doesn't really work as one line.
480 is more messy. Newline is allowed there too.
483 That makes sense. newline wasn't permitted.
486 That is OK. Why the difference? INDENT is attached to '+' which
487 is not the start of the production that swallows it... actually it is.
488 statementlist -> statementlist statement
489 the 'statement' is 'd = f;', + was swallowed into statementlist.
490 Maybe that means it is bad.
491 So: When we reduce a production, and indents that are not on the first
492 symbol must be matched by pending undents.
493 Any indents that are on the first symbol must be carried to parent
495 This is getting there, but doesn't work for
500 as shifting the last '}' will be an error as an undent is
501 pending.... No, that is the old rule.
504 - If we see an unexpected INDENT, we hold it and continue.
505 Next SHIFTed symbol gets it attached. i.e. we read the next
506 symbol and mark it as indented.
507 - When we see an unexpected UNDENT, we keep count of it and continue.
508 It is not attached (yet).
509 - When we reduce a production, Any indents attached to the first
510 token are moved to the created token. Any indents on other tokens
511 must match saved undents, and are canceled. If there aren't
512 enough saved undents, then it is an error. This means that if
513 a nonterminal continues onto more lines, it must end the last line.
523 That seems to make sense.
525 Only I want to revise the above again.
526 - If we see an INDENT or UNDENT which has no action, we count it,
527 keeping a single count which is +ve or -ve or 0.
528 - When we shift in a terminal, it collects the current count.
529 - When we reduce a production, the count on the first token gets
530 attached to the resulting nonterminal. The counts on the
531 remaining tokens must be non-negative, and there must be enough
532 excess UNDENTS to counter them.
534 But what about NEWLINEs?
535 If there is an action for it, we just accept it and eventually
537 But what if no action?
545 Actually, that isn't an UNDENT, it is an UNDENT INDENT if
546 anything. But in any case the NEWLINE is allowed and ignored
554 First newline is confusing and should be illegal. But why?
555 However is that different to '1'?
556 Maybe because NEWLINE is expected in a little while?
564 is even more like 1, and even worse than 2. Even with ';'s so no
565 reason to expect newlines soon, it looks bad. But here it is the
566 indent which is bad, not the newline.
568 Is it because '1' has "a -> a b" (first token is common) while
569 bads is "a -> b c d" - different token.
571 Maybe we need to flag non-terminals which are, or are not, allowed
572 to have unexpected newlines.
573 So a statement-list may not, but other lists may.
575 If a production contains a newline, then the head may not contain
576 unexpected newlines. But how would we know? And indents are still
577 allowed. So if the stack contains indents....
592 When we see an INDENT, UNDENT, or NEWLINE which would otherwise be
593 an error, we record it. We should never end up recording an indent
594 and undent at the same time. Probably not a new line either.
595 When we SHIFT a terminal, we attach to it in the current recorded
596 state, which is a simple number: 1 for NEWLINE, 2 for INDENT, -N for
597 UNDENT, 0 for none of the above.
598 When we REDUCE a production:
599 Any state on the first symbol is carried up to the resulting
601 Any other INDENTS must cancel with following UNDENTs. They can
602 follow either in subsequent symbols, or in the current (unshifted)
604 There must be no uncanceled UNDENTs on any symbol, but there can
605 be in the state. These are carried forward.
606 Any NEWLINEs between an INDENT and UNDENT are canceled together with
608 If there are any extra NEWLINES, and a NEWLINE has an ACTION for
609 the current state, then that is an error - stray newline, indent
612 That seems good, though will need to be tested. Remaining question:
613 Is an undent less than the previous indent allows? How is it parsed
614 and what does it mean?
619 This could be an UNDENT followed by an INDENT.
620 However this fails the above rules. We would reduce
621 "cond 2 or cond 3" which contains an INDENT before we see an UNDENT.
623 '(' Cond-with-indent ')' [undent pending] ['and' is lookahead]
625 Maybe each symbol gets a 'leading indent' and 'internal indent'.
626 Internal indents can be canceled against undents and hide newlines.
627 Leading indents must wait until they become internal. Only then can
628 they cancel newlines and undents.
631 I'm not sure about combining these things. Might need 2 numbers.
632 Any newline after an indent is ignored. So there is no point
633 keeping both and indent and a newline on a symbol - just the indent
635 Undents can only ever be 'internal'. Intents can be leading or
637 So we have 'leading indent', 'internal undent', 'internal indent'.
638 where indent can also be newline.
639 When we shift in a terminal, it can have undent or indent. It
640 cannot have both. If there is an indent, the undent must have been
642 When we reduce, indent/newline/undent is canceled. If there is an
643 undent followed by an indent, that is an error. so a symbol can
644 never carry both an indent and an undent.
645 If it carries and undent, it is probably a closing bracket.
649 '}' carries an undent which balances the indent on 'b'.
656 The undent doesn't get attached to 'else' as indenting syntax is
668 'else' sees an undent and a newline. The undent balances the
669 indent in the 'block' so it only carries the newline.
671 But ... the 'else' - in both cases - gets a newline, and in the first
672 it isn't allowed because it is at the same level as newlines being
673 recognised terminators. But in this case the 'then' *must* be on
674 the next line because we just had a line-oriented block.
675 Maybe we just have to 'expect' that newline.
677 Anyway, a symbol never carries both an indent and an undent.
678 Can it carry a newline with either of those? And newline 'inside'
679 the dent will get ignored, so the interesting possibilities are:
684 I think 'newline' can be leading while 'indent' is internal.
685 However I don't think the other possibility is valid.
688 'leading_nl' is one of "nothing", "newline" "indent"
689 There can only be a single indent.
691 'internal_nl' can be a number of undents, or a number of indents,
692 or nothing. There are no 'internal newlines'. If they were
693 in a dent, they were ignored. If they weren't then based on
694 production they were either ignored or caused an error.
696 So to process a production, we first move the leading nl out
697 of the way. Then we look at leading, then internal for each node.
700 indent: increment current indent
701 newline: if indented, ignore. if outdented, error.
702 If nonterm allows, ignore else error
703 Then add 'internal' to current indent. However once a negative
704 indent has been seen, no further positives are allowed.
708 Damn - I've broken something.
709 My grammar has "NEWLINE" at end of statement, but parser returns
710 "UNDENT" which isn't recognised so it is swallowed.
711 So I think I really want every NEWLINE to be visible.
712 So as same-level newline is "NEWLINE"
713 A great-level newline is "NEWLINE INDENT"
714 A lesser-level newline is "NEWLINE UNDENT+ INDENT?"
716 If a grammer rule can be ended by a newline it just puts NEWLINE at
718 If a grammer rule requires an undent to close it, then it must
720 IF -> if EXPR : INDENT commandlist UNDENT
722 The newline before the INDENT gets held, then destroyed by the INDENT.
723 The newline before the UNDENT is probably match by the commandlist,
724 otherwise it is help, then destroyed by the UNDENT.
725 So we are moving NEWLINE-destruction from lexer to parser.
727 So the pending dents are NEWLINE then UNDENT then INDENT or NEWLINE
728 The first NEWLINE is destroyed by following INDENT or UNDENT. And
729 UNDENT cancels a previous INDENT, but an INDENT doesn't cancel an
730 UNDENT. A newline following an INDENT is impossible for a terminal.
731 So we can have UNDENT INDENT-or-NEWLINE where UNDENT can have a count.
733 These are then attached to the next thing shifted, except that there
734 cannot be both an UNDENT and an INDENT when the SHIFT happens, else
735 error unexpected INDENT.
737 So a shifted terminal has undent count and possible newline, or no
738 undents and possible indent or newline
740 When we reduce a series of symbols to a nonterminal:
741 - the tags on the first symbol are carried to the non-terminal
742 because they come completely before both.
743 - remaining indents and undents must match, or must be able to
744 cancel against pending undents, including the look-ahead symbol.
745 - and newlines between indent and undent are ignored.
749 There is a thing which is one of
756 Each symbol has 2 of these: leading and internal.
757 A Terminal always has -nothing- for the internal. These are
758 collected from previous reductions and shifted ignored symbols.
759 A non terminal gain them in a reduction.
760 The leading thing of a new nonterminal is the leading thing of the
761 first reduced symbol.
762 The internal thing of a new nonterminal is a combination of the
763 leading and internal things (in that order) of the remaining
764 symbols, together with some undents that are pending, or in the
766 As we merge the remaining things:
767 - indents may not follow undents
768 - newlines following an indent and before next undent are
770 - indent then undent are discarded.
771 - newlines with no indent are either discarded or cause an error
773 So result for internal of new nonterminal is either
774 - some number of undents
775 - some number of indents
777 When we see an unexpected newline or indent it goes on the next
779 When we see an unexpected undent it goes on the current
780 top-of-stack as internal.
787 internal: count, -ve undents, +ve indents
791 Stray indents must be disallowed in the same places where stray
792 newlines are.... or something.
793 And how do I know where these stray things are allowed?
795 Hypothesis: There are two sorts of nonterminals - horizontal
796 and vertical. Horizontal nonterminals are expected to be all on
797 one line but can use indents for line continuation. Their
798 productions do not contain a NEWLINE symbol.
799 Vertical nonterminals are expected to run vertically. Their
800 productions do often contain newlines, possibly indirectly.
802 A horizontal nonterminal may contain an indent, but not an
804 A vertical nonterminal may contain newlines but no indents.
812 The list if number is horizontal and starts with an indent, so
814 The [list] structure is also horizontal and contains an indent
815 The second indent is legal because it in horizontal.
821 FOUND A MISTAKE. First token in production does not necessarily hold
822 first terminal. It could have an empty production. That is easy to
823 handle: reducing an empty production collects the pending stuff.
826 double-damn! I do need internal newlines after all - maybe.
833 That indent is used by statementlist which is vertical so the newline
834 in binding, which is horizontal, isn't protected by the indent and so
836 i.e. the indent at the start doesn't protect things.
842 More thoughts next morning:
843 - Newline before indent needs to be invisible, else the indent
844 cannot effect line continuation DONE
845 - hidden undents might blur with expected undents. Need to decide if
846 care is needed: almost certainly is
847 - I think the horiz/vert distinction is on productions, not nonterms.
848 A production "A -> A ... " is vertical, others are horizontal.
849 Maybe 'headed' and 'headless' are better term.
850 A 'headless' production can contain unindented newlines. but not
851 unexpected indents. A 'headed' production can be indented but
852 must not have an unindented newline.
854 And that afternoon....
857 is now complaining because it sees an indent in "1+2" which is
858 Term -> Term + Factor
859 so is vertical and doesn't like indent. But the indent is really
860 in "a = 1 + 2" which is horizontal so permitted.
861 How to we express that?
863 A production that starts with an indent or newline can contain
864 with a sequence of newlines (if vertical) or an indent and a sequence
865 of newlines (if horizontal).
867 A production that does not start with Newline or Indent is deemed to
868 be horizontal, so can contain an indent and some newlines.
870 A production which contains or starts with an indent which is not
871 canceled by a trailing undent passes that indent up to it's parent.
873 So an indent must ultimately be on the first symbol of a recursive
874 (vertical) production, or it must be the first indented symbol in a
875 non-recursive (horizontal) production.
877 Do I really need internal indents?
880 - on a newline (same indent as something earlier)
881 - on an indented line
882 - in the middle of a line
883 Based on this and recursive nature we classify as horizontal or
884 vertical (h if middle or non-recursive. v if non-middle and recursive)
885 The first non-middle token in the body must start with indent or
886 newline corresponding to horiz or vert status. Subsequent non-middle
887 tokens may only start with a newline
888 Any non-initial indent in a body must be closed before the production
889 is reduced. So a nonterm can only carry an initial indent.
901 Once we see an unexpected indent, future newlines are
902 ignored. Sometimes. Until we see the indent.
904 Can we have multiple unexpected indents? Yes - if never expecting
906 So we have a count of unexpected indents. When we see one we expected
907 we push the counter with the indent and restore it when the undent
910 If the pending production is horizontal, newlines are ignored. If
911 vertical, newlines are recognised.
913 (next morning - 11may2013)
915 So when we see a NEWLINE which the grammar allows, how do we know
916 whether to SHIFT it or SKIP it?
917 We SKIP if the it is or will be part of a production that contains an
918 indent line continuation. But it might be in lots of productions
919 concurrently, and the LALR engine doesn't really tell us about
920 productions until they are ready to reduce.
923 Require NEWLINEs to appear in grammer following a nonterminal which
924 is the entire object that they complete
925 So if any command can be terminated by newline or semi, then
926 Termcommand -> Command NEWLINE
928 Commandlist -> CommandlistN
930 CommandlistN -> Command
931 | CommandlistN Command
933 Alternately if the newline is only needed for some commands:
941 Then when we see a NEWLINE that is expected we trace-ahead how
942 many stack frames would be absorbed before it is shifted.
943 i.e. while action(state, NEWLINE) == Reduce(production):
944 if production.len > 0:
945 tos -= production.len
947 state = goto(state, production.head)
950 Now if any symbol from 'tos' to the real tos contains an Indent,
951 we Skip the NEWLINE, otherwise we process all those reductions and
952 then SHIFT the NEWLINE.
954 Can we deduce this ahead of time so the matcher just does a
956 The result needed is stack depth to examine, and that cannot be
958 So we need the 'production->head' mapping to be more accessible,
959 but still need to search.
961 How does this relate to 'recursive' and 'horiz'v'vert' ?
962 Is the symbol before a NEWLINE there-fore horizontal?
963 It shouldn't come up.
964 We could flag every nonterm followed by NEWLINE as 'oneline',
965 and flag all nonterms which produce recursion as 'recursive' and
966 warn if any nonterm gets both flags.
968 When we 'Skip' a newline, do we still record it in m.nl?
969 I think not, it would be irrelevant.
971 Do we want to search ahead every time we see a newline?
972 We would need to skip to the next expected symbol, then see how
973 many symbols are about to be reduced, and examine those symbols.
975 Maybe this leads to a very different approach.
976 INDENT and UNDENT are still the same: attached to next symbol and
978 But NEWLINEs are not. When we see a newline, we optionally skip
979 forward to the next expected symbol (or ERROR), then examine
980 how much of the stack we would reduce.
981 If there is a net indent, or a vertical production, we ignore the
982 newline otherwise we remember the decision and start reducing.
990 The first newline appear only as INDENT and is recorded on the +
991 The second would trigger a reduction back to "Assignment" "a = b +
992 c" which contains an indent (on the '+') so the new is ignored and
993 The '+' is seen and shifted.
994 The third newline is likewise ignored. The UNDENT gets queued
995 and the NEWLINE now sees no net indent and so it isn't ignored.
996 So the Assignment is reduced, the NEWLINE is SHIFTED and we reduce
1005 First newline is only seen as indent, attached to 'b'.
1006 Second newline is not expected so we peek at 'd'.
1007 'd' is shiftable as current production is
1008 "listN -> listN , IDENTIFIER"
1009 This 'listN' contains the indent on 'b' as a leading indent,
1010 but as it is a recursive production, that is accepted as an
1011 indent and the newline is ignored. This continues until we have
1012 the comma after 'g' shifted.
1013 Look-ahead is now ']' which triggers a reduction to 'List'
1016 Which is not recursive. Though it contains a recursive which
1017 has an indent, so that might be cool
1018 So we only go back until we find a suitable indent.
1025 First newline gets attached to start of 'thing' as indent
1026 next terminates 'thing' as it is the next to be reduced
1027 Third reduces a=b to 'assignment' which has no indent, so is not
1036 After the '}' we don't expect a newline because we have complete
1037 command. We see a newline so are tempted to ignore it but need to
1038 check the indents. look-ahead of 'a' (like everything else)
1040 commandlistN -> commandlistN command
1041 which is recursive and has a leading indent, so we are good.
1049 This is weird and probably should be illegal.
1050 The NEWLINE after 'c' is not expected so we peek at ','.
1051 This would be shifted so there is no indent to permit the 'ignore'
1052 and we trigger an error. Probably good.
1054 So to sum it all up.
1057 INDENT for a newline with an increased indent
1058 NEWLINE for a newline with the same indent
1059 NEWLINE UNDENT+ {NEWLINE | INDENT} for a newline with lesser
1062 - depth of lexed "ignored_indent" When an INDENT or UNDENT
1063 isn't expect by the grammar
1064 - number of "unparsed_undents" which is incremented when we
1065 accept an unexpected indent, and decremented when we reduce
1067 - When an INDENT is shifted, "ignored_indent" is attached to is,
1068 and then set to zero. "unparsed_undents" must be zero.
1069 - When an INDENT leads to ERROR, we tag the next terminal as
1070 indented and count the indent in "ignored_indent"
1071 - When an UNDENT is seen while the "ignored_indent is +ve, we
1072 decrement the UNDENT count and increment "unparsed_undents".
1073 However if there is a pending indent waiting to be shifted, we
1074 cancel that instead of incrementing "unparsed_undents".
1075 - When an UNDENT is seen while "ignored_indents" is zero, we either
1076 SHIFT it or trigger an ERROR. On SHIFT we find the previous
1077 UNDENT (which must exist) and restore the indent count.
1078 - When we reduce a production we check that it has only one
1079 indented symbol which may be internal or at the start, and
1080 this must correspond to whether it is horizontal or vertical.
1081 If it was at the start, the resulting symbol gets the indent.
1082 If it is internal, "unparsed_undents" is decremented if it is
1083 +ve, otherwise the indent is 'carried forward' to the next
1087 - When we see a NEWLINE that the grammar expects, we each
1088 production that will be reduced before it is shifted, and if any
1089 contain a legal indent we discard the newline, else we accept
1091 - When we see a NEWLINE that the grammar does not expect we skip
1092 it till we find an expected symbol and repeat the 'search
1093 reduceable productions' process and throw and error if the right
1094 indent isn't present.
1103 New idea. end-of-line is different from start-of-line.
1104 End of line may or may not terminate a nonterminal
1105 Start of line is one of 'indent' or 'same'.
1106 a new line is coded as one of:
1112 Eol, Indent and Undent can be in the grammar, Same cannot.
1113 Eol is ignored if not expected or a reduce production would contain
1115 NO that doesn't work - what would terminate that line? I was using
1116 the newline after the indent, but that doesn't exist any more.
1119 When a newline is ignored by an indent, we count that and ignore
1120 all indent/undents and newlines until the matching undent arrives.
1121 That is parsed as a Newline. This implies no nesting. What
1130 The first newline is an indent, so it is ignored but indent is
1132 The second newline?? The third shouldn't be ignored.
1133 Maybe the second is in a state where a recursive symbol is
1135 But so is the first.
1143 Why is newline after 'b' different to the one after (x); ??
1144 The one after the 'b' isn't expected. The one after '(x);'
1145 isn't either. So neither newlines are semantic.
1146 If the ';' was missing, the second newline would reduce to a
1147 vertical production, so is ignored.
1154 The first newline is followed by indent, so ignored.
1155 The second is at end if indented thing, so ignored.
1156 The third is still indented, so still ignored
1157 After the next undent a newline will fix it.
1158 So maybe we want: { newline undent}* [ newline | indent ] ??
1161 OK, I have a new problem with indents.
1163 if condition : INDENT statementlist UNDENT else : INDENT statementlist UNDENT
1165 doesn't work, because it is a horizontal production, but the 'else' is not indented.
1167 If_Statement -> if Expression Block else Block
1169 Block -> : INDENT Statementlist UNDENT
1172 Also with same productions:
1178 doesn't work because the '{' is not indented.
1180 Can I find a clean way to allow these automatically, or do I need some explicit
1181 tagging in the grammar?
1182 Tagging with OptionalNewline seems to work and probably makes sense. It would be
1188 at the same time as allowing '\n{'
1192 Next steps (02May2013)
1193 - Get indenting and newlines worked out.
1194 - create basic structure of grammar. This will be a long way from
1195 final, but creates room to try things out.
1197 + A file contains declarations:
1198 types, functions, methods, constants, implementations
1200 interface - between modules. import/export etc
1203 + Types: array, struct, enum, union? pointer function
1204 parameterized, string int float char Bool
1207 + Code block: loops, conditionals, break/continue/return
1209 actions: assign (= := += *= ...), funcall, methodcall
1211 + expression: conditions, boolean, arithmetics, binary
1214 + bindings, variable declarations, local variables etc.
1215 local functions? are they just closures?
1220 DONE - (" scanner doesn't notice quote
1221 DONE - grammar cannot say "no type of these nonterms
1222 - grammar to synthesise A' symbols
1223 DONE - better tracing - show full stack info
1224 - store 'indents' better.
1225 - need lots of test code.
1226 - use a 'tos' pointer rather than counter.
1228 - easier to write 'main' function
1229 - make it easier to handle plain-text files.
1231 shift-symbol, record-indents, handle linebreaks