1 How to write up two-dimensional parsing?
5 - If an indent is ignored the matching undent is ignored
6 except to decrement the 'ignored' stack and to turn
7 a pending indent into a pending newline.
8 or to count the pending undents.
10 - If a newline is found and cannot be shifted yet then we check
11 each production that would close before we shift, and if any still
12 have pending undents, we ignore the newline
14 - when we reduce a production, we 'resolve' indents wich may leave
15 something pending. Either an indent or a number of undents?
17 - productions can be recursive if first body symbol matches head.
18 A recursive production may be horizontal if it starts in the middle
19 of a line. non-recursive productions are always horizontal.
20 A horizontal production must be all on one line or have its body
22 A vertical production must not be indented.
25 Maybe I don't want to mention indents in the grammar ?
27 if CONDITION : STATEMENTLIST
29 doesn't have to be indented but we will only reduce it when we
30 see something that doesn't belong.... which could be an UNDENT
32 So: when look-ahead is UNDENT we cannot shift so we reduce until
33 the top state has an indent attached, at which point we consume the
42 the undent will reduce until we just have the ifstatement which
44 So any indent is pushed onto the stack and a reduction sums all the
47 What about newlines? A newline will reduce any production that
48 doesn't contain an indent. It gets discard when top of stack
49 was at start of line...
51 So each 'state' also encodes
52 start-of-line or middle-of-line
56 When an INDENT is seen the current state increments count of indents.
57 When an UNDENT is seen we must reduce until tos state has an indent,
59 When a NEWLINE is seen, we reduce until tos state has indents or was
63 UNDENT shouldn't force a reduce. Only newline should do that. But as
64 an undent is always followed by a newline, it still results in a
65 reduce happening where I want it.
67 So an undent shifts in line an indent(?). Reduction nets the indents.
68 A newline forces the reduction until we find excess indents.
70 But what does grammar look like?
71 statement -> assignment ';'
73 So a newline will force the reduction to 'statement', but this will
77 statement -> assignment ';'
79 Then the NEWLINE can be shifted normally.
80 So newline forces reduction until it can be shifted or until ToS
81 contains an indent or started on a new line.
82 If we hit an irreducible state that does not accept NEWLINE when we
83 have to shift in an ERROR instead which might then be reducible.
84 When popping states we need to collect the indents.
87 Where do we attach the indent/undent info? To the state or to the
88 symbol? They are kept together but mean different things.
89 indent/undent are between terminals, just like states. But they
90 can be inside non-terminals which make them unlike states.
92 What "is" a state? It is a point between two symbols in a sentential
93 form. It doesn't carry information. So I think the indent has to go
95 An 'indent' should attach to the symbol *before* it as that is the
96 thing that is being continued. An 'undent' should also attach before
97 itself. Eventually those two terminals become included in a single
98 non-terminal and they collapse.
99 You can never have more undents than indents as a newline will force
100 reductions until top symbol has no undents.
101 In fact we reduce until the top symbol has a new indent, or is a
102 vertical production, or is explicitly allowed.
104 If we see an INDENT, we attach to symbol at ToS
105 If we see an UNDENT, we attach to symbol at ToS (possibly canceling indent)
106 If we see a Newline and current ToS holds +ve indent,
108 If look-ahead is expected, we shift it.
109 If we can reduce, we do so.
110 Else we trigger error handling:
111 pop states until one accepted ERROR - collecting indent
112 shift ERROR with those indents
113 Discard input, but process indent/undent, until next symbol is
118 Does this work? If I have
123 Then the NEWLINE after "c +" cannot be handled, but should not be an
127 where 'expr' contains an indent.
128 Maybe NEWLINE is ignored unless ToS has negative indent.
132 That must be an error.
135 1/ accepting NL must depend on what it will reduce.
136 2/ If UNDENT/INDENT appear together they must be kept separate.
138 So maybe UNDENT does block shifts and so forces reduction. If not
139 reduction can be made, that is an error
145 The NL UD NL before '}':
146 NL will ensure "b;" is reduced.
147 UD will reduce to command-list so we will have
149 if ( expr ) { commandlist
150 where '{' holds the indent. So '}' needs to bring in the undent,
151 which is after another NL.
153 So Undent waits for a symbol to collect it when shifted.
161 When we try to reduce "foo;while(b)bar;" to "statementlist" the
162 negative indent triggers an error. This a bit late though. I want it
163 when I see the 'while'. I want an error there which can be reduced a
165 The undent should close the statementlist so that '}' is the only
166 shiftable symbol. But after 'statementlist', 'while' is still
169 If we just waited for the 'while' statement to reduce and then noticed
170 the error, we could still report it against the 'while' token, but
171 recover would be a lot harder as we have already parsed the 'while'
172 into the 'wrong' place.
174 So when I see that token after NL,UD,NL, I need to be able to say
176 Could require help from grammar.
177 e.g. if a token is allowed to have a lesser indent than previous
178 token, then there must be a literal UNDENT token.
182 BlockA -> { Statementlist
184 That looks ugly, but the undent would close BlockA and allow the } to
188 Block -> { Statementlist OptNewline }
190 so the newline after the indent is shifted and doesn't
193 Then the UNDENT would ensure that the Statementlist were closed, the
194 Newline would be shifted, only a '}' would be allowed.
196 So: each symbol in a production must be in the same line as the
197 previous, or in an indented line.
199 No. OptNewline there wouldn't work as a blank line would take us to
202 I think I need to go back to what works... only it doesn't?
205 Indent is attached to previous symbol. I think that is easier that
206 keeping it at the 'start' of the 'next' symbol as I was doing.
208 Newline closes any productions that don't include an indent, then can
209 be shifted if wanted, or ignored.
211 Undent reduces things until ToS has an indent that can be canceled.
212 This means that if part of a production can have a negative relative
213 indent, it must be separate:
215 Blockhead -> { Statementlist
217 So the Undent before the '}' will reduce Blockhead and can then be
220 Block -> Blockhead ERROR
222 It would be nice if that can be auto-handled. e.g.
224 Block -> { Statementlist $undent }
228 That would be messy for the reduce action though.
229 Could each state "know"
232 A state can have different items with different offsets, so how much
233 of the stack is part of the 'current' production is not well defined.
234 It is not completely random either.
236 We can ignore non-kernel items as those are just potential productions
237 - they haven't started yet.
238 Of the remainder, many states have only one length.
239 Many others will have some offset and '1' for recursive
240 productions: A -> A . X
241 These can be called "potential" as well as we have no commitment.
243 Funcall -> Factor . ( ExprList )
246 Funcall -> Factor . ( ExprList )
247 Term -> Term * Factor .
250 i.e. just the bits that make the grammar distinctively LR.
251 In my case one is always reducible, is that important?
253 So when we see an Undent we find the max-index of the current state
254 and determine if there is a matching indent within that distance.
255 If there isn't then we must reduce or trigger an error.
257 Alternately: Each 'goto' from this state has a max-history.
258 We exclude any that don't have enough history to include balanced
260 If the next symbol would trigger one of those, it is an error.
263 For early error detection we record Indents against the previous
264 symbol, ignore newlines, and when we see an undent the next shift
265 must lead from a core item with enough history to keep balance.
267 If that is well defined it might catch early undents, but what about
272 I expect that is less likely, but still an error.
273 When we see the first ';' we will reduce to a statement and then a
274 statementlist. The second 'c' would lead us from
275 statementlist -> statementlist . statement
276 which is recursive and so doesn't allow indents (unless it started in
277 the middle of a line).
279 Intends are allowed anywhere except after the first symbol of a
280 left-recursive production which started at beginning-of-line.
286 is not allowed as 'Term->Term*Factor' is left-recursive. But why not?
288 So maybe an indent has to be allowed anywhere. Maybe a warning but
289 there is no real recovery option.
291 An undent is allowed in a state with a history back to the matching
293 If we see an undent, we check if the state can consume it. If not we
294 look to the next symbol which is not an undent (or newline) and reduce
295 until that symbol can be shifted. At this point the undent must have
298 Or easier: We must reduce to the matching indent before we shift,
299 unless a NEWLINE is explicitly allowed...
302 An indent is between two symbols, or within one symbol.
303 We hold a pending indent which gets attached to the front of the next
304 symbol that we shift.
305 Undents and newlines collect until we see a "real" symbol.
306 Any reductions that are then indicated happen and if the ToS symbol
307 is found to hold an indent, that cancels a pending undent.
308 Once we get to a state that can shift the lookahead we examine the
309 number of pending undents. It should be zero. i.e. we shouldn't
310 still be in the middle of a non-terminal that started at a greater
313 If it isn't zero we trigger an error because a close-bracket or
314 similar is missing. This might be different to "unrecognised-error"
315 The error should lead to reductions which cancel out the remaining
317 I think the parser needs to incorporate error handling into states.
318 Each state knows what symbol to synthesis to get closer to a
320 i.e. on 'no close' error, we reduce what we can, and synthesis when
323 So that seems to handle early undent.
324 We also have 'unexpected indent', 'missing indent' and 'late undent'.
327 A 'missing indent' happens when a production that we expect to be
328 one-line sees a newline rather than an indent.
333 The + after the newline only reduces 'b' to Expression which starts
334 in the middle of the line. A newline should see a reduction to a
335 nonterm which started at the beginning of a line, or which contains an
338 We trigger a 'no close' error here so that '+' is the start of a new
339 statement. Maybe if the next symbol is shiftable, we just report an
340 error and allow parsing to continue;
342 An 'unexpected indent' happens when the fully reduced non-terminal
343 started at the beginning of a line and is a recursive non-terminal.
344 Or maybe we are in a recursive state. This triggers an error
346 A 'late undent' is like an unexpected indent. The previous
347 fully-reduced non-terminal is recursive, started at the beginning of a
348 line, and still contains an indent.
351 Each symbol can carry a 'start-of-line' flag and an 'indent' count.
352 Each state may be flagged as 'recursive' if "A -> A . etc", and
353 records the symbol to synthesise to resolve a 'missing' error.
355 Current state includes an undent count, a nothing/newline/indent flag.
356 When look ahead contains undent, we increment that count
357 When look ahead contains newline we set newline flag
358 When look ahead contains indent, we set flag to indent.
359 When we reduce we combine indent counts in symbols and keep
360 start-of-line from first symbol.
361 If the ToS after a reduce contains indents and current state contain
362 undents, they are reduce so that one becomes zero (min is subtracted
364 Before we shift we must test:
365 - if undents are non-zero, we synthesis the symbol expected by the
366 state, we keep synthesising and reducing until the undent count
367 reaches zero. We report that the synthesised symbols were missing.
369 - If newline flag is set, the ToS symbol should start at SOL or
370 contain an indent. If it doesn't we just report a missing indent
373 - If indent flag is set and ToS symbol started at SOL and state is recursive
374 we report an unexpected indent.
376 - If state is recursive and ToS started a SoL and contains an indent,
377 we report an extra indent.
380 ---------------------
381 (I think I'm getting there).
383 An indent is allowed in a non-recursive that started on a newline,
384 or in a recursive (at the recursion point) that didn't.
386 The matching undent must be at the end of the production.
388 A newline is only allowed in a production that started at the start
389 of a line, but not always. - only if recursive or annotated.
391 An undent is required at the end of a production that contains an
395 An indent or newline found where not "allowed" generates an error
397 An undent not found where required generates an error and emergency
401 If we have an ambiguous grammar then the undent could simply
402 force a reduction. i.e. if the 'emergency reduction' just involves
403 choosing REDUCE of SHIFT, then it isn't an error or even a warning.
406 But where exactly is an indent or newline allowed - I don't think I
408 - a newline is definitely allowed in a list that started at the
409 beginning of a line. We need to structure the production right
410 so we don't end up with a separator at the start of a line.
411 LIST -> LIST SEP ELEMENT
413 would need to allow the newline before ELEMENT. The logic there
414 would be that if the first ELEMENT can be at start-of-line, other
418 Is appropriate if no separate is needed, and an empty list is OK.
419 Here ... does the indent go after the LIST or before the ELEMENT?
420 i.e. if there are more symbols, how do we know.
422 If the list already has an indent, then from there on it "started
423 at the beginning of a line"
424 This is treating it a lot like LL(1) which would be:
428 LIST -> ELEMENT SEP LIST
431 - a newline is definitely allowed in a non-list which has already
441 So: whatever holds the indent can also hold a newline.
443 i.e. a production can have a newline between symbols if an earlier
444 symbol has an indent before it.
449 which is probably OK.
451 What about indents? Can I have two indents in the one production?
452 No, I don't think so.
454 If an indent occurs *inside* a production, newlines can appear
456 If an indent occurs *in front* of a list production, newlines can
462 a symbol (on stack) can have an indent at the start, and can have a
463 number of indents within.
464 When we shift the first symbol after an indent, it gets an indent at
466 When we reduce the result inherit and indent at the start, and other
467 indents (within and at start) become within.
468 The indent at the start of a list can be treated as being within.
469 i.e. as list has an empty symbol at the from before the indent.
471 A newline is permitted if the max history for the current state
472 contains an indent. Indent at start of non-list doesn't count. At
473 start of list it does.
474 An indent is permitted if the max history does not contain an
477 An undent cancels indents within, or possibly before, the
478 immediately preceeding symbol.
481 Can I do all lists with empty start?
488 LIST -> LIST0 ELEMENT
492 // non-empty with separator
493 LIST -> LIST0 ELEMENT
497 // can be empty, sep is needed
503 Just stick "IsList -> " in front of anything that is a list.
506 For newline to be allowed, do indent have to be between symbols in
507 same production, or is within symbol ok?
510 b ) // now I need another indent.. or an undent?
513 The undent resets, then only an indent is allowed.
518 No reset, so extra indent. Anything is probably OK except newline.
519 So the indent cannot be inside a previous symbol.
520 But for lists, the indent will be exactly inside LIST.
521 So maybe inside first symbol is OK ??
531 I think I might be trying too hard. I should allow newlines
532 and indents anywhere, and give a clear interpretation for undents.
533 However my real goal is to know exactly what newlines and undents mean
534 and how to parse them.
535 i.e. does the grammar need to mention all of the, some of them, or
536 just be ambiguous and allow line break to disambiguate?
538 That last options doesn't really work as we cannot allow a separator
539 to be dropped anywhere by at end-of-line or before-close.
540 So grammar needs newlines. Various separators can be "sep or newline".
541 However I don't think indents need to be mentioned.
542 if condition : statementlist
543 needs undents to disambiguate I don't need to say that.
546 - indents get shifted with token
547 - indents are grouped by reduce
548 - and undent reduces until a matching indent in the top
549 state can be cancelled.
550 - newlines are messy.
551 We look ahead through possible reductions until we
552 either shift the newline or find an indent or cannot reduce.
553 If the shifted newline would not contain a newline in the production,
554 we choose that path. If we find an indent, we discard the newline.
555 If we cannot reduce we again ignore the newline
556 We could also discard a newline early if not in LA set.
564 Looks to the compiler like an addition and to the human like two assignments.
569 ?? Only allow newline if production started with indent.
571 If not expected and current production contains indent, discard.
572 If not expected and current production has no indent - error, recover.
573 If expected, test-reduce and discard if indent found else accept.
576 What if we just test-recover until we hit an indent or want to shift the newline?
578 Difference between statements and expressions is that a statement can
579 be terminated by a newline.
580 So if we find a newline and test-recover would use it, that is an
582 If we find a newline and test-recover would not use it before an
583 indent, it isn't an error.
585 So each state which cannot reduce has a link to the best state to shift to.
586 When we want to recovery we follow these links until they run
587 out. Then we can reduce which leads to a new state.
589 If we find a newline, we do this until we find a state which will take
590 a newline, or we cross an indent.
591 If we crossed an indent, we discard the newline.
592 If not and we has to take any shift steps, we call an error.
593 This requires that newlines aren't buried to deep.
595 If we find an undent, we do this until we have cancelled the undent
596 against an indent. If anything was shifted, we report an error.
598 No - using indent just to disambiguate an ambiguous grammar doesn't
599 work as it still allows confusing constructs.
603 Should be wrong as the 'else' looks to be part of the statement but
604 cannot be. It should be "unexpected". That really means that
605 "INDENT" "UNDENT" must be part of the grammar.
607 if COND : INDENT STATEMENTLIST UNDENT
608 which is where I started.
610 So an indent which is not expected protects all newlines and
611 indent/undent pairs until the matching undent.... not quite.
613 Indents can be expected in a state (if we compute lookahead).
614 If an indent is not expected we simply discard it and keep count.
615 When we shift an expected indent we push the count with it. When we
616 push the matching undent, we restore the count.
617 When we get an undent, we discard it if indent count is non-zero.
619 When we get a newline we look ahead to see if it would reduce across
620 an indent. If so we discard the newline. If not we keep it.
622 But an undent still needs to force reductions doesn't it?
624 i.e. if an undent doesn't cancel against top of stack we count it.
625 If we every try to shift while an undent is present, we throw an error
626 and shift other stuff until the undent is gone.
629 Indents must affect error handling. We no longer just discard tokens
630 left and right. The 'error' state for any indented token is the
631 matching undent. So we discard stack tokens until we get an error
632 state or an indented state.
633 If error, we shift the error and discard input until a token
634 matches. While discarding input we track indent/undent and
635 if we come to an unmatched undent we discard stack tokens again
637 If we hit an indented state we discard input until the matching undent
640 But do we really want to discard tokens, or can we synthesise tokens?
642 Discarding is standard. It gets to a known position. It allows any
643 location in a production to accept errors.
644 However discarding does discard data that cannot get further analysis.
645 If we use very fine grained error productions we might not miss much.
647 Come back to this later. For now just discard.
649 If we find an Indent on stack, an error "must" be expected.
651 block -> : INDENT Statementlist UNDENT
652 Statementlist -> ERROR
655 if we can shift ERROR, we do,
660 Indents and newlines.
662 Indents and newlines can be taken as strong syntactic marker, useful
663 both for regular parsing and error recovery.
664 Unlike other bracketing markers such as parentheses or paired key
665 words (BEGIN and END or DO and OD) indents cannot be lost. Every line
666 displays what the current indent level is and for every indent there
667 is precisely one location where the matching undent is (though
668 of course one location can have undents for multiple indents).
670 Similarly start-of-line and end-of-line can be strong bracketing
671 markers which cannot be missed.
673 The two can complement each other to provide a very effective
674 indication of the grouping that the programmer intends. Often the
675 programmer will want more explicit marking as well (e.g. "{" and "}" )
676 which is perfectly fine as redundancy of expression is an important
677 part of effective communication and this is equally true in writing a
680 The connection between the two is simply that indentation effects a
681 continuation of the previous line. For example in
686 Any bracketing implied by sol/eol should not be imposed around "A B
687 C" but only around "A ... E" - the indent effectively hides the line
689 So this could be tokenised as
690 sol A B C indent sol D E eol undent eol
692 Note that every SOL has a matching EOL and every INDENT a matching
695 As these markers are strongly paired they can be useful for error
696 detection and recovery. This is particularly true for indents.
698 It seems universally true that any syntactic unit (such a declaration,
699 statement, or expression) which starts indented will finish while
700 still indented. That means it will complete before the undent which
701 matches the most recent indent when it was started.
703 This fact is helpful but not quite as strong as we might like
711 "F G" could be in the same syntactic unit as "A..E". however if "D"
712 started a new subunit, "F G" certainly aren't part of that.
713 Assuming that "A" started a syntactic unit, "H" is the first symbol
714 that is clearly not part of it.
715 This isn't really a problem for
721 as the statementlist is the last part of the if-statement, so
722 there is no room in it for baz - baz must be separate.
730 This looks wrong but cannot be ruled out on the grounds of
732 Here the expression of
733 statement -> var "=" expression ";"
735 starts on the same line as the statement so there is no indent.
737 We could argue that a syntactic element (expression) that stated other
738 that at the start of a line may only continue to the end of that line
739 (when adjusted for indents). This can hit problems with:
747 block -> "{" statementlist "}"
748 starts other than at the beginning of a line, but continues on the
751 There are a few different ways we could try to deal with this:
753 a/ some special-case marking in the grammar to say that a newline is
754 permitted before the '}'
755 b/ Don't make a separate syntax element for "block":
756 ifstatement -> if condition { statementlist }
757 would mean that the '}' is at the same indent as the 'if'
758 which starts the element.
759 c/ decide that the last terminal of a production is a special
760 case and can go on a new line.
762 The last is almost certainly a bad ideas it would encourage
770 Another construct that is often used is:
778 where the "block" is allowed to start on a new line.
779 This potentially encourages 'block' to be a separate syntactic unit,
780 so 'b' above doesn't seem encouraging. That leave 'a' and marking
781 both the '{' and the '}' as being allowed to start a new line
784 If we assume these markings, then an undent can close any syntactic
785 unit that started at a greater indent, or at the same indent but not
788 - to detect errors by reducing early and then ensuring the next
790 e.g. the indent after 'bar;' would close the statementlist after
791 which only a '}' is expected. If it isn't there, that is an error.
792 - to handle ambiguous grammars. In the classic
793 statement -> if ( expression ) statement
794 | if ( expression ) statement else statement
795 the normal default is to treat 'else' as right-associative.
803 can be parsed with the 'else' applying to the 'if(x)' as you would
807 Now we turn to line markings - sol and eol.
808 The above suggested that any unit should be "on one line" (allowing
809 for indents) unless specially marked. This is bit excessive.
815 "b + c + d" is all one expression, but appear on two lines. Might
817 a[some_long_expression_with_function_call_etc] =
818 verylongling + otherverylongthing
821 might not seem so unlikely. This could easily be made correct by
822 moving the '=' to the start of the second line, or indenting the third
823 line slightly. It isn't certain that we want to do that, and
824 annotating that a newline is allowed before the 'plus' seems clumsy.
826 An alternate approach involves noting that 'statements' are rather
827 special. Not uniquely special but still special.
828 That is, statements are normally on one line. When they are two long
829 for a line the remainder is indented (make it part of the previous
831 When a statement appears on multiple lines it is a notable exception.
832 For example the Python
841 has three lines in one statement. This is a one-line statement which
842 has newlines before the 'elif' and 'else', or maybe 3 one-line objects
843 which can come together.
845 While a statement is line-based, and expression probably isn't. They
846 tend to appear all on one line, but that is because the statement they
849 In some sense the line-nature of the statement "captures" the
850 expression. An expression can only be on multiple lines if they are
851 all indented with respect to the containing statement.
853 So rather than marking some things as being allowed to continue to the
854 next line, we can mark some things as being line-shaped. These are
855 sometimes whole units and sometimes parts. e.g.
861 Note the missing '}' above - it is not part of the line shape.
863 if condition : statementlist
867 function name(argumentlist)
873 Being line-shaped just means that it must be "all on one line", but
874 not that it must be a line by self. Several line shaped things can
879 However it would be nice to allow a syntax modification when separate
884 The trailing semicolon is not needed. As the assignment is
885 line-shaped it must finish at the end of the line.
886 Declaring the semicolon to be optional would allow
888 which looks more like an error than a correct construct. Here the
891 A simple solution is to allow the NEWLINE to be part of the syntax
892 listed for the grammar:
897 Assignment -> lvalue = expression OptSemi
899 as a line-shaped item should work.
903 1/ If a point is 'newline allowed' then what follows can go
905 2/ If something is line-shaped then linebreaks may only be intended.
907 So in a production we can have 'optnewline' or 'optsemi'
908 anywhere but at the start. Or maybe 'at the start' means something a
911 statement -> if condition optnewline block
912 | if condition optnewline block optnewline else optnewline block
914 block -> : statementlist
915 | { statementlist optnewline }
918 If something is line-shaped, then the EOL that ends the line on which
919 it starts, must end the thing.
920 If something is not line-shaped, then an EOL can be safely ignored.
922 An LR state can expect NEWLINE if it has a goto state for it.
923 And it has an indent.
925 If we look up the stack and find an indent before a newline-expected
926 then a newline is ignored.
927 If we find a newline-expected before an indent, the newline is used.
928 If a state expects a newline and was indented then the newline
929 expectation comes last, is seen first when looking back, so newline is
932 Separately, an undent reduces until it can cancel with an indent.
943 This isn't so much about whether an indent is present, as how much of
944 an indent there should be. This sort of indent doesn't affect parsing
945 or help with error recovery so it is probably best left to esthetics.
946 We could possibly require that everything in the brackets be indented
947 w.r.t. line on which the brackets started so that
953 would not be allowed. However we would need to ensure that
957 were allowed (I think)
960 I've got the UNDENT reduce a bit wrong.
961 Anything that started with an indent must be reduced, not just one
962 thing that started there and everything after.
963 This is awkward as reducing back to 'statementlist' doesn't stop
964 adding to that statementlist, but it should. So we want e.g.
965 statementlist -> statements
966 statements -> statements statement
968 Then we reduce back to 'statementlist' which cannot accept another
969 statement. The parser generator could insert these extra non-terms
971 No that's not quite it either. If I have
976 then the undent after 'c;' needs to reduce the statement (which
977 is already reduced) but not the 'statementlist'. But there is
978 no extra state there. We aren't reducing back to a state, but back to
979 a symbol. The indent happened after a statementlist
981 A recursive symbol A is a problem if:
985 for non-empty Y. This results in a state:
989 If the Y is empty, X will reduce and B becomes unshiftable.
990 If this pattern is found we can create A'
995 Now 'A' only appears at the end of a production so there is no
999 Got it wrong again ...
1006 after 'bat', the recent indent is in Statements which makes it look
1007 like we need to close all of Statements. But we don't. As the indent
1008 stated in the middle of Statements we only need to close to it.
1009 So we really do need to know if an indent is at the start or in the
1011 We can cancel an 'in the middle' as soon as we find it. We can only
1012 cancel an at-the-start when reduce_size > 1 (as it is now 'in the
1017 Now back to newlines. It doesn't work very well to include newlines
1018 in the grammar wherever one is allowed as it conflicts with where they
1019 are required. E.g. for ';' to be optional we sometimes require one at
1020 the end of a statement, but for statements to be line-like we need to
1021 allow one at the start
1023 I think we need to allow NEWLINE in the grammar for EOL markings.
1024 Maybe we want SOL too?
1026 The important cases are:
1032 Statement is line-like so return to SOL not allowed, but expression is
1039 Both 'if ..' and 'else..' are line-like
1042 Newline at end of this line is allowed in place of ';'.
1044 If we have the grammar recognise all newlines that are not protected
1045 we need to support blank lines in lists.
1046 Statements -> Statements OptNL Statement
1049 No, I don't like that. EOL should be at the end.
1051 Some things can have an EOL at end, either optionally or required.
1052 Or elsewhere in the production.
1054 When we see an EOL we need to ask "Do I ignore this or use this?".
1055 The answer relates to indents and expected newlines.
1057 If a symbol can derive a sequence with an EOL, then the symbol is
1059 If that symbols after DOT in any item in a state are flagged
1060 'can-eol', then the state is flagged 'can-eol'.
1061 If a 'can-eol' state appear on the stack closer than a 'was indented'
1062 state, the we use the EOL token.
1063 If no states up to the first 'was indented' state are flagged
1064 'can-eol' then we discard the EOL token.
1067 That might work, but have I thought of all the issues?
1068 - blank lines. The grammar will need to allow for these explicitly
1069 in any context where EOL is expected. Not a big burden I guess.
1070 - force reductions ? If EOL closes things, does it close anything
1071 where not explicit? Probably not - that is sort of the point.
1072 If it appears where not expected but in line context, that is an
1073 error just like an "unexpected" error.
1075 - is the nesting 1-way? i.e. can I have line-oriented things in
1076 noline boxes. or "vertical stacks in horizontal boxes"?
1078 A possible example is a function call. It might be nice to allow
1083 There are two problem here.
1084 Firstly there is no EOL after 'first' so that cannot mark the end
1086 Secondly it forces an EOL interpretation on surrounding expressions.
1093 I would like to be able to drop the ',' after the '3' but it isn't
1094 clear that it is safe.
1100 Is a bit forced but with clear intent. I think it could be deemed
1101 as wrong because the think that wraps is not the first list element
1102 on that line. Is that a well defined concept?
1114 Recording the line structure allows for a number of different options. As mentioned already, a "NEWLINE" at the end of each line might do it. However as "IN" and "OUT" already imply a line break, we could just have "SAME" on breaks which don't have "IN" or "OUT". However this doesn't capture the idea that a line sometimes corresponds to a syntactic unit.
1116 Another approach is to follow the bracketing nature of "IN" and "OUT"
1117 and have "SOL" and "EOL" for start-of-line and end-of-line. As we
1118 noted already an indented section implies continuation and so should
1119 delay the end of line. If we follow this idea, the above structure
1120 could be tokenize as:
1125 Here every "SOL" has a matching "EOL" but it can be delayed until after an indented block. So the "EOL" that matches the first "SOL" comes after the "OUT". It is worth noting here and a single line break can have multiple "OUT"s and even an "IN" as well.
1133 After the "G" we would have "EOL OUT EOL OUT EOL IN SOL" before the "H". The three "EOL"s terminate the lines that start "F", "D", and "A" in that order, and a new "IN" is needed as "H" is indented with respect to "A".
1135 This tokenization captures the structure very effectively and is what
1136 I will be using with one simplification. The "SOL" really isn't
1137 needed. It is easily deduced and doesn't have a strong effect on the
1138 parser. The "EOL" may for a syntactic element to be closed. The
1139 "SOL" is simply recorded for reference, and that can be recorded when
1140 "IN" or "EOL" is seen ("OUT" is always followed by "IN" or "EOL", so
1141 "OUT" doesn't need to imply "SOL").
1143 This tokenization captures the structure very effectively and is what I will be using with one simplification. The "SOL" really isn't needed. It is easily deduced and doesn't have a strong effect on the parser. The "EOL" may for a syntactic element to be closed. The "SOL" is simply recorded for reference, and that can be recorded when "IN" or "EOL" is seen ("OUT" is always followed by "IN" or "EOL", so "OUT" doesn't need to imply "SOL").
1145 Unsurprisingly, these are exactly the tokens that can be generated by the scanner I previously described though the token names have been changed.
1149 When does an EOL actually mark and END.
1151 Now that we have a good understanding of IN and OUT we can turn to EOL. Intuitively EOL should mark the end of something much like OUT does. But what. Let's look at some cases:
1157 Here there are 3 EOL tokens, one after "Hello", one after "World" and one after the OUT token right at then end. The first two should close the statement on that line. The last probably shouldn't as there might be an "else:" still to come.
1163 There are 3 EOL token here in the same locations. Now the first two should not close anything, while the last one should close the statement.
1165 These two examples have very similar structure and are largely indistinguishable to the LR parser, and yet must be handled very differently. This implies that we need to provide the parser with extra information. We need to tell it something about the expected line structure of the language.
1167 To understand what to tell the parse we need to look into our (or my) intuition about why the newlines are or are not wanted in the different locations.
1169 I think the key observation is that statements are generally expected to be one-per-line, while there is no similar expectation for expressions. The earlier observation that an "else:" might still be expected shows that statements are always one-per-line, however that is a fairly well defined exception. i.e. there are certain placed in a statement where a newline is expected, or at least not a surprise. The start/end is one place. Before an "else" or "elif" is another. In C we might have:
1177 so an EOL is expected before a '{' or a '}' in that language.
1179 If we define all the places where an EOL is expected then we should be able to work out when an EOL should close a syntactic unit and when it should be ignore. However it isn't as simple as ignoring an EOL where ever it isn't expected. We need to include our understanding that indents protect EOL in some sense. In the second EOL example above (a = b + c ...) the first two EOL tokens were ignored in part because they were indent. If the second they were not ignore because despite being indented, they were expected. So the interplay between indents and EOL expectations needs to be understood.
1181 If we go back to our very first example:
1188 you might remember I suggested that if something started at "B C..." it should finish at "F". The "OUT" should close that as well. Our process described ..... not sure ..
1196 Also you need to know how the scanner can report indent information to the parser. I discuss that in some detail when introducing my lexical scanner but you probably don't need to read that now. It is sufficient to know that the token stream will have paired "INDENT" and "UNDENT" tokens, and that the "NEWLINE token that you might expect to come before an "INDENT" is delayed until after the matching "UNDENT".
1198 With this in place, it seems natural to allow the parser to absorb INDENT and UNDENT, but to require that when it reduces and production, the net indent absorbed while shifting the tokens which make up that production is zero. i.e for every INDENT there must be exactly one UNDENT. So:
1202 It might help to start with some examples.
1204 1. Lorem ipsum dolor sit amet, consectetur adipisicing
1205 2. elit, sed do eiusmod tempor
1206 3. incididunt ut labore et dolore
1207 4. magna aliqua. Ut enim ad minim
1208 5. veniam, quis nostrud exercitation ullamco
1209 6. laboris nisi ut aliquip ex ea commodo consequat.
1210 7. Duis aute irure dolor in reprehenderit in voluptate
1211 8. velit esse cillum dolore eu fugiat nulla pariatur.
1213 Here are 8 lines of random text - the line number are just for reference and should not be considered part of the text. Similarly any punctuation or capitals should be ignored when trying to understand the structure.
1215 Looking at just the newlines and indents it appears that line 1-5, line6, and lines 7-8 form three separate and equal units. Within the first of these, lines 2 and 3-4 are again to separate and equal units. Line 5 is not part of whatever 2,3-4 are but is something else that is part of 1-5.
1217 Where that structural analysis is actually correct cannot be determined without knowledge of the meaning, so lets try something more meaningful.
1224 Now we can clearly see that the continued lines are all part of what was started on the first line. The last line is less indented than the others much like line 5 in the first example. We took that to mean it is not a continuation 2nd and 3rd lines but only of the first, and this example bears that out. The lack of a comma at the end of the first 3 lines are not lead to any ambiguity. It is clear from the structure that something is completed. It cannot be the whole list that is completed as that started on the first line and we are still indented with respect to there, so it must be a "smaller" or more local syntactic unit that has completed. i.e. the list entry.
1226 Here is a different example using the same structure.
1233 This looks very wrong. Breaking the first assignment looks bad, and having the second assignment start with the same indent as the continuation of the first is also wrong. Yet the list begun, but not finished, of the first line of the first example looks fine. Maybe it is the amount of indenting that makes a difference. We can try adjusting that and see:
1244 certainly the amount of indenting helped in the first example, but fixing it doesn't really fix the second example.
1249 I seem to be stuck here...
1252 When I see a newline, I must either IGNORE it or SHIFT it or REDUCE
1253 The key question is: when to ignore it?
1255 We ignore newlines when they are protected by an indent.
1256 What does "protected" mean? It means there is already an internal indent
1257 in any symbol which can derive a newline.
1259 So in the above examples, only newlines are expected at end of
1260 statement (with or without ';') and after '{'.
1262 So for all the newlines seen in a = [..] the closest symbol that derives
1263 newline is 'statement' and it has an internal indent. So they are ignored.
1265 In the "if a = 1 { ..." The first newline is after "+ c;". We are in
1266 statement (or statementlist) which derives newline and that has internal
1267 indent so newline is ignored. But I don't want it ignored.
1270 When we see a newline, we find the closest reducible thing that
1271 can contain a newline. If it started at the start of a line and
1272 contains an internal indent, then we can ignore the newline.
1274 That has potential: a line-like thing can only be broken if it
1275 starts at the beginning of a line.
1277 But can we implement this in an LR grammar? We don't know what context
1278 we are in until we reduce into that context. So how do we determine if
1279 we are in a line-like thing?
1281 Each symbol has a 'can_eol' flag. EOL gets that flag, and any
1282 non-terminal which can produce an 'can_eol' symbol gets the flag too.
1283 So 'statement' has it but 'expression' does not.
1285 Each state has an 'expect_eol' flag. If is set if the 'next' symbol of
1286 any item in the state has can_eol set.
1287 OR : It is set if any the head of any production in the state has can_eol set.
1289 Now recall that the state stack is conceptually:
1290 STATE1 SYM1 STATE2 SYM2 STATE3 SYM3
1291 Each STATEn is expecting all the following syms to eventually be reduce
1292 to something which can then advance that state.
1293 So if STATEn has 'expect_eol', then everything following is part of
1294 something that maybe could have a newline in it.
1296 So if STATEk has expect_eol then we ignore newline precisely if
1297 SYMk was at start-of-line and somewhere from SYMk on there is an
1298 internal indent. So we need a new start-of-line flag in the frame.
1300 Is expect_eol precise enough? I'm wondering about a production
1301 that has a newline in the middle. What happens then?
1302 After the newline has been shifted the whole won't be reduced so that
1303 expect_eol state will still be there so we can still be ignoring
1306 I think we declare that a production which can contain a newline but
1307 cannot end with a newline is an oddity which we don't support.
1309 Similarly if any production from a symbol can contain a newline,
1310 then every production is assumed to be line-like and so require an
1311 indent for newlines to be ignored.
1314 "Newlines are ignored only inside line-like symbols which started at
1315 the beginning of a line and have since had an indent"
1317 Possibly good, but a grammar with no newlines would not allow
1318 newlines. Maybe that doesn't matter - the lexer can ignore them.
1319 Can we turn it around though
1321 "Newlines are only shifted if the nearest line-like thing does not
1322 have an internal indent."
1323 so "They are ignored if there is no linelike thing, or it holds an indent
1325 Also "An internal indent in a linelike thing that started midline is
1329 That didn't work - 'expect_eol' definition doesn't make sense in that way.
1332 expect_eol is set if *all* of the *next* symbols in core item
1334 When a core item does not have a 'next' symbol we ignore that item.
1336 Looks good, but now a new issue appears.
1337 If a production is reduced by an OUT indent, the following NEWLINE
1338 doesn't have a home. If expect an optional newline there it might
1339 use the newline before the outdent (which could have been ignored
1340 otherwise??). and worse, it can create a reduce-reduce conflict
1341 because the optional newline could terminate inner or outer statement.
1343 This is particularly a problem with
1344 Block -> : Statementlist
1345 Statement -> if Expression Block else Block
1347 After an OUT reduces ": Statementlist" to Block there is a NEWLINE
1348 but we need to see the "else". If we indent the 'else' it works nicely.
1349 I want to tell that a NEWLINE is needed there. That means I must put one
1350 after "Statement -> if Expression Block" to avoid a reduce/reduce conflict.
1353 After trying to describe this I need to refine it again.
1354 - A symbol is "can_eol" if any symbol in any production from this symbol
1355 is followed only by nullable symbols.
1356 - A state is "starts_line" if the previous symbol in any item "can_eol",
1357 or is nullable and previous to that "can_eol", etc.
1358 - The start state is "starts_line" if the start symbol "can_eol".