1 How to write up two-dimensional parsing?
5 - If an indent is ignored the matching undent is ignored
6 except to decrement the 'ignored' stack and to turn
7 a pending indent into a pending newline.
8 or to count the pending undents.
10 - If a newline is found and cannot be shifted yet then we check
11 each production that would close before we shift, and if any still
12 have pending undents, we ignore the newline
14 - when we reduce a production, we 'resolve' indents wich may leave
15 something pending. Either an indent or a number of undents?
17 - productions can be recursive if first body symbol matches head.
18 A recursive production may be horizontal if it starts in the middle
19 of a line. non-recursive productions are always horizontal.
20 A horizontal production must be all on one line or have its body
22 A vertical production must not be indented.
25 Maybe I don't want to mention indents in the grammar ?
27 if CONDITION : STATEMENTLIST
29 doesn't have to be indented but we will only reduce it when we
30 see something that doesn't belong.... which could be an UNDENT
32 So: when look-ahead is UNDENT we cannot shift so we reduce until
33 the top state has an indent attached, at which point we consume the
42 the undent will reduce until we just have the ifstatement which
44 So any indent is pushed onto the stack and a reduction sums all the
47 What about newlines? A newline will reduce any production that
48 doesn't contain an indent. It gets discard when top of stack
49 was at start of line...
51 So each 'state' also encodes
52 start-of-line or middle-of-line
56 When an INDENT is seen the current state increments count of indents.
57 When an UNDENT is seen we must reduce until tos state has an indent,
59 When a NEWLINE is seen, we reduce until tos state has indents or was
63 UNDENT shouldn't force a reduce. Only newline should do that. But as
64 an undent is always followed by a newline, it still results in a
65 reduce happening where I want it.
67 So an undent shifts in line an indent(?). Reduction nets the indents.
68 A newline forces the reduction until we find excess indents.
70 But what does grammar look like?
71 statement -> assignment ';'
73 So a newline will force the reduction to 'statement', but this will
77 statement -> assignment ';'
79 Then the NEWLINE can be shifted normally.
80 So newline forces reduction until it can be shifted or until ToS
81 contains an indent or started on a new line.
82 If we hit an irreducible state that does not accept NEWLINE when we
83 have to shift in an ERROR instead which might then be reducible.
84 When popping states we need to collect the indents.
87 Where do we attach the indent/undent info? To the state or to the
88 symbol? They are kept together but mean different things.
89 indent/undent are between terminals, just like states. But they
90 can be inside non-terminals which make them unlike states.
92 What "is" a state? It is a point between two symbols in a sentential
93 form. It doesn't carry information. So I think the indent has to go
95 An 'indent' should attach to the symbol *before* it as that is the
96 thing that is being continued. An 'undent' should also attach before
97 itself. Eventually those two terminals become included in a single
98 non-terminal and they collapse.
99 You can never have more undents than indents as a newline will force
100 reductions until top symbol has no undents.
101 In fact we reduce until the top symbol has a new indent, or is a
102 vertical production, or is explicitly allowed.
104 If we see an INDENT, we attach to symbol at ToS
105 If we see an UNDENT, we attach to symbol at ToS (possibly canceling indent)
106 If we see a Newline and current ToS holds +ve indent,
108 If look-ahead is expected, we shift it.
109 If we can reduce, we do so.
110 Else we trigger error handling:
111 pop states until one accepted ERROR - collecting indent
112 shift ERROR with those indents
113 Discard input, but process indent/undent, until next symbol is
118 Does this work? If I have
123 Then the NEWLINE after "c +" cannot be handled, but should not be an
127 where 'expr' contains an indent.
128 Maybe NEWLINE is ignored unless ToS has negative indent.
132 That must be an error.
135 1/ accepting NL must depend on what it will reduce.
136 2/ If UNDENT/INDENT appear together they must be kept separate.
138 So maybe UNDENT does block shifts and so forces reduction. If not
139 reduction can be made, that is an error
145 The NL UD NL before '}':
146 NL will ensure "b;" is reduced.
147 UD will reduce to command-list so we will have
149 if ( expr ) { commandlist
150 where '{' holds the indent. So '}' needs to bring in the undent,
151 which is after another NL.
153 So Undent waits for a symbol to collect it when shifted.
161 When we try to reduce "foo;while(b)bar;" to "statementlist" the
162 negative indent triggers an error. This a bit late though. I want it
163 when I see the 'while'. I want an error there which can be reduced a
165 The undent should close the statementlist so that '}' is the only
166 shiftable symbol. But after 'statementlist', 'while' is still
169 If we just waited for the 'while' statement to reduce and then noticed
170 the error, we could still report it against the 'while' token, but
171 recover would be a lot harder as we have already parsed the 'while'
172 into the 'wrong' place.
174 So when I see that token after NL,UD,NL, I need to be able to say
176 Could require help from grammar.
177 e.g. if a token is allowed to have a lesser indent than previous
178 token, then there must be a literal UNDENT token.
182 BlockA -> { Statementlist
184 That looks ugly, but the undent would close BlockA and allow the } to
188 Block -> { Statementlist OptNewline }
190 so the newline after the indent is shifted and doesn't
193 Then the UNDENT would ensure that the Statementlist were closed, the
194 Newline would be shifted, only a '}' would be allowed.
196 So: each symbol in a production must be in the same line as the
197 previous, or in an indented line.
199 No. OptNewline there wouldn't work as a blank line would take us to
202 I think I need to go back to what works... only it doesn't?
205 Indent is attached to previous symbol. I think that is easier that
206 keeping it at the 'start' of the 'next' symbol as I was doing.
208 Newline closes any productions that don't include an indent, then can
209 be shifted if wanted, or ignored.
211 Undent reduces things until ToS has an indent that can be canceled.
212 This means that if part of a production can have a negative relative
213 indent, it must be separate:
215 Blockhead -> { Statementlist
217 So the Undent before the '}' will reduce Blockhead and can then be
220 Block -> Blockhead ERROR
222 It would be nice if that can be auto-handled. e.g.
224 Block -> { Statementlist $undent }
228 That would be messy for the reduce action though.
229 Could each state "know"
232 A state can have different items with different offsets, so how much
233 of the stack is part of the 'current' production is not well defined.
234 It is not completely random either.
236 We can ignore non-kernel items as those are just potential productions
237 - they haven't started yet.
238 Of the remainder, many states have only one length.
239 Many others will have some offset and '1' for recursive
240 productions: A -> A . X
241 These can be called "potential" as well as we have no commitment.
243 Funcall -> Factor . ( ExprList )
246 Funcall -> Factor . ( ExprList )
247 Term -> Term * Factor .
250 i.e. just the bits that make the grammar distinctively LR.
251 In my case one is always reducible, is that important?
253 So when we see an Undent we find the max-index of the current state
254 and determine if there is a matching indent within that distance.
255 If there isn't then we must reduce or trigger an error.
257 Alternately: Each 'goto' from this state has a max-history.
258 We exclude any that don't have enough history to include balanced
260 If the next symbol would trigger one of those, it is an error.
263 For early error detection we record Indents against the previous
264 symbol, ignore newlines, and when we see an undent the next shift
265 must lead from a core item with enough history to keep balance.
267 If that is well defined it might catch early undents, but what about
272 I expect that is less likely, but still an error.
273 When we see the first ';' we will reduce to a statement and then a
274 statementlist. The second 'c' would lead us from
275 statementlist -> statementlist . statement
276 which is recursive and so doesn't allow indents (unless it started in
277 the middle of a line).
279 Intends are allowed anywhere except after the first symbol of a
280 left-recursive production which started at beginning-of-line.
286 is not allowed as 'Term->Term*Factor' is left-recursive. But why not?
288 So maybe an indent has to be allowed anywhere. Maybe a warning but
289 there is no real recovery option.
291 An undent is allowed in a state with a history back to the matching
293 If we see an undent, we check if the state can consume it. If not we
294 look to the next symbol which is not an undent (or newline) and reduce
295 until that symbol can be shifted. At this point the undent must have
298 Or easier: We must reduce to the matching indent before we shift,
299 unless a NEWLINE is explicitly allowed...
302 An indent is between two symbols, or within one symbol.
303 We hold a pending indent which gets attached to the front of the next
304 symbol that we shift.
305 Undents and newlines collect until we see a "real" symbol.
306 Any reductions that are then indicated happen and if the ToS symbol
307 is found to hold an indent, that cancels a pending undent.
308 Once we get to a state that can shift the lookahead we examine the
309 number of pending undents. It should be zero. i.e. we shouldn't
310 still be in the middle of a non-terminal that started at a greater
313 If it isn't zero we trigger an error because a close-bracket or
314 similar is missing. This might be different to "unrecognised-error"
315 The error should lead to reductions which cancel out the remaining
317 I think the parser needs to incorporate error handling into states.
318 Each state knows what symbol to synthesis to get closer to a
320 i.e. on 'no close' error, we reduce what we can, and synthesis when
323 So that seems to handle early undent.
324 We also have 'unexpected indent', 'missing indent' and 'late undent'.
327 A 'missing indent' happens when a production that we expect to be
328 one-line sees a newline rather than an indent.
333 The + after the newline only reduces 'b' to Expression which starts
334 in the middle of the line. A newline should see a reduction to a
335 nonterm which started at the beginning of a line, or which contains an
338 We trigger a 'no close' error here so that '+' is the start of a new
339 statement. Maybe if the next symbol is shiftable, we just report an
340 error and allow parsing to continue;
342 An 'unexpected indent' happens when the fully reduced non-terminal
343 started at the beginning of a line and is a recursive non-terminal.
344 Or maybe we are in a recursive state. This triggers an error
346 A 'late undent' is like an unexpected indent. The previous
347 fully-reduced non-terminal is recursive, started at the beginning of a
348 line, and still contains an indent.
351 Each symbol can carry a 'start-of-line' flag and an 'indent' count.
352 Each state may be flagged as 'recursive' if "A -> A . etc", and
353 records the symbol to synthesise to resolve a 'missing' error.
355 Current state includes an undent count, a nothing/newline/indent flag.
356 When look ahead contains undent, we increment that count
357 When look ahead contains newline we set newline flag
358 When look ahead contains indent, we set flag to indent.
359 When we reduce we combine indent counts in symbols and keep
360 start-of-line from first symbol.
361 If the ToS after a reduce contains indents and current state contain
362 undents, they are reduce so that one becomes zero (min is subtracted
364 Before we shift we must test:
365 - if undents are non-zero, we synthesis the symbol expected by the
366 state, we keep synthesising and reducing until the undent count
367 reaches zero. We report that the synthesised symbols were missing.
369 - If newline flag is set, the ToS symbol should start at SOL or
370 contain an indent. If it doesn't we just report a missing indent
373 - If indent flag is set and ToS symbol started at SOL and state is recursive
374 we report an unexpected indent.
376 - If state is recursive and ToS started a SoL and contains an indent,
377 we report an extra indent.
380 ---------------------
381 (I think I'm getting there).
383 An indent is allowed in a non-recursive that started on a newline,
384 or in a recursive (at the recursion point) that didn't.
386 The matching undent must be at the end of the production.
388 A newline is only allowed in a production that started at the start
389 of a line, but not always. - only if recursive or annotated.
391 An undent is required at the end of a production that contains an
395 An indent or newline found where not "allowed" generates an error
397 An undent not found where required generates an error and emergency
401 If we have an ambiguous grammar then the undent could simply
402 force a reduction. i.e. if the 'emergency reduction' just involves
403 choosing REDUCE of SHIFT, then it isn't an error or even a warning.
406 But where exactly is an indent or newline allowed - I don't think I
408 - a newline is definitely allowed in a list that started at the
409 beginning of a line. We need to structure the production right
410 so we don't end up with a separator at the start of a line.
411 LIST -> LIST SEP ELEMENT
413 would need to allow the newline before ELEMENT. The logic there
414 would be that if the first ELEMENT can be at start-of-line, other
418 Is appropriate if no separate is needed, and an empty list is OK.
419 Here ... does the indent go after the LIST or before the ELEMENT?
420 i.e. if there are more symbols, how do we know.
422 If the list already has an indent, then from there on it "started
423 at the beginning of a line"
424 This is treating it a lot like LL(1) which would be:
428 LIST -> ELEMENT SEP LIST
431 - a newline is definitely allowed in a non-list which has already
441 So: whatever holds the indent can also hold a newline.
443 i.e. a production can have a newline between symbols if an earlier
444 symbol has an indent before it.
449 which is probably OK.
451 What about indents? Can I have two indents in the one production?
452 No, I don't think so.
454 If an indent occurs *inside* a production, newlines can appear
456 If an indent occurs *in front* of a list production, newlines can
462 a symbol (on stack) can have an indent at the start, and can have a
463 number of indents within.
464 When we shift the first symbol after an indent, it gets an indent at
466 When we reduce the result inherit and indent at the start, and other
467 indents (within and at start) become within.
468 The indent at the start of a list can be treated as being within.
469 i.e. as list has an empty symbol at the from before the indent.
471 A newline is permitted if the max history for the current state
472 contains an indent. Indent at start of non-list doesn't count. At
473 start of list it does.
474 An indent is permitted if the max history does not contain an
477 An undent cancels indents within, or possibly before, the
478 immediately preceeding symbol.
481 Can I do all lists with empty start?
488 LIST -> LIST0 ELEMENT
492 // non-empty with separator
493 LIST -> LIST0 ELEMENT
497 // can be empty, sep is needed
503 Just stick "IsList -> " in front of anything that is a list.
506 For newline to be allowed, do indent have to be between symbols in
507 same production, or is within symbol ok?
510 b ) // now I need another indent.. or an undent?
513 The undent resets, then only an indent is allowed.
518 No reset, so extra indent. Anything is probably OK except newline.
519 So the indent cannot be inside a previous symbol.
520 But for lists, the indent will be exactly inside LIST.
521 So maybe inside first symbol is OK ??
531 I think I might be trying too hard. I should allow newlines
532 and indents anywhere, and give a clear interpretation for undents.
533 However my real goal is to know exactly what newlines and undents mean
534 and how to parse them.
535 i.e. does the grammar need to mention all of the, some of them, or
536 just be ambiguous and allow line break to disambiguate?
538 That last options doesn't really work as we cannot allow a separator
539 to be dropped anywhere by at end-of-line or before-close.
540 So grammar needs newlines. Various separators can be "sep or newline".
541 However I don't think indents need to be mentioned.
542 if condition : statementlist
543 needs undents to disambiguate I don't need to say that.
546 - indents get shifted with token
547 - indents are grouped by reduce
548 - and undent reduces until a matching indent in the top
549 state can be cancelled.
550 - newlines are messy.
551 We look ahead through possible reductions until we
552 either shift the newline or find an indent or cannot reduce.
553 If the shifted newline would not contain a newline in the production,
554 we choose that path. If we find an indent, we discard the newline.
555 If we cannot reduce we again ignore the newline
556 We could also discard a newline early if not in LA set.
564 Looks to the compiler like an addition and to the human like two assignments.
569 ?? Only allow newline if production started with indent.
571 If not expected and current production contains indent, discard.
572 If not expected and current production has no indent - error, recover.
573 If expected, test-reduce and discard if indent found else accept.
576 What if we just test-recover until we hit an indent or want to shift the newline?
578 Difference between statements and expressions is that a statement can
579 be terminated by a newline.
580 So if we find a newline and test-recover would use it, that is an
582 If we find a newline and test-recover would not use it before an
583 indent, it isn't an error.
585 So each state which cannot reduce has a link to the best state to shift to.
586 When we want to recovery we follow these links until they run
587 out. Then we can reduce which leads to a new state.
589 If we find a newline, we do this until we find a state which will take
590 a newline, or we cross an indent.
591 If we crossed an indent, we discard the newline.
592 If not and we has to take any shift steps, we call an error.
593 This requires that newlines aren't buried to deep.
595 If we find an undent, we do this until we have cancelled the undent
596 against an indent. If anything was shifted, we report an error.
598 No - using indent just to disambiguate an ambiguous grammar doesn't
599 work as it still allows confusing constructs.
603 Should be wrong as the 'else' looks to be part of the statement but
604 cannot be. It should be "unexpected". That really means that
605 "INDENT" "UNDENT" must be part of the grammar.
607 if COND : INDENT STATEMENTLIST UNDENT
608 which is where I started.
610 So an indent which is not expected protects all newlines and
611 indent/undent pairs until the matching undent.... not quite.
613 Indents can be expected in a state (if we compute lookahead).
614 If an indent is not expected we simply discard it and keep count.
615 When we shift an expected indent we push the count with it. When we
616 push the matching undent, we restore the count.
617 When we get an undent, we discard it if indent count is non-zero.
619 When we get a newline we look ahead to see if it would reduce across
620 an indent. If so we discard the newline. If not we keep it.
622 But an undent still needs to force reductions doesn't it?
624 i.e. if an undent doesn't cancel against top of stack we count it.
625 If we every try to shift while an undent is present, we throw an error
626 and shift other stuff until the undent is gone.
629 Indents must affect error handling. We no longer just discard tokens
630 left and right. The 'error' state for any indented token is the
631 matching undent. So we discard stack tokens until we get an error
632 state or an indented state.
633 If error, we shift the error and discard input until a token
634 matches. While discarding input we track indent/undent and
635 if we come to an unmatched undent we discard stack tokens again
637 If we hit an indented state we discard input until the matching undent
640 But do we really want to discard tokens, or can we synthesise tokens?
642 Discarding is standard. It gets to a known position. It allows any
643 location in a production to accept errors.
644 However discarding does discard data that cannot get further analysis.
645 If we use very fine grained error productions we might not miss much.
647 Come back to this later. For now just discard.
649 If we find an Indent on stack, an error "must" be expected.
651 block -> : INDENT Statementlist UNDENT
652 Statementlist -> ERROR
655 if we can shift ERROR, we do,
660 Indents and newlines.
662 Indents and newlines can be taken as strong syntactic marker, useful
663 both for regular parsing and error recovery.
664 Unlike other bracketing markers such as parentheses or paired key
665 words (BEGIN and END or DO and OD) indents cannot be lost. Every line
666 displays what the current indent level is and for every indent there
667 is precisely one location where the matching undent is (though
668 of course one location can have undents for multiple indents).
670 Similarly start-of-line and end-of-line can be strong bracketing
671 markers which cannot be missed.
673 The two can complement each other to provide a very effective
674 indication of the grouping that the programmer intends. Often the
675 programmer will want more explicit marking as well (e.g. "{" and "}" )
676 which is perfectly fine as redundancy of expression is an important
677 part of effective communication and this is equally true in writing a
680 The connection between the two is simply that indentation effects a
681 continuation of the previous line. For example in
686 Any bracketing implied by sol/eol should not be imposed around "A B
687 C" but only around "A ... E" - the indent effectively hides the line
689 So this could be tokenised as
690 sol A B C indent sol D E eol undent eol
692 Note that every SOL has a matching EOL and every INDENT a matching
695 As these markers are strongly paired they can be useful for error
696 detection and recovery. This is particularly true for indents.
698 It seems universally true that any syntactic unit (such a declaration,
699 statement, or expression) which starts indented will finish while
700 still indented. That means it will complete before the undent which
701 matches the most recent indent when it was started.
703 This fact is helpful but not quite as strong as we might like
711 "F G" could be in the same syntactic unit as "A..E". however if "D"
712 started a new subunit, "F G" certainly aren't part of that.
713 Assuming that "A" started a syntactic unit, "H" is the first symbol
714 that is clearly not part of it.
715 This isn't really a problem for
721 as the statementlist is the last part of the if-statement, so
722 there is no room in it for baz - baz must be separate.
730 This looks wrong but cannot be ruled out on the grounds of
732 Here the expression of
733 statement -> var "=" expression ";"
735 starts on the same line as the statement so there is no indent.
737 We could argue that a syntactic element (expression) that stated other
738 that at the start of a line may only continue to the end of that line
739 (when adjusted for indents). This can hit problems with:
747 block -> "{" statementlist "}"
748 starts other than at the beginning of a line, but continues on the
751 There are a few different ways we could try to deal with this:
753 a/ some special-case marking in the grammar to say that a newline is
754 permitted before the '}'
755 b/ Don't make a separate syntax element for "block":
756 ifstatement -> if condition { statementlist }
757 would mean that the '}' is at the same indent as the 'if'
758 which starts the element.
759 c/ decide that the last terminal of a production is a special
760 case and can go on a new line.
762 The last is almost certainly a bad ideas it would encourage
770 Another construct that is often used is:
778 where the "block" is allowed to start on a new line.
779 This potentially encourages 'block' to be a separate syntactic unit,
780 so 'b' above doesn't seem encouraging. That leave 'a' and marking
781 both the '{' and the '}' as being allowed to start a new line
784 If we assume these markings, then an undent can close any syntactic
785 unit that started at a greater indent, or at the same indent but not
788 - to detect errors by reducing early and then ensuring the next
790 e.g. the indent after 'bar;' would close the statementlist after
791 which only a '}' is expected. If it isn't there, that is an error.
792 - to handle ambiguous grammars. In the classic
793 statement -> if ( expression ) statement
794 | if ( expression ) statement else statement
795 the normal default is to treat 'else' as right-associative.
803 can be parsed with the 'else' applying to the 'if(x)' as you would
807 Now we turn to line markings - sol and eol.
808 The above suggested that any unit should be "on one line" (allowing
809 for indents) unless specially marked. This is bit excessive.
815 "b + c + d" is all one expression, but appear on two lines. Might
817 a[some_long_expression_with_function_call_etc] =
818 verylongling + otherverylongthing
821 might not seem so unlikely. This could easily be made correct by
822 moving the '=' to the start of the second line, or indenting the third
823 line slightly. It isn't certain that we want to do that, and
824 annotating that a newline is allowed before the 'plus' seems clumsy.
826 An alternate approach involves noting that 'statements' are rather
827 special. Not uniquely special but still special.
828 That is, statements are normally on one line. When they are two long
829 for a line the remainder is indented (make it part of the previous
831 When a statement appears on multiple lines it is a notable exception.
832 For example the Python
841 has three lines in one statement. This is a one-line statement which
842 has newlines before the 'elif' and 'else', or maybe 3 one-line objects
843 which can come together.
845 While a statement is line-based, and expression probably isn't. They
846 tend to appear all on one line, but that is because the statement they
849 In some sense the line-nature of the statement "captures" the
850 expression. An expression can only be on multiple lines if they are
851 all indented with respect to the containing statement.
853 So rather than marking some things as being allowed to continue to the
854 next line, we can mark some things as being line-shaped. These are
855 sometimes whole units and sometimes parts. e.g.
861 Note the missing '}' above - it is not part of the line shape.
863 if condition : statementlist
867 function name(argumentlist)
873 Being line-shaped just means that it must be "all on one line", but
874 not that it must be a line by self. Several line shaped things can
879 However it would be nice to allow a syntax modification when separate
884 The trailing semicolon is not needed. As the assignment is
885 line-shaped it must finish at the end of the line.
886 Declaring the semicolon to be optional would allow
888 which looks more like an error than a correct construct. Here the
891 A simple solution is to allow the NEWLINE to be part of the syntax
892 listed for the grammar:
897 Assignment -> lvalue = expression OptSemi
899 as a line-shaped item should work.
903 1/ If a point is 'newline allowed' then what follows can go
905 2/ If something is line-shaped then linebreaks may only be intended.
907 So in a production we can have 'optnewline' or 'optsemi'
908 anywhere but at the start. Or maybe 'at the start' means something a
911 statement -> if condition optnewline block
912 | if condition optnewline block optnewline else optnewline block
914 block -> : statementlist
915 | { statementlist optnewline }
918 If something is line-shaped, then the EOL that ends the line on which
919 it starts, must end the thing.
920 If something is not line-shaped, then an EOL can be safely ignored.
922 An LR state can expect NEWLINE if it has a goto state for it.
923 And it has an indent.
925 If we look up the stack and find an indent before a newline-expected
926 then a newline is ignored.
927 If we find a newline-expected before an indent, the newline is used.
928 If a state expects a newline and was indented then the newline
929 expectation comes last, is seen first when looking back, so newline is
932 Separately, an undent reduces until it can cancel with an indent.
943 This isn't so much about whether an indent is present, as how much of
944 an indent there should be. This sort of indent doesn't affect parsing
945 or help with error recovery so it is probably best left to esthetics.
946 We could possibly require that everything in the brackets be indented
947 w.r.t. line on which the brackets started so that
953 would not be allowed. However we would need to ensure that
957 were allowed (I think)
960 I've got the UNDENT reduce a bit wrong.
961 Anything that started with an indent must be reduced, not just one
962 thing that started there and everything after.
963 This is awkward as reducing back to 'statementlist' doesn't stop
964 adding to that statementlist, but it should. So we want e.g.
965 statementlist -> statements
966 statements -> statements statement
968 Then we reduce back to 'statementlist' which cannot accept another
969 statement. The parser generator could insert these extra non-terms
971 No that's not quite it either. If I have
976 then the undent after 'c;' needs to reduce the statement (which
977 is already reduced) but not the 'statementlist'. But there is
978 no extra state there. We aren't reducing back to a state, but back to
979 a symbol. The indent happened after a statementlist
981 A recursive symbol A is a problem if:
985 for non-empty Y. This results in a state:
989 If the Y is empty, X will reduce and B becomes unshiftable.
990 If this pattern is found we can create A'
995 Now 'A' only appears at the end of a production so there is no
999 Got it wrong again ...
1006 after 'bat', the recent indent is in Statements which makes it look
1007 like we need to close all of Statements. But we don't. As the indent
1008 stated in the middle of Statements we only need to close to it.
1009 So we really do need to know if an indent is at the start or in the
1011 We can cancel an 'in the middle' as soon as we find it. We can only
1012 cancel an at-the-start when reduce_size > 1 (as it is now 'in the
1017 Now back to newlines. It doesn't work very well to include newlines
1018 in the grammar wherever one is allowed as it conflicts with where they
1019 are required. E.g. for ';' to be optional we sometimes require one at
1020 the end of a statement, but for statements to be line-like we need to
1021 allow one at the start
1023 I think we need to allow NEWLINE in the grammar for EOL markings.
1024 Maybe we want SOL too?
1026 The important cases are:
1032 Statement is line-like so return to SOL not allowed, but expression is
1039 Both 'if ..' and 'else..' are line-like
1042 Newline at end of this line is allowed in place of ';'.
1044 If we have the grammar recognise all newlines that are not protected
1045 we need to support blank lines in lists.
1046 Statements -> Statements OptNL Statement
1049 No, I don't like that. EOL should be at the end.
1051 Some things can have an EOL at end, either optionally or required.
1052 Or elsewhere in the production.
1054 When we see an EOL we need to ask "Do I ignore this or use this?".
1055 The answer relates to indents and expected newlines.
1057 If a symbol can derive a sequence with an EOL, then the symbol is
1059 If that symbols after DOT in any item in a state are flagged
1060 'can-eol', then the state is flagged 'can-eol'.
1061 If a 'can-eol' state appear on the stack closer than a 'was indented'
1062 state, the we use the EOL token.
1063 If no states up to the first 'was indented' state are flagged
1064 'can-eol' then we discard the EOL token.
1067 That might work, but have I thought of all the issues?
1068 - blank lines. The grammar will need to allow for these explicitly
1069 in any context where EOL is expected. Not a big burden I guess.
1070 - force reductions ? If EOL closes things, does it close anything
1071 where not explicit? Probably not - that is sort of the point.
1072 If it appears where not expected but in line context, that is an
1073 error just like an "unexpected" error.
1075 - is the nesting 1-way? i.e. can I have line-oriented things in
1076 noline boxes. or "vertical stacks in horizontal boxes"?
1078 A possible example is a function call. It might be nice to allow
1083 There are two problem here.
1084 Firstly there is no EOL after 'first' so that cannot mark the end
1086 Secondly it forces an EOL interpretation on surrounding expressions.
1093 I would like to be able to drop the ',' after the '3' but it isn't
1094 clear that it is safe.
1100 Is a bit forced but with clear intent. I think it could be deemed
1101 as wrong because the think that wraps is not the first list element
1102 on that line. Is that a well defined concept?
1114 Recording the line structure allows for a number of different options. As mentioned already, a "NEWLINE" at the end of each line might do it. However as "IN" and "OUT" already imply a line break, we could just have "SAME" on breaks which don't have "IN" or "OUT". However this doesn't capture the idea that a line sometimes corresponds to a syntactic unit.
1116 Another approach is to follow the bracketing nature of "IN" and "OUT"
1117 and have "SOL" and "EOL" for start-of-line and end-of-line. As we
1118 noted already an indented section implies continuation and so should
1119 delay the end of line. If we follow this idea, the above structure
1120 could be tokenize as:
1125 Here every "SOL" has a matching "EOL" but it can be delayed until after an indented block. So the "EOL" that matches the first "SOL" comes after the "OUT". It is worth noting here and a single line break can have multiple "OUT"s and even an "IN" as well.
1133 After the "G" we would have "EOL OUT EOL OUT EOL IN SOL" before the "H". The three "EOL"s terminate the lines that start "F", "D", and "A" in that order, and a new "IN" is needed as "H" is indented with respect to "A".
1135 This tokenization captures the structure very effectively and is what
1136 I will be using with one simplification. The "SOL" really isn't
1137 needed. It is easily deduced and doesn't have a strong effect on the
1138 parser. The "EOL" may for a syntactic element to be closed. The
1139 "SOL" is simply recorded for reference, and that can be recorded when
1140 "IN" or "EOL" is seen ("OUT" is always followed by "IN" or "EOL", so
1141 "OUT" doesn't need to imply "SOL").
1143 This tokenization captures the structure very effectively and is what I will be using with one simplification. The "SOL" really isn't needed. It is easily deduced and doesn't have a strong effect on the parser. The "EOL" may for a syntactic element to be closed. The "SOL" is simply recorded for reference, and that can be recorded when "IN" or "EOL" is seen ("OUT" is always followed by "IN" or "EOL", so "OUT" doesn't need to imply "SOL").
1145 Unsurprisingly, these are exactly the tokens that can be generated by the scanner I previously described though the token names have been changed.
1149 When does an EOL actually mark and END.
1151 Now that we have a good understanding of IN and OUT we can turn to EOL. Intuitively EOL should mark the end of something much like OUT does. But what. Let's look at some cases:
1157 Here there are 3 EOL tokens, one after "Hello", one after "World" and one after the OUT token right at then end. The first two should close the statement on that line. The last probably shouldn't as there might be an "else:" still to come.
1163 There are 3 EOL token here in the same locations. Now the first two should not close anything, while the last one should close the statement.
1165 These two examples have very similar structure and are largely indistinguishable to the LR parser, and yet must be handled very differently. This implies that we need to provide the parser with extra information. We need to tell it something about the expected line structure of the language.
1167 To understand what to tell the parse we need to look into our (or my) intuition about why the newlines are or are not wanted in the different locations.
1169 I think the key observation is that statements are generally expected to be one-per-line, while there is no similar expectation for expressions. The earlier observation that an "else:" might still be expected shows that statements are always one-per-line, however that is a fairly well defined exception. i.e. there are certain placed in a statement where a newline is expected, or at least not a surprise. The start/end is one place. Before an "else" or "elif" is another. In C we might have:
1177 so an EOL is expected before a '{' or a '}' in that language.
1179 If we define all the places where an EOL is expected then we should be able to work out when an EOL should close a syntactic unit and when it should be ignore. However it isn't as simple as ignoring an EOL where ever it isn't expected. We need to include our understanding that indents protect EOL in some sense. In the second EOL example above (a = b + c ...) the first two EOL tokens were ignored in part because they were indent. If the second they were not ignore because despite being indented, they were expected. So the interplay between indents and EOL expectations needs to be understood.
1181 If we go back to our very first example:
1188 you might remember I suggested that if something started at "B C..." it should finish at "F". The "OUT" should close that as well. Our process described ..... not sure ..
1196 Also you need to know how the scanner can report indent information to the parser. I discuss that in some detail when introducing my lexical scanner but you probably don't need to read that now. It is sufficient to know that the token stream will have paired "INDENT" and "UNDENT" tokens, and that the "NEWLINE" token that you might expect to come before an "INDENT" is delayed until after the matching "UNDENT".
1198 With this in place, it seems natural to allow the parser to absorb INDENT and UNDENT, but to require that when it reduces and production, the net indent absorbed while shifting the tokens which make up that production is zero. i.e for every INDENT there must be exactly one UNDENT. So:
1202 It might help to start with some examples.
1204 1. Lorem ipsum dolor sit amet, consectetur adipisicing
1205 2. elit, sed do eiusmod tempor
1206 3. incididunt ut labore et dolore
1207 4. magna aliqua. Ut enim ad minim
1208 5. veniam, quis nostrud exercitation ullamco
1209 6. laboris nisi ut aliquip ex ea commodo consequat.
1210 7. Duis aute irure dolor in reprehenderit in voluptate
1211 8. velit esse cillum dolore eu fugiat nulla pariatur.
1213 Here are 8 lines of random text - the line number are just for reference and should not be considered part of the text. Similarly any punctuation or capitals should be ignored when trying to understand the structure.
1215 Looking at just the newlines and indents it appears that line 1-5, line6, and lines 7-8 form three separate and equal units. Within the first of these, lines 2 and 3-4 are again to separate and equal units. Line 5 is not part of whatever 2,3-4 are but is something else that is part of 1-5.
1217 Where that structural analysis is actually correct cannot be determined without knowledge of the meaning, so lets try something more meaningful.
1224 Now we can clearly see that the continued lines are all part of what was started on the first line. The last line is less indented than the others much like line 5 in the first example. We took that to mean it is not a continuation 2nd and 3rd lines but only of the first, and this example bears that out. The lack of a comma at the end of the first 3 lines are not lead to any ambiguity. It is clear from the structure that something is completed. It cannot be the whole list that is completed as that started on the first line and we are still indented with respect to there, so it must be a "smaller" or more local syntactic unit that has completed. i.e. the list entry.
1226 Here is a different example using the same structure.
1233 This looks very wrong. Breaking the first assignment looks bad, and having the second assignment start with the same indent as the continuation of the first is also wrong. Yet the list begun, but not finished, of the first line of the first example looks fine. Maybe it is the amount of indenting that makes a difference. We can try adjusting that and see:
1244 certainly the amount of indenting helped in the first example, but fixing it doesn't really fix the second example.
1249 I seem to be stuck here...
1252 When I see a newline, I must either IGNORE it or SHIFT it or REDUCE
1253 The key question is: when to ignore it?
1255 We ignore newlines when they are protected by an indent.
1256 What does "protected" mean? It means there is already an internal indent
1257 in any symbol which can derive a newline.
1259 So in the above examples, only newlines are expected at end of
1260 statement (with or without ';') and after '{'.
1262 So for all the newlines seen in a = [..] the closest symbol that derives
1263 newline is 'statement' and it has an internal indent. So they are ignored.
1265 In the "if a = 1 { ..." The first newline is after "+ c;". We are in
1266 statement (or statementlist) which derives newline and that has internal
1267 indent so newline is ignored. But I don't want it ignored.
1270 When we see a newline, we find the closest reducible thing that
1271 can contain a newline. If it started at the start of a line and
1272 contains an internal indent, then we can ignore the newline.
1274 That has potential: a line-like thing can only be broken if it
1275 starts at the beginning of a line.
1277 But can we implement this in an LR grammar? We don't know what context
1278 we are in until we reduce into that context. So how do we determine if
1279 we are in a line-like thing?
1281 Each symbol has a 'can_eol' flag. EOL gets that flag, and any
1282 non-terminal which can produce an 'can_eol' symbol gets the flag too.
1283 So 'statement' has it but 'expression' does not.
1285 Each state has an 'expect_eol' flag. If is set if the 'next' symbol of
1286 any item in the state has can_eol set.
1287 OR : It is set if any the head of any production in the state has can_eol set.
1289 Now recall that the state stack is conceptually:
1290 STATE1 SYM1 STATE2 SYM2 STATE3 SYM3
1291 Each STATEn is expecting all the following syms to eventually be reduce
1292 to something which can then advance that state.
1293 So if STATEn has 'expect_eol', then everything following is part of
1294 something that maybe could have a newline in it.
1296 So if STATEk has expect_eol then we ignore newline precisely if
1297 SYMk was at start-of-line and somewhere from SYMk on there is an
1298 internal indent. So we need a new start-of-line flag in the frame.
1300 Is expect_eol precise enough? I'm wondering about a production
1301 that has a newline in the middle. What happens then?
1302 After the newline has been shifted the whole won't be reduced so that
1303 expect_eol state will still be there so we can still be ignoring
1306 I think we declare that a production which can contain a newline but
1307 cannot end with a newline is an oddity which we don't support.
1309 Similarly if any production from a symbol can contain a newline,
1310 then every production is assumed to be line-like and so require an
1311 indent for newlines to be ignored.
1314 "Newlines are ignored only inside line-like symbols which started at
1315 the beginning of a line and have since had an indent"
1317 Possibly good, but a grammar with no newlines would not allow
1318 newlines. Maybe that doesn't matter - the lexer can ignore them.
1319 Can we turn it around though
1321 "Newlines are only shifted if the nearest line-like thing does not
1322 have an internal indent."
1323 so "They are ignored if there is no linelike thing, or it holds an indent
1325 Also "An internal indent in a linelike thing that started midline is
1329 That didn't work - 'expect_eol' definition doesn't make sense in that way.
1332 expect_eol is set if *all* of the *next* symbols in core item
1334 When a core item does not have a 'next' symbol we ignore that item.
1336 Looks good, but now a new issue appears.
1337 If a production is reduced by an OUT indent, the following NEWLINE
1338 doesn't have a home. If expect an optional newline there it might
1339 use the newline before the outdent (which could have been ignored
1340 otherwise??). and worse, it can create a reduce-reduce conflict
1341 because the optional newline could terminate inner or outer statement.
1343 This is particularly a problem with
1344 Block -> : Statementlist
1345 Statement -> if Expression Block else Block
1347 After an OUT reduces ": Statementlist" to Block there is a NEWLINE
1348 but we need to see the "else". If we indent the 'else' it works nicely.
1349 I want to tell that a NEWLINE is needed there. That means I must put one
1350 after "Statement -> if Expression Block" to avoid a reduce/reduce conflict.
1353 After trying to describe this I need to refine it again.
1354 - A symbol is "can_eol" if any symbol in any production from this symbol
1355 is followed only by nullable symbols.
1356 - A state is "starts_line" if the previous symbol in any item "can_eol",)
1357 or is nullable and previous to that "can_eol", etc.
1358 - The start state is "starts_line" if the start symbol "can_eol".
1360 19may2019 - in Texas!
1363 Because anything that ends at a NEWLINE is considered LineLike, an
1364 Expression is LineLike. I didn't want that.
1371 The NEWLINE after b is not Ignored in the expression,
1372 only once it has reduced the simplestatement because....
1374 A NEWLINE is discarded when the tos is NOT newline_permitted.
1375 A frame gets newline_permitted when the state "starts_line"
1376 which happens if a starting sym is "line_like", meaning it
1377 can usefully be followed by a newline.
1381 I've re-read most of my notes about NEWLINE parsing and I'm not sure....
1382 I originally wanted "linelike" to be anything containing a newline. I apparently
1383 found problems and now it is anything followed by a new line.
1384 But that seems weird, because in
1385 assignment -> name = expression
1386 statement -> assignment NEWLINE
1387 expression is followed by a newline, so is linelike. I think.
1389 Can I return to my original idea?
1390 Any symbol containing a newline is lineline, and newlines shouldn't be ignored.
1391 If the symbol started since an indent. If it started before an indent, then it
1392 doesn't affect newlines.
1394 During parsing, we don't know which symbol we are working on (yet),
1395 only which states we have, which can be working on multiple symbols.
1396 The decision we need to make when we see a newline is which of these to choose:
1397 - ignore - if states since an indent don't start a line
1398 - reduce - if there is start-of-line state since start of line.
1399 - shift - if we can, and not ignored
1400 - reduce - if we can
1401 - error - if nothing else
1403 Maybe every item in an itemset must be followed by a newline if any are.
1404 That cannot work as we always add prefixed of 'next' symbol.
1410 Should the NEWLINE be ignored? No! It isn't indented.
1411 So if I want NEWLINEs to be allowed, they need to be explicit.
1412 This makes the state startsline.
1414 So: we mark can_eol on any symbol that can derive NEWLINE
1415 we mark a state as starts_line if any item with dot-at-start can derive a can_eol.
1417 Then if the most recent starts_line is further back than an indent, we ignore newlines.
1418 So I don't need the lineline flag.
1421 I'm making progress with newlines...
1422 Newlines *are* shifted if there is only one symbol since the start of the line..
1423 The current conundrum is what happens to several blank lines after:
1427 The next thing could be
1429 - a new complex statement
1430 - the end of the block
1432 The first two are currently handle with newlines before things, so we stack up
1433 several newlines, then when we see a thing, we reduce the newlines out from under it.
1434 But that doesn't work for the end of the block.
1435 A Newlines symbol cannot insert an optional linebreak, but can absorb blank lines.
1436 So places where a line break is optional, we have
1439 Where we want to allow blank lines, we have
1443 Then if we see a newline, not at start of line, we shift if it is optional
1444 When we see a newline at start of line, we shift if blank lines are allowed.
1445 So if grammar allows
1447 Then .... that doesn't work. Newlines won't get any lines.
1450 | NEWLINE Newlines {
1458 to parse the same as
1463 and for the last newline to close the elsepart.
1464 But the latter has 2 newlines while the former only has one
1465 and I don't have any obvious justification for ignoring either.
1466 I think it is in the Newline before the OUT that is extra.
1468 I could drop the newline before the OUT, assuming the newline
1469 separate things, and the OUT will force any reductions needed.
1470 But then we have fewer newlines reported than actual.
1471 (Same imbalance happens with multiline comments and strings, so maybe
1473 Another way to look at it is that the newline following an IN is discarded
1474 (or always ignored) and not moved to after the OUT.
1475 So (maybe) the newline at an IN or OUT is reported *after* the IN or OUT.
1482 Would be A IN NL B NL C OUT IN NL D OUT NL
1484 The parser always ignores the NL after an IN but uses other
1485 NL to reduce to a single symbol (if possible)
1486 OR maybe it doesn't ignore (unless not line-like context) and
1487 lines are preceeded by NL, not followed by them...
1488 No, followed is usually good.. though separated is better... so preceed!!
1490 Ok, this isn't working.
1495 cannot be reduced to a Statement until we know what comes next, and it
1496 might be separated by several newlines.
1497 So the newlines need to be part of the Statement.
1498 But that means we cannot have newlines at the front of a statement.
1499 But that was the point...
1501 Maybe a Statementlist is a series of StatementNL followed by a Statement
1504 StatementNL -> Statement Newlines
1505 as a general catch-all, but when we have something like if, or anything
1506 with an optional tail "else:" or "case:"
1508 StatementNL -> if Expression Block Newlines
1509 But that would produce a conflict with
1510 Statement -> if Expression Block
1511 As a newline could either trigger a reduce to Statement, or a shift.
1512 Obviously we shift, but maybe we use precedence to force the point.
1514 Can we handle 'else if' ...
1515 IfStatementNL -> if Expression Block Newlines
1516 | if Expression Block else IfStatementNL
1518 ... I'm contemplating having the parser duplicate NL as necessary, so
1521 can appear to be followed by 2 NL, one to terminate the 'action' statement
1522 and one to terminate the whole 'if'.
1523 This might mean I need to extend when NL are discarded - to ensure they
1524 don't get duplicated too much.
1525 1/ if state does not permit newlines, discard
1526 2/ else if I can reduce symbols all since start of line do that.
1527 3/ else if can shift, do that
1528 4/ else if only one symbol since newline, discard.
1531 This means that we cannot recognise multiple newlines
1533 If we shift a Newline, that is since_newline=0;
1534 If we reduce that to Newlines, that is still since_newline=0
1535 If 4/discard only applies when since_newline==1 -- we win.
1537 Currently since_newline essentially means the symbol contains a newline.
1538 So 'statements' usually does, but 'statement' doesn't.
1539 When we shift the newline and reduce, it all becomes since_newline=0.
1540 That is when we want to ignore newlines.
1542 15jun2019 - still working this through..
1544 Normally the parser does
1545 shift or else reduce or else error
1546 exceptions are TK_in which is simply recorded and
1547 TK_out: reduce until there is a TK_in in scope, then cancel, else error
1549 if not newline_permitted (indent since last starts_line state)
1551 if can Reduce to at most start-of-line, reduce
1552 if can Shift, duplicate and Shift
1553 if can Reduce, do so
1554 if 0 since newline, Discard
1556 since_newline needs to be changed a bit.
1557 A TK_newline token *isn't* zero, it is N+1. The token *after*
1558 the NEWLINE is zero - so that
1560 Arg. I'm struggle with that fact that having shifted a newline,
1561 we are both at the end of a line, and at the start of the next.
1562 When I see a newline, I want to reduce until the end of line
1563 is in the same state as the start of that line.
1565 Maybe I do want newline to be a separator.
1566 What if I don't actually include the newline in the grammar, just like in/out.
1567 Instead we mark select productions as lines. This is like marking
1569 A marked production is reduced when a newline is seen providing it won't
1570 contain any indents.
1571 So: if the reducable item in a state is marked, the start gets marked.
1572 When we see a newline, if the state is marked and the reduce size does not
1573 exceed since_indent, we reduce. Otherwise we discard.
1574 No... I need an error condition too.
1575 So I need the state to have a starts_line marking, when a new item is marked.
1578 productions can be marked $$NEWlINE which flags the production as line-like
1579 a state with an item with DOT at start of a line-like production is starts_line
1580 a state with an item with DOT at the end of a line-line product is ends_line
1581 We track indents as before.
1582 When we process an indent or newline, we set since_newline to 0
1583 When we see a newline we do one of:
1584 if not newline_permitted, we discard
1585 if top state starts line, we discard
1586 else reduce or else error
1589 A production -> { statements }
1590 needs to ignore newlines either side of statements.
1591 It is a multi-line production - newlines don't matter.
1592 Maybe there are several sorts of symbols:
1593 - in-line: must be broken across lines unless indented
1594 - line-like: is terminated (reduced) by a newline
1595 - multi-line: newlines are ignored
1597 We tag symbols which are line-like.
1598 Any symbol which can derive a line-like symbol is multi-line
1599 Any other symbol is in-line.
1601 So SimpleStatements, ifhead, elsepart, casepart etc are linelike
1603 $line SimpleStatements IfHead ....
1605 A state that is at the start of a linelike symbol starts_line
1606 Any state in a multi-line production starts_line
1608 if tos starts_line, newlines are ignored.
1609 else if there is an indent since the starts_line, newlines are ignored.
1610 But if there are symbols since starts_line, we have to reduce until are
1611 are in a starts_line start (or can see an indent).
1614 Block -> : Statementlist
1615 is none of thse. It must be reduced by a newline, but isn't entirely line-like.
1616 Block -> { Statementlist }
1619 but maybe Block here is neither. It only becomes linelike when it
1620 terminates a Statement, which is linelike. Or terminates an ifHead
1622 Should this be legal?
1623 a:=b;pass if something
1624 probably not. I want to require at least ';' or NEWLINE.
1625 That means I need to include NEWLINE in the grammar.
1626 Statements -> SimpleStatements ; Statement
1627 | Statements Statement
1630 if cond: if cond: a:=b
1631 NEWLINE reduces this down to IfHead
1632 IfHead -> IfHead NEWLINE
1636 ElsePart -> else BLOCK
1638 | else IfHead ElsePart
1642 Statements -> Statements Statement
1646 if cond { statments } else { statements}
1648 if cond: statements else: statements
1650 so :statements must expect a NEWLINE but then
1651 if cond: if cond: statement
1654 Block -> : IN statments NEWLINE
1655 if there is no indent, we synth one which triggers an OUT NEWLINE pair.
1656 This could be automatic.
1657 If a linelike is followed by a newline, we synthesis an IN before it.
1659 That requires a hack to the scanner: Synth Indent
1663 Block -> { Statements }
1664 | : Statements NEWLINE
1665 | : SimpleStatements
1670 might not end in a NEWLINE so else could come immediately. Is that OK?
1671 if expr: a:=0 if expr
1672 must be forbidden. That requires a newline.
1674 Block -> : IN statements OUT NEWLINE
1676 In marks state as 'needs_indent' If an indent arrives, fine.
1677 If something else, we record that next newline (After balanced in/out)
1678 must synth extra out/newline.
1680 Block -> { Statements }
1683 statementblock -> Statements $$line
1685 $$line means it must be reduced by a newline. If something else tries,
1686 it is an error and we skip to newline.
1687 It also strips everything but NEWLINE from the (effective) lookahead
1688 to avoid reporting conficts, as those things will never be shifted.
1690 IfHead -> if Expr Block
1692 Block -> { statements }
1693 | : statements $$line
1695 IfStatement -> IfHead
1697 | IfHead else IfStatement
1698 IfTail -> else Block
1703 20jun2019 Happy 87th birthday Dad.
1705 I'm not convinced about $$NEWLINE
1707 else: simplestatementlist
1709 should be able to parse simplestatementlist without a newline, and
1710 use the newline to close the if/else.
1715 has a newline to close the statementlist and another to close the if/else.
1716 But can the LR parser tell the difference?
1717 It only sees that newlines don't forcibly reduce the else:
1718 So when it sees the newline at the end of simplestatementlist,
1719 it cannto shift because there is a sub-line thing that can be reduced.
1720 So this becomes elsepart before the newline is absorbed.
1721 Whereas in statementlist, the newline can be shifted creating simplestatementline.
1724 if cond: if cond: statement
1726 Again, the newline cannot be shifted while we can reduce
1728 But.... how does conflict analysis know that an 'if', for example, is not
1729 permitted after simplestatementlist?
1731 Ahh.. This is exactly what $$NEWLINE is for. Maybe it should be $$OUT.
1732 Either way, the grammar is ambiguous and relies on newlines or indentation to
1733 close the production, and this fact needs to be explicit.
1734 Requiring OUT is probably best as it means
1740 works even though there is no newline after the first statements.
1741 Here I want the 'statements' to be closed by OUT, but the whole to
1742 be closed by NEWLINE.
1743 So maybe I need both $$NEWLINE and $$OUT ??
1746 $$OUT makes lots of sense. It is exactly how we expect :statements to
1747 be closed - where we allow NEWLINE to have the same effect.
1749 $$NEWLINE is good for closing an complex if or for etc. It means that
1750 nothing else can be on the same line - allowing for indents.
1751 How do we implement that?
1752 Any production in the grammar that represents a full line but doesn't
1753 end with a newline should be marked $$NEWLINE
1754 This head of that production should recursively absorb NEWLINEs.
1756 I'm not yet clear on exactly the difference bwetween $$OUT and $$NEWLINE.
1757 I would put $$OUT after
1758 block -> : Statements $$OUT
1759 and $$NEWLINE at the end of a statement that must end a line
1760 condstatement -> Ifpart IfSuffix $$NEWLINE
1761 | constatement NEWLINE
1763 Maybe I need a worked example.
1765 if condb: if condc: action
1768 So after action there is NL OUT NL pass
1769 The NL sees that it can reduce, and the if allows the NL to reduce it so
1770 while COND : IN if COND : statement [ NL OUT NL ]
1771 again the NL can reduce. Note that we *don'* absorb the NL in the statement
1772 while COND : in statement [ NL OUT NL ]
1773 Now we can shift the NL
1774 while COND : IN statment [ OUT NL ]
1775 Now the OUT forces a reduction
1776 while COND BLOCK(in) [ OUT NL ]
1777 Now the out is cancelled
1778 while COND BLOCK [NL]
1779 and the while is reduced.
1781 So the $$NEWLINE must always see a newline (or $$EOF)
1782 An $$OUT must see an OUT or a NEWLINE (if there was no IN)
1784 $$OUT causes the LA set for items with the production to be empty.
1785 It is never credible that anything will be shifted so any apparent LA
1786 contents can be ignored.
1787 The state when a $$OUT is reducible has a recedence higher than any terminal, so
1788 nothing can be shifted and no completion should be possible.
1789 The state when a $$NEWLINE is reducible is much the same.
1791 Maybe I don't want NEWLINE in the grammar, only $$NEWLINE??
1792 How would we recognize a blank line?
1793 command -> $$NEWLINE
1795 We would need a new rule for discarding newlines.
1796 e.g. when the top-but-one state is start-of-line we discard and mark the top
1797 state s-o-l. That stops us discarding a newline until it reduces something that is at the start of a line....
1799 1/ if there is an indent since the last start-of-line state, discard NEWLINEs
1802 Q: When is a NEWLINE an error?
1803 A: when it isn't ignored and we cannot reduce and
1804 top or top-but-one state isn't starts_line.??
1806 So we need extra state info and extra frame info.
1809 - starts-line - is at start of a $$NEWLINE production
1810 - ends-line - is at end of unreduced $NEWLINE production
1811 - ends-indent - is at the end of a $$OUT production
1812 - min-prefix - how far back a 'in' can be and still cancel
1815 - indents - count in or after that sym
1816 - line_start - is the was a line start (IN or NEWLINE) immediately after
1818 - newline_permitted: no indent since start-line
1819 - since_indent: number of frames where indents==0
1820 - since_newline: number of frames where line_start==0
1822 If we see a NEWLINE then:
1823 if ! newline_permitted, discard
1824 elseif can reduce and reduce_count <= since_newline - reduce
1825 elseif since_newline <= 1, and state.starts-line, discard and record line_start
1829 increment indents, set line_start
1832 if reduce_size <= since_indent, reduce
1833 if min_prefix >= since_indent, cancel
1836 How does error handling work?
1837 Normally we pop states until we can shift ERROR
1838 Then we discard tokens until we can shift one.
1840 However we need to do something different for IN OUT NEWLINE.
1841 For IN, we simply increment a counter
1842 For OUT we decrement if it is positive.
1843 If it is zero and the state ends-indent, then we are synced.
1844 If it doesn't, we need to pop more states until we have an indent to cancel.
1845 For NEWLINE if the state ends-line or ends-indent and ...something... we are synced.
1848 ... no, that doesn't work because I cannot see a way to describe an optional newline.
1850 Let's try with just $$OUT which requires OUT or NEWLINE...
1851 We put $$OUT on productions that must be closed in a 2-d obvious way.
1852 So they can be at the end of a line or at the end of an indente block.
1855 means the next line after the : cannot be indented.
1857 Block -> : statementline | : statementblock | { statements }
1858 statementblock -> statements $NEWLINE
1859 means I can have else indented, or on same line as single statement
1861 if cond: a = b; else: b = a
1867 The whole 'if' needs a $NEWLINE marking to ensure a following statement isn't
1869 So implementation is almost exactly what I have:
1870 - if anything else is lookahead when reducing that production, it is an error.
1871 - remove non-newlines from lookahead in items
1873 But I don't think $$OUT is quite what I want to call it.
1874 That doesn't quote cover end-of-line possibilities.
1875 Maybe allow $$NEWLINE or $$OUT but with same behaviour.
1877 .... still not there.
1878 Another way to satisfy a $$OUT reduction is for it to already look right.
1879 So: No indents and at start-of-line
1881 But that upsets the modification to look-ahead as we can no longer assume
1883 I think this might be more like a precedence thing??
1884 Without look-ahead modification, the first token of a statement can be shifted
1885 before a newline forces a reduction....
1887 Maybe I do need two sorts productions.
1888 $$OUT requires an out/newline to reduce it.
1889 $$NEWLINE follows either $$OUT or NEWLINE and requires start-of-line and no indents.
1890 or $$OUT or $$NEWLINE
1892 Or does it matter. Over-modification of the look-ahead suppresses warnings, but
1893 doesn't affect the parse.
1894 Will we get warnings anyway?
1898 Are left-recursive symbols in a non-final position always bad?
1900 Left-recursive symbols cannot be closed by forcing a reduction.
1901 So if one starts in an indented region (in which newlines are ignored)
1902 it could continue afterwards - unless we make that an explicit error somehow.
1903 If they appear at the end of some other production, that one will (maybe)
1904 be reduced as well so (maybe) no problem...
1911 is weird and I want to forbid it. Al that is between b() and c() is
1912 NL OUT IN. NL closed b(), so it is just OUT IN
1913 So I do want the statementlist to close.
1919 is very wrong. How much can I help?
1920 The OUT will reduce "1 + 2" which will then become
1922 which would be highly confusing.
1923 So something about this must be disallowed.
1924 Maybe when newlines are ignored, OUT doesn't force a reduce??
1925 I can make it an error by having Expression reduce to something else.
1927 Do I want an error even for
1932 I could achieve that by adding extra checks when we SHIFT at
1934 If we could reduce tokens since previous SOL, then we have 2D ambiguity.
1939 That is just as ambiguous, but we cannot reduce anything.
1940 When we see the second '+', the reduction crosses a line-start but doesn't
1941 result in a line-start.
1943 So: a reduce that doesn't contain an indent, but does contain a start-of-line
1944 must reduce to that start of a line.
1946 This means we need to keep the start-of-line when we "IGNORE" a newline.
1948 Can I use this sort of logic to avoid the need for the extra reduction,
1949 or for the $$OUT markings??
1951 1/ The point of extra reduction is to avoid consuming more after an OUT or
1958 must be an error. The OUT reduced a()b(). The stack is then
1959 if cond : statements(n1) . IN Ident
1960 The first indent is gone. . There is no error until we see all of c() so
1961 if cond : statements(n1) simplestatement(i) . NL OUT NL
1963 Is it problen that the simplestatement is indented?
1964 if cond : statements(n1) statement(i) . OUT NL
1966 Q: is it an error to reduce a sequence containing an (uncancelled) indent?
1968 2/ The $$OUT markings guard against exactly a reduction containing an uncancelled IN.
1970 So maybe I have two new rules.
1972 1/ a reduction must not include any uncancelled indent. pop() must return 0.
1973 2/ a reduction the contains an unindented start-of-line must begin with start-of-line.
1974 So when we cancel an indent, we also cancel line starts since there.
1976 One other value of $$OUT is that is avoided conflicts - most symbols could not
1977 be shifted. That should have only applied to $$NEWLINE(!) and doesn't apply
1978 at all if I drop the marking and use internal rules instead.
1979 So how do I avoid reporting conflicts?
1981 Really, there shouldn't be any conflict as NEWLINE should be expected.
1982 Let's go back to that idea.
1984 1/ A linelike thing MAY start with Newlines and MUST end with a NEWLINE
1985 2/ A SimpleStatement is not linelike and doesn't include and Newlines
1986 3/ if condition : SimpleStatement
1987 is a SimpleStatement.
1989 4/ When we see the NEWLINE after "if condition : SimpleStatement" we have a shift/reduce
1990 conflict as we could SHIFT to make a complex statement, or reduce the whole thing
1991 to a SimpleStatement.
1992 Default action is SHIFT but in this case we want REDUCE - due to precedence?
1994 However when we see the NEWLINE after "if condition :IN Simplestatement"
1995 we cannot REDUCE as there is an cancelled indent, so we have to shift.
1997 But when we reduce, we only want to Reduce to IfHead so that an 'else' can appear
2000 If we see IN .. just continue.
2002 What do I need to do:
2004 1/ Change grammer to expect blank-lines before and to have a NEWLINE at the end
2005 of any line-like thing.
2006 This requires IfHeadNL and IfHead. ditto for switch, while, then ...
2007 This get complex with
2008 for a:=0; then a += 1; while a < 10:
2009 which could have several newlines
2011 then a+=1 ; while a < 10:
2013 ForPart -> for simplestatements ;
2014 | for simplestatements NEWLINE
2018 ThenPart -> then SimpleStatements ;
2019 | then SimpleStatements NEWLINE
2023 2/ disallow Reduce when embedded indents - report ERROR
2024 3/ disallow Reduce when embedded start-of-line.
2025 4/ TK_newline uses these rules to decide when to force a reduce.
2027 A/ A parser symbol that starts after an IN must end before the OUT
2028 B/ A parser symbol that starts before an IN must end at-or-after the OUT
2029 only if if the symbol is not line-like ???
2031 C/ A parser symbol that starts after a line-start and before an indent must end
2033 D/ A parser symbol that starts at a line-start must end before the end-of-line,
2034 or at a subsequent end-of-line.
2036 A is satisfied by forcing a reduce on OUT and reporting error if IN cannot be cancelled
2037 B is satisfied if we report an error if we try to reduce an uncancelled IN
2038 C is satisfied by forcing a reduce *after* shifting NL and reporting ERROR if
2039 min_prefix exceeds the line
2040 D is satisfied if we report an error when reducing at eol crosses a NL and doesn't start
2043 C is interesting - do we reduce *after* shifting NL?? I think we do, yes.
2045 So: when can I suppress conflicts, and how do I handle reduce/reduce conflicts?
2047 I need to be sure that a line-like ends with an unindented newline.
2048 I can trigger an error when that doesn't happen, but I want more.
2049 I want to encourage it to happen. So if the grammar allows a NEWLINE it
2050 will be shifted in, but if we have already seen an OUT, we ignore the NEWLINE
2051 rather than trigger an error.
2052 Also, I need to not report shift/reduce conflicts on whatever comes next.
2055 and c can end with a, then both c and a can be followed by the same things.
2056 This is a conflict. If c (and a) end with NEWLINE we declare the conflict
2059 An IfHead might not end with a NEWLINE. So to make a statement we need
2060 to follow it with an optional NEWLINE. Let's see if we can make that work.
2062 What is a SimpleCondStatement? It has no blank lines or unindented breaks..
2064 ForPart ThenPart WhilePart CondSuffix OptNL
2066 We don't need a distinct Simple class!! If it didn't start at SOL,
2067 then unindented NEWLINEs must be terminal Wooho!!
2069 This requires that "Statements" doesn't insist on following NEWLINE
2071 A SimpleStatement can be followed by a ';', a Statement cannot. That
2072 different is still needed.
2074 SimpleStatements -> SimpleStatement | SimpleStatements ; SimpleStatement
2076 SimpleStatements can end with a ComplexStatement and no NEWLINE.
2077 ComplexStatements must end with a NEWLINE after each statement except the last
2079 Each Part (including SSlist) end with arbitrary NEWLINEs. These will
2080 only ever be at the same indent level.
2081 A ComplexStatement must be separated from next by a NEWLINE.
2082 So if the final non-empty Part does not end with NEWLINEs, how do we require one?
2085 What if a Part doesn't end with NEWLINE ever, but can start with them
2087 CondStatement -> IfPart Newlines
2090 I think I need a CondStatement which doesn't end with a newline and a
2091 CondStatementNL which does. Then anything that can end a cond statement
2092 must come in two versions.
2093 IfPart ElsePart CasePart WhilePart CondSuffix
2095 If we expect the non-NL, we accept the NL but not vice-versa.
2104 the 'if' that started before the indent must finish at/after the indent.
2109 The Expr that started before the first indent may finish well before the indent finishes.
2110 I think this is because Expr is not linelike but 'if' is.
2112 So I don't want an error when reducing if there is an indent, unless the new top start
2116 OK, I'm up to the part where I need to hide conflicts that I can automatically resolve.
2119 State 7 has 2 (or more) reducible items
2120 IfHead -> IfHeadNL . [25]
2121 IfStatementNL -> IfHeadNL . [27] (2Right)
2122 State 35 has 2 (or more) reducible items
2123 IfTailNL -> else IfStatementNL . [32] (2Right)
2124 IfStatement -> IfStatementNL . [30]
2126 I need to clarify the rules that I'm working with.
2128 1/ Statements might not end with a NL but as it is linelike...
2130 2/ IfHeadNL IfStatementNL IfTailNL all end with a NEWLINE
2131 IfHead IfStatement IfTail might not (but they may)
2133 Why did I want IfHead -> IfHeadNL??
2134 Because I might have
2141 No, that is still an IfStatementNL. Once there is any NL, we cannot fit it
2144 Hmm... the FooNL pattern is getting out of control.
2145 Why do I need this again?
2147 Because when "if cond : statements" is followed by a NEWLINE I need to
2148 hang on to the parser state - not reducing to statement - until I see an 'else' or don't.
2150 If I do see the else, the difference doesn't matter. If I don't then I need
2151 to know if I have a NEWLINE.
2154 IfHead -> if Expression : Statements
2155 IfHeadNL -> IfHead Newlines
2157 IfElse -> IfHead else
2160 Statement -> IfHeadNL
2162 But I want "if Expr then SimpleStatements"
2163 where SimpleStatements can end with "IfHead"
2166 if foo : if bar : baz NEWLINE
2168 That NEWLINE mustn't turn "if bar: baz" into an IfHeadNL
2169 We need to first turn "if bar: baz" into a SimpleStatement, then
2170 "if foo : SimpleStatement" into an "IfHead", but NOT into a SimpleStatement.
2172 Arg. I might not know to reduce something until I've seen an IN. It is the
2175 What if an Statement *always* ends with a newline.
2178 also ends with a newline and can be a statement
2179 But if there is an IN after ':' the newline is hidden.
2180 So that doesn't work.
2182 What if a NEWLINE absolutely has to be at the top level.
2183 If a symbol contains a NEWLINE, then it must be at the start of a line,
2185 So if it isn't indented, it mustn't contain a NEWLINE - no NEWLINE will get shifted in.
2187 Statement -> if Expr : Statements
2190 What does Statements look like? It must end a NEWLINE
2191 Statements -> Statements Statement NEWLINE
2193 | Statements NEWLINE
2196 SimpleStatements -> SimplePrefix ; Statement
2197 SimplePrefix -> SimpleStatement
2198 | SimplePrefix ; SimpleStatement
2202 I think I have a very different approach - it incorporates a lot of
2203 the ideas so far and is maybe better.
2207 We have a simplified SLR(1) grammar where each state has at most one
2208 reducible production. We don't have an action table, but use the goto table
2209 to decide if a terminal can be shifted. If it can, we do. If it cannot,
2210 we reduce or trigger an error.
2212 Onto this we add handling for IN/OUT and NL. NL can appear int the grammar,
2215 Any non-terminal which can derive a NL is deemed "line-like".
2216 Such non-terminals will normally appear at the start of a line - possibly indented.
2217 These non-terminals can have some productions that have a NL (usually at the end)
2218 and some that contain no NL.
2219 If a non-terminal appears other than at the start of a line then no NL will ever
2220 be shifted into it, so a production without NL will be used.
2221 If it does appear at the start of a line, then any production can be used, though
2222 it must end up ending in a NL.
2224 The above paints an incorrect picture of how LR parsing works. At any given
2225 time you don't know what non-terminal is being matched, so we cannot exclude
2226 NL based on the non-terminal. We only know what parser state(s) we are in.
2227 So: any parser state which is at the start of a line-like non-terminal is
2228 flagged as "starts_line".
2229 Also, in each start we store a "min prefix" which is the minimum non-zero number of
2230 symbols before "dot" in any item in the set. This given a sense of where we are
2233 If min_prefix if the top state is less than the number of symbols since start-of-line,
2234 then we will not SHIFT a newline.
2236 Indents (IN) are recorded after the symbol they follow. If there is an IN
2237 since the most recent starts_line state, the any NL is ignored.
2238 An OUT will cancel the most recent IN, providing is in the top min_prefix symbols.
2239 If not, we need to reduce something first.
2241 So when we see an OUT, we reduce until we can cancel.
2242 When we see a NL, we reduce until the min_prefix reaches at least to the
2243 start of the line. Then we can shift the NL.
2244 After shifting the NL, the whole line should be reduced.
2246 When a line-like non-terminal produces a sequence that *doesn't*
2247 end with an explicit NEWLINE, the grammar analysis ensures that
2248 nothing can be shifted in after the end of the production. This
2249 forces it to be reduced into the non-terminal.
2252 simplestatement -> var = expr | print expr
2253 simplestatements -> simplestatement | simplestatements simplestatement
2254 SSline -> simplestatements NEWLINE | simplestatements ; condstatement NEWLINE
2257 | simplestatements NEWLINE
2258 | simplestatements ; statement
2259 | if expr then statements NEWLINE
2260 | if expr then statements Newlines else statements NEWLINE
2262 When a non-terminal is explicitly followed by a NEWLINE, it is line-like
2263 also if it contains a NEWLINE or linelike, it is linelike.
2266 ifhead -> if expr : statements
2269 iftail -> else : statements
2272 ifstatement -> ifhead
2275 statement -> simplelist | ifstatement | Newlines simplelist | Newlines ifstatement
2277 statements -> statement | statements statement
2279 A line-like that contains newlines must be reduced by OUT or NEWLINE.
2281 How can I know that a statements can be followed by else in
2282 if cond : statements else: statements
2286 statements statement
2288 Maybe I could have a $sol token.
2289 If at $sol, and cannot shift then try shifting $sol..
2291 statements -> statements $sol statement
2292 or maybe $eol is better, then we can have NEWLINEs start start of statement.
2293 OR maybe either... $linebreak is shifted if previous or next is NEWLINE
2294 An IN doesn't allow a LINEBREAK.
2296 Just to repeat myself:
2297 Arg. I might not know to reduce something until I've seen an IN. It is the
2300 After above, new approach: IfHead and Statement have NL versions, nothing else does.
2301 MAybe fixed the above...
2305 StatementNL -> Statement NEWLINE | IfHeadNL
2306 Statements -> StatmentNL | Statements StatementNL
2307 StatementList -> Statement | Statements
2309 Block -> : Statements | Open Statements Close
2311 IfStatement -> IfHead | IfHead IfTail | IfHeadNL IfTail
2312 IfHead -> if Expr Block
2313 IfHeadNL -> IfHead NEWLINE | IfHeadNL NEWLINE
2314 IfTail -> else Block | else IfStatement
2316 Close, but the IfHeadNL in "else IfStatement" cannot accept newlines.
2318 IfHead -> if Expr Block | IfHeadNL else IfHead | IfHead else IfHead
2320 No, I think I have to take a totally different approach.
2322 IfPart elsepart switchpart whilepart etc are all syntactically valid
2323 as stand-alone statements in the base grammar.
2324 We use the code to fail a stand-alone elsepart the isn't preceeded by an ifpart, whilepart or casepart.
2326 So statements never contain newlines, only the statement-list does.
2327 If puts a NEWLINE at the end of each statement
2328 statementlist -> statement | statementlist NEWLINE statement
2329 statement can be empty string, thus allowing blank lines and a NEWLINE at the end,
2330 which the parser will require.
2333 [[ thought experiment - interestibg, but gets unwieldy with
2334 more complex statements
2335 statements -> simpleline NL statements
2336 | ifhead NL iftail NL statements
2337 | ifhead NL statements
2338 | ifstatement NL statements
2340 iftail -> else block
2341 | else ifhead iftail
2344 ifhead -> Newlines if expr block
2345 ifstatement -> ifhead iftail
2349 13oct2020 - much later.
2351 Maybe a different approach to newlines. Never shift them, only use
2352 them like indent to guide parsing.
2355 OUT must reduce until top symbol contains an IN, which is cancelled.
2357 Some productions are flagged as $$NEWLINE (or similar) meaning that they can
2358 only be reduced as a whole number of lines.
2359 When we see a newline, we record that until a non-newline is seen.
2360 If we cannot shift and so choose to reduce:
2361 if the production doesn't require a newline - we just reduce.
2362 If it does, and we have recently seen a newline, again we reduce.
2363 Otherwise, we signal an error and (maybe) shift until we see a newline.
2364 NOTE the newline only counts if there are no indents.
2370 would be parsed as a statement, which I don't want.
2375 to be a single statement.
2376 So a new line in some context must force a reduction, like OUT.
2377 When exactly does a newline force?
2378 An OUT forces a reduction until the top symbol has the matching indent.
2379 A NEWLINE ?? Must depend on the grammar.
2383 Q: Can this be only about resolving ambiguity and reporting errors?
2384 If the default is to shift if we can, reduce otherwise, then
2385 2D causing an early reduce can resolve ambiguity.
2387 If certain productions must end at EOL, and error is thrown if they don't.
2389 That latter really must be a gramatically need to shift a newline.
2391 Maybe I'm wrong about reporting errors. Maybe they should always come from
2392 a failure to shift or reduce.
2394 One of my problems was
2395 if foo: if bar: dostuff
2396 where the newline at the end needs to reduce everything, so I can
2397 require at most one newline in the grammar.
2398 It probably should be required for the first 'if'
2400 Stmt -> if Cond : StLstNL
2411 NONO. I don't want some statements that end with NL and some that don't.
2413 - Record indent with preceding token and line-start with following
2414 - when reducing all indents are accumulated into the new symbol,
2415 and the line-start of first symbol is preserved. Other line-starts are dropped
2416 - When OUT in lookahead, there is an indent within 'min_prefix' of top state,
2417 decrement closest indent count and discard the OUT
2418 - When OUT in lookahead and cannot be discarded and top state can be reduced,
2420 - When OUT canot be discarded or reduce, and error is triggered.
2421 - No production can be reduced if it contains unmatched indents. If cannot shift,
2423 - Some productions are marked as $LINE
2424 - any state that contains an item at the start of a $LINE is marked 'starts line'
2425 - if there is an indent more recent than a starts-line state, or the top
2426 state starts_Line, newlines are ignored.
2427 - if a newline is seen but not ignored, and the top state can be reduced,
2429 - A $LINE production can only be reduced by a newline, or, if it didn't start
2430 at start-of-line, by an OUT
2433 Block -> : statementlist $LINE
2435 I want to be sure that
2442 is OK. Here the "IfHead" doesn't end on a newline, but the state that is
2443 expecting 'Block' 'starts-line'. I guess as "Block -> { statementslist }"
2444 isn't a $LINE, it can be reduced by 'else'.
2448 Maybe I don't want TK_out to force a reduce (mostly). Maybe the grammar
2449 should be in control. So when we see an OUT we record it off-stack.
2450 If we reduce for some other REASON and the top state contains an indent,
2452 When we decide to SHIFT, if the top of stack has -ve indent, that is an
2460 would show an error when s2 was shifted.
2464 needs to show an error too. The "if(Cond)s1;" statement would have an
2467 So maybe the requirement is that no symbol has unbalanced internal
2469 When we shift we require that top balances first.
2470 So we count up outs and ins. There can be several OUTs and at most on
2471 IN before a terminal.
2472 If the terminal can be shifted, the top symbol must have no internal
2473 indents after cancelling with out, and there must be no outstanding
2475 If the terminal cannot be shifted we reduce and we must have enough
2476 outs to balance anything within the reduction.
2478 So indents on the stack are always between symbols. Each frame has
2479 a symbol, the indent after it etc.
2481 If we don't have enough outs to complete a reduction we raise an error
2482 and need to discard (or shift/reduce) until we find an 'out'.
2483 If there are too many outs to shift we again need an error which will
2484 hopefully let us reduce and cancel some outs - just discarding states
2485 will help with that.
2487 So: outs must cancel, following token must envourage reductions
2489 Some productions end with newline, their start state is 'starts-line',
2490 and if there is an indent since a starts-line, we ignore newlines.
2492 If we don't ignore newlines, then there is a starts-line state that we
2493 should reduce to. If we don't have enough symbols yet, we error.
2494 So if we see a newline, we reduce until the newline would be ignored.
2496 If that state expects a newline and we don't have one, we error.
2497 Except.. I sometimes want OUT to work like a newline.
2498 If I need a newline and reduction isn't to start of line
2500 If state expects newline and we don't have one but do have OUTs
2501 we reduce as long as there are enough outs and raise an error if
2502 we end up at start of line.
2505 Do I want to require that a newline be shifted? We can synthesize
2506 one when minprefix is less than since newline and we have an out.
2510 A shift requires there be no pending OUT ??
2511 A reduce requires there be no pending indent, i.e that
2512 indents within the reduction don't exceed outs.
2513 NO this doesn't work.
2514 A IN increments the indent on the top frame - never causes error
2515 An OUT decrements indent in a frame within min_prefix, or error
2516 A NEWLINE is either ignored or shifted, but is duplicated when
2518 After an OUT is handled, if a NEWLINE is expected but will only
2519 reduce a partial line, it is provided.
2525 This effectively pretends there was a IN at the start of the "line",
2526 and we are not doing two OUTs with a NEWLINE between.
2533 Do I really want to shift newlines:
2534 yes: simpler grammer
2535 no: don't need to record min-prefix.
2540 Place NEWLINE in the grammar
2541 When completing a state, if any production contains NEWLINE, record as
2546 My idea of ignoring newline when the top state is a starts_line state isn't
2548 The idea was that reducing
2549 statement -> assignment NEWLINE
2550 would leave at a starts_line start, so NEWLINEs would be ignored.
2551 But this leaves us at a ' ... statement . ' state
2552 e.g. " Statementlist -> Statement . "
2553 which reduces to "Statementlist -> Statementlist . Statement"
2559 Review: because it isn't working (as usual).
2561 Indenting have 2 effects:
2562 - it triggers errors when something looks wrong
2563 - it causes newlines to be sometimes ignored.
2566 - A symbol that starts indented must finish while still indented.
2567 So an OUT cannot cancel an IN except in the top phrase, and an attempt to
2568 shift while there are pending OUTs is an error.
2569 - A symbol that begins at the start of line and contains an indent may only
2570 be the top symbol - shifting before an OUT cancels the indent is an error.
2572 - Newlines are ignored when there is an indent since a starts-line state
2575 - There can be one IN and several OUTs pending - we update the counters
2577 - If there are pending OUTs and their is an indent in the top phrase,
2578 we can cancel an indent.
2579 - a SHIFT becomes an error if the top symbol has an indent and started
2580 at beginning of line
2581 - a SHIFT becomes an error if there are pending OUTs
2582 - a SHIFT that is not an error first imposes the IN on the top of stack
2583 - REDUCE may happen at any time.
2585 Newlines, when not ignored, are repeating terminals. An indefinite series
2586 is generated until a state is reached where they are ignored.
2587 The expectation is that they will be shifted (possibly after some reductions)
2588 and then a reduction will happen which changes the state appropriately.
2590 It is vital that no recursive production keeps consuming newlines
2592 as that will be fed indefinitely, or if newlines are ignored, will never reduce.
2594 NEWLINEs are recognized if there is a starts_line state since the most recent
2595 indent, but not when the current state is starts_line. So once we reach
2596 starts_line, we ignore newlines until something else is shifted.
2597 Or maybe only a starts_line state at the start of a line?
2599 A starts_line state has an item "head -> . stuff NEWLINE"
2600 So if were just reduce 'foo -> foo NEWLINE' then we are at the end of foo,
2601 so an earlier state with have ".foo" and so "foo -> . foo NEWLINE" and so will
2602 be starts_line. So we will always accept a NEWLINE. So these must be disallowed.
2604 How does "if cond : if cond : print NEWLINE" parse?
2605 "block -> : slist NEWLINE"
2606 "slist -> statment | slist statement"
2607 "statement -> print NEWLINE | if cond block NEWLINE"
2609 so starts_line are before statement or ':' - I don't want that as it would allow a newline
2610 before ':' to be ignored.
2611 Maybe I want starts AFTER a NEWLINE?? Did I do that before?
2612 So after block, after statement, after slist.. but we expect NEWLINE there.
2614 Maybe the important states are "foo -> baz . stuff NEWLINE".
2615 Once we've entered a path to a NEWLINE, we need to be ready for it.
2616 So: after :, after print after if and cond and block
2618 The repeating newlines are hurting now. The rule to start ignoring them
2620 ifhead else if expr : statementlist [NEWLINE]
2621 the NEWLINE should be ignored.
2622 There is an indent after ':' but final state is startsline because this could
2623 be the end of a block -> : statementlist NEWLINE
2624 But I want that newline to be ignored because of the indent.
2629 why is that bad? How can we know? Because we reduce to start-of-line
2630 and still have an indent.
2637 I need some good examples to remind myself.
2638 Why do I want min_prefix? I allow indents to be cancels within min_prefix
2639 cmd -> { statements }
2643 ARRRGGHHH I keep going down rabbit holes. I need to take a more mathematical
2644 approach and present clearly what I have and what I need.
2645 The big issue, for the moment at least, is when to ignore newlines.
2646 I REALLY like the idea of newlines being repeating until they are ignored.
2647 I definitely like that they are ignored after an indent unless they are explicitly
2648 expected. I need to ensure they are not expected at awkward places, like the
2651 I want to be careful of making LL assumptions about the LR grammar. i.e. I cannot
2652 assume the end from the beginning - that is was LL does. I want the appearance
2653 of a newline to only make a difference once it appears. But at the point that
2654 it appears we need to know if it is to be ignored. In
2656 it isn't ignored event though it isn't expected exactly here. In
2659 it is ignored and the difference is that no state since the IN expects a
2662 To acknowledge LR approach we need to take the latest state. The start of
2663 line, before the 'a' is tempting but we hardly know anything there. After the 'a'
2664 is tempting, but that would mean after a[b+c].thing(foo) as well which seems to
2666 Maybe something earlier sets the expectation?
2668 The ':' says newline-separated things are coming. So it is the state
2669 after the ':' which says 'newlines matter'. This will be after the indent.
2672 then maybe newlines don't matter (depending on grammar)
2674 But a newline is ignored *at* that state, so that blank lines can be skipped.
2675 I think there is a 'statementlist' symbol and at the start or end of that symbol
2676 newlines are ignored. Within that symbol unindented newlines are significant.
2677 statementlist -> statement NEWLINE
2678 | statementlist statement NEWLINE
2682 Can I drop the 'min-prefix' concept?
2683 The means that OUT can only cancel against an indent 'in' or immediately before
2684 the top symbol. This means we go back to keeping a 'starts_indented' flag and
2685 storing each indent with the following symbol.
2687 Then the 'ignore newline' test is since_starts_line > since_indent
2688 but since_indent is 0 fir there are internal indents and 1 if there is only a
2689 starts_indented indent.
2691 So a TK_in simply sets starts_indented for next
2693 Whenever top symbol has indents and outs are recorded, we cancel.
2694 If we want to shift with pending outs, that causes an error
2696 I need a reason to signal an error for
2700 based on indents. equally for
2703 If indent is a continuation, then 'c=1;' but be part of 'if...' before
2704 it is part of what comes before?
2705 This means the OUT will not be able to see the IN ... unless the whole
2706 statement list ends here....
2707 In "a = IN b + c ; " the ';' causes a reduction to Assignment with indent
2708 so I must be happy for the 'c=' to reduce to 'ifstatement' with an indent,
2709 but reducing to the recursive 'statementlist' seems wrong. Maybe these
2710 left-recursive productions really are special.
2711 But what about 'expr -> expr + expr' ??
2713 The rule I'm depending on is that I cannot shift with a pending OUT.
2714 So everything since the IN needs to compress to a single symbol.
2720 then the '}' will reduce to "{ statementlist" before shifting, so the OUT will be
2721 cancelled. The question is: how can we prevent that in the grammar?
2722 There are 2 OUTs (in this case) does that help?
2723 A comparable case is
2729 In that case the two OUTs will be cancelled at clearly different times, but not real
2739 Now the 'bar' will be blocked from shifting.... no it won't.
2741 I need a new rule - this *must* be wrong, even when we ignore newlines.
2742 But why? Because "b=1;c=1;" looks like a statementlist, but isn't.
2743 It isn't because a statementlist doesn't belong here, ony a block or a statement.
2744 How do I know that indent is continuing the 'if', not the statementlist.
2745 I need some marking on statementlist to say "Cannot be continued by indent"
2746 I guess that is what NEWLINE is for.
2748 So the rules would be "a suitably marked production cannot be reduced with
2749 outstanding indents. If nothing can be shifted, an error must be raised.
2750 These will typically be productions that end in a NEWLINE, but maybe any
2751 production from a symbol that can produce a NEWLINE, or that is marked $$something
2753 Back to TK_newline. These are duplicated on shift, but discarded when not wanted.
2754 They are not wanted unless there is a starts-line start below the current state,
2755 and more recent than the most recent indent.
2756 A starts-line state is any state which contains an item with a NEWLINE and dot
2759 statementlist -> statement NEWLINE
2760 | statementlist statement NEWLINE
2763 Q: how do nested 'if's resolve?
2770 statement -> if cond block
2771 | if cond block else block
2773 The tokenization would be
2774 if cond : IN if cond2 : IN print NL OUT NL OUT IN else : IN die NL OUT NL OUT NL
2775 1 2 A 2 B 1 3 4 C 4 D 3 E
2777 A makes 'print' a statment,
2778 B makes 'if cond2' a statment,
2779 C makes 'die' a statement
2780 D is protected by IN/3 and is ignored.
2781 E makes if cond .. else.. a statement
2785 | if cond block NEWLINE else block
2787 Then on seeing the NEWLINE I wouldn't know whether to shift it,
2788 or reduce to a statement.
2789 To fix that I would need:
2790 statementlist -> statement | statementlist statement
2791 statement -> simplestatements NEWLINE
2792 | if cond block NEWLINE
2793 | if cond block else block NEWLINE
2794 | if cond block else statement
2795 | if cond block NEWLINE else block
2796 | if cond block NEWLINE else statement
2798 But wait.... Allowing NEWLINE before 'else' confuses the parse of 'if cond:...' above.
2799 NL-B makes "if cond block NEWLINE" and then the else can be shifted, but that is illegal
2800 because of the negative indent. That means the negative indent must prevent SHIFT.
2803 Do I still need to know which symbols are 'line-like' ??
2804 Yes, to make it easier to detect line-line productions which must contain unmatched indents.
2805 Do I need recursive line-like detection?
2806 For no-indent productions, it probably makes sense.
2807 For starts-line detection? I don't think so. 'block' isn't the start of a line.
2808 And not really needed for no-indent
2811 Do I need to worry about left-recursive symbols?
2812 I don't think so. There should always be some terminal - often NEWLINE -
2813 which will ensure the symbol isn't extended??.. or if it did we would get
2814 a parse error due to uncancelled OUT
2816 Do I need to split indents?
2817 A stack frame holds "symbol state". Indents are within or before the symbol.
2818 For cancelling with OUT, indents within and before the top state are equivalent.
2819 For hiding newlines the indent before the symbol is too distant. f->next_indented ? "/" : "",
2822 Should I store them in the stack?
2823 "symbol+indents indent state"
2824 So when we see an indent, we mini-push that
2825 Any indent since state protects newline
2826 Out can be cancelled if top state has indents, or previous has trailing indent.
2828 Rather than an ignore_newline flag we need a outs_needed flag. This is how
2829 many outs are needed before we process newlines.
2830 If state is startsline, then outs_needed == infinity.
2831 Else it is a count of indents since a starts-line state.
2832 So maybe it is an indents_since_starts_line and we test state and indents.
2835 1/ rearrange frame to include 'indented' in the top frame
2836 2/ replace ignore_newline with indents_on_line
2839 I'm getting close...
2840 Problem is correctly ignoring of newlines.
2841 I ignore them after an indent - that bit is easy.
2842 But I also ignore them at a start-line symbol - sometimes.
2844 INDENT statementlist NEWLINE should ignore the newline
2846 INDENT statementlist if expr : stl if expr : stl NEWLINE should not
2848 (/0s) Statementlist (11s) IfHead (5) else (22) if (3) Expression (19) : (29s) Statementlist (40s) [NEWLINE:22:0] - Ignore
2850 This shouldn't Ignore. Is that because there is a non-'s' since the '/' ??
2852 i.e Ignore newlines if all states since indent are startsline, or if none are.
2853 Don't count current state if it is startsline
2854 indents_on_line > outs || all_startsline(...)
2856 No - better, but no banana. if expr { statement; newline
2857 need to ignore even without the indent.
2858 So I need a new type of state - one that is at the start of Statementlist but
2859 not at the end. State 29 in the above, but not 40
2861 (/0s) Statementlist (11s) if (3) Expression (19) { (30s) Statementlist (39s) [NEWLINE:34:0] - ERROR
2865 I introduce "startsline" and "endsline" states.
2866 A state before a linelike symbol is startsline, a state after such a
2867 symbol is endsline. A state can be both, in which case we consider it 'endsline'.
2868 So 11 endsline, 30 startsline 39 endsline
2870 If there is an indent since the last startsline or endsline state, we ignore NEWLINE.
2871 If there are only start/end line states since indent, ignore newline
2872 if there are only endsline states since startsline, ignore newline - NO.
2875 I wonder if maybe I should have IN and OUT in the grammar.
2876 The tokens can still appear anywhere, but the production can only be reduced of there is an IN there.
2877 Block -> : IN Statementlist OUT | : SimpleStatements
2878 might get messy. But it already is messy.
2882 The big questions seems to be: when can/must I ignore NEWLINEs?
2883 Conversely when do they REDUCE or get SHIFTed?
2884 The answer must lie in what appears between the most recent INDENT and the NEWLINE.
2885 - if there are no startsline states, we can ignore.
2887 It seems that I need to be able to see the start of a block, either an indent, or a {..
2888 But if there is a statementlist, surely a block started? No, that misses a point.
2889 INDENT ignores newlines because there will be a matching OUT
2890 Maybe { causes NEWLINEs to be ignored because there is a matching } expected?
2892 If a startesline state is at the end of a productions, it plays no role in NEWLINEs,
2893 but if it is followed by something
2897 We track indents and newlines. The goal is resolve ambiguities and detect errors.
2898 Ambiguities are resolved by forcing a REDUCE in some circumstances when an OUT or NEWLINE is seen.
2899 Errors happen when there are too many OUTs.
2900 NEWLINEs are a normal part of a grammar, except that they get ignored sometimes when they are not relevant.
2903 1/ a production from a lineline symbol cannot be reduced while it contains an unbalanced indent,
2904 and so must be reduced before an excess OUT token is processed.
2905 2/ NEWLINES are ignored if there is an IN since a start-of-line state
2907 Otherwise NEWLINEs must be in the grammar. This can be awkward where they are optional
2908 such as before an 'else'. To handle the fact that a structured statement can be multiple lines
2909 or few we need to build it up from the possible line-like parts.
2910 Maybe this means that ifpart, elsepart, whilepart, casepart etc need to be statements which
2911 are combined above the grammar level??
2912 The reason has something to do with reducing to Statement when the newline is shifted.
2913 I wonder if that is really needed.
2914 if we have "ifpart -> if cond block" then that can be followed by elsepart, but we
2915 need a separate "ifpartNL -> ifpart NEWLINE | ifpartNL NEWLINE" which can also be followed
2916 by an elsepart, or can reduce to a statement
2917 ..but no. That isn't sufficient if NEWLINEs are indefinitely duplicated.
2918 So if we don't ignore newlines where they aren't needed, we cannot duplicate them.
2920 The need for duplicating was constructs like
2921 if cond : if cond2 : action
2922 which needed 2 NEWLINEs, one for each statement.
2923 Maybe the "if cond2 : action" needs to be a simplestatement.
2924 When we see the NEWLINE we can (must?) reduce anything that started since
2925 a start-of-line, which creates the simplestatement.
2927 So let's try dropping the infinite repeat and require newlines in the grammar
2930 - a production from lineline symbol cannot be reduced while it contains unbalanced
2932 - a NEWLINE is ignored if no startsline state since indent
2933 - a NEWLINE cannot be shifted unless top startsline is at start of line.
2936 - ignored if not startsline since indent
2937 - forces REDUCE if top startsline is not at start of line !!!!. i.e. don't shift
2938 - else can be SHIFTed.
2940 26jan2021 - happy australia/invasion day
2941 I want ": simplestatements" to be a valid block, but not when there is an
2942 indent in the middle
2943 Previously this was resolved by having an SSLine -> simplestatements NEWLINE
2944 and the SHIFT of the NEWLINE outweights the reduction to block.
2945 And I have that now with Statement -> SimpleStatements NEWLINE...
2946 Ahh.. it is because I put Newlines before Statementlist. I guess a NEWLINE
2947 needs to be a Statement
2949 I currently suppress shift unless "outs == 0". That means
2953 doesn't shift the 'else' because the "out" isn't resolved.
2954 Without the leading space, the NEWLINE reduces to ifstatement, and I sort-of know
2956 So: why suppress SHIFT when there is an outstanding OUT... or better Question,
2957 why is this OUT still outstanding?
2958 Ahhh.. I misunderstood. I DO want to reduce that 'if' because it is nested.
2959 I have an IfStatement what needs to reduce to a Statement and merge with a
2960 Statementlist, so the OUT can be cancelled and then the 'else' shifted.
2961 BUT as 'else' is the lookahead.... I need NEWLINE to reduce IfStatement to
2962 Statement. But even with a NEWLINE we don't SHIFT that because we still have
2963 that OUT to clear. So how do I clear an OUT ?? It can only be cancelled when
2964 the matching IN reaches TOS, which requires the NEWLINE to be shifted.
2965 We cannot shift the newline early, partly because that would be wrong, and
2966 partly because it gets reduced to a statement and disappears.
2968 Hmmm. After blocking SHIFT(NL) when we don't have a line I see
2969 progress but not there yet.
2974 After the OK is NL OUT NL
2975 The first NL is shifted and reduces down to Statementlist and the OUT cancels.
2976 But now the NL doesn't belong. Can I explicitly allow NL before } ??
2977 No, because it doesn't even SHIFT. It cannot because the Statementlist doesn't
2978 appear to be sol - the OUT hide that and it now looks like
2979 if expr { statementlist
2981 Arggh. I think I need to hide the starts-line state in this context.
2982 More thinking needed.
2984 I introduced the shift-block to make sure the NEWLINE paired with the right line.
2985 Here the NEWLINE should probably be ignored. But why? It isn't indented, but it
2986 is inside a multiline thingy.
2987 Might be better to explicitly allow it and find some way to let it be SHIFTed.
2988 Problem is that the Statementlist is linelike, but I don't want that here.
2989 Hard to avoid when the statements are and must be linelike.
2990 Maybe I need a way to hide linelike-ness.
2992 Maybe a state should only be startline if the core item has dot followed by
2993 a single symbol (which can derive a newline) ??
2996 I need a new idea concerning starts-line states. I need some refinement somehow
2998 block -> { statementlist . }
2999 should ignore newlines - providing statementlist isn't recursive - but doesn't
3001 block -> { . statementlist }
3002 is further up the stack, and that is a startsline state
3004 Maybe the thing is that the latter is startsline only because of statementlist, and
3005 now that statementlist is gone, the startsline-ness lapses.
3007 So in the former state, it is not startsline, and it is not terminal, so it
3008 suppresses a startlines state 2 levels up.
3010 But does that help? We would suppress the startsline-ness but there are
3011 no remaining indents to ignore the newline.
3012 Why can I ignore a newline in "if cond { st }" but not in "a = ( x )"
3014 Ahhh. This helps because the new top startsline would be at the start
3015 of a line, so newlines can be shifted. The grammar can explicitly
3016 allow a newline there... only then the state becomes a startsline
3017 state?? or does it? But it is the top state, so it doesn't matter.
3019 Rule: a NEWLINE cannot be SHIFTed if the topmost active startlines state
3020 is not at the start of a line non-indented. This is because newline
3021 must be meant to end a line started earlier - where starts-line was at
3022 the beginning of a line.
3023 The stop state is never "active" as the line it would start hasn't
3024 actually started. If the shifted newline reduces immediately, the
3025 grammar is probably broken.
3026 Also a state is inactive if a subsequent state declares it to be. This
3027 happens when a state is non-terminal (not reducable), and is not startsline.
3028 The smallest prefix length of all core items indicates how many
3029 preceding states are deactivated. If min-prefix is N, then N-1 starts
3033 So what do I need to code:
3034 - I need to record with each state how far back it suppresses
3036 - enhance test for shifting newline
3039 OK, the parsing code seems to do what I want, now I need to fix the grammar.
3040 The context is structure statements which contain lines. e.g.
3046 The "if cond: statements" is a while line so it looks like a statement.
3047 But then we see "else" which isn't the start of a statement.
3048 I've considered two avenues.
3049 1/ decide that "else: statements" is a valid statement and generate errors
3050 in the semantics analysis if the preceeding statement doesn't like the else.
3051 2/ enumerate all the possibilities to the grammar as 1 or more lines.
3052 ifstatement -> ifline | ifheadline elseline ...
3053 But that seems problematic with cascaded "else if"
3055 So let's try avenue 1. "else block" and "else ifstatement" are statements.
3058 indent_test seems to work, now trying to convert ocean.
3059 My plan is that the various parts of a condstatement can either be
3060 all on one "line", or some of them on their own lines.
3063 for then while do case* else
3067 a for,while,switch,if,do can start a statement
3068 and this determines what other parts are allowed.
3069 So we need to allow continuations of
3072 then? while case* else?
3083 But wait... what happens with "else"?
3084 I want to allow "else" to be followed by a CondStatement so
3090 works. I guess there is not much of an issue there the 'else' becomes an
3091 option prefix to a condstatement
3092 Callinfg var_block_close at the right time might be awkward as we don't
3093 know when we are parsing the end of a CondStatement.
3095 Pause and reflect: what is the problem we are trying to solve, and does
3098 The problem is newlines. When we see one we don't know whether to
3099 reduce to a Statement or just to an (e.g.) IfPart.
3100 We would need to allow several Newlines while staying at IfPart.
3101 Then if we see 'else' we shift that, otherwise reduce to Statement
3103 ifstatement -> ifhead elsepart
3108 But wait... indent_test is broken!!
3109 If I indent the 'else' one space, it looks like an ElseStatement after
3110 the Statementlist that should be closed - but is recursive.
3111 I can change it to a BStatementlist, but there is nothing to force that
3112 to reduce. We prevent shifting until the outdent is cleared, but that
3113 happens with the Statementlist. Maybe don't clear the outdent if the
3114 top symbol state had a reduce-length of 1.??
3116 OK.. that's fixed. Let's get back to the bigger problem.
3120 | simplestatements NEWLINEs
3123 | IfHeadNL IfSuffixBL
3124 | SwitchPart CondSuffixNL
3125 | SwitchPartNL CondSuffixNL
3126 | WhilePart CondSuffixNL
3127 | WhilePartNL CondSuffixNL
3128 | ForPart WhilePart CondSuffixNL
3129 | ForPart WhilePartNL CondSuffixNL
3130 | ForPartNL WhilePart CondSuffixNL
3131 | ForPartNL WhilePartNL CondSuffixNL
3133 ... and some for ThenPart and ThenPartNL
3135 ForPart -> for simplestatements
3137 ForPartNL -> ForPart NEWLINE
3139 IfHeadNL -> IfHead NEWLINE
3141 IfSuffixNL -> IfSuffix NEWLINE
3142 | else Block NEWLINE
3144 SwitchPart -> switch Expr
3146 SwitchPartNL -> SwitchPart NEWLINE
3147 | SwitchPartNL NEWLINE
3148 CondSuffixNL -> IfSuffixNL
3149 | CasePart CondSuffixNL
3150 | CasePartNL CondSuffixNL
3152 CasePart -> case Expr Block
3153 CasePartNL -> CasePart NEWLINE
3154 | CarePartNL NEWLINE
3158 Above looks promising but doesn't quite work.
3159 The "statement" after an "else" must be "statementNONL" because no
3160 further newline is expected, but even then it isn't quite right
3167 scans as: if expr1 : IN stat1 NL OUT IN else if cond2 : IN stat2 NL OUT NL OUT NL
3172 else if cond2: stat2
3174 scans as: if expr1 : IN stat1 NL OUT IN else if cond2 : stat2 NL OUT NL
3176 In both cases there are more NLs than things that need to be ended.
3177 We always was a NL for the starting 'if', and in the first case we need a NL
3178 for 'stat2'. I wonder what that means.
3182 if cond block else block NL
3184 because the state before 'else' is startsline the NEWLINE cannot be shifted.
3185 That seems to mean the NEWLINE must be in the production that starts the line,
3186 so "CasePartNL" etc cannot be used.....
3188 Bingo(??) I change each statement type to be a FooNL, or list thereof, with
3189 FooNL -> stuff and nonsense NEWLINE
3192 But what about that extra NL .... which now seems not to be a problem
3194 Ah-ha. The second (of 3) is ignored because it is indented. All good (for now).
3197 The longest multi-line thing is
3198 For Then While Do Case... Else
3200 Each can be on a new line, or on previous line.
3201 How can Case be handled? I guess they all need to be the same.
3210 ??? That looks awkward.
3216 I should test and see. ... I don't think so. At least not without more
3217 smarts for newline handling.
3220 For Then While Do Case... Else NEWLINE
3224 ForNL Then While Do Case... Else
3225 ForNL ThenNL While Do Case... Else
3226 ForNL ThenNL WhileNL Do Case... Else
3227 For Then While Do Case... Else
3228 For Then While Do Case... Else
3229 For Then While Do Case... Else
3230 For Then While Do Case... Else
3232 more than 64 combinations....
3234 First line is one of:
3240 For Then While Do Case
3241 For Then While Do Case Else
3248 Cases should be easy. A list of caselines, each as list of case parts.
3249 Followed by an elseline which has zero or more caseparts and an elsepart.
3251 I think I need to change how NEWLINE is handled, do minprefix differently.
3252 It is used to ignore stuff when deciding which startsline starts can prevent a
3253 newline from shifting. Review exactly what is wanted there.
3255 What exactly do I do with newlines?
3256 - If a production contains a literal NEWLINE, the head is marked line-like
3257 - forbid shifting NEWLINE when recent starts_line state is not at actual
3258 start of line... but ignore intermediate states based on min_prefix
3259 - record where lines actually start
3260 - ignore if indent since starts-line state
3263 Note that any state where an item starts with a line-like symbol is a
3265 Any state that can reduce to a line-like symbol requires indents to be
3267 starts_line states only affect ignoring newlines and choosing when to
3268 allow shift, as described above.
3271 I could extend 'line-like' to any production containing a symbol that
3272 starts with NEWLINE. The Newlines would work.
3273 Rather than 'min_prefix' I could store "since-newline-or-start' so
3274 that multiple newlines in a production would make sense,
3277 New thoughts. I wonder if they will work.
3279 Change the scanner to produce paired SOL and EOL tokens, where EOL is
3280 much link NEWLINE currently and is delayed by paired IN/OUT.
3281 Also skip blank line, so only get a SOL if there is text on the line.
3283 Now a production needs to be explicit about being at the start of a
3285 Maybe we can even do
3290 statement -> SOL SimpleStatements EOL
3291 | SOL CondStatement EOL
3293 If the grammar requires an EOL followed by an EOL, there must be an
3296 in "if cond block else"
3297 how do we know when the "block" is finished so that the "else" can be
3299 The expansion of 'block' will (possibly) end with a EOL. For "else" to
3300 follow EOL without a SOL, there must be an OUT.
3303 I need to clarify how the scanner must work for SOL/EOL so that I can
3304 write code that works.
3306 SOL needs to be generated when we see a non-space character on a new line.
3307 This is the same time that we need to possibly generate IN, which is in
3309 So at start of line we scan for non-space, then unget and set check_indent.
3310 In check_indent we assume start-of-line and generate SOL after any IN.
3312 EOL needs to be generated after we see a NEWLINE (or maybe EOF) on a
3313 non-empty line. It may be delayed until after indents, so we need to store
3314 it. We delay it until after multiple blank lines, so we always need to
3315 store it. So ->indent_eol[->indent_level] is a delayed EOL, if ->num
3318 I think we need a flag for 'at start of line' which means the line
3319 seen so far is empty. So much like my "non_empty"
3321 OK - much easier to get it right once I've thought it through :-)
3325 This isn't quite working how I had hoped :-(
3326 The "EOL SOL" pair, or more the "SOL else" pair suggests I need a look-ahead
3327 for 2 to recognise if I have an IfSuffix or not.
3328 But I know and LR(2) can be re-written as LR(1) (Did I learn that in uni?)
3331 Statementlist -> SOL SimpleStatements EOL Statementlist
3332 | SOL Ifhead EOL Statementlist
3333 | SOL Ifhead IfSuffix Statementlist
3334 | SOL IfHead EOL SOL IfSuffix Statementlist
3336 So if we see EOL SOL we can wait for else, which leads to IfSuffix, or
3337 something else for StatementList.
3338 But I don't want to allow StatementList to be empty. I can achieve this
3339 but duplicating the above for a StatementList_nonempty. A bit ugly.
3341 Also, this is right-recursive which uses a lot of stack.
3342 I can compress it a bit. By making an IfStat include the following statement.
3343 SL -> Stat | SL Stat
3345 Stat -> SOL SimpList EOL
3348 | SOL IfHead IfSuffix
3350 IfX -> SOL IfHead EOL
3351 IfHead -> if Expr Block
3352 IfSuffix -> else Block
3354 | else IfHead IfSuffix
3355 | else IfHead EOL SOL IfSuffix
3356 | else IfHead EOL Stat
3359 Getting there... (again).
3366 The 'else' pairs with cond2.
3367 There is an EOL after "if cond2: stat1" and then "SOL else"
3368 which looks just the same as
3374 The only difference is an extra OUT IN which we currently ignore.
3376 How can I use the OUT?
3378 SOL IFHead EOL .... OUT IN SOL
3379 and I need the OUT to tell me to Reduce, or to block the Shift of SOL.
3380 But if I simply block Shift when I have an OUT, the SOL IfHead EOL
3381 becomes a Statement which is merged into the StatementList and then
3382 the SOL is Shifted. I need to go all the way to make that Statementlist
3384 If I hold out with the OUT longer until reduce_size!=1
3386 IfHead else IfHead .... EOL
3387 cannot shift the EOL
3389 Maybe I need to use min_prefix, but I really don't like that.
3390 Need to think this through.
3392 Well, I have it working.
3394 If suppress shift if there are outs EXCEPT for TK_eol. Why?
3395 Also I use the Bstatementlist indirection
3396 and don't cancel the out if reduce_size==1
3398 It's a bit clunky. Can I justify it?
3400 I'd like the tokens to be different. With
3405 The SOL before the else is ignored becuause we don't expect SOL there.
3406 Trouble is in the problem case, SOL doesn't get ignored until later.
3408 Can I *only* prevent a shift of SOL when it is unbalanced?
3410 So: prevent shift of SOL if there is an uncancelled out, otherwise it will
3411 be assumed to be at the wrong level.
3412 Better, but not completely happy...
3414 14feb2021 valentines day
3416 What if the rule for cancelling indents was that the cancel couldn't cross
3417 a starts-line state. How would that work out?
3420 I didn't have time to pursue that, and now I'm a lot less convinced.
3422 New idea: Allow IN and OUT in the grammar, and selectively ignore them
3423 like we do with SOL EOL.
3424 That was, OUT could force a reduce which could not them be extended, so that
3425 whole issue of recursive productions becomes moot.
3427 When are indents relevant? Maybe we have starts-block states which
3428 expect IN, and with ignore IN if there is an indent since the last
3431 block -> : IN statementlist OUT
3432 | : simplestatements
3433 would ignore IN until we hit the :, then IN becomes relevant.
3434 If we don't see and IN it must be simplestatements. Do we allow IN
3435 there-in? Probably not. It would look confusing.
3436 But if we get an IN, then we start ignoring INs again.
3438 The OUT absolutely must balance the IN, so we ignore OUT whenever the matching
3441 We still refuse to skip OUT if the matching IN is too far away. Must be in top
3444 Clarify handling of OUT when the IN was ignored...
3445 A linelike production that started before the IN must not reduce until
3448 Any production that started after the IN must reduce before the OUT.
3449 We don't force it to reduce, we flag an error.
3450 So if we reduce some symbols which contain more OUT than IN, that is
3454 I need to track in/out carefully so they match properly and I ignore the right
3456 IN is ignored whenever SOL/EOL would be. OUT is ignored precisely when the matching
3458 I also want to track all ins and outs until they cancel in a reduction.
3459 It is only at the reduction step that we can determine if an error occured.
3460 An error is when a symbol contains nett negative indent.
3461 So we can just count indents in each symbol.
3462 Some in/out are within symbols, possibly IN and OUT. Others which are ignored
3463 exist between symbols. A frame holds (symbol+internal indents),(state+pending indents).
3464 To track which OUT to ignore we need a depth count and a bit-set.
3465 If a bit is set, then the IN was ignored so the OUT must be too.
3466 If clear, the IN was shifted, so the OUT must be too.
3468 I need to get indents_on_line right.
3469 Previously I tracked them before this frame. I don't know why...
3470 I want 0 when starts_line
3473 OK, new approach is looking really good. Need to make sure it isn't too hard
3475 Tricky area is multi-line statements that don't *have* to be multi-line.
3477 We cannot reduce "SOL IfHead EOL" to a statement as we cannot tell if it
3478 is complete until we shift the SOL and look for an "else".
3479 One option is "statement -> SOL IfHead EOL statement | SOL IfHead EOL IfTail"
3480 So "statement" is really a sublit of statements.
3481 Easy in indent_test, what about in ocean?
3483 There are lots of parts that can be on a line:
3484 if, else, for, then, while, do, switch, case
3486 if and while can be "expr block" or "block" and the thenpart/dopart
3487 else can be "block" or "statement"
3488 then is optional in for, request if some if
3490 ifpart -> if expr block | if block then block | if block EOL SOL then block
3494 ifpart -> if expr block EOL SOL | if block then block EOL SOL...
3496 What if I support backtracking over terminals? So if I cannot shift
3497 and cannot reduce, I back up until I can reduce, then do so?
3499 Then I can shift the SOL and if there is an else, I'm good. If not I back up
3500 and reduce the statement
3502 statement -> SOL simple EOL
3504 | SOL ifhead EOL SOL elsepart EOL
3505 | SOL ifhead elsepart EOL
3509 statement -> simple EOL
3511 | ifhead EOL SOL statement
3512 | ifhead EOL SOL iftail
3515 | switchead casepart
3518 ifhead -> if block then block | if expr block | if block EOL SOL then block
3519 iftail -> else block | else statement
3521 whilehead -> while expr block | while block EOL SOL do block | while block do block
3522 whilepart -> whilehead EOL
3523 | whilehead EOL SOL statement
3524 | whilehead casepart
3525 | whilehead EOL SOL casepart
3527 casepart -> casehead casepart
3528 | casehead EOL SOL casepart
3529 | casehead EOL SOL statement
3531 casehead -> case expr block
3534 I've had a new idea - let's drop SOL! Now that I have IN, it isn't really needed.
3535 We can assume SOL follows EOL or IN .... maybe.
3536 Problem is if we want to require IN/OUT around something that is not line-oriented.
3537 Might that ever matter?
3538 No, I don't think so.
3541 Maybe this make it really really easy.
3542 We don't mark different sorts of states, and we only track which indents were
3546 IN never causes a reduction, it is either shifted or ignored.
3547 An EOL is ignored if the most recent IN was ignored, otherwise it is a normal
3549 An OUT is similarly ignored if the matching indent was ignored. It also
3550 cancels that indent.
3554 .... no, it seems to work.
3556 So: back to the ocean grammar
3558 statement -> simple EOL
3563 | switchead casepart
3566 ifhead -> if block then block | if expr block | if block EOL then block
3567 iftail -> else block EOL | else statement
3569 whilehead -> while expr block | while block EOL do block | while block do block
3570 whilepart -> whilehead EOL
3571 | whilehead casepart
3572 | whilehead EOL casepart
3574 casepart -> casehead casepart
3575 | casehead EOL casepart
3578 casehead -> case expr block
3583 An ifpart can be "if expr then simple ;"... no it cannot...
3584 But the problem was that some forms for a head with an optional tail
3585 must end EOL, other forms need not.
3586 But the whole must end EOL.
3588 So: do we put EOL at end of 'statement' or end of IfSuffix
3590 Let's try assuming it is at the end of 'statement'
3591 So IfSuffix can assume an EOL follows
3592 So CondStatement can too
3593 So an ifhead either 'may' or 'must' be followed by an EOL.
3594 If may, it is followed by IfSuffix which is empty, or starts OptEOL
3595 If must, it is followed by empty or
3596 No.. this isn't working for me.
3598 Let's try assuming that a CondStatement ends with an EOL.
3599 So an IfSuffix must too. and it cannot be just EOL
3600 If an ifhead that must be followed by EOL, it is either EOL or EOL IfSuffix
3601 If it may be, then EOL or IfSuffix
3604 ForPart ThenPart SwitchPart are ALWAYS followed by something, so can end
3605 EOL or not, as suits
3606 WhilePart IfPart CasePart might be the last thing so each option must
3607 end with a SuffixEOL which ends with EOL or SuffixOpt which might not
3609 What do I want to do about
3613 case value : statement
3616 though for the latter I can and use 'then'.
3617 For 'else' I don't need the ':', but it wouldn't hurt.
3619 Problem is: do I insist on a trailing newline or ';'
3621 case foo: bar case bar: baz
3622 would be legal, but hard to read, as would
3623 if cond : stat1 else stat2
3624 which is probbly error prone.
3632 That looks like an indented block, but is really indented lines.
3633 So it is probably a mistake.
3634 So allow switch expr : or ';' at the end
3636 Whatever happens after "switch expr" must work after "while expr block"
3639 If first case is not indented, none of them may be
3640 If first is: it happens in an IN/OUT block, so again all the same
3642 Can I implement that? Can I have IN after a non-terminal somehow?
3643 When I see an IN, I could reduce as long as go_to_cnt == 0.
3644 That might help after an OUT, but not after EXPR,,
3646 Or: look at next symbol. If it can be shifted, we ignore the IN.
3647 If not, we reduce and try to shift the IN again.
3649 Also: need to mark IN as ignored when popped off during error recovery,
3650 and maintain stack when discarding during error recovery
3654 { IN statements OUT }
3655 { simplestatements }
3659 : simplestatements NL .... or ';'
3661 In other contexts I have
3662 for simple; statements; then simple ; statements ; while expr:
3664 I currently require a ';' or newline before "then" or "while"
3666 Interesting other cases are:
3668 case expr : simplestatements
3669 while expr : simplestatements
3671 For 'if' I currently have "if expr then simplestatements"
3673 Because of 'for' and 'then' I don't want to require ':' before simplestatements.
3675 while expr do simplestatements
3676 But what do I do for 'case' ??? I really want the ':' there.
3677 So I should use it for 'if' and 'while'
3678 'for' could be followed immediately by IN, as could then and even if/while
3679 So the ':' comes after an expression.
3682 Problems with the idea of only using : to come after an expression.
3683 1/ "else" looks wrong compared to Python, but may I can get used to that
3684 2/ with "for" it would be simple statements, with "while" it would be expr
3685 if there was no indent. Do I need different things to look different?
3686 If statements always follow ':', the "for" and "then" always need a ':'
3687 for: a=1; then: a = a+1; while a < 10:
3689 In C there is no difference, but I want a difference..
3692 Arg... I'm not struggle with parsing concepts this time, I'm struggling with code.
3693 I want to add an "EOL" symbol to the grammar as a special terminal.
3694 It is like "NEWLINE", but handled a bit differently.
3696 In parsergen it is just another terminal symbol, but it mustn't get added
3697 to the "known" list. Currently all terminals from TK_reserved are added
3698 to "known". Maybe if I give it a number that is after the virtual symbols