X-Git-Url: https://ocean-lang.org/code/?p=ocean-D;a=blobdiff_plain;f=twod;h=251f794b95d73de42239d48cfb34903d28e1efdc;hp=de0e302f196ea0b582dc81c82916524261b5fa53;hb=fb71b26483853b4e8198c3b5ca1c3f2ec554a800;hpb=5a245ff2b74c48d2a50cc8d6f4529ef5abb84837 diff --git a/twod b/twod index de0e302..251f794 100644 --- a/twod +++ b/twod @@ -1448,3 +1448,2251 @@ The NEWLINE after b is not Ignored in the expression, Maybe we want: Open -> { | NEWLINE Newlines { + + + I have a problem. + I want + + else: a := b + + to parse the same as + + else: + a := b + + and for the last newline to close the elsepart. + But the latter has 2 newlines while the former only has one + and I don't have any obvious justification for ignoring either. + I think it is in the Newline before the OUT that is extra. + + I could drop the newline before the OUT, assuming the newline + separate things, and the OUT will force any reductions needed. + But then we have fewer newlines reported than actual. + (Same imbalance happens with multiline comments and strings, so maybe + that is OK). + Another way to look at it is that the newline following an IN is discarded + (or always ignored) and not moved to after the OUT. + So (maybe) the newline at an IN or OUT is reported *after* the IN or OUT. + so + A + B + C + D + + Would be A IN NL B NL C OUT IN NL D OUT NL + + The parser always ignores the NL after an IN but uses other + NL to reduce to a single symbol (if possible) + OR maybe it doesn't ignore (unless not line-like context) and + lines are preceeded by NL, not followed by them... + No, followed is usually good.. though separated is better... so preceed!! + + Ok, this isn't working. + A construct + if cond: + pass + + cannot be reduced to a Statement until we know what comes next, and it + might be separated by several newlines. + So the newlines need to be part of the Statement. + But that means we cannot have newlines at the front of a statement. + But that was the point... + + Maybe a Statementlist is a series of StatementNL followed by a Statement + + We allow + StatementNL -> Statement Newlines + as a general catch-all, but when we have something like if, or anything + with an optional tail "else:" or "case:" + We say: + StatementNL -> if Expression Block Newlines + But that would produce a conflict with + Statement -> if Expression Block + As a newline could either trigger a reduce to Statement, or a shift. + Obviously we shift, but maybe we use precedence to force the point. + + Can we handle 'else if' ... + IfStatementNL -> if Expression Block Newlines + | if Expression Block else IfStatementNL + + ... I'm contemplating having the parser duplicate NL as necessary, so + that + if test: action + can appear to be followed by 2 NL, one to terminate the 'action' statement + and one to terminate the whole 'if'. + This might mean I need to extend when NL are discarded - to ensure they + don't get duplicated too much. + 1/ if state does not permit newlines, discard + 2/ else if I can reduce symbols all since start of line do that. + 3/ else if can shift, do that + 4/ else if only one symbol since newline, discard. + 5/ else ERROR + + This means that we cannot recognise multiple newlines + or does it. + If we shift a Newline, that is since_newline=0; + If we reduce that to Newlines, that is still since_newline=0 + If 4/discard only applies when since_newline==1 -- we win. + + Currently since_newline essentially means the symbol contains a newline. + So 'statements' usually does, but 'statement' doesn't. + When we shift the newline and reduce, it all becomes since_newline=0. + That is when we want to ignore newlines. + + 15jun2019 - still working this through.. + + Normally the parser does + shift or else reduce or else error + exceptions are TK_in which is simply recorded and + TK_out: reduce until there is a TK_in in scope, then cancel, else error + TK_newline: + if not newline_permitted (indent since last starts_line state) + Discard + if can Reduce to at most start-of-line, reduce + if can Shift, duplicate and Shift + if can Reduce, do so + if 0 since newline, Discard + + since_newline needs to be changed a bit. + A TK_newline token *isn't* zero, it is N+1. The token *after* + the NEWLINE is zero - so that + + Arg. I'm struggle with that fact that having shifted a newline, + we are both at the end of a line, and at the start of the next. + When I see a newline, I want to reduce until the end of line + is in the same state as the start of that line. + + Maybe I do want newline to be a separator. + What if I don't actually include the newline in the grammar, just like in/out. + Instead we mark select productions as lines. This is like marking + for precedence. + A marked production is reduced when a newline is seen providing it won't + contain any indents. + So: if the reducable item in a state is marked, the start gets marked. + When we see a newline, if the state is marked and the reduce size does not + exceed since_indent, we reduce. Otherwise we discard. + No... I need an error condition too. + So I need the state to have a starts_line marking, when a new item is marked. + + So: + productions can be marked $$NEWlINE which flags the production as line-like + a state with an item with DOT at start of a line-like production is starts_line + a state with an item with DOT at the end of a line-line product is ends_line + We track indents as before. + When we process an indent or newline, we set since_newline to 0 + When we see a newline we do one of: + if not newline_permitted, we discard + if top state starts line, we discard + else reduce or else error + +No..... + A production -> { statements } + needs to ignore newlines either side of statements. + It is a multi-line production - newlines don't matter. + Maybe there are several sorts of symbols: + - in-line: must be broken across lines unless indented + - line-like: is terminated (reduced) by a newline + - multi-line: newlines are ignored + + We tag symbols which are line-like. + Any symbol which can derive a line-like symbol is multi-line + Any other symbol is in-line. + + So SimpleStatements, ifhead, elsepart, casepart etc are linelike + +$line SimpleStatements IfHead .... + + A state that is at the start of a linelike symbol starts_line + Any state in a multi-line production starts_line + + if tos starts_line, newlines are ignored. + else if there is an indent since the starts_line, newlines are ignored. + But if there are symbols since starts_line, we have to reduce until are + are in a starts_line start (or can see an indent). + +No.... + Block -> : Statementlist +is none of thse. It must be reduced by a newline, but isn't entirely line-like. + Block -> { Statementlist } +is multi-line + +but maybe Block here is neither. It only becomes linelike when it +terminates a Statement, which is linelike. Or terminates an ifHead + +Should this be legal? + a:=b;pass if something +probably not. I want to require at least ';' or NEWLINE. +That means I need to include NEWLINE in the grammar. + Statements -> SimpleStatements ; Statement + | Statements Statement + | Statement + + if cond: if cond: a:=b +NEWLINE reduces this down to IfHead +IfHead -> IfHead NEWLINE +Statement -> IfHead + | IfHead ElsePart + +ElsePart -> else BLOCK + | else IfHead + | else IfHead ElsePart + | ElsePart NEWLINE + + +Statements -> Statements Statement + | Statement + + +if cond { statments } else { statements} +but not +if cond: statements else: statements + +so :statements must expect a NEWLINE but then + if cond: if cond: statement +expects 2 NEWLINEs. +Maybe + Block -> : IN statments NEWLINE +if there is no indent, we synth one which triggers an OUT NEWLINE pair. +This could be automatic. + If a linelike is followed by a newline, we synthesis an IN before it. + +That requires a hack to the scanner: Synth Indent + +What if + + Block -> { Statements } + | : Statements NEWLINE + | : SimpleStatements + +Then + if Expr Block + +might not end in a NEWLINE so else could come immediately. Is that OK? + if expr: a:=0 if expr +must be forbidden. That requires a newline. + + Block -> : IN statements OUT NEWLINE + +In marks state as 'needs_indent' If an indent arrives, fine. +If something else, we record that next newline (After balanced in/out) +must synth extra out/newline. + + Block -> { Statements } + | : statementblock + + statementblock -> Statements $$line + +$$line means it must be reduced by a newline. If something else tries, +it is an error and we skip to newline. +It also strips everything but NEWLINE from the (effective) lookahead +to avoid reporting conficts, as those things will never be shifted. + + IfHead -> if Expr Block + | IFHead NEWLINE + Block -> { statements } + | : statements $$line + + IfStatement -> IfHead + | IfHead IfTail + | IfHead else IfStatement + IfTail -> else Block + | IfTail NEWLINE + + +------ +20jun2019 Happy 87th birthday Dad. + +I'm not convinced about $$NEWLINE + + else: simplestatementlist + +should be able to parse simplestatementlist without a newline, and +use the newline to close the if/else. +where as + else: + statementlist + +has a newline to close the statementlist and another to close the if/else. +But can the LR parser tell the difference? +It only sees that newlines don't forcibly reduce the else: +So when it sees the newline at the end of simplestatementlist, +it cannto shift because there is a sub-line thing that can be reduced. +So this becomes elsepart before the newline is absorbed. +Whereas in statementlist, the newline can be shifted creating simplestatementline. + +What about + if cond: if cond: statement + +Again, the newline cannot be shifted while we can reduce + +But.... how does conflict analysis know that an 'if', for example, is not +permitted after simplestatementlist? + +Ahh.. This is exactly what $$NEWLINE is for. Maybe it should be $$OUT. +Either way, the grammar is ambiguous and relies on newlines or indentation to +close the production, and this fact needs to be explicit. +Requiring OUT is probably best as it means + if cond: + statements + else: + statements + +works even though there is no newline after the first statements. +Here I want the 'statements' to be closed by OUT, but the whole to +be closed by NEWLINE. +So maybe I need both $$NEWLINE and $$OUT ?? + + +$$OUT makes lots of sense. It is exactly how we expect :statements to +be closed - where we allow NEWLINE to have the same effect. + +$$NEWLINE is good for closing an complex if or for etc. It means that +nothing else can be on the same line - allowing for indents. +How do we implement that? +Any production in the grammar that represents a full line but doesn't +end with a newline should be marked $$NEWLINE +This head of that production should recursively absorb NEWLINEs. + +I'm not yet clear on exactly the difference bwetween $$OUT and $$NEWLINE. +I would put $$OUT after + block -> : Statements $$OUT +and $$NEWLINE at the end of a statement that must end a line + condstatement -> Ifpart IfSuffix $$NEWLINE + | constatement NEWLINE + +Maybe I need a worked example. + while conda: + if condb: if condc: action + pass + +So after action there is NL OUT NL pass +The NL sees that it can reduce, and the if allows the NL to reduce it so + while COND : IN if COND : statement [ NL OUT NL ] +again the NL can reduce. Note that we *don'* absorb the NL in the statement + while COND : in statement [ NL OUT NL ] +Now we can shift the NL + while COND : IN statment [ OUT NL ] +Now the OUT forces a reduction + while COND BLOCK(in) [ OUT NL ] +Now the out is cancelled + while COND BLOCK [NL] +and the while is reduced. + +So the $$NEWLINE must always see a newline (or $$EOF) +An $$OUT must see an OUT or a NEWLINE (if there was no IN) + +$$OUT causes the LA set for items with the production to be empty. +It is never credible that anything will be shifted so any apparent LA +contents can be ignored. +The state when a $$OUT is reducible has a recedence higher than any terminal, so +nothing can be shifted and no completion should be possible. +The state when a $$NEWLINE is reducible is much the same. + +Maybe I don't want NEWLINE in the grammar, only $$NEWLINE?? +How would we recognize a blank line? + command -> $$NEWLINE +?? +We would need a new rule for discarding newlines. +e.g. when the top-but-one state is start-of-line we discard and mark the top +state s-o-l. That stops us discarding a newline until it reduces something that is at the start of a line.... + +1/ if there is an indent since the last start-of-line state, discard NEWLINEs +2/ if .... + +Q: When is a NEWLINE an error? +A: when it isn't ignored and we cannot reduce and + top or top-but-one state isn't starts_line.?? + +So we need extra state info and extra frame info. + +State has: + - starts-line - is at start of a $$NEWLINE production + - ends-line - is at end of unreduced $NEWLINE production + - ends-indent - is at the end of a $$OUT production + - min-prefix - how far back a 'in' can be and still cancel + +Frame has: + - indents - count in or after that sym + - line_start - is the was a line start (IN or NEWLINE) immediately after + the symbol + - newline_permitted: no indent since start-line + - since_indent: number of frames where indents==0 + - since_newline: number of frames where line_start==0 + +If we see a NEWLINE then: + if ! newline_permitted, discard + elseif can reduce and reduce_count <= since_newline - reduce + elseif since_newline <= 1, and state.starts-line, discard and record line_start + else error + +If we see an IN + increment indents, set line_start + +If we see an OUT + if reduce_size <= since_indent, reduce + if min_prefix >= since_indent, cancel + else error + +How does error handling work? +Normally we pop states until we can shift ERROR +Then we discard tokens until we can shift one. + +However we need to do something different for IN OUT NEWLINE. +For IN, we simply increment a counter +For OUT we decrement if it is positive. + If it is zero and the state ends-indent, then we are synced. + If it doesn't, we need to pop more states until we have an indent to cancel. +For NEWLINE if the state ends-line or ends-indent and ...something... we are synced. + else we skip it?? + +... no, that doesn't work because I cannot see a way to describe an optional newline. + +Let's try with just $$OUT which requires OUT or NEWLINE... +We put $$OUT on productions that must be closed in a 2-d obvious way. +So they can be at the end of a line or at the end of an indente block. +So + : statements $OUT +means the next line after the : cannot be indented. +However + Block -> : statementline | : statementblock | { statements } + statementblock -> statements $NEWLINE +means I can have else indented, or on same line as single statement + + if cond: a = b; else: b = a + if cond: + a = b + else: + b = a + +The whole 'if' needs a $NEWLINE marking to ensure a following statement isn't +indented. +So implementation is almost exactly what I have: + - if anything else is lookahead when reducing that production, it is an error. + - remove non-newlines from lookahead in items + +But I don't think $$OUT is quite what I want to call it. +That doesn't quote cover end-of-line possibilities. +Maybe allow $$NEWLINE or $$OUT but with same behaviour. + +.... still not there. +Another way to satisfy a $$OUT reduction is for it to already look right. +So: No indents and at start-of-line + +But that upsets the modification to look-ahead as we can no longer assume +the next token. +I think this might be more like a precedence thing?? +Without look-ahead modification, the first token of a statement can be shifted + before a newline forces a reduction.... + +Maybe I do need two sorts productions. + $$OUT requires an out/newline to reduce it. + $$NEWLINE follows either $$OUT or NEWLINE and requires start-of-line and no indents. + or $$OUT or $$NEWLINE + +Or does it matter. Over-modification of the look-ahead suppresses warnings, but +doesn't affect the parse. +Will we get warnings anyway? + + +-------- +Are left-recursive symbols in a non-final position always bad? + +Left-recursive symbols cannot be closed by forcing a reduction. +So if one starts in an indented region (in which newlines are ignored) +it could continue afterwards - unless we make that an explicit error somehow. +If they appear at the end of some other production, that one will (maybe) +be reduced as well so (maybe) no problem... + +if cond: + a() + b() + c() + +is weird and I want to forbid it. Al that is between b() and c() is +NL OUT IN. NL closed b(), so it is just OUT IN +So I do want the statementlist to close. + + a := + 1 + 2 + * 3 + 4 + +is very wrong. How much can I help? +The OUT will reduce "1 + 2" which will then become + ((1 + 2) * 3) + 4 +which would be highly confusing. +So something about this must be disallowed. +Maybe when newlines are ignored, OUT doesn't force a reduce?? +I can make it an error by having Expression reduce to something else. + +Do I want an error even for + 1 + 2 + * 3 + 4 +?? + +I could achieve that by adding extra checks when we SHIFT at +start of line. +If we could reduce tokens since previous SOL, then we have 2D ambiguity. + + 1 + 2 * + 3 + 4 + +That is just as ambiguous, but we cannot reduce anything. +When we see the second '+', the reduction crosses a line-start but doesn't +result in a line-start. + +So: a reduce that doesn't contain an indent, but does contain a start-of-line +must reduce to that start of a line. + +This means we need to keep the start-of-line when we "IGNORE" a newline. + +Can I use this sort of logic to avoid the need for the extra reduction, +or for the $$OUT markings?? + +1/ The point of extra reduction is to avoid consuming more after an OUT or + ignored NL. + if cond: + a() + b() + c() + + must be an error. The OUT reduced a()b(). The stack is then + if cond : statements(n1) . IN Ident + The first indent is gone. . There is no error until we see all of c() so + if cond : statements(n1) simplestatement(i) . NL OUT NL + + Is it problen that the simplestatement is indented? + if cond : statements(n1) statement(i) . OUT NL + + Q: is it an error to reduce a sequence containing an (uncancelled) indent? + +2/ The $$OUT markings guard against exactly a reduction containing an uncancelled IN. + +So maybe I have two new rules. + + 1/ a reduction must not include any uncancelled indent. pop() must return 0. + 2/ a reduction the contains an unindented start-of-line must begin with start-of-line. + So when we cancel an indent, we also cancel line starts since there. + +One other value of $$OUT is that is avoided conflicts - most symbols could not +be shifted. That should have only applied to $$NEWLINE(!) and doesn't apply +at all if I drop the marking and use internal rules instead. +So how do I avoid reporting conflicts? + +Really, there shouldn't be any conflict as NEWLINE should be expected. +Let's go back to that idea. + +1/ A linelike thing MAY start with Newlines and MUST end with a NEWLINE +2/ A SimpleStatement is not linelike and doesn't include and Newlines +3/ if condition : SimpleStatement + is a SimpleStatement. + +4/ When we see the NEWLINE after "if condition : SimpleStatement" we have a shift/reduce + conflict as we could SHIFT to make a complex statement, or reduce the whole thing + to a SimpleStatement. + Default action is SHIFT but in this case we want REDUCE - due to precedence? + + However when we see the NEWLINE after "if condition :IN Simplestatement" + we cannot REDUCE as there is an cancelled indent, so we have to shift. + + But when we reduce, we only want to Reduce to IfHead so that an 'else' can appear + on the next line. + + If we see IN .. just continue. + +What do I need to do: + + 1/ Change grammer to expect blank-lines before and to have a NEWLINE at the end + of any line-like thing. + This requires IfHeadNL and IfHead. ditto for switch, while, then ... + This get complex with + for a:=0; then a += 1; while a < 10: + which could have several newlines + for a:=0 + then a+=1 ; while a < 10: + + ForPart -> for simplestatements ; + | for simplestatements NEWLINE + | for Block + | Newlines ForPart + + ThenPart -> then SimpleStatements ; + | then SimpleStatements NEWLINE + | then BLOCK + | Newlines ThenPart + + 2/ disallow Reduce when embedded indents - report ERROR + 3/ disallow Reduce when embedded start-of-line. + 4/ TK_newline uses these rules to decide when to force a reduce. + +A/ A parser symbol that starts after an IN must end before the OUT +B/ A parser symbol that starts before an IN must end at-or-after the OUT + only if if the symbol is not line-like ??? + +C/ A parser symbol that starts after a line-start and before an indent must end + by the end of line +D/ A parser symbol that starts at a line-start must end before the end-of-line, + or at a subsequent end-of-line. + +A is satisfied by forcing a reduce on OUT and reporting error if IN cannot be cancelled +B is satisfied if we report an error if we try to reduce an uncancelled IN +C is satisfied by forcing a reduce *after* shifting NL and reporting ERROR if + min_prefix exceeds the line +D is satisfied if we report an error when reducing at eol crosses a NL and doesn't start + at start-of-line. + +C is interesting - do we reduce *after* shifting NL?? I think we do, yes. + +So: when can I suppress conflicts, and how do I handle reduce/reduce conflicts? + +I need to be sure that a line-like ends with an unindented newline. +I can trigger an error when that doesn't happen, but I want more. +I want to encourage it to happen. So if the grammar allows a NEWLINE it +will be shifted in, but if we have already seen an OUT, we ignore the NEWLINE +rather than trigger an error. +Also, I need to not report shift/reduce conflicts on whatever comes next. +i.e. if + a -> b c +and c can end with a, then both c and a can be followed by the same things. +This is a conflict. If c (and a) end with NEWLINE we declare the conflict +resolved. + +An IfHead might not end with a NEWLINE. So to make a statement we need +to follow it with an optional NEWLINE. Let's see if we can make that work. + +What is a SimpleCondStatement? It has no blank lines or unindented breaks.. + + ForPart ThenPart WhilePart CondSuffix OptNL + +We don't need a distinct Simple class!! If it didn't start at SOL, +then unindented NEWLINEs must be terminal Wooho!! + +This requires that "Statements" doesn't insist on following NEWLINE + +A SimpleStatement can be followed by a ';', a Statement cannot. That +different is still needed. +So + SimpleStatements -> SimpleStatement | SimpleStatements ; SimpleStatement + +SimpleStatements can end with a ComplexStatement and no NEWLINE. +ComplexStatements must end with a NEWLINE after each statement except the last + +Each Part (including SSlist) end with arbitrary NEWLINEs. These will +only ever be at the same indent level. +A ComplexStatement must be separated from next by a NEWLINE. +So if the final non-empty Part does not end with NEWLINEs, how do we require one? +Maybe not.. + +What if a Part doesn't end with NEWLINE ever, but can start with them + +CondStatement -> IfPart Newlines + | IfPart IfTail... + +I think I need a CondStatement which doesn't end with a newline and a +CondStatementNL which does. Then anything that can end a cond statement +must come in two versions. + IfPart ElsePart CasePart WhilePart CondSuffix + +If we expect the non-NL, we accept the NL but not vice-versa. + +------ +problem. +in + if cond: + cmd1 + cmd2 + +the 'if' that started before the indent must finish at/after the indent. +But in + if a = b or + c = d : + do something +The Expr that started before the first indent may finish well before the indent finishes. +I think this is because Expr is not linelike but 'if' is. + +So I don't want an error when reducing if there is an indent, unless the new top start +starts_line +... + +OK, I'm up to the part where I need to hide conflicts that I can automatically resolve. +I have: + + State 7 has 2 (or more) reducible items + IfHead -> IfHeadNL . [25] + IfStatementNL -> IfHeadNL . [27] (2Right) + State 35 has 2 (or more) reducible items + IfTailNL -> else IfStatementNL . [32] (2Right) + IfStatement -> IfStatementNL . [30] + +I need to clarify the rules that I'm working with. + +1/ Statements might not end with a NL but as it is linelike... + +2/ IfHeadNL IfStatementNL IfTailNL all end with a NEWLINE + IfHead IfStatement IfTail might not (but they may) + +Why did I want IfHead -> IfHeadNL?? +Because I might have + if cond: + action + + + else: bar + +No, that is still an IfStatementNL. Once there is any NL, we cannot fit it +on a line. + +Hmm... the FooNL pattern is getting out of control. +Why do I need this again? + +Because when "if cond : statements" is followed by a NEWLINE I need to +hang on to the parser state - not reducing to statement - until I see an 'else' or don't. + +If I do see the else, the difference doesn't matter. If I don't then I need +to know if I have a NEWLINE. + +So I could have + IfHead -> if Expression : Statements + IfHeadNL -> IfHead Newlines + + IfElse -> IfHead else + | IfHeadNL else + + Statement -> IfHeadNL + + But I want "if Expr then SimpleStatements" + where SimpleStatements can end with "IfHead" +So when I see: + + if foo : if bar : baz NEWLINE + +That NEWLINE mustn't turn "if bar: baz" into an IfHeadNL + We need to first turn "if bar: baz" into a SimpleStatement, then + "if foo : SimpleStatement" into an "IfHead", but NOT into a SimpleStatement. + +Arg. I might not know to reduce something until I've seen an IN. It is the +'else' + +What if an Statement *always* ends with a newline. +So + if Expr : Statement +also ends with a newline and can be a statement +But if there is an IN after ':' the newline is hidden. +So that doesn't work. + +What if a NEWLINE absolutely has to be at the top level. +If a symbol contains a NEWLINE, then it must be at the start of a line, +possibly indented. +So if it isn't indented, it mustn't contain a NEWLINE - no NEWLINE will get shifted in. + + Statement -> if Expr : Statements + | SimpleStatements + +What does Statements look like? It must end a NEWLINE + Statements -> Statements Statement NEWLINE + | Statement NEWLINE + | Statements NEWLINE + | SimpleStatements + + SimpleStatements -> SimplePrefix ; Statement + SimplePrefix -> SimpleStatement + | SimplePrefix ; SimpleStatement + + +01July2019 + I think I have a very different approach - it incorporates a lot of + the ideas so far and is maybe better. + + From the top: + + We have a simplified SLR(1) grammar where each state has at most one + reducible production. We don't have an action table, but use the goto table + to decide if a terminal can be shifted. If it can, we do. If it cannot, + we reduce or trigger an error. + + Onto this we add handling for IN/OUT and NL. NL can appear int the grammar, + IN/OUT cannot. + + Any non-terminal which can derive a NL is deemed "line-like". + Such non-terminals will normally appear at the start of a line - possibly indented. + These non-terminals can have some productions that have a NL (usually at the end) + and some that contain no NL. + If a non-terminal appears other than at the start of a line then no NL will ever + be shifted into it, so a production without NL will be used. + If it does appear at the start of a line, then any production can be used, though + it must end up ending in a NL. + + The above paints an incorrect picture of how LR parsing works. At any given + time you don't know what non-terminal is being matched, so we cannot exclude + NL based on the non-terminal. We only know what parser state(s) we are in. + So: any parser state which is at the start of a line-like non-terminal is + flagged as "starts_line". + Also, in each start we store a "min prefix" which is the minimum non-zero number of + symbols before "dot" in any item in the set. This given a sense of where we are + in the parse. + + If min_prefix if the top state is less than the number of symbols since start-of-line, + then we will not SHIFT a newline. + + Indents (IN) are recorded after the symbol they follow. If there is an IN + since the most recent starts_line state, the any NL is ignored. + An OUT will cancel the most recent IN, providing is in the top min_prefix symbols. + If not, we need to reduce something first. + + So when we see an OUT, we reduce until we can cancel. + When we see a NL, we reduce until the min_prefix reaches at least to the + start of the line. Then we can shift the NL. + After shifting the NL, the whole line should be reduced. + + When a line-like non-terminal produces a sequence that *doesn't* + end with an explicit NEWLINE, the grammar analysis ensures that + nothing can be shifted in after the end of the production. This + forces it to be reduced into the non-terminal. + + For example + simplestatement -> var = expr | print expr + simplestatements -> simplestatement | simplestatements simplestatement + SSline -> simplestatements NEWLINE | simplestatements ; condstatement NEWLINE + + statement ->.. + | simplestatements NEWLINE + | simplestatements ; statement + | if expr then statements NEWLINE + | if expr then statements Newlines else statements NEWLINE + + When a non-terminal is explicitly followed by a NEWLINE, it is line-like + also if it contains a NEWLINE or linelike, it is linelike. + + + ifhead -> if expr : statements + | ifhead NEWLINE + + iftail -> else : statements + | else ifstatement + + ifstatement -> ifhead + | ifhead iftail + + statement -> simplelist | ifstatement | Newlines simplelist | Newlines ifstatement + + statements -> statement | statements statement + + A line-like that contains newlines must be reduced by OUT or NEWLINE. + + How can I know that a statements can be followed by else in + if cond : statements else: statements + or + IfHead IfTail + bit not by 'if' in + statements statement + + Maybe I could have a $sol token. + If at $sol, and cannot shift then try shifting $sol.. + then + statements -> statements $sol statement + or maybe $eol is better, then we can have NEWLINEs start start of statement. + OR maybe either... $linebreak is shifted if previous or next is NEWLINE + An IN doesn't allow a LINEBREAK. + +Just to repeat myself: + Arg. I might not know to reduce something until I've seen an IN. It is the + 'else' + +After above, new approach: IfHead and Statement have NL versions, nothing else does. +MAybe fixed the above... + +So + + StatementNL -> Statement NEWLINE | IfHeadNL + Statements -> StatmentNL | Statements StatementNL + StatementList -> Statement | Statements + + Block -> : Statements | Open Statements Close + + IfStatement -> IfHead | IfHead IfTail | IfHeadNL IfTail + IfHead -> if Expr Block + IfHeadNL -> IfHead NEWLINE | IfHeadNL NEWLINE + IfTail -> else Block | else IfStatement + +Close, but the IfHeadNL in "else IfStatement" cannot accept newlines. +What if + IfHead -> if Expr Block | IfHeadNL else IfHead | IfHead else IfHead + +No, I think I have to take a totally different approach. + +IfPart elsepart switchpart whilepart etc are all syntactically valid +as stand-alone statements in the base grammar. +We use the code to fail a stand-alone elsepart the isn't preceeded by an ifpart, whilepart or casepart. + +So statements never contain newlines, only the statement-list does. +If puts a NEWLINE at the end of each statement + statementlist -> statement | statementlist NEWLINE statement +statement can be empty string, thus allowing blank lines and a NEWLINE at the end, +which the parser will require. + + +[[ thought experiment - interestibg, but gets unwieldy with + more complex statements +statements -> simpleline NL statements + | ifhead NL iftail NL statements + | ifhead NL statements + | ifstatement NL statements + +iftail -> else block + | else ifhead iftail + | NL iftail + +ifhead -> Newlines if expr block +ifstatement -> ifhead iftail +]] + + +13oct2020 - much later. + + Maybe a different approach to newlines. Never shift them, only use + them like indent to guide parsing. + + IN is just recorded + OUT must reduce until top symbol contains an IN, which is cancelled. + + Some productions are flagged as $$NEWLINE (or similar) meaning that they can + only be reduced as a whole number of lines. + When we see a newline, we record that until a non-newline is seen. + If we cannot shift and so choose to reduce: + if the production doesn't require a newline - we just reduce. + If it does, and we have recently seen a newline, again we reduce. + Otherwise, we signal an error and (maybe) shift until we see a newline. + NOTE the newline only counts if there are no indents. + + This means that + a = b + + c + + would be parsed as a statement, which I don't want. + But I do want + if foo: bar + else: baz + + to be a single statement. + So a new line in some context must force a reduction, like OUT. + When exactly does a newline force? + An OUT forces a reduction until the top symbol has the matching indent. + A NEWLINE ?? Must depend on the grammar. + + + + Q: Can this be only about resolving ambiguity and reporting errors? + If the default is to shift if we can, reduce otherwise, then + 2D causing an early reduce can resolve ambiguity. + + If certain productions must end at EOL, and error is thrown if they don't. + + That latter really must be a gramatically need to shift a newline. + + Maybe I'm wrong about reporting errors. Maybe they should always come from + a failure to shift or reduce. + + One of my problems was + if foo: if bar: dostuff + where the newline at the end needs to reduce everything, so I can + require at most one newline in the grammar. + It probably should be required for the first 'if' + + Stmt -> if Cond : StLstNL + StLstNL -> StLst NL + + StLst -> Stmt + | StLst ; Stmt + | SlLst NLs Stmt + | NLs StLst + + IfSt -> IfHead + + + NONO. I don't want some statements that end with NL and some that don't. + + - Record indent with preceding token and line-start with following + - when reducing all indents are accumulated into the new symbol, + and the line-start of first symbol is preserved. Other line-starts are dropped + - When OUT in lookahead, there is an indent within 'min_prefix' of top state, + decrement closest indent count and discard the OUT + - When OUT in lookahead and cannot be discarded and top state can be reduced, + reduce it + - When OUT canot be discarded or reduce, and error is triggered. + - No production can be reduced if it contains unmatched indents. If cannot shift, + then error. + - Some productions are marked as $LINE + - any state that contains an item at the start of a $LINE is marked 'starts line' + - if there is an indent more recent than a starts-line state, or the top + state starts_Line, newlines are ignored. + - if a newline is seen but not ignored, and the top state can be reduced, + it is reduced. + - A $LINE production can only be reduced by a newline, or, if it didn't start + at start-of-line, by an OUT + + so + Block -> : statementlist $LINE + + I want to be sure that + if cond { + statements + } else { + more + } + + is OK. Here the "IfHead" doesn't end on a newline, but the state that is + expecting 'Block' 'starts-line'. I guess as "Block -> { statementslist }" + isn't a $LINE, it can be reduced by 'else'. + +24oct2020 + + Maybe I don't want TK_out to force a reduce (mostly). Maybe the grammar + should be in control. So when we see an OUT we record it off-stack. + If we reduce for some other REASON and the top state contains an indent, + we cancel. + When we decide to SHIFT, if the top of stack has -ve indent, that is an + error. + So with C, + if (cond) { + s1; + s2; + s2; + + would show an error when s2 was shifted. + if (cond) + s1; + s2; + needs to show an error too. The "if(Cond)s1;" statement would have an + internal indent. + + So maybe the requirement is that no symbol has unbalanced internal + indents. + When we shift we require that top balances first. + So we count up outs and ins. There can be several OUTs and at most on + IN before a terminal. + If the terminal can be shifted, the top symbol must have no internal + indents after cancelling with out, and there must be no outstanding + TK_outs + If the terminal cannot be shifted we reduce and we must have enough + outs to balance anything within the reduction. + + So indents on the stack are always between symbols. Each frame has + a symbol, the indent after it etc. + + If we don't have enough outs to complete a reduction we raise an error + and need to discard (or shift/reduce) until we find an 'out'. + If there are too many outs to shift we again need an error which will + hopefully let us reduce and cancel some outs - just discarding states + will help with that. + + So: outs must cancel, following token must envourage reductions + + Some productions end with newline, their start state is 'starts-line', + and if there is an indent since a starts-line, we ignore newlines. + + If we don't ignore newlines, then there is a starts-line state that we + should reduce to. If we don't have enough symbols yet, we error. + So if we see a newline, we reduce until the newline would be ignored. + + If that state expects a newline and we don't have one, we error. + Except.. I sometimes want OUT to work like a newline. + If I need a newline and reduction isn't to start of line + + If state expects newline and we don't have one but do have OUTs + we reduce as long as there are enough outs and raise an error if + we end up at start of line. + + + Do I want to require that a newline be shifted? We can synthesize + one when minprefix is less than since newline and we have an out. + + + Summary of stuff. + A shift requires there be no pending OUT ?? + A reduce requires there be no pending indent, i.e that + indents within the reduction don't exceed outs. + NO this doesn't work. + A IN increments the indent on the top frame - never causes error + An OUT decrements indent in a frame within min_prefix, or error + A NEWLINE is either ignored or shifted, but is duplicated when + shifted. + After an OUT is handled, if a NEWLINE is expected but will only + reduce a partial line, it is provided. + + if foo: if bar: + stuff + bazz + + This effectively pretends there was a IN at the start of the "line", + and we are not doing two OUTs with a NEWLINE between. + if foo: + if bar: + stuff + baz + + + Do I really want to shift newlines: + yes: simpler grammer + no: don't need to record min-prefix. + + I think 'yes'. + + So discard $NEWLINE + Place NEWLINE in the grammar + When completing a state, if any production contains NEWLINE, record as + StartsLine + + + --- + My idea of ignoring newline when the top state is a starts_line state isn't + working. + The idea was that reducing + statement -> assignment NEWLINE + would leave at a starts_line start, so NEWLINEs would be ignored. + But this leaves us at a ' ... statement . ' state + e.g. " Statementlist -> Statement . " + which reduces to "Statementlist -> Statementlist . Statement" + + + +30oct2020 + + Review: because it isn't working (as usual). + + Indenting have 2 effects: + - it triggers errors when something looks wrong + - it causes newlines to be sometimes ignored. + + Specifically: + - A symbol that starts indented must finish while still indented. + So an OUT cannot cancel an IN except in the top phrase, and an attempt to + shift while there are pending OUTs is an error. + - A symbol that begins at the start of line and contains an indent may only + be the top symbol - shifting before an OUT cancels the indent is an error. + + - Newlines are ignored when there is an indent since a starts-line state + + Procedurally: + - There can be one IN and several OUTs pending - we update the counters + as they are seen. + - If there are pending OUTs and their is an indent in the top phrase, + we can cancel an indent. + - a SHIFT becomes an error if the top symbol has an indent and started + at beginning of line + - a SHIFT becomes an error if there are pending OUTs + - a SHIFT that is not an error first imposes the IN on the top of stack + - REDUCE may happen at any time. + + Newlines, when not ignored, are repeating terminals. An indefinite series + is generated until a state is reached where they are ignored. + The expectation is that they will be shifted (possibly after some reductions) + and then a reduction will happen which changes the state appropriately. + + It is vital that no recursive production keeps consuming newlines + foo -> foo NEWLINE + as that will be fed indefinitely, or if newlines are ignored, will never reduce. + + NEWLINEs are recognized if there is a starts_line state since the most recent + indent, but not when the current state is starts_line. So once we reach + starts_line, we ignore newlines until something else is shifted. + Or maybe only a starts_line state at the start of a line? + + A starts_line state has an item "head -> . stuff NEWLINE" + So if were just reduce 'foo -> foo NEWLINE' then we are at the end of foo, + so an earlier state with have ".foo" and so "foo -> . foo NEWLINE" and so will + be starts_line. So we will always accept a NEWLINE. So these must be disallowed. + + How does "if cond : if cond : print NEWLINE" parse? + "block -> : slist NEWLINE" + "slist -> statment | slist statement" + "statement -> print NEWLINE | if cond block NEWLINE" + + so starts_line are before statement or ':' - I don't want that as it would allow a newline + before ':' to be ignored. + Maybe I want starts AFTER a NEWLINE?? Did I do that before? + So after block, after statement, after slist.. but we expect NEWLINE there. + + Maybe the important states are "foo -> baz . stuff NEWLINE". + Once we've entered a path to a NEWLINE, we need to be ready for it. + So: after :, after print after if and cond and block + + The repeating newlines are hurting now. The rule to start ignoring them + isn't working. + ifhead else if expr : statementlist [NEWLINE] + the NEWLINE should be ignored. + There is an indent after ':' but final state is startsline because this could + be the end of a block -> : statementlist NEWLINE + But I want that newline to be ignored because of the indent. + + if (A) + b; + c; + why is that bad? How can we know? Because we reduce to start-of-line + and still have an indent. + + a = + b + c + + d + * + + I need some good examples to remind myself. + Why do I want min_prefix? I allow indents to be cancels within min_prefix + cmd -> { statements } + + +01nov2020 + ARRRGGHHH I keep going down rabbit holes. I need to take a more mathematical + approach and present clearly what I have and what I need. + The big issue, for the moment at least, is when to ignore newlines. + I REALLY like the idea of newlines being repeating until they are ignored. + I definitely like that they are ignored after an indent unless they are explicitly + expected. I need to ensure they are not expected at awkward places, like the + start of a line. + + I want to be careful of making LL assumptions about the LR grammar. i.e. I cannot + assume the end from the beginning - that is was LL does. I want the appearance + of a newline to only make a difference once it appears. But at the point that + it appears we need to know if it is to be ignored. In + a = b + NEWLINE + it isn't ignored event though it isn't expected exactly here. In + a = b + (IN) + c + NEWLINE + it is ignored and the difference is that no state since the IN expects a + newline. + + To acknowledge LR approach we need to take the latest state. The start of + line, before the 'a' is tempting but we hardly know anything there. After the 'a' + is tempting, but that would mean after a[b+c].thing(foo) as well which seems to + miss the point. + Maybe something earlier sets the expectation? + if cond : + The ':' says newline-separated things are coming. So it is the state + after the ':' which says 'newlines matter'. This will be after the indent. + So if we saw + if cond { + then maybe newlines don't matter (depending on grammar) + + But a newline is ignored *at* that state, so that blank lines can be skipped. + I think there is a 'statementlist' symbol and at the start or end of that symbol + newlines are ignored. Within that symbol unindented newlines are significant. + statementlist -> statement NEWLINE + | statementlist statement NEWLINE + + +02nov2020 + Can I drop the 'min-prefix' concept? + The means that OUT can only cancel against an indent 'in' or immediately before + the top symbol. This means we go back to keeping a 'starts_indented' flag and + storing each indent with the following symbol. + + Then the 'ignore newline' test is since_starts_line > since_indent + but since_indent is 0 fir there are internal indents and 1 if there is only a + starts_indented indent. + + So a TK_in simply sets starts_indented for next + TK_out is recorded + Whenever top symbol has indents and outs are recorded, we cancel. + If we want to shift with pending outs, that causes an error + + I need a reason to signal an error for + if (a) + b=1; + c=1; + based on indents. equally for + if (c) + b=1;c=1; + If indent is a continuation, then 'c=1;' but be part of 'if...' before + it is part of what comes before? + This means the OUT will not be able to see the IN ... unless the whole + statement list ends here.... + In "a = IN b + c ; " the ';' causes a reduction to Assignment with indent + so I must be happy for the 'c=' to reduce to 'ifstatement' with an indent, + but reducing to the recursive 'statementlist' seems wrong. Maybe these + left-recursive productions really are special. + But what about 'expr -> expr + expr' ?? + + The rule I'm depending on is that I cannot shift with a pending OUT. + So everything since the IN needs to compress to a single symbol. + If the code is + { + if (a) + b=1;c=1; + } + then the '}' will reduce to "{ statementlist" before shifting, so the OUT will be + cancelled. The question is: how can we prevent that in the grammar? + There are 2 OUTs (in this case) does that help? + A comparable case is + { + foo; + if (a) + b=1;c=1; + } + In that case the two OUTs will be cancelled at clearly different times, but not real + difference. + + { + foo; + if (a) + b=1;c=1; + bar; + } + + Now the 'bar' will be blocked from shifting.... no it won't. + + I need a new rule - this *must* be wrong, even when we ignore newlines. + But why? Because "b=1;c=1;" looks like a statementlist, but isn't. + It isn't because a statementlist doesn't belong here, ony a block or a statement. + How do I know that indent is continuing the 'if', not the statementlist. + I need some marking on statementlist to say "Cannot be continued by indent" + I guess that is what NEWLINE is for. + + So the rules would be "a suitably marked production cannot be reduced with + outstanding indents. If nothing can be shifted, an error must be raised. + These will typically be productions that end in a NEWLINE, but maybe any + production from a symbol that can produce a NEWLINE, or that is marked $$something + + Back to TK_newline. These are duplicated on shift, but discarded when not wanted. + They are not wanted unless there is a starts-line start below the current state, + and more recent than the most recent indent. + A starts-line state is any state which contains an item with a NEWLINE and dot + at the start. + + statementlist -> statement NEWLINE + | statementlist statement NEWLINE + + + Q: how do nested 'if's resolve? + if cond: + if cond2: + print + else: + die + + statement -> if cond block + | if cond block else block + + The tokenization would be + if cond : IN if cond2 : IN print NL OUT NL OUT IN else : IN die NL OUT NL OUT NL + 1 2 A 2 B 1 3 4 C 4 D 3 E + so NLs: + A makes 'print' a statment, + B makes 'if cond2' a statment, + C makes 'die' a statement + D is protected by IN/3 and is ignored. + E makes if cond .. else.. a statement + + What if I also had + + | if cond block NEWLINE else block + + Then on seeing the NEWLINE I wouldn't know whether to shift it, + or reduce to a statement. + To fix that I would need: + statementlist -> statement | statementlist statement + statement -> simplestatements NEWLINE + | if cond block NEWLINE + | if cond block else block NEWLINE + | if cond block else statement + | if cond block NEWLINE else block + | if cond block NEWLINE else statement + + But wait.... Allowing NEWLINE before 'else' confuses the parse of 'if cond:...' above. + NL-B makes "if cond block NEWLINE" and then the else can be shifted, but that is illegal + because of the negative indent. That means the negative indent must prevent SHIFT. + + + Do I still need to know which symbols are 'line-like' ?? + Yes, to make it easier to detect line-line productions which must contain unmatched indents. + Do I need recursive line-like detection? + For no-indent productions, it probably makes sense. + For starts-line detection? I don't think so. 'block' isn't the start of a line. + And not really needed for no-indent + + + Do I need to worry about left-recursive symbols? + I don't think so. There should always be some terminal - often NEWLINE - + which will ensure the symbol isn't extended??.. or if it did we would get + a parse error due to uncancelled OUT + + Do I need to split indents? + A stack frame holds "symbol state". Indents are within or before the symbol. + For cancelling with OUT, indents within and before the top state are equivalent. + For hiding newlines the indent before the symbol is too distant. f->next_indented ? "/" : "", + + So: yes. + Should I store them in the stack? + "symbol+indents indent state" + So when we see an indent, we mini-push that + Any indent since state protects newline + Out can be cancelled if top state has indents, or previous has trailing indent. + + Rather than an ignore_newline flag we need a outs_needed flag. This is how + many outs are needed before we process newlines. + If state is startsline, then outs_needed == infinity. + Else it is a count of indents since a starts-line state. + So maybe it is an indents_since_starts_line and we test state and indents. + + So: + 1/ rearrange frame to include 'indented' in the top frame + 2/ replace ignore_newline with indents_on_line + +14nov2020 + I'm getting close... + Problem is correctly ignoring of newlines. + I ignore them after an indent - that bit is easy. + But I also ignore them at a start-line symbol - sometimes. + + INDENT statementlist NEWLINE should ignore the newline + but + INDENT statementlist if expr : stl if expr : stl NEWLINE should not + + (/0s) Statementlist (11s) IfHead (5) else (22) if (3) Expression (19) : (29s) Statementlist (40s) [NEWLINE:22:0] - Ignore + + This shouldn't Ignore. Is that because there is a non-'s' since the '/' ?? + + i.e Ignore newlines if all states since indent are startsline, or if none are. + Don't count current state if it is startsline + indents_on_line > outs || all_startsline(...) + + No - better, but no banana. if expr { statement; newline + need to ignore even without the indent. + So I need a new type of state - one that is at the start of Statementlist but + not at the end. State 29 in the above, but not 40 + +(/0s) Statementlist (11s) if (3) Expression (19) { (30s) Statementlist (39s) [NEWLINE:34:0] - ERROR + +18nov2020 + So: + I introduce "startsline" and "endsline" states. + A state before a linelike symbol is startsline, a state after such a + symbol is endsline. A state can be both, in which case we consider it 'endsline'. + So 11 endsline, 30 startsline 39 endsline + + If there is an indent since the last startsline or endsline state, we ignore NEWLINE. + If there are only start/end line states since indent, ignore newline + if there are only endsline states since startsline, ignore newline - NO. + + + I wonder if maybe I should have IN and OUT in the grammar. + The tokens can still appear anywhere, but the production can only be reduced of there is an IN there. + Block -> : IN Statementlist OUT | : SimpleStatements + might get messy. But it already is messy. + +20jan2021 + Where am I? + The big questions seems to be: when can/must I ignore NEWLINEs? + Conversely when do they REDUCE or get SHIFTed? + The answer must lie in what appears between the most recent INDENT and the NEWLINE. + - if there are no startsline states, we can ignore. + + It seems that I need to be able to see the start of a block, either an indent, or a {.. + But if there is a statementlist, surely a block started? No, that misses a point. + INDENT ignores newlines because there will be a matching OUT + Maybe { causes NEWLINEs to be ignored because there is a matching } expected? + + If a startesline state is at the end of a productions, it plays no role in NEWLINEs, + but if it is followed by something + +23jan2021 + review/summary. + We track indents and newlines. The goal is resolve ambiguities and detect errors. + Ambiguities are resolved by forcing a REDUCE in some circumstances when an OUT or NEWLINE is seen. + Errors happen when there are too many OUTs. + NEWLINEs are a normal part of a grammar, except that they get ignored sometimes when they are not relevant. + and are protected. + + 1/ a production from a lineline symbol cannot be reduced while it contains an unbalanced indent, + and so must be reduced before an excess OUT token is processed. + 2/ NEWLINES are ignored if there is an IN since a start-of-line state + + Otherwise NEWLINEs must be in the grammar. This can be awkward where they are optional + such as before an 'else'. To handle the fact that a structured statement can be multiple lines + or few we need to build it up from the possible line-like parts. + Maybe this means that ifpart, elsepart, whilepart, casepart etc need to be statements which + are combined above the grammar level?? + The reason has something to do with reducing to Statement when the newline is shifted. + I wonder if that is really needed. + if we have "ifpart -> if cond block" then that can be followed by elsepart, but we + need a separate "ifpartNL -> ifpart NEWLINE | ifpartNL NEWLINE" which can also be followed + by an elsepart, or can reduce to a statement + ..but no. That isn't sufficient if NEWLINEs are indefinitely duplicated. + So if we don't ignore newlines where they aren't needed, we cannot duplicate them. + + The need for duplicating was constructs like + if cond : if cond2 : action + which needed 2 NEWLINEs, one for each statement. + Maybe the "if cond2 : action" needs to be a simplestatement. + When we see the NEWLINE we can (must?) reduce anything that started since + a start-of-line, which creates the simplestatement. + + So let's try dropping the infinite repeat and require newlines in the grammar + + Extra rules: + - a production from lineline symbol cannot be reduced while it contains unbalanced + indent + - a NEWLINE is ignored if no startsline state since indent + - a NEWLINE cannot be shifted unless top startsline is at start of line. + + So a NEWLINE is: + - ignored if not startsline since indent + - forces REDUCE if top startsline is not at start of line !!!!. i.e. don't shift + - else can be SHIFTed. + +26jan2021 - happy australia/invasion day + I want ": simplestatements" to be a valid block, but not when there is an + indent in the middle + Previously this was resolved by having an SSLine -> simplestatements NEWLINE + and the SHIFT of the NEWLINE outweights the reduction to block. + And I have that now with Statement -> SimpleStatements NEWLINE... + Ahh.. it is because I put Newlines before Statementlist. I guess a NEWLINE + needs to be a Statement + + I currently suppress shift unless "outs == 0". That means + if cond: + statement + else: + doesn't shift the 'else' because the "out" isn't resolved. + Without the leading space, the NEWLINE reduces to ifstatement, and I sort-of know + that is an issue. + So: why suppress SHIFT when there is an outstanding OUT... or better Question, + why is this OUT still outstanding? + Ahhh.. I misunderstood. I DO want to reduce that 'if' because it is nested. + I have an IfStatement what needs to reduce to a Statement and merge with a + Statementlist, so the OUT can be cancelled and then the 'else' shifted. + BUT as 'else' is the lookahead.... I need NEWLINE to reduce IfStatement to + Statement. But even with a NEWLINE we don't SHIFT that because we still have + that OUT to clear. So how do I clear an OUT ?? It can only be cancelled when + the matching IN reaches TOS, which requires the NEWLINE to be shifted. + We cannot shift the newline early, partly because that would be wrong, and + partly because it gets reduced to a statement and disappears. + + Hmmm. After blocking SHIFT(NL) when we don't have a line I see + progress but not there yet. + if false { + print = OK + } else { + + After the OK is NL OUT NL + The first NL is shifted and reduces down to Statementlist and the OUT cancels. + But now the NL doesn't belong. Can I explicitly allow NL before } ?? + No, because it doesn't even SHIFT. It cannot because the Statementlist doesn't + appear to be sol - the OUT hide that and it now looks like + if expr { statementlist + + Arggh. I think I need to hide the starts-line state in this context. + More thinking needed. + + I introduced the shift-block to make sure the NEWLINE paired with the right line. + Here the NEWLINE should probably be ignored. But why? It isn't indented, but it + is inside a multiline thingy. + Might be better to explicitly allow it and find some way to let it be SHIFTed. + Problem is that the Statementlist is linelike, but I don't want that here. + Hard to avoid when the statements are and must be linelike. + Maybe I need a way to hide linelike-ness. + + Maybe a state should only be startline if the core item has dot followed by + a single symbol (which can derive a newline) ?? + +27jan2021 + I need a new idea concerning starts-line states. I need some refinement somehow + The state + block -> { statementlist . } + should ignore newlines - providing statementlist isn't recursive - but doesn't + because + block -> { . statementlist } + is further up the stack, and that is a startsline state + + Maybe the thing is that the latter is startsline only because of statementlist, and + now that statementlist is gone, the startsline-ness lapses. + + So in the former state, it is not startsline, and it is not terminal, so it + suppresses a startlines state 2 levels up. + + But does that help? We would suppress the startsline-ness but there are + no remaining indents to ignore the newline. + Why can I ignore a newline in "if cond { st }" but not in "a = ( x )" + ?? + Ahhh. This helps because the new top startsline would be at the start + of a line, so newlines can be shifted. The grammar can explicitly + allow a newline there... only then the state becomes a startsline + state?? or does it? But it is the top state, so it doesn't matter. + + Rule: a NEWLINE cannot be SHIFTed if the topmost active startlines state + is not at the start of a line non-indented. This is because newline + must be meant to end a line started earlier - where starts-line was at + the beginning of a line. + The stop state is never "active" as the line it would start hasn't + actually started. If the shifted newline reduces immediately, the + grammar is probably broken. + Also a state is inactive if a subsequent state declares it to be. This + happens when a state is non-terminal (not reducable), and is not startsline. + The smallest prefix length of all core items indicates how many + preceding states are deactivated. If min-prefix is N, then N-1 starts + are deactivated. + + + So what do I need to code: + - I need to record with each state how far back it suppresses + start-line states. + - enhance test for shifting newline + +30jan2021 + OK, the parsing code seems to do what I want, now I need to fix the grammar. + The context is structure statements which contain lines. e.g. + if cond: + statements + else: + statements + + The "if cond: statements" is a while line so it looks like a statement. + But then we see "else" which isn't the start of a statement. + I've considered two avenues. + 1/ decide that "else: statements" is a valid statement and generate errors + in the semantics analysis if the preceeding statement doesn't like the else. + 2/ enumerate all the possibilities to the grammar as 1 or more lines. + ifstatement -> ifline | ifheadline elseline ... + But that seems problematic with cascaded "else if" + + So let's try avenue 1. "else block" and "else ifstatement" are statements. + +03feb2021 + indent_test seems to work, now trying to convert ocean. + My plan is that the various parts of a condstatement can either be + all on one "line", or some of them on their own lines. + The parts are: + + for then while do case* else + switch case* else + if then else + + a for,while,switch,if,do can start a statement + and this determines what other parts are allowed. + So we need to allow continuations of + + after for + then? while case* else? + after while + do* case* else? + after switch + case* else? + after if + then? else? + after do + -nothing + + + But wait... what happens with "else"? + I want to allow "else" to be followed by a CondStatement so + if cond: + stuff + else if cond: + sufff + + works. I guess there is not much of an issue there the 'else' becomes an + option prefix to a condstatement + Callinfg var_block_close at the right time might be awkward as we don't + know when we are parsing the end of a CondStatement. + + Pause and reflect: what is the problem we are trying to solve, and does + it still apply? + + The problem is newlines. When we see one we don't know whether to + reduce to a Statement or just to an (e.g.) IfPart. + We would need to allow several Newlines while staying at IfPart. + Then if we see 'else' we shift that, otherwise reduce to Statement + + ifstatement -> ifhead elsepart + | ifheadnl elsepart + | ifheadnl + + + But wait... indent_test is broken!! + If I indent the 'else' one space, it looks like an ElseStatement after + the Statementlist that should be closed - but is recursive. + I can change it to a BStatementlist, but there is nothing to force that + to reduce. We prevent shifting until the outdent is cleared, but that + happens with the Statementlist. Maybe don't clear the outdent if the + top symbol state had a reduce-length of 1.?? + + OK.. that's fixed. Let's get back to the bigger problem. + + A statement can be: + -> + | simplestatements NEWLINEs + | IfHeadNL + | IfHead IfSuffixNL + | IfHeadNL IfSuffixBL + | SwitchPart CondSuffixNL + | SwitchPartNL CondSuffixNL + | WhilePart CondSuffixNL + | WhilePartNL CondSuffixNL + | ForPart WhilePart CondSuffixNL + | ForPart WhilePartNL CondSuffixNL + | ForPartNL WhilePart CondSuffixNL + | ForPartNL WhilePartNL CondSuffixNL + + ... and some for ThenPart and ThenPartNL + + ForPart -> for simplestatements + | for Block + ForPartNL -> ForPart NEWLINE + | ForPartNL NEWLINE + IfHeadNL -> IfHead NEWLINE + | IfHeadNL NEWLINE + IfSuffixNL -> IfSuffix NEWLINE + | else Block NEWLINE + | else statement + SwitchPart -> switch Expr + | switch Block + SwitchPartNL -> SwitchPart NEWLINE + | SwitchPartNL NEWLINE + CondSuffixNL -> IfSuffixNL + | CasePart CondSuffixNL + | CasePartNL CondSuffixNL + + CasePart -> case Expr Block + CasePartNL -> CasePart NEWLINE + | CarePartNL NEWLINE + +05feb2021 + + Above looks promising but doesn't quite work. + The "statement" after an "else" must be "statementNONL" because no + further newline is expected, but even then it isn't quite right + + if expr1: + stat1 + else if cond2: + stat2 + + scans as: if expr1 : IN stat1 NL OUT IN else if cond2 : IN stat2 NL OUT NL OUT NL + + whereas + if expr1 : + stat1 + else if cond2: stat2 + + scans as: if expr1 : IN stat1 NL OUT IN else if cond2 : stat2 NL OUT NL + + In both cases there are more NLs than things that need to be ended. + We always was a NL for the starting 'if', and in the first case we need a NL + for 'stat2'. I wonder what that means. + + Separately + + if cond block else block NL + + because the state before 'else' is startsline the NEWLINE cannot be shifted. + That seems to mean the NEWLINE must be in the production that starts the line, + so "CasePartNL" etc cannot be used..... + + Bingo(??) I change each statement type to be a FooNL, or list thereof, with + FooNL -> stuff and nonsense NEWLINE + | FooNL NEWLINE + + But what about that extra NL .... which now seems not to be a problem + + Ah-ha. The second (of 3) is ignored because it is indented. All good (for now). + +06feb2021 + The longest multi-line thing is + For Then While Do Case... Else + + Each can be on a new line, or on previous line. + How can Case be handled? I guess they all need to be the same. + + What about + if cond1: + stat1 + else if cond2: + stat2 + else if cond3.... + + ??? That looks awkward. + + Can I have + For -> ForPart + | For NEWLINE + ?? + I should test and see. ... I don't think so. At least not without more + smarts for newline handling. + + So back to + For Then While Do Case... Else NEWLINE + + Other forms are + + ForNL Then While Do Case... Else + ForNL ThenNL While Do Case... Else + ForNL ThenNL WhileNL Do Case... Else + For Then While Do Case... Else + For Then While Do Case... Else + For Then While Do Case... Else + For Then While Do Case... Else + + more than 64 combinations.... + + First line is one of: + + For + For Then + For Then While + For Then While Do + For Then While Do Case + For Then While Do Case Else + + Then + Then.. 5 options + then 3, 2, 1 + Maybe only 21 parts + + Cases should be easy. A list of caselines, each as list of case parts. + Followed by an elseline which has zero or more caseparts and an elsepart. + + I think I need to change how NEWLINE is handled, do minprefix differently. + It is used to ignore stuff when deciding which startsline starts can prevent a + newline from shifting. Review exactly what is wanted there. + + What exactly do I do with newlines? + - If a production contains a literal NEWLINE, the head is marked line-like + - forbid shifting NEWLINE when recent starts_line state is not at actual + start of line... but ignore intermediate states based on min_prefix + - record where lines actually start + - ignore if indent since starts-line state + and that is all. + + Note that any state where an item starts with a line-like symbol is a + starts-line state. + Any state that can reduce to a line-like symbol requires indents to be + balanced. + starts_line states only affect ignoring newlines and choosing when to + allow shift, as described above. + + Thoughts: + I could extend 'line-like' to any production containing a symbol that + starts with NEWLINE. The Newlines would work. + Rather than 'min_prefix' I could store "since-newline-or-start' so + that multiple newlines in a production would make sense, + +10feb2021 + New thoughts. I wonder if they will work. + + Change the scanner to produce paired SOL and EOL tokens, where EOL is + much link NEWLINE currently and is delayed by paired IN/OUT. + Also skip blank line, so only get a SOL if there is text on the line. + + Now a production needs to be explicit about being at the start of a + line. + Maybe we can even do + OptNL -> + | EOL SOL + + So: + statement -> SOL SimpleStatements EOL + | SOL CondStatement EOL + + If the grammar requires an EOL followed by an EOL, there must be an + implied OUT. + + in "if cond block else" + how do we know when the "block" is finished so that the "else" can be + shifted? + The expansion of 'block' will (possibly) end with a EOL. For "else" to + follow EOL without a SOL, there must be an OUT. + +12feb2021 + I need to clarify how the scanner must work for SOL/EOL so that I can + write code that works. + + SOL needs to be generated when we see a non-space character on a new line. + This is the same time that we need to possibly generate IN, which is in + check_indent. + So at start of line we scan for non-space, then unget and set check_indent. + In check_indent we assume start-of-line and generate SOL after any IN. + + EOL needs to be generated after we see a NEWLINE (or maybe EOF) on a + non-empty line. It may be delayed until after indents, so we need to store + it. We delay it until after multiple blank lines, so we always need to + store it. So ->indent_eol[->indent_level] is a delayed EOL, if ->num + is not TK_error. + + I think we need a flag for 'at start of line' which means the line + seen so far is empty. So much like my "non_empty" + + OK - much easier to get it right once I've thought it through :-) + +13feb2021 + + This isn't quite working how I had hoped :-( + The "EOL SOL" pair, or more the "SOL else" pair suggests I need a look-ahead + for 2 to recognise if I have an IfSuffix or not. + But I know and LR(2) can be re-written as LR(1) (Did I learn that in uni?) + How can I do that? + + Statementlist -> SOL SimpleStatements EOL Statementlist + | SOL Ifhead EOL Statementlist + | SOL Ifhead IfSuffix Statementlist + | SOL IfHead EOL SOL IfSuffix Statementlist + | + So if we see EOL SOL we can wait for else, which leads to IfSuffix, or + something else for StatementList. + But I don't want to allow StatementList to be empty. I can achieve this + but duplicating the above for a StatementList_nonempty. A bit ugly. + + Also, this is right-recursive which uses a lot of stack. + I can compress it a bit. By making an IfStat include the following statement. + SL -> Stat | SL Stat + + Stat -> SOL SimpList EOL + | IfX Stat + | IfX SOL IfSuffix + | SOL IfHead IfSuffix + + IfX -> SOL IfHead EOL + IfHead -> if Expr Block + IfSuffix -> else Block + | else IfHead + | else IfHead IfSuffix + | else IfHead EOL SOL IfSuffix + | else IfHead EOL Stat + + + Getting there... (again). + Problem: + if cond1: + if cond2: + stat1 + else: + + The 'else' pairs with cond2. + There is an EOL after "if cond2: stat1" and then "SOL else" + which looks just the same as + if cond1: + if cond2: + stat1 + else: + + The only difference is an extra OUT IN which we currently ignore. + + How can I use the OUT? + I have + SOL IFHead EOL .... OUT IN SOL + and I need the OUT to tell me to Reduce, or to block the Shift of SOL. + But if I simply block Shift when I have an OUT, the SOL IfHead EOL + becomes a Statement which is merged into the StatementList and then + the SOL is Shifted. I need to go all the way to make that Statementlist + a Block and IfHead. + If I hold out with the OUT longer until reduce_size!=1 + I get further but + IfHead else IfHead .... EOL + cannot shift the EOL + + Maybe I need to use min_prefix, but I really don't like that. + Need to think this through. + + Well, I have it working. + + If suppress shift if there are outs EXCEPT for TK_eol. Why? + Also I use the Bstatementlist indirection + and don't cancel the out if reduce_size==1 + + It's a bit clunky. Can I justify it? + + I'd like the tokens to be different. With + if cond: + st + else: + + The SOL before the else is ignored becuause we don't expect SOL there. + Trouble is in the problem case, SOL doesn't get ignored until later. + + Can I *only* prevent a shift of SOL when it is unbalanced? + + So: prevent shift of SOL if there is an uncancelled out, otherwise it will + be assumed to be at the wrong level. + Better, but not completely happy... + +14feb2021 valentines day + + What if the rule for cancelling indents was that the cancel couldn't cross + a starts-line state. How would that work out? + +15feb2021 + I didn't have time to pursue that, and now I'm a lot less convinced. + + New idea: Allow IN and OUT in the grammar, and selectively ignore them + like we do with SOL EOL. + That was, OUT could force a reduce which could not them be extended, so that + whole issue of recursive productions becomes moot. + + When are indents relevant? Maybe we have starts-block states which + expect IN, and with ignore IN if there is an indent since the last + starts-block state. + So + block -> : IN statementlist OUT + | : simplestatements + would ignore IN until we hit the :, then IN becomes relevant. + If we don't see and IN it must be simplestatements. Do we allow IN + there-in? Probably not. It would look confusing. + But if we get an IN, then we start ignoring INs again. + + The OUT absolutely must balance the IN, so we ignore OUT whenever the matching + IN was ignored. + + We still refuse to skip OUT if the matching IN is too far away. Must be in top + frame. + + Clarify handling of OUT when the IN was ignored... + A linelike production that started before the IN must not reduce until + after the OUT??? + + Any production that started after the IN must reduce before the OUT. + We don't force it to reduce, we flag an error. + So if we reduce some symbols which contain more OUT than IN, that is + an error + +17feb2021 + I need to track in/out carefully so they match properly and I ignore the right + OUTs. + IN is ignored whenever SOL/EOL would be. OUT is ignored precisely when the matching + IN was ignored. + I also want to track all ins and outs until they cancel in a reduction. + It is only at the reduction step that we can determine if an error occured. + An error is when a symbol contains nett negative indent. + So we can just count indents in each symbol. + Some in/out are within symbols, possibly IN and OUT. Others which are ignored + exist between symbols. A frame holds (symbol+internal indents),(state+pending indents). + To track which OUT to ignore we need a depth count and a bit-set. + If a bit is set, then the IN was ignored so the OUT must be too. + If clear, the IN was shifted, so the OUT must be too. + + I need to get indents_on_line right. + Previously I tracked them before this frame. I don't know why... + I want 0 when starts_line + +19feb2021 + OK, new approach is looking really good. Need to make sure it isn't too hard + to use. + Tricky area is multi-line statements that don't *have* to be multi-line. + + We cannot reduce "SOL IfHead EOL" to a statement as we cannot tell if it + is complete until we shift the SOL and look for an "else". + One option is "statement -> SOL IfHead EOL statement | SOL IfHead EOL IfTail" + So "statement" is really a sublit of statements. + Easy in indent_test, what about in ocean? + + There are lots of parts that can be on a line: + if, else, for, then, while, do, switch, case + + if and while can be "expr block" or "block" and the thenpart/dopart + else can be "block" or "statement" + then is optional in for, request if some if + + ifpart -> if expr block | if block then block | if block EOL SOL then block + + OR?? + + ifpart -> if expr block EOL SOL | if block then block EOL SOL... + + What if I support backtracking over terminals? So if I cannot shift + and cannot reduce, I back up until I can reduce, then do so? + + Then I can shift the SOL and if there is an else, I'm good. If not I back up + and reduce the statement + So + statement -> SOL simple EOL + | SOL ifhead EOL + | SOL ifhead EOL SOL elsepart EOL + | SOL ifhead elsepart EOL + would work. + But do I need it? + + statement -> simple EOL + | ifhead EOL + | ifhead EOL SOL statement + | ifhead EOL SOL iftail + | whilepart + | forhead whilepart + | switchead casepart + + + ifhead -> if block then block | if expr block | if block EOL SOL then block + iftail -> else block | else statement + + whilehead -> while expr block | while block EOL SOL do block | while block do block + whilepart -> whilehead EOL + | whilehead EOL SOL statement + | whilehead casepart + | whilehead EOL SOL casepart + + casepart -> casehead casepart + | casehead EOL SOL casepart + | casehead EOL SOL statement + | iftail + casehead -> case expr block + +22feb2021 + I've had a new idea - let's drop SOL! Now that I have IN, it isn't really needed. + We can assume SOL follows EOL or IN .... maybe. + Problem is if we want to require IN/OUT around something that is not line-oriented. + Might that ever matter? + No, I don't think so. + +23feb2021 + Maybe this make it really really easy. + We don't mark different sorts of states, and we only track which indents were + 'ignored'. + + Then: + IN never causes a reduction, it is either shifted or ignored. + An EOL is ignored if the most recent IN was ignored, otherwise it is a normal + token. + An OUT is similarly ignored if the matching indent was ignored. It also + cancels that indent. + + Is thats too easy? + + .... no, it seems to work. + + So: back to the ocean grammar + + statement -> simple EOL + | ifhead EOL + | ifhead EOL iftail + | whilepart + | forhead whilepart + | switchead casepart + + + ifhead -> if block then block | if expr block | if block EOL then block + iftail -> else block EOL | else statement + + whilehead -> while expr block | while block EOL do block | while block do block + whilepart -> whilehead EOL + | whilehead casepart + | whilehead EOL casepart + + casepart -> casehead casepart + | casehead EOL casepart + | casehead EOL + | iftail + casehead -> case expr block + + +24feb + Hmmm. awkwardness. + An ifpart can be "if expr then simple ;"... no it cannot... + But the problem was that some forms for a head with an optional tail + must end EOL, other forms need not. + But the whole must end EOL. + + So: do we put EOL at end of 'statement' or end of IfSuffix + + Let's try assuming it is at the end of 'statement' + So IfSuffix can assume an EOL follows + So CondStatement can too + So an ifhead either 'may' or 'must' be followed by an EOL. + If may, it is followed by IfSuffix which is empty, or starts OptEOL + If must, it is followed by empty or + No.. this isn't working for me. + + Let's try assuming that a CondStatement ends with an EOL. + So an IfSuffix must too. and it cannot be just EOL + If an ifhead that must be followed by EOL, it is either EOL or EOL IfSuffix + If it may be, then EOL or IfSuffix + + + ForPart ThenPart SwitchPart are ALWAYS followed by something, so can end + EOL or not, as suits + WhilePart IfPart CasePart might be the last thing so each option must + end with a SuffixEOL which ends with EOL or SuffixOpt which might not + + What do I want to do about + : SimpleStatements + + It is useful for + case value : statement + and maybe even + if cond : statement + though for the latter I can and use 'then'. + For 'else' I don't need the ':', but it wouldn't hurt. + + Problem is: do I insist on a trailing newline or ';' + If I don't then + case foo: bar case bar: baz + would be legal, but hard to read, as would + if cond : stat1 else stat2 + which is probbly error prone. + + But do I want + switch expr + case val1: st1 + case val2: st2 + else: st3 + + That looks like an indented block, but is really indented lines. + So it is probably a mistake. + So allow switch expr : or ';' at the end + + Whatever happens after "switch expr" must work after "while expr block" + + So.... + If first case is not indented, none of them may be + If first is: it happens in an IN/OUT block, so again all the same + + Can I implement that? Can I have IN after a non-terminal somehow? + When I see an IN, I could reduce as long as go_to_cnt == 0. + That might help after an OUT, but not after EXPR,, + + Or: look at next symbol. If it can be shifted, we ignore the IN. + If not, we reduce and try to shift the IN again. + + Also: need to mark IN as ignored when popped off during error recovery, + and maintain stack when discarding during error recovery + +26feb2021 + Syntax for blocks? + { IN statements OUT } + { simplestatements } + : IN Statements OUT + + but what about + : simplestatements NL .... or ';' + + In other contexts I have + for simple; statements; then simple ; statements ; while expr: + + I currently require a ';' or newline before "then" or "while" + + Interesting other cases are: + + case expr : simplestatements + while expr : simplestatements + + For 'if' I currently have "if expr then simplestatements" + + Because of 'for' and 'then' I don't want to require ':' before simplestatements. + I could have + while expr do simplestatements + But what do I do for 'case' ??? I really want the ':' there. + So I should use it for 'if' and 'while' + 'for' could be followed immediately by IN, as could then and even if/while + So the ':' comes after an expression. + +27feb2021 + Problems with the idea of only using : to come after an expression. + 1/ "else" looks wrong compared to Python, but may I can get used to that + 2/ with "for" it would be simple statements, with "while" it would be expr + if there was no indent. Do I need different things to look different? + If statements always follow ':', the "for" and "then" always need a ':' + for: a=1; then: a = a+1; while a < 10: + + In C there is no difference, but I want a difference.. + +03mar2021 + Arg... I'm not struggle with parsing concepts this time, I'm struggling with code. + I want to add an "EOL" symbol to the grammar as a special terminal. + It is like "NEWLINE", but handled a bit differently. + + In parsergen it is just another terminal symbol, but it mustn't get added + to the "known" list. Currently all terminals from TK_reserved are added + to "known". Maybe if I give it a number that is after the virtual symbols