02oct2014 I don't know how to create a grammar or condstatements. The newlines are confusing. When I have if cond: statements This ends with NL OUT NL. The first NL is included in "statements" which is allowed to end with an NL. The "OUT" reduced us down to "if Expression Block" and then cancels. The second NL is the problem. It could end the whole statement, or it could be followed by an 'else' and I don't know which. I currently have complex statements separated by newlines. The means the second NL must close the complex statement. So we cannot shift it until we have reduced a ComplexStatement. So we cannot see if an 'else' is coming. If we request ComplexStatement to end with a newline, then have: ComplexStatement -> IfPart Newlines IfPart OptNL Elsepart NEWLINE IfPart -> if Expression Block ElsePart -> else Block Then the NEWLINE can be shifted. If we see an 'else' we reduce it to OptNL. If we don't we reduce to ComplexStatement. But if we see another newline .... Need both to allow a list If ComplexStatement -> WhilePart CaseList ElsePart WhilePart -> while Expression Block CaseList -> Newlines | OptNL CasePart | OptNL ElsePart Elsepart -> Newlines | OptNL else Block Newlines OR WhilePart -> while Expression Block CaseList -> OptNL CasePart | ElsePart -> OptNL else Block | ComplexStatement -> WhilePart WhileSuffix WhileSuffix -> Newlines | OptNL CasePart WhileSuffix | OptNL ElsePart Newlines ComplexStatement -> ForPart WhilePart WhileSuffix | ForPart WhileSuffix | WhilePart WhileSuffix | SwitchPart WhileSuffix | Ifpart IfSuffix 24sep2014 another problems. If I have statementlist -> statementlist NEWLINE statement then the newline will completely close the statementlist which I don't want. if a: stuff else if b: stuff HERE thing at 'HERE' there is a newline before and after the OUT which need to be shifted into two different things... it isn't working for some reason. ------ I have a problem. if cond : a=b; d=f parses badly. The NEWLINE at the end could be shifted to turn the "simplestatements" into a "statements", or it could trigger a reduce and then be shifted for the IfPart. So in neither case is it ignored, and that is all the previous logic involved. Both the state at 'if' and at 'a=b' starts_line and there are no outstanding indents. So the NEWLINE should reduce anything that started with starts-line that doesn't contain an indent or a newline. Actually, 'a-b' doesn't start_line. So when we see a NEWLINE when it is allowed, we could reduce anything that is completely since the last start. maybe.... If newlines are ignored, obviously we ignore any we find. If not, there must be a starts_line since the last indent. We really want to reduce everything since there to a single non-recursive symbol. But maybe we need to SHIFT before we can REDUCE that far. So we just reduce as far as we can. Hmmm... I've made a mess of this. How embarrassing. My top-of-stack and 'next' handling gets confused. The indent on the 'next' token gets stolen when I reduce. So: The stack alternates tokens, which can hold indent, and states, which can allow newlines. The top and bottom are states. Each frame contains a state (with newline flag) and the following token (with indent information). The final (topmost) state (with newline flag) is stored in 'next', as is the look-ahead token (together with indent info). When we reduce() we remove several (0 or more) frames and replace with a single frame. The information we remove is actually a token and its following state, N times. Including the state in 'next', but not the token in 'next'. The new frame is the new reduced-to token with the old state, either from frame or from 'next'. 'next' gets a new state. This suggests I did the frame the wrong way. A frame should have a token (With ast and indent) and then a state (With newline flag). The bottom-off-stack can have a null token. 'next' just has the look-ahead token with indent state. Reduce discards N frames (never the bottom frame) and pushes the resulting token (with ast) and 'goto' state from previous frame. 'shift' pushes the 'next' token and 'goto' state. What a mess. - fixed now (I hope) Back to where I was. I want to parse: if a==b: print a; print b and when I see the NEWLINE I want to reduce "print a; print b" to Statementlist without shifting the NEWLINE. Then the NEWLINE is shifted to make a ComplexStatement. The state at the start of 'print' doesn't expect a newline, but at start of 'if' does (I hope)... only it doesn't if it is at the start of a block. But in that case it is indented. So we look backwards for an indent or a starts-line state. If we can reduce without discarding the state or absorbing the indent, we do. Only .... now that 'print' isn't in a starts_line state, it also isn't after an indent, so we are ignoring newlines. I think I want my cake, and am eating it. How to parse: if a == b : print a; print b This should parse like the above. So "Block" isn't a nicely reducible element. Only Statementlist is. if a == b { stuff } The state after the ':' is important for reducing back to. If that because it is a recursion point? No. It is because next thing can be preceded by a newline. i.e. CompleStatement can follow a newline. So we find symbols that end with a newline, and thus symbols which follow a newline in a production, but only immediately. So ComplexStatement. Then any state where a newline-following symbol follows DOT, is declared a line-starting state. These make newlines visible until the next indent and .... and what? If a newline appears (before an indent) it reduces everything since that state while it can or until there is just one thing which is not recursive. After a line-starting state, an *internal* indent disables newlines. An initial indent reduced the reductions that a newline can cause.? No, that don't work. I think I need to track if a symbol is at start-of-line. When we get a newline, we reduce anything we can since that start-of-line until we get one symbol? i.e. if we get NEWLINE and top symbol didn't start line, we reduce if that reduction won't swallow start of line. Summary: - IN, OUT, NEWLINE tokens. NEWLINE that would be before IN is moved to after OUT - IN are recorded against 'next' symbol. Each symbol records indents (INs minus OUTs) it contains, and whether it started with an IN - in parsergen, each symbol is tagged if it can end with a newline - Each symbol following a can-eol symbol is 'starts-line' - Each state where a starts-line symbol follows DOT is a starts-line state. - When parsing we record which symbols followed IN or NEWLINE. - If there are net indents after a starts-line start, other than immediately after, then NEWLINE tokens are ignored. - If we see a NEWLINE which is not ignored then we must reduce any production which started after the most recent start-of-line. So if we can reduce and length is less than length-to-start, reduce. - If we see an OUT we must reduce any production which started indented. We end up with lots of starts-line state that aren't interesting, as they are very close to a newline anyway or only a terminal away from the next starts-line start Optional NEWLINES are awkward. When we see a newline, we are prone to reduce early so we end up with a newline to be shifted when it isn't wanted any more. Optional newlines are even more awkward. An optional newline in "block -> OptNL { statementlist }" messes up because the NEWLINE forces the reduction of OptNL from 'empty' before NEWLINE is shifted. So we can never achieve "OptNL -> NEWLINE" except at the start of a line, after an OUT. The purpose of reducing early is to ensure a symbol never includes a newline unless it started at start-of-line, or explicitly allows newline. How is if cond : st1 ; st2 st3 (where newline is permitted between 'st1;st2' and 'st3') different from if: cond then: st (where newline is permitted between 'cond' and 'then'). In first instance I want to reduce further. In second instance I cannot. In first case, new thing started midline. In second it didn't. ARG. Still not sure I have this right. Though maybe by indent grammar is broken.... We definitely need to know if a start "starts lines". i.e. it is a state where we are expecting a 'line like' think. A 'line like' thing should be a thing. i.e. a non-terminal. A non-terminal which ends with a newline is a perfect candidate for 'linelike' So any state which is followed by can_eol is linelike? The grammar needs to be carefully constructed. Anywhere a NEWLINE appears, we definitely don't ignore newlines. So they should only appear after things that we want to be line-like; So variable = expression NEWLINE is no good, because we don't want 'expression' to be linelike. Is: for SimpleStatements NEWLINE ok? Not really because SimpleStatements in recursive. 1/ outdents and newlines must "close" any productions which started at-or-after the matching indent, or after the matching start-line The key idea is that the total set of tokens for any given symbol must: - not include an OUT without the matching IN. If the IN was at the start, the OUT must be at the end. - not include a NEWLINE unless it started at or before start-of-line. unless NEWLINEs are being ignored. (unless the symbol includes only the newline) So when we see an 'OUT' we reduce anything we can until we can cancel against and IN. If the IN we would cancel against is at the start, reduce again if length==1. When we see a NEWLINE, reduce if we can as long as length doesn't go beyond start-of-line. 2/ NEWLINES are ignored after an indent if they are not "expected" since the indent. A symbol is 'linelike' if it is ever followed by a NEWLINE. i.e. the symbol after it in some production begins with a NEWLINE. (if "a -> b c" and "x -> a NEWLINE", then a is linelike, but c isn't). If a state is followed by a linelike symbol, then it is a starts_line state. Newlines are expected in starts_line states SimpleStatements Block ComplexStatements so: - track which symbols *start* with a newline - deduce which symbols are linelike - they are followed by newline - deduce which starts start_line - a lineline symbol follows DOT - Make sure grammar handles newlines properly. The "shift if you can, reduce if you cannot" rule means that an unexpected symbol effectively terminates everything. But we don't want to terminate an indent before we see the outdent. so that needs fixing.