How to write up two-dimensional parsing?

What does the code do?

- If an indent is ignored the matching undent is ignored
  except to decrement the 'ignored' stack and to turn
  a pending indent into a pending newline.
  or to count the pending undents.

- If a newline is found and cannot be shifted yet then we check
  each production that would close before we shift, and if any still
  have pending undents, we ignore the newline

- when we reduce a production, we 'resolve' indents wich may leave
  something pending. Either an indent or a number of undents?

- productions can be recursive if first body symbol matches head.
  A recursive production may be horizontal if it starts in the middle
  of a line. non-recursive productions are always horizontal.
  A horizontal production must be all on one line or have its body
  indented.
  A vertical production must not be indented.


Maybe I don't want to mention indents in the grammar ?

 if CONDITION : STATEMENTLIST

doesn't have to be indented but we will only reduce it when we
see something that doesn't belong.... which could be an UNDENT

So: when look-ahead is UNDENT we cannot shift so we reduce until
the top state has an indent attached, at which point we consume the
UNDENT.

 if hello :
    foo
    bar
    bar
 UNDENT

the undent will reduce until we just have the ifstatement which
contains an indent.
So any indent is pushed onto the stack and a reduction sums all the
indents.

What about newlines?  A newline will reduce any production that
doesn't contain an indent.  It gets discard when top of stack
was at start of line...

So each 'state' also encodes
   start-of-line or middle-of-line
 and
   count of indents.

When an INDENT is seen the current state increments count of indents.
When an UNDENT is seen we must reduce until tos state has an indent,
which we decrement.
When a NEWLINE is seen, we reduce until tos state has indents or was
at start-of-line.

nononono
UNDENT shouldn't force a reduce.  Only newline should do that.  But as
an undent is always followed by a newline, it still results in a
reduce happening where I want it.

So an undent shifts in line an indent(?).  Reduction nets the indents.
A newline forces the reduction until we find excess indents.

But what does grammar look like?
 statement -> assignment ';'
           | assignment
So a newline will force the reduction to 'statement', but this will
allow
    a = b c = d
which looks bad.  So:
 statement -> assignment ';'
           | assignment NEWLINE
Then the NEWLINE can be shifted normally.
So newline forces reduction until it can be shifted or until ToS
contains an indent or started on a new line.
If we hit an irreducible state that does not accept NEWLINE when we
have to shift in an ERROR instead which might then be reducible.
When popping states we need to collect the indents.


Where do we attach the indent/undent info?  To the state or to the
symbol?  They are kept together but mean different things.
indent/undent are between terminals, just like states.  But they
can be inside non-terminals which make them unlike states.

What "is" a state?  It is a point between two symbols in a sentential
form.  It doesn't carry information.  So I think the indent has to go
with a symbol.
An 'indent' should attach to the symbol *before* it as that is the
thing that is being continued.  An 'undent' should also attach before
itself.  Eventually those two terminals become included in a single
non-terminal and they collapse.
You can never have more undents than indents as a newline will force
reductions until top symbol has no undents.
In fact we reduce until the top symbol has a new indent, or is a
vertical production, or is explicitly allowed.

If we see an INDENT, we attach to symbol at ToS
If we see an UNDENT, we attach to symbol at ToS (possibly canceling indent)
If we see a Newline and current ToS holds +ve indent,
   we ignore it
If look-ahead is expected, we shift it.
If we can reduce, we do so.
Else we trigger error handling:
 pop states until one accepted ERROR - collecting indent
 shift ERROR with those indents
 Discard input, but process indent/undent, until next symbol is
 expected.
 continue.


Does this work?  If I have
  a = b +
      c +
      d

Then the NEWLINE after "c +" cannot be handled, but should not be an
error.
Stack would hold
   var = expr +
where 'expr' contains an indent.
Maybe NEWLINE is ignored unless ToS has negative indent.
  a = b +
      c +
    d
That must be an error.

Problems:
 1/ accepting NL must depend on what it will reduce.
 2/ If UNDENT/INDENT appear together they must be kept separate.

So maybe UNDENT does block shifts and so forces reduction.  If not
reduction can be made, that is an error

if (a) {
  b;
}

The NL UD NL before '}':
  NL will ensure "b;" is reduced.
  UD will reduce to command-list so we will have

  if ( expr ) { commandlist
where '{' holds the indent.  So '}' needs to bring in the undent,
which is after another NL.

So Undent waits for a symbol to collect it when shifted.

What when I have:
   if (a) {
       foo;
   while (b)
       bar;

When we try to reduce "foo;while(b)bar;" to "statementlist" the
negative indent triggers an error.  This a bit late though.  I want it
when I see the 'while'.  I want an error there which can be reduced a
closed statement.
The undent should close the statementlist so that '}' is the only
shiftable symbol.  But after 'statementlist', 'while' is still
permitted.

If we just waited for the 'while' statement to reduce and then noticed
the error, we could still report it against the 'while' token, but
recover would be a lot harder as we have already parsed the 'while'
into the 'wrong' place.

So when I see that token after NL,UD,NL, I need to be able to say
'yes' or 'no'.
Could require help from grammar.
e.g. if a token is allowed to have a lesser indent than previous
token, then there must be a literal UNDENT token.
Or we could require:

 Block -> BlockA }
 BlockA -> { Statementlist

That looks ugly, but the undent would close BlockA and allow the } to
be expected.
Or maybe

 Block -> { Statementlist OptNewline }

so the newline after the indent is shifted and doesn't
reduce-or-error.

Then the UNDENT would ensure that the Statementlist were closed, the
Newline would be shifted, only a '}' would be allowed.

So: each symbol in a production must be in the same line as the
previous, or in an indented line.

No.  OptNewline there wouldn't work as a blank line would take us to
expecting a '}'.

I think I need to go back to what works... only it doesn't?


Indent is attached to previous symbol. I think that is easier that
keeping it at the 'start' of the 'next' symbol as I was doing.

Newline closes any productions that don't include an indent, then can
be shifted if wanted, or ignored.

Undent reduces things until ToS has an indent that can be canceled.
This means that if part of a production can have a negative relative
indent, it must be separate:
Block -> Blockhead }
Blockhead -> { Statementlist

So the Undent before the '}' will reduce Blockhead and can then be
discarded.
We would add
Block -> Blockhead ERROR

It would be nice if that can be auto-handled. e.g.

Block -> { Statementlist $undent }


leads to the above.
That would be messy for the reduce action though.
Could each state "know"

--------------
A state can have different items with different offsets, so how much
of the stack is part of the 'current' production is not well defined.
It is not completely random either.

We can ignore non-kernel items as those are just potential productions
- they haven't started yet.
Of the remainder, many states have only one length.
Many others will have some offset and '1' for recursive
 productions: A -> A . X
These can be called "potential" as well as we have no commitment.
What is left?
  Funcall -> Factor . ( ExprList )
  Term -> ~ Factor .

  Funcall -> Factor . ( ExprList )
  Term -> Term * Factor .
etc

i.e. just the bits that make the grammar distinctively LR.
In my case one is always reducible, is that important?

So when we see an Undent we find the max-index of the current state
and determine if there is a matching indent within that distance.
If there isn't then we must reduce or trigger an error.

Alternately:  Each 'goto' from this state has a max-history.
We exclude any that don't have enough history to include balanced
indents.
If the next symbol would trigger one of those, it is an error.

So:
 For early error detection we record Indents against the previous
 symbol, ignore newlines, and when we see an undent the next shift
 must lead from a core item with enough history to keep balance.

If that is well defined it might catch early undents, but what about
unneeded indents?
   a = b + c;
       c = d + e;

I expect that is less likely, but still an error.
When we see the first ';' we will reduce to a statement and then a
statementlist.  The second 'c' would lead us from
  statementlist -> statementlist . statement
which is recursive and so doesn't allow indents (unless it started in
the middle of a line).

Intends are allowed anywhere except after the first symbol of a
left-recursive production which started at beginning-of-line.

a = b +
    c
      * d;

is not allowed as 'Term->Term*Factor' is left-recursive.  But why not?

So maybe an indent has to be allowed anywhere.  Maybe a warning but
there is no real recovery option.

An undent is allowed in a state with a history back to the matching
indent.
If we see an undent, we check if the state can consume it.  If not we
look to the next symbol which is not an undent (or newline) and reduce
until that symbol can be shifted.  At this point the undent must have
been consumed.

Or easier:  We must reduce to the matching indent before we shift,
unless a NEWLINE is explicitly allowed...

Take 1:
 An indent is between two symbols, or within one symbol.
 We hold a pending indent which gets attached to the front of the next
 symbol that we shift.
 Undents and newlines collect until we see a "real" symbol.
 Any reductions that are then indicated happen and if the ToS symbol
 is found to hold an indent, that cancels a pending undent.
 Once we get to a state that can shift the lookahead we examine the
 number of pending undents.  It should be zero. i.e. we shouldn't
 still be in the middle of a non-terminal that started at a greater
 level of indent.

 If it isn't zero we trigger an error because a close-bracket or
 similar is missing.  This might be different to "unrecognised-error"
 The error should lead to reductions which cancel out the remaining
 undent.
 I think the parser needs to incorporate error handling into states.
 Each state knows what symbol to synthesis to get closer to a
 reduction.
 i.e. on 'no close' error, we reduce what we can, and synthesis when
 we cannot.

So that seems to handle early undent.
We also have 'unexpected indent', 'missing indent' and 'late undent'.


A 'missing indent' happens when a production that we expect to be
one-line sees a newline rather than an indent.
i.e.
     a = b
     + c;

The + after the newline only reduces 'b' to Expression which starts
in the middle of the line.  A newline should see a reduction to a
nonterm which started at the beginning of a line, or which contains an
indent.

We trigger a 'no close' error here so that '+' is the start of a new
statement.  Maybe if the next symbol is shiftable, we just report an
error and allow parsing to continue;

An 'unexpected indent' happens when the fully reduced non-terminal
started at the beginning of a line and is a recursive non-terminal.
Or maybe we are in a recursive state.  This triggers an error

A 'late undent' is like an unexpected indent.  The previous
fully-reduced non-terminal is recursive, started at the beginning of a
line, and still contains an indent.

------------------
Each symbol can carry a 'start-of-line' flag and an 'indent' count.
Each state may be flagged as 'recursive' if "A -> A . etc", and
records the symbol to synthesise to resolve a 'missing' error.

Current state includes an undent count, a nothing/newline/indent flag.
When look ahead contains undent, we increment that count
When look ahead contains newline we set newline flag
When look ahead contains indent, we set flag to indent.
When we reduce we combine indent counts in symbols and keep
start-of-line from first symbol.
If the ToS after a reduce contains indents and current state contain
undents, they are reduce so that one becomes zero (min is subtracted
from both).
Before we shift we must test:
- if undents are non-zero, we synthesis the symbol expected by the
  state, we keep synthesising and reducing until the undent count
  reaches zero.  We report that the synthesised symbols were missing.

- If newline flag is set, the ToS symbol should start at SOL or
  contain an indent.  If it doesn't we just report a missing indent
  error.

- If indent flag is set and ToS symbol started at SOL and state is recursive
  we report an unexpected indent.

- If state is recursive and ToS started a SoL and contains an indent,
  we report an extra indent.


---------------------
(I think I'm getting there).
Hypothesis:
  An indent is allowed in a non-recursive that started on a newline,
  or in a recursive (at the recursion point) that didn't.

  The matching undent must be at the end of the production.

  A newline is only allowed in a production that started at the start
  of a line, but not always. - only if recursive or annotated.

  An undent is required at the end of a production that contains an
  indent.


  An indent or newline found where not "allowed" generates an error
  but no recovery.
  An undent not found where required generates an error and emergency
  reductions.


If we have an ambiguous grammar then the undent could simply
force a reduction. i.e. if the 'emergency reduction' just involves
choosing REDUCE of SHIFT, then it isn't an error or even a warning.


But where exactly is an indent or newline allowed - I don't think I
have this right.
 - a newline is definitely allowed in a list that started at the
   beginning of a line.  We need to structure the production right
   so we don't end up with a separator at the start of a line.
      LIST -> LIST SEP ELEMENT
           -> ELEMENT
   would need to allow the newline before ELEMENT.  The logic there
   would be that if the first ELEMENT can be at start-of-line, other
   ELEMENTS can too.
      LIST -> LIST ELEMENT
           -> empty
   Is appropriate if no separate is needed, and an empty list is OK.
   Here ... does the indent go after the LIST or before the ELEMENT?
   i.e. if there are more symbols, how do we know.

   If the list already has an indent, then from there on it "started
   at the beginning of a line"
   This is treating it a lot like LL(1) which would be:
      LIST -> ELEMENT LIST
            | empty

      LIST -> ELEMENT SEP LIST
            | ELEMENT

 - a newline is definitely allowed in a non-list which has already
   been indented.
      for (i=0;
           i<10;
	   i++)

      a = b +
          c +
	  d;

   So: whatever holds the indent can also hold a newline.

   i.e. a production can have a newline between symbols if an earlier
   symbol has an indent before it.
   This would allow
    a = [ b
        , c
        , d ]
    which is probably OK.

    What about indents?  Can I have two indents in the one production?
    No, I don't think so.

    If an indent occurs *inside* a production, newlines can appear
    later.
    If an indent occurs *in front* of a list production, newlines can
    appear later.


So (again!)
  a symbol (on stack) can have an indent at the start, and can have a
  number of indents within.
  When we shift the first symbol after an indent, it gets an indent at
  the start.
  When we reduce the result inherit and indent at the start, and other
  indents (within and at start) become within.
  The indent at the start of a list can be treated as being within.
  i.e. as list has an empty symbol at the from before the indent.

  A newline is permitted if the max history for the current state
  contains an indent.  Indent at start of non-list doesn't count.  At
  start of list it does.
  An indent is permitted if the max history does not contain an
  indent.

  An undent cancels indents within, or possibly before, the
  immediately preceeding symbol.


Can I do all lists with empty start?

  // simple list
  LIST -> LIST ELEMENT
        | empty

  // non-empty list
  LIST -> LIST0 ELEMENT
  LIST0 -> LIST
        |

  // non-empty with separator
  LIST -> LIST0 ELEMENT
  LIST0 -> LIST SEP
         |

  // can be empty, sep is needed
  LIST ->
       | LIST1
  LIST1 -> ELEMENT
       | LIST1 SEP ELEMENT

  Just stick "IsList -> " in front of anything that is a list.


For newline to be allowed, do indent have to be between symbols in
same production, or is within symbol ok?

if ( a =
     b )   // now I need another indent.. or an undent?
  x = x + 2;

The undent resets, then only an indent is allowed.
if ( a =
     b )
        x = x + 2;

No reset, so extra indent.  Anything is probably OK except newline.
So the indent cannot be inside a previous symbol.
But for lists, the indent will be exactly inside LIST.
So maybe inside first symbol is OK ??
 x[a+
   b] =
   c

 x = [ a, b, c,
   d, e, f
 ]


I think I might be trying too hard.  I should allow newlines
and indents anywhere, and give a clear interpretation for undents.
However my real goal is to know exactly what newlines and undents mean
and how to parse them.
i.e. does the grammar need to mention all of the, some of them, or
just be ambiguous and allow line break to disambiguate?

That last options doesn't really work as we cannot allow a separator
to be dropped anywhere by at end-of-line or before-close.
So grammar needs newlines.  Various separators can be "sep or newline".
However I don't think indents need to be mentioned.
  if condition : statementlist
needs undents to disambiguate I don't need to say that.

So:
 - indents get shifted with token
 - indents are grouped by reduce
 - and undent reduces until a matching indent in the top
   state can be cancelled.
 - newlines are messy.
   We look ahead through possible reductions until we
   either shift the newline or find an indent or cannot reduce.
   If the shifted newline would not contain a newline in the production,
   we choose that path.  If we find an indent, we discard the newline.
   If we cannot reduce we again ignore the newline
   We could also discard a newline early if not in LA set.


But then:
   if x :
     a = b +
     c = d

Looks to the compiler like an addition and to the human like two assignments.
     a =
         b +
         c +
         d +
?? Only allow newline if production started with indent.
So:
   If not expected and current production contains indent, discard.
   If not expected and current production has no indent - error, recover.
   If expected, test-reduce and discard if indent found else accept.
Not quite.

What if we just test-recover until we hit an indent or want to shift the newline?

Difference between statements and expressions is that a statement can
be terminated by a newline.
So if we find a newline and test-recover would use it, that is an
error.
If we find a newline and test-recover would not use it before an
indent, it isn't an error.

So each state which cannot reduce has a link to the best state to shift to.
When we want to recovery we follow these links until they run
out. Then we can reduce which leads to a new state.

If we find a newline, we do this until we find a state which will take
a newline, or we cross an indent.
If we crossed an indent, we discard the newline.
If not and we has to take any shift steps, we call an error.
This requires that newlines aren't buried to deep.

If we find an undent, we do this until we have cancelled the undent
against an indent.  If anything was shifted, we report an error.

No - using indent just to disambiguate an ambiguous grammar doesn't
work as it still allows confusing constructs.
   if foo:
     bar
     else: baz
Should be wrong as the 'else' looks to be part of the statement but
cannot be.  It should be "unexpected".  That really means that
"INDENT" "UNDENT" must be part of the grammar.
So:
   if COND : INDENT STATEMENTLIST UNDENT
which is where I started.

So an indent which is not expected protects all newlines and
indent/undent pairs until the matching undent.... not quite.

Indents can be expected in a state (if we compute lookahead).
If an indent is not expected we simply discard it and keep count.
When we shift an expected indent we push the count with it.  When we
push the matching undent, we restore the count.
When we get an undent, we discard it if indent count is non-zero.

When we get a newline we look ahead to see if it would reduce across
an indent.  If so we discard the newline.  If not we keep it.

But an undent still needs to force reductions doesn't it?

i.e. if an undent doesn't cancel against top of stack we count it.
If we every try to shift while an undent is present, we throw an error
and shift other stuff until the undent is gone.


Indents must affect error handling.  We no longer just discard tokens
left and right.  The 'error' state for any indented token is the
matching undent.  So we discard stack tokens until we get an error
state or an indented state.
If error, we shift the error and discard input until a token
  matches.  While discarding input we track indent/undent and
  if we come to an unmatched undent we discard stack tokens again

If we hit an indented state we discard input until the matching undent
arrives.

But do we really want to discard tokens, or can we synthesise tokens?

Discarding is standard.  It gets to a known position.  It allows any
location in a production to accept errors.
However discarding does discard data that cannot get further analysis.
If we use very fine grained error productions we might not miss much.

Come back to this later.  For now just discard.

If we find an Indent on stack, an error "must" be expected.
i.e.
  block -> : INDENT Statementlist UNDENT
  Statementlist -> ERROR

So:
  if we can shift ERROR, we do,
  else if


Indents and newlines.

Indents and newlines can be taken as strong syntactic marker, useful
both for regular parsing and error recovery.
Unlike other bracketing markers such as parentheses or paired key
words (BEGIN and END or DO and OD) indents cannot be lost.  Every line
displays what the current indent level is and for every indent there
is precisely one location where the matching undent is (though
of course one location can have undents for multiple indents).

Similarly start-of-line and end-of-line can be strong bracketing
markers which cannot be missed.

The two can complement each other to provide a very effective
indication of the grouping that the programmer intends.  Often the
programmer will want more explicit marking as well (e.g. "{" and "}" )
which is perfectly fine as redundancy of expression is an important
part of effective communication and this is equally true in writing a
compute program.

The connection between the two is simply that indentation effects a
continuation of the previous line. For example in

   A B C
      D E

Any bracketing implied by sol/eol should  not be imposed around "A B
C" but only around "A ... E" - the indent effectively hides the line
break.
So this could be tokenised as
  sol A B C indent sol D E eol undent eol

Note that every SOL has a matching EOL and every INDENT a matching
UNDENT.

As these markers are strongly paired they can be useful for error
detection and recovery.  This is particularly true for indents.

It seems universally true that any syntactic unit (such a declaration,
statement, or expression) which starts indented will finish while
still indented. That means it will complete before the undent which
matches the most recent indent when it was started.

This fact is helpful but not quite as strong as we might like
For example, in

  A B C
     D E
  F G
 H

"F G" could be in the same syntactic unit as "A..E".  however if "D"
started a new subunit, "F G" certainly aren't part of that.
Assuming that "A" started a syntactic unit, "H" is the first symbol
that is clearly not part of it.
This isn't really a problem for
   if condition :
          foo;
	  bar;
   baz;

as the statementlist is the last part of the if-statement, so
there is no room in it for baz - baz must be separate.
However in

    a = b + c
        + d + e
	+ f
    + g + h;

This looks wrong but cannot be ruled out on the grounds of
indentation.
Here the expression of
     statement -> var "=" expression ";"

starts on the same line as the statement so there is no indent.

We could argue that a syntactic element (expression) that stated other
that at the start of a line may only continue to the end of that line
(when adjusted for indents).  This can hit problems with:

  if condition {
      foo;
      bar;
  }

Here the "block" of
   block -> "{" statementlist "}"
starts other than at the beginning of a line, but continues on the
next line.

There are a few different ways we could try to deal with this:

a/ some special-case marking in the grammar to say that a newline is
   permitted before the '}'
b/ Don't make a separate syntax element for "block":
     ifstatement -> if condition { statementlist }
    would mean that the '}' is at the same indent as the 'if'
    which starts the element.
c/ decide that the last terminal of a production is a special
   case and can go on a new line.

The last is almost certainly a bad ideas it would encourage
    a = b
    ;
and discourage
    if foo then
       stuff
    end if

Another construct that is often used is:

  if condition
  {
       foo;
       bar;
  }

where the "block" is allowed to start on a new line.
This potentially encourages 'block' to be a separate syntactic unit,
so 'b' above doesn't seem encouraging.  That leave 'a' and marking
both the '{' and the '}' as being allowed to start a new line


If we assume these markings, then an undent can close any syntactic
unit that started at a greater indent, or at the same indent but not
at start of line.
We can use this:
 - to detect errors by reducing early and then ensuring the next
   token is expected.
   e.g. the indent after 'bar;' would close the statementlist after
   which only a '}' is expected.  If it isn't there, that is an error.
 - to handle ambiguous grammars.  In the classic
        statement -> if ( expression ) statement
	           | if ( expression ) statement else statement
   the normal default is to treat 'else' as right-associative.
   Using indents:
      if (x)
         if (y)
            foo();
      else
         bar();

   can be parsed with the 'else' applying to the 'if(x)' as you would
   visually expect.


Now we turn to line markings - sol and eol.
The above suggested that any unit should be "on one line" (allowing
for indents) unless specially marked.  This is bit excessive.
For example
   a =
       b + c
       + d;

"b + c + d" is all one expression, but appear on two lines.  Might
seem unlikely, but
   a[some_long_expression_with_function_call_etc] =
      verylongling + otherverylongthing
      + moreofthesame;

might not seem so unlikely.  This could easily be made correct by
moving the '=' to the start of the second line, or indenting the third
line slightly.  It isn't certain that we want to do that, and
annotating that a newline is allowed before the 'plus' seems clumsy.

An alternate approach involves noting that 'statements' are rather
special.  Not uniquely special but still special.
That is, statements are normally on one line.  When they are two long
for a line the remainder is indented (make it part of the previous
line).
When a statement appears on multiple lines it is a notable exception.
For example the  Python

   if condition:
      statementlist
   elif condition:
      statementlist
   else:
      statementlist

has three lines in one statement.  This is a one-line statement which
has newlines before the 'elif' and 'else', or maybe 3 one-line objects
which can come together.

While a statement is line-based, and expression probably isn't.  They
tend to appear all on one line, but that is because the statement they
are in does.

In some sense the line-nature of the statement "captures" the
expression.  An expression can only be on multiple lines if they are
all indented with respect to the containing statement.

So rather than marking some things as being allowed to continue to the
next line, we can mark some things as being line-shaped.  These are
sometimes whole units and sometimes parts. e.g.

  if ( condition )

  {  statementlist

Note the missing '}' above - it is not part of the line shape.

  if condition : statementlist

  else: statementlist

  function name(argumentlist)

  lval = expression ;

and so forth.

Being line-shaped just means that it must be "all on one line", but
not that it must be a line by self.  Several line shaped things can
appear on a line

   a = a+1; b= b-2;

However it would be nice to allow a syntax modification when separate
lines are used. i.e.
   a = a + 1
   b = b - 2

The trailing semicolon is not needed.  As the assignment is
line-shaped it must finish at the end of the line.
Declaring the semicolon to be optional would allow
   a = a + 1 b = b - 2
which looks more like an error than a correct construct.  Here the
semicolon is need.

A simple solution is to allow the NEWLINE to be part of the syntax
listed for the grammar:
   OptSemi -> ;
           | NEWLINE

Then
   Assignment -> lvalue = expression OptSemi

as a line-shaped item should work.


1/ If a point is 'newline allowed' then what follows can go
   on a new line
2/ If something is line-shaped then linebreaks may only be intended.

So in a production we can have 'optnewline' or 'optsemi'
anywhere but at the start.  Or maybe 'at the start' means something a
bit different.

  statement -> if condition optnewline block
             | if condition optnewline block optnewline else optnewline block

  block ->  : statementlist
        |  { statementlist optnewline }


If something is line-shaped, then the EOL that ends the line on which
it starts, must end the thing.
If something is not line-shaped, then an EOL can be safely ignored.

An LR state can expect NEWLINE if it has a goto state for it.
And it has an indent.

If we look up the stack and find an indent before a newline-expected
then a newline is ignored.
If we find a newline-expected before an indent, the newline is used.
If a state expects a newline and was indented then the newline
expectation comes last, is seen first when looking back, so newline is
used.

Separately, an undent reduces until it can cancel with an indent.


but what about
   a = 1 + (3 *
            4 + 2)
   b = 1 +
       3 * func(a, b,
                c, d)

This isn't so much about whether an indent is present, as how much of
an indent there should be.  This sort of indent doesn't affect parsing
or help with error recovery so it is probably best left to esthetics.
We could possibly require that everything in the brackets be indented
w.r.t. line on which the brackets started so that

   b = 1 +
       3 * func(a, b,
       c, d)

would not be allowed.  However we would need to ensure that
   b = 1 +
       3 * func(
       a,b,c,d)
were allowed (I think)


I've got the UNDENT reduce a bit wrong.
Anything that started with an indent must be reduced, not just one
thing that started there and everything after.
This is awkward as reducing back to 'statementlist' doesn't stop
adding to that statementlist, but it should.  So we want e.g.
  statementlist -> statements
   statements -> statements statement
              |
Then we reduce back to 'statementlist' which cannot accept another
statement.   The parser generator could insert these extra non-terms
automagically.
No that's not quite it either.  If I have
   a;
   b;
    c;
   d;
then the undent after 'c;' needs to reduce the statement (which
is already reduced) but not the 'statementlist'.  But there is
no extra state there.  We aren't reducing back to a state, but back to
a symbol.  The indent happened after a statementlist

A recursive symbol A is a problem if:
  A -> A B
and
  X -> ?? A Y
for non-empty Y.  This results in a state:
   X -> ?? A . Y
   A -> A . B

If the Y is empty, X will reduce and B becomes unshiftable.
If this pattern is found we can create A'

  X -> ?? A' Y
  A' -> A

Now 'A' only appears at the end of a production so there is no
problem.


Got it wrong again ...

   foo
   bar
    baz
    bat

after 'bat', the recent indent is in Statements which makes it look
like we need to close all of Statements.  But we don't.  As the indent
stated in the middle of Statements we only need to close to it.
So we really do need to know if an indent is at the start or in the
middle.
We can cancel an 'in the middle' as soon as we find it.  We can only
cancel an at-the-start when reduce_size > 1 (as it is now 'in the
middle').


Now back to newlines.  It doesn't work very well to include newlines
in the grammar wherever one is allowed as it conflicts with where they
are required. E.g. for ';' to be optional we sometimes require one at
the end of a statement, but for statements to be line-like we need to
allow one at the start

I think we need to allow NEWLINE in the grammar for EOL markings.
Maybe we want SOL too?

The important cases are:

  a = b *
      c +
      d;

Statement is line-like so return to SOL not allowed, but expression is
not to return is.
  if condition:
     statement
  else:
     statement

Both 'if ..' and 'else..' are line-like
  a = b + c

Newline at end of this line is allowed in place of ';'.

If we have the grammar recognise all newlines that are not protected
we need to support blank lines in lists.
 Statements -> Statements OptNL Statement


No, I don't like that.  EOL should be at the end.

Some things can have an EOL at end, either optionally or required.
Or elsewhere in the production.

When we see an EOL we need to ask "Do I ignore this or use this?".
The answer relates to indents and expected newlines.

If a symbol can derive a sequence with an EOL, then the symbol is
flagged can-eol.
If that symbols after DOT in any item in a state are flagged
'can-eol', then the state is flagged 'can-eol'.
If a 'can-eol' state appear on the stack closer than a 'was indented'
state, the we use the EOL token.
If no states up to the first 'was indented' state are flagged
'can-eol' then we discard the EOL token.


That might work, but have I thought of all the issues?
 - blank lines.  The grammar will need to allow for these explicitly
      in any context where EOL is expected.  Not a big burden I guess.
 - force reductions ?  If EOL closes things, does it close anything
     where not explicit?  Probably not - that is sort of the point.
     If it appears where not expected but in line context, that is an
     error just like an "unexpected" error.

 - is the nesting 1-way?  i.e. can I have line-oriented things in
   noline boxes.  or "vertical stacks in horizontal boxes"?

   A possible example is a function call.  It might be nice to allow
      a = b + func(first
                   second
		   third) * 2

   There are two problem here.
   Firstly there is no EOL after 'first' so that cannot mark the end
   of an argument.
   Secondly it forces an EOL interpretation on surrounding expressions.

   And what about
     a = [1, 2, 3
          4, 5, 6
	  ]

   I would like to be able to drop the ',' after the '3' but it isn't
   clear that it is safe.
     a = [1, 2, 3
                +4
		+2
	  8, 9]

   Is a bit forced but with clear intent.  I think it could be deemed
   as wrong because the think that wraps is not the first list element
   on that line.  Is that a well defined concept?


   Recording the line structure allows for a number of different options.  As mentioned already, a "NEWLINE" at the end of each line might do it.  However as "IN" and "OUT" already imply a line break, we could just have "SAME" on breaks which don't have "IN" or "OUT".  However this doesn't capture the idea that a line sometimes corresponds to a syntactic unit.

Another approach is to follow the bracketing nature of "IN" and "OUT"
and have "SOL" and "EOL" for start-of-line and  end-of-line.  As we
noted already an indented section implies continuation and so should
delay the end of line.  If we follow this idea, the above structure
could be tokenize as:


Here every "SOL" has a matching "EOL" but it can be delayed until after an indented block. So the  "EOL" that matches the first "SOL" comes after the "OUT". It is worth noting here and a single line break can have multiple "OUT"s and  even an "IN" as well.

A B C
  D E
   F G
 H
I

After the "G" we would have "EOL OUT EOL OUT EOL IN SOL" before the "H".  The three "EOL"s terminate the lines that start "F", "D", and "A" in that order,  and a new "IN" is needed as "H" is indented with respect to "A".

This tokenization captures the structure very effectively and is what
I will be using with one simplification.  The  "SOL" really isn't
needed.  It is easily deduced and doesn't have a strong effect on the
parser.  The "EOL" may for a syntactic element to be closed.  The
"SOL" is simply recorded for reference, and that can be recorded when
"IN" or "EOL" is seen ("OUT" is always followed by "IN" or "EOL", so
"OUT" doesn't need to imply "SOL").

This tokenization captures the structure very effectively and is what I will be using with one simplification.  The  "SOL" really isn't needed.  It is easily deduced and doesn't have a strong effect on the parser.  The "EOL" may for a syntactic element to be closed.  The "SOL" is simply recorded for reference, and that can be recorded when "IN" or "EOL" is seen ("OUT" is always followed by "IN" or "EOL", so "OUT" doesn't need to imply "SOL").

Unsurprisingly, these are exactly the tokens that can be generated by the scanner I previously described though the token names have been changed.


When does an EOL actually mark and END.

Now that we have a good understanding of IN and  OUT we can turn to EOL.  Intuitively EOL should mark the end of something much like OUT does.  But what.  Let's look at some cases:

if a == b:
    print "Hello"
    print "World"

Here there are 3 EOL tokens, one after "Hello", one after "World" and one after the OUT token right at then end.  The first two should close the statement on that line.  The last probably shouldn't as there might be an "else:" still to come.

a = b +
    c * d +
    e

There are 3 EOL token here in the same locations.  Now the first two should not close anything, while the last one should close the statement.

These two examples have very similar structure and are largely indistinguishable to the LR parser, and yet must be handled very differently. This implies that we need to provide the parser with extra information. We need to tell it something about the expected line structure of the language.

To understand what to tell the parse we need to look into our (or my) intuition about why the newlines are or are not wanted in the different locations.

I think the key observation is that statements are generally expected to be one-per-line, while there is no similar expectation for expressions.  The earlier observation that an "else:" might still be expected shows that statements are always one-per-line, however that is a fairly well defined exception. i.e. there are certain placed in a statement where a newline is expected, or at least not a surprise.  The start/end is one place.  Before an "else" or "elif" is another.  In C we might have:

if (a == b)
{
    print "Hello"
    print "World"
}

so an EOL is expected before a '{' or a '}' in that language.

If we define all the places where an EOL is expected then we should be able to work out when an EOL should close a syntactic unit and when it should be ignore.  However it isn't as simple as ignoring an EOL where ever it isn't expected.  We need to include our understanding that indents protect EOL in some sense. In the second EOL example above (a = b + c ...) the first two EOL tokens were ignored in part because they were indent.  If the second they were not ignore because despite being indented, they were expected.  So the interplay between indents and EOL expectations needs to be understood.

If we go back to our very first example:

A B C
   D E
   F
G H

you might remember I suggested that if something started at "B C..." it should finish at "F".  The "OUT" should close that as well.  Our process described ..... not sure ..

 
Also you need to know how the scanner can report indent information to the parser.  I discuss that in some detail when introducing my lexical scanner but you probably don't need to read that now.  It is sufficient to know that the token stream will have paired "INDENT" and "UNDENT" tokens, and that the "NEWLINE" token that you might expect to come before an "INDENT" is delayed until after the matching "UNDENT".

With this in place, it seems natural to allow the parser to absorb INDENT and UNDENT, but to require that when it reduces and production, the net indent absorbed while shifting the tokens which make up that production is zero.  i.e for every INDENT there must be exactly one UNDENT.  So:

if condition {

It might help to start with some examples.

1. Lorem ipsum dolor sit amet, consectetur adipisicing
2.         elit, sed do eiusmod tempor
3.         incididunt ut labore et dolore
4.            magna aliqua. Ut enim ad minim
5.    veniam, quis nostrud exercitation ullamco
6. laboris nisi ut aliquip ex ea commodo consequat.
7. Duis aute irure dolor in reprehenderit in voluptate
8.      velit esse cillum dolore eu fugiat nulla pariatur.

Here are 8 lines of random text - the line number are just for reference and should not be considered part of the text.   Similarly any punctuation or capitals should be ignored when trying to understand the structure.

Looking at just the newlines and indents it appears that line 1-5, line6, and lines 7-8 form three separate and equal units.  Within the first of these, lines 2 and 3-4 are again to separate and equal units.  Line 5 is not part of whatever 2,3-4 are but is something else that is part of 1-5.

Where that structural analysis is actually correct cannot be determined without knowledge of the meaning, so lets try something more meaningful.

a = [ 1, 2
      3, 4
      5, 6
    ]

Now we can clearly see that the continued lines are all part of what was started on the first line.  The last line is less indented than the others much like line 5 in the first example.  We took that to mean it is not a continuation 2nd and 3rd lines but only of the first, and this example bears that out.  The lack of a comma at the end of the first 3 lines are not lead to any ambiguity.  It is clear from the structure that something is completed.  It cannot be the whole list that is completed as that started on the first line and we are still indented with respect to there, so it must be a "smaller" or more local syntactic unit that has completed.  i.e. the list entry.

Here is a different example using the same structure.

if a = 1 { a =
    b + c;
    d = e;
  }

This looks very wrong.  Breaking the first assignment looks bad, and having the second assignment start with the same indent as the continuation of the first is also wrong.  Yet the list begun, but not finished, of the first line of the first example looks fine.  Maybe it is the amount of indenting that makes a difference.  We can try adjusting that and see:

a = [ 1, 2
   3, 4
   5, 6
  ]
if a = 1 { a =
               b + c;
   d = e;
 }

certainly the amount of indenting helped in the first example, but fixing it doesn't really fix the second example.

hello

30apr2014
 I seem to be stuck here...
 What do I know?

 When I see a newline, I must either IGNORE it or SHIFT it or REDUCE
 The key question is: when to ignore it?

 We ignore newlines when they are protected by an indent.
 What does "protected" mean?  It means there is already an internal indent
 in any symbol which can derive a newline.

 So in the above examples, only newlines are expected at end of
 statement (with or without ';') and after '{'.

 So for all the newlines seen in a = [..] the closest symbol that derives
 newline is 'statement' and it has an internal indent.  So they are ignored.

 In the "if a = 1 { ..."  The first newline is after "+ c;". We are in
 statement (or statementlist) which derives newline and that has internal
 indent so newline is ignored.  But I don't want it ignored.

 How about:
   When we see a newline, we find the closest reducible thing that
   can contain a newline.  If it started at the start of a line and
   contains an internal indent, then we can ignore the newline.

 That has potential: a line-like thing can only be broken if it
 starts at the beginning of a line.

 But can we implement this in an LR grammar?  We don't know what context
 we are in until we reduce into that context.  So how do we determine if
 we are in a line-like thing?

 Each symbol has a 'can_eol' flag.  EOL gets that flag, and  any
 non-terminal which can produce an 'can_eol' symbol gets the flag too.
 So 'statement' has it but 'expression' does not.

 Each state has an 'expect_eol' flag.  If is set if the 'next' symbol of
 any item in the state has can_eol set.
 OR : It is set if any the head of any production in the state has can_eol set.

 Now recall that the state stack is conceptually:
   STATE1 SYM1 STATE2 SYM2 STATE3 SYM3
 Each STATEn is expecting all the following syms to eventually be reduce
 to something which can then advance that state.
 So if STATEn has 'expect_eol', then everything following is part of
 something that maybe could have a newline in it.

 So if STATEk has expect_eol then we ignore newline precisely if
 SYMk was at start-of-line and somewhere from SYMk on there is an
 internal indent.  So we need a new start-of-line flag in the frame.

 Is expect_eol precise enough?  I'm wondering about a production
 that has a newline in the middle.  What happens then?
 After the newline has been shifted the whole won't be reduced so that
 expect_eol state will still be there so we can still be ignoring
 newlines.

 I think we declare that a production which can contain a newline but
 cannot end with a newline is an oddity which we don't support.

 Similarly if any production from a symbol can contain a newline,
 then every production is assumed to be line-like and so require an
 indent for newlines to be ignored.


 "Newlines are ignored only inside line-like symbols which started at
  the beginning of a line and  have since had an indent"

 Possibly good, but a grammar with no newlines would not allow
 newlines.  Maybe that doesn't matter - the lexer can ignore them.
 Can we turn it around  though

 "Newlines are only shifted if the nearest line-like thing does not
  have an internal indent."
 so "They are ignored if there is no linelike thing, or it holds an indent

 Also "An internal indent in a linelike thing that started midline is
  not permitted.


 That didn't work - 'expect_eol' definition doesn't make sense in that way.

 New attempt:
  expect_eol is set if *all* of the *next* symbols in core item
  have cal-eol set.
  When a core item does not have a 'next' symbol we ignore that item.

 Looks good, but now a new issue appears.
 If a production is reduced by an OUT indent, the following NEWLINE
 doesn't have a home.  If expect an optional newline there it might
 use the newline before the outdent (which could have been ignored
 otherwise??). and worse, it can create a reduce-reduce conflict
 because the optional newline could terminate inner or outer statement.

 This is particularly a problem with
    Block -> : Statementlist
    Statement -> if Expression Block else Block

 After an OUT reduces ": Statementlist" to Block there is a NEWLINE
 but we need to see the "else".  If we indent the 'else' it works nicely.
 I want to tell that a NEWLINE is needed there.  That means I must put one
 after "Statement -> if Expression Block" to avoid a reduce/reduce conflict.
 Each.

 After trying to describe this I need to refine it again.
 - A symbol is "can_eol" if any symbol in any production from this symbol
   is followed only by nullable symbols.
 - A state is "starts_line" if the previous symbol in any item "can_eol",)
   or is nullable and previous to that "can_eol", etc.
 - The start state is "starts_line" if the start symbol "can_eol".

19may2019 - in Texas!
I hit a problem.

Because anything that ends at a NEWLINE is considered LineLike, an
Expression is LineLike.  I didn't want that.
So if I have

   print
     a + b
     + c

The NEWLINE after b is not Ignored in the expression,
 only once it has reduced the simplestatement because....

 A NEWLINE is discarded when the tos is NOT newline_permitted.
 A frame gets newline_permitted when the state "starts_line"
  which happens if a starting sym is "line_like", meaning it
   can usefully be followed by a newline.


25may2019
 I've re-read most of my notes about NEWLINE parsing and I'm not sure....
 I originally wanted "linelike" to be anything containing a newline.  I apparently
 found problems and now it is anything followed by a new line.
 But that seems weird, because in
   assignment -> name = expression
   statement -> assignment NEWLINE
 expression is followed by a newline, so is linelike.  I think.

 Can I return to my original idea?
 Any symbol containing a newline is lineline, and newlines shouldn't be ignored.
 If the symbol started since an indent.  If it started before an indent, then it
 doesn't affect newlines.

 During parsing, we don't know which symbol we are working on (yet),
 only which states we have, which can be working on multiple symbols.
 The decision we need to make when we see a newline is which of these to choose:
  - ignore - if states since an indent don't start a line
  - reduce - if there is start-of-line state since start of line.
  - shift  - if we can, and not ignored
  - reduce - if we can
  - error  - if nothing else

 Maybe every item in an itemset must be followed by a newline if any are.
 That cannot work as we always add prefixed of 'next' symbol.


 if expr { NEWLINE
 if expr NEWLINE {

 Should the NEWLINE be ignored?  No!  It isn't indented.
 So if I want NEWLINEs to be allowed, they need to be explicit.
 This makes the state startsline.

 So: we mark can_eol on any symbol that can derive NEWLINE
  we mark a state as starts_line if any item with dot-at-start can derive a can_eol.

 Then if the most recent starts_line is further back than an indent, we ignore newlines.
 So I don't need the lineline flag.

03jun2019
 I'm making progress with newlines...
 Newlines *are* shifted if there is only one symbol since the start of the line..
 The current conundrum is what happens to several blank lines after:
   if cond:
      action

 The next thing could be
   - an else:
   - a new complex statement
   - the end of the block

 The first two are currently handle with newlines before things, so we stack up
 several newlines, then when we see a thing, we reduce the newlines out from under it.
 But that doesn't work for the end of the block.
 A Newlines symbol cannot insert an optional linebreak, but can absorb blank lines.
 So places where a line break is optional, we have
    Open -> {
	 | Newline {
 Where we want to allow blank lines, we have
    Newlines ->
		NEWLINE Newlines

    Then if we see a newline, not at start of line, we shift if it is optional
    When we see a newline at start of line, we shift if blank lines are allowed.
    So if grammar allows
           Newlines Open
    Then .... that doesn't work.  Newlines won't get any lines.
    Maybe we want:
        Open -> {
	     | NEWLINE Newlines {


 I have a problem.
 I want

    else: a := b

 to parse the same as

    else:
          a := b

 and for the last newline to close the elsepart.
 But the latter has 2 newlines while the former only has one
 and I don't have any obvious justification for ignoring either.
 I think it is in the Newline before the OUT that is extra.

 I could drop the newline before the OUT, assuming the newline
 separate things, and the OUT will force any reductions needed.
 But then we have fewer newlines reported than actual.
 (Same imbalance happens with multiline comments and strings, so maybe
  that is OK).
 Another way to look at it is that the newline following an IN is discarded
 (or always ignored) and not moved to after the OUT.
 So (maybe) the newline at an IN or OUT is reported *after* the IN or OUT.
 so
  A
    B
    C
   D

 Would be A IN NL B NL C OUT IN NL D OUT NL

 The parser always ignores the NL after an IN but uses other
 NL to reduce to a single symbol (if possible)
 OR maybe it doesn't ignore (unless not line-like context) and
 lines are preceeded by NL, not followed by them...
 No, followed is usually good.. though separated is better... so preceed!!

 Ok, this isn't working.
 A construct
   if cond:
      pass

 cannot be reduced to a Statement until we know what comes next, and it
 might be separated by several newlines.
 So the newlines need to be part of the Statement.
 But that means we cannot have newlines at the front of a statement.
 But that was the point...

 Maybe a Statementlist is a series of StatementNL followed by a Statement

 We allow
   StatementNL -> Statement Newlines
 as a general catch-all, but when we have something like if, or anything
 with an optional tail "else:" or "case:"
 We say:
   StatementNL -> if Expression Block Newlines
 But that would produce a conflict with
   Statement -> if Expression Block
 As a newline could either trigger a reduce to Statement, or a shift.
 Obviously we shift, but maybe we use precedence to force the point.

 Can we handle 'else if' ...
 IfStatementNL -> if Expression Block Newlines
             | if Expression Block else IfStatementNL

 ... I'm contemplating having the parser duplicate NL as necessary, so
 that
    if test: action
 can appear to be followed by 2 NL, one to terminate the 'action' statement
 and one to terminate the whole 'if'.
 This might mean I need to extend when NL are discarded - to ensure they
 don't get duplicated too much.
 1/ if state does not permit newlines, discard
 2/ else if I can reduce symbols all since start of line do that.
 3/ else if can shift, do that
 4/ else if only one symbol since newline, discard.
 5/ else ERROR

 This means that we cannot recognise multiple newlines
 or does it.
 If we shift a Newline, that is since_newline=0;
 If we reduce that to Newlines, that is still since_newline=0
 If 4/discard only applies when since_newline==1 -- we win.

 Currently since_newline essentially means the symbol contains a newline.
 So 'statements' usually does, but 'statement' doesn't.
 When we shift the newline and reduce, it all becomes since_newline=0.
 That is when we want to ignore newlines.

 15jun2019 - still working this through..

 Normally the parser does
   shift or else reduce or else error
 exceptions are TK_in which is simply recorded and
   TK_out: reduce until there is a TK_in in scope, then cancel, else error
   TK_newline:
       if not newline_permitted (indent since last starts_line state)
                 Discard
       if can Reduce to at most start-of-line, reduce
       if can Shift, duplicate and Shift
       if can Reduce, do so
       if 0 since newline, Discard

  since_newline needs to be changed a bit.
   A TK_newline token *isn't* zero, it is N+1.  The token *after*
   the NEWLINE is zero - so that

 Arg.  I'm struggle with that fact that having shifted a newline,
 we are both at the end of a line, and at the start of the next.
 When I see a newline, I want to reduce until the end of line
 is in the same state as the start of that line.

 Maybe I do want newline to be a separator.
 What if I don't actually include the newline in the grammar, just like in/out.
 Instead we mark select productions as lines.  This is like marking
  for precedence.
 A marked production is reduced when a newline is seen providing it won't
 contain any indents.
 So: if the reducable item in a state is marked, the start gets marked.
 When we see a newline, if the state is marked and the reduce size does not
 exceed since_indent, we reduce.  Otherwise we discard.
 No... I need an error condition too.
 So I need the state to have a starts_line marking, when a new item is marked.

 So:
   productions can be marked $$NEWlINE which flags the production as line-like
   a state with an item with DOT at start of a line-like production is starts_line
   a state with an item with DOT at the end of a line-line product is ends_line
   We track indents as before.
   When we process an indent or newline, we set since_newline to 0
   When we see a newline we do one of:
     if not newline_permitted, we discard
     if top state starts line, we discard
     else reduce or else error

No.....
 A production -> { statements }
 needs to ignore newlines either side of statements.
 It is a multi-line production - newlines don't matter.
 Maybe there are several sorts of symbols:
  - in-line:  must be broken across lines unless indented
  - line-like: is terminated (reduced) by a newline
  - multi-line: newlines are ignored

 We tag symbols which are line-like.
   Any symbol which can derive a line-like symbol is multi-line
   Any other symbol is in-line.

 So SimpleStatements, ifhead, elsepart, casepart etc are linelike

$line SimpleStatements IfHead ....

 A state that is at the start of a linelike symbol starts_line
 Any state in a multi-line production starts_line

 if tos starts_line, newlines are ignored.
 else if there is an indent since the starts_line, newlines are ignored.
 But if there are symbols since starts_line, we have to reduce until are
 are in a starts_line start (or can see an indent).

No....
    Block -> : Statementlist
is none of thse.  It must be reduced by a newline, but isn't entirely line-like.
    Block -> { Statementlist }
is multi-line

but maybe Block here is neither.  It only becomes linelike when it
terminates a Statement, which is linelike.  Or terminates an ifHead

Should this be legal?
  a:=b;pass if something
probably not. I want to require at least ';' or NEWLINE.
That means I need to include NEWLINE in the grammar.
 Statements -> SimpleStatements ; Statement
      | Statements  Statement
      | Statement

 if cond: if cond: a:=b
NEWLINE reduces this down to IfHead
IfHead -> IfHead NEWLINE
Statement -> IfHead
   | IfHead ElsePart

ElsePart -> else BLOCK
  | else IfHead
  | else IfHead ElsePart
  | ElsePart NEWLINE


Statements -> Statements Statement
       | Statement


if cond { statments } else { statements}
but not
if cond: statements else: statements

so :statements must expect a NEWLINE but then
  if cond: if cond: statement
expects 2 NEWLINEs.
Maybe
   Block -> : IN statments NEWLINE
if there is no indent, we synth one which triggers an OUT NEWLINE pair.
This could be automatic.
 If a linelike is followed by a newline, we synthesis an IN before it.

That requires a hack to the scanner: Synth Indent

What if

 Block -> { Statements }
   | : Statements NEWLINE
   | : SimpleStatements

Then
   if Expr Block

might not end in a NEWLINE so else could come immediately. Is that OK?
   if expr: a:=0 if expr
must be forbidden.  That requires a newline.

 Block -> : IN statements OUT NEWLINE

In marks state as 'needs_indent'  If an indent arrives, fine.
If something else, we record that next newline (After balanced in/out)
must synth extra out/newline.

 Block -> { Statements }
     | : statementblock

 statementblock -> Statements $$line

$$line means it must be reduced by a newline.  If something else tries,
it is an error and we skip to newline.
It also strips everything but NEWLINE from the (effective) lookahead
to avoid reporting conficts, as those things will never be shifted.

 IfHead -> if Expr Block
    | IFHead NEWLINE
 Block -> { statements }
     | : statements $$line

 IfStatement -> IfHead
      | IfHead IfTail
      | IfHead else IfStatement
 IfTail -> else Block
      | IfTail NEWLINE


------
20jun2019  Happy 87th birthday Dad.

I'm not convinced about $$NEWLINE

   else: simplestatementlist

should be able to parse simplestatementlist without a newline, and
use the newline to close the if/else.
where as
   else:
      statementlist

has a newline to close the statementlist and another to close the if/else.
But can the LR parser tell the difference?
It only sees that newlines don't forcibly reduce the else:
So when it sees the newline at the end of simplestatementlist,
it cannto shift because there is a sub-line thing that can be reduced.
So this becomes elsepart before the newline is absorbed.
Whereas in statementlist, the newline can be shifted creating simplestatementline.

What about
   if cond: if cond: statement

Again, the newline cannot be shifted while we can reduce

But.... how does conflict analysis know that an 'if', for example, is not
permitted after simplestatementlist?

Ahh.. This is exactly what $$NEWLINE is for. Maybe it should be $$OUT.
Either way, the grammar is ambiguous and relies on newlines or indentation to
close the production, and this fact needs to be explicit.
Requiring OUT is probably best as it means
  if cond:
       statements
    else:
       statements

works even though there is no newline after the first statements.
Here I want the 'statements' to be closed by OUT, but the whole to
be closed by NEWLINE.
So maybe I need both $$NEWLINE and $$OUT ??


$$OUT makes lots of sense.  It is exactly how we expect :statements to
be closed - where we allow NEWLINE to have the same effect.

$$NEWLINE is good for closing an complex if or for etc.  It means that
nothing else can be on the same line - allowing for indents.
How do we implement that?
Any production in the grammar that represents a full line but doesn't
end with a newline should be marked $$NEWLINE
This head of that production should recursively absorb NEWLINEs.

I'm not yet clear on exactly the difference bwetween $$OUT and $$NEWLINE.
I would put $$OUT after
   block -> : Statements $$OUT
and $$NEWLINE at the end of a statement that must end a line
   condstatement -> Ifpart IfSuffix $$NEWLINE
         | constatement NEWLINE

Maybe I need a worked example.
   while conda:
       if condb: if condc: action
   pass

So after action there is NL OUT NL pass
The NL sees that it can reduce, and the if allows the NL to reduce it so
      while COND : IN if COND : statement   [ NL OUT NL ]
again the NL can reduce.  Note that we *don'* absorb the NL in the statement
      while COND : in statement [ NL OUT NL ]
Now we can shift the NL
      while COND : IN statment [ OUT NL ]
Now the OUT forces a reduction
      while COND BLOCK(in)  [ OUT NL ]
Now the out is cancelled
      while COND BLOCK [NL]
and the while is reduced.

So the $$NEWLINE must always see a newline (or $$EOF)
An $$OUT must see an OUT or a NEWLINE (if there was no IN)

$$OUT causes the LA set for items with the production to be empty.
It is never credible that anything will be shifted so any apparent LA
contents can be ignored.
The state when a $$OUT is reducible has a recedence higher than any terminal, so
nothing can be shifted and no completion should be possible.
The state when a $$NEWLINE is reducible is much the same.

Maybe I don't want NEWLINE in the grammar, only $$NEWLINE??
How would we recognize a blank line?
  command -> $$NEWLINE
??
We would need a new rule for discarding newlines.
e.g. when the top-but-one state is start-of-line we discard and mark the top
state s-o-l.  That stops us discarding a newline until it reduces something that is at the start of a line....

1/ if there is an indent since the last start-of-line state, discard NEWLINEs
2/ if ....

Q: When is a NEWLINE an error?
A: when it isn't ignored and we cannot reduce and
   top or  top-but-one state isn't starts_line.??

So we need extra state info and extra frame info.

State has:
  - starts-line - is at start of a $$NEWLINE production
  - ends-line - is at end of unreduced $NEWLINE production
  - ends-indent - is at the end of a $$OUT production
  - min-prefix - how far back a 'in' can be and still cancel

Frame has:
  - indents - count in or after that sym
  - line_start - is the was a line start (IN or NEWLINE) immediately after
      the symbol
  - newline_permitted: no indent since start-line
  - since_indent: number of frames where indents==0
  - since_newline: number of frames where line_start==0

If we see a NEWLINE then:
  if ! newline_permitted, discard
  elseif can reduce and reduce_count <= since_newline - reduce
  elseif since_newline <= 1, and state.starts-line, discard and record line_start
  else error

If we see an IN
  increment indents, set line_start

If we see an OUT
  if reduce_size <= since_indent, reduce
  if min_prefix >= since_indent, cancel
  else error

How does error handling work?
Normally we pop states until we can shift ERROR
Then we discard tokens until we can shift one.

However we need to do something different for IN OUT NEWLINE.
For IN, we simply increment a counter
For OUT we decrement if it is positive.
   If it is zero and the state ends-indent, then we are synced.
   If it doesn't, we need to pop more states until we have an indent to cancel.
For NEWLINE if the state ends-line or ends-indent and ...something... we are synced.
   else we skip it??

... no, that doesn't work because I cannot see a way to describe an optional newline.

Let's try with just $$OUT which requires OUT or NEWLINE...
We put $$OUT on productions that must be closed in a 2-d obvious way.
So they can be at the end of a line or at the end of an indente block.
So
   : statements $OUT
means the next line after the : cannot be indented.
However
  Block -> : statementline | : statementblock | { statements }
  statementblock -> statements $NEWLINE
means I can have else indented, or on same line as single statement

  if cond: a = b; else: b = a
  if cond:
        a = b
     else:
        b = a

The whole 'if' needs a $NEWLINE marking to ensure a following statement isn't
indented.
So implementation is almost exactly what I have:
 - if anything else is lookahead when reducing that production, it is an error.
 - remove non-newlines from lookahead in items

But I don't think $$OUT is quite what I want to call it.
That doesn't quote cover end-of-line possibilities.
Maybe allow $$NEWLINE or $$OUT but with same behaviour.

.... still not there.
Another way to satisfy a $$OUT reduction is for it to already look right.
So: No indents and at start-of-line

But that upsets the modification to look-ahead as we can no longer assume
the next token.
I think this might be more like a precedence thing??
Without look-ahead modification, the first token of a statement can be shifted
 before a newline forces a reduction....

Maybe I do need two sorts productions.
 $$OUT requires an out/newline to reduce it.
 $$NEWLINE follows either $$OUT or NEWLINE and requires start-of-line and no indents.
        or $$OUT or $$NEWLINE

Or does it matter. Over-modification of the look-ahead suppresses warnings, but
doesn't affect the parse.
Will we get warnings anyway?


--------
Are left-recursive symbols in a non-final position always bad?

Left-recursive symbols cannot be closed by forcing a reduction.
So if one starts in an indented region (in which newlines are ignored)
it could continue afterwards - unless we make that an explicit error somehow.
If they appear at the end of some other production, that one will (maybe)
be reduced as well so (maybe) no problem...

if cond:
   a()
   b()
 c()

is weird and I want to forbid it.  Al that is between b() and c() is
NL OUT IN.  NL closed b(), so it is just OUT IN
So I do want the statementlist to close.

 a :=
     1 + 2
   * 3 + 4

is very wrong.  How much can I help?
The OUT will reduce "1 + 2" which will then become
    ((1 + 2) * 3) + 4
which would be highly confusing.
So something about this must be disallowed.
Maybe when newlines are ignored, OUT doesn't force a reduce??
I can make it an error by having Expression reduce to something else.

Do I want an error even for
    1 + 2
    * 3 + 4
??

I could achieve that by adding extra checks when we SHIFT at
start of line.
If we could reduce tokens since previous SOL, then we have 2D ambiguity.

   1 + 2 *
   3 + 4

That is just as ambiguous, but we cannot reduce anything.
When we see the second '+', the reduction crosses a line-start but doesn't
result in a line-start.

So: a reduce that doesn't contain an indent, but does contain a start-of-line
must reduce to that start of a line.

This means we need to keep the start-of-line when we "IGNORE" a newline.

Can I use this sort of logic to avoid the need for the extra reduction,
or for the $$OUT markings??

1/ The point of extra reduction is to avoid consuming more after an OUT or
   ignored NL.
    if cond:
         a()
         b()
       c()

    must be an error.  The OUT reduced a()b().  The stack is then
        if cond : statements(n1) . IN Ident
    The first indent is gone. . There is no error until we see all of c() so
     if cond : statements(n1) simplestatement(i) . NL OUT NL

    Is it  problen that the simplestatement is indented?
     if cond : statements(n1) statement(i) . OUT NL

   Q: is it an error to reduce a sequence containing an (uncancelled) indent?

2/ The $$OUT markings guard against exactly a reduction containing an uncancelled IN.

So maybe I have two new rules.

 1/ a reduction must not include any uncancelled indent.  pop() must return 0.
 2/ a reduction the contains an unindented start-of-line must begin with start-of-line.
     So when we cancel an indent, we also cancel line starts since there.

One other value of $$OUT is that is avoided conflicts - most symbols could not
be shifted.  That should have only applied to $$NEWLINE(!) and doesn't apply
at all if I drop the marking and use internal rules instead.
So how do I avoid reporting conflicts?

Really, there shouldn't be any conflict as NEWLINE should be expected.
Let's go back to that idea.

1/ A linelike thing MAY start with Newlines and MUST end with a NEWLINE
2/ A SimpleStatement is not linelike and doesn't include and Newlines
3/    if condition : SimpleStatement
   is a SimpleStatement.

4/ When we see the NEWLINE after "if condition : SimpleStatement" we have a shift/reduce
   conflict as we could SHIFT to make a complex statement, or reduce the whole thing
   to a SimpleStatement.
   Default action is SHIFT but in this case we want REDUCE - due to precedence?

   However when we see the NEWLINE after "if condition :IN Simplestatement"
   we cannot REDUCE as there is an cancelled indent, so we have to shift.

   But when we reduce, we only want to Reduce to IfHead so that an 'else' can appear
   on the next line.

   If we see IN .. just continue.

What do I need to do:

 1/ Change grammer to expect blank-lines before and to have a NEWLINE at the end
    of any line-like thing.
    This requires IfHeadNL and IfHead.  ditto for switch, while, then ...
    This get complex with
      for a:=0; then a += 1; while a < 10:
    which could have several newlines
      for a:=0
      then a+=1 ; while a < 10:

    ForPart -> for simplestatements ;
            | for simplestatements NEWLINE
            | for Block
            | Newlines ForPart

    ThenPart -> then SimpleStatements ;
            | then SimpleStatements NEWLINE
            | then BLOCK
            | Newlines ThenPart

 2/ disallow Reduce when embedded indents - report ERROR
 3/ disallow Reduce when embedded start-of-line.
 4/ TK_newline uses these rules to decide when to force a reduce.

A/ A parser symbol that starts after an IN must end before the OUT
B/ A parser symbol that starts before an IN must end at-or-after the OUT
    only if if the symbol is not line-like ???

C/ A parser symbol that starts after a line-start and before an indent must end
    by the end of line
D/ A parser symbol that starts at a line-start must end before the end-of-line,
    or at a subsequent end-of-line.

A is satisfied by forcing a reduce on OUT and reporting error if IN cannot be cancelled
B is satisfied if we report an error if we try to reduce an uncancelled IN
C is satisfied by forcing a reduce *after* shifting NL and reporting ERROR if
   min_prefix exceeds the line
D is satisfied if we report an error when reducing at eol crosses a NL and doesn't start
   at start-of-line.

C is interesting - do we reduce *after* shifting NL??  I think we do, yes.

So: when can I suppress conflicts, and how do I handle reduce/reduce conflicts?

I need to be sure that a line-like ends with an unindented newline.
I can trigger an error when that doesn't happen, but I want more.
I want to encourage it to happen.  So if the grammar allows a NEWLINE it
will be shifted in, but if we have already seen an OUT, we ignore the NEWLINE
rather than trigger an error.
Also, I need to not report shift/reduce conflicts on whatever comes next.
i.e. if
   a -> b c
and c can end with a, then both c and a can be followed by the same things.
This is a conflict.  If c (and a) end with NEWLINE we declare the conflict
resolved.

An IfHead might not end with a NEWLINE.  So to make a statement we need
to follow it with an optional NEWLINE.  Let's see if we can make that work.

What is a SimpleCondStatement?  It has no blank lines or unindented breaks..

  ForPart ThenPart WhilePart CondSuffix OptNL

We don't need a distinct Simple class!! If it didn't start at SOL,
then unindented NEWLINEs must be terminal  Wooho!!

This requires that "Statements" doesn't insist on following NEWLINE

A SimpleStatement can be followed by a ';', a Statement cannot.  That
different is still needed.
So
   SimpleStatements -> SimpleStatement | SimpleStatements ; SimpleStatement

SimpleStatements can end with a ComplexStatement and no NEWLINE.
ComplexStatements must end with a NEWLINE after each statement except the last

Each Part (including SSlist) end with arbitrary NEWLINEs.  These will
only ever be at the same indent level.
A ComplexStatement must be separated from next by a NEWLINE.
So if the final non-empty Part does not end with NEWLINEs, how do we require one?
Maybe not..

What if a Part doesn't end with NEWLINE ever, but can start with them

CondStatement -> IfPart Newlines
  | IfPart IfTail...

I think I need a CondStatement which doesn't end with a newline and a
CondStatementNL which does.  Then anything that can end a cond statement
must come in two versions.
 IfPart ElsePart CasePart WhilePart  CondSuffix

If we expect the non-NL, we accept the NL but not vice-versa.

------
problem.
in
   if cond:
       cmd1
       cmd2

the 'if' that started before the indent must finish at/after the indent.
But in
   if a = b or
      c = d :
          do something
The Expr that started before the first indent may finish well before the indent finishes.
I think this is because Expr is not linelike but 'if' is.

So I don't want an error when reducing if there is an indent, unless the new top start
starts_line
...

OK, I'm up to the part where I need to hide conflicts that I can automatically resolve.
I have:

  State 7 has 2 (or more) reducible items
    IfHead -> IfHeadNL . [25]
    IfStatementNL -> IfHeadNL . [27] (2Right)
  State 35 has 2 (or more) reducible items
    IfTailNL -> else IfStatementNL . [32] (2Right)
    IfStatement -> IfStatementNL . [30]

I need to clarify the rules that I'm working with.

1/ Statements might not end with a NL but as it is linelike...

2/ IfHeadNL IfStatementNL IfTailNL all end with a NEWLINE
   IfHead IfStatement IfTail might not (but they may)

Why did I want IfHead -> IfHeadNL??
Because I might have
   if cond:
        action


     else: bar

No, that is still an IfStatementNL.  Once there is any NL, we cannot fit it
on a line.

Hmm... the FooNL pattern is getting out of control.
Why do I need this again?

Because when "if cond : statements" is followed by a NEWLINE I need to
hang on to the parser state - not reducing to statement - until I see an 'else' or don't.

If I do see the else, the difference doesn't matter.  If I don't then I need
to know if I have a NEWLINE.

So I could have
  IfHead -> if Expression : Statements
  IfHeadNL -> IfHead Newlines

  IfElse -> IfHead else
          | IfHeadNL else

  Statement -> IfHeadNL

 But I want "if Expr then SimpleStatements"
 where SimpleStatements can end with "IfHead"
So when I see:

 if foo : if bar : baz NEWLINE

That NEWLINE mustn't turn "if bar: baz" into an IfHeadNL
  We need to first turn "if bar: baz" into a SimpleStatement, then
  "if foo : SimpleStatement" into an "IfHead", but NOT into a SimpleStatement.

Arg.  I might not know to reduce something until I've seen an IN.  It is the
'else'

What if an Statement *always* ends with a newline.
So
   if Expr : Statement
also ends with a newline and can be a statement
But if there is an IN after ':' the newline is hidden.
So that doesn't work.

What if a NEWLINE absolutely has to be at the top level.
If a symbol contains a NEWLINE, then it must be at the start of a line,
possibly indented.
So if it isn't indented, it mustn't contain a NEWLINE - no NEWLINE will get shifted in.

 Statement -> if Expr : Statements
            | SimpleStatements

What does Statements look like?  It must end a NEWLINE
  Statements -> Statements Statement NEWLINE
	| Statement NEWLINE
	| Statements NEWLINE
	| SimpleStatements

  SimpleStatements -> SimplePrefix ; Statement
  SimplePrefix -> SimpleStatement
	| SimplePrefix ; SimpleStatement


01July2019
 I think I have a very different approach - it incorporates a lot of
 the ideas so far and is maybe better.

 From the top:

   We have a simplified SLR(1) grammar where each state has at most one
   reducible production.  We don't have an action table, but use the goto table
   to decide if a terminal can be shifted.  If it can, we do.  If it cannot,
   we reduce or trigger an error.

   Onto this we add handling for IN/OUT and NL.  NL can appear int the grammar,
   IN/OUT cannot.

   Any non-terminal which can derive a NL is deemed "line-like".
   Such non-terminals will normally appear at the start of a line - possibly indented.
   These non-terminals can have some productions that have a NL (usually at the end)
   and some that contain no NL.
   If a non-terminal appears other than at the start of a line then no NL will ever
   be shifted into it, so a production without NL will be used.
   If it does appear at the start of a line, then any production can be used, though
   it must end up ending in a NL.

   The above paints an incorrect picture of how LR parsing works.  At any given
   time you don't know what non-terminal is being matched, so we cannot exclude
   NL based on the non-terminal. We only know what parser state(s) we are in.
   So: any parser state which is at the start of a line-like non-terminal is
   flagged as "starts_line".
   Also, in each start we store a "min prefix" which is the minimum non-zero number of
   symbols before "dot" in any item in the set.  This given a sense of where we are
   in the parse.

   If min_prefix if the top state is less than the number of symbols since start-of-line,
   then we will not SHIFT a newline.

   Indents (IN) are recorded after the symbol they follow.  If there is an IN
   since the most recent starts_line state, the any NL is ignored.
   An OUT will cancel the most recent IN, providing is in the top min_prefix symbols.
   If not, we need to reduce something first.

   So when we see an OUT, we reduce until we can cancel.
   When we see a NL, we reduce until the min_prefix reaches at least to the
   start of the line.  Then we can shift the NL.
   After shifting the NL, the whole line should be reduced.

   When a line-like non-terminal produces a sequence that *doesn't*
   end with an explicit NEWLINE, the grammar analysis ensures that
   nothing can be shifted in after the end of the production.  This
   forces it to be reduced into the non-terminal.

   For example
      simplestatement -> var = expr | print expr
      simplestatements -> simplestatement | simplestatements simplestatement
      SSline -> simplestatements NEWLINE | simplestatements ; condstatement NEWLINE

      statement ->..
             | simplestatements NEWLINE
             | simplestatements ; statement
             | if expr then statements NEWLINE
             | if expr then statements Newlines else statements NEWLINE

   When a non-terminal is explicitly followed by a NEWLINE, it is line-like
   also if it contains a NEWLINE or linelike, it is linelike.


 ifhead -> if expr : statements
   | ifhead NEWLINE

 iftail -> else : statements
   | else ifstatement

 ifstatement -> ifhead
   | ifhead iftail

 statement -> simplelist | ifstatement | Newlines simplelist | Newlines ifstatement

 statements -> statement | statements statement

 A line-like that contains newlines must be reduced by OUT or NEWLINE.

 How can I know that a statements can be followed by else in
      if cond : statements else: statements
   or
      IfHead IfTail
 bit not by 'if' in
      statements statement

 Maybe I could have a $sol token.
  If at $sol, and cannot shift then try shifting $sol..
  then
    statements -> statements $sol statement
  or maybe $eol is better, then we can have NEWLINEs start start of statement.
  OR maybe either... $linebreak is shifted if previous or next is NEWLINE
  An IN doesn't allow a LINEBREAK.

Just to repeat myself:
  Arg.  I might not know to reduce something until I've seen an IN.  It is the
 'else'

After above, new approach:  IfHead and Statement have NL versions, nothing else does.
MAybe fixed the above...

So

  StatementNL -> Statement NEWLINE | IfHeadNL
  Statements -> StatmentNL | Statements StatementNL
  StatementList -> Statement | Statements

  Block -> : Statements | Open Statements Close

  IfStatement -> IfHead | IfHead IfTail | IfHeadNL IfTail
  IfHead -> if Expr Block
  IfHeadNL -> IfHead NEWLINE | IfHeadNL NEWLINE
  IfTail -> else Block | else IfStatement

Close, but the IfHeadNL in "else IfStatement" cannot accept newlines.
What if
   IfHead -> if Expr Block | IfHeadNL else IfHead | IfHead else IfHead

No, I think I have to take a totally different approach.

IfPart elsepart switchpart whilepart etc are all syntactically valid
as stand-alone statements in the base grammar.
We use the code to fail a stand-alone elsepart the isn't preceeded by an ifpart, whilepart or casepart.

So statements never contain newlines, only the statement-list does.
If puts a NEWLINE at the end of each statement
 statementlist -> statement | statementlist NEWLINE statement
statement can be empty string, thus allowing blank lines and a NEWLINE at the end,
which the parser will require.


[[ thought experiment - interestibg, but gets unwieldy with
   more complex statements
statements -> simpleline NL statements
 | ifhead NL iftail NL statements
 | ifhead NL statements
 | ifstatement NL statements

iftail -> else block
 | else ifhead iftail
 | NL iftail

ifhead -> Newlines if expr block
ifstatement -> ifhead iftail
]]