ocean-lang.org Git - ocean-D/blob - twod

   1 How to write up two-dimensional parsing?
   2
   3 What does the code do?
   4
   5 - If an indent is ignored the matching undent is ignored
   6   except to decrement the 'ignored' stack and to turn
   7   a pending indent into a pending newline.
   8   or to count the pending undents.
   9
  10 - If a newline is found and cannot be shifted yet then we check
  11   each production that would close before we shift, and if any still
  12   have pending undents, we ignore the newline
  13
  14 - when we reduce a production, we 'resolve' indents wich may leave
  15   something pending. Either an indent or a number of undents?
  16
  17 - productions can be recursive if first body symbol matches head.
  18   A recursive production may be horizontal if it starts in the middle
  19   of a line. non-recursive productions are always horizontal.
  20   A horizontal production must be all on one line or have its body
  21   indented.
  22   A vertical production must not be indented.
  23
  24
  25 Maybe I don't want to mention indents in the grammar ?
  26
  27  if CONDITION : STATEMENTLIST
  28
  29 doesn't have to be indented but we will only reduce it when we
  30 see something that doesn't belong.... which could be an UNDENT
  31
  32 So: when look-ahead is UNDENT we cannot shift so we reduce until
  33 the top state has an indent attached, at which point we consume the
  34 UNDENT.
  35
  36  if hello :
  37     foo
  38     bar
  39     bar
  40  UNDENT
  41
  42 the undent will reduce until we just have the ifstatement which
  43 contains an indent.
  44 So any indent is pushed onto the stack and a reduction sums all the
  45 indents.
  46
  47 What about newlines?  A newline will reduce any production that
  48 doesn't contain an indent.  It gets discard when top of stack
  49 was at start of line...
  50
  51 So each 'state' also encodes
  52    start-of-line or middle-of-line
  53  and
  54    count of indents.
  55
  56 When an INDENT is seen the current state increments count of indents.
  57 When an UNDENT is seen we must reduce until tos state has an indent,
  58 which we decrement.
  59 When a NEWLINE is seen, we reduce until tos state has indents or was
  60 at start-of-line.
  61
  62 nononono
  63 UNDENT shouldn't force a reduce.  Only newline should do that.  But as
  64 an undent is always followed by a newline, it still results in a
  65 reduce happening where I want it.
  66
  67 So an undent shifts in line an indent(?).  Reduction nets the indents.
  68 A newline forces the reduction until we find excess indents.
  69
  70 But what does grammar look like?
  71  statement -> assignment ';'
  72            | assignment
  73 So a newline will force the reduction to 'statement', but this will
  74 allow
  75     a = b c = d
  76 which looks bad.  So:
  77  statement -> assignment ';'
  78            | assignment NEWLINE
  79 Then the NEWLINE can be shifted normally.
  80 So newline forces reduction until it can be shifted or until ToS
  81 contains an indent or started on a new line.
  82 If we hit an irreducible state that does not accept NEWLINE when we
  83 have to shift in an ERROR instead which might then be reducible.
  84 When popping states we need to collect the indents.
  85
  86
  87 Where do we attach the indent/undent info?  To the state or to the
  88 symbol?  They are kept together but mean different things.
  89 indent/undent are between terminals, just like states.  But they
  90 can be inside non-terminals which make them unlike states.
  91
  92 What "is" a state?  It is a point between two symbols in a sentential
  93 form.  It doesn't carry information.  So I think the indent has to go
  94 with a symbol.
  95 An 'indent' should attach to the symbol *before* it as that is the
  96 thing that is being continued.  An 'undent' should also attach before
  97 itself.  Eventually those two terminals become included in a single
  98 non-terminal and they collapse.
  99 You can never have more undents than indents as a newline will force
 100 reductions until top symbol has no undents.
 101 In fact we reduce until the top symbol has a new indent, or is a
 102 vertical production, or is explicitly allowed.
 103
 104 If we see an INDENT, we attach to symbol at ToS
 105 If we see an UNDENT, we attach to symbol at ToS (possibly canceling indent)
 106 If we see a Newline and current ToS holds +ve indent,
 107    we ignore it
 108 If look-ahead is expected, we shift it.
 109 If we can reduce, we do so.
 110 Else we trigger error handling:
 111  pop states until one accepted ERROR - collecting indent
 112  shift ERROR with those indents
 113  Discard input, but process indent/undent, until next symbol is
 114  expected.
 115  continue.
 116
 117
 118 Does this work?  If I have
 119   a = b +
 120       c +
 121       d
 122
 123 Then the NEWLINE after "c +" cannot be handled, but should not be an
 124 error.
 125 Stack would hold
 126    var = expr +
 127 where 'expr' contains an indent.
 128 Maybe NEWLINE is ignored unless ToS has negative indent.
 129   a = b +
 130       c +
 131     d
 132 That must be an error.
 133
 134 Problems:
 135  1/ accepting NL must depend on what it will reduce.
 136  2/ If UNDENT/INDENT appear together they must be kept separate.
 137
 138 So maybe UNDENT does block shifts and so forces reduction.  If not
 139 reduction can be made, that is an error
 140
 141 if (a) {
 142   b;
 143 }
 144
 145 The NL UD NL before '}':
 146   NL will ensure "b;" is reduced.
 147   UD will reduce to command-list so we will have
 148
 149   if ( expr ) { commandlist
 150 where '{' holds the indent.  So '}' needs to bring in the undent,
 151 which is after another NL.
 152
 153 So Undent waits for a symbol to collect it when shifted.
 154
 155 What when I have:
 156    if (a) {
 157        foo;
 158    while (b)
 159        bar;
 160
 161 When we try to reduce "foo;while(b)bar;" to "statementlist" the
 162 negative indent triggers an error.  This a bit late though.  I want it
 163 when I see the 'while'.  I want an error there which can be reduced a
 164 closed statement.
 165 The undent should close the statementlist so that '}' is the only
 166 shiftable symbol.  But after 'statementlist', 'while' is still
 167 permitted.
 168
 169 If we just waited for the 'while' statement to reduce and then noticed
 170 the error, we could still report it against the 'while' token, but
 171 recover would be a lot harder as we have already parsed the 'while'
 172 into the 'wrong' place.
 173
 174 So when I see that token after NL,UD,NL, I need to be able to say
 175 'yes' or 'no'.
 176 Could require help from grammar.
 177 e.g. if a token is allowed to have a lesser indent than previous
 178 token, then there must be a literal UNDENT token.
 179 Or we could require:
 180
 181  Block -> BlockA }
 182  BlockA -> { Statementlist
 183
 184 That looks ugly, but the undent would close BlockA and allow the } to
 185 be expected.
 186 Or maybe
 187
 188  Block -> { Statementlist OptNewline }
 189
 190 so the newline after the indent is shifted and doesn't
 191 reduce-or-error.
 192
 193 Then the UNDENT would ensure that the Statementlist were closed, the
 194 Newline would be shifted, only a '}' would be allowed.
 195
 196 So: each symbol in a production must be in the same line as the
 197 previous, or in an indented line.
 198
 199 No.  OptNewline there wouldn't work as a blank line would take us to
 200 expecting a '}'.
 201
 202 I think I need to go back to what works... only it doesn't?
 203
 204
 205 Indent is attached to previous symbol. I think that is easier that
 206 keeping it at the 'start' of the 'next' symbol as I was doing.
 207
 208 Newline closes any productions that don't include an indent, then can
 209 be shifted if wanted, or ignored.
 210
 211 Undent reduces things until ToS has an indent that can be canceled.
 212 This means that if part of a production can have a negative relative
 213 indent, it must be separate:
 214 Block -> Blockhead }
 215 Blockhead -> { Statementlist
 216
 217 So the Undent before the '}' will reduce Blockhead and can then be
 218 discarded.
 219 We would add
 220 Block -> Blockhead ERROR
 221
 222 It would be nice if that can be auto-handled. e.g.
 223
 224 Block -> { Statementlist $undent }
 225
 226
 227 leads to the above.
 228 That would be messy for the reduce action though.
 229 Could each state "know"
 230
 231 --------------
 232 A state can have different items with different offsets, so how much
 233 of the stack is part of the 'current' production is not well defined.
 234 It is not completely random either.
 235
 236 We can ignore non-kernel items as those are just potential productions
 237 - they haven't started yet.
 238 Of the remainder, many states have only one length.
 239 Many others will have some offset and '1' for recursive
 240  productions: A -> A . X
 241 These can be called "potential" as well as we have no commitment.
 242 What is left?
 243   Funcall -> Factor . ( ExprList )
 244   Term -> ~ Factor .
 245
 246   Funcall -> Factor . ( ExprList )
 247   Term -> Term * Factor .
 248 etc
 249
 250 i.e. just the bits that make the grammar distinctively LR.
 251 In my case one is always reducible, is that important?
 252
 253 So when we see an Undent we find the max-index of the current state
 254 and determine if there is a matching indent within that distance.
 255 If there isn't then we must reduce or trigger an error.
 256
 257 Alternately:  Each 'goto' from this state has a max-history.
 258 We exclude any that don't have enough history to include balanced
 259 indents.
 260 If the next symbol would trigger one of those, it is an error.
 261
 262 So:
 263  For early error detection we record Indents against the previous
 264  symbol, ignore newlines, and when we see an undent the next shift
 265  must lead from a core item with enough history to keep balance.
 266
 267 If that is well defined it might catch early undents, but what about
 268 unneeded indents?
 269    a = b + c;
 270        c = d + e;
 271
 272 I expect that is less likely, but still an error.
 273 When we see the first ';' we will reduce to a statement and then a
 274 statementlist.  The second 'c' would lead us from
 275   statementlist -> statementlist . statement
 276 which is recursive and so doesn't allow indents (unless it started in
 277 the middle of a line).
 278
 279 Intends are allowed anywhere except after the first symbol of a
 280 left-recursive production which started at beginning-of-line.
 281
 282 a = b +
 283     c
 284       * d;
 285
 286 is not allowed as 'Term->Term*Factor' is left-recursive.  But why not?
 287
 288 So maybe an indent has to be allowed anywhere.  Maybe a warning but
 289 there is no real recovery option.
 290
 291 An undent is allowed in a state with a history back to the matching
 292 indent.
 293 If we see an undent, we check if the state can consume it.  If not we
 294 look to the next symbol which is not an undent (or newline) and reduce
 295 until that symbol can be shifted.  At this point the undent must have
 296 been consumed.
 297
 298 Or easier:  We must reduce to the matching indent before we shift,
 299 unless a NEWLINE is explicitly allowed...
 300
 301 Take 1:
 302  An indent is between two symbols, or within one symbol.
 303  We hold a pending indent which gets attached to the front of the next
 304  symbol that we shift.
 305  Undents and newlines collect until we see a "real" symbol.
 306  Any reductions that are then indicated happen and if the ToS symbol
 307  is found to hold an indent, that cancels a pending undent.
 308  Once we get to a state that can shift the lookahead we examine the
 309  number of pending undents.  It should be zero. i.e. we shouldn't
 310  still be in the middle of a non-terminal that started at a greater
 311  level of indent.
 312
 313  If it isn't zero we trigger an error because a close-bracket or
 314  similar is missing.  This might be different to "unrecognised-error"
 315  The error should lead to reductions which cancel out the remaining
 316  undent.
 317  I think the parser needs to incorporate error handling into states.
 318  Each state knows what symbol to synthesis to get closer to a
 319  reduction.
 320  i.e. on 'no close' error, we reduce what we can, and synthesis when
 321  we cannot.
 322
 323 So that seems to handle early undent.
 324 We also have 'unexpected indent', 'missing indent' and 'late undent'.
 325
 326
 327 A 'missing indent' happens when a production that we expect to be
 328 one-line sees a newline rather than an indent.
 329 i.e.
 330      a = b
 331      + c;
 332
 333 The + after the newline only reduces 'b' to Expression which starts
 334 in the middle of the line.  A newline should see a reduction to a
 335 nonterm which started at the beginning of a line, or which contains an
 336 indent.
 337
 338 We trigger a 'no close' error here so that '+' is the start of a new
 339 statement.  Maybe if the next symbol is shiftable, we just report an
 340 error and allow parsing to continue;
 341
 342 An 'unexpected indent' happens when the fully reduced non-terminal
 343 started at the beginning of a line and is a recursive non-terminal.
 344 Or maybe we are in a recursive state.  This triggers an error
 345
 346 A 'late undent' is like an unexpected indent.  The previous
 347 fully-reduced non-terminal is recursive, started at the beginning of a
 348 line, and still contains an indent.
 349
 350 ------------------
 351 Each symbol can carry a 'start-of-line' flag and an 'indent' count.
 352 Each state may be flagged as 'recursive' if "A -> A . etc", and
 353 records the symbol to synthesise to resolve a 'missing' error.
 354
 355 Current state includes an undent count, a nothing/newline/indent flag.
 356 When look ahead contains undent, we increment that count
 357 When look ahead contains newline we set newline flag
 358 When look ahead contains indent, we set flag to indent.
 359 When we reduce we combine indent counts in symbols and keep
 360 start-of-line from first symbol.
 361 If the ToS after a reduce contains indents and current state contain
 362 undents, they are reduce so that one becomes zero (min is subtracted
 363 from both).
 364 Before we shift we must test:
 365 - if undents are non-zero, we synthesis the symbol expected by the
 366   state, we keep synthesising and reducing until the undent count
 367   reaches zero.  We report that the synthesised symbols were missing.
 368
 369 - If newline flag is set, the ToS symbol should start at SOL or
 370   contain an indent.  If it doesn't we just report a missing indent
 371   error.
 372
 373 - If indent flag is set and ToS symbol started at SOL and state is recursive
 374   we report an unexpected indent.
 375
 376 - If state is recursive and ToS started a SoL and contains an indent,
 377   we report an extra indent.
 378
 379
 380 ---------------------
 381 (I think I'm getting there).
 382 Hypothesis:
 383   An indent is allowed in a non-recursive that started on a newline,
 384   or in a recursive (at the recursion point) that didn't.
 385
 386   The matching undent must be at the end of the production.
 387
 388   A newline is only allowed in a production that started at the start
 389   of a line, but not always. - only if recursive or annotated.
 390
 391   An undent is required at the end of a production that contains an
 392   indent.
 393
 394
 395   An indent or newline found where not "allowed" generates an error
 396   but no recovery.
 397   An undent not found where required generates an error and emergency
 398   reductions.
 399
 400
 401 If we have an ambiguous grammar then the undent could simply
 402 force a reduction. i.e. if the 'emergency reduction' just involves
 403 choosing REDUCE of SHIFT, then it isn't an error or even a warning.
 404
 405
 406 But where exactly is an indent or newline allowed - I don't think I
 407 have this right.
 408  - a newline is definitely allowed in a list that started at the
 409    beginning of a line.  We need to structure the production right
 410    so we don't end up with a separator at the start of a line.
 411       LIST -> LIST SEP ELEMENT
 412            -> ELEMENT
 413    would need to allow the newline before ELEMENT.  The logic there
 414    would be that if the first ELEMENT can be at start-of-line, other
 415    ELEMENTS can too.
 416       LIST -> LIST ELEMENT
 417            -> empty
 418    Is appropriate if no separate is needed, and an empty list is OK.
 419    Here ... does the indent go after the LIST or before the ELEMENT?
 420    i.e. if there are more symbols, how do we know.
 421
 422    If the list already has an indent, then from there on it "started
 423    at the beginning of a line"
 424    This is treating it a lot like LL(1) which would be:
 425       LIST -> ELEMENT LIST
 426             | empty
 427
 428       LIST -> ELEMENT SEP LIST
 429             | ELEMENT
 430
 431  - a newline is definitely allowed in a non-list which has already
 432    been indented.
 433       for (i=0;
 434            i<10;
 435            i++)
 436
 437       a = b +
 438           c +
 439           d;
 440
 441    So: whatever holds the indent can also hold a newline.
 442
 443    i.e. a production can have a newline between symbols if an earlier
 444    symbol has an indent before it.
 445    This would allow
 446     a = [ b
 447         , c
 448         , d ]
 449     which is probably OK.
 450
 451     What about indents?  Can I have two indents in the one production?
 452     No, I don't think so.
 453
 454     If an indent occurs *inside* a production, newlines can appear
 455     later.
 456     If an indent occurs *in front* of a list production, newlines can
 457     appear later.
 458
 459
 460
 461 So (again!)
 462   a symbol (on stack) can have an indent at the start, and can have a
 463   number of indents within.
 464   When we shift the first symbol after an indent, it gets an indent at
 465   the start.
 466   When we reduce the result inherit and indent at the start, and other
 467   indents (within and at start) become within.
 468   The indent at the start of a list can be treated as being within.
 469   i.e. as list has an empty symbol at the from before the indent.
 470
 471   A newline is permitted if the max history for the current state
 472   contains an indent.  Indent at start of non-list doesn't count.  At
 473   start of list it does.
 474   An indent is permitted if the max history does not contain an
 475   indent.
 476
 477   An undent cancels indents within, or possibly before, the
 478   immediately preceeding symbol.
 479
 480
 481 Can I do all lists with empty start?
 482
 483   // simple list
 484   LIST -> LIST ELEMENT
 485         | empty
 486
 487   // non-empty list
 488   LIST -> LIST0 ELEMENT
 489   LIST0 -> LIST
 490         |
 491
 492   // non-empty with separator
 493   LIST -> LIST0 ELEMENT
 494   LIST0 -> LIST SEP
 495          |
 496
 497   // can be empty, sep is needed
 498   LIST ->
 499        | LIST1
 500   LIST1 -> ELEMENT
 501        | LIST1 SEP ELEMENT
 502
 503   Just stick "IsList -> " in front of anything that is a list.
 504
 505
 506 For newline to be allowed, do indent have to be between symbols in
 507 same production, or is within symbol ok?
 508
 509 if ( a =
 510      b )   // now I need another indent.. or an undent?
 511   x = x + 2;
 512
 513 The undent resets, then only an indent is allowed.
 514 if ( a =
 515      b )
 516         x = x + 2;
 517
 518 No reset, so extra indent.  Anything is probably OK except newline.
 519 So the indent cannot be inside a previous symbol.
 520 But for lists, the indent will be exactly inside LIST.
 521 So maybe inside first symbol is OK ??
 522  x[a+
 523    b] =
 524    c
 525
 526  x = [ a, b, c,
 527    d, e, f
 528  ]
 529
 530
 531 I think I might be trying too hard.  I should allow newlines
 532 and indents anywhere, and give a clear interpretation for undents.
 533 However my real goal is to know exactly what newlines and undents mean
 534 and how to parse them.
 535 i.e. does the grammar need to mention all of the, some of them, or
 536 just be ambiguous and allow line break to disambiguate?
 537
 538 That last options doesn't really work as we cannot allow a separator
 539 to be dropped anywhere by at end-of-line or before-close.
 540 So grammar needs newlines.  Various separators can be "sep or newline".
 541 However I don't think indents need to be mentioned.
 542   if condition : statementlist
 543 needs undents to disambiguate I don't need to say that.
 544
 545 So:
 546  - indents get shifted with token
 547  - indents are grouped by reduce
 548  - and undent reduces until a matching indent in the top
 549    state can be cancelled.
 550  - newlines are messy.
 551    We look ahead through possible reductions until we
 552    either shift the newline or find an indent or cannot reduce.
 553    If the shifted newline would not contain a newline in the production,
 554    we choose that path.  If we find an indent, we discard the newline.
 555    If we cannot reduce we again ignore the newline
 556    We could also discard a newline early if not in LA set.
 557
 558
 559 But then:
 560    if x :
 561      a = b +
 562      c = d
 563
 564 Looks to the compiler like an addition and to the human like two assignments.
 565      a =
 566          b +
 567          c +
 568          d +
 569 ?? Only allow newline if production started with indent.
 570 So:
 571    If not expected and current production contains indent, discard.
 572    If not expected and current production has no indent - error, recover.
 573    If expected, test-reduce and discard if indent found else accept.
 574 Not quite.
 575
 576 What if we just test-recover until we hit an indent or want to shift the newline?
 577
 578 Difference between statements and expressions is that a statement can
 579 be terminated by a newline.
 580 So if we find a newline and test-recover would use it, that is an
 581 error.
 582 If we find a newline and test-recover would not use it before an
 583 indent, it isn't an error.
 584
 585 So each state which cannot reduce has a link to the best state to shift to.
 586 When we want to recovery we follow these links until they run
 587 out. Then we can reduce which leads to a new state.
 588
 589 If we find a newline, we do this until we find a state which will take
 590 a newline, or we cross an indent.
 591 If we crossed an indent, we discard the newline.
 592 If not and we has to take any shift steps, we call an error.
 593 This requires that newlines aren't buried to deep.
 594
 595 If we find an undent, we do this until we have cancelled the undent
 596 against an indent.  If anything was shifted, we report an error.
 597
 598 No - using indent just to disambiguate an ambiguous grammar doesn't
 599 work as it still allows confusing constructs.
 600    if foo:
 601      bar
 602      else: baz
 603 Should be wrong as the 'else' looks to be part of the statement but
 604 cannot be.  It should be "unexpected".  That really means that
 605 "INDENT" "UNDENT" must be part of the grammar.
 606 So:
 607    if COND : INDENT STATEMENTLIST UNDENT
 608 which is where I started.
 609
 610 So an indent which is not expected protects all newlines and
 611 indent/undent pairs until the matching undent.... not quite.
 612
 613 Indents can be expected in a state (if we compute lookahead).
 614 If an indent is not expected we simply discard it and keep count.
 615 When we shift an expected indent we push the count with it.  When we
 616 push the matching undent, we restore the count.
 617 When we get an undent, we discard it if indent count is non-zero.
 618
 619 When we get a newline we look ahead to see if it would reduce across
 620 an indent.  If so we discard the newline.  If not we keep it.
 621
 622 But an undent still needs to force reductions doesn't it?
 623
 624 i.e. if an undent doesn't cancel against top of stack we count it.
 625 If we every try to shift while an undent is present, we throw an error
 626 and shift other stuff until the undent is gone.
 627
 628
 629 Indents must affect error handling.  We no longer just discard tokens
 630 left and right.  The 'error' state for any indented token is the
 631 matching undent.  So we discard stack tokens until we get an error
 632 state or an indented state.
 633 If error, we shift the error and discard input until a token
 634   matches.  While discarding input we track indent/undent and
 635   if we come to an unmatched undent we discard stack tokens again
 636
 637 If we hit an indented state we discard input until the matching undent
 638 arrives.
 639
 640 But do we really want to discard tokens, or can we synthesise tokens?
 641
 642 Discarding is standard.  It gets to a known position.  It allows any
 643 location in a production to accept errors.
 644 However discarding does discard data that cannot get further analysis.
 645 If we use very fine grained error productions we might not miss much.
 646
 647 Come back to this later.  For now just discard.
 648
 649 If we find an Indent on stack, an error "must" be expected.
 650 i.e.
 651   block -> : INDENT Statementlist UNDENT
 652   Statementlist -> ERROR
 653
 654 So:
 655   if we can shift ERROR, we do,
 656   else if
 657
 658
 659
 660 Indents and newlines.
 661
 662 Indents and newlines can be taken as strong syntactic marker, useful
 663 both for regular parsing and error recovery.
 664 Unlike other bracketing markers such as parentheses or paired key
 665 words (BEGIN and END or DO and OD) indents cannot be lost.  Every line
 666 displays what the current indent level is and for every indent there
 667 is precisely one location where the matching undent is (though
 668 of course one location can have undents for multiple indents).
 669
 670 Similarly start-of-line and end-of-line can be strong bracketing
 671 markers which cannot be missed.
 672
 673 The two can complement each other to provide a very effective
 674 indication of the grouping that the programmer intends.  Often the
 675 programmer will want more explicit marking as well (e.g. "{" and "}" )
 676 which is perfectly fine as redundancy of expression is an important
 677 part of effective communication and this is equally true in writing a
 678 compute program.
 679
 680 The connection between the two is simply that indentation effects a
 681 continuation of the previous line. For example in
 682
 683    A B C
 684       D E
 685
 686 Any bracketing implied by sol/eol should  not be imposed around "A B
 687 C" but only around "A ... E" - the indent effectively hides the line
 688 break.
 689 So this could be tokenised as
 690   sol A B C indent sol D E eol undent eol
 691
 692 Note that every SOL has a matching EOL and every INDENT a matching
 693 UNDENT.
 694
 695 As these markers are strongly paired they can be useful for error
 696 detection and recovery.  This is particularly true for indents.
 697
 698 It seems universally true that any syntactic unit (such a declaration,
 699 statement, or expression) which starts indented will finish while
 700 still indented. That means it will complete before the undent which
 701 matches the most recent indent when it was started.
 702
 703 This fact is helpful but not quite as strong as we might like
 704 For example, in
 705
 706   A B C
 707      D E
 708   F G
 709  H
 710
 711 "F G" could be in the same syntactic unit as "A..E".  however if "D"
 712 started a new subunit, "F G" certainly aren't part of that.
 713 Assuming that "A" started a syntactic unit, "H" is the first symbol
 714 that is clearly not part of it.
 715 This isn't really a problem for
 716    if condition :
 717           foo;
 718           bar;
 719    baz;
 720
 721 as the statementlist is the last part of the if-statement, so
 722 there is no room in it for baz - baz must be separate.
 723 However in
 724
 725     a = b + c
 726         + d + e
 727         + f
 728     + g + h;
 729
 730 This looks wrong but cannot be ruled out on the grounds of
 731 indentation.
 732 Here the expression of
 733      statement -> var "=" expression ";"
 734
 735 starts on the same line as the statement so there is no indent.
 736
 737 We could argue that a syntactic element (expression) that stated other
 738 that at the start of a line may only continue to the end of that line
 739 (when adjusted for indents).  This can hit problems with:
 740
 741   if condition {
 742       foo;
 743       bar;
 744   }
 745
 746 Here the "block" of
 747    block -> "{" statementlist "}"
 748 starts other than at the beginning of a line, but continues on the
 749 next line.
 750
 751 There are a few different ways we could try to deal with this:
 752
 753 a/ some special-case marking in the grammar to say that a newline is
 754    permitted before the '}'
 755 b/ Don't make a separate syntax element for "block":
 756      ifstatement -> if condition { statementlist }
 757     would mean that the '}' is at the same indent as the 'if'
 758     which starts the element.
 759 c/ decide that the last terminal of a production is a special
 760    case and can go on a new line.
 761
 762 The last is almost certainly a bad ideas it would encourage
 763     a = b
 764     ;
 765 and discourage
 766     if foo then
 767        stuff
 768     end if
 769
 770 Another construct that is often used is:
 771
 772   if condition
 773   {
 774        foo;
 775        bar;
 776   }
 777
 778 where the "block" is allowed to start on a new line.
 779 This potentially encourages 'block' to be a separate syntactic unit,
 780 so 'b' above doesn't seem encouraging.  That leave 'a' and marking
 781 both the '{' and the '}' as being allowed to start a new line
 782
 783
 784 If we assume these markings, then an undent can close any syntactic
 785 unit that started at a greater indent, or at the same indent but not
 786 at start of line.
 787 We can use this:
 788  - to detect errors by reducing early and then ensuring the next
 789    token is expected.
 790    e.g. the indent after 'bar;' would close the statementlist after
 791    which only a '}' is expected.  If it isn't there, that is an error.
 792  - to handle ambiguous grammars.  In the classic
 793         statement -> if ( expression ) statement
 794                    | if ( expression ) statement else statement
 795    the normal default is to treat 'else' as right-associative.
 796    Using indents:
 797       if (x)
 798          if (y)
 799             foo();
 800       else
 801          bar();
 802
 803    can be parsed with the 'else' applying to the 'if(x)' as you would
 804    visually expect.
 805
 806
 807 Now we turn to line markings - sol and eol.
 808 The above suggested that any unit should be "on one line" (allowing
 809 for indents) unless specially marked.  This is bit excessive.
 810 For example
 811    a =
 812        b + c
 813        + d;
 814
 815 "b + c + d" is all one expression, but appear on two lines.  Might
 816 seem unlikely, but
 817    a[some_long_expression_with_function_call_etc] =
 818       verylongling + otherverylongthing
 819       + moreofthesame;
 820
 821 might not seem so unlikely.  This could easily be made correct by
 822 moving the '=' to the start of the second line, or indenting the third
 823 line slightly.  It isn't certain that we want to do that, and
 824 annotating that a newline is allowed before the 'plus' seems clumsy.
 825
 826 An alternate approach involves noting that 'statements' are rather
 827 special.  Not uniquely special but still special.
 828 That is, statements are normally on one line.  When they are two long
 829 for a line the remainder is indented (make it part of the previous
 830 line).
 831 When a statement appears on multiple lines it is a notable exception.
 832 For example the  Python
 833
 834    if condition:
 835       statementlist
 836    elif condition:
 837       statementlist
 838    else:
 839       statementlist
 840
 841 has three lines in one statement.  This is a one-line statement which
 842 has newlines before the 'elif' and 'else', or maybe 3 one-line objects
 843 which can come together.
 844
 845 While a statement is line-based, and expression probably isn't.  They
 846 tend to appear all on one line, but that is because the statement they
 847 are in does.
 848
 849 In some sense the line-nature of the statement "captures" the
 850 expression.  An expression can only be on multiple lines if they are
 851 all indented with respect to the containing statement.
 852
 853 So rather than marking some things as being allowed to continue to the
 854 next line, we can mark some things as being line-shaped.  These are
 855 sometimes whole units and sometimes parts. e.g.
 856
 857   if ( condition )
 858
 859   {  statementlist
 860
 861 Note the missing '}' above - it is not part of the line shape.
 862
 863   if condition : statementlist
 864
 865   else: statementlist
 866
 867   function name(argumentlist)
 868
 869   lval = expression ;
 870
 871 and so forth.
 872
 873 Being line-shaped just means that it must be "all on one line", but
 874 not that it must be a line by self.  Several line shaped things can
 875 appear on a line
 876
 877    a = a+1; b= b-2;
 878
 879 However it would be nice to allow a syntax modification when separate
 880 lines are used. i.e.
 881    a = a + 1
 882    b = b - 2
 883
 884 The trailing semicolon is not needed.  As the assignment is
 885 line-shaped it must finish at the end of the line.
 886 Declaring the semicolon to be optional would allow
 887    a = a + 1 b = b - 2
 888 which looks more like an error than a correct construct.  Here the
 889 semicolon is need.
 890
 891 A simple solution is to allow the NEWLINE to be part of the syntax
 892 listed for the grammar:
 893    OptSemi -> ;
 894            | NEWLINE
 895
 896 Then
 897    Assignment -> lvalue = expression OptSemi
 898
 899 as a line-shaped item should work.
 900
 901
 902
 903 1/ If a point is 'newline allowed' then what follows can go
 904    on a new line
 905 2/ If something is line-shaped then linebreaks may only be intended.
 906
 907 So in a production we can have 'optnewline' or 'optsemi'
 908 anywhere but at the start.  Or maybe 'at the start' means something a
 909 bit different.
 910
 911   statement -> if condition optnewline block
 912              | if condition optnewline block optnewline else optnewline block
 913
 914   block ->  : statementlist
 915         |  { statementlist optnewline }
 916
 917
 918 If something is line-shaped, then the EOL that ends the line on which
 919 it starts, must end the thing.
 920 If something is not line-shaped, then an EOL can be safely ignored.
 921
 922 An LR state can expect NEWLINE if it has a goto state for it.
 923 And it has an indent.
 924
 925 If we look up the stack and find an indent before a newline-expected
 926 then a newline is ignored.
 927 If we find a newline-expected before an indent, the newline is used.
 928 If a state expects a newline and was indented then the newline
 929 expectation comes last, is seen first when looking back, so newline is
 930 used.
 931
 932 Separately, an undent reduces until it can cancel with an indent.
 933
 934
 935
 936 but what about
 937    a = 1 + (3 *
 938             4 + 2)
 939    b = 1 +
 940        3 * func(a, b,
 941                 c, d)
 942
 943 This isn't so much about whether an indent is present, as how much of
 944 an indent there should be.  This sort of indent doesn't affect parsing
 945 or help with error recovery so it is probably best left to esthetics.
 946 We could possibly require that everything in the brackets be indented
 947 w.r.t. line on which the brackets started so that
 948
 949    b = 1 +
 950        3 * func(a, b,
 951        c, d)
 952
 953 would not be allowed.  However we would need to ensure that
 954    b = 1 +
 955        3 * func(
 956        a,b,c,d)
 957 were allowed (I think)
 958
 959
 960 I've got the UNDENT reduce a bit wrong.
 961 Anything that started with an indent must be reduced, not just one
 962 thing that started there and everything after.
 963 This is awkward as reducing back to 'statementlist' doesn't stop
 964 adding to that statementlist, but it should.  So we want e.g.
 965   statementlist -> statements
 966    statements -> statements statement
 967               |
 968 Then we reduce back to 'statementlist' which cannot accept another
 969 statement.   The parser generator could insert these extra non-terms
 970 automagically.
 971 No that's not quite it either.  If I have
 972    a;
 973    b;
 974     c;
 975    d;
 976 then the undent after 'c;' needs to reduce the statement (which
 977 is already reduced) but not the 'statementlist'.  But there is
 978 no extra state there.  We aren't reducing back to a state, but back to
 979 a symbol.  The indent happened after a statementlist
 980
 981 A recursive symbol A is a problem if:
 982   A -> A B
 983 and
 984   X -> ?? A Y
 985 for non-empty Y.  This results in a state:
 986    X -> ?? A . Y
 987    A -> A . B
 988
 989 If the Y is empty, X will reduce and B becomes unshiftable.
 990 If this pattern is found we can create A'
 991
 992   X -> ?? A' Y
 993   A' -> A
 994
 995 Now 'A' only appears at the end of a production so there is no
 996 problem.
 997
 998
 999 Got it wrong again ...
1000
1001    foo
1002    bar
1003     baz
1004     bat
1005
1006 after 'bat', the recent indent is in Statements which makes it look
1007 like we need to close all of Statements.  But we don't.  As the indent
1008 stated in the middle of Statements we only need to close to it.
1009 So we really do need to know if an indent is at the start or in the
1010 middle.
1011 We can cancel an 'in the middle' as soon as we find it.  We can only
1012 cancel an at-the-start when reduce_size > 1 (as it is now 'in the
1013 middle').
1014
1015
1016
1017 Now back to newlines.  It doesn't work very well to include newlines
1018 in the grammar wherever one is allowed as it conflicts with where they
1019 are required. E.g. for ';' to be optional we sometimes require one at
1020 the end of a statement, but for statements to be line-like we need to
1021 allow one at the start
1022
1023 I think we need to allow NEWLINE in the grammar for EOL markings.
1024 Maybe we want SOL too?
1025
1026 The important cases are:
1027
1028   a = b *
1029       c +
1030       d;
1031
1032 Statement is line-like so return to SOL not allowed, but expression is
1033 not to return is.
1034   if condition:
1035      statement
1036   else:
1037      statement
1038
1039 Both 'if ..' and 'else..' are line-like
1040   a = b + c
1041
1042 Newline at end of this line is allowed in place of ';'.
1043
1044 If we have the grammar recognise all newlines that are not protected
1045 we need to support blank lines in lists.
1046  Statements -> Statements OptNL Statement
1047
1048
1049 No, I don't like that.  EOL should be at the end.
1050
1051 Some things can have an EOL at end, either optionally or required.
1052 Or elsewhere in the production.
1053
1054 When we see an EOL we need to ask "Do I ignore this or use this?".
1055 The answer relates to indents and expected newlines.
1056
1057 If a symbol can derive a sequence with an EOL, then the symbol is
1058 flagged can-eol.
1059 If that symbols after DOT in any item in a state are flagged
1060 'can-eol', then the state is flagged 'can-eol'.
1061 If a 'can-eol' state appear on the stack closer than a 'was indented'
1062 state, the we use the EOL token.
1063 If no states up to the first 'was indented' state are flagged
1064 'can-eol' then we discard the EOL token.
1065
1066
1067 That might work, but have I thought of all the issues?
1068  - blank lines.  The grammar will need to allow for these explicitly
1069       in any context where EOL is expected.  Not a big burden I guess.
1070  - force reductions ?  If EOL closes things, does it close anything
1071      where not explicit?  Probably not - that is sort of the point.
1072      If it appears where not expected but in line context, that is an
1073      error just like an "unexpected" error.
1074
1075  - is the nesting 1-way?  i.e. can I have line-oriented things in
1076    noline boxes.  or "vertical stacks in horizontal boxes"?
1077
1078    A possible example is a function call.  It might be nice to allow
1079       a = b + func(first
1080                    second
1081                    third) * 2
1082
1083    There are two problem here.
1084    Firstly there is no EOL after 'first' so that cannot mark the end
1085    of an argument.
1086    Secondly it forces an EOL interpretation on surrounding expressions.
1087
1088    And what about
1089      a = [1, 2, 3
1090           4, 5, 6
1091           ]
1092
1093    I would like to be able to drop the ',' after the '3' but it isn't
1094    clear that it is safe.
1095      a = [1, 2, 3
1096                 +4
1097                 +2
1098           8, 9]
1099
1100    Is a bit forced but with clear intent.  I think it could be deemed
1101    as wrong because the think that wraps is not the first list element
1102    on that line.  Is that a well defined concept?
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114    Recording the line structure allows for a number of different options.  As mentioned already, a "NEWLINE" at the end of each line might do it.  However as "IN" and "OUT" already imply a line break, we could just have "SAME" on breaks which don't have "IN" or "OUT".  However this doesn't capture the idea that a line sometimes corresponds to a syntactic unit.
1115
1116 Another approach is to follow the bracketing nature of "IN" and "OUT"
1117 and have "SOL" and "EOL" for start-of-line and  end-of-line.  As we
1118 noted already an indented section implies continuation and so should
1119 delay the end of line.  If we follow this idea, the above structure
1120 could be tokenize as:
1121
1122
1123
1124
1125 Here every "SOL" has a matching "EOL" but it can be delayed until after an indented block. So the  "EOL" that matches the first "SOL" comes after the "OUT". It is worth noting here and a single line break can have multiple "OUT"s and  even an "IN" as well.
1126
1127 A B C
1128   D E
1129    F G
1130  H
1131 I
1132
1133 After the "G" we would have "EOL OUT EOL OUT EOL IN SOL" before the "H".  The three "EOL"s terminate the lines that start "F", "D", and "A" in that order,  and a new "IN" is needed as "H" is indented with respect to "A".
1134
1135 This tokenization captures the structure very effectively and is what
1136 I will be using with one simplification.  The  "SOL" really isn't
1137 needed.  It is easily deduced and doesn't have a strong effect on the
1138 parser.  The "EOL" may for a syntactic element to be closed.  The
1139 "SOL" is simply recorded for reference, and that can be recorded when
1140 "IN" or "EOL" is seen ("OUT" is always followed by "IN" or "EOL", so
1141 "OUT" doesn't need to imply "SOL").
1142
1143 This tokenization captures the structure very effectively and is what I will be using with one simplification.  The  "SOL" really isn't needed.  It is easily deduced and doesn't have a strong effect on the parser.  The "EOL" may for a syntactic element to be closed.  The "SOL" is simply recorded for reference, and that can be recorded when "IN" or "EOL" is seen ("OUT" is always followed by "IN" or "EOL", so "OUT" doesn't need to imply "SOL").
1144
1145 Unsurprisingly, these are exactly the tokens that can be generated by the scanner I previously described though the token names have been changed.
1146
1147
1148
1149 When does an EOL actually mark and END.
1150
1151 Now that we have a good understanding of IN and  OUT we can turn to EOL.  Intuitively EOL should mark the end of something much like OUT does.  But what.  Let's look at some cases:
1152
1153 if a == b:
1154     print "Hello"
1155     print "World"
1156
1157 Here there are 3 EOL tokens, one after "Hello", one after "World" and one after the OUT token right at then end.  The first two should close the statement on that line.  The last probably shouldn't as there might be an "else:" still to come.
1158
1159 a = b +
1160     c * d +
1161     e
1162
1163 There are 3 EOL token here in the same locations.  Now the first two should not close anything, while the last one should close the statement.
1164
1165 These two examples have very similar structure and are largely indistinguishable to the LR parser, and yet must be handled very differently. This implies that we need to provide the parser with extra information. We need to tell it something about the expected line structure of the language.
1166
1167 To understand what to tell the parse we need to look into our (or my) intuition about why the newlines are or are not wanted in the different locations.
1168
1169 I think the key observation is that statements are generally expected to be one-per-line, while there is no similar expectation for expressions.  The earlier observation that an "else:" might still be expected shows that statements are always one-per-line, however that is a fairly well defined exception. i.e. there are certain placed in a statement where a newline is expected, or at least not a surprise.  The start/end is one place.  Before an "else" or "elif" is another.  In C we might have:
1170
1171 if (a == b)
1172 {
1173     print "Hello"
1174     print "World"
1175 }
1176
1177 so an EOL is expected before a '{' or a '}' in that language.
1178
1179 If we define all the places where an EOL is expected then we should be able to work out when an EOL should close a syntactic unit and when it should be ignore.  However it isn't as simple as ignoring an EOL where ever it isn't expected.  We need to include our understanding that indents protect EOL in some sense. In the second EOL example above (a = b + c ...) the first two EOL tokens were ignored in part because they were indent.  If the second they were not ignore because despite being indented, they were expected.  So the interplay between indents and EOL expectations needs to be understood.
1180
1181 If we go back to our very first example:
1182
1183 A B C
1184    D E
1185    F
1186 G H
1187
1188 you might remember I suggested that if something started at "B C..." it should finish at "F".  The "OUT" should close that as well.  Our process described ..... not sure ..
1189
1190
1191
1192
1193
1194
1195
1196 Also you need to know how the scanner can report indent information to the parser.  I discuss that in some detail when introducing my lexical scanner but you probably don't need to read that now.  It is sufficient to know that the token stream will have paired "INDENT" and "UNDENT" tokens, and that the "NEWLINE token that you might expect to come before an "INDENT" is delayed until after the matching "UNDENT".
1197
1198 With this in place, it seems natural to allow the parser to absorb INDENT and UNDENT, but to require that when it reduces and production, the net indent absorbed while shifting the tokens which make up that production is zero.  i.e for every INDENT there must be exactly one UNDENT.  So:
1199
1200 if condition {
1201
1202 It might help to start with some examples.
1203
1204 1. Lorem ipsum dolor sit amet, consectetur adipisicing
1205 2.         elit, sed do eiusmod tempor
1206 3.         incididunt ut labore et dolore
1207 4.            magna aliqua. Ut enim ad minim
1208 5.    veniam, quis nostrud exercitation ullamco
1209 6. laboris nisi ut aliquip ex ea commodo consequat.
1210 7. Duis aute irure dolor in reprehenderit in voluptate
1211 8.      velit esse cillum dolore eu fugiat nulla pariatur.
1212
1213 Here are 8 lines of random text - the line number are just for reference and should not be considered part of the text.   Similarly any punctuation or capitals should be ignored when trying to understand the structure.
1214
1215 Looking at just the newlines and indents it appears that line 1-5, line6, and lines 7-8 form three separate and equal units.  Within the first of these, lines 2 and 3-4 are again to separate and equal units.  Line 5 is not part of whatever 2,3-4 are but is something else that is part of 1-5.
1216
1217 Where that structural analysis is actually correct cannot be determined without knowledge of the meaning, so lets try something more meaningful.
1218
1219 a = [ 1, 2
1220       3, 4
1221       5, 6
1222     ]
1223
1224 Now we can clearly see that the continued lines are all part of what was started on the first line.  The last line is less indented than the others much like line 5 in the first example.  We took that to mean it is not a continuation 2nd and 3rd lines but only of the first, and this example bears that out.  The lack of a comma at the end of the first 3 lines are not lead to any ambiguity.  It is clear from the structure that something is completed.  It cannot be the whole list that is completed as that started on the first line and we are still indented with respect to there, so it must be a "smaller" or more local syntactic unit that has completed.  i.e. the list entry.
1225
1226 Here is a different example using the same structure.
1227
1228 if a = 1 { a =
1229     b + c;
1230     d = e;
1231   }
1232
1233 This looks very wrong.  Breaking the first assignment looks bad, and having the second assignment start with the same indent as the continuation of the first is also wrong.  Yet the list begun, but not finished, of the first line of the first example looks fine.  Maybe it is the amount of indenting that makes a difference.  We can try adjusting that and see:
1234
1235 a = [ 1, 2
1236    3, 4
1237    5, 6
1238   ]
1239 if a = 1 { a =
1240                b + c;
1241    d = e;
1242  }
1243
1244 certainly the amount of indenting helped in the first example, but fixing it doesn't really fix the second example.
1245
1246 hello
1247
1248 30apr2014
1249  I seem to be stuck here...
1250  What do I know?
1251
1252  When I see a newline, I must either IGNORE it or SHIFT it or REDUCE
1253  The key question is: when to ignore it?
1254
1255  We ignore newlines when they are protected by an indent.
1256  What does "protected" mean?  It means there is already an internal indent
1257  in any symbol which can derive a newline.
1258
1259  So in the above examples, only newlines are expected at end of
1260  statement (with or without ';') and after '{'.
1261
1262  So for all the newlines seen in a = [..] the closest symbol that derives
1263  newline is 'statement' and it has an internal indent.  So they are ignored.
1264
1265  In the "if a = 1 { ..."  The first newline is after "+ c;". We are in
1266  statement (or statementlist) which derives newline and that has internal
1267  indent so newline is ignored.  But I don't want it ignored.
1268
1269  How about:
1270    When we see a newline, we find the closest reducible thing that
1271    can contain a newline.  If it started at the start of a line and
1272    contains an internal indent, then we can ignore the newline.
1273
1274  That has potential: a line-like thing can only be broken if it
1275  starts at the beginning of a line.
1276
1277  But can we implement this in an LR grammar?  We don't know what context
1278  we are in until we reduce into that context.  So how do we determine if
1279  we are in a line-like thing?
1280
1281  Each symbol has a 'can_eol' flag.  EOL gets that flag, and  any
1282  non-terminal which can produce an 'can_eol' symbol gets the flag too.
1283  So 'statement' has it but 'expression' does not.
1284
1285  Each state has an 'expect_eol' flag.  If is set if the 'next' symbol of
1286  any item in the state has can_eol set.
1287  OR : It is set if any the head of any production in the state has can_eol set.
1288
1289  Now recall that the state stack is conceptually:
1290    STATE1 SYM1 STATE2 SYM2 STATE3 SYM3
1291  Each STATEn is expecting all the following syms to eventually be reduce
1292  to something which can then advance that state.
1293  So if STATEn has 'expect_eol', then everything following is part of
1294  something that maybe could have a newline in it.
1295
1296  So if STATEk has expect_eol then we ignore newline precisely if
1297  SYMk was at start-of-line and somewhere from SYMk on there is an
1298  internal indent.  So we need a new start-of-line flag in the frame.
1299
1300  Is expect_eol precise enough?  I'm wondering about a production
1301  that has a newline in the middle.  What happens then?
1302  After the newline has been shifted the whole won't be reduced so that
1303  expect_eol state will still be there so we can still be ignoring
1304  newlines.
1305
1306  I think we declare that a production which can contain a newline but
1307  cannot end with a newline is an oddity which we don't support.
1308
1309  Similarly if any production from a symbol can contain a newline,
1310  then every production is assumed to be line-like and so require an
1311  indent for newlines to be ignored.
1312
1313
1314  "Newlines are ignored only inside line-like symbols which started at
1315   the beginning of a line and  have since had an indent"
1316
1317  Possibly good, but a grammar with no newlines would not allow
1318  newlines.  Maybe that doesn't matter - the lexer can ignore them.
1319  Can we turn it around  though
1320
1321  "Newlines are only shifted if the nearest line-like thing does not
1322   have an internal indent."
1323  so "They are ignored if there is no linelike thing, or it holds an indent
1324
1325  Also "An internal indent in a linelike thing that started midline is
1326   not permitted.
1327
1328
1329  That didn't work - 'expect_eol' definition doesn't make sense in that way.
1330
1331  New attempt:
1332   expect_eol is set if *all* of the *next* symbols in core item
1333   have cal-eol set.
1334   When a core item does not have a 'next' symbol we ignore that item.
1335
1336  Looks good, but now a new issue appears.
1337  If a production is reduced by an OUT indent, the following NEWLINE
1338  doesn't have a home.  If expect an optional newline there it might
1339  use the newline before the outdent (which could have been ignored
1340  otherwise??). and worse, it can create a reduce-reduce conflict
1341  because the optional newline could terminate inner or outer statement.
1342
1343  This is particularly a problem with
1344     Block -> : Statementlist
1345     Statement -> if Expression Block else Block
1346
1347  After an OUT reduces ": Statementlist" to Block there is a NEWLINE
1348  but we need to see the "else".  If we indent the 'else' it works nicely.
1349  I want to tell that a NEWLINE is needed there.  That means I must put one
1350  after "Statement -> if Expression Block" to avoid a reduce/reduce conflict.
1351  Each.
1352
1353  After trying to describe this I need to refine it again.
1354  - A symbol is "can_eol" if any symbol in any production from this symbol
1355    is followed only by nullable symbols.
1356  - A state is "starts_line" if the previous symbol in any item "can_eol",
1357    or is nullable and previous to that "can_eol", etc.
1358  - The start state is "starts_line" if the start symbol "can_eol".