ocean-lang.org Git - ocean-D/blob - Parsing

   1 Apr 2013
   2
   3 What do I remember about how parsing works?
   4 And why don't I just use yacc?
   5 A: because I didn't invent that :-)
   6
   7 We have productions from a non-terminal to a list of
   8 terminals and  non-terminals.
   9
  10 We have an input stream of terminals.
  11 Parsing involves a stack of symbols and two operations: shift and reduce.
  12
  13 "shift" take the next terminal from input and pushes it onto the stack.
  14 "reduce" pops a sequence of symbol off the stack which match the
  15 right hand side of a production, and replace it with the non-terminal.
  16
  17 On every "reduce" action some parsing code is run which typically
  18 gathers important info from the symbol list and attaches it to the
  19 non-terminal which is pushed.
  20
  21 We finish with "program" as the only non-terminal on the stack,
  22 and empty input stream, and a full abstract syntax tree attached to the
  23 "program".
  24
  25 OR an error.
  26
  27 To choose between "shift" and "reduce" we use a state machine.
  28 Each state corresponds to a point between two symbols in a production,
  29 though multiple state may well get merged.
  30 For each state there is a list of terminals which should be shifted
  31 and a list of terminal that should trigger a 'reduce', together with
  32 the production that should be reduced.  Each of these also determines
  33 the next state.
  34
  35 Actually not.
  36 We have: a stack of "state" + anything the actions might want.
  37
  38 Actions: given a state and a lookahead determine action:
  39   - shift:  newstate
  40   - reduce: non-terminal, count, action
  41   - accept: all done
  42
  43 When a reduce happens, we look up a separate "goto" table
  44  goto: state + non-terminal -> state
  45
  46 We create these tables in 4 steps using some intermediate tables.
  47 build_first
  48    create the 'first' map which lists all the possible 'first' terminals
  49         for each non-terminal
  50 build_follow
  51    create the 'follow' map which lists all possible following terminals
  52         for each non-terminal
  53 build_itemsets
  54    create itemsets which maps an 'itemset' to closures.
  55    Each 'itemset' corresponds to a 'state' above.
  56    An 'item' is a production plus an index, 0 for the start, 1 after the first
  57    symbol etc.
  58    An 'itemset' is a collection of 'items' that can all exist at the same point.
  59    The 'closure' is a list of nonterminals and terminals which may follow
  60    this point.  i.e. all that follow immediately, plus the first for each,
  61    plus the first of each of those, etc.
  62
  63    This involves building the 'goto' table
  64
  65 build_actions
  66
  67  - if [A -> alpha . a beta] is in Ii and goto(Ii, a)==Ij,
  68           action[i,a] is "shift j"
  69  - if [A -> alpha . ] is in Ii then
  70           action[i,a] is "reduce A -> alpha" for all a if FOLLOW(a)
  71
  72 build_first:
  73   Each terminal may only start with that terminal.
  74   Each non-terminal that can produce an empty string gets 'Empty' added.
  75      This is repeated until no more can be added (As setting Empty might
  76      allow other non-terms to produce empty)
  77   Then for each production, and for each symbol until a non-empty is found
  78     add the first of each symbol to the first for this symbol.
  79   Then repeat until no change.
  80
  81 build_follow:
  82   build_follow requires the first() mapping from symbol to terminal sets.
  83   Initially, follow() everything is empty, except follow(Start) is End
  84   Then for each production
  85     for each symbol but the last, all of first(next-symbol) except Empty into
  86             follow(this)
  87     for last symbol, if first(last) contains empty, then all for follow(last)
  88      go to follow(this)
  89   or something like that
  90
  91 build_itemsets
  92   first itemset has one item; "START 0"
  93    for each new itemset, we add closure and build_goto.
  94    if build_goto added itemsets, we add their closure and build their goto.
  95
  96   To close an itemset:
  97      - collect lists of terminals and nonterminal which are 'next'
  98          in each production
  99      - for each non-terminal, add 'first()' ???
 100   the code in LRparse.itcl looks confused - or I am.
 101
 102   Having created the closure, we create the goto for the itemset plus
 103   each symbol in the closure.
 104   For an itemset+symbol, the 'goto' is the set of all items which are
 105   one step forward. over *this* symbol.  This might be a new itemset,
 106   which will need a closure and gotos.
 107
 108
 109 So we assign a number to each terminal and a higher number to each non-terminal.
 110 We need to be able to store sets of these numbers. This would be an auto-grow array
 111 for which the first part is sorted.  Every time (say) 8 entries are added, we
 112 sort the new entries in.
 113
 114 We also assign an index to each production, which is a list of a head and several
 115 body parts
 116 An 'item' then is 2 numbers: production+offset.
 117 An 'itemset' is just like the terminal sets, but fatter.
 118 Each itemset gets assigned a number.
 119 A 'goto' is a map from itemset plus symbol (2 numbers) to another itemset.
 120
 121 'first' and 'follow' are arrays pointing to termsets.
 122 Infact there is probably one array for symbols where each has
 123 name, first, follow, first_production
 124 'itemsets' needs content-lookup so could just use linear search, or occasional sort.
 125
 126 'goto' is a separate small table for each itemset which can be created sorted.
 127
 128 'actions' is a per-itemset list which is created sorted.
 129
 130 =================================
 131 Errors:
 132
 133 grammar errors are "shift-reduce" conflicts and "reduce-reduce" conflicts.
 134
 135 shift-reduce are resolved by shifting, but cause a warning
 136 reduce-reduce are resolved arbitrarily I guess.  Best to avoid them.
 137
 138 However we can use precedence to resolve these conflicts.... nearly
 139 as easy to be explicit about different nonterminals
 140
 141 Parse errors happen where we can neither 'shift' or 'reduce'.
 142
 143 We can discard unexpected symbols until we recognise something, but
 144 that may be never and might discard too much.
 145 Or we can reduce prematurely so we are likely to expect more significant
 146 symbols.
 147 What we actually do is replace an unexpected symbol by 'error' which should let
 148 us reduce.... don't really know how this is meant to work.  We want to shift
 149 some and reduce some but how much of each?  Maybe replace each bad symbol
 150 by 'error' ... but then we lose stuff.
 151 Maybe check each state on the stack to see if anything recognises the next symbol.
 152 If not, discard it and try again?  Or something.
 153
 154 So an error production is:
 155    A -> error term
 156 meaning that in the context of A, if we find an error, we look for 'term'
 157 and decide that is a (bad) A.
 158 So if we get an unexpected terminal, we pop states until one
 159 allows us to shift and error, then shift the error, then discard terminals
 160 until one is recognised.
 161
 162
 163 ================================
 164 Lexer.
 165 hand crafted which "knows" about
 166  identifiers/reserved words
 167  numbers
 168  symbols
 169  newlines
 170  end of file.
 171  indentation
 172  strings
 173  comments
 174
 175 Gets list of reserved words from grammer, and list of symbols
 176
 177 ident/reserve:  _alpha followed by _alphanum
 178 numbers: digit followed by digits, periods, a-z and +- immediately following
 179      an 'e' or 'E'
 180 newlines are .. newlines
 181 strings are: ' upto next' on same line, " upto next ", 3 of these upto next
 182     3 on same line, or some multiline thing with no indent.
 183 comments are: // to end of line, or /* to */ but without another /*
 184
 185 symbols: any ASCII after space and before delete other than digits,
 186     alpha and quotes.  i.e. anything left.  Unknown symbols are the
 187     "unknown" token.
 188
 189 Grammer is:
 190
 191 nonterm -> nonterm + foo += bar  "$1 $2 - args for HEAD struct"
 192         |  something else ;
 193
 194 Any identifer that does appear on the LHS is a literal.
 195 Any symbol sequence is a symbol.
 196 So as we parse we add each word into symbol table.  As we
 197 find proctions we mark some of them as non-terminals.
 198 First non-terminal is the 'start' symbol.
 199
 200 This builds an abstract syntax tree.
 201
 202 Each node in the tree has a name (the symbol) and either:
 203   for non-terminals: a list of other nodes
 204   for terminals: the id and value
 205
 206 Each symbol on the rhs can be followed by a number which indicates where
 207 in the symbol list of the lhs this symbol should go. 1 is first, 2 is second.
 208 0 has a special meaning. The list of symbols in parent is set to the list
 209 of symbols in this child, then others are appended.
 210
 211 ===============================
 212 How do we handle fine indents in parsing?
 213
 214 Each token carries not only the token name/value, but also
 215 the line-num, char-num of where it starts, and the indent size
 216 of that line.  A 'reduce' ensures all tokens have compatible indents.
 217 If the first token is a terminal, it is ignored.(maybe)
 218
 219 Indentation changes result in 'CLOSE' tokens
 220    statementlist -> statementlist statement statement_end
 221                  |
 222    statementend  -> ;
 223                 | ; CLOSE
 224                 | CLOSE
 225 No, that doesn't work - it could swallow 'CLOSE's.
 226 I suspect we need OPEN and CLOSE to be paired.
 227 Or we want 'SAME INDENT' and 'SMALLER INDENT' to be distinct.
 228 Python uses INDENT and DETENT and NEWLINE, which is sometimes ignored.
 229
 230 Maybe:
 231   If a newline is found were it isn't expected it is quietly ignored.
 232   If it is followed by an indent, it and following newlines to a dedent
 233   are also ignored.
 234 Then
 235   block -> : NEWLINE INDENT statementlist DEDENT
 236         | { statementlist }
 237   statementlist -> statementlist statement ;
 238         | statementlist statement NEWLINE
 239         | statementlist statement ; NEWLINE
 240
 241 ==================================
 242 Just clarifying how the states in the parser work.
 243
 244 A 'state' is a collection of points in different productions that are
 245 all equivalent in some way.
 246
 247 We have a stack.  Every item on the stack is a pair of a 'state' and a
 248 'frame'.  A 'frame' is effectively the abstract syntax tree for some
 249 symbol which lead to that state.
 250
 251 The two main parsing operations are:
 252  SHIFT(state):  This pushes the given state onto the stack together
 253      with a frame built for the terminal.
 254      The given state is one step further along one or more productions from the
 255      state that was on top of the stack (and is now second from top).
 256
 257  REDUCE(production): This pops one entry from the stack for each body
 258      part in the production and combines the frames as determined by the
 259      production to product a new frame.
 260      It also looks up the head in the goto table for the top-of-stack
 261      state and makes that the new state
 262
 263 So in each case, a state plus a symbol founds goes to a new state.
 264
 265 So given a state we examine each production and look at each following
 266 symbol.
 267 For each unique symbol found, we create a state which is one step
 268 forward in each production over precisely that symbol.
 269 To this state we add the '0' start for all productions for all
 270 non-terminals which could come next
 271
 272 Once we get a new state, we create lists of the terms and  non-terms
 273 which come next in all the different productions.  Then for each
 274 nonterm we add all terms and non-terms at start of productions
 275 for that nonterm.  Isn't this just 'first' all over again?  No,
 276 because 'first' skips over 'empty' nonterminals, but we don't want to here.
 277 This pair of symsets then lives with the itemset and are used to build
 278 the goto table and to expand the itemsets to a full state.
 279
 280 So each itemset needs a set of items and (computed from that) a set of
 281 following symbols.  We need to be able to quick check if a given
 282 itemset is know.
 283 The set of following symbols is used partly to help complete the
 284 itemset, partly to build the goto table from that itemset.  So we
 285 don't need to hold on to it if we build the complete itemset.
 286 So after we complete the itemset and build the goto table for it,
 287 an itemset is just a list of items.
 288
 289 However: completing an itemset is a lot of work, and if we have
 290 already found it, that is pointless.  So we want to check against
 291 known itemsets before completion.  So we need to be able to see the
 292 non-completed form of existing itemsets.
 293 The completion of an itemset involves adding "X,0" items,  All other
 294 items are non-zero.  So as long as we get the encoding right, it
 295 should be easy to just compare the prefix.
 296
 297 itemsets need some sort of stable identity to be stored in the goto
 298 tables, so we need to store a number in each itemset.
 299 How do we provide lookup?  I could use a skip list... yes.
 300
 301
 302
 303 ---------------------------------------------
 304 I've changed my mind ... I think.
 305 I don't want the grammar processing to be in the same program as the
 306 grammar matching.  It is kind-a neat having just one program and being
 307 able to fiddle the grammar and re-run it.  However there are impedance
 308 mismatches.
 309 1/ symbol numbering.  The matcher needs to 'know' the numbers of
 310    symbols as compile time constants.  The could be done by having the
 311    lexer assign known numbers to known names while parsing the grammar.
 312 2/ AST types.  It requires a single data type with a simple list of
 313    children.  This is overly simplistic and makes the matchers
 314    unnecessarily messy
 315 3/ AST management code.  This probably has to change with the grammar
 316    anyway and keeping it separate doesn't gain anything but loses
 317    much.
 318
 319 If we had an interpreted language with introspection and 'eval', then
 320 it might make sense, but I don't really want that sort of language.
 321
 322 So: Have the grammar processing output some C (or later 'plato') code
 323 to do the parsing.  The grammar file can then define the ast data
 324 types and the per-production code.
 325 So we need some syntax to embed code and grammar in the same file.  It
 326 would be best if it looks like C to emacs so I can edit it easily.
 327
 328 Each production needs to know the type (always a struct I think) of
 329 the frame for the head and some code to run.
 330
 331 The code has '$N' where variables can be substituted. $0 means HEAD
 332 frame,
 333 Old frames are (probably) freed so we need to move whatever is
 334 needed out.
 335
 336 The generated code needs to include the go_to table which maps
 337 state to token to state, and the action table which maps
 338 state to terminal to action, where action is ACCEPT, SHIFT or
 339 REDUCE(n) where n is a production.  The table of productions lists
 340 body size, token, (to feed to go_to) and code fragment.
 341
 342 gram file contains:
 343
 344 #include lines
 345 #define lines
 346 enum lines
 347 struct lines
 348 static lines
 349
 350 $$$$$ to mark start of grammar
 351
 352 head -> token token ${  code  }$
 353
 354 #include lines go in .c
 355 #define lines go in .h
 356 struct lines go in .h
 357 other lines go in .c
 358
 359 #define TOK_token N
 360    gets put in .h
 361
 362 func define for parser goes in .h
 363
 364 "known" table goes in .c
 365 go_to as an array of pointers to arrays of pairs of integer with
 366 length.  To we index state into go_to then binary search token.
 367
 368 action table is array of numbers
 369  -1 for accept
 370  -2 for reduce
 371  (N*256 + body_size) for production.
 372
 373 production table is a switch statement which includes the code
 374 and assigned the result token.
 375
 376 Wait - do I care about token names in the C code?  I probably don't.
 377 If I want to use similar things I can create my own defines.
 378 So drop the TOK_token stuff... though it might make the generated
 379 C file more readable.
 380
 381
 382 ------------
 383 I need to be able to capture raw input text that matches some
 384 sequence of symbols.  This will be used to copy code out of the
 385 grammar file and into the code or include files.
 386 It is important to capture spacing and comments as well as the
 387 functional bits.
 388 Probably best to turn on capture when we see a newline, then if we
 389 decide we don't want that, we can turn off and discard it.
 390 So 'capture' collects every character processed which will typically
 391 be leading spaces and the target token.
 392
 393 When capturing C code we count '{}' and if we find something we don't
 394 recognised, we add it to the previous stanza.
 395 Or maybe have '$$$$$' tokens to differentiate include from C code.
 396
 397
 398 -------------------------------------
 399
 400 What about indents and newlines?
 401 If I find a newline that I'm not expecting, I can ignore it but
 402 require future  tokens to be indented more than the token which
 403 started the current production(s).  But I don't really know which
 404 are the current productions until we reduce, by which time it might be
 405 a little late to report an error.
 406
 407 With each frame on the stack we record the indent of the line
 408 which started that nonterminal.  On each 'shift', we ensure
 409 the new symbol has at least the same indent.
 410 On reduce, we set the new indent based on first token reduced.
 411
 412 But that isn't the whole picture.
 413 I think there are two different sorts of regions.
 414 In one, we don't really expect newline, we are happy for them to
 415 appear anywhere, as long as indenting never goes backwards.
 416 In the other, we do expect newlines, they terminate something
 417 but we allow extras as long as they are *more* indented than before.
 418 In effect the extra indent in a line continuation and hides the
 419 newline.
 420
 421   if a and b :
 422      then C
 423
 424 that was expected and the indent is structural
 425
 426   if a and
 427     b:
 428       C
 429
 430 First one not expected but due to indent, is ignored.
 431 Second one is expected and indent required and structural.
 432 Intent relative to 'if' line
 433
 434   if a and b : C
 435      d e
 436
 437 The newline was possible as a terminator, so the next indent
 438 becomes a line continuation.  That could be a bit confusing,
 439 so we probably don't want to allow that?
 440
 441 Each newline gets parsed as either
 442   INDENT - if it is followed by an increasing indent
 443   NEWLINE- if it is followed by the same indent
 444   UNDENT+- if it is followed by a reduced indent.
 445
 446 If an indent is not expected, it and all newlines and indents
 447 and matching undents are ignored.
 448 If newline or undent is not ignored or expected, it is an error.
 449
 450 That would make:
 451    if asdad
 452    {
 453      stuff
 454    }
 455 an error because that first NEWLINE isn't expected or ignored.
 456 I guess it should be optional before a '{'.
 457
 458  IF -> if Expression NEWLINE { Statementlist }
 459     |  if Expression { Statementlist }
 460     |  if Expression : INDENT Statementlist UNDENT
 461
 462
 463 Maybe this:
 464  - If we see an unexpected INDENT, we hold it and continue.
 465         Next SHIFTed symbol gets it attached.
 466  - when we reduce all the attached indents get collected and
 467    then attached to the nonterm when it is 'shifted' in.
 468  - If we see an unexpected UNDENT, we hold it and continue.
 469      Any shift while we hold an UNDENT is an error (how to handle?)
 470      unless it holds matching INDENTS (i.e. is a reduced nonterm).
 471
 472 That is promising but doesn't explain newlines and allows:
 473   a = b;
 474     c = d;
 475 Why is this a problem?  It could be a single line, but is wrapped.
 476 I think it is a problem because a NEWLINE was allowed after the ';'.
 477 So it doesn't really work as one line.
 478    a = b
 479      ; c = d;
 480 is more messy.  Newline is allowed there too.
 481    a =
 482      b ; c = d
 483 That makes sense.  newline wasn't permitted.
 484    a = b
 485      + c ; d = f;
 486 That is OK.  Why the difference?  INDENT is attached to '+' which
 487 is not the start of the production that swallows it... actually it is.
 488  statementlist -> statementlist statement
 489 the 'statement' is 'd = f;', + was swallowed into statementlist.
 490 Maybe that means it is bad.
 491 So: When we reduce a production, and indents that are not on the first
 492   symbol must be matched by pending undents.
 493   Any indents that are on the first symbol must be carried to parent
 494
 495 This is getting there, but doesn't work for
 496 {
 497   foo;
 498 }
 499
 500 as shifting the last '}' will be an error as an undent is
 501 pending.... No, that is the old rule.
 502 Current rule is:
 503
 504  - If we see an unexpected INDENT, we hold it and continue.
 505    Next SHIFTed symbol gets it attached.  i.e. we  read the next
 506    symbol and mark it as indented.
 507  - When we see an unexpected UNDENT, we keep count of it and continue.
 508    It is not attached (yet).
 509  - When we reduce a production, Any indents attached to the first
 510    token are moved to the created token.  Any indents on other tokens
 511    must match saved undents, and are canceled.  If there aren't
 512    enough saved undents, then it is an error.  This means that if
 513    a nonterminal continues onto more lines, it must end the last line.
 514    So a a ; b b ; c
 515         c ; d d
 516    is not ok, but
 517       a a ; b b ; c c ;
 518         d d
 519    is as is
 520       a a ; b b ; c
 521          c ;
 522        d d
 523    That seems to make sense.
 524
 525  Only I want to revise the above again.
 526   - If we see an INDENT or UNDENT which has no action, we count it,
 527     keeping a single count which is +ve or -ve or 0.
 528   - When we shift in a terminal, it collects the current count.
 529   - When we reduce a production, the count on the first token gets
 530     attached to the resulting nonterminal.  The counts on the
 531     remaining tokens must be non-negative, and there must be enough
 532     excess UNDENTS to counter them.
 533
 534  But what about NEWLINEs?
 535    If there is an action for it, we just accept it and eventually
 536    shift it.
 537    But what if no action?
 538
 539    1/
 540      a = [ 1, 2,  INDENT
 541            3, 4,  NEWLINE
 542            5, 6,  UNDENT
 543          ]
 544
 545      Actually, that isn't an UNDENT, it is an UNDENT INDENT if
 546      anything.  But in any case the NEWLINE is allowed and ignored
 547
 548    2/
 549      if true {  INDENT
 550         a =     NEWLINE
 551         b + c   UNDENT
 552      }
 553
 554      First newline is confusing and should be illegal.  But why?
 555      However is that different to '1'?
 556      Maybe because NEWLINE is expected in a little while?
 557
 558    3/
 559      if true { a =
 560         b + c;
 561         d = e;
 562      }
 563
 564      is even more like 1, and even worse than 2.  Even with ';'s so no
 565      reason to expect newlines soon, it looks bad.  But here it is the
 566      indent which is bad, not the newline.
 567
 568      Is it because '1' has "a -> a b" (first token is common) while
 569      bads is "a -> b c d" - different token.
 570
 571   Maybe we need to flag non-terminals which are, or are not, allowed
 572   to have unexpected newlines.
 573   So a statement-list may not, but other lists may.
 574
 575   If a production contains a newline, then the head may not contain
 576   unexpected newlines.  But how would we know?  And indents are still
 577   allowed.  So if the stack contains indents....
 578
 579    a =
 580    b + c
 581   is bad, but
 582    a =
 583     b + c
 584   is fine.
 585    a =
 586     b +
 587     c
 588   is also fine.
 589
 590   New attempt.
 591
 592   When we see an INDENT, UNDENT, or NEWLINE which would otherwise be
 593   an error, we record it.  We should never end up recording an indent
 594   and undent at the same time.  Probably not a new line either.
 595   When we SHIFT a terminal, we attach to it in the current recorded
 596   state, which is a simple number: 1 for NEWLINE, 2 for INDENT, -N for
 597   UNDENT, 0 for none of the  above.
 598   When we REDUCE a production:
 599     Any state on the first symbol is carried up to the resulting
 600     nonterminal.
 601     Any other INDENTS must cancel with following UNDENTs.  They can
 602     follow either in subsequent symbols, or in the current (unshifted)
 603     state.
 604     There must be no uncanceled UNDENTs on any symbol, but there can
 605     be in the state.  These are carried forward.
 606     Any NEWLINEs between an INDENT and UNDENT are canceled together with
 607     the INDENTs.
 608     If there are any extra NEWLINES, and a NEWLINE has an ACTION for
 609     the current state, then that is an error - stray newline, indent
 610     expected.
 611
 612   That seems good, though will need to be tested.  Remaining question:
 613   Is an undent less than the previous indent allows?  How is it parsed
 614   and what does it mean?
 615     if cond1 and (cond 2
 616                   or cond 3)
 617        and cond 4
 618
 619   This could be an UNDENT followed by an INDENT.
 620   However this fails the above rules.  We would reduce
 621   "cond 2 or cond 3" which contains an INDENT before we see an UNDENT.
 622   So we would have
 623      '(' Cond-with-indent ')'  [undent pending] ['and' is lookahead]
 624
 625   Maybe each symbol gets a 'leading indent' and  'internal indent'.
 626   Internal indents can be canceled against undents and hide newlines.
 627   Leading indents must wait until they become internal.  Only then can
 628   they cancel newlines and undents.
 629
 630
 631   I'm not sure about combining these things.  Might need 2 numbers.
 632   Any newline after an indent is ignored.  So there is no point
 633   keeping both and indent and a newline on a symbol - just the indent
 634   is enough.
 635   Undents can only ever be 'internal'.  Intents can be leading or
 636   internal.
 637   So we have 'leading indent', 'internal undent', 'internal indent'.
 638   where indent can also be newline.
 639   When we shift in a terminal, it can have undent or indent.  It
 640   cannot have both.  If there is an indent, the undent must have been
 641   canceled.
 642   When we reduce, indent/newline/undent is canceled.  If there is an
 643   undent followed by an indent, that is an error.  so a symbol can
 644   never carry both an indent and an undent.
 645   If it carries and undent, it is probably a closing bracket.
 646    if a {
 647       b
 648    }
 649   '}' carries an undent which balances the indent on 'b'.
 650   What about:
 651     if a:
 652        foo
 653     else:
 654        bar
 655
 656    The undent doesn't get attached to 'else' as indenting syntax is
 657    explict here.
 658    How about:
 659      if (a)
 660       {
 661         foo;
 662       }
 663      else
 664       {
 665         bar;
 666       }
 667
 668     'else' sees an undent and a newline.  The undent balances the
 669     indent in the 'block' so it only carries the newline.
 670
 671     But ... the 'else' - in both cases - gets a newline, and in the first
 672     it isn't allowed because it is at the same level as newlines being
 673     recognised terminators.  But in this case the 'then' *must* be on
 674     the next line because we just had a line-oriented block.
 675     Maybe we just have to 'expect' that newline.
 676
 677     Anyway, a symbol never carries both an indent and an undent.
 678     Can it carry a newline with either of those?  And newline 'inside'
 679     the dent will get ignored, so the interesting possibilities are:
 680       newline indent
 681     and
 682       undent newline
 683
 684     I think 'newline' can be leading while 'indent' is internal.
 685     However I don't think the other possibility is valid.
 686
 687
 688     'leading_nl' is one of "nothing", "newline" "indent"
 689       There can only be a single indent.
 690
 691     'internal_nl' can be a number of undents, or a number of indents,
 692       or nothing.  There are no 'internal newlines'.  If they were
 693       in a dent, they were ignored.  If they weren't then based on
 694       production they were either ignored or caused an error.
 695
 696     So to process a production, we first move the leading nl out
 697     of the way.  Then we look at leading, then internal for each node.
 698     If leading is:
 699        nothing: do nothing
 700        indent: increment current indent
 701        newline: if indented, ignore.  if outdented, error.
 702           If nonterm allows, ignore  else error
 703     Then add 'internal' to current indent.  However once a negative
 704     indent has been seen, no further positives are allowed.
 705
 706
 707 (09may2013)
 708 Damn - I've broken something.
 709 My grammar has "NEWLINE" at end of statement, but parser returns
 710 "UNDENT" which isn't recognised so it is swallowed.
 711 So I think I really want every NEWLINE to be visible.
 712 So as same-level newline is "NEWLINE"
 713 A great-level newline is "NEWLINE INDENT"
 714 A lesser-level newline is "NEWLINE UNDENT+ INDENT?"
 715
 716 If a grammer rule can be ended by a newline it just puts NEWLINE at
 717 the end.
 718 If a grammer rule requires an undent to close it, then it must
 719 also list an indent:
 720    IF -> if EXPR : INDENT commandlist UNDENT
 721
 722 The newline before the INDENT gets held, then destroyed by the INDENT.
 723 The newline before the UNDENT is probably match by the commandlist,
 724 otherwise it is help, then destroyed by the UNDENT.
 725 So we are moving NEWLINE-destruction from lexer to parser.
 726
 727 So the pending dents are NEWLINE then UNDENT then INDENT or NEWLINE
 728 The first NEWLINE is destroyed by following INDENT or UNDENT.  And
 729 UNDENT cancels a previous INDENT, but an INDENT doesn't cancel an
 730 UNDENT.  A newline following an INDENT is impossible for a terminal.
 731 So we can have UNDENT INDENT-or-NEWLINE where UNDENT can have a count.
 732
 733 These are then attached to the next thing shifted, except that there
 734 cannot be both an UNDENT and an INDENT when the SHIFT happens, else
 735 error unexpected INDENT.
 736
 737 So a shifted terminal has undent count and possible newline, or no
 738 undents and possible indent or newline
 739
 740 When we reduce a series of symbols to a nonterminal:
 741   - the tags on the first symbol are carried to the non-terminal
 742       because they come completely before both.
 743   - remaining indents and  undents must match, or must be able to
 744     cancel against pending undents, including the look-ahead symbol.
 745   - and newlines between indent and undent are ignored.
 746   -
 747
 748   Arg.
 749   There is a thing which is one of
 750         undents
 751         undents newline
 752         indents
 753         newline
 754         -nothing-
 755
 756   Each symbol has 2 of these: leading and internal.
 757   A Terminal always has -nothing- for the internal.  These are
 758   collected from previous reductions and shifted ignored symbols.
 759   A non terminal gain them in a reduction.
 760   The leading thing of a new nonterminal is the leading thing of the
 761   first reduced symbol.
 762   The internal thing of a new nonterminal is a combination of the
 763   leading and internal things (in that order) of the remaining
 764   symbols, together with some undents that are pending, or in the
 765   lookahead.
 766   As we merge the remaining things:
 767    - indents may not follow undents
 768    - newlines following an indent and before next undent are
 769      discarded.
 770    - indent then undent are discarded.
 771    - newlines with no indent are either discarded or cause an error
 772       depending on context
 773   So result for internal of new nonterminal is either
 774    - some number of undents
 775    - some number of indents
 776
 777   When we see an unexpected newline or indent it goes on the next
 778   terminal.
 779   When we see an unexpected undent it goes on the current
 780   top-of-stack as internal.
 781
 782   So:
 783    pending: 0 = nothing
 784             1 = newline
 785             2 = indent
 786    leading: same
 787    internal: count, -ve undents, +ve indents
 788
 789
 790 I'm getting close.
 791   Stray indents must be disallowed in the same places where stray
 792   newlines are.... or something.
 793   And how do I know where these stray things are allowed?
 794
 795   Hypothesis:  There are two sorts of nonterminals - horizontal
 796      and vertical.  Horizontal nonterminals are expected to be all on
 797      one line but can use indents for line continuation.  Their
 798      productions do not contain a NEWLINE symbol.
 799      Vertical nonterminals are expected to run vertically.  Their
 800      productions do often contain newlines, possibly indirectly.
 801
 802      A horizontal nonterminal may contain an indent, but not an
 803      unindented newline.
 804      A vertical nonterminal may contain newlines but no indents.
 805
 806      a = [
 807           1, 2,
 808           3, 4,
 809             5, 6,
 810          ]
 811
 812      The  list if number is horizontal and  starts with an indent, so
 813      safe.
 814      The [list] structure is also horizontal and contains an indent
 815      The second indent is legal because  it in horizontal.
 816      if a {
 817         a; b;
 818         c; d;
 819      }
 820
 821 FOUND A MISTAKE.  First token in production does not necessarily hold
 822 first terminal.  It could have an empty production.  That is easy to
 823 handle: reducing an empty production collects the pending stuff.
 824
 825
 826 double-damn!  I do need internal newlines after all - maybe.
 827
 828  {
 829     a =
 830     1
 831  }
 832
 833 That indent is used by statementlist which is vertical so the newline
 834 in binding, which is horizontal, isn't protected by the indent and so
 835 is not allowed.
 836 i.e. the indent at the start doesn't protect things.
 837  type str = {
 838         a: int;
 839         b: char;
 840  }
 841
 842 More thoughts next morning:
 843  - Newline before indent needs to be invisible, else the indent
 844    cannot effect line continuation DONE
 845  - hidden undents might blur with expected undents.  Need to decide if
 846    care is needed: almost certainly is
 847  - I think the horiz/vert distinction is on productions, not nonterms.
 848    A production  "A -> A ... " is vertical, others are horizontal.
 849    Maybe 'headed' and 'headless' are better term.
 850    A 'headless' production can contain unindented newlines. but not
 851    unexpected indents.  A 'headed' production can be indented but
 852    must not have an unindented newline.
 853
 854 And that afternoon....
 855    a = 1
 856      + 2
 857 is now complaining because it sees an indent in "1+2" which is
 858   Term -> Term + Factor
 859 so is vertical and doesn't like indent.  But the indent is really
 860 in "a = 1 + 2" which is horizontal so permitted.
 861 How to we express that?
 862
 863 A production that starts with an indent or newline can contain
 864 with a sequence of newlines (if vertical) or an indent and a sequence
 865 of newlines (if horizontal).
 866
 867 A production that does not start with Newline or Indent is deemed to
 868 be horizontal, so can contain an indent and some newlines.
 869
 870 A production which contains or starts with an indent which is not
 871 canceled by a trailing undent passes that indent up to it's parent.
 872
 873 So an indent must ultimately be on the first symbol of a recursive
 874 (vertical) production, or it must be the first indented symbol in a
 875 non-recursive (horizontal) production.
 876
 877 Do I really need internal indents?
 878
 879 A production starts:
 880   - on a newline (same indent as something earlier)
 881   - on an indented line
 882   - in the middle of a line
 883 Based on this and recursive nature we classify as horizontal or
 884 vertical (h if middle or non-recursive.  v if non-middle and recursive)
 885 The first non-middle token in the body must start with indent or
 886 newline corresponding to horiz or vert status.  Subsequent non-middle
 887 tokens may only start with a newline
 888 Any non-initial indent in a body must be closed before the production
 889 is reduced.  So a nonterm can only carry an initial indent.
 890
 891
 892  a = 1
 893    + 2 + 3
 894    + 4
 895
 896  a =
 897    1 + 2
 898
 899
 900 Wrong again!!!!
 901 Once we see an unexpected indent, future newlines are
 902 ignored. Sometimes.  Until we see the indent.
 903
 904 Can we have multiple unexpected indents?  Yes - if never expecting
 905 any.
 906 So we have a count of unexpected indents.  When we see one we expected
 907 we push the counter with the indent and restore it when the undent
 908 arrives.
 909
 910 If the pending production is horizontal, newlines are ignored.  If
 911 vertical, newlines are recognised.
 912
 913 (next morning - 11may2013)
 914
 915 So when we see a NEWLINE which the grammar allows, how do we know
 916 whether to SHIFT it or SKIP it?
 917 We SKIP if the it is or will be part of a production that contains an
 918 indent line continuation.  But it might be in lots of productions
 919 concurrently, and the LALR engine doesn't really tell us about
 920 productions until they are ready to reduce.
 921
 922 Proposal:
 923   Require NEWLINEs to appear in grammer following a nonterminal which
 924   is the entire object that they complete
 925     So if any command can be terminated by newline or semi, then
 926       Termcommand -> Command NEWLINE
 927                 | Command ;
 928       Commandlist -> CommandlistN
 929                 |
 930       CommandlistN -> Command
 931                 | CommandlistN Command
 932
 933     Alternately if the newline is only needed for some commands:
 934        Command -> IfCommand
 935                 | WhileCommand
 936                 | Assignment ;
 937                 | Assignment NEWLINE
 938                 | FunCall ;
 939                 | FunCall NEWLINE
 940
 941     Then when we see a NEWLINE that is expected we trace-ahead how
 942     many stack frames would be absorbed before it is shifted.
 943     i.e. while action(state, NEWLINE) == Reduce(production):
 944                 if production.len > 0:
 945                         tos -= production.len
 946                         state = tos.state
 947                 state = goto(state, production.head)
 948                 tos++
 949
 950     Now if any symbol from 'tos' to the real tos contains an Indent,
 951     we Skip the NEWLINE, otherwise we process all those reductions and
 952     then SHIFT the NEWLINE.
 953
 954     Can we deduce this ahead of time so the matcher just does a
 955     lookup?
 956     The result needed is stack depth to examine, and that cannot be
 957     known ahead of time.
 958     So we need the 'production->head' mapping to be more accessible,
 959     but still need to search.
 960
 961     How does this relate to 'recursive' and 'horiz'v'vert' ?
 962     Is the symbol before a NEWLINE there-fore horizontal?
 963     It shouldn't come up.
 964     We could flag every nonterm followed by NEWLINE as 'oneline',
 965     and flag all nonterms which produce recursion as 'recursive' and
 966     warn if any nonterm gets both flags.
 967
 968     When we 'Skip' a newline, do we still record it in m.nl?
 969     I think not, it would be irrelevant.
 970
 971     Do we want to search ahead every time we see a newline?
 972     We would need to skip to the next expected symbol, then see how
 973     many symbols are about to be reduced, and examine those symbols.
 974
 975     Maybe this leads to a very different approach.
 976     INDENT and UNDENT are still the same: attached to next symbol and
 977     passed back.
 978     But NEWLINEs are not.  When we see a newline, we optionally skip
 979     forward to the next expected symbol (or ERROR), then examine
 980     how much of the stack we would reduce.
 981     If there is a net indent, or a vertical production, we ignore the
 982     newline otherwise we remember the decision and start reducing.
 983
 984     How does that work?
 985       a = b
 986         + c
 987         + d
 988       return
 989
 990     The first newline appear only as INDENT and is recorded on the +
 991     The second would trigger a reduction back to "Assignment" "a = b +
 992     c" which contains an indent (on the '+') so the new is ignored and
 993     The '+' is seen and shifted.
 994     The third newline is likewise ignored.  The UNDENT gets queued
 995     and the NEWLINE now sees no net indent and so it isn't ignored.
 996     So the Assignment is reduced, the NEWLINE is SHIFTED and we reduce
 997     a command.
 998     a = [
 999         b, c,
1000         d, e,
1001         f, g,
1002         ]
1003     x = ...
1004
1005     First newline is only seen as indent, attached to 'b'.
1006     Second newline is not expected so we peek at 'd'.
1007     'd' is shiftable as current production is
1008        "listN -> listN , IDENTIFIER"
1009     This 'listN' contains the indent on 'b' as a leading indent,
1010     but as it is a recursive production, that is accepted as an
1011     indent and the newline is ignored.  This continues until we have
1012     the comma after 'g' shifted.
1013     Look-ahead is now ']' which triggers a reduction to 'List'
1014        List -> listN
1015                 |
1016     Which is not recursive.  Though it contains a recursive which
1017     has an indent, so that might be cool
1018     So we only go back until we find a suitable indent.
1019
1020     {
1021       thing
1022       a = b
1023       (c) + d
1024
1025     First newline gets attached to start of 'thing' as indent
1026     next terminates 'thing' as it is the next to be reduced
1027     Third reduces a=b to 'assignment' which has no indent, so is not
1028     ignored.
1029
1030     {
1031       if a {
1032          stuff
1033       }
1034       a = b
1035
1036     After the '}' we don't expect a newline because we have complete
1037     command.  We see a newline so are tempted to ignore it but need to
1038     check the indents.  look-ahead of 'a' (like everything else)
1039     reduces
1040          commandlistN -> commandlistN command
1041     which is recursive and has a leading indent, so we are good.
1042
1043     a = [
1044         b, c
1045         ,d, e
1046         ,f, g,
1047         ]
1048
1049     This is weird and probably should be illegal.
1050     The NEWLINE after 'c' is not expected so we peek at ','.
1051     This would be shifted so there is no indent to permit the 'ignore'
1052     and we trigger an error.  Probably good.
1053
1054     So to sum it all up.
1055
1056     - Lexer returns:
1057          INDENT  for a newline with an increased indent
1058          NEWLINE for a newline with the same indent
1059          NEWLINE UNDENT+ {NEWLINE | INDENT} for a newline with lesser
1060               indent.
1061     - Parser tracks:
1062          - depth of lexed "ignored_indent"  When an INDENT or UNDENT
1063            isn't expect by the grammar
1064          - number of "unparsed_undents" which is incremented when we
1065            accept an unexpected indent, and decremented when we reduce
1066            an indent away.
1067     - When an INDENT is shifted, "ignored_indent" is attached to is,
1068          and then set to zero.  "unparsed_undents" must be zero.
1069     - When an INDENT leads to ERROR, we tag the next terminal as
1070          indented and count the indent in "ignored_indent"
1071     - When an UNDENT is seen while the "ignored_indent is +ve, we
1072       decrement the UNDENT count and increment "unparsed_undents".
1073       However if there is a pending indent waiting to be shifted, we
1074       cancel that instead of incrementing "unparsed_undents".
1075     - When an UNDENT is seen while "ignored_indents" is zero, we either
1076       SHIFT it or trigger an ERROR.  On SHIFT we find the previous
1077       UNDENT (which must exist) and restore the indent count.
1078     - When we reduce a production we check that it has only one
1079       indented symbol which may be internal or at the start, and
1080       this must correspond to whether it is horizontal or vertical.
1081       If it was at the start, the resulting symbol gets the indent.
1082       If it is internal, "unparsed_undents" is decremented if it is
1083       +ve, otherwise the indent is 'carried forward' to the next
1084       symbol shifted.
1085
1086
1087     - When we see a NEWLINE that the grammar expects, we each
1088       production that will be reduced before it is shifted, and if any
1089       contain a legal indent we discard the newline, else we accept
1090       it.
1091     - When we see a NEWLINE that the grammar does not expect we skip
1092       it till we find an expected symbol and repeat the 'search
1093       reduceable productions' process and throw and error if the right
1094       indent isn't present.
1095
1096 if a and
1097    b:
1098      c
1099      d
1100    while true :
1101
1102
1103  New idea.  end-of-line is different from start-of-line.
1104  End of line may or may not terminate a nonterminal
1105  Start of line is one of 'indent' or 'same'.
1106  a new line is coded as one of:
1107     Indent
1108     Eol Same
1109     Eol Undent+ Same
1110     Eol Undent+ Indent
1111
1112  Eol, Indent and Undent can be in the grammar, Same cannot.
1113  Eol is ignored if not expected or a reduce production would contain
1114  an indent.
1115  NO that doesn't work - what would terminate that line? I was using
1116  the newline after the indent, but that doesn't exist any more.
1117
1118  Try again:
1119    When a newline is ignored by an indent, we count that and ignore
1120    all indent/undents and newlines until the matching undent arrives.
1121    That is parsed as a Newline.   This implies no nesting.  What
1122    about:
1123
1124       if a == b &&
1125          c == d {
1126                 print("Hello");
1127                 print("world");
1128       }
1129
1130    The first newline is an indent, so it is ignored but indent is
1131    remembered.
1132    The second newline??  The third shouldn't be ignored.
1133    Maybe the second is in a state where a recursive symbol is
1134    expected.
1135    But so is the first.
1136      if
1137         a == b
1138         && c == d {
1139                 print(x);
1140                 print(y);
1141      }
1142
1143     Why is newline after 'b' different to the one after (x); ??
1144     The one after the 'b' isn't expected.  The one after '(x);'
1145     isn't either.  So neither newlines are semantic.
1146     If the ';' was missing, the second newline would reduce to a
1147     vertical production, so is ignored.
1148
1149     What about
1150       a = b + c
1151               * d
1152           +e
1153     ??
1154     The first newline is followed by indent, so ignored.
1155     The second is at end if indented thing, so ignored.
1156     The third is still indented, so still ignored
1157     After the next undent a newline will fix it.
1158     So maybe we want:  { newline undent}* [ newline | indent ] ??
1159
1160
1161 OK, I have a new problem with indents.
1162
1163   if condition : INDENT statementlist UNDENT else : INDENT statementlist UNDENT
1164
1165 doesn't work, because it is a horizontal production, but the 'else' is not indented.
1166 Actually I have:
1167    If_Statement -> if Expression Block else Block
1168 and
1169    Block        -> : INDENT Statementlist UNDENT
1170         | { Statementlist }
1171
1172 Also with same productions:
1173
1174    if expression
1175    {
1176       stuff;
1177    }
1178 doesn't work because the '{' is not indented.
1179
1180 Can I find a clean way to allow these automatically, or do I need some explicit
1181 tagging in the grammar?
1182 Tagging with OptionalNewline seems to work and probably makes sense.  It would be
1183 hard to prevent:
1184
1185   function foo
1186   (thing) { ....
1187
1188 at the same time as allowing '\n{'
1189
1190
1191 ------------------
1192 Next steps (02May2013)
1193  - Get indenting and newlines worked out.
1194  - create basic structure of grammar.  This will be a long way from
1195    final, but creates room to try things out.
1196
1197    + A file contains declarations:
1198       types, functions, methods, constants, implementations
1199       maybe variables
1200       interface - between modules.  import/export etc
1201          init function?
1202
1203    + Types:  array, struct, enum, union? pointer function
1204          parameterized, string int float char Bool
1205          closures
1206
1207    + Code block:  loops, conditionals, break/continue/return
1208          catch, defer?
1209          actions: assign (= := += *= ...), funcall, methodcall
1210
1211    + expression: conditions, boolean, arithmetics, binary
1212         'in'
1213
1214    + bindings, variable declarations,  local variables etc.
1215        local functions?  are they just closures?
1216
1217
1218 -----------
1219 TO FIX
1220 DONE - ("  scanner doesn't notice quote
1221 DONE - grammar cannot say "no type of these nonterms
1222  - grammar to synthesise A' symbols
1223 DONE - better tracing - show full stack info
1224  - store 'indents' better.
1225  - need lots of test code.
1226  - use a 'tos' pointer rather than counter.
1227  - GPL notice?
1228  - easier to write 'main' function
1229  - make it easier to handle plain-text files.
1230
1231 shift-symbol, record-indents, handle linebreaks