ocean-lang.org Git - ocean-D/blob - lr

   1
   2 Now that I've implemented an LR parser which handles indents nicely I
   3 want to describe it.
   4
   5 First the basic LR parser.  This is nothing new.
   6
   7 There are a set of tokens - primarily numbers.  They can have content
   8 attached but that does not affect the parsing.
   9 Any difference that affects parsing must make a different token.
  10 So each reserved word is a token, each special symbol is too.
  11 Then 'identifier', 'number', 'string'. Are all tokens.
  12
  13 These are "terminals" - anything produced by lexical analysis aka
  14 scanning is a terminal token.
  15
  16 There are also "non-terminal" token.  These are the "head" of
  17 productions.
  18 For each nonterminal there must be at least one production, possibly
  19 several.  Each production consists of a "head" and a "body".  The body
  20 is simply a list of token, whether terminals or non-terminals.
  21
  22   If_statement -> if Expression then Statementlist else Statementlist fi
  23
  24 here I have capitalised the non-terminals.
  25 What this means is that if the head token is permitted somewhere, then
  26 the body sequence can be used in it's stead.  Alternately, if the body
  27 sequence is found where the head might be expected, then it can be
  28 treated as if it *was* the head.
  29
  30
  31 Parsing a text (sequence of terminals) against a grammar like this
  32 involves a sequence of "shift" and "reduce" steps.  This happens in
  33 the context of a stack.  Each entry on the stack represents a single
  34 token, though it may have other state attached.
  35 It also happens in the context of a 1-token lookahead.  As well as
  36 knowing what is in the stack, we know what the next terminal token is.
  37
  38 A "shift" operation pushes the look-ahead terminal onto the stack.
  39 This then allows us to see what the next terminal is.
  40 A "reduce" operation can happen when the last few tokens on the stack
  41 exactly match the body of some production (and when all the earlier
  42 tokens on the stack provides a valid preamble to the head of that
  43 production).
  44 When this happens we pop off the tokens representing the body, and
  45 push on the head token.
  46 So terminals get on to the stack from 'shift' and non-terminals get on
  47 the stack from 'reduce'.
  48
  49 Eventually a reduce state will push the special 'start' non-terminal
  50 on to the stack.  When that happens, we have finished.
  51
  52 So the first interesting question is: how do we know when to shift and
  53 when to reduce - and what production to reduce.  This is the fun bit.
  54
  55 First we need to introduce the idea of a 'state'.  A state if
  56 effectively some position in some production.  so one state would
  57 represent that state in the above production where 'then' has been
  58 shifted.  In this state we are expecting a statementlist.
  59 another state might be after the statementlist when we are expecting
  60 'else'.  In any state, we know there are a bunch of tokens on the
  61 stack, and  there is something we are expecting.
  62
  63 Every entry on the stack already mentioned has a 'state' along with
  64 the token. It is the state *after* that token was shifted or reduced.
  65 Based on the state of the top element of the stack, we know what to
  66 expect.  Based on the state and the look-ahead terminal, we know what
  67 to do.
  68
  69 To encode this knowledge we need a few different data structures.
  70 1/ An action table.  This maps from a state plus a look-ahead terminal
  71   to an action.  The action is one of "Shift", "Accept" (which means
  72   that Start symbol has been reduced) or "Reduce-N", where N is the
  73   number of the production to be reduced.
  74 2/ A go_to table, or state-transition table.  This maps a state plus a
  75   token to a new state.  Of the two states mentions above, the 'go_to'
  76   entry for the first state combined with the token "statementlist"
  77   would be the second state.
  78 3/ A production table.  For each production we need to know how many
  79   token to pop, which token to push, and maybe what action to perform
  80   when that happens.  Often this will extract information from the
  81   popped tokens and store it in the pushed token to build up an
  82   abstract syntax tree.
  83
  84
  85 In simple terms the action of parsing them becomes:
  86
  87   1- Set current state to '0'
  88   2- look up current state together with lookahead token in
  89      the action table.
  90   2a- if "Accept" - stop.
  91   2b- if shift, push 'state' and 'look-ahead' onto stack
  92   2c- if reduce, find production, perform action, pop appropriate
  93       number of tokens, keeping state from first token as current
  94       state.
  95       Then push head token and current state to stack
  96   3- look up current state and top-of-stack token in go-to table,
  97      and set current state to the value found.
  98   4- go to 2.
  99
 100 All we need now at those tables.  The production table is easily
 101 extracted from whatever form the grammar is presented in.  The action
 102 and go-to tables take a bit of work.  First we will need a few tools.
 103
 104 Firstly we need to store a set of small numbers.  This will be used
 105 both to hold a set of tokens, and a set of 'items' which will be
 106 described later.  Items are actually pairs of small numbers.  These
 107 sets need to support test for membership, insertion, union, iteration,
 108 and an equality test.
 109
 110 Next we need a search table that these sets can be stored in. The
 111 sets of items needs to be indexed so we can check if a particular
 112 itemset already exists.
 113
 114 Finally we will need some small lookup tables from token numbers to
 115 small numbers.  These will usually be fairly sparsely populated.
 116
 117 Given these we can start to find the states and create the goto table,
 118 and it is here that we need to understand what an item is.
 119
 120 An 'item' is a production plus a number of body tokens that have
 121 already been shifted.  So "5:0" refers to production 5 with no tokens
 122 shifted yet.  If production 5 had 4 body tokens, then "5:0", "5:1",
 123 "5:2", "5:3",  "5:4" would all be valid items.
 124
 125 "5:0" is a state where we are expecting the body of production 5.
 126 If the second body token in production 3 was the head of production 5:
 127
 128     3:   A -> X Y Z
 129     5:   Y -> B1 B2 B2 B4
 130
 131 Then "3:1" would also be a state when we are expecting the body of
 132 production '5'.  So "3:1" and "5:0" are really the same state.
 133
 134 A "State" then is a collection of "items".  Some of the items will
 135 have a non-zero token count, and some will have a zero token count.
 136 We need a data structure that contains a set of "items" - much like a
 137 set of token, so probably the same data structure can be used.  We
 138 need to be see if we already have a particular state so they will need
 139 to be stored in some sort of lookup table with the ability to test two
 140 states to see if they are equal.  It is best if the lookup only
 141 compares the non-zero items as they imply the zero items, but
 142 computing them for a new set is non-trivial.  If we can avoid that
 143 step when the state is already known, that is a useful optimisation.
 144
 145 We build the set of states effectively by recursing down the
 146 production tree.  We start with a state containing the item "0:0",
 147 This is assigned the next free state number, which is "0".
 148
 149 For each state we need to produce all the states that it can reach.
 150 These are states which we reach after shifting a token, possible via
 151 reduction, so there could be one for each token.  We will need a set
 152 of tokens we have dealt with already.
 153
 154 First we need to "complete" the itemset.  This involves adding all the
 155 ":0" items that are relevant.
 156
 157   a/ Initialise a set of possible 'next' tokens.
 158   b/ For each item in the state, consider the 'next' token in the
 159      body, if there is one.
 160   c/ if this token is already in our 'next' set, move on to the next item at
 161      b/
 162   d/ for every production from this token, add an item with a token
 163      count of zero
 164   e/ We've modified the itemset we are iterating, so start again at
 165   the beginning at 'b'.
 166
 167 Now we have a complete itemset and a set of possible 'next' we can
 168 look for states to go-to.
 169
 170 For each token in the 'next' set we potentially create a new itemset.
 171 To do that we look at each item in our completed set and see if the
 172 productions next token is the current next token.  If it is, we add
 173 the item to the new itemset with the token count incremented.
 174 Once we have this new itemset we check to see if it is already in the
 175 search table.  If not, we:
 176   - assign a number
 177   - add it to the table
 178   - add it to the list of states still to process
 179   - add 'current_state + current_token -> new_state' to the goto
 180     table.
 181
 182 Once we have completed and created goto tables for every state, we are
 183 ready to move on to creating the action table.
 184 Part of this is fairly straight forward, part is somewhat laborious
 185 but nor particularly difficult.  We'll do the straight forward bit
 186 first.
 187
 188 For each state we need to create a lookup table to map from a terminal
 189 to an action.  So we iterate through the states, allocate a lookup
 190 table, and then iterate through the items in that state.
 191
 192 If we find an item where the next token is a terminal, then we add a
 193 'SHIFT' action for that terminal in the current state.
 194 If we find an item where there is no next token  - the token count
 195 matches the body count for the production, then we want to add a
 196 REDUCE action for that production, but what lookahead tokens should we
 197 add it for?
 198
 199 In many cases it won't matter.  There is usually only one reducible
 200 production, so if we arrange to reduce it whenever there is no
 201 SHIFTable token, the parse will proceed correctly.  But sometimes it
 202 isn't that easy.  Consider the rather contrived grammar:
 203
 204         A -> B c
 205            | C d
 206         B -> x y
 207         C -> x y
 208
 209 If we see 'x' and then 'y', it would seem valid to reduce either of
 210 the last two productions.  What we choose depends on the next
 211 terminal.  If it is 'c' we want to reduce to B, if 'd' we want to
 212 reduce to 'C'.  To make that decision we need some idea of what the
 213 'next' token might be in different circumstances.
 214
 215 This is the somewhat laborious part, but thankfully we have a
 216 computer.  We need to complete 3 things for every non-terminal.
 217
 218 1/ Whether it can produce an empty string or not.  If there is a
 219    production:
 220
 221         A ->
 222
 223    Then clearly A can product the empty string.  Given that, if there
 224    is another production:
 225
 226        B -> A A
 227
 228    then B can also produce the empty string.  Rather then recursively
 229    walking from each nonterminal, we will use a dynamic programming
 230    approach were we repeatedly examine all productions flagging those
 231    non-terminals which can produce 'empty', until we don't find any
 232    more non-terminals to flag.
 233
 234 2/ The set of terminals which can come 'first' in the expansion of
 235    each non-terminal.  This clearly included the first token from each
 236    production of that nonterminal if the token is a terminal, but also
 237    all the 'first' terminals for the first token of each production
 238    when that is a non-terminal.  Further if some of the first tokens
 239    in a production can produce the empty string, then the 'first'
 240    terminals of later tokens may also been needed.
 241    Again we will use a dynamic programming approach to build up these
 242    lists until they stop changing.
 243
 244 3/ The set of terminals that can 'follow' a given non-terminal.  This
 245    is what we are really after.  Given this, we will know which
 246    lookahead terminal should trigger which reduction.
 247
 248
 249 Empty:
 250   To correctly set the 'empty' flags we first clear them all (assume
 251   nothing can produce 'empty' and then we walk the list of production
 252   and if we find one where the body is empty, or only contains symbols
 253   which are nonterminals that can produce empty, then we flag the head
 254   non-terminal as "can produce empty".
 255
 256   If while doing this we set the flag on any non-terminal, we need to
 257   loop back to the start and check every production again, as some
 258   more production might now be able to produce "empty".  When we have
 259   made a full pass with now changes we can move on.
 260
 261 First
 262   The "first" set for each terminal is trivially the set which
 263   contains just that terminal.  We might choose to create all of those
 264   for simplicity, or we might choose to handle terminals specially
 265   when creating unions later in this process.
 266
 267   For each non-terminal we start out assuming the set of 'first'
 268   terminals is the empty set.  Then for every production we examine
 269   all the token in the body up to and including (but not beyond) the
 270   first token which cannot produce 'empty'.   For each of these we union
 271   the 'first' terminals for the token to the set of 'first' terminals
 272   for the head of the production.
 273
 274   Once we have examined every production, if any 'first' set was
 275   modified we loop back to the start and examine every production
 276   again.  We keep going until a full pass has not added any terminals
 277   to any set.
 278
 279   Setting the 'empty' flag and computing 'first' can be combined.
 280   This should reduce the total number of passes by 1 as there will
 281   only be a single pass that does nothing.  It might reduce the number
 282   of passes a little more as more happens in each.
 283
 284 Follow
 285   Now we get to use the "first" sets to produce "follow" sets.
 286   We initially assume each 'follow' set is empty, and repeatedly scan
 287   the production list adding things to the 'follow' sets until one
 288   scan has made no changes (once again, a dynamic programming approach).
 289
 290   For each production, and for each token in that production, we add to
 291   the tokens 'follow' set, the entire 'first' set of the next token,
 292   and every subsequent token until one is found that does not produce
 293   'empty'.  As this part of calculating 'follow' is only based on
 294   'first' which does not change any more we do not need to repeat it
 295   until stable.  One the next step requires that.
 296
 297   For the last token in each production, and for every previous token
 298   until one if found that does not produce 'empty', the 'follow' set
 299   of the head token is added to the follow set of the body token.
 300
 301   Once these sets are stable, we will know every terminal which can
 302   possibly come after each non-terminal.
 303
 304
 305 Now that we have 'follow' we can examine each item in each itemset and
 306 decide on which SHIFT or REDUCE actions are appropriate for each
 307 look-ahead terminal.
 308 It is still possible that some conflict might occur - two different
 309 productions might suggest actions for the same terminal.
 310
 311 If two production both suggest 'SHIFT' for the same terminal, then
 312 that isn't a conflict.  We just SHIFT and don't worry which production
 313 it belongs to.
 314
 315 If one production suggests 'SHIFT' and one suggests 'REDUCE', then
 316 that is technically a conflict, but it is usually safe to resolve it
 317 in favour of the SHIFT.  This implies that we take the longest valid
 318 match which is often what is desired.
 319
 320 And example that can cause this sort of conflict happens with a
 321 grammar containing:
 322
 323    Statement -> if ( Expression ) Statement
 324              |  if ( Expression ) Statement else Statement
 325
 326 If we have:
 327
 328    if ( Expression ) if ( Expression ) Statement
 329
 330 on the stack, and we see an 'else', we could reduce to
 331
 332    if ( Expression ) Statement
 333
 334 or shift to
 335
 336    if ( Expression ) if ( Expression ) Statement else
 337
 338 In the 'shift' case the else clause would apply to the most recent
 339 'if', in the 'reduce' case it would end up applying to the more
 340 distant 'if'.  This 'shift' is probably preferred.
 341
 342 If two different productions suggest a 'REDUCE' for the same
 343 look-ahead character then that is a genuine conflict that cannot be
 344 resolved.  It means that the grammar is not LALR(1).  It might still
 345 be LR(1), but the simplifications of LALR(1) are over-simplification
 346 in this case.  In many cases though, a REDUCE/REDUCE conflict signals
 347 a genuine ambiguity in the grammar.
 348
 349 Expr -> Expr + Expr
 350      |  Val  + Expr
 351      |  Number
 352
 353 Val -> Number
 354
 355
 356 When we have a Number and we see a '+', do we reduce the number to a
 357 Val or to an Expr?
 358
 359
 360 ------------------------------------------
 361 I read the LALR paper:
 362
 363 http://3e8.org/pub/scheme/doc/parsing/Efficient%20Computation%20of%20LALR(1)%20Look-Ahead%20Sets.pdf
 364
 365 It doesn't help.
 366 So I'll figure it out myself.
 367
 368 The simple "FOLLOW" function find the 'next' terminal for each
 369 non-terminal in any context at all.  This can be too broad.
 370 LALR find a more focused FOLLOW function that is potentially different
 371 for each state.
 372
 373 For a start we can consider 'follow' for each specific instance of
 374 each non-terminal in the grammar.
 375 This is 'first' of the next symbol and if that symbol can be empty,
 376 then first of the following symbol etc.
 377 But what is 'follow' of the non-terminal at the end of a production?
 378 It is exactly FOLLOW of the non-terminal at the head of the
 379 production.  No, that is too simple.  We've discarded info about the
 380 state we are in.
 381
 382 Different approach.  Calculate the follow sets as we calculate the
 383 item sets.
 384 When we add an item to an itemset to complete it, we also add a
 385 'follow' set for that item, which is 'first' of the next token (after
 386 dot) possibly combined with 'first' of following tokens and possibly
 387 'follow' of the item.
 388 This 'follow' set remains with the item as it gets duplicated into
 389 other itemsets.  Once 'dot' gets to the end of the item, the 'follow'
 390 set becomes active for that itemset and that item.  i.e. if the
 391 look-ahead is in that follow set, the given item is reduced.
 392
 393 That seems to be a completely general rule that doesn't merge
 394 anything...
 395 Yes it does. It merged the itemsets earlier.
 396
 397 i.e. when we have built an itemset with attached follow sets for each
 398 time, we can only merge if the follow sets are the same, or don't
 399 conflict.
 400
 401 So: we need 'empty' and 'first'.
 402
 403 An itemset contains a list of
 404   production + offset + follow-set
 405
 406 An itemset if consistent if the follow-sets are disjoint and don't
 407 contain any terminal which is 'next' on and item.
 408 So while building an itemset we keep a set of all 'next' terminals and
 409 check for membership before adding.
 410
 411 We only merge two itemsets if merging the follow sets doesn't create
 412 an inconsistent itemset.
 413
 414 Precedence can help avoid inconsistent set.
 415 Each production can potentially be given a precedence and an
 416 associativity: left or right.
 417 When considering two productions that can use a symbol, if one has a
 418 higher precedence, it gets the symbol.
 419 If they have the same precedence they and have different associativity
 420 or no associativity there is a conflict.
 421 If they have associativity and one is a shift and one a reduce, then
 422 left-associative prefers reduce and right associative prefers shift.
 423
 424
 425 We really need the 'go to' table but the 'action' might be optional.
 426
 427 For LR(0), each state can identify a production to reduce.
 428 If there is no such production, we shift and follow the goto.
 429 If there is no 'goto' state it is an error.
 430
 431 For a simple LR(1), we try 'go to' first.  If that works we shift.  If
 432 it doesn't we apply the one reduction.  If there is no reduction we
 433 error.
 434
 435 Firstly we create the LR(0) states and impose precedence and see if
 436 there are any conflicts.  If not we finish.
 437
 438 Then we create first/follow and see if that resolves all conflicts.
 439 If yes, we finish.
 440 Can we do this just for conflicts?   A conflict it two or more
 441 productions in an itemset which can act.  Each conflicted state
 442 indicates a single non-terminal.  So we only need 'follow' for those.
 443 But to create that we need 'follow' for the head of any 'reduce'
 444 production.  So to limit the set of follows needed, we would need to
 445 know all the nonterminals which can be an ancestor of each state.
 446
 447 We can short-circuit the (SLR) FOLLOW step if a conflict is between
 448 two production with same head ... is that even possible?  Probably.
 449 But unlikely?
 450
 451 If creating 'follow' doesn't resolve conflicts we need to try LALR.
 452 This means, for each conflicted state, finding the 'follow' for the
 453 heads of the items in 'this' state.
 454 This involves following the 'goto' links backwards to all previous
 455 states and asking them ... something.
 456
 457 If this (conflicted) state has a completed production from X, we must
 458 have got here eventually from a state which eventually lead to this
 459 state, and has a 'goto' for 'X'.
 460
 461
 462 To find 'follow' for a given item X->a, we scan all states and see if
 463 following the body of the production (a) from each state T leads to the
 464 current state (S).
 465 If it does, we need to find FOLLOW(X, T) as the look-ahead tokens to
 466 choose this item.
 467 FOLLOW(X,T)  is IFOLLOW(X,T) unioned with other stuff.
 468 IFOLLOW(X,T), is the 'first' of whatever comes after X in items in T,
 469 If empty can come after X, we need to union in the follow of this item
 470 too.
 471
 472 IT is all too hard.  Let's not bother.
 473
 474 Just a simple LR(0) plus precedence and assuming SHIFT over REDUCE if
 475 no precedence info.
 476
 477 So we read a token and if cannot goto and we can reduce, we reduce.
 478 If we cannot reduce we have an error.
 479
 480
 481 However: I want to know about conflicts, not have them hidden.  So
 482 while I don't want to use a separate action table and don't want the
 483 parser to know which look ahead tokens are for which reduction,
 484 I want the grammar processor to know.
 485
 486 So:
 487
 488 Each itemset has a 'next' symset as well as 'items' and 'go_to'.
 489 This maps from non-terminal to magic number.
 490 The magic number is an index into a list of symset (which might be shared). These are pure sets of terminals.
 491
 492 These tell us the possible 'next' terminals after any for the
 493 productions form the given non-terminal in this state.
 494
 495 In state zero, there is one symset for 'start' and it contains only 'EOF'.
 496
 497 When we create a new symset, we do so given an old symset and a
 498 non-terminal.
 499
 500 As we add each advanced item from the old symset to the new, we copy
 501 the 'next' itemsef for the head non-terminal.
 502 As we add new itemsets we calculate the 'next' for the new
 503 non-terminal as the 'first' for the following symbols and, if they all
 504 can produce 'empty', we union with the 'next' for the head of the
 505 production we are basing on.
 506
 507 A finished itemset if consistent if all the 'next' itemsets for
 508 products that can reduce are disjoint, and are also disjoint with the
 509 set of terminals which can come 'next'.
 510 If it is inconsistent, that is an error.
 511 It can be made consistent by applying precedence rules.  This assigns
 512 a number to each production, and a left/right to each number.
 513 We don't record actions for items which have lower precedence than
 514 items we do record actions for.
 515
 516 When we create an itemset we might find that it already exists.  If it
 517 does we merge the new 'next' symbols (if any) into the old.  These
 518 changes potentially need to be carried forward, so we flag that
 519 itemset for review.
 520
 521 Precedence is set in the .gram file by lines that start with a lone $
 522 The first word is one of 'left' 'right' 'nonassoc' which creates a
 523 precedence level and sets associativity.
 524 After that come terminal.  A second '$' precedes a virtual-terminal
 525 name which can be linked to the level.
 526
 527 At the end of a production, '$$' before the '${' introduces a
 528 virtual-terminal which sets the precedence for this production,
 529 over-riding any defaults.
 530
 531 Precedence setting must come before the first use of a terminal.
 532
 533
 534 -------
 535 It's not works.  What exactly do I store again?
 536
 537 In each state we need the 'follow' set for each non-terminal.
 538 As we add non-terminals we calculate by looking at 'first' of stuff in
 539 production we started from and possibly the follow for the head of
 540 that production.
 541 For old non-terminals they come from the previous state.
 542
 543 So when we 'complete' a set by adding productions, we add new sets.
 544 When we create a new set, we copy what we have.
 545
 546
 547 It works.  But can I do better?
 548 - avoid multiple copies of 'next' sets using refcount.
 549 - even more by using a lookup table.
 550 Currently have 756 follows with 186 itemsets
 551 Only 73 unique next sets.
 552 346 calls to 'build' the 186 sets.  Some called 4 times.  Most called twice.
 553
 554 After adding refcount:
 555 487 follow sets, still 73 Unique.
 556 345 calls to 'build'
 557
 558
 559
 560 -----
 561 My LR code is wrong.
 562 With the itemset we just compare core, which is easily found.
 563 With the LA sets we also want to just compare the "core", but that is
 564 that?
 565 I'd be happy for equality to be "new is subset of found" but how then
 566 do we impose an ordering.
 567 As it is a sequential search, we could as the new is always >= old.
 568 That might do for a start...
 569
 570 No all wrong.
 571 I need an LA set for each item, not for each nonterminal.
 572
 573 In the core, LA is inherited from items in previous state.
 574 In the completion, LA is calculated from core and completion items and
 575 *may* be affected by changes to the LA of the core.
 576 However changes to LA of core will propagate to subsequent states so
 577 we need to do recalculation, so recording that 'may' probably isn't
 578 helpful.
 579
 580 So when we visit an item we always compute the completion so that we
 581 can update the look-ahead sets.