From 663f971dfab3f07a5153139cc7ca80c373dedd63 Mon Sep 17 00:00:00 2001 From: NeilBrown Date: Fri, 5 Mar 2021 20:07:55 +1100 Subject: [PATCH] parsergen: assorted updates to descriptive text. Some of these are fixed for typos or poor grammar. Others update the text to more closely match the code. Signed-off-by: NeilBrown --- csrc/parsergen.mdc | 440 ++++++++++++++++++++++++--------------------- 1 file changed, 233 insertions(+), 207 deletions(-) diff --git a/csrc/parsergen.mdc b/csrc/parsergen.mdc index 2da1d65..e0a2167 100644 --- a/csrc/parsergen.mdc +++ b/csrc/parsergen.mdc @@ -1,9 +1,11 @@ -# LR(1) Parser Generator # +# LR(1) Parser Generator with 2D support # This parser generator takes a grammar description combined with code fragments, analyses it, and can report details about the analysis and write out C code files which can be compiled to make a parser. +"2D support" means that indentation and line breaks can be significant. + There are several distinct sections. 1. `grammar_read` will read a grammar from a literate-code file and @@ -52,18 +54,20 @@ There are several distinct sections. ## Reading the grammar -The grammar must be stored in a markdown literate code file as parsed -by "[mdcode][]". It must have three top level (i.e. unreferenced) -sections: `header`, `code`, and `grammar`. The first two will be -literally copied into the generated `.c` and `.h`. files. The last -contains the grammar. This is tokenised with "[scanner][]". +The grammar must be stored in a markdown literate code file as parsed by +"[mdcode][]". It may have up to four top level (i.e. unreferenced) +sections: `header`, `code`, `grammar` and `reduce`. The first two, if +present, will be literally copied into the generated `.c` and `.h` +files. The third is required and contains the grammar which is +tokenised with "[scanner][]". `reduce` can provide variable +declarations and common intialization code for all the reduction code +fragments in the grammar. If the `--tag` option is given, then any top level header that doesn't start with the tag is ignored, and the tag is striped from the rest. So -`--tag Foo` -means that the three needed sections must be `Foo: header`, `Foo: code`, -and `Foo: grammar`. The tag `calc` is used to extract the same calculator -from this file. +`--tag Foo` means that the four sections expected are `Foo: header`, +`Foo: code`, `Foo: grammar` and `Foo: reduce`. The tag `calc` is used +to extract the sample calculator from this file. [mdcode]: mdcode.html [scanner]: scanner.html @@ -77,10 +81,11 @@ from this file. #include "scanner.h" The grammar contains production sets, precedence/associativity -declarations, and data type declarations. These are all parsed with -_ad hoc_ parsing as we don't have a parser generator yet. +declarations, general terminal declarations, and data type declarations. +These are all parsed with _ad hoc_ parsing as we don't have a parser +generator yet. -The precedence and associativity can be set for each production, but +The precedence and associativity can be set for each production, and can be inherited from symbols. The data type (either structure or a reference to a structure) is potentially defined for each non-terminal and describes what C structure is used to store information about each @@ -107,11 +112,11 @@ symbol. ## grammar fields }; -The strings reported by `mdcode` and `scanner` are `struct text` which have -length rather than being null terminated. To help with printing and +The strings reported by `mdcode` and `scanner` are `struct text` which +have length rather than being nul terminated. To help with printing and comparing we define `text_is` and `prtxt`, which should possibly go in -`mdcode`. `scanner` does provide `text_dump` which is useful for strings -which might contain control characters. +`mdcode`. `scanner` does provide `text_dump` which is useful for +strings which might contain control characters. `strip_tag` is a bit like `strncmp`, but adds a test for a colon, because that is what we need to detect tags. @@ -153,7 +158,8 @@ is yet. To help with code safety it is possible to declare the terminal symbols. If this is done, then any symbol used in a production that does not -appear in a head and is not declared is treated as an error. +appear in the head of any production and is not declared as a terminal +is treated as an error. ###### forward declarations enum symtype { Unknown, Virtual, Terminal, Nonterminal }; @@ -249,7 +255,7 @@ symbol, but its type might be `Unknown`. Data type specification, precedence specification, and declaration of terminals are all introduced by a dollar sign at the start of the line. If the next word is `LEFT`, `RIGHT` or `NON`, then the line specifies a -precedence, if it is `TERM` the the line declares terminals without +precedence, if it is `TERM` then the line declares terminals without precedence, otherwise it specifies a data type. The data type name is simply stored and applied to the head of all @@ -262,18 +268,18 @@ Any productions given before the first data type declaration will have no data type associated with them and can carry no information. In order to allow other non-terminals to have no type, the data type `$void` can be given. This does *not* mean that `struct void` will be -used, but rather than no type will be associated with future +used, but rather than no type will be associated with subsequent non-terminals. -The precedence line must contain a list of symbols - typically -terminal symbols, but not necessarily. It can only contain symbols -that have not been seen yet, so precedence declaration must precede -the productions that they affect. +The precedence line must contain a list of symbols, either terminal +symbols or virtual symbols. It can only contain symbols that have not +been seen yet, so precedence declaration must precede the productions +that they affect. -A precedence line may also contain a "virtual" symbol which is an -identifier preceded by `$$`. Virtual terminals carry precedence -information but are not included in the grammar. A production can -declare that it inherits the precedence of a given virtual symbol. +A "virtual" symbol is an identifier preceded by `$$`. Virtual symbols +carry precedence information but are not included in the grammar. A +production can declare that it inherits the precedence of a given +virtual symbol. This use for `$$` precludes it from being used as a symbol in the described language. Two other symbols: `${` and `}$` are also @@ -285,7 +291,12 @@ listed and may be inherited by any production which uses the symbol. A production inherits from the last symbol which has a precedence. The symbols on the first precedence line have the lowest precedence. -Subsequent lines introduce symbols with higher precedence. +Subsequent lines introduce symbols with higher precedence and so bind +more tightly. + +Note that a declaration line (or "dollar line") can start with either of +two different marks: "$" or "$*". The `dollar_line()` function defined +here is told which was found in the `isref` argument. ###### grammar fields struct text current_type; @@ -415,7 +426,7 @@ for a terminal symbol is `struct token`. For a non-terminal, it is whatever has been declared for that symbol. The `<` may be included and means that the value (usually a reference) is being moved out, so it will not automatically be freed. The effect of using '<' is that the -variable is cleareed to all-zeros. +variable is cleared to all-zeros. While building productions we will need to add to an array which needs to grow dynamically. @@ -553,15 +564,15 @@ Now we have all the bits we need to parse a full production. return err; } -With the ability to parse production and dollar-lines, we have nearly all +With the ability to parse productions and dollar-lines, we have nearly all that we need to parse a grammar from a `code_node`. -The head of the first production will effectively be the `start` symbol of -the grammar. However it won't _actually_ be so. Processing the grammar is -greatly simplified if the real start symbol only has a single production, -and expects `$eof` as the final terminal. So when we find the first -explicit production we insert an extra production as production zero which -looks like +The head of the first production will effectively be the `start` symbol +of the grammar. However it won't _actually_ be so. Processing the +grammar is greatly simplified if the real start symbol only has a single +production, and expects `$eof` as the final terminal. So when we find +the first explicit production we insert an extra implicit production as +production zero which looks like ###### Example: production 0 $start -> START $eof @@ -589,6 +600,11 @@ and strings so that the marks which introduce them can be used as terminals in the grammar. We don't ignore numbers even though we don't allow them as that causes the scanner to produce errors that the parser is better positioned to handle. +We also ignore indentation, but we expect and use newlines. + +To allow comments in the grammar, we explicitly check for "//" as the +first token on the line and those lines are skipped. "//" can still be +used as a terminal anywhere that a terminal is expected. ###### grammar_read static struct grammar *grammar_read(struct code_node *code) @@ -695,19 +711,18 @@ to produce errors that the parser is better positioned to handle. ## Analysing the grammar The central task in analysing the grammar is to determine a set of -states to drive the parsing process. This will require finding -various sets of symbols and of "items". Some of these sets will need -to attach information to each element in the set, so they are more -like maps. +states to drive the parsing process. This will require finding various +sets of symbols and of "items". Some of these sets will need to attach +information to each element in the set, so they are more like maps. -Each "item" may have a 'look-ahead' or `LA` set associated with -it. Many of these will be the same as each other. To avoid memory -wastage, and to simplify some comparisons of sets, these sets will be -stored in a list of unique sets, each assigned a number. +Each "item" may have a 'look-ahead' or `LA` set associated with it. +Many of these will be the same as each other. To avoid memory wastage, +and to simplify some comparisons of sets, these sets will be stored in a +list of unique sets, each assigned a number. -Once we have the data structures in hand to manage these sets and -lists, we can start setting the 'nullable' flag, build the 'FIRST' -sets, and then create the item sets which define the various states. +Once we have the data structures in hand to manage these sets and lists, +we can start setting the 'nullable' flag, build the 'FIRST' sets, and +then create the item sets which define the various states. ### Symbol sets. @@ -715,8 +730,8 @@ Though we don't only store symbols in these sets, they are the main things we store, so they are called symbol sets or "symsets". A symset has a size, an array of shorts, and an optional array of data -values, which are also shorts. If the array of data is not present, -we store an impossible pointer, as `NULL` is used to indicate that no +values, which are also shorts. If the array of data is not present, we +store an impossible pointer, as `NULL` is used to indicate that no memory has been allocated yet; ###### declarations @@ -728,10 +743,10 @@ memory has been allocated yet; const struct symset INIT_SYMSET = { 0, NULL, NO_DATA }; const struct symset INIT_DATASET = { 0, NULL, NULL }; -The arrays of shorts are allocated in blocks of 8 and are kept sorted -by using an insertion sort. We don't explicitly record the amount of -allocated space as it can be derived directly from the current `cnt` using -`((cnt - 1) | 7) + 1`. +The arrays of shorts are allocated in blocks of 8 and are kept sorted by +using an insertion sort. We don't explicitly record the amount of +allocated space as it can be derived directly from the current `cnt` +using `((cnt - 1) | 7) + 1`. ###### functions static void symset_add(struct symset *s, unsigned short key, unsigned short val) @@ -781,11 +796,11 @@ or `-1` to indicate failure. } We will often want to form the union of two symsets. When we do, we -will often want to know if anything changed (as that might mean there -is more work to do). So `symset_union` reports whether anything was -added to the first set. We use a slow quadratic approach as these -sets don't really get very big. If profiles shows this to be a problem it -can be optimised later. +will often want to know if anything changed (as that might mean there is +more work to do). So `symset_union` reports whether anything was added +to the first set. We use a slow quadratic approach as these sets are +rarely large. If profiling shows this to be a problem it can be +optimised later. static int symset_union(struct symset *a, struct symset *b) { @@ -816,9 +831,9 @@ And of course we must be able to free a symset. Some symsets are simply stored somewhere appropriate in the `grammar` data structure, others needs a bit of indirection. This applies particularly to "LA" sets. These will be paired with "items" in an -"itemset". As itemsets will be stored in a symset, the "LA" set needs to be -stored in the `data` field, so we need a mapping from a small (short) -number to an LA `symset`. +"itemset". As itemsets will be stored in a symset, the "LA" set needs +to be stored in the `data` field, so we need a mapping from a small +(short) number to an LA `symset`. As mentioned earlier this involves creating a list of unique symsets. @@ -875,9 +890,9 @@ would be more efficient and may be added later. return s->num; } -Finding a set by number is currently performed by a simple linear search. -If this turns out to hurt performance, we can store the sets in a dynamic -array like the productions. +Finding a set by number is currently performed by a simple linear +search. If this turns out to hurt performance, we can store the sets in +a dynamic array like the productions. static struct symset set_find(struct grammar *g, int num) { @@ -974,20 +989,22 @@ attribute of symbols, and compute it in a repetitive manner similar to ### Building the `first` sets -When calculating what can follow a particular non-terminal, we will need to -know what the "first" terminal in any subsequent non-terminal might be. So -we calculate the `first` set for every non-terminal and store them in an -array. We don't bother recording the "first" set for terminals as they are -trivial. +When calculating what can follow a particular non-terminal, we will need +to know what the "first" terminal in any subsequent non-terminal might +be. So we calculate the `first` set for every non-terminal and store +them in an array. We don't bother recording the "first" set for +terminals as they are trivial. -As the "first" for one symbol might depend on the "first" of another, -we repeat the calculations until no changes happen, just like with -`set_nullable`. This makes use of the fact that `symset_union` -reports if any change happens. +As the "first" for one symbol might depend on the "first" of another, we +repeat the calculations until no changes happen, just like with +`set_nullable`. This makes use of the fact that `symset_union` reports +if any change happens. The core of this, which finds the "first" of part of a production body, -will be reused for computing the "follow" sets, so we split it out -into a separate function. +will be reused for computing the "follow" sets, so we split that out +into a separate function, `add_first`, which also reports if it got all +the way to the end of the production without finding a non-nullable +symbol. ###### grammar fields struct symset *first; @@ -1041,18 +1058,18 @@ into a separate function. ### Building the `follow` sets. -There are two different situations when we will want to generate "follow" -sets. If we are doing an SLR analysis, we want to generate a single -"follow" set for each non-terminal in the grammar. That is what is -happening here. If we are doing an LALR or LR analysis we will want -to generate a separate "LA" set for each item. We do that later -in state generation. +There are two different situations when we will want to generate +"follow" sets. If we are doing an SLR analysis, we want to generate a +single "follow" set for each non-terminal in the grammar. That is what +is happening here. If we are doing an LALR or LR analysis we will want +to generate a separate "LA" set for each item. We do that later in +state generation. There are two parts to generating a "follow" set. Firstly we look at -every place that any non-terminal appears in the body of any -production, and we find the set of possible "first" symbols after -there. This is done using `add_first` above and only needs to be done -once as the "first" sets are now stable and will not change. +every place that any non-terminal appears in the body of any production, +and we find the set of possible "first" symbols after there. This is +done using `add_first` above and only needs to be done once as the +"first" sets are now stable and will not change. ###### follow code @@ -1074,6 +1091,10 @@ other symbol which is followed only by `nullable` symbols. As this depends on "follow" itself we need to repeatedly perform the process until no further changes happen. +Rather than 2 nested loops, one that repeats the whole process until +there is no change, and one that iterates of the set of productions, we +combine these two functions into a single loop. + ###### follow code for (again = 0, p = 0; @@ -1117,11 +1138,13 @@ complete function. There are three different levels of detail that can be chosen for building the itemsets and states for the LR grammar. They are: -1. LR(0) or SLR(1), where no look-ahead is considered. +1. LR(0), LR(0.5), or SLR(1), where no look-ahead is included in the + itemsets - look-ahead, if used at all, is simply the 'follow' sets + already computed, 2. LALR(1) where we build look-ahead sets with each item and merge the LA sets when we find two paths to the same "kernel" set of items. 3. LR(1) where different look-ahead for any item in the set means - a different state must be created. + a different item set must be created. ###### forward declarations enum grammar_type { LR0, LR05, SLR, LALR, LR1 }; @@ -1131,10 +1154,10 @@ generated state already exists. For now we use a simple sorted linked list. An item is a pair of numbers: the production number and the position of -"DOT", which is an index into the body of the production. -As the numbers are not enormous we can combine them into a single "short" -and store them in a `symset` - 4 bits for "DOT" and 12 bits for the -production number (so 4000 productions with maximum of 15 symbols in the +"DOT", which is an index into the body of the production. As the +numbers are not enormous we can combine them into a single "short" and +store them in a `symset` - 5 bits for "DOT" and 11 bits for the +production number (so 2000 productions with maximum of 31 symbols in the body). Comparing the itemsets will be a little different to comparing symsets @@ -1203,7 +1226,8 @@ can just compare the symset and the data values together. It will be helpful to know if an itemset has been "completed" or not, particularly for LALR where itemsets get merged, at which point they -need to be consider for completion again. So a `completed` flag is needed. +need to be consider for completion again. So a `completed` flag is +needed. For correct handling of `TK_newline` when parsing, we will need to know which states (itemsets) can occur at the start of a line, so we @@ -1222,8 +1246,8 @@ indented section must have ended in the middle of a syntactic unit, so an error must be signaled. And now we can build the list of itemsets. The lookup routine returns -both a success flag and a pointer to where in the list an insert -should happen, so we don't need to search a second time. +both a success flag and a pointer to where in the list an insert should +happen, so we don't need to search a second time. ###### declarations struct itemset { @@ -1265,9 +1289,9 @@ should happen, so we don't need to search a second time. } Adding an itemset may require merging the LA sets if LALR analysis is -happening. If any new LA set adds any symbols that weren't in the old LA set, we -clear the `completed` flag so that the dependants of this itemset will be -recalculated and their LA sets updated. +happening. If any new LA set adds any symbols that weren't in the old +LA set, we clear the `completed` flag so that the dependants of this +itemset will be recalculated and their LA sets updated. `add_itemset` must consume the symsets it is passed, either by adding them to a data structure, of freeing them. @@ -1315,32 +1339,36 @@ them to a data structure, of freeing them. #### The build -To build all the itemsets, we first insert the initial itemset made -from production zero, complete each itemset, and then generate new -itemsets from old until no new ones can be made. +To build all the itemsets, we first insert the initial itemset made from +production zero, complete each itemset, and then generate new itemsets +from old until no new ones can be made. -Completing an itemset means finding all the items where "DOT" is followed by -a nonterminal and adding "DOT=0" items for every production from that -non-terminal - providing each item hasn't already been added. +Completing an itemset means finding all the items where "DOT" is +followed by a nonterminal and adding "DOT=0" items for every production +from that non-terminal - providing each item hasn't already been added. +When we add such an item it might get inserted before the current +one, so to make sure we process it we reset the loop counter to the +start. If LA sets are needed, the LA set for each new item is found using -`add_first` which was developed earlier for `FIRST` and `FOLLOW`. This may -be supplemented by the LA set for the item which produce the new item. - -We also collect a set of all symbols which follow "DOT" (in `done`) as this -is used in the next stage. -If any of these symbols are flagged as `line_like`, then this -state must be a `starts_line` state so now is a good time to record that. - -When itemsets are created we assign a precedence to the itemset from -the complete item, if there is one. We ignore the possibility of -there being two and don't (currently) handle precedence in such -grammars. When completing a grammar we ignore any item where DOT is -followed by a terminal with a precedence lower than that for the -itemset. Unless the terminal has right associativity, we also ignore -items where the terminal has the same precedence. The result is that -unwanted items are still in the itemset, but the terminal doesn't get -into the go to set, so the item is ineffective. +`add_first` which was developed earlier for `FIRST` and `FOLLOW`. This +may be supplemented by the LA set for the item which produced the new +item. + +We also collect a set of all symbols which follow "DOT" (in `done`) as +this is used in the next stage. If any of these symbols are flagged as +`line_like`, then this state must be a `starts_line` state so now is a +good time to record that. + +When itemsets are created we assign a precedence to the itemset from the +complete item, if there is one. We ignore the possibility of there +being two and don't (currently) handle precedence in such grammars. +When completing a grammar we ignore any item where DOT is followed by a +terminal with a precedence lower than that for the itemset. Unless the +terminal has right associativity, we also ignore items where the +terminal has the same precedence. The result is that unwanted items are +still in the itemset, but the terminal doesn't get into the go to set, +so the item is ineffective. ###### complete itemset for (i = 0; i < is->items.cnt; i++) { @@ -1428,10 +1456,10 @@ into the go to set, so the item is ineffective. } } -For each symbol we found (and placed in `done`) we collect all the items for -which this symbol is next, and create a new itemset with "DOT" advanced over -the symbol. This is then added to the collection of itemsets (or merged -with a pre-existing itemset). +For each symbol we found (and placed in `done`) we collect all the items +for which this symbol is next, and create a new itemset with "DOT" +advanced over the symbol. This is then added to the collection of +itemsets (or merged with a pre-existing itemset). ###### derive itemsets // Now we have a completed itemset, so we need to @@ -1537,9 +1565,9 @@ doesn't use them. They will be calculated to some extent if needed for a report. Once we have built everything we allocate arrays for the two lists: -symbols and itemsets. This allows more efficient access during reporting. -The symbols are grouped as terminals and non-terminals and we record the -changeover point in `first_nonterm`. +symbols and itemsets. This allows more efficient access during +reporting. The symbols are grouped as terminals and then non-terminals, +and we record the changeover point in `first_nonterm`. ###### grammar fields struct symbol **symtab; @@ -1607,8 +1635,7 @@ all the tables that have been generated, plus a description of any conflicts. Firstly we have the complete list of symbols, together with the "FIRST" set if that was generated. We add a mark to each symbol to -show if it can end in a newline (`>`), if it is considered to be -"line-like" (`<`), or if it is nullable (`.`). +show if it is considered to be "line-like" (`<`), or if it is nullable (`.`). ###### functions @@ -1762,7 +1789,7 @@ Now we can report all the item sets complete with items, LA sets, and GO TO. ### Reporting conflicts Conflict detection varies a lot among different analysis levels. However -LR0 and LR0.5 are quite similar - having no follow sets, and SLR, LALR and +LR0 and LR0.5 are quite similar - having no follow sets - and SLR, LALR and LR1 are also similar as they have FOLLOW or LA sets. ###### functions @@ -1935,7 +1962,7 @@ known words added and then is used with the `code_node` to initialize the scanner. `parse_XX` then calls the library function `parser_run` to actually complete -the parse. This needs the `states` table and function to call the various +the parse. This needs the `states` table and functions to call the various pieces of code provided in the grammar file, so they are generated first. ###### parser_generate @@ -1967,9 +1994,7 @@ pieces of code provided in the grammar file, so they are generated first. ### Known words table The known words table is simply an array of terminal symbols. -The table of nonterminals used for tracing is a similar array. We -include virtual symbols in the table of non_terminals to keep the -numbers right. +The table of nonterminals used for tracing is a similar array. ###### functions @@ -2002,12 +2027,12 @@ numbers right. ### States and the goto tables. -For each state we record the goto table, the reducible production if -there is one, or a symbol to shift for error recovery. +For each state we record the goto table and details of the reducible +production if there is one. Some of the details of the reducible production are stored in the -`do_reduce` function to come later. Here we store the production number, -the body size (useful for stack management) and the resulting symbol (useful -for knowing how to free data later). +`do_reduce` function to come later. Here we store the production +number, the body size (useful for stack management), and the resulting +symbol (useful for knowing how to free data later). The go to table is stored in a simple array of `sym` and corresponding `state`. @@ -2066,7 +2091,7 @@ The go to table is stored in a simple array of `sym` and corresponding if (bp < pr->body_size) continue; - /* This is what we reduce */ + /* This is what we reduce - choose longest */ if (prod < 0 || prod_len < pr->body_size) { prod = p; prod_len = pr->body_size; @@ -2091,31 +2116,31 @@ The go to table is stored in a simple array of `sym` and corresponding ### The `do_reduce` function and the code -When the parser engine decides to reduce a production, it calls `do_reduce`. -This has two functions. +When the parser engine decides to reduce a production, it calls +`do_reduce` which runs the code that was included with the production, +if any. -Firstly, if a non-NULL `trace` file is passed, it prints out details of the -production being reduced. Secondly it runs the code that was included with -the production if any. - -This code needs to be able to store data somewhere. Rather than requiring -`do_reduce` to `malloc` that "somewhere", we pass in a large buffer and have -`do_reduce` return the size to be saved. +This code needs to be able to store data somewhere. Rather than +requiring `do_reduce` to `malloc` that "somewhere", we pass in a large +buffer and have `do_reduce` return the size to be saved. In order for the code to access "global" context, we pass in the -"config" pointer that was passed to parser function. If the `struct +"config" pointer that was passed to the parser function. If the `struct token_config` is embedded in some larger structure, the reducing code -can access the larger structure using pointer manipulation. - -The code fragment requires translation when written out. Any `$N` needs to -be converted to a reference either to that buffer (if `$0`) or to the -structure returned by a previous reduction. These pointers need to be cast -to the appropriate type for each access. All this is handled in +can access the larger structure using pointer manipulation. Performing +that pointer manipulation, and any other common code or variables for +all reduce actions, can be provided in code node called "reduce" which +is passed around in `pre_reduce` which you might have already noticed. + +The code fragments require translation when written out. Any `$N` needs +to be converted to a reference either to that buffer (if `$0`) or to the +structure returned by a previous reduction. These pointers need to be +cast to the appropriate type for each access. All this is handled in `gen_code`. `gen_code` also allows symbol references to contain a '`<`' as in '`$<2`'. This is particularly useful for references (or pointers), but -can be used with structures too. The `<` implies that the value it +can be used with structures too. The `<` implies that the value is being moved out, so the object will not be automatically freed. It is equivalent to assigning `NULL` to the pointer or filling a structure with zeros. @@ -2123,13 +2148,13 @@ with zeros. Instead of a number `N`, the `$` or `$<` can be followed by some letters and possibly a number. A number by itself (other than zero) selects a symbol from the body of the production. A sequence of letters selects -the shortest symbol in the body which contains those letters in the given -order. If a number follows the letters, then a later occurrence of -that symbol is chosen. So "`$AB2`" will refer to the structure attached -to the second occurrence of the shortest symbol which contains an `A` -followed by a `B`. If there is no unique shortest system, or if the -number given is too large, then the symbol reference is not transformed, -and will cause an error when the code is compiled. +the shortest symbol in the body which contains those letters in the +given order. If a number follows the letters, then a later occurrence +of that symbol is chosen. So "`$AB2`" will refer to the structure +attached to the second occurrence of the shortest symbol which contains +an `A` followed by a `B`. If there is no unique shortest system, or if +the number given is too large, then the symbol reference is not +transformed, and will cause an error when the code is compiled. ###### functions @@ -2447,11 +2472,11 @@ To be able to run `mdcode` and `scanner` on the grammar we need to memory map it. Once we have extracted the code (with `mdcode`) we expect to find three -sections: header, code, and grammar. Anything else that is not -excluded by the `--tag` option is an error. +or four sections: header, code, grammar, reduce. +Anything else that is not excluded by the `--tag` option is an error. -"header" and "code" are optional, though it is hard to build a working -parser with neither. "grammar" must be provided. +"header", "code", and "reduce" are optional, though it is hard to build +a working parser without the first two. "grammar" must be provided. ###### includes #include @@ -2605,8 +2630,8 @@ recognised properly, and link with `libicuuc` as `libmdcode` requires that. ## The SHIFT/REDUCE parser -Having analysed the grammar and generated all the tables, we only need the -shift/reduce engine to bring it all together. +Having analysed the grammar and generated all the tables, we only need +the shift/reduce engine to bring it all together. ### Goto table lookup @@ -2638,24 +2663,25 @@ table. ### The state stack. -The core data structure for the parser is the stack. This tracks all the -symbols that have been recognised or partially recognised. - -The stack usually won't grow very large - maybe a few tens of entries. So -we dynamically resize and array as required but never bother to shrink it -down again. - -We keep the stack as two separate allocations. One, `asn_stack` stores the -"abstract syntax nodes" which are created by each reduction. When we call -`do_reduce` we need to pass an array of the `asn`s of the body of the -production, and by keeping a separate `asn` stack, we can just pass a -pointer into this stack. - -The other allocation stores all other stack fields of which there are six. -The `state` is the most important one and guides the parsing process. The -`sym` is nearly unnecessary. However when we want to free entries from the -`asn_stack`, it helps to know what type they are so we can call the right -freeing function. The symbol leads us to the right free function through +The core data structure for the parser is the stack. This tracks all +the symbols that have been recognised or partially recognised. + +The stack usually won't grow very large - maybe a few tens of entries. +So we dynamically grow an array as required but never bother to +shrink it down again. + +We keep the stack as two separate allocations. One, `asn_stack` stores +the "abstract syntax nodes" which are created by each reduction. When +we call `do_reduce` we need to pass an array of the `asn`s of the body +of the production, and by keeping a separate `asn` stack, we can just +pass a pointer into this stack. + +The other allocation stores all other stack fields of which there are +several. The `state` is the most important one and guides the parsing +process. The `sym` is nearly unnecessary as it is implicit in the +state. However when we want to free entries from the `asn_stack`, it +helps to know what type they are so we can call the right freeing +function. The symbol leads us to the right free function through `do_free`. The `indents` count tracks the line indents with-in the symbol or @@ -2675,7 +2701,8 @@ The stack is most properly seen as alternating states and symbols - states, like the 'DOT' in items, are between symbols. Each frame in our stack holds a state and the symbol that was before it. The bottom of stack holds the start state but no symbol, as nothing came -before the beginning. +before the beginning. As we need to store some value, `TK_eof` is used +to mark the beginning of the file as well as the end. ###### parser functions @@ -2698,15 +2725,15 @@ before the beginning. Two operations are needed on the stack - shift (which is like push) and pop. -Shift applies not only to terminals but also to non-terminals. When -we reduce a production we will pop off entries corresponding to the -body symbols, then push on an item for the head of the production. -This last is exactly the same process as shifting in a terminal so we -use the same function for both. In both cases we provide the symbol, -the number of indents the symbol contains (which will be zero for a -terminal symbol) and a flag indicating the the symbol was at (or was -reduced from a symbol which was at) the start of a line. The state is -deduced from the current top-of-stack state and the new symbol. +Shift applies not only to terminals but also to non-terminals. When we +reduce a production we will pop off frames corresponding to the body +symbols, then push on a frame for the head of the production. This last +is exactly the same process as shifting in a terminal so we use the same +function for both. In both cases we provide the symbol, the number of +indents the symbol contains (which will be zero for a terminal symbol) +and a flag indicating the the symbol was at (or was reduced from a +symbol which was at) the start of a line. The state is deduced from the +current top-of-stack state and the new symbol. To simplify other code we arrange for `shift` to fail if there is no `goto` state for the symbol. This is useful in basic parsing due to our design @@ -2837,10 +2864,9 @@ copying, hence `memdup` and `tokcopy`. ### The heart of the parser. -Now we have the parser. If we can shift we do, though newlines and -reducing indenting may block that. If not and we can reduce we do -that. If the production we reduced was production zero, then we have -accepted the input and can finish. +Now we have the parser. For each token we might shift it, trigger a +reduction, or start error handling. 2D tokens (IN, OUT, EOL) also need +to be handled. We return whatever `asn` was returned by reducing production zero. @@ -3192,12 +3218,12 @@ end inside square brackets. The obvious example for a parser is a calculator. -As `scanner` provides number parsing function using `libgmp` is it not much +As `scanner` provides number parsing function using `libgmp` it is not much work to perform arbitrary rational number calculations. -This calculator takes one expression, or an equality test, per line. The -results are printed and if any equality test fails, the program exits with -an error. +This calculator takes one expression, or an equality test, per line. +The results are printed and if any equality test fails, the program +exits with an error. ###### File: parsergen.mk calc.c calc.h : parsergen parsergen.mdc -- 2.43.0