From: NeilBrown Date: Sun, 11 May 2014 06:04:53 +0000 (+1000) Subject: parsergen: review and update text. X-Git-Tag: workingparser~38 X-Git-Url: https://ocean-lang.org/code/?p=ocean;a=commitdiff_plain;h=41683790fbbd6a2b9396271ac7851d575cca20ba parsergen: review and update text. Fix lots of typos, improve poor descriptions, and update some text to match recent changes. Signed-off-by: NeilBrown --- diff --git a/csrc/parsergen.mdc b/csrc/parsergen.mdc index 19d1b5a..ef01bf3 100644 --- a/csrc/parsergen.mdc +++ b/csrc/parsergen.mdc @@ -1194,7 +1194,7 @@ should happen, so we don't need to search a second time. } Adding an itemset may require merging the LA sets if LALR analysis is -happening. If any new LA set add symbol that weren't in the old LA set, we +happening. If any new LA set adds any symbols that weren't in the old LA set, we clear the `completed` flag so that the dependants of this itemset will be recalculated and their LA sets updated. @@ -1242,9 +1242,9 @@ them to a data structure, of freeing them. #### The build -To build all the itemsets, we first insert the initial itemset made from the -start symbol, complete each itemset, and then generate new itemsets from old -until no new ones can be made. +To build all the itemsets, we first insert the initial itemset made +from production zero, complete each itemset, and then generate new +itemsets from old until no new ones can be made. Completing an itemset means finding all the items where "DOT" is followed by a nonterminal and adding "DOT=0" items for every production from that @@ -1471,7 +1471,7 @@ changeover point in `first_nonterm`. The purpose of the report is to give the grammar developer insight into how the grammar parser will work. It is basically a structured dump of -all the tables that have been generated, plus an description of any conflicts. +all the tables that have been generated, plus a description of any conflicts. ###### grammar_report static int grammar_report(struct grammar *g, enum grammar_type type) @@ -1483,8 +1483,9 @@ all the tables that have been generated, plus an description of any conflicts. return report_conflicts(g, type); } -Firstly we have the complete list of symbols, together with the "FIRST" -set if that was generated. +Firstly we have the complete list of symbols, together with the +"FIRST" set if that was generated. We add a mark to each symbol to +show if it can end in a newline (`>`), or if it is nullable (`.`). ###### functions @@ -1523,7 +1524,7 @@ set if that was generated. } } -Then we have to follow sets if they were computed. +Then we have the follow sets if they were computed. static void report_follow(struct grammar *g) { @@ -1646,7 +1647,7 @@ LR1 are also similar as they have FOLLOW or LA sets. } LR0 conflicts are any state which have both a reducible item and -a shiftable item. +a shiftable item, or two reducible items. LR05 conflicts only occurs if two possibly reductions exist, as shifts always over-ride reductions. @@ -1695,12 +1696,13 @@ as shifts always over-ride reductions. } SLR, LALR, and LR1 conflicts happen if two reducible items have over-lapping -look ahead, or if a symbol in a look-ahead can be shifted. The differ only +look ahead, or if a symbol in a look-ahead can be shifted. They differ only in the source of the look ahead set. -We build a dataset mapping terminal to item for possible SHIFTs and then -another for possible REDUCE operations. We report when we get conflicts -between the two. +We build two datasets to reflect the "action" table: one which maps +terminals to items where that terminal could be shifted and another +which maps terminals to items that could be reduced when the terminal +is in look-ahead. We report when we get conflicts between the two. static int conflicts_slr(struct grammar *g, enum grammar_type type) { @@ -1777,7 +1779,7 @@ between the two. ## Generating the parser -The export part of the parser is the `parse_XX` function, where the name +The exported part of the parser is the `parse_XX` function, where the name `XX` is based on the name of the parser files. This takes a `code_node`, a partially initialized `token_config`, and an @@ -1785,7 +1787,7 @@ optional `FILE` to send tracing to. The `token_config` gets the list of known words added and then is used with the `code_node` to initialize the scanner. -`parse_XX` then call the library function `parser_run` to actually complete +`parse_XX` then calls the library function `parser_run` to actually complete the parse. This needs the `states` table and function to call the various pieces of code provided in the grammar file, so they are generated first. @@ -1815,9 +1817,9 @@ pieces of code provided in the grammar file, so they are generated first. fprintf(f, "}\n\n"); } -### Table words table +### Known words table -The know words is simply an array of terminal symbols. +The known words table is simply an array of terminal symbols. The table of nonterminals used for tracing is a similar array. ###### functions @@ -1959,7 +1961,7 @@ This code needs to be able to store data somewhere. Rather than requiring The code fragment requires translation when written out. Any `$N` needs to be converted to a reference either to that buffer (if `$0`) or to the -structure returned by a previous reduction. These pointer need to be cast +structure returned by a previous reduction. These pointers need to be cast to the appropriate type for each access. All this is handling in `gen_code`. @@ -2047,10 +2049,10 @@ done. It is particularly important to have fine control over freeing during error recovery where individual stack frames might need to be freed. -For this, the grammar author required to defined a `free_XX` function for -each structure that is used by a non-terminal. `do_free` all call whichever +For this, the grammar author is required to defined a `free_XX` function for +each structure that is used by a non-terminal. `do_free` will call whichever is appropriate given a symbol number, and will call `free` (as is -appropriate for tokens` on any terminal symbol. +appropriate for tokens on any terminal symbol. ###### functions @@ -2174,8 +2176,9 @@ grammar file). To be able to run `mdcode` and `scanner` on the grammar we need to memory map it. -One we have extracted the code (with `mdcode`) we expect to file three -sections: header, code, and grammar. Anything else is an error. +One we have extracted the code (with `mdcode`) we expect to find three +sections: header, code, and grammar. Anything else that is not +excluded by the `--tag` option is an error. "header" and "code" are optional, though it is hard to build a working parser with neither. "grammar" must be provided. @@ -2302,7 +2305,7 @@ file with the code section (if any) and the parser tables and function. } And that about wraps it up. We need to set the locale so that UTF-8 is -recognised properly, and link with `libicuuc` is `libmdcode` requires that. +recognised properly, and link with `libicuuc` as `libmdcode` requires that. ###### File: parsergen.mk parsergen : parsergen.o libscanner.o libmdcode.o @@ -2329,7 +2332,7 @@ recognised properly, and link with `libicuuc` is `libmdcode` requires that. ## The SHIFT/REDUCE parser -Having analysed the grammar and generated all the table, we only need the +Having analysed the grammar and generated all the tables, we only need the shift/reduce engine to bring it all together. ### Goto table lookup @@ -2375,7 +2378,7 @@ We keep the stack as two separate allocations. One, `asn_stack` stores the production, and by keeping a separate `asn` stack, we can just pass a pointer into this stack. -The other allocation stores all other stack fields of which there are four. +The other allocation stores all other stack fields of which there are six. The `state` is the most important one and guides the parsing process. The `sym` is nearly unnecessary. However when we want to free entries from the `asn_stack`, it helps to know what type they are so we can call the right @@ -2386,6 +2389,9 @@ The `indents` count and the `starts_indented` flag track the line indents in the symbol. These are used to allow indent information to guide parsing and error recovery. +`newline_permitted` keeps track of whether newlines should be ignored +or not, and `starts_line` records if this state stated on a newline. + As well as the stack of frames we have a `next` frame which is assembled from the incoming token and other information prior to pushing it onto the stack. @@ -2531,6 +2537,9 @@ an indent. or must force reductions until there is a pending indent which isn't at the start of a production. +`TK_newline` tokens are ignored precisely if there has been an indent +since the last state which could have been at the start of a line. + ###### parser includes #include "parser.h" ###### parser_run @@ -2746,13 +2755,9 @@ As `scanner` provides number parsing function using `libgmp` is it not much work to perform arbitrary rational number calculations. This calculator takes one expression, or an equality test per line. The -results are printed and in any equality test fails, the program exits with +results are printed and if any equality test fails, the program exits with an error. -Embedding mdcode inside mdcode is rather horrible. I'd like to find a -better approach, but as the grammar file must have 3 components I need -something like this. - ###### File: parsergen.mk calc.c calc.h : parsergen parsergen.mdc ./parsergen --tag calc -o calc parsergen.mdc