From abe144f0999c5576e18c74324c2436cd6bd65b2d Mon Sep 17 00:00:00 2001 From: NeilBrown Date: Tue, 6 Oct 2020 17:02:22 +1100 Subject: [PATCH] parsegen: detect left-recursive symbols in non-final position. A left-recursive symbol that appear other than at the end of a production causes problem for indent-based parsing, as describe in the document. So teach parsergen to be able to report them. Ocean currently has several of these, which I'll need to look into at a later date. Signed-off-by: NeilBrown --- csrc/parsergen.mdc | 60 +++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 59 insertions(+), 1 deletion(-) diff --git a/csrc/parsergen.mdc b/csrc/parsergen.mdc index 3f36df9..5bf413d 100644 --- a/csrc/parsergen.mdc +++ b/csrc/parsergen.mdc @@ -400,6 +400,21 @@ The `<` may be included for symbols declared as storing a reference (not a structure) and means that the reference is being moved out, so it will not automatically be freed. +Symbols that are left-recursive are a little special. These are symbols +that both the head of a production and the first body symbol of the same +production. They are problematic when they appear in other productions +elsewhere than at the end, and when indenting is used to describe +structure. In this case there is no way for a smaller indent to ensure +the left-recursive symbol cannot be extended. When it appears at the +end of a production, that production can be reduced to ensure the symbol +isn't extended. So we record left-recursive symbols while reading the +grammar, and produce a warning when reporting the grammar if they are +found in an unsuitable place. + +A symbol that is only left recursive in a production where it is +followed by newline does not cause these problems because the newline +will effectively terminate it. + While building productions we will need to add to an array which needs to grow dynamically. @@ -460,6 +475,7 @@ Now we have all the bits we need to parse a full production. ###### symbol fields int first_production; + int left_recursive; ###### functions static char *parse_production(struct grammar *g, @@ -519,6 +535,11 @@ Now we have all the bits we need to parse a full production. } tk = token_next(state); } + if (p.body_size >= 2 && + p.body[0] == p.head && + p.body[1]->num != TK_newline) + p.head->left_recursive = 1; + if (tk.num != TK_newline && tk.num != TK_eof) { err = "stray tokens at end of line"; goto abort; @@ -1562,7 +1583,7 @@ all the tables that have been generated, plus a description of any conflicts. if (g->follow) report_follow(g); report_itemsets(g); - return report_conflicts(g, type); + return report_conflicts(g, type) + report_left_recursive(g); } Firstly we have the complete list of symbols, together with the @@ -1883,6 +1904,43 @@ but handled internally. return cnt; } + +### Reporting non-final left-recursive symbols. + +Left recursive symbols are a problem for parses that honour indentation +when they appear other than at the end of the production. So we need to +report these when asked. + +###### functions + + static int report_left_recursive(struct grammar *g) + { + int p; + int bad_left_recursive = 0; + + for (p = 0; p < g->production_count; p++) { + struct production *pr = g->productions[p]; + int sn; + + for (sn = 0; sn < pr->body_size - 1; sn++) { + struct symbol *s = pr->body[sn]; + + if (s->left_recursive == 1 && + s != pr->head) { + if (!bad_left_recursive) { + bad_left_recursive = 1; + printf("Misplaced left recursive symbols.\n"); + } + printf(" "); + prtxt(s->name); + printf(" in production [%d]\n", p); + s->left_recursive = 2; + } + } + } + return bad_left_recursive; + } + ## Generating the parser The exported part of the parser is the `parse_XX` function, where the name -- 2.43.0