From 1e17521c96f4116042cdaf3cf2aac40284f6dec8 Mon Sep 17 00:00:00 2001
From: NeilBrown <neil@brown.name>
Date: Sun, 11 Oct 2020 09:50:12 +1100
Subject: [PATCH] parsergen: avoid infinite loop on error.

If the grammar allows "ERROR" in a recursive location, error handling
can loop for every.
e.g.

 foo -> foo bar
 foo -> ERROR

Rather than detect and reject such grammars, detect the infinite loop
as it start, and discard an extra token.

i.e.  if error handling doesn't discard any tokens from the input
stream, and another error is triggered before anything is shifted, then
we force the next error handling phase to discard at least one token,
or to abort if that token is EOF.

Signed-off-by: NeilBrown <neil@brown.name>
---
 csrc/parsergen.mdc | 27 ++++++++++++++++++++++-----
 1 file changed, 22 insertions(+), 5 deletions(-)

diff --git a/csrc/parsergen.mdc b/csrc/parsergen.mdc
index 13b946f..cdb5f82 100644
--- a/csrc/parsergen.mdc
+++ b/csrc/parsergen.mdc
@@ -2779,8 +2779,12 @@ accepted the input and can finish.
 We return whatever `asn` was returned by reducing production zero.
 
 If we can neither shift nor reduce we have an error to handle.  We pop
-single entries off the stack until we can shift the `TK_error` symbol, then
-drop input tokens until we find one we can shift into the new error state.
+single entries off the stack until we can shift the `TK_error` symbol,
+then drop input tokens until we find one we can shift into the new error
+state.  We need to ensure that something is dropped or shifted after an
+error, or we could get into an infinite loop, only shifting in
+`TK_error`, then reducing.  So we track if there has been a shift since
+the last error, and if not the next error always discards one token.
 
 When we find `TK_in` and `TK_out` tokens which report indents we need
 to handle them directly as the grammar cannot express what we want to
@@ -2824,11 +2828,11 @@ in the thing that preceed:
 Here the NEWLINE will be shifted because nothing can be reduced until
 the `if` is seen.
 
-When, during error handling, we discard token read in, we want to keep
+When during error handling we discard tokens read in, we want to keep
 discarding until we see one that is recognised.  If we had a full set
-of LR(1) grammar states, this will mean looking in the look-ahead set,
+of LR(1) grammar states, this would mean looking in the look-ahead set,
 but we don't keep a full look-ahead set.  We only record the subset
-that leads to SHIFT.  We can, however, deduce the look-ahead set but
+that leads to SHIFT.  We can, however, deduce the look-ahead set by
 looking at the SHIFT subsets for all states that we can get to by
 reducing zero or more times.  So we need a little function which
 checks if a given token is in any of these look-ahead sets.
@@ -2860,6 +2864,7 @@ checks if a given token is in any of these look-ahead sets.
 		struct parser p = { 0 };
 		struct token *tk = NULL;
 		int accepted = 0;
+		int shift_since_err = 1;
 		void *ret = NULL;
 
 		shift(&p, TK_eof, 0, 1, NULL, states);
@@ -2931,6 +2936,7 @@ checks if a given token is in any of these look-ahead sets.
 					goto force_reduce;
 			}
 			if (shift(&p, tk->num, 0, tk->num == TK_newline, tk, states)) {
+				shift_since_err = 1;
 				tk = NULL;
 				parser_trace_action(trace, "Shift");
 				continue;
@@ -2995,11 +3001,22 @@ checks if a given token is in any of these look-ahead sets.
 				// no state accepted TK_error
 				break;
 			}
+			if (!shift_since_err) {
+				/* must discard at least one token to avoid
+				 * infinite loop.
+				 */
+				if (tk->num == TK_eof)
+					break;
+				free(tk);
+				tk = tok_copy(token_next(tokens));
+			}
+			shift_since_err = 0;
 			tos = &p.stack[p.tos-1];
 			while (!in_lookahead(tk, states, tos->state) &&
 			       tk->num != TK_eof) {
 				free(tk);
 				tk = tok_copy(token_next(tokens));
+				shift_since_err = 1;
 				if (tk->num == TK_in)
 					indents += 1;
 				if (tk->num == TK_out) {
-- 
2.43.0