write out C code files which can be compiled to make a parser.
"2D support" means that indentation and line breaks can be significant.
+Indent tokens (IN and OUT) and end-of-line (EOL) tokens can be used to
+describe the grammar and the parser can selectively ignore these where
+they aren't relevant.
There are several distinct sections.
Symbols can be either `TK_ident` or `TK_mark`. They are saved in a
table of known symbols and the resulting parser will report them as
`TK_reserved + N`. A small set of identifiers are reserved for the
-different token types that `scanner` can report.
+different token types that `scanner` can report, and an even smaller set
+are reserved for a special token that the parser can generate (`EOL`) as
+will be described later. This latter set cannot use predefined numbers,
+so they are marked as `isspecial` for now and will get assigned a number
+with the non-terminals later.
###### declarations
{ TK_out, "OUT" },
{ TK_newline, "NEWLINE" },
{ TK_eof, "$eof" },
+ { -1, "EOL" },
};
+
###### symbol fields
short num;
+ unsigned int isspecial:1;
Note that `TK_eof` and the two `TK_*_comment` tokens cannot be
recognised. The former is automatically expected at the end of the text
s = sym_find(g, t);
s->type = Terminal;
s->num = reserved_words[i].num;
+ s->isspecial = 1;
}
}
Once we have built everything we allocate arrays for the two lists:
symbols and itemsets. This allows more efficient access during
-reporting. The symbols are grouped as terminals and then non-terminals,
-and we record the changeover point in `first_nonterm`.
+reporting. The symbols are grouped as terminals, then non-terminals,
+then virtual, with the start of non-terminals recorded as `first_nonterm`.
+Special terminals -- meaning just EOL -- are included with the
+non-terminals so that they are not expected by the scanner.
###### grammar fields
struct symbol **symtab;
struct itemset *is;
int snum = TK_reserved;
for (s = g->syms; s; s = s->next)
- if (s->num < 0 && s->type == Terminal) {
+ if (s->num < 0 && s->type == Terminal && !s->isspecial) {
s->num = snum;
snum++;
}
which maps terminals to items that could be reduced when the terminal
is in look-ahead. We report when we get conflicts between the two.
-As a special case, if we find a SHIFT/REDUCE conflict, on the NEWLINE
-terminal, we ignore it. NEWLINES are handled specially with its own
-rules for when to shift and when to reduce. Conflicts are expected,
-but handled internally.
-
static int conflicts_slr(struct grammar *g, enum grammar_type type)
{
int i;
int k;
for (k = 0; k < la.cnt; k++) {
int pos = symset_find(&shifts, la.syms[k]);
- if (pos >= 0 && la.syms[k] != TK_newline) {
+ if (pos >= 0) {
printf(" State %d has SHIFT/REDUCE conflict on ", i);
cnt++;
prtxt(g->symtab[la.syms[k]]->name);
for (i = TK_reserved;
i < g->num_syms;
i++)
- if (g->symtab[i]->type == Nonterminal)
+ if (g->symtab[i]->type == Nonterminal ||
+ g->symtab[i]->isspecial)
fprintf(f, "\t\"%.*s\",\n", g->symtab[i]->name.len,
g->symtab[i]->name.txt);
fprintf(f, "};\n\n");
fprintf(f, "static void do_free(short sym, void *asn)\n");
fprintf(f, "{\n");
fprintf(f, "\tif (!asn) return;\n");
- fprintf(f, "\tif (sym < %d) {\n", g->first_nonterm);
+ fprintf(f, "\tif (sym < %d", g->first_nonterm);
+ /* Need to handle special terminals too */
+ for (i = 0; i < g->num_syms; i++) {
+ struct symbol *s = g->symtab[i];
+ if (i >= g->first_nonterm && s->type == Terminal &&
+ s->isspecial)
+ fprintf(f, " || sym == %d", s->num);
+ }
+ fprintf(f, ") {\n");
fprintf(f, "\t\tfree(asn);\n\t\treturn;\n\t}\n");
fprintf(f, "\tswitch(sym) {\n");
function. The symbol leads us to the right free function through
`do_free`.
-The `indents` count tracks the line indents with-in the symbol or
-immediately follow it. These are used to allow indent information to
-guide parsing and error recovery.
-
-`since_newline` tracks how many stack frames since the last
-start-of-line (whether indented or not). So if `since_newline` is
-zero, then this symbol is at the start of a line. Similarly
-`since_indent` counts the number of states since an indent, it is zero
-precisely when `indents` is not zero.
-
-`newline_permitted` keeps track of whether newlines should be ignored
-or not.
-
The stack is most properly seen as alternating states and symbols -
states, like the 'DOT' in items, are between symbols. Each frame in
our stack holds a state and the symbol that was before it. The
before the beginning. As we need to store some value, `TK_eof` is used
to mark the beginning of the file as well as the end.
+Indents (IN) are sometimes shifted and sometimes only accounted.
+Whatever decision is made must apply equally to the matching OUT. To
+ensure this we keep a stack of bits in `ignored_indents` and
+`indent_depth`. When we process an IN, we record if it was ignored.
+When we see an out, we pop of the relavant bit and use it to decide how
+to handle the OUT.
+
###### parser functions
struct parser {
struct frame {
short state;
- short newline_permitted;
-
short sym;
- short indents;
- short since_newline;
- short since_indent;
} *stack;
void **asn_stack;
int stack_size;
int tos;
+
+ ## parser state
};
#### Shift and pop
reduce a production we will pop off frames corresponding to the body
symbols, then push on a frame for the head of the production. This last
is exactly the same process as shifting in a terminal so we use the same
-function for both. In both cases we provide the symbol, the number of
-indents the symbol contains (which will be zero for a terminal symbol)
-and a flag indicating the the symbol was at (or was reduced from a
-symbol which was at) the start of a line. The state is deduced from the
-current top-of-stack state and the new symbol.
+function for both. In both cases we provide the symbol. The state is
+deduced from the current top-of-stack state and the new symbol.
To simplify other code we arrange for `shift` to fail if there is no `goto`
state for the symbol. This is useful in basic parsing due to our design
So `shift` finds the next state. If that succeeds it extends the
allocations if needed and pushes all the information onto the stacks.
+An extra complication is added to `shift` by the `EOL` token. This
+token must be generated when a `NEWLINE` is seen, but an `EOL` is
+expected. When this happens, the `NEWLINE` is NOT consumed, so multiple
+EOL can appear before a NEWLINE. To indicate that the token was shifted
+by not consumed, `shift` can return the special value `2`. The token
+number for `EOL` cannot be statically declared, so when the parser
+starts we need to look through the array of non-terminals to find the
+EOL.
+
+###### parser state
+ int tk_eol;
+
+###### find eol
+ p.tk_eol = 0;
+ while (strcmp(non_term[p.tk_eol], "EOL") != 0)
+ p.tk_eol += 1;
+ p.tk_eol += TK_reserved + config->known_count;
+
+
###### parser functions
static int shift(struct parser *p,
- short sym, short indents, short start_of_line,
- void *asn,
+ short sym, void *asn,
const struct state states[])
{
- // Push an entry onto the stack
struct frame next = {0};
+ int ret;
int newstate = p->tos
? search(&states[p->stack[p->tos-1].state],
sym)
: 0;
- if (newstate < 0)
+ if (newstate >= 0)
+ ret = 1;
+ else if (sym != TK_newline)
return 0;
+ else {
+ // have a NEWLINE, might need an EOL
+ sym = p->tk_eol;
+ newstate = p->tos
+ ? search(&states[p->stack[p->tos-1].state],
+ sym)
+ : 0;
+ if (newstate < 0)
+ return 0;
+ ret = 2;
+ asn = tok_copy(*(struct token*)asn);
+ }
+
if (p->tos >= p->stack_size) {
p->stack_size += 10;
p->stack = realloc(p->stack, p->stack_size
* sizeof(p->asn_stack[0]));
}
next.sym = sym;
- next.indents = indents;
next.state = newstate;
- if (!start_of_line) {
- if (p->tos)
- next.since_newline = p->stack[p->tos-1].since_newline + 1;
- else
- next.since_newline = 1;
- }
- if (indents)
- next.since_indent = 0;
- else if (p->tos)
- next.since_indent = p->stack[p->tos-1].since_indent + 1;
- else
- next.since_indent = 1;
-
p->stack[p->tos] = next;
p->asn_stack[p->tos] = asn;
p->tos++;
- return 1;
+ return ret;
}
`pop` primarily moves the top of stack (`tos`) back down the required
-amount and frees any `asn` entries that need to be freed. It also
-collects a summary of the indents and line starts in the symbols that
-are being removed. It is called _after_ we reduce a production, just
-before we `shift` the nonterminal in.
+amount and frees any `asn` entries that need to be freed. It is called
+_after_ we reduce a production, just before we `shift` the nonterminal
+in.
###### parser functions
- static int pop(struct parser *p, int num,
- short *start_of_line,
- void(*do_free)(short sym, void *asn))
+ static void pop(struct parser *p, int num,
+ void(*do_free)(short sym, void *asn))
{
int i;
- short indents = 0;
- int sol = 0;
p->tos -= num;
- for (i = 0; i < num; i++) {
- sol |= !p->stack[p->tos+i].since_newline;
- indents += p->stack[p->tos+i].indents;
- do_free(p->stack[p->tos+i].sym,
- p->asn_stack[p->tos+i]);
- }
- if (start_of_line)
- *start_of_line = sol;
- return indents;
+ for (i = 0; i < num; i++)
+ do_free(p->stack[p->tos+i].sym, p->asn_stack[p->tos+i]);
}
### The heart of the parser.
Now we have the parser. For each token we might shift it, trigger a
-reduction, or start error handling. 2D tokens (IN, OUT, EOL) also need
-to be handled.
-
-We return whatever `asn` was returned by reducing production zero.
-
-If we can neither shift nor reduce we have an error to handle. We pop
-single entries off the stack until we can shift the `TK_error` symbol,
-then drop input tokens until we find one we can shift into the new error
-state. We need to ensure that something is dropped or shifted after an
-error, or we could get into an infinite loop, only shifting in
-`TK_error`, then reducing. So we track if there has been a shift since
-the last error, and if not the next error always discards one token.
-
-When we find `TK_in` and `TK_out` tokens which report indents we need
-to handle them directly as the grammar cannot express what we want to
-do with them.
-
-`TK_in` tokens are easy: we simply update indent count in the top stack frame to
-record how many indents there are following the previous token.
-
-`TK_out` tokens must be canceled against an indent count
-within the stack. If we can reduce some symbols that are all since
-the most recent indent, then we do that first. If the minimum prefix
-of the current state then extends back before the most recent indent,
-that indent can be cancelled. If the minimum prefix is shorter then
-the indent had ended prematurely and we must start error handling, which
-is still a work-in-progress.
-
-`TK_newline` tokens are ignored unless the top stack frame records
-that they are permitted. In that case they will not be considered for
-shifting if it is possible to reduce some symbols that are all since
-the most recent start of line. This is how a newline forcibly
-terminates any line-like structure - we try to reduce down to at most
-one symbol for each line where newlines are allowed.
-A consequence of this is that a rule like
-
-###### Example: newlines - broken
-
- Newlines ->
- | NEWLINE Newlines
- IfStatement -> Newlines if ....
-
-cannot work, as the NEWLINE will never be shifted as the empty string
-will be reduced first. Optional sets of newlines need to be include
-in the thing that preceed:
-
-###### Example: newlines - works
-
- If -> if
- | NEWLINE If
- IfStatement -> If ....
-
-Here the NEWLINE will be shifted because nothing can be reduced until
-the `if` is seen.
-
-When during error handling we discard tokens read in, we want to keep
-discarding until we see one that is recognised. If we had a full set
-of LR(1) grammar states, this would mean looking in the look-ahead set,
-but we don't keep a full look-ahead set. We only record the subset
-that leads to SHIFT. We can, however, deduce the look-ahead set by
-looking at the SHIFT subsets for all states that we can get to by
-reducing zero or more times. So we need a little function which
-checks if a given token is in any of these look-ahead sets.
+reduction, or start error handling. 2D tokens (IN, OUT, NEWLINE, EOL)
+might also be ignored. Ignoring tokens is combined with shifting.
-###### parser includes
- #include "parser.h"
+###### parser vars
-###### parser_run
+ struct parser p = { 0 };
+ struct token *tk = NULL;
+ int accepted = 0;
- static int in_lookahead(struct token *tk, const struct state *states, int state)
- {
- while (state >= 0) {
- if (search(&states[state], tk->num) >= 0)
- return 1;
- if (states[state].reduce_prod < 0)
- return 0;
- state = search(&states[state], states[state].reduce_sym);
+###### heart of parser
+
+ shift(&p, TK_eof, NULL, states);
+ while (!accepted && p.tos > 0) {
+ struct frame *tos = &p.stack[p.tos-1];
+ if (!tk)
+ tk = tok_copy(token_next(tokens));
+ parser_trace(trace, &p,
+ tk, states, non_term, config->known_count);
+
+ ## try shift or ignore
+ ## try reduce
+ ## handle error
+ }
+
+Indents are ignored unless they can be shifted onto the stack
+immediately. The end of an indented section - the OUT token - is
+ignored precisely when the indent was ignored. To keep track of this we
+need a small stack of flags, which is easily stored as bits in an
+`unsigned long`. This will never overflow and the scanner only allows
+20 levels of indentation.
+
+###### parser state
+ unsigned long ignored_indents;
+ int indent_depth;
+
+NEWLINE/EOL is ignored when in an indented section of text which was not
+explicitly expected by the grammar. So if the most recent indent is
+ignored, so is any EOL token.
+
+For other tokens, we shift the next token if that is possible, otherwise
+we try to reduce a production.
+
+###### try shift or ignore
+
+ if ((tk->num == TK_newline || tk->num == TK_out) &&
+ (p.ignored_indents & (1 << p.indent_depth))) {
+ /* indented, so ignore OUT and NEWLINE */
+ if (tk->num == TK_out)
+ p.indent_depth -= 1;
+ free(tk);
+ tk = NULL;
+ parser_trace_action(trace, "Ignore");
+ continue;
+ }
+
+ switch (shift(&p, tk->num, tk, states)) {
+ case 1:
+ if (tk->num == TK_out)
+ p.indent_depth -= 1;
+ if (tk->num == TK_in) {
+ p.indent_depth += 1;
+ p.ignored_indents &= ~(1 << p.indent_depth);
}
- return 0;
+ tk = NULL;
+ /* fallthrough */
+ case 2:
+ parser_trace_action(trace, tk ? "ShiftEOL" : "Shift");
+ ## did shift
+ continue;
+ }
+
+ if (tk->num == TK_in) {
+ /* No indent expected here, so ignore IN */
+ free(tk);
+ tk = NULL;
+ p.indent_depth += 1;
+ p.ignored_indents |= (1 << p.indent_depth);
+ parser_trace_action(trace, "Ignore");
+ continue;
+ }
+
+We have already discussed the bulk of the handling of a "reduce" action,
+with the `pop()` and `shift()` functions doing much of the work. There
+is a little more complexity needed to manage storage for the asn (Abstract
+Syntax Node), and also a test of whether the reduction is permitted.
+
+When we try to shift the result of reducing production-zero, it will
+fail because there is no next state. In this case the asn will not have
+been stored on the stack, so it get stored in the `ret` variable, and we
+report that that input has been accepted.
+
+###### parser vars
+
+ void *ret = NULL;
+
+###### try reduce
+
+ if (states[tos->state].reduce_prod >= 0) {
+ void **body;
+ void *res;
+ const struct state *nextstate = &states[tos->state];
+ int prod = nextstate->reduce_prod;
+ int size = nextstate->reduce_size;
+ int res_size = nextstate->result_size;
+
+ body = p.asn_stack + (p.tos - size);
+ res = res_size ? calloc(1, res_size) : NULL;
+ res_size = do_reduce(prod, body, config, res);
+ if (res_size != nextstate->result_size)
+ abort();
+ pop(&p, size, do_free);
+ if (!shift(&p, nextstate->reduce_sym, res, states)) {
+ accepted = 1;
+ ret = res;
+ parser_trace_action(trace, "Accept");
+ } else
+ parser_trace_action(trace, "Reduce");
+ continue;
}
+If we can neither shift nor reduce we have an error to handle. There
+are two possible responses to an error: we can pop single frames off the
+stack until we find a frame that can shift `TK_error`, or we can discard
+the current look-ahead symbol.
+
+When we first see an error we do the first of these and set a flag to
+record that we are processing an error. If the normal shift/reduce
+tests fail to find that anything can be done from that state, we start
+dropping tokens until either we manage to shift one, or reach end-of-file.
+
+###### parser vars
+
+ int in_err = 0;
+
+###### did shift
+
+ in_err = 0;
+
+###### handle error
+
+ if (in_err) {
+ parser_trace_action(trace, "DISCARD");
+ if (tk->num == TK_eof)
+ break;
+ free(tk);
+ tk = NULL;
+ } else {
+ struct token *err_tk;
+
+ parser_trace_action(trace, "ERROR");
+
+ err_tk = tok_copy(*tk);
+ while (p.tos > 0 && shift(&p, TK_error, err_tk, states) == 0)
+ // discard this state
+ pop(&p, 1, do_free);
+ if (p.tos == 0) {
+ free(err_tk);
+ // no state accepted TK_error
+ break;
+ }
+ in_err = 1;
+ }
+
+###### parser includes
+ #include "parser.h"
+
+###### parser_run
+
void *parser_run(struct token_state *tokens,
const struct state states[],
int (*do_reduce)(int, void**, struct token_config*, void*),
FILE *trace, const char *non_term[],
struct token_config *config)
{
- struct parser p = { 0 };
- struct token *tk = NULL;
- int accepted = 0;
- int shift_since_err = 1;
- void *ret = NULL;
-
- shift(&p, TK_eof, 0, 1, NULL, states);
- while (!accepted && p.tos > 0) {
- struct token *err_tk;
- struct frame *tos = &p.stack[p.tos-1];
- if (!tk)
- tk = tok_copy(token_next(tokens));
- parser_trace(trace, &p,
- tk, states, non_term, config->known_count);
-
- if (tk->num == TK_in) {
- tos->indents += 1;
- tos->since_newline = 0;
- tos->since_indent = 0;
- free(tk);
- tk = NULL;
- parser_trace_action(trace, "Record");
- continue;
- }
- if (tk->num == TK_out) {
- if (states[tos->state].reduce_size >= 0 &&
- states[tos->state].reduce_size <= tos->since_indent)
- goto force_reduce;
- if (1) {
- // OK to cancel
- struct frame *in = tos - tos->since_indent;
- in->indents -= 1;
- if (in->indents == 0) {
- /* Reassess since_indent and newline_permitted */
- if (in > p.stack) {
- in->since_indent = in[-1].since_indent + 1;
- in->newline_permitted = in[-1].newline_permitted;
- } else {
- in->since_indent = 0;
- in->newline_permitted = 0;
- }
- while (in < tos) {
- in += 1;
- in->since_indent = in[-1].since_indent + 1;
- }
- }
- free(tk);
- tk = NULL;
- parser_trace_action(trace, "Cancel");
- continue;
- }
- // fall through to error handling as both SHIFT and REDUCE
- // will fail.
- }
- if (tk->num == TK_newline) {
- if (!tos->newline_permitted) {
- free(tk);
- tk = NULL;
- parser_trace_action(trace, "Discard");
- continue;
- }
- if (tos->since_newline > 1 &&
- states[tos->state].reduce_size >= 0 &&
- states[tos->state].reduce_size <= tos->since_newline)
- goto force_reduce;
- }
- if (shift(&p, tk->num, 0, tk->num == TK_newline, tk, states)) {
- shift_since_err = 1;
- tk = NULL;
- parser_trace_action(trace, "Shift");
- continue;
- }
- force_reduce:
- if (states[tos->state].reduce_prod >= 0 &&
- !(tk->num == TK_newline ||
- tk->num == TK_eof ||
- tk->num == TK_out ||
- (tos->indents == 0 && tos->since_newline == 0))) {
- /* Anything other than newline or out or eof
- * in an error unless we are already at start
- * of line, as this production must end at EOL.
- */
- } else if (states[tos->state].reduce_prod >= 0) {
- void **body;
- void *res;
- const struct state *nextstate = &states[tos->state];
- int prod = nextstate->reduce_prod;
- int size = nextstate->reduce_size;
- int res_size = nextstate->result_size;
- short indents, start_of_line;
-
- body = p.asn_stack + (p.tos - size);
- res = res_size ? calloc(1, res_size) : NULL;
- res_size = do_reduce(prod, body, config, res);
- if (res_size != nextstate->result_size)
- abort();
-
- indents = pop(&p, size, &start_of_line,
- do_free);
-
- if (!shift(&p, nextstate->reduce_sym,
- indents, start_of_line,
- res, states)) {
- if (prod != 0) abort();
- accepted = 1;
- ret = res;
- }
- parser_trace_action(trace, "Reduce");
- continue;
- }
- /* Error. We walk up the stack until we
- * find a state which will accept TK_error.
- * We then shift in TK_error and see what state
- * that takes us too.
- * Then we discard input tokens until
- * we find one that is acceptable.
- */
- parser_trace_action(trace, "ERROR");
- short indents = 0, start_of_line;
-
- err_tk = tok_copy(*tk);
- while (p.tos > 0 &&
- shift(&p, TK_error, 0, 0,
- err_tk, states) == 0)
- // discard this state
- indents += pop(&p, 1, &start_of_line, do_free);
- if (p.tos == 0) {
- free(err_tk);
- // no state accepted TK_error
- break;
- }
- if (!shift_since_err) {
- /* must discard at least one token to avoid
- * infinite loop.
- */
- if (tk->num == TK_eof)
- break;
- free(tk);
- tk = tok_copy(token_next(tokens));
- }
- shift_since_err = 0;
- tos = &p.stack[p.tos-1];
- while (!in_lookahead(tk, states, tos->state) &&
- tk->num != TK_eof) {
- free(tk);
- tk = tok_copy(token_next(tokens));
- shift_since_err = 1;
- if (tk->num == TK_in)
- indents += 1;
- if (tk->num == TK_out) {
- if (indents == 0)
- break;
- indents -= 1;
- // FIXME update since_indent here
- }
- }
- tos->indents += indents;
- }
+ ## parser vars
+
+ ## find eol
+
+ ## heart of parser
+
free(tk);
- pop(&p, p.tos, NULL, do_free);
+ pop(&p, p.tos, do_free);
free(p.asn_stack);
free(p.stack);
return ret;
} else
fputs(non_term[sym - TK_reserved - knowns],
trace);
- if (f->indents)
- fprintf(trace, ".%d", f->indents);
- if (f->since_newline == 0)
- fputs("/", trace);
fputs(" ", trace);
}
fprintf(trace, "(%d) ", f->state);