From ec6d416970d844f7bbe67fa66fcf23c2e681b199 Mon Sep 17 00:00:00 2001 From: NeilBrown Date: Fri, 3 Oct 2014 14:52:16 +1000 Subject: [PATCH] parsergen: update description to match current reality. In partcular, the handling of indents and newlines was a bit out-dated. Signed-off-by: NeilBrown --- csrc/parsergen.mdc | 83 +++++++++++++++++++++++++--------------------- 1 file changed, 46 insertions(+), 37 deletions(-) diff --git a/csrc/parsergen.mdc b/csrc/parsergen.mdc index 7ac1e0b..ae7087e 100644 --- a/csrc/parsergen.mdc +++ b/csrc/parsergen.mdc @@ -2437,20 +2437,23 @@ The `state` is the most important one and guides the parsing process. The freeing function. The symbol leads us to the right free function through `do_free`. -The `indents` count tracks the line indents in the symbol. These are -used to allow indent information to guide parsing and error recovery. +The `indents` count tracks the line indents with in the symbol or +immediately follow it. These are used to allow indent information to +guide parsing and error recovery. `since_newline` tracks how many stack frames since the last start-of-line (whether indented or not). So if `since_newline` is -zero, then this symbol is at the start of a line. +zero, then this symbol is at the start of a line. Similarly +`since_indent` counts the number of states since an indent, it is zero +precisely when `indents` is not zero. `newline_permitted` keeps track of whether newlines should be ignored -or not, and `starts_line` records if this state stated on a newline. +or not. The stack is most properly seen as alternating states and symbols - states, like the 'DOT' in items, are between symbols. Each frame in our stack holds a state and the symbol that was before it. The -bottom of stack holds the start state, but no symbol, as nothing came +bottom of stack holds the start state but no symbol, as nothing came before the beginning. ###### parser functions @@ -2474,12 +2477,15 @@ before the beginning. Two operations are needed on the stack - shift (which is like push) and pop. -Shift applies not only to terminals but also to non-terminals. When we -reduce a production we will pop off entries corresponding to the body -symbols, then push on an item for the head of the production. This last is -exactly the same process as shifting in a terminal so we use the same -function for both. In both cases we provide a stack frame which -contains the symbol to shift and related indent information. +Shift applies not only to terminals but also to non-terminals. When +we reduce a production we will pop off entries corresponding to the +body symbols, then push on an item for the head of the production. +This last is exactly the same process as shifting in a terminal so we +use the same function for both. In both cases we provide the symbol, +the number of indents the symbol contains (which will be zero for a +terminal symbol) and a flag indicating the the symbol was at (or was +reduced from a symbol which was at) the start of a line. The state is +deduced from the current top-of-stack state and the new symbol. To simplify other code we arrange for `shift` to fail if there is no `goto` state for the symbol. This is useful in basic parsing due to our design @@ -2489,17 +2495,13 @@ function reports if it could. `shift` is also used to push state zero onto the stack, so if the stack is empty, it always chooses zero as the next state. -So `shift` finds the next state. If that succeed it extends the allocations -if needed and pushes all the information onto the stacks. +So `shift` finds the next state. If that succeeds it extends the +allocations if needed and pushes all the information onto the stacks. -Newlines are permitted after a starts_line state until an internal -indent. So we need to find the topmost state which `starts_line` and -see if there are any indents other than immediately after it. - -So we walk down: - -- if state starts_line, then newlines_permitted. -- if any non-initial indents, newlines not permitted +Newlines are permitted after a `starts_line` state until an internal +indent. If the new frame has neither a `starts_line` state nor an +indent, newlines are permitted if the previous stack frame permitted +them. ###### parser functions @@ -2557,9 +2559,9 @@ So we walk down: `pop` primarily moves the top of stack (`tos`) back down the required amount and frees any `asn` entries that need to be freed. It also -collects a summary of the indents in the symbols that are being -removed. It is called _after_ we reduce a production, just before we -`shift` the nonterminal in. +collects a summary of the indents and line starts in the symbols that +are being removed. It is called _after_ we reduce a production, just +before we `shift` the nonterminal in. ###### parser functions @@ -2614,9 +2616,9 @@ copying, hence `memdup` and `tokcopy`. ### The heart of the parser. -Now we have the parser. If we can shift, we do, though newlines and -reducing indenting may block that. If not and we can reduce we do. -If the production we reduced was production zero, then we have +Now we have the parser. If we can shift we do, though newlines and +reducing indenting may block that. If not and we can reduce we do +that. If the production we reduced was production zero, then we have accepted the input and can finish. We return whatever `asn` was returned by reducing production zero. @@ -2629,16 +2631,23 @@ When we find `TK_in` and `TK_out` tokens which report indents we need to handle them directly as the grammar cannot express what we want to do with them. -`TK_in` tokens are easy: we simply update the `next` stack frame to -record how many indents there are and that the next token started with -an indent. - -`TK_out` tokens must either be counted off against any pending indent, -or must force reductions until there is a pending indent which isn't -at the start of a production. - -`TK_newline` tokens are ignored precisely if there has been an indent -since the last state which could have been at the start of a line. +`TK_in` tokens are easy: we simply update indent count in the top stack frame to +record how many indents there are following the previous token. + +`TK_out` tokens must either be canceled against an indent count +within the stack. If we can reduce some symbols that are all since +the most recent indent, then we do that first. If the minimum prefix +of the current state then extents back before the most recent indent, +that indent can be cancelled. If the minimum prefix is shorter then +the indent is premature and we must start error handling, which +currently doesn't work at all. + +`TK_newline` tokens are ignored unless the top stack frame records +that they are permitted. In that case they will not be considered for +shifting if it is possible to reduce some symbols that are all since +the most recent start of line. This is how a newline forcible +terminates any line-like structure - we try to reduce down to at most +one symbol for each line where newlines are allowed. ###### parser includes #include "parser.h" -- 2.43.0