From 18d02a36152f7ab7325e4b82a03d885a6a3bc56c Mon Sep 17 00:00:00 2001 From: NeilBrown Date: Tue, 26 Jan 2021 17:24:21 +1100 Subject: [PATCH] devel - working on twod I'm trying a new approach to twod parsing which makes NEWLINEs explicit except when hidden by indents, so they aren't ignored at start-of-line. They cannot be shifted in some circumstances, and I'm still refining those cirumstances. Signed-off-by: NeilBrown --- 00-TODO | 18 +- Blog-pointers | 243 +++++++++++++++++ Ocean-introspection | 59 ++++ ooooo.tex | 0 thoughts | 0 twod | 647 ++++++++++++++++++++++++++++++++++++++++++++ 6 files changed, 958 insertions(+), 9 deletions(-) create mode 100644 Blog-pointers create mode 100644 Ocean-introspection mode change 100755 => 100644 ooooo.tex mode change 100755 => 100644 thoughts diff --git a/00-TODO b/00-TODO index 8c0efc7..5675fdb 100644 --- a/00-TODO +++ b/00-TODO @@ -1,24 +1,24 @@ This is a living document - delete things when done. Avoid discussion. Current version (Cataract Creek) -- Warn when left-recursive symbols appear elsewhere, other than at the end - of a production. Might have to special-case Newlines. -- parser not to get into ERROR infinite loop -- sort 'virtual' symbols to end -- allow $xy instead of $3. Chooses shortest bodysym with xy in that order - $xy_2 gives the second one -- allow $TERM terminals to be listed. If so, extras are errors - structs - - const fields + - const fields ... what does that mean? Assign once as initialization? + Can be used for array size? What else? - anonymous field - array or struct (or pointer to these) - multiple anon struct ar allowed if they don't conflict + multiple anon struct are allowed if they don't conflict - [] can apply to anon array field - anon struct field gets fields interpolated + - anon fields have no name before the :, so + struct foobar: + :content + size:number + - manifest values for arrays and structs [a,b,c] or [.foo=a, .bar=b] or [ [1]=a, [2]=b] That last doesn't parse easily, unless we require tags... not a good idea. [ .[1] = a, .[2] = b ] ?? Maybe. + or () to group, [] for index. To (.foo=a) ( [1] = b ) - yet more operators << >> # bit-ops & | ~ &~ diff --git a/Blog-pointers b/Blog-pointers new file mode 100644 index 0000000..59b943f --- /dev/null +++ b/Blog-pointers @@ -0,0 +1,243 @@ + +Pointers is one of the particular areas where I wanted to innovate in +the design of Ocean. + +Traditionally (in lanuages like C) pointers are both powerful and +dangerous. My goal is to control that power without extinguishing it, +so they become powerful with more clarity, and without the danger. + +Pointers have always been overloaded - the can store either a pointer +to an actual object, or a 'nil' (or 'NULL') pointer, which isn't an +object and cannot be dereferenced safely. This provides power (it +enables simple finite linked lists) and danger (it is too easy to try +to dereference the `nil` pointer). In Ocean, pointers that can be +`nil` are a different (though related) type to those which cannot. A +pointer which can be `nil` may only be dereferenced after a test +confirms that it isn't, in fact, `nil`. This requires more careful +notation of the types of all pointers, but is easily checked by the +compiler so notation errors are easily caught and corrected. + +As well as allowing `nil`, it is sometimes useful to allow other +non-pointer values to be stored in a pointer object. The Linux kernel +does this a lot, and the are two specific cases that are used. +In the first case, small integers (normally negative) can be stored in +a pointer variable. These are used as error codes. As the first page +of virtual memory and also the last page are not mapped, any pointer +with an absolute numeric value less than the page size cannot be a +valid pointers and can be treated as something else. Zero is used for +the `nil` pointer, negative numbers are used as errors, positive +numbers are largely unused. Ocean will make this pattern explicit in +the types: a pure pointer will not have one of these values and can be +dereferenced. A loaded pointer might have a reserved value and can +only be dereferenced after a test. If the test fails, the numeric +value can be extracted. + +The second case relies on the fact that on all supported +architectures, any pointer to a 16-byte (or larger) value must be (at +least) 2-byte aligned, so the numeric value of the pointer will be +even. This means that odd values cannot be valid pointers, so those +values can be used for something else. This provides a secord, or +third, level of "loading" that can be declared for a pointer. The +Linux kernel uses these in hashtables that support lockless lookups +and movement of entries between chains. The loaded value is treated +like a `nil` pointer, but records which hash chain has come to an +end. If a search find the wrong hash value at the end of the chain, +it know that it might have been diverted between chains and needs to +re-check. + +The odd pointer values can also be used in a different way. In each +of indicating a different interpretation of the whole pointer, the +least significant bit can be treated as a separate value that is +stored in the pointer - to optimize space usage. A particular use for +this is as a lock bit, again in a hash table though this time in the +head. Before a thread may change a hash chain, it must atomically +change the least significant bit from zero to one. After the change, +it can clear the bit, possibly as a sige effect of updating that first +pointer. + +The Ocean language shouldn't insist that the various non-pointer +values are error codes, hash values, or lock bits, but it will make +them available for the programmer to use in a controlled way, and will +ensure that a pointer that can contain non-pointer values is always +properly checked. + +Apart from `nil` pointer dereferences, the main danger with pointers +involves the interaction with memory allocation and freeing. If +memory is released (and then reused) while a pointer to the memory is +held, future dereferences of that pointer are likely to cause +problems. In order to avoid these errors, the language must ensure +that no such references remain when memory is freed. Once it does +that, it can help the programmer by actively freeing the memory once +the number of valid references reaches zero. + +This tracking of references is sometimes done using a process referred +to as "garbage collection". All pointer that are (or could be) +active, and all locations they point to are marked. Anything not-marked +can be freed. This has a degree of conceptual simplicity and a degree +of practical inelegance. The implementation can supposedly be made +quite efficient, but it isn't the approach that I want to take. + +The alternate approach is to enhance the type system so that the +language "knows" when a pointer is active or not, and how many active +pointers there are to any given location. When there is only one +pointer and it become inactive, the memory can be free. The simplest +implementation of this pattern is to store a reference-counter in each +allocated object, and to use it to keep count of all references. This +is a useful pattern and in some circumstances it is the best possible +pattern, but it is not the only valid pattern. Ocean will support +this natively by allowing a field on a structure to be identified as +reference counter, but it will allow other options as well. + +Points can be classified as either "owned" or "borrowed" references. +This will probably be determined dynamically from code analysis, +though in some cases such as function parameters and returns, it must +be declared. An owned reference is one that holds a reference count +or by some other mechanism is an independent reference which will +prevent the object from being freed, until some action happens on the +owned references to releasse it. A "borrowed" reference is a +temporary reference that exists under the umbrella of some +identifiable owned reference. The language definition will require that +the borrowed reference have limited scope, and that the related owned +reference will not be released while that scope is still active. + +A simple example of a borrowed reference happens when an owned +reference is passed as a parameter to a function which declares the +formal parameter to be a borrowed reference. Inside the function, the +borrowed reference cannot be used to release the reference, and the +owned reference much remain intact for the duration of the function +call. + +It may be that the owned reference cannot be uniquely identified. As an +example, there might be two owned references, and an conditional +statement assigns one or the other to a borrowed pointer. In this case +it isn't possible to know at time of analysis which owned pointer is the +primary, so both (or all) possible primaries must be preserved. + +The "Scope" of the borrowed reference that the owned reference must be +preserved for will often a a lexical scope, but may not always be. I +imaging that the language may at some stage be able describe other +temporal relationships between different objects. In that case, the +borrowed reference must have a life time contained within the +guaranteed lifetime of the matching borrowed pointer. + +I should say that "Garbage collection" will be an option, just not the +only or the default. Language can potentially identify exactly the +references that need to be checked. Maybe collectable pointers +reserve the lsb for mark, prior to sweep. +compiler would extract a description of places that gc refs can live, +and encourage them in specific domains. + +We might also make an allocation domain a well defined concept. + +## Other Ownership + +There are a variety of sorts of owned references. They include: + +- counted references. These are created by incrementing a + reference-count field, which is then decremented when the reference + is destroyed. + +- single references. Some objects are declared as only ever having a + single owned reference. This can be moved from pointer to pointer + but never duplicated. When the reference is destroyed, so is the + object. + +- inherited references. An .... + +## Rules for borrowing references. + +Borrowed references can live in automatic variables and in structures +(including arrays). They can only live here as long as the owned +reference is stable. +For automatic variables, we can (hopefully) deduce the relevant owned +references - it is what was copied to get a borrowed ref. +For refs in structures, we need to be explicit for borrowed refs +That means we need a language and a precise understanding. +We would need a parent or sibling or child object to have the ref.. +Maybe we just declare the name of the object, where it has to be +passed in as an arg to be part of a parent. + + +---------------------- +Later... + +How much of this could/should be implemented as Smart Pointers? +Rust uses smart pointers to implement + Nullable (Option<>), RefCount (rc<>) Atomic refcount (arc<>) and + others. +It even has Box<> to create on heap instead of stack. + +Why doesn't I just do that? Partly because I don't have classes yet!! + +I like a simple syntax to test if a pointer is over-loaded. + if pointer? +and to get the overload value + pointer! + +Rust would use pattern matching. + if let Some(p) = pointer { + use p + } + +in place of + if pointer? : + use pointer! + +Rc() uses Rc::clone() to take a reference. I would rather that were +transparent. + +Rc and ARc need to be different. I don't want the difference to be +that obvious. + +Box<> is interesting. It is the gateway to the heap. Maybe it is +just syntax. + +I want to be able to store an owned pointer in an ADT without telling +the ADT the precise type. Either it returns an object to be freed by +caller, or a 'free' function is passed in. +... NAH this can be done with normal type parameterization. + +So the ownership-type needs to be transparent to a degree. +i.e. I can have: + - a borrowed pointer + - an owned pointer of unknown disposition + - an owned pointer of specific discipline. + +Rust uses & to specify a borrowed reference. + +Maybe + ^type is a borrowed reference + @type is an owned reference with discipline defined by type. + discipline@type is an owned reference of a given discipline. + + +How do I want to handle mutability? +Function can usefully have 'const' markers. +What are the risks when single-threaded? We could extract a value, +accidentally change something, then depend on the value. +We could probably do that anyway... + +Pointers are normally (default) overloaded with a small integer. +/2047 +Pointers can have status attributes when stored in structures or +passed to or from functions. + - "pure" means there is no small integer + - "loaded" means there is large-integer overloading - if lsb set. + - "init" means the target is being initialised + - "exit" means it can be torn down + - other words can reflect stages in life cycle + - still other words might reflect locking status. +There can be several locking statuses, and several life-time statuses, +as these can affect different fields differently. + +In a struct each field can have a mapping from internal states to +external state. (init=foo, locked=foobar, rlocked=baz) +And internal state not listed in inheritet from external states (a=a). + +For a function, returned value attribute can be copied from args, as +can type I suspect + fun foo(a:thing, b:thing) : + +or better + + fun foo(a:x, b:x) : x + diff --git a/Ocean-introspection b/Ocean-introspection new file mode 100644 index 0000000..b809622 --- /dev/null +++ b/Ocean-introspection @@ -0,0 +1,59 @@ +Here I collect thoughts about introspection. + + +Introspection can be at compile-time as well as runtime, and can have +quite a different role. +In edlib/python I use the documentation string of class methods to +indicate what 'key' they respond to. I cannot easily do that in C, +but could if I have compile-time introspection. I could do it +with a sed script that extracts an include file ... + +The language would need to allow for extra stuff to be parsed into +an AST, then the introspection code would run over that AST before +it was further analysed. This a bit like gcc pluggins, but +is writting if the language being compiled, and compiled like +any other code. + +This would be part of a larger feature that any code could run at +compile time instead of later. Constant-folding is already a common +practice and is an example of this. + +I would need a simple and stable AST ADT to be manipulated. Most uses +would be relatively simple, like collection functions into an array, +or pre-parsing some internal data structure, possibly sorting it. + +It could be as simple as allowing strings in lots of places, and +expecting the introspection to parse them - along with various object +names - and do stuff. + + +Separately there is runtime introspection. +This would be needed to handle anything not known at compile time. + +Mostly this is probably underspecified type. An 'sprintf' +type function to be introspected and transformed at compile time +providing the types of everything could be determined. And if the +type isn't known... recording introspection info is probably not easier +that recording the appropriate dispatch function. + +Then there is link-time information. Compile-time introspection needs +to all be run at compile-time. If I load a foreign module at run time +(or link time), it cannot ask questions that weren't already asked. + +Maybe a "standard" compile-time introspector could encode any info +needed for link-time or runtime introspectors. + + +Introspection could be used for: + + - preparsing data structures + e.g. sorting, adding links, .. + - annotating for run-time + e.g. storing table of function, with names + providing serialization, de-serialization + - providing default values for missing args + - transforming n-ary args to arrays + - adding appropriate dispatch info when passing args to + a function that expects objects + - adding coverage tracking + - adding performance tracking diff --git a/ooooo.tex b/ooooo.tex old mode 100755 new mode 100644 diff --git a/thoughts b/thoughts old mode 100755 new mode 100644 diff --git a/twod b/twod index bde8aae..e6db5cd 100644 --- a/twod +++ b/twod @@ -2344,3 +2344,650 @@ iftail -> else block ifhead -> Newlines if expr block ifstatement -> ifhead iftail ]] + + +13oct2020 - much later. + + Maybe a different approach to newlines. Never shift them, only use + them like indent to guide parsing. + + IN is just recorded + OUT must reduce until top symbol contains an IN, which is cancelled. + + Some productions are flagged as $$NEWLINE (or similar) meaning that they can + only be reduced as a whole number of lines. + When we see a newline, we record that until a non-newline is seen. + If we cannot shift and so choose to reduce: + if the production doesn't require a newline - we just reduce. + If it does, and we have recently seen a newline, again we reduce. + Otherwise, we signal an error and (maybe) shift until we see a newline. + NOTE the newline only counts if there are no indents. + + This means that + a = b + + c + + would be parsed as a statement, which I don't want. + But I do want + if foo: bar + else: baz + + to be a single statement. + So a new line in some context must force a reduction, like OUT. + When exactly does a newline force? + An OUT forces a reduction until the top symbol has the matching indent. + A NEWLINE ?? Must depend on the grammar. + + + + Q: Can this be only about resolving ambiguity and reporting errors? + If the default is to shift if we can, reduce otherwise, then + 2D causing an early reduce can resolve ambiguity. + + If certain productions must end at EOL, and error is thrown if they don't. + + That latter really must be a gramatically need to shift a newline. + + Maybe I'm wrong about reporting errors. Maybe they should always come from + a failure to shift or reduce. + + One of my problems was + if foo: if bar: dostuff + where the newline at the end needs to reduce everything, so I can + require at most one newline in the grammar. + It probably should be required for the first 'if' + + Stmt -> if Cond : StLstNL + StLstNL -> StLst NL + + StLst -> Stmt + | StLst ; Stmt + | SlLst NLs Stmt + | NLs StLst + + IfSt -> IfHead + + + NONO. I don't want some statements that end with NL and some that don't. + + - Record indent with preceding token and line-start with following + - when reducing all indents are accumulated into the new symbol, + and the line-start of first symbol is preserved. Other line-starts are dropped + - When OUT in lookahead, there is an indent within 'min_prefix' of top state, + decrement closest indent count and discard the OUT + - When OUT in lookahead and cannot be discarded and top state can be reduced, + reduce it + - When OUT canot be discarded or reduce, and error is triggered. + - No production can be reduced if it contains unmatched indents. If cannot shift, + then error. + - Some productions are marked as $LINE + - any state that contains an item at the start of a $LINE is marked 'starts line' + - if there is an indent more recent than a starts-line state, or the top + state starts_Line, newlines are ignored. + - if a newline is seen but not ignored, and the top state can be reduced, + it is reduced. + - A $LINE production can only be reduced by a newline, or, if it didn't start + at start-of-line, by an OUT + + so + Block -> : statementlist $LINE + + I want to be sure that + if cond { + statements + } else { + more + } + + is OK. Here the "IfHead" doesn't end on a newline, but the state that is + expecting 'Block' 'starts-line'. I guess as "Block -> { statementslist }" + isn't a $LINE, it can be reduced by 'else'. + +24oct2020 + + Maybe I don't want TK_out to force a reduce (mostly). Maybe the grammar + should be in control. So when we see an OUT we record it off-stack. + If we reduce for some other REASON and the top state contains an indent, + we cancel. + When we decide to SHIFT, if the top of stack has -ve indent, that is an + error. + So with C, + if (cond) { + s1; + s2; + s2; + + would show an error when s2 was shifted. + if (cond) + s1; + s2; + needs to show an error too. The "if(Cond)s1;" statement would have an + internal indent. + + So maybe the requirement is that no symbol has unbalanced internal + indents. + When we shift we require that top balances first. + So we count up outs and ins. There can be several OUTs and at most on + IN before a terminal. + If the terminal can be shifted, the top symbol must have no internal + indents after cancelling with out, and there must be no outstanding + TK_outs + If the terminal cannot be shifted we reduce and we must have enough + outs to balance anything within the reduction. + + So indents on the stack are always between symbols. Each frame has + a symbol, the indent after it etc. + + If we don't have enough outs to complete a reduction we raise an error + and need to discard (or shift/reduce) until we find an 'out'. + If there are too many outs to shift we again need an error which will + hopefully let us reduce and cancel some outs - just discarding states + will help with that. + + So: outs must cancel, following token must envourage reductions + + Some productions end with newline, their start state is 'starts-line', + and if there is an indent since a starts-line, we ignore newlines. + + If we don't ignore newlines, then there is a starts-line state that we + should reduce to. If we don't have enough symbols yet, we error. + So if we see a newline, we reduce until the newline would be ignored. + + If that state expects a newline and we don't have one, we error. + Except.. I sometimes want OUT to work like a newline. + If I need a newline and reduction isn't to start of line + + If state expects newline and we don't have one but do have OUTs + we reduce as long as there are enough outs and raise an error if + we end up at start of line. + + + Do I want to require that a newline be shifted? We can synthesize + one when minprefix is less than since newline and we have an out. + + + Summary of stuff. + A shift requires there be no pending OUT ?? + A reduce requires there be no pending indent, i.e that + indents within the reduction don't exceed outs. + NO this doesn't work. + A IN increments the indent on the top frame - never causes error + An OUT decrements indent in a frame within min_prefix, or error + A NEWLINE is either ignored or shifted, but is duplicated when + shifted. + After an OUT is handled, if a NEWLINE is expected but will only + reduce a partial line, it is provided. + + if foo: if bar: + stuff + bazz + + This effectively pretends there was a IN at the start of the "line", + and we are not doing two OUTs with a NEWLINE between. + if foo: + if bar: + stuff + baz + + + Do I really want to shift newlines: + yes: simpler grammer + no: don't need to record min-prefix. + + I think 'yes'. + + So discard $NEWLINE + Place NEWLINE in the grammar + When completing a state, if any production contains NEWLINE, record as + StartsLine + + + --- + My idea of ignoring newline when the top state is a starts_line state isn't + working. + The idea was that reducing + statement -> assignment NEWLINE + would leave at a starts_line start, so NEWLINEs would be ignored. + But this leaves us at a ' ... statement . ' state + e.g. " Statementlist -> Statement . " + which reduces to "Statementlist -> Statementlist . Statement" + + + +30oct2020 + + Review: because it isn't working (as usual). + + Indenting have 2 effects: + - it triggers errors when something looks wrong + - it causes newlines to be sometimes ignored. + + Specifically: + - A symbol that starts indented must finish while still indented. + So an OUT cannot cancel an IN except in the top phrase, and an attempt to + shift while there are pending OUTs is an error. + - A symbol that begins at the start of line and contains an indent may only + be the top symbol - shifting before an OUT cancels the indent is an error. + + - Newlines are ignored when there is an indent since a starts-line state + + Procedurally: + - There can be one IN and several OUTs pending - we update the counters + as they are seen. + - If there are pending OUTs and their is an indent in the top phrase, + we can cancel an indent. + - a SHIFT becomes an error if the top symbol has an indent and started + at beginning of line + - a SHIFT becomes an error if there are pending OUTs + - a SHIFT that is not an error first imposes the IN on the top of stack + - REDUCE may happen at any time. + + Newlines, when not ignored, are repeating terminals. An indefinite series + is generated until a state is reached where they are ignored. + The expectation is that they will be shifted (possibly after some reductions) + and then a reduction will happen which changes the state appropriately. + + It is vital that no recursive production keeps consuming newlines + foo -> foo NEWLINE + as that will be fed indefinitely, or if newlines are ignored, will never reduce. + + NEWLINEs are recognized if there is a starts_line state since the most recent + indent, but not when the current state is starts_line. So once we reach + starts_line, we ignore newlines until something else is shifted. + Or maybe only a starts_line state at the start of a line? + + A starts_line state has an item "head -> . stuff NEWLINE" + So if were just reduce 'foo -> foo NEWLINE' then we are at the end of foo, + so an earlier state with have ".foo" and so "foo -> . foo NEWLINE" and so will + be starts_line. So we will always accept a NEWLINE. So these must be disallowed. + + How does "if cond : if cond : print NEWLINE" parse? + "block -> : slist NEWLINE" + "slist -> statment | slist statement" + "statement -> print NEWLINE | if cond block NEWLINE" + + so starts_line are before statement or ':' - I don't want that as it would allow a newline + before ':' to be ignored. + Maybe I want starts AFTER a NEWLINE?? Did I do that before? + So after block, after statement, after slist.. but we expect NEWLINE there. + + Maybe the important states are "foo -> baz . stuff NEWLINE". + Once we've entered a path to a NEWLINE, we need to be ready for it. + So: after :, after print after if and cond and block + + The repeating newlines are hurting now. The rule to start ignoring them + isn't working. + ifhead else if expr : statementlist [NEWLINE] + the NEWLINE should be ignored. + There is an indent after ':' but final state is startsline because this could + be the end of a block -> : statementlist NEWLINE + But I want that newline to be ignored because of the indent. + + if (A) + b; + c; + why is that bad? How can we know? Because we reduce to start-of-line + and still have an indent. + + a = + b + c + + d + * + + I need some good examples to remind myself. + Why do I want min_prefix? I allow indents to be cancels within min_prefix + cmd -> { statements } + + +01nov2020 + ARRRGGHHH I keep going down rabbit holes. I need to take a more mathematical + approach and present clearly what I have and what I need. + The big issue, for the moment at least, is when to ignore newlines. + I REALLY like the idea of newlines being repeating until they are ignored. + I definitely like that they are ignored after an indent unless they are explicitly + expected. I need to ensure they are not expected at awkward places, like the + start of a line. + + I want to be careful of making LL assumptions about the LR grammar. i.e. I cannot + assume the end from the beginning - that is was LL does. I want the appearance + of a newline to only make a difference once it appears. But at the point that + it appears we need to know if it is to be ignored. In + a = b + NEWLINE + it isn't ignored event though it isn't expected exactly here. In + a = b + (IN) + c + NEWLINE + it is ignored and the difference is that no state since the IN expects a + newline. + + To acknowledge LR approach we need to take the latest state. The start of + line, before the 'a' is tempting but we hardly know anything there. After the 'a' + is tempting, but that would mean after a[b+c].thing(foo) as well which seems to + miss the point. + Maybe something earlier sets the expectation? + if cond : + The ':' says newline-separated things are coming. So it is the state + after the ':' which says 'newlines matter'. This will be after the indent. + So if we saw + if cond { + then maybe newlines don't matter (depending on grammar) + + But a newline is ignored *at* that state, so that blank lines can be skipped. + I think there is a 'statementlist' symbol and at the start or end of that symbol + newlines are ignored. Within that symbol unindented newlines are significant. + statementlist -> statement NEWLINE + | statementlist statement NEWLINE + + +02nov2020 + Can I drop the 'min-prefix' concept? + The means that OUT can only cancel against an indent 'in' or immediately before + the top symbol. This means we go back to keeping a 'starts_indented' flag and + storing each indent with the following symbol. + + Then the 'ignore newline' test is since_starts_line > since_indent + but since_indent is 0 fir there are internal indents and 1 if there is only a + starts_indented indent. + + So a TK_in simply sets starts_indented for next + TK_out is recorded + Whenever top symbol has indents and outs are recorded, we cancel. + If we want to shift with pending outs, that causes an error + + I need a reason to signal an error for + if (a) + b=1; + c=1; + based on indents. equally for + if (c) + b=1;c=1; + If indent is a continuation, then 'c=1;' but be part of 'if...' before + it is part of what comes before? + This means the OUT will not be able to see the IN ... unless the whole + statement list ends here.... + In "a = IN b + c ; " the ';' causes a reduction to Assignment with indent + so I must be happy for the 'c=' to reduce to 'ifstatement' with an indent, + but reducing to the recursive 'statementlist' seems wrong. Maybe these + left-recursive productions really are special. + But what about 'expr -> expr + expr' ?? + + The rule I'm depending on is that I cannot shift with a pending OUT. + So everything since the IN needs to compress to a single symbol. + If the code is + { + if (a) + b=1;c=1; + } + then the '}' will reduce to "{ statementlist" before shifting, so the OUT will be + cancelled. The question is: how can we prevent that in the grammar? + There are 2 OUTs (in this case) does that help? + A comparable case is + { + foo; + if (a) + b=1;c=1; + } + In that case the two OUTs will be cancelled at clearly different times, but not real + difference. + + { + foo; + if (a) + b=1;c=1; + bar; + } + + Now the 'bar' will be blocked from shifting.... no it won't. + + I need a new rule - this *must* be wrong, even when we ignore newlines. + But why? Because "b=1;c=1;" looks like a statementlist, but isn't. + It isn't because a statementlist doesn't belong here, ony a block or a statement. + How do I know that indent is continuing the 'if', not the statementlist. + I need some marking on statementlist to say "Cannot be continued by indent" + I guess that is what NEWLINE is for. + + So the rules would be "a suitably marked production cannot be reduced with + outstanding indents. If nothing can be shifted, an error must be raised. + These will typically be productions that end in a NEWLINE, but maybe any + production from a symbol that can produce a NEWLINE, or that is marked $$something + + Back to TK_newline. These are duplicated on shift, but discarded when not wanted. + They are not wanted unless there is a starts-line start below the current state, + and more recent than the most recent indent. + A starts-line state is any state which contains an item with a NEWLINE and dot + at the start. + + statementlist -> statement NEWLINE + | statementlist statement NEWLINE + + + Q: how do nested 'if's resolve? + if cond: + if cond2: + print + else: + die + + statement -> if cond block + | if cond block else block + + The tokenization would be + if cond : IN if cond2 : IN print NL OUT NL OUT IN else : IN die NL OUT NL OUT NL + 1 2 A 2 B 1 3 4 C 4 D 3 E + so NLs: + A makes 'print' a statment, + B makes 'if cond2' a statment, + C makes 'die' a statement + D is protected by IN/3 and is ignored. + E makes if cond .. else.. a statement + + What if I also had + + | if cond block NEWLINE else block + + Then on seeing the NEWLINE I wouldn't know whether to shift it, + or reduce to a statement. + To fix that I would need: + statementlist -> statement | statementlist statement + statement -> simplestatements NEWLINE + | if cond block NEWLINE + | if cond block else block NEWLINE + | if cond block else statement + | if cond block NEWLINE else block + | if cond block NEWLINE else statement + + But wait.... Allowing NEWLINE before 'else' confuses the parse of 'if cond:...' above. + NL-B makes "if cond block NEWLINE" and then the else can be shifted, but that is illegal + because of the negative indent. That means the negative indent must prevent SHIFT. + + + Do I still need to know which symbols are 'line-like' ?? + Yes, to make it easier to detect line-line productions which must contain unmatched indents. + Do I need recursive line-like detection? + For no-indent productions, it probably makes sense. + For starts-line detection? I don't think so. 'block' isn't the start of a line. + And not really needed for no-indent + + + Do I need to worry about left-recursive symbols? + I don't think so. There should always be some terminal - often NEWLINE - + which will ensure the symbol isn't extended??.. or if it did we would get + a parse error due to uncancelled OUT + + Do I need to split indents? + A stack frame holds "symbol state". Indents are within or before the symbol. + For cancelling with OUT, indents within and before the top state are equivalent. + For hiding newlines the indent before the symbol is too distant. f->next_indented ? "/" : "", + + So: yes. + Should I store them in the stack? + "symbol+indents indent state" + So when we see an indent, we mini-push that + Any indent since state protects newline + Out can be cancelled if top state has indents, or previous has trailing indent. + + Rather than an ignore_newline flag we need a outs_needed flag. This is how + many outs are needed before we process newlines. + If state is startsline, then outs_needed == infinity. + Else it is a count of indents since a starts-line state. + So maybe it is an indents_since_starts_line and we test state and indents. + + So: + 1/ rearrange frame to include 'indented' in the top frame + 2/ replace ignore_newline with indents_on_line + +14nov2020 + I'm getting close... + Problem is correctly ignoring of newlines. + I ignore them after an indent - that bit is easy. + But I also ignore them at a start-line symbol - sometimes. + + INDENT statementlist NEWLINE should ignore the newline + but + INDENT statementlist if expr : stl if expr : stl NEWLINE should not + + (/0s) Statementlist (11s) IfHead (5) else (22) if (3) Expression (19) : (29s) Statementlist (40s) [NEWLINE:22:0] - Ignore + + This shouldn't Ignore. Is that because there is a non-'s' since the '/' ?? + + i.e Ignore newlines if all states since indent are startsline, or if none are. + Don't count current state if it is startsline + indents_on_line > outs || all_startsline(...) + + No - better, but no banana. if expr { statement; newline + need to ignore even without the indent. + So I need a new type of state - one that is at the start of Statementlist but + not at the end. State 29 in the above, but not 40 + +(/0s) Statementlist (11s) if (3) Expression (19) { (30s) Statementlist (39s) [NEWLINE:34:0] - ERROR + +18nov2020 + So: + I introduce "startsline" and "endsline" states. + A state before a linelike symbol is startsline, a state after such a + symbol is endsline. A state can be both, in which case we consider it 'endsline'. + So 11 endsline, 30 startsline 39 endsline + + If there is an indent since the last startsline or endsline state, we ignore NEWLINE. + If there are only start/end line states since indent, ignore newline + if there are only endsline states since startsline, ignore newline - NO. + + + I wonder if maybe I should have IN and OUT in the grammar. + The tokens can still appear anywhere, but the production can only be reduced of there is an IN there. + Block -> : IN Statementlist OUT | : SimpleStatements + might get messy. But it already is messy. + +20jan2021 + Where am I? + The big questions seems to be: when can/must I ignore NEWLINEs? + Conversely when do they REDUCE or get SHIFTed? + The answer must lie in what appears between the most recent INDENT and the NEWLINE. + - if there are no startsline states, we can ignore. + + It seems that I need to be able to see the start of a block, either an indent, or a {.. + But if there is a statementlist, surely a block started? No, that misses a point. + INDENT ignores newlines because there will be a matching OUT + Maybe { causes NEWLINEs to be ignored because there is a matching } expected? + + If a startesline state is at the end of a productions, it plays no role in NEWLINEs, + but if it is followed by something + +23jan2021 + review/summary. + We track indents and newlines. The goal is resolve ambiguities and detect errors. + Ambiguities are resolved by forcing a REDUCE in some circumstances when an OUT or NEWLINE is seen. + Errors happen when an there are too many OUTs. + NEWLINEs are a normal part of a grammar, except that they get ignored sometimes when they are not relevant. + and are protected. + + 1/ a production from a lineline symbol cannot be reduced while it contains an unbalanced indent, + and so must be reduced before an excess OUT token is processed. + 2/ NEWLINES are ignored if there is an IN since a start-of-line state + + Otherwise NEWLINEs must be in the grammar. This can be awkward where they are optional + such as before an 'else'. To handle the fact that a structured statement can be multiple lines + or few we need to build it up from the possible line-like parts. + Maybe this means that ifpart, elsepart, whilepart, casepart etc need to be statements which + are combined above the grammar level?? + The reason has something to do with reducing to Statement when the newline is shifted. + I wonder if that is really needed. + if we have "ifpart -> if cond block" then that can be followed by elsepart, but we + need a separate "ifpartNL -> ifpart NEWLINE | ifpartNL NEWLINE" which can also be followed + by an elsepart, or can reduce to a statement + ..but no. That isn't sufficient if NEWLINEs are indefinitely duplicated. + So if we don't ignore newlines where they aren't needed, we cannot duplicate them. + + The need for duplicating was constructs like + if cond : if cond2 : action + which needed 2 NEWLINEs, one for each statement. + Maybe the "if cond2 : action" needs to be a simplestatement. + When we see the NEWLINE we can (must?) reduce anything that started since + a start-of-line, which creates the simplestatement. + + So let's try dropping the infinite repeat and require newlines in the grammar + + Extra rules: + - a production from lineline symbol cannot be reduced while it contains unbalanced + indent + - a NEWLINE is ignored if no startsline state since indent + - a NEWLINE cannot be shifted unless top startsline is at start of line. + + So a NEWLINE is: + - ignored if not startsline since indent + - forces REDUCE if top startsline is not at start of line !!!!. i.e. don't shift + - else can be SHIFTed. + +26jan2021 - happy australia/invasion day + I want ": simplestatements" to be a valid block, but not when there is an + indent in the middle + Previously this was resolved by having an SSLine -> simplestatements NEWLINE + and the SHIFT of the NEWLINE outweights the reduction to block. + And I have that now with Statement -> SimpleStatements NEWLINE... + Ahh.. it is because I put Newlines before Statementlist. I guess a NEWLINE + needs to be a Statement + + I currently suppress shift unless "outs == 0". That means + if cond: + statement + else: + doesn't shift the 'else' because the "out" isn't resolved. + Without the leading space, the NEWLINE reduces to ifstatement, and I sort-of know + that is an issue. + So: why suppress SHIFT when there is an outstanding OUT... or better Question, + why is this OUT still outstanding? + Ahhh.. I misunderstood. I DO want to reduce that 'if' because it is nested. + I have an IfStatement what needs to reduce to a Statement and merge with a + Statementlist, so the OUT can be cancelled and then the 'else' shifted. + BUT as 'else' is the lookahead.... I need NEWLINE to reduce IfStatement to + Statement. But even with a NEWLINE we don't SHIFT that because we still have + that OUT to clear. So how do I clear an OUT ?? It can only be cancelled when + the matching IN reaches TOS, which requires the NEWLINE to be shifted. + We cannot shift the newline early, partly because that would be wrong, and + partly because it gets reduced to a statement and disappears. + + Hmmm. After blocking SHIFT(NL) when we don't have a line I see + progress but not there yet. + if false { + print = OK + } else { + + After the OK is NL OUT NL + The first NL is shifted and reduces down to Statementlist and the OUT cancels. + But now the NL doesn't belong. Can I explicitly allow NL before } ?? + No, because it doesn't even SHIFT. It cannot because the Statementlist doesn't + appear to be sol - the OUT hide that and it now looks like + if expr { statementlist + + Arggh. I think I need to hide the starts-line state in this context. + More thinking needed. + + I introduced the shift-block to make sure the NEWLINE paired with the right line. + Here the NEWLINE should probably be ignored. But why? It isn't indented, but it + is inside a multiline thingy. + Might be better to explicitly allow it and find some way to let it be SHIFTed. + Problem is that the Statementlist is linelike, but I don't want that here. + Hard to avoid when the statements are and must be linelike. + Maybe I need a way to hide linelike-ness. + + Maybe a state should only be startline if the core item has dot followed by + a single symbol (which can derive a newline) ?? -- 2.43.0