devel - working on twod

author NeilBrown <neil@brown.name>

Tue, 26 Jan 2021 06:24:21 +0000 (17:24 +1100)

committer NeilBrown <neil@brown.name>

Tue, 26 Jan 2021 06:24:21 +0000 (17:24 +1100)
author NeilBrown <neil@brown.name>
Tue, 26 Jan 2021 06:24:21 +0000 (17:24 +1100)
committer NeilBrown <neil@brown.name>
Tue, 26 Jan 2021 06:24:21 +0000 (17:24 +1100)
diff --git a/00-TODO b/00-TODO

index 8c0efc77bafe710ae709f0a582f47232ed876ca7..5675fdb226b41beb48c4ba1f0111494811589e7c 100644 (file)
--- a/00-TODO
+++ b/00-TODO
@@ -1,24 +1,24 @@
  This is a living document - delete things when done.  Avoid discussion.
  
  Current version (Cataract Creek)
-- Warn when left-recursive symbols appear elsewhere, other than at the end
-  of a production.  Might have to special-case Newlines.
-- parser not to get into ERROR infinite loop
-- sort 'virtual' symbols to end
-- allow $xy instead of $3.  Chooses shortest bodysym with xy in that order
-    $xy_2 gives the second one
-- allow $TERM terminals to be listed.  If so, extras are errors
  - structs
-   - const fields
+   - const fields ... what does that mean?  Assign once as initialization?
+      Can be used for array size? What else?
     - anonymous field - array or struct (or pointer to these)
-      multiple anon struct ar allowed if they don't conflict
+      multiple anon struct are allowed if they don't conflict
     - [] can apply to anon array field
     - anon struct field gets fields interpolated
  
+   - anon fields have no name before the :, so
+       struct foobar:
+               :content
+               size:number
+
  - manifest values for arrays and structs [a,b,c]
      or [.foo=a, .bar=b] or [ [1]=a, [2]=b]
     That last doesn't parse easily, unless we require tags... not a good idea.
     [ .[1] = a, .[2] = b ] ?? Maybe.
+   or () to group, [] for index. To (.foo=a) ( [1] = b )
  - yet more operators
       << >> #
       bit-ops & | ~ &~
diff --git a/Blog-pointers b/Blog-pointers

new file mode 100644 (file)

index 0000000..59b943f
--- /dev/null
+++ b/Blog-pointers
@@ -0,0 +1,243 @@
+
+Pointers is one of the particular areas where I wanted to innovate in
+the design of Ocean.
+
+Traditionally (in lanuages like C) pointers are both powerful and
+dangerous.  My goal is to control that power without extinguishing it,
+so they become powerful with more clarity, and without the danger.
+
+Pointers have always been overloaded - the can store either a pointer
+to an actual object, or a 'nil' (or 'NULL') pointer, which isn't an
+object and cannot be dereferenced safely.  This provides power (it
+enables simple finite linked lists) and danger (it is too easy to try
+to dereference the `nil` pointer).  In Ocean, pointers that can be
+`nil` are a different (though related) type to those which cannot.  A
+pointer which can be `nil` may only be dereferenced after a test
+confirms that it isn't, in fact, `nil`.  This requires more careful
+notation of the types of all pointers, but is easily checked by the
+compiler so notation errors are easily caught and corrected.
+
+As well as allowing `nil`, it is sometimes useful to allow other
+non-pointer values to be stored in a pointer object.  The Linux kernel
+does this a lot, and the are two specific cases that are used.
+In the first case, small integers (normally negative) can be stored in
+a pointer variable.  These are used as error codes.  As the first page
+of virtual memory and also the last page are not mapped, any pointer
+with an absolute numeric value less than the page size cannot be a
+valid pointers and can be treated as something else.  Zero is used for
+the `nil` pointer, negative numbers are used as errors, positive
+numbers are largely unused.  Ocean will make this pattern explicit in
+the types: a pure pointer will not have one of these values and can be
+dereferenced.  A loaded pointer might have a reserved value and can
+only be dereferenced after a test.  If the test fails, the numeric
+value can be extracted.
+
+The second case relies on the fact that on all supported
+architectures, any pointer to a 16-byte (or larger) value must be (at
+least) 2-byte aligned, so the numeric value of the pointer will be
+even.  This means that odd values cannot be valid pointers, so those
+values can be used for something else.  This provides a secord, or
+third, level of "loading" that can be declared for a pointer.  The
+Linux kernel uses these in hashtables that support lockless lookups
+and movement of entries between chains.  The loaded value is treated
+like a `nil` pointer, but records which hash chain has come to an
+end.  If a search find the wrong hash value at the end of the chain,
+it know that it might have been diverted between chains and needs to
+re-check.
+
+The odd pointer values can also be used in a different way.  In each
+of indicating a different interpretation of the whole pointer, the
+least significant bit can be treated as a separate value that is
+stored in the pointer - to optimize space usage.  A particular use for
+this is as a lock bit, again in a hash table though this time in the
+head.  Before a thread may change a hash chain, it must atomically
+change the least significant bit from zero to one. After the change,
+it can clear the bit, possibly as a sige effect of updating that first
+pointer.
+
+The Ocean language shouldn't insist that the various non-pointer
+values are error codes, hash values, or lock bits, but it will make
+them available for the programmer to use in a controlled way, and will
+ensure that a pointer that can contain non-pointer values is always
+properly checked.
+
+Apart from `nil` pointer dereferences, the main danger with pointers
+involves the interaction with memory allocation and freeing.  If
+memory is released (and then reused) while a pointer to the memory is
+held, future dereferences of that pointer are likely to cause
+problems.  In order to avoid these errors, the language must ensure
+that no such references remain when memory is freed.  Once it does
+that, it can help the programmer by actively freeing the memory once
+the number of valid references reaches zero.
+
+This tracking of references is sometimes done using a process referred
+to as "garbage collection".  All pointer that are (or could be)
+active, and all locations they point to are marked.  Anything not-marked
+can be freed.  This has a degree of conceptual simplicity and a degree
+of practical inelegance.  The implementation can supposedly be made
+quite efficient, but it isn't the approach that I want to take.
+
+The alternate approach is to enhance the type system so that the
+language "knows" when a pointer is active or not, and how many active
+pointers there are to any given location.  When there is only one
+pointer and it become inactive, the memory can be free.  The simplest
+implementation of this pattern is to store a reference-counter in each
+allocated object, and to use it to keep count of all references.  This
+is a useful pattern and in some circumstances it is the best possible
+pattern, but it is not the only valid pattern.  Ocean will support
+this natively by allowing a field on a structure to be identified as
+reference counter, but it will allow other options as well.
+
+Points can be classified as either "owned" or "borrowed" references.
+This will probably be determined dynamically from code analysis,
+though in some cases such as function parameters and returns, it must
+be declared.  An owned reference is one that holds a reference count
+or by some other mechanism is an independent reference which will
+prevent the object from being freed, until some action happens on the
+owned references to releasse it.  A "borrowed" reference is a
+temporary reference that exists under the umbrella of some
+identifiable owned reference.  The language definition will require that
+the borrowed reference have limited scope, and that the related owned
+reference will not be released while that scope is still active.
+
+A simple example of a borrowed reference happens when an owned
+reference is passed as a parameter to a function which declares the
+formal parameter to be a borrowed reference.  Inside the function, the
+borrowed reference cannot be used to release the reference, and the
+owned reference much remain intact for the duration of the function
+call.
+
+It may be that the owned reference cannot be uniquely identified.  As an
+example, there might be two owned references, and an conditional
+statement assigns one or the other to a borrowed pointer.  In this case
+it isn't possible to know at time of analysis which owned pointer is the
+primary, so both (or all) possible primaries must be preserved.
+
+The "Scope" of the borrowed reference that the owned reference must be
+preserved for will often a a lexical scope, but may not always be.  I
+imaging that the language may at some stage be able describe other
+temporal relationships between different objects.  In that case, the
+borrowed reference must have a life time contained within the
+guaranteed lifetime of the matching borrowed pointer.
+
+I should say that "Garbage collection" will be an option, just not the
+only or the default.  Language can potentially identify exactly the
+references that need to be checked.  Maybe collectable pointers
+reserve the lsb for mark, prior to sweep.
+compiler would extract a description of places that gc refs can live,
+and encourage them in specific domains.
+
+We might also make an allocation domain a well defined concept.
+
+## Other Ownership
+
+There are a variety of sorts of owned references.  They include:
+
+- counted references.  These are created by incrementing a
+  reference-count field, which is then decremented when the reference
+  is destroyed.
+
+- single references.  Some objects are declared as only ever having a
+  single owned reference.  This can be moved from pointer to pointer
+  but never duplicated.  When the reference is destroyed, so is the
+  object.
+
+- inherited references.  An ....
+
+## Rules for borrowing references.
+
+Borrowed references can live in automatic variables and in structures
+(including arrays).  They can only live here as long as the owned
+reference is stable.
+For automatic variables, we can (hopefully) deduce the relevant owned
+references - it is what was copied to get a borrowed ref.
+For refs in structures, we need to be explicit for borrowed refs
+That means we need a language and a precise understanding.
+We would need a parent or sibling or child object to have the ref..
+Maybe we just declare the name of the object, where it has to be
+passed in as an arg to be part of a parent.
+
+
+----------------------
+Later...
+
+How much of this could/should be implemented as Smart Pointers?
+Rust uses smart pointers to implement
+ Nullable (Option<>), RefCount (rc<>) Atomic refcount (arc<>) and
+ others.
+It even has Box<> to create on heap instead of stack.
+
+Why doesn't I just do that?  Partly because I don't have classes yet!!
+
+I like a simple syntax to test if a pointer is over-loaded.
+   if pointer?
+and to get the overload value
+   pointer!
+
+Rust would use pattern matching.
+   if let Some(p) = pointer {
+      use p
+   }
+
+in place of
+   if pointer? :
+      use pointer!
+
+Rc() uses Rc::clone() to take a reference.  I would rather that were
+transparent.
+
+Rc and ARc need to be different.  I don't want the difference to be
+that obvious.
+
+Box<> is interesting.  It is the gateway to the heap.  Maybe it is
+just syntax.
+
+I want to be able to store an owned pointer in an ADT without telling
+the ADT the precise type.  Either it returns an object to be freed by
+caller, or a 'free' function is passed in.
+... NAH this can be done with normal type parameterization.
+
+So the ownership-type needs to be transparent to a degree.
+i.e. I can have:
+ - a borrowed pointer
+ - an owned pointer of unknown disposition
+ - an owned pointer of specific discipline.
+
+Rust uses & to specify a borrowed reference.
+
+Maybe
+  ^type  is a borrowed reference
+  @type  is an owned reference with discipline defined by type.
+  discipline@type is an owned reference of a given discipline.
+
+
+How do I want to handle mutability?
+Function can usefully have 'const' markers.
+What are the risks when single-threaded?  We could extract a value,
+accidentally change something, then depend on the value.
+We could probably do that anyway...
+
+Pointers are normally (default) overloaded with a small integer. +/2047
+Pointers can have status attributes when stored in structures or
+passed to or from functions.
+ - "pure" means there is no small integer
+ - "loaded" means there is large-integer overloading - if lsb set.
+ - "init" means the target is being initialised
+ - "exit" means it can be torn down
+ - other words can reflect stages in life cycle
+ - still other words might reflect locking status.
+There can be several locking statuses, and several life-time statuses,
+as these can affect different fields differently.
+
+In a struct each field can have a mapping from internal states to
+external state. (init=foo, locked=foobar, rlocked=baz)
+And internal state not listed in inheritet from external states (a=a).
+
+For a function, returned value attribute can be copied from args, as
+can type I suspect
+  fun foo(a:thing, b:thing) : <a|b>
+
+or better
+
+  fun foo<x:thing>(a:x, b:x) : x
+
diff --git a/Ocean-introspection b/Ocean-introspection

new file mode 100644 (file)

index 0000000..b809622
--- /dev/null
+++ b/Ocean-introspection
@@ -0,0 +1,59 @@
+Here I collect thoughts about introspection.
+
+
+Introspection can be at compile-time as well as runtime, and can have
+quite a different role.
+In edlib/python I use the documentation string of class methods to
+indicate what 'key' they respond to.  I cannot easily do that in C,
+but could if I have compile-time introspection.  I could do it
+with a sed script that extracts an include file ...
+
+The language would need to allow for extra stuff to be parsed into
+an AST, then the introspection code would run over that AST before
+it was further analysed.  This a bit like gcc pluggins, but
+is writting if the language being compiled, and compiled like
+any other code.
+
+This would be part of a larger feature that any code could run at
+compile time instead of later.  Constant-folding is already a common
+practice and is an example of this.
+
+I would need a simple and stable AST ADT to be manipulated.  Most uses
+would be relatively simple, like collection functions into an array,
+or pre-parsing some internal data structure, possibly sorting it.
+
+It could be as simple as allowing strings in lots of places, and 
+expecting the introspection to parse them - along with various object
+names - and do stuff.
+
+
+Separately there is runtime introspection.
+This would be needed to handle anything not known at compile time.
+
+Mostly this is probably underspecified type.  An 'sprintf'
+type function to be introspected and transformed at compile time
+providing the types of everything could be determined.  And if the
+type isn't known... recording introspection info is probably not easier
+that recording the appropriate dispatch function.
+
+Then there is link-time information.  Compile-time introspection needs
+to all be run at compile-time.  If I load a foreign module at run time
+(or link time), it cannot ask questions that weren't already asked.
+
+Maybe a "standard" compile-time introspector could encode any info
+needed for link-time or runtime introspectors.
+
+
+Introspection could be used for:
+
+ - preparsing data structures
+       e.g. sorting, adding links, ..
+ - annotating for run-time 
+       e.g. storing table of function, with names
+       providing serialization, de-serialization
+ - providing default values for missing args
+ - transforming n-ary args to arrays
+ - adding appropriate dispatch info when passing args to
+      a function that expects objects
+ - adding coverage tracking
+ - adding performance tracking
diff --git a/ooooo.tex b/ooooo.tex

old mode 100755 (executable)

new mode 100644 (file)
diff --git a/thoughts b/thoughts

old mode 100755 (executable)

new mode 100644 (file)
diff --git a/twod b/twod

index bde8aae6cb3e886c7dc71c90c6a51b4aa55ae945..e6db5cd230a0f119ba4b9fdaac620acb901d6b5c 100644 (file)
--- a/twod
+++ b/twod
@@ -2344,3 +2344,650 @@ iftail -> else block
  ifhead -> Newlines if expr block
  ifstatement -> ifhead iftail
  ]]
+
+
+13oct2020 - much later.
+
+ Maybe a different approach to newlines.  Never shift them, only use
+ them like indent to guide parsing.
+
+ IN is just recorded
+ OUT must reduce until top symbol contains an IN, which is cancelled.
+
+ Some productions are flagged as $$NEWLINE (or similar) meaning that they can
+ only be reduced as a whole number of lines.
+ When we see a newline, we record that until a non-newline is seen.
+ If we cannot shift and so choose to reduce:
+    if the production doesn't require a newline - we just reduce.
+    If it does, and we have recently seen a newline, again we reduce.
+    Otherwise, we signal an error and (maybe) shift until we see a newline.
+    NOTE the newline only counts if there are no indents.
+
+ This means that
+    a = b
+    + c
+
+ would be parsed as a statement, which I don't want.
+ But I do want
+   if foo: bar
+   else: baz
+
+ to be a single statement.
+ So a new line in some context must force a reduction, like OUT.
+ When exactly does a newline force?
+ An OUT forces a reduction until the top symbol has the matching indent.
+ A NEWLINE ?? Must depend on the grammar.
+
+ 
+
+ Q: Can this be only about resolving ambiguity and reporting errors?
+   If the default is to shift if we can, reduce otherwise, then
+   2D causing an early reduce can resolve ambiguity.
+
+   If certain productions must end at EOL, and error is thrown if they don't.
+
+   That latter really must be a gramatically need to shift a newline.
+
+ Maybe I'm wrong about reporting errors.  Maybe they should always come from
+ a failure to shift or reduce.
+
+ One of my problems was
+   if foo: if bar: dostuff
+ where the newline at the end needs to reduce everything, so I can
+ require at most one newline in the grammar.
+ It probably should be required for the first 'if'
+
+  Stmt -> if Cond : StLstNL
+  StLstNL -> StLst NL
+
+  StLst -> Stmt
+       | StLst ; Stmt
+       | SlLst NLs Stmt
+       | NLs StLst
+ 
+  IfSt -> IfHead
+
+
+  NONO.  I don't want some statements that end with NL and some that don't.
+
+  - Record indent with preceding token and line-start with following
+  - when reducing all indents are accumulated into the new symbol,
+    and the line-start of first symbol is preserved. Other line-starts are dropped
+  - When OUT in lookahead, there is an indent within 'min_prefix' of top state,
+    decrement closest indent count and discard the OUT
+  - When OUT in lookahead and cannot be discarded and top state can be reduced,
+    reduce it
+  - When OUT canot be discarded or reduce, and error is triggered.
+  - No production can be reduced if it contains unmatched indents.  If cannot shift,
+    then error.
+  - Some productions are marked as $LINE
+  - any state that contains an item at the start of a $LINE is marked 'starts line'
+  - if there is an indent more recent than a starts-line state, or the top
+    state starts_Line, newlines are ignored.
+  - if a newline is seen but not ignored, and the top state can be reduced, 
+    it is reduced.
+  - A $LINE production can only be reduced by a newline, or, if it didn't start
+     at start-of-line, by an OUT
+
+  so 
+    Block -> : statementlist  $LINE
+   
+  I want to be sure that
+    if cond {
+          statements
+    } else {
+          more
+    }
+
+  is OK. Here the "IfHead" doesn't end on a newline, but the state that is
+  expecting 'Block' 'starts-line'.  I guess as "Block -> { statementslist }"
+  isn't a $LINE, it can be reduced by 'else'.
+
+24oct2020
+
+ Maybe I don't want TK_out to force a reduce (mostly).  Maybe the grammar
+ should be in control. So when we see an OUT we record it off-stack.
+ If we reduce for some other REASON and the top state contains an indent,
+ we cancel.
+ When we decide to SHIFT, if the top of stack has -ve indent, that is an
+ error.
+ So with C,
+  if (cond) {
+       s1;
+       s2;
+  s2;
+
+ would show an error when s2 was shifted.
+  if (cond)
+       s1;
+       s2;
+ needs to show an error too.  The "if(Cond)s1;" statement would have an
+ internal indent.
+
+ So maybe the requirement is that no symbol has unbalanced internal
+ indents.
+ When we shift we require that top balances first.
+ So we count up outs and ins.  There can be several OUTs and at most on
+ IN before a terminal.
+ If the terminal can be shifted, the top symbol must have no internal
+ indents after cancelling with out, and there must be no outstanding
+ TK_outs
+ If the terminal cannot be shifted we reduce and we must have enough
+ outs to balance anything within the reduction.
+
+ So indents on the stack are always between symbols. Each frame has
+ a symbol, the indent after it etc.  
+
+ If we don't have enough outs to complete a reduction we raise an error
+ and need to discard (or shift/reduce) until we find an 'out'.
+ If there are too many outs to shift we again need an error which will
+ hopefully let us reduce and cancel some outs - just discarding states
+ will help with that.
+
+ So: outs must cancel, following token must envourage reductions
+
+ Some productions end with newline, their start state is 'starts-line',
+ and if there is an indent since a starts-line, we ignore newlines.
+
+ If we don't ignore newlines, then there is a starts-line state that we
+ should reduce to.  If we don't have enough symbols yet, we error.
+ So if we see a newline, we reduce until the newline would be ignored.
+
+ If that state expects a newline and we don't have one, we error.
+ Except..  I sometimes want OUT to work like a newline.
+ If I need a newline and reduction isn't to start of line
+
+ If state expects newline and we don't have one but do have OUTs
+ we reduce as long as there are enough outs and raise an error if
+ we end up at start of line.
+
+
+ Do I want to require that a newline be shifted?  We can synthesize
+ one when minprefix is less than since newline and we have an out.
+
+
+ Summary of stuff.
+  A shift requires there be no pending OUT ??
+  A reduce requires there be no pending indent, i.e that
+       indents within the reduction don't exceed outs.
+    NO this doesn't work.
+  A IN increments the indent on the top frame - never causes error
+  An OUT decrements indent in a frame within min_prefix, or error
+  A NEWLINE is either ignored or shifted, but is duplicated when
+  shifted.
+  After an OUT is handled, if a NEWLINE is expected but will only
+  reduce a partial line, it is provided.
+
+  if foo: if bar:
+             stuff
+          bazz
+
+  This effectively pretends there was a IN at the start of the "line",
+  and we are not doing two OUTs with a NEWLINE between.
+   if foo:
+       if bar:
+          stuff
+       baz
+
+
+  Do I really want to shift newlines:
+    yes: simpler grammer
+    no: don't need to record min-prefix.
+
+ I think 'yes'.
+
+ So discard $NEWLINE
+ Place NEWLINE in the grammar
+ When completing a state, if any production contains NEWLINE, record as
+   StartsLine
+
+
+ ---
+ My idea of ignoring newline when the top state is a starts_line state isn't
+ working.
+   The idea was that reducing
+        statement -> assignment NEWLINE
+   would leave at a starts_line start, so NEWLINEs would be ignored.
+   But this leaves us at a  ' ... statement . ' state
+    e.g. " Statementlist -> Statement . "
+  which reduces to "Statementlist -> Statementlist . Statement"
+
+
+
+30oct2020
+
+  Review: because it isn't working (as usual).
+
+  Indenting have 2 effects:
+        - it triggers errors when something looks wrong
+        - it causes newlines to be sometimes ignored.
+
+  Specifically:
+    - A symbol that starts indented must finish while still indented.
+      So an OUT cannot cancel an IN except in the top phrase, and an attempt to
+      shift while there are pending OUTs is an error.
+    - A symbol that begins at the start of line and contains an indent may only
+      be the top symbol - shifting before an OUT cancels the indent is an error.
+    
+    - Newlines are ignored when there is an indent since a starts-line state
+
+  Procedurally:
+    - There can be one IN and several OUTs pending - we update the counters
+      as they are seen.
+    - If there are pending OUTs and their is an indent in the top phrase,
+      we can cancel an indent.
+    - a SHIFT becomes an error if the top symbol has an indent and started
+      at beginning of line
+    - a SHIFT becomes an error if there are pending OUTs
+    - a SHIFT that is not an error first imposes the IN on the top of stack
+    - REDUCE may happen at any time.
+
+   Newlines, when not ignored, are repeating terminals. An indefinite series
+   is generated until a state is reached where they are ignored.
+   The expectation is that they will be shifted (possibly after some reductions)
+   and then a reduction will happen which changes the state appropriately.
+
+   It is vital that no recursive production keeps consuming newlines
+     foo -> foo NEWLINE
+   as that will be fed indefinitely, or if newlines are ignored, will never reduce.
+
+   NEWLINEs are recognized if there is a starts_line state since the most recent
+   indent, but not when the current state is starts_line.  So once we reach
+   starts_line, we ignore newlines until something else is shifted.
+   Or maybe only a starts_line state at the start of a line?
+
+   A starts_line state has an item "head -> . stuff NEWLINE"
+   So if were just reduce 'foo -> foo NEWLINE' then we are at the end of foo,
+   so an earlier state with have ".foo" and so "foo -> . foo NEWLINE" and so will
+   be starts_line.  So we will always accept a NEWLINE. So these must be disallowed.
+
+   How does "if cond : if cond : print NEWLINE" parse?
+    "block -> : slist NEWLINE"
+    "slist -> statment | slist statement"
+    "statement -> print NEWLINE | if cond block NEWLINE"
+
+   so starts_line are before statement or ':' - I don't want that as it would allow a newline
+   before ':' to be ignored.
+   Maybe I want starts AFTER a NEWLINE??  Did I do that before?
+   So after block, after statement, after slist.. but we expect NEWLINE there.
+
+   Maybe the important states are "foo -> baz . stuff NEWLINE".
+   Once we've entered a path to a NEWLINE, we need to be ready for it.
+   So: after :, after print after if and cond and block
+   
+   The repeating newlines are hurting now.  The rule to start ignoring them
+   isn't working.
+    ifhead else if expr : statementlist [NEWLINE]
+   the NEWLINE should be ignored.
+   There is an indent after ':' but final state is startsline because this could
+   be the end of a block -> : statementlist NEWLINE
+   But I want that newline to be ignored because of the indent.
+
+   if (A)
+        b;
+        c;
+   why is that bad?  How can we know?  Because we reduce to start-of-line
+   and still have an indent.
+
+    a =
+      b + c
+        + d
+        *
+
+  I need some good examples to remind myself.
+   Why do I want min_prefix? I allow indents to be cancels within min_prefix
+    cmd -> { statements }
+
+
+01nov2020
+ ARRRGGHHH  I keep going down rabbit holes.  I need to take a more mathematical
+ approach and present clearly what I have and what I need.
+ The big issue, for the moment at least, is when to ignore newlines.
+ I REALLY like the idea of newlines being repeating until they are ignored.
+ I definitely like that they are ignored after an indent unless they are explicitly
+ expected.  I need to ensure they are not expected at awkward places, like the
+ start of a line.
+
+ I want to be careful of making LL assumptions about the LR grammar. i.e. I cannot
+ assume the end from the beginning - that is was LL does.  I want the appearance
+ of a newline to only make a difference once it appears.  But at the point that
+ it appears we need to know if it is to be ignored. In
+   a = b + NEWLINE
+ it isn't ignored event though it isn't expected exactly here. In
+   a = b + (IN)
+       c + NEWLINE
+ it is ignored and the difference is that no state since the IN expects a
+ newline.
+
+ To acknowledge LR approach we need to take the latest state.  The start of 
+ line, before the 'a' is tempting but we hardly know anything there.  After the 'a'
+ is tempting, but that would mean after a[b+c].thing(foo) as well which seems to
+ miss the point.
+ Maybe something earlier sets the expectation?
+  if cond :
+ The ':' says newline-separated things are coming. So it is the state
+ after the ':' which says 'newlines matter'.  This will be after the indent.
+ So if we saw
+   if cond {
+ then maybe newlines don't matter (depending on grammar)
+
+ But a newline is ignored *at* that state, so that blank lines can be skipped.
+ I think there is a 'statementlist' symbol and at the start or end of that symbol
+ newlines are ignored.  Within that symbol unindented newlines are significant.
+  statementlist -> statement NEWLINE
+                | statementlist statement NEWLINE
+
+ 
+02nov2020
+ Can I drop the 'min-prefix' concept?
+ The means that OUT can only cancel against an indent 'in' or immediately before
+ the top symbol.  This means we go back to keeping a 'starts_indented' flag and
+ storing each indent with the following symbol.
+
+ Then the 'ignore newline' test is since_starts_line > since_indent
+ but since_indent is 0 fir there are internal indents and 1 if there is only a
+ starts_indented indent.
+
+ So a TK_in simply sets starts_indented for next
+   TK_out is recorded
+  Whenever top symbol has indents and outs are recorded, we cancel.
+  If we want to shift with pending outs, that causes an error
+  
+  I need a reason to signal an error for
+       if (a)
+             b=1;
+             c=1;
+  based on indents.  equally for
+       if (c)
+             b=1;c=1;
+  If indent is a continuation, then  'c=1;' but be part of 'if...' before
+  it is part of what comes before?
+  This means the OUT will not be able to see the IN ... unless the whole
+  statement list ends here....
+  In  "a = IN b + c ; " the ';' causes a reduction to Assignment with indent
+  so I must be happy for the 'c=' to reduce to 'ifstatement' with an indent,
+  but reducing to the recursive 'statementlist' seems wrong.  Maybe these
+  left-recursive productions really are special.
+  But what about 'expr -> expr + expr' ??
+
+  The rule I'm depending on is that I cannot shift with a pending OUT.
+  So everything since the IN needs to compress to a single symbol.
+  If the code is
+   {
+       if (a)
+          b=1;c=1;
+   }
+  then the '}' will reduce to "{ statementlist" before shifting, so the OUT will be 
+  cancelled.  The question is: how can we prevent that in the grammar?
+  There are 2 OUTs (in this case) does that help?
+  A comparable case is
+   {
+      foo;
+      if (a)
+        b=1;c=1;
+   }
+  In that case the two OUTs will be cancelled at clearly different times, but not real
+  difference.
+
+  {
+    foo;
+    if (a)
+      b=1;c=1;
+    bar;
+  }
+
+  Now the 'bar' will be blocked from shifting.... no it won't.
+
+  I need a new rule - this *must* be wrong, even when we ignore newlines.
+  But why?  Because "b=1;c=1;" looks like a statementlist, but isn't.  
+  It isn't because a statementlist doesn't belong here, ony a block or a statement.
+  How do I know that indent is continuing the 'if', not the statementlist.
+  I need some marking on statementlist to say "Cannot be continued by indent"
+  I guess that is what NEWLINE is for. 
+
+  So the rules would be "a suitably marked production cannot be reduced with
+  outstanding indents.  If nothing can be shifted, an error must be raised.
+  These will typically be productions that end in a NEWLINE, but maybe any
+  production from a symbol that can produce a NEWLINE, or that is marked $$something
+
+ Back to TK_newline.  These are duplicated on shift, but discarded when not wanted.
+ They are not wanted unless there is a starts-line start below the current state,
+ and more recent than the most recent indent.
+ A starts-line state is any state which contains an item with a NEWLINE and dot
+ at the start.
+
+  statementlist -> statement NEWLINE
+                | statementlist statement NEWLINE
+
+
+ Q: how do nested 'if's resolve?
+     if cond:
+        if cond2:
+            print
+      else:
+         die
+
+   statement -> if cond block
+              | if cond block else block
+   
+   The tokenization would be 
+     if cond : IN if cond2 : IN print NL OUT NL OUT IN else : IN die NL OUT NL OUT NL
+                1             2       A   2  B   1   3         4     C   4  D   3  E
+   so NLs:
+      A  makes 'print' a statment,
+      B  makes 'if cond2' a statment,
+      C  makes 'die' a statement
+      D  is protected by IN/3 and is ignored.
+      E  makes if cond .. else.. a statement
+
+  What if I also had
+
+              | if cond block NEWLINE else block
+
+  Then on seeing the NEWLINE I wouldn't know whether to shift it,
+  or reduce to a statement.
+  To fix that I would need:
+   statementlist -> statement | statementlist statement
+   statement -> simplestatements NEWLINE
+      | if cond block NEWLINE
+      | if cond block else block NEWLINE
+      | if cond block else statement
+      | if cond block NEWLINE else block
+      | if cond block NEWLINE else statement
+
+  But wait.... Allowing NEWLINE before 'else' confuses the parse of 'if cond:...' above.
+  NL-B makes "if cond block NEWLINE" and then the else can be shifted, but that is illegal
+  because of the negative indent.  That means the negative indent must prevent SHIFT.
+
+
+ Do I still need to know which symbols are 'line-like' ??
+    Yes, to make it easier to detect line-line productions which must contain unmatched indents.
+ Do I need recursive line-like detection?
+   For no-indent productions, it probably makes sense.
+   For starts-line detection?  I don't think so. 'block' isn't the start of a line.
+   And not really needed for no-indent
+
+
+ Do I need to worry about left-recursive symbols?
+   I don't think so.  There should always be some terminal - often NEWLINE -
+   which will ensure the symbol isn't extended??.. or if it did we would get
+   a parse error due to uncancelled OUT
+
+ Do I need to split indents?
+ A stack frame holds "symbol state".  Indents are within or before the symbol.
+ For cancelling with OUT, indents within and before the top state are equivalent.
+ For hiding newlines the indent before the symbol is too distant.                      f->next_indented ? "/" : "",
+
+ So: yes.
+ Should I store them in the stack?
+     "symbol+indents  indent  state"
+ So when we see an indent, we mini-push that
+ Any indent since state protects newline
+ Out can be cancelled if top state has indents, or previous has trailing indent.
+
+ Rather than an ignore_newline flag we need a outs_needed flag.  This is how
+ many outs are needed before we process newlines.
+ If state is startsline, then outs_needed == infinity.
+ Else it is a count of indents since a starts-line state.
+ So maybe it is an indents_since_starts_line and we test state and indents.
+
+ So:
+  1/ rearrange frame to include 'indented' in the top frame
+  2/ replace ignore_newline with indents_on_line
+
+14nov2020
+ I'm getting close...
+ Problem is correctly ignoring of newlines.
+ I ignore them after an indent - that bit is easy.
+ But I also ignore them at a start-line symbol - sometimes.
+
+  INDENT statementlist  NEWLINE     should ignore the newline
+ but
+  INDENT statementlist if  expr : stl if expr : stl NEWLINE should not
+
+ (/0s) Statementlist (11s) IfHead (5) else (22) if (3) Expression (19) : (29s) Statementlist (40s) [NEWLINE:22:0] - Ignore
+
+  This shouldn't Ignore.  Is that because there is a non-'s' since the '/' ??
+
+  i.e Ignore newlines if all states since indent are startsline, or if none are.
+   Don't count current state if it is startsline
+   indents_on_line > outs || all_startsline(...)
+
+ No - better, but no banana.  if expr { statement; newline
+ need to ignore even without the indent.
+ So I need a new type of state - one that is at the start of Statementlist but
+ not at the end.  State 29 in the above, but not 40
+
+(/0s) Statementlist (11s) if (3) Expression (19) { (30s) Statementlist (39s) [NEWLINE:34:0] - ERROR
+
+18nov2020
+ So:
+   I introduce "startsline" and "endsline" states.
+   A state before a linelike symbol is startsline, a state after such a 
+   symbol is endsline.  A state can be both, in which case we consider it 'endsline'.
+   So 11 endsline, 30 startsline 39 endsline
+
+   If there is an indent since the last startsline or endsline state, we ignore NEWLINE.
+   If there are only start/end line states since indent, ignore newline
+   if there are only endsline states since startsline, ignore newline - NO.
+
+   
+ I wonder if maybe I should have IN and OUT in the grammar.
+ The tokens can still appear anywhere, but the production can only be reduced of there is an IN there.
+  Block -> : IN Statementlist OUT | : SimpleStatements
+ might get messy.  But it already is messy.
+
+20jan2021 
+ Where am I?
+ The big questions seems to be: when can/must I ignore NEWLINEs?
+ Conversely when do they REDUCE or get SHIFTed?
+ The answer must lie in what appears between the most recent INDENT and the NEWLINE.
+ - if there are no startsline states, we can ignore.
+
+ It seems that I need to be able to see the start of a block, either an indent, or a {..
+ But if there is a statementlist, surely a block started?  No, that misses a point.
+ INDENT ignores newlines because there will be a matching OUT
+ Maybe { causes NEWLINEs to be ignored because there is a matching } expected?
+
+ If a startesline state is at the end of a productions, it plays no role in NEWLINEs,
+ but if it is followed by something
+
+23jan2021
+ review/summary.
+ We track indents and newlines.  The goal is resolve ambiguities and detect errors.
+ Ambiguities are resolved by forcing a REDUCE in some circumstances when an OUT or NEWLINE is seen.
+ Errors happen when an there are too many OUTs.
+ NEWLINEs are a normal part of a grammar, except that they get ignored sometimes when they are not relevant. 
+ and are protected.
+
+ 1/ a production from a lineline symbol cannot be reduced while it contains an unbalanced indent,
+  and so must be reduced before an excess OUT token is processed.
+ 2/ NEWLINES are ignored if there is an IN since a start-of-line state
+
+ Otherwise NEWLINEs must be in the grammar.  This can be awkward where they are optional
+ such as before an 'else'.  To handle the fact that a structured statement can be multiple lines
+ or few we need to build it up from the possible line-like parts.
+ Maybe this means that ifpart, elsepart, whilepart, casepart etc need to be statements which
+ are combined above the grammar level??
+ The reason has something to do with reducing to Statement when the newline is shifted.  
+ I wonder if that is really needed. 
+ if we have "ifpart -> if cond block" then that can be followed by elsepart, but we 
+ need a separate "ifpartNL -> ifpart NEWLINE | ifpartNL NEWLINE" which can also be followed
+ by an elsepart, or can reduce to a statement
+ ..but no.  That isn't sufficient if NEWLINEs are indefinitely duplicated.
+ So if we don't ignore newlines where they aren't needed, we cannot duplicate them.
+
+ The need for duplicating was constructs like
+   if cond : if cond2 : action
+ which needed 2 NEWLINEs, one for each statement.
+ Maybe the "if cond2 : action" needs to be a simplestatement.
+ When we see the NEWLINE we can (must?) reduce anything that started since
+ a start-of-line, which creates the simplestatement.
+
+ So let's try dropping the infinite repeat and require newlines in the grammar
+
+ Extra rules:
+ - a production from lineline symbol cannot be reduced while it contains unbalanced
+   indent
+ - a NEWLINE is ignored if no startsline state since indent
+ - a NEWLINE cannot be shifted unless top startsline is at start of line.
+
+ So a NEWLINE is:
+   - ignored if not startsline since indent
+   - forces REDUCE if top startsline is not at start of line !!!!. i.e. don't shift
+   - else can be SHIFTed.
+
+26jan2021 - happy australia/invasion day
+ I want ": simplestatements" to be a valid block, but not when there is an
+ indent in the middle
+ Previously this was resolved by having an SSLine -> simplestatements NEWLINE
+ and the SHIFT of the NEWLINE outweights the reduction to block.
+ And I have that now with Statement -> SimpleStatements NEWLINE...
+ Ahh.. it is because I put Newlines before Statementlist.  I guess a NEWLINE
+ needs to be a Statement
+
+ I currently suppress shift unless "outs == 0".  That means
+   if cond:
+       statement
+    else:
+ doesn't shift the 'else' because the "out" isn't resolved.
+ Without the leading space, the NEWLINE reduces to ifstatement, and I sort-of know
+ that is an issue.
+ So: why suppress SHIFT when there is an outstanding OUT... or better Question,
+  why is this OUT still outstanding?
+ Ahhh.. I misunderstood.  I DO want to reduce that 'if' because it is nested.
+  I have an IfStatement what needs to reduce to a Statement and merge with a
+  Statementlist, so the OUT can be cancelled and then the 'else' shifted.
+  BUT as 'else' is the lookahead.... I need NEWLINE to reduce IfStatement to
+  Statement.  But even with a NEWLINE we don't SHIFT that because we still have
+  that OUT to clear.  So how do I clear an OUT ??  It can only be cancelled when
+  the matching IN reaches TOS, which requires the NEWLINE to be shifted.
+  We cannot shift the newline early, partly because that would be wrong, and
+  partly because it gets reduced to a statement and disappears.
+
+ Hmmm.  After blocking SHIFT(NL) when we don't have a line I see
+ progress but not there yet.
+   if false {
+       print = OK
+   } else {
+
+ After the OK is NL OUT NL
+ The first  NL is shifted and reduces down to Statementlist and the OUT cancels.
+ But now the NL doesn't belong.  Can I explicitly allow NL before } ??
+ No, because it doesn't even SHIFT.  It cannot because the Statementlist doesn't
+ appear to be sol - the OUT hide that and it now looks like
+   if expr { statementlist
+
+ Arggh.  I think I need to hide the starts-line state in this context.
+ More thinking needed.
+
+ I introduced the shift-block to make sure the NEWLINE paired with the right line.
+ Here the NEWLINE should probably be ignored.  But why?  It isn't indented, but it
+ is inside a multiline thingy.
+ Might be better to explicitly allow it and find some way to let it be SHIFTed.
+ Problem is that the Statementlist is linelike, but I don't want that here.
+ Hard to avoid when the statements are and must be linelike.
+ Maybe I need a way to hide linelike-ness.  
+
+ Maybe a state should only be startline if the core item has dot followed by
+ a single symbol (which can derive a newline) ??
author	NeilBrown <neil@brown.name>
	Tue, 26 Jan 2021 06:24:21 +0000 (17:24 +1100)
committer	NeilBrown <neil@brown.name>
	Tue, 26 Jan 2021 06:24:21 +0000 (17:24 +1100)
00-TODO		patch \| blob \| history
Blog-pointers	[new file with mode: 0644]	patch \| blob
Ocean-introspection	[new file with mode: 0644]	patch \| blob
ooooo.tex	[changed mode: 0755->0644]	patch \| blob \| history
thoughts	[changed mode: 0755->0644]	patch \| blob \| history
twod		patch \| blob \| history