From d56bb6bb5e57d17d7e2215118cf210fb97a425ff Mon Sep 17 00:00:00 2001 From: NeilBrown Date: Tue, 20 Feb 2018 14:52:08 +1100 Subject: [PATCH] Collect random oo thoughts in a git manage directory. I like having somewhere to keep notes.. Signed-off-by: NeilBrown --- 2011 | 1056 ++++++++++++++++++ A Language | 72 ++ Abstract Syntax | 44 + Blog-outline | 585 ++++++++++ Byte Code | 135 +++ Design | 2803 +++++++++++++++++++++++++++++++++++++++++++++++ I Object | 473 ++++++++ Interpret | 63 ++ NOTES | 317 ++++++ Numbers | 95 ++ Numbers-impl | 20 + PLAN | 7 + Parsing | 1231 +++++++++++++++++++++ Rope-11 | 23 + WhyNot | 27 + harmful | 57 + lr | 581 ++++++++++ notes | 85 ++ ocean-steps | 39 + ooooo.tex | 121 ++ pony-thoughts | 47 + spaces | 73 ++ thoughts | 168 +++ twod | 1358 +++++++++++++++++++++++ x | 4 + 25 files changed, 9484 insertions(+) create mode 100644 2011 create mode 100644 A Language create mode 100644 Abstract Syntax create mode 100644 Blog-outline create mode 100644 Byte Code create mode 100644 Design create mode 100644 I Object create mode 100644 Interpret create mode 100644 NOTES create mode 100644 Numbers create mode 100644 Numbers-impl create mode 100644 PLAN create mode 100644 Parsing create mode 100644 Rope-11 create mode 100644 WhyNot create mode 100644 harmful create mode 100644 lr create mode 100644 notes create mode 100644 ocean-steps create mode 100755 ooooo.tex create mode 100644 pony-thoughts create mode 100644 spaces create mode 100755 thoughts create mode 100644 twod create mode 100644 x diff --git a/2011 b/2011 new file mode 100644 index 0000000..89293fc --- /dev/null +++ b/2011 @@ -0,0 +1,1056 @@ +*Intro +Over a decade later I am thinking about language design again. + +It is probably the release of GCC 4.6.0 mention on lwn.net and the +ensuing discussion which touched on 'go' from google. Particularly +see http://www.cowlark.com/2009-11-15-go/ + +So many thoughts swimming around, it is hard to know where to start... + + +*Type system + + parameterised interfaces for subtyping, but no inheritance + + numbers + type coersion is bad - it can hide errors too easily. + But explicit casting can get very ugly. + + So: + A value - literal or named, or an operation that must produce + a precisely correct results - can be coersed into a value of + greater range or precision automatically. + A value - the result of an operation that might overflow - + cannot be coersed at all. + A cast must be used to restrict range or precision + Thus "a = b + c" cannot lose anything. + If b and c are the same size but differ in sign, neither can be + coersed into the other. The could both be coersed into a + larger signed value, but the 'a' would have to encompass that + value or an error would result. + So if b is s32 and c is u32, the addition would be s64 so a + could be 's64' or bigger. + + + If b and c have the same type, overflow can still occur but is + expected. In that case it can only be assigned to a variable + of the same size. So if b and c are s32, 'a' must also be + s32. If 'a' is s64, then one of b and c must first be cast to + 's64' before the addition. + + + tags/modifiers/whatever: + + - 'nullable' 'nonullable' on pointers default is later + - 'once' for reference counting + - function args can indicate if function absorbs a reference + an return values might provide a reference? + - 'signed' ?? Can I assign 'signed' to 'unsigned' with no check? + + + immutable and sharable types... + Numbers and strings and bool are immutable. User defined are + normally sharable, but can be immutable or 'once'. + .. something to allow an object to be embedded in another... + + + 'types' describe particular implementations. Each object is + of a particular type. This included int, bool, etc. + An object is created using the 'type' function against a set + of values which might be manifest constants or might be other + objects. + Outside the module which defines a type, the internals of the + object are not visible except through the interfaces.. + + 'interfaces' describe the behaviour of an object. + It identifies attributes and functions which the object + provides. + Attributes maybe be read-only or re-write but whether they are + fields or managed attributes is not externally visible + .... except that would make implementation inefficient for the + common 'field' case. + Probably field attributes need to be directly readable, but + writing should always be handled (though a NULL pointer might + imply direct update). + Of course if we did incremental compilation we could detect + direct-access fields and optimise them(??). + + An interface is a function name with input and output types. + Then these can be gathered in structs - which unroll of + course. + So an interface can be defined as a list of functions and + interfaces. + The items listed in the input and output structs are names + paired with either an interface or a type. + + The function can also be an operator, or get_indexed, + put_indexed - which handle name[index] references. + + + interface matching is strict inheritance matching. + i.e. if two interface both define 'foo' they don't + match. Rather they must both inherit the *same* foo from + somewhere. + + An object implements multiple interfaces, possibly with code + in several files... + + virtual functions.. or templates or something... can be + written for an interface. The function can then be called on + any object that implements the interface. The function uses + other functions of the interface to implement the result. + So an "array of comparable" might implement a qsort function + What about a mergesort function on a list? What is the list? + + + Yes.. what is a list? + How can we implement linux/list.h lists in a fully type-safe + way? + Possibly we cannot without tagging the list_head with a type + indicator and requiring a check, which would be boring and + pointless. + I think that the only times we do list_entry, we know where + the head of the list is, and it isn't here. + + So the type of a list_head must include the identity of the + head of the list. + next is a pointer to a listhead of this type embedded in a foo, or at X + + When insert-after or delete we don't need to care about X. + In fact we only care when walking the list and calling + list_entry. + We could tag a 'head' with a low bit in one of the addresses. + Ugly but effective. + So everything on the list is either embedded in a foo, or + is a load-list-head which is tagged. + + If we allow "if X != Y" to change the type of X to exclude Y, + as we might with "if X != NULL" we might be able to do + something... + But I suspect not. The assertions we are making are simply to + rich to make in a comprehensible type system. + So we probably want the 'list' module to be able to make + 'unsafe' casts and assert that they are safe. + + + What about parameterised types etc. + A 'max' or 'min' function can work on any two items that are + mutually ordered, but we don't care what particular type it + is. So the type is an unknown: + max : (a,b:X < TotalOrdered) -> X + + When we call 'max' we don't tell it what type X is, it has to + deduce from the passed args. So that is an implicit + parameter. + But we could say + intmax = max(X=int) + to create an explicitly typed 'intmax' function(??) + + Containers are the classic case for parameterised types: + stack(X) : + push(entry:X); + pop() -> X + empty -> bool + this is an interface which could be provided by a type with an + embedded linkage field, or by an expandable array + + + + --------------------- + So, some concretish thoughts. + We have two distinct things - interfaces and types. + An interface defines a set of function names and their types. + An interface can include another interface and may broaden the + types args or narrow the types of return values. + Broadening may remove args. Narrowing may add extra return + values + + A type may declare that it conforms to one or more interfaces, + and by implication any included interface. There must be a + strict connection. Just defining functions with the same + names it not sufficient - the interface must be declared and + the compiler will check it. + If a type conforms to two interfaces which both have functions + with the same name, the type should provide two functions and + will need to indicate which serves which interface. + + A type includes a data structure and a set of functions and + provides a namespace to enclose them. The structure may + include objects of other types either by reference or more + directly by 'embedding', sometimes called inheritance. + The function of those types are available and may be used to + implement a shared interface, either explicitly by declaring + an interface function to implement by an imported function, or + implicitly by marking a member object as a 'parent' - though a + better name is needed. + + When the member's function is called a reference to the + contained object is passed together with a set of function + pointers that call into the relevant interface functions for + main type. + + A 'set of functions' is a data structure which starts with a + function pointer - the dispatching function. Args are set up + with the identity of the called function first, the target + object next, then any args. In the common case remainder of + the data structure is a sorted list of method identities and + function pointers. The dispatching function does a binary + search to find the target, possibly subtracts an offset from + the target object, and jumps into the function. + + But it would be best if not all functions had to go through + this lookup. Of course the caller could cache the found + function and not look it up again if it is called several + times. But that is only part of the solution. + + The standard approach is to declare functions as 'final' if we + know they will not be over-ridden. In our case that means + that calls in the parent which expect the parent will never + call the child's function. + Possibly this should be the default to encourage efficiency. + If a function is marked 'virtual', then any call to it must go + through a lookup table. If it isn't, then it doesn't have to. + This marking is in the .... interface? + + When an external caller knows the type of an object, normal function + calls are direct, virtual function calls are dispatched. + When a caller only knows an interface of an object, all calls + are dispatched. + So it is the type which determines whether functions are + virtual or not. + + Do we ever want to be able to make function calls based on an + interface? i.e. could some feature of an interface be final. + This could make sense for a template like function. + e.g. qsort. + An interface 'sortable' requires an array with comparison and + defines qsort. But why do that? Why not just define qsort + as a function. I guess that is the thing. A final interface + method is really just a function, possibly in some namespace. + So an interface defines methods or 'static' functions. + A type defines functions which can be 'virtual'ised. + + + An interface can be parameterised so that some types in args + or return are taken from the parameters. + interface comparable(base) { + int lesseq(this:base, that:base); + } + So numbers conform to comparable(number) and strings to + comparable(string) + int conforms to comparable(number) + + interface arrayof(base) { + base get_elem(this, index:int) + void set_elem(this, index:int, elem:base) + } + + A function can be written to a subtype of an interface. + function base max(a,b:base) where base:comparable + with X:object, base:comparable(X) + function base median(a: array of base) + + + + I want to think about closures - a function with some 'local' + variables already set... a lot like an object, but with one + method. + This is created if you can take the address of a chunk of code + in a function - it holds on to the context as you export it. + I would either need to copy the stack frame, or have stack + frames on the heap which doesn't sound efficient. + copying the frame would be difficult if you had addresses of + things, but maybe you wouldn't. Or maybe you only allocate a + frame on the heap if it can possibly be part of a closure. + +*statements, expressions, operators + + Pascal made a strong distinction between statements and + expressions. This seemed elegant but turned out to be somewhat + clumsy. It created a strong distinction between procedures and + function which seem unnecessary. + It is sometime nice to have expressions with side-effects such as + 'variable++' but this can cause problems with macro and + order-of-execution. So don't have macros! and don't allow + side-effects to affect other values in the same expression.... + Some languages have a 'where' modifier to an expression or + statement which can contain arbitrary code to set up the context + for the expression. Typically is defines a bunch of names that + are used in the expression. It doesn't seem to have caught on and + I wonder why... + while not finished + where { foo; bar; baz; finished := whatever } + do { stuff } + + I sort-of like 'after' instead of 'where' as it doesn't imply that + the body has to explicitly affect the function. + C allows {( for; bar; baz; not whatever )} to achieve a similar idea, + but placing the 'whatever' at the end hides it a bit. With the + expression at the start it is more clear where we are going. + + A for could then be: + initialise; while condition do increment after { body} + which is kind-of interesting, though having the initialise out the + front is a little bit horrible. + Pascal's 'with' (which never quite was useful) could allow: + with initialisation while condition do increment after {body} + if condition after {context} + do body; else other + + I wonder what case/switch should look like; + I probably want implied 'break'. In fact the 'go' switch is quite + nice. + switch expression { case expression: statements ; case + expression: statements; } + Part of me wants those statements inside { } so they are clearly + bracketed - but that might be ugly: + switch expression { + case expr { code code } + case expr { code code } + } + The word 'case' becomes pointless... + I think I want 'continue' rather than 'fall-though' meaning 'find + the next case that matches. I'm still not happy with the switch + syntax. + + Operators... I'm not sure what to say about those. Having + unbounded overloading seems like a bad idea, having none is too + restrictive. I think there should be a defined set of operators + with arity and precedence fixed. Possibly precedence should be + undefined when different definitions of operators are used to that + parentheses are needed. + + i.e. a + b * c binds OK when they are all in the same type family, + but is a different type to b and c, then it must be a + (b * c) + I'm not really sure what that means - need to know more about how + operators are overloaded. + + Every operator which is infix could also be prefix. + Post fix.. less clear: + arg op1 op2 arg + Either op1 is postfix and op2 is infix, or + op1 is infix and op2 is prefix. + Best to exclude postfix ops from being infix. + + Hmm.. there is more i want to think about here. + I would be cool if function calls could use a richer syntax to + separate args than just comma. + e.g. + append $value to $list + distance from $x1,$y1 to $x2,$y2 + distance from $x1,$y1 to $x2,$y2 via $x3,$y3 + Cobol lives again? + There are lots of questions here... + As the function words are not predefined we need them to either + be kept distinct from local names, or be syntacticly + distinguishable. $name works in the above, but not in real life. + We probably want the function names to belong to a type, but the + type isn't apparent until later if at all. + In the 'append' case the key type is at the very end. + In the distance case there is no key type. + We could require + distance from point(x1,y1) to point(x2,y2) + though a name of type 'point' could also be used + If I defined a variable 'distance' with a postfix operator 'from' + it could get awfully confusing. + Disallowing variables named for function starters might be too + restrictive as you probably cannot control which function names + get defined.. + The alternate is for variables to over-ride functions... which is + probably fine. + + So each word introduces a temporary name space which over-rides + what is there. + 'distance' introduces 'from', 'to', 'via' + But a variable called 'distance' hides all of that. + If the name is followed immediately (no space) by a parenthesis, + then the new namespace only exists inside the parenthetic space. + That makes "point(x,y), z" different from "point (x,y), z". That + would be bad. + We could go lisp-like: (point x,y) + (distance from here to there via elsewhere) + cf. distance(here, there, elsewhere) + + Not sure - the traditional version allows the functional bits to + stand out, but doesn't make their relationship clear. + + number divided by number ??? + number div number + + + + +*Basic syntax + Lots of little questions here... + 1/ are parentheses just to over-rule precedence? or do they + have a real meaning? i.e. what is the function call syntax + name list + print a,b,c -- that works + me.append a -- looks a bit odd. + What is a list? + a,b,c -- clearly a list + a -- not so sure + :a:b:c -- different syntax,so lists can be nested + :a:b,c:d -- first and third are not lists. + :a -- list of one element + :(:a):b -- a list of a list of one element, and 'b' + nil -- list with zero elements - empty list + a,b,c, -- trailing , is allowed and sometimes assumed. + So a function takes a single argument, which might be a + list... But how then do you call a function which takes no + args? + + me.tostring -- Is that a function or a function call? + me.tostring nil -- ugly + me.tostring() -- but now parentheses mean something. + + Maybe getting the function (or closure) requires different + syntax? + &me.tostring + lambda me.tostring + + 2/ when are {} required to group commands? Do we separate commands? + I don't like too many {} - so only require them to + disambiguate. + Don't separate anything. Separators are ugly. We + terminate this with ',' or ';' when syntactially necessary + +*strings + String constants and char constants are not syntactically + different. The type of a constant is determined from context. + + They can be enclosed in "" or ''. In either case the terminal + char can never appear in the string, not even quoted. This + makes parsing simpler. + \escapes are recognised in "" but not in ''. \q is a quote. + + Adjacent string constants are catenated so: "'" '"' is a + string of the two different quote characters. + + A multi-line string is introduced by """. This must be the + first and last thing on the line after initial blanks. + Every subsequent line until another identical line must + start with the same sequence of initial blanks. All the lines + between are joined with the blank sequence removed. + + The big question is: how are internal strings stored. + UTF-8, UTF-16, and UTF-32 are obvious candidates. + UTF-8 is best for European languages + UTF-16 is better for many Asian languages + UTF-32 is fairly pointless for general storage. + + So we probably have two internal formats with auto-conversion + like for numbers. A pragma determines how constants are stored?? + + Or better, we allow a string to be either UTF-8 or UTF-16 and + self-describing, like any good object. + Each byte array is prefixed with a count that contains a + bit flag which distinguished between UTF-8 on a byte count, or + UTF-16 and a word count. + The bit is the lsb and is zeroed before using the count - so + the count must be even. If there are an odd number of + bytes/words it is simply padded.... Or maybe it is set to 1 so + that with the count it is closer to a round number of bytes + total. + + + + +*Operator Overloading. + + Operators in general... + It would be nice if modules could define operators instead of + just functions. Then 'int' could be just a module. + + But operators affect parsing at a fairly deep level, both by + precedence and fix-order. + + If two modules both import the same operator with different + precedence, that would be bad. But ditto with global names. + + Anyway - can an operator be infix and prefix and postfix? + C only has ++ and --. .name ->name [] () are also sort-of + For prefix: & * + - ~ ! ++ -- cast + + If we expect a term and see an op, it must be prefix. + If we expect and 'end' and see an op, it could be infix or + postfix. + If after an operator we see another operator, the could be + infix prefix or postfix infix + So op cannot be both infix and postfix. + + Operators like 'and' and 'or' and 'not' and 'or else' and 'and + then' really need to be predefined names or parsing gets + too complex. + + Probably assert that all 'other' operators bind more tightly + than pre-defined ops, are left->right, equal in precedence + and are infix or prefix but never postfix... or at least when + postfix they cannot be followed by something - just a close. + + But do we really need new operators? Probably yes. Not many, + but sometimes and operator is just soooo much neater. + Look at &^ in 'go' - a really good idea I think. + <: :> ... I wonder... + + Overloading + The issue here is who to determine which instance rules. + It is simplest if the type of the first operand must accept + the operator. With int/float, the type of 'int' is actually + 'number' which defines division. int/array would not work as + 'number' does know about array. + + + + + +*Declarations and assignments + + Lots of interpreted languages don't require, or even allow, + variables to be declared before use. This makes writing quick + code easy (is that good?) but also makes typos easy. Checking + for unused values and uninitialised name references can help + catch a lot of the error but I'm not sure it is generally a + good thing. + + While some C variants allow declarations to be mixed with code + but that seems to be unpopular and I cannot say that I like it + much. + Still - having the declaration close to use first use can be + nice. The 'with' statement I suggested earlier could make + this work nicely. + with name = value do stuff + could serve as a declaration of 'name' with exactly the type + of 'value', though + with int name = value do stuff + is not much harder so should be preferred. + + I wonder if + :name = value + would be a good syntax to declare 'name'. + int:name = value + would disambiguate if needed + The same could be used in type matching for function. + max(:X:a,b) -> X + means X is a type variable determined from the args, and + becomes the type of the result. + + I wonder if it would be nice to allow + :(a,b,c) = 1,2,3 + to initialise all 3. It would be the same as + :a,:b,:c = 1,2,3 + Or do we want + int:a = 1 + int:b = 2 + Do we want multiple names per type? How? + int:(a,b) = 1,2 + int:a,:b = 1,2 + int:a,b = 1,2 + +*Type labels: + How, syntactically, do we give a type to a variable. + Pascal uses + name, name : type + which doesn't work very well with initialisation, though it + could I guess. + C uses + type name, name + or + type modified-name + such as *name or name[num] + which means some types don't have a nice closed name. + in Pascal "array [0..36] of int" becomes "int _[36]" ?? + or use a typedef. + I think types should have simple closed names so they can + be passed around, particularly to parameterise interfaces. + Also "int* a, b" looks like and b are both int pointers, but + they aren't. "int *a=b; *a=b;" do two very different things. + "array of" is very verbose. + int[] array of ints + int* reference to an int + int& formal parameter which takes the address of the passed + value?? - no, maybe not. + [int,err] union of two possibilities? + + +*Pointers: + Pointers are good. we like pointers. + Pointer arithmetic is bad. It allows very silly things. + It also allows cool things (container_of), but the language + should provide those. + + Pointers can be cast to integers - that can help with + arbitrary ordering or equality checks (but what about memory + compaction?). But integers can never be cast back to pointers. + + Array slices are a good idea. It requires passing a size + around with a pointer - though that could be optimised out + in some cases, and would be needed anyway in others. + If strings are array slices then we don't need nul + termination, but have the cost of a length... Probably not a + big cost and gets right of some horrible cut/paste issues with + null termination. We can pass a substring without copying or + mutating. + + Strings are very special. They are not arrays of characters. + They cannot be indexed but can be iterated. Each element is a + unicode point, which might have an ASCII value.. + Strings are not mutable. + + Byte arrays are normal arrays and are mutable. You might be + able to convert a string to a byte array but in general you + cannot. + + + The language has values and references. 'references' are ways + to refer to a value, often a name (i.e. a variable) or a field + in an object, or a slot in an array. + + Values can be mutable or immutable. The reference to an + immutable object might be a copy rather than a pointer. + Values have a lifetime. When the lifetime ends the value is + destroyed, which might a destructor function. + Lifetime can be determined by: + - reference counting - when it hits 0 the value is destroyed + - single-reference. There is only one reference and when it + is destroyed, the value is destroyed + - garbage collection. References can be found in memory and + where there are no more, the value is destroyed. + + A reference-counted value must container a counter. + A collected value must contain a single bit usable in mark/sweep. + + + +*error returns and exception handlers + + Returning and handling errors is the bane of any programmers + life. Exception handling is essential but feels heavy weight and + clumsy. It doesn't have to. + I like the "set -e" functionality of the Bourne shell: If + anything returns an error status that isn't used, then execution + aborts. + + As similar concept would be that any function can return a union + of the normal return value, or an error. If the caller only + accepts the normal return value, the error is raise up the stack. + If the caller accepts the error: then it does whatever it likes. + So: + answer, err := funciton_call(args) + if (err) { + do whatever, answer is undefined. + return err; + } + + Obviously the function must be declared as possibly returning an + error. We could make the error type a part of the language, or + we could just generically allow unions to be returned. This + would allow structured error objects that are specific to the + context. + + Exploring 'go' suggests that there could be more than just error + returns to consider. This is also timeouts. + i.e. in some cases: + answer := function + will wait for an answer while + answer, ok := function + will return immediately, either with an answer or an error. + Does that make sense? Should the calling context determine if a + delay is appropriate or not? That really sounds like a different + argument. + This is very different to the 'err' return. The function always + throws an error if that it what it has to do, the call sight + either catches it or lets of flow on up the stack. + An implementation might place a global setjmp which is like + passing an arg down, but it is a thread-global arg. + So 'catch error' is a stacked thread-global value. + Could not 'timeout' be similar? Or probably 'nodelay' as + timeouts would be implemented by killing of a thread that was no + longer interesting... + No - nodelay is too global. Some delays are small, some are + large. We cannot put them all in the one basket. + It makes more sense for the channel to be marked for delays or + not, or rather a channel endpoint. + Or a channel could normally block, but channel.ndelay could be a + channel that threw an error instead. + + A problem with this approach is syntax. + It requires a 'catch' to be written like: + rv, err := { + lots of function code here; + } + if (err) { + handle error + } + Which is much worse than + after_scope{handle error} + + Or it could be: + with error = stuff + if error do handle error + + +*modularity + name space control - who owns namespaces and how are they made + available? + + Obviously any block - or at least any 'with' statement - + introduces a namespace which over-rides existing names.... + or should conflicts be error? That is safer. + + A structure variable holds a namespace with local + interpretation. This is traditionally introduced by + {value of the type}.{name in local namespace} + + It isn't just structures, any type, even the humble 'int' + could have names attached. "$int.abs" might find the absolute + value. But would be nice if 'abs' were more visible + abs of $int (???) + as that extends to e.g. pop value off stack, push value on stack + Is that really better than stack.push(value); value = stack.pop; + Maybe not, but as the args are more complex and optional, + something like: + table.find(value, hash, compare-fn, ....) + loses something.. but then that is fairly contrived. + In mdadm I have 'force', 'verbose' flags, then for + e.g. create, level, chunk, size, raid_disks, spare_disks, + etc.. + Using lots of prepositions will quickly run out of steam here. + Requiring tags for each fields would be boring as it would + look like: + Create(dev=devname, level=level, raid_disks=raid_disks, ....) + I really should use some little structures to gather stuff. + It would be nice if I didn't have to have a common type. + if + var = val + would work when they are different types, and become: + for each field in var: + do var.field = val.field + At least this could be allowed for arg passing to function. + Maybe something like "&val" expands to + + + A 'struct' is different from an object .. or a 'packed' + layout for that matter. + + A struct is simply a collection of distinct values or + variables, each with a name and/or a position. + A struct does not comprise a type - different structs are simply + different things. structs are compatible based on name or + position matching. + This is only relevant when a struct is assigned, either + directly with '=' or when assigning actual parameter to formal + parameters in a function call. + In these cases every variable in the target must match a value + from the source, either by name or position. The source may + have extra values, but it may not be deficient. + Multiple structs can be combined by listing them. + + A struct is really just a short-hand for a few names/values. + A struct does not have an address, cannot be stored in an + object, or passed around. + + A name can be defined as a struct in a local scope and the + various fields can be assigned. The whole struct can be + assigned to some other struct value in which case the source + must have matching names or positions. + But here is the problem: what if it has both matching names + and positions? and what if they don't align? + (x,y) = (y:1, x:3) + I think that names must take priority. Maybe only use + positions if one or the other doesn't have names. If both + have names, then they must match. + So: + a,b = fncall() + a,b doesn't have names, so it is purely positional. + (result:a, err:b) = fncall() + requires that fncall returns a result and an err. + + So structs are primarily used for function call args and + results. + Values can be just listed and are then positional. + If a struct is in the list, it is interpolated, not a separate + object. + + I guess positional parameters take precedence.. + x, y, foo + where foo is a struct, assigns first from x, second from y, + and the rest from foo. If foo has names they are used, else + positions. + + A type name is given a struct and creates the object. There + might be several different structs than can be used in which + case name matching is important. + Imaginary(x:float,y:float) vs Imaginary(mag:float,direction:float) + If no labels are given, the first with enough matches wins. + + Also a typed object can fill in a struct if it has all the + right names. Some might be attributes and some might be + computed attributes, but true functions are not allowed + because there is nowhere for args to come from. + + So type name are in the global namespace (unless they are + prefixed by a module name). + So it must be OK for any name to be in the global namespace, + it might be a type, it might just be a function. + + + Later.... + after thinking (and writing) about object us in the kernel one + thing that really is missing from C is the variable namespace. + C just has the fields in a struct but cannot have anything + else. (The other big thing is a richer type system). + So what if an 'object' (the basic abstraction) was about + define a name space as much as a set of storage fields. + And are these really two different things? or are they + compatible? + + So an 'object' creates a name space. In it are names which + can be: + - constant: a number or a string or even a function. The + value might be stored in 'text' space or might just be + known to the compiler. + - static: The name refers to a globally allocated variable + which is known only through this object type. + So like a 'class variable' ... which is probably a bad + idea for exactly the same reason that global variables are + a bad idea. + - instance/auto: field that gets included each instance + - method: This is a function that gets stored in a vtable. + The innermost object that has methods owns the vtable. + Each object with any methods devices a vtable structure + which embeds that vtable of the inner object. + So there can only be one - multi-inheritance doesn't work. + But an object can still store function pointers.. + + But maybe multi-inheritance is fine once we understand + inheritance properly. + Typically we embed a parent in a child and say that the + behaviours of the parent applied to the child through the + parent. + So if you want to do X to the child, and X can be done to the + parent, then you do it to the parent. But you might have + adjusted that parent through giving it a vtable that + understands the child. + So a second parent just means a second vtable pointer in there + somewhere. Of course is both parents can respond to X, then + the child needs to explicity disambiguate. + + Which leads to over-rides and name resolution. + If I include a parent as 'p' then everything in it is + available as 'p.thing'. We might some or all things to be + available as 'thing' directly, or maybe 'pthing' - i.e. with a + different name. + So this is just standard importing. We can import everything, + or specific things with renaming. + Do we want 'import if it doesn't cause a conflict'?? That is + probably reasonable. So: + import all, import none, import if unique, import X (as Y). + + struct foo bar {X, y as z, *unique} + + Then X == bar.X, z == bar.y, $1 = bar.$1 if it is unique. + + When a method is declared, its vtable location must be + identified. + It can be: + - inline, so the method is just a function pointer + - attached, so a vtable pointer is in the structure + - attached through some parent + - delegated to some referenced object. + + + + +*parallelism + I've never really given this a lot of serious thoughts. + In the kernel, we simply use locks for protection and a fairly + heavy-weight mechanism for creation threads. fork/wait. + 'go' has 'go statements' to run those statements in parallel, + and uses channels (blocking or buffered) for synchronisation. + They assume the library might provide locking primitives. + I guess channels as a builtin rather than a library function + might allow some useful optimisations. You would thinks locks + would too. + + I feel that the language should not preferentially define any + synchronisation mechanisms. Locks, queues, signals etc could + all be provided by the library, and if necessary the compiler + could 'know' about some and allow optimisations just like the + gcc C compiler 'knows' about strcpy from the library. + + So the language just needs a way to start a new thread. If + the termination of the thread is important, it can set some + flag or whatever when it ends. It might be nice to be able to + abort a thread - kill it. Rather than defining a naming + system we could allow a thread to be given a handle of some + sort which it passes to the runtime say "exit when this flag + is set". This is a bit like a signal handler which I decided + was a bad idea... + No, I think the 'thread' creation function should return a + handle which can be killed. + + So + handle := do { commands } + handle.kill() + handle.running() + etc. + + This creates a closure that copies all visible local variable, + but shares references to allocated and global variables. + + concurrency management is very easy to get wrong. + Sometimes it is nice to be able to use lowlevel primitives + like RCU and write barriers and spinlock, but in many cases + your needs are so fine grained and you want the language to + do it for you. + This could in part be through data structures that do the + synch for you (channels, bags, refcounts). However it is + probably good to have some way to tag something to say that it + must run locked. + + Locking is something that must be optional. A structure needs + locking when shared, but if it is always used in a locked + context, then the locking is a waste. And while it may only + be setting a bit, it has to be bus-locked test-and-set which + is definitely slower than not doing it. + + If we had 2 bits for each lock, one could say if locking is + active and the other if the lock is held. (There would be a + completely separate wait-queue found through a hash on the + lock address). If either bit is set, enter spinlock protocol. + + So when allocating an object we enable locking or not + depending on something. + + If an object is lockable, then certain fields should only be + accessed while the lock is held. So they get to be 'private' + or 'hidden'. + + Granularity of language-imposed locking is probably object and + method. i.e. a method takes a lock on an object. Presumably + this is what the 'synchronised' type attribute means. + + + + +*introspection + or reflection... + + In 'go' this allows the actual type of an object to be + explored to some extent - the name and type etc of each field + of a struct can be extracted. I suspect that is used. + I don't really like typecase that it uses though. Maybe + I need to understand how unions should work... + + +*Memory management + Garbage collections seems to be popular. I hate it. + It makes this stuff "just work" which is good, but it feels + very untidy. Also there is no guarantee when it happens + so destructors which free other resources like fds aren't + always safe. + Reference counting is a real over head, particularly on small + objects but talloc has good results... + One problem with refcounting is that you need it to be atomic + in a multithread environment. But the GC would need + to stop everything in multithread too. + A program which was known to not run for long could be compiled + with refcounting disabled and just never to free anything... + Though that is bad for destructors too. + + Probably should allow manual management with no refcounting, + and 'once' tagging on pointer types so the compiler can check + no refs are taken rather than counting them all. + + + If we had really good reference counting, then locks could + make use of them too. A 'lock' function returns a 'locked' + object and when that object is released the lock is dropped + by the destructor. + So just going out-of-scope is enough to unlock, but + var = get_lock(lock) + do stuff + var = None + + would allow it to be explicit. + This also allows lock ownership to be passed around. + If a function type declares that it absorbs a reference, then + it is thereby allowed to unlock something that other code + locked. + + Possibly the most interesting part of managing refcount is + determining which references in heap objects are counted + as references in the stack should be fairly easy to track. + e.g. a double-linked list won't want two counts - the 'back' + link should be ignored. + And possible it won't want to count at all if it is e.g. in a + hash table we might not want to count at all - when the + refcount becomes zero just remove from the list. + + So: a reference can be: + - counted: meaning it contributes to the refcount + - clearable: meaning that when the object is deleted, the + reference can be found and destroyed was with + the bask link in a double-link list + - dependant: meaning that it depends on another reference, + and will only be held while that other + reference exists. + + Getting the compiler to check these is probably too hard. + Dependant references might be do-able if we had a lock on the + holder of the prime link.. but that only makes sense for + transient dependant references. + Checking that the destructor actually destroys all clearable + references is asking too much. + Maybe the 'clearable' declaration could include a label for + the code in the destructor where the 'clear' happens - or + vice-versa. i.e. annotate each clearing with the clearable + field that this clears. + + compiler could track movement of references and insert 'get' and + 'put' where absolutely necessary based on annotations. + + So the programmer could still make mistakes, but they are at a + higher level + +*Higher order functions and first class functions + + Functions are first class if they can be passed around like + other values, and of course called - given args and allowed to + return a value. + Higher order function take functions as arguments, such as + 'map' which takes a function and a list and applies the + function to each member of the list. + Then there is currying where you just pass one arg to a + function and it returns a function which can be applied to the + second arg. The type of 'curry' would be rather subtle: + curry : function(x,..y)->z, x -> function(..y)->z + + Then there are closures. This is a function completed with + some bound variables, just like an object. + + i.e. a call frame *is* an object and any section of code in + the frame can be given a name and can then be called from out + side the frame. After parsing the code in the function we can + determine if there are any code blocks that might be exported + and determine which variables are in there. + These get placed in an object allocated on the heap. All + other variables remain on the stack. + i.e. we create an object to contain the closed functions. + But we don't pass the object around ... we pass the function + around. That is a little odd. + Unless a function *is* an object. But what does that mean? + It doesn't obviously have state, and it only has one + behaviour. i.e. referencing a label in it makes no sense. + Only calling it does. So that is its one behaviour - + callable. + So a "regular" object should be callable too? Why not? + So foo.bar is an object which curries foo. + x = foo.bar + x(hello) + is the same as + for.bar(hell) + What is 'x'? a pointer to an object and a function. + It is X with the interface narrowed down to a single function. + The type doesn't have a name, though the interface might. diff --git a/A Language b/A Language new file mode 100644 index 0000000..2c3d71f --- /dev/null +++ b/A Language @@ -0,0 +1,72 @@ + +Tokensation: + + Different types of strings, separated by separators. + Types might be + integer e.g. 2345 + number e.g. 3.14159e-3 + text e.g. "hello there" + name e.g. Fibonacci + symbol e.g. <=> + + In string of symbol characters, how do we divide it up. + Could used the longest defined match, but then the list of defined symbols + must be processed quite early. + Anything else will be very arbitrary though, so probably is best. + +Simple Syntax: + + There is a simple, all encompassing, syntax which is probably a bit clumsy + (though not too clumsy). + It basically looks like object method invokations: + expr : expr ":" name expr + | "(" expr ")" + | name | text | number | integer + + Should lists be in this simple syntax (for args to function), or should + they be created with + list : create 27 : append 12 : append 1 : append hello + gives a list 27, 12, 1, hello + + No, I think that the language need to "know" about lists more than this. + Then we need a distinct bracket for lists, so that we know the difference between + an expr and a list with one element. + +Nicer Syntax: + + There is a mechanism for defining various syntactic forms to improve the syntax. + e.g. + "if" A:boolean "then" B:block "else" C:block "fi" + transforms to + A:select << B, C >> + + A:addable "+" B:addable + transforms to + LUB(A,B):add <> + +Local variables: + Haven't really thought though how local binding works, especially for loops. + Maybe a local name binds to a single object, possibly created for the purpose. + This needs to be some sort of meta-syntactic form.... + If there is only one object, do we need/want to declare a type for the name. + Is it not obvious from creation? + +Types: + A "type" is a set of objects having common behaviour. + A class is a means of creating new objects. + A "type" can "inherit from", or "be a subtype of" another type. + A class creates objects of a defined type. + + An object has (optional) state and behaviour. + Each behavious is a "class" in that it creates (or produces, returns) objects. + A class takes arguements to tell which sort of object to create. + A "package" is an object without state that pre-exists in some sense, and contains + a lot of classes, which create objects, each with their own behaviour. + How does this relate to parameterised classes which have parameterised types? + This simply requires that the type of (returned by) a class is allowed to be a type-function of + the known types of the arguments, and that classes (functions,methods) can be passed as arguments. + + stack(type) + + +Execution blocks, and bindings, and scope there of. diff --git a/Abstract Syntax b/Abstract Syntax new file mode 100644 index 0000000..7acbd32 --- /dev/null +++ b/Abstract Syntax @@ -0,0 +1,44 @@ + +Hiding behind the concrete syntax there is an abstract +syntax which embodies all the important concepts without +being distracted by sugar. + +A core concept is object method calling. +The actual function is determined from the concrete type of the first +argument. +So the method call takes a list of objects from local names and +returns a list of objects which are associated with local names. + + (a,b,c) = z.Function(x,y,z) + +The formal args of the function may end in @rest which causes extra args to +be placed in a vector. Alternately a vector can be passed as @v. + +Maybe the list is just part of the concrete syntax. +A function (like an object) has a set of internal names. +A function call involves: + Setting some internal names + Executing the code + retrieving some internal names. +So it is like + fn = New obj.method + fn.a=b; fn.b=a; ... + Call fn; + p=fn.z; q=fn.y; + +Note that 'fn' is a concrete object and we know how to +access the various feels directly - they are not accessor function. + +if/while/switch/etc become if/goto. If the given name is not false, +control is tranferred to the label. A distinguished label effects 'return' +Constants... get assigned names in the function preamble. + + +Objects can be: + refcounted - freed when refcount hits zero + 'once' - copied when refcount would become 2. + +Also, objects can be dependant on some other object. In that case, +references within the dependancy group are not counted, and references from +outside effect the whole group. +This is particularly useful when objects are attached to function calls diff --git a/Blog-outline b/Blog-outline new file mode 100644 index 0000000..93574de --- /dev/null +++ b/Blog-outline @@ -0,0 +1,585 @@ + +Supposing I were to write a series documenting my language experiments. +What wouldI include: + + +1/ Design rules / guidelines + + - serve programmer, not compiler. + Obviously the compiler servers the programmer, and many things + make life easier for both. + + - Allow programmer to tell the compiler what they are thinking. + + - Builtin types should not be too special. Whatever is avail to + them should be available to others. So operators and literals + should be available to all types. + + - Similar things should look similar. This encourages a uniform + syntax which is easier to remember. It can also encourage + mnemonics. + + - Different things should look different. Just because two + distinct concepts can be implemented with a single syntax doesn't + mean they should be. Code is read more than it is written, so it + should be clear from the syntax what the intent is. + + - Build on previous languages. Programmers already know several + languages. They shouldn't have to re-learn everything. Only do + an old feature in a new way if it add substantial value. + + - Allow a range of verbosity - let the programmer choose. + +1a/ Ocean - better than C + +2/ literate programming and md2c + +2a/ lexical structure + + - simple - we have lots of history - use it. + - general - use same scanner for multiple purposes + - newlines and indents + + - comments and white space + - newlines and indents + - identifiers: ID_START ID_CONT plus 2 lists + - specials + - reserved words - must be a subset of identifiers + - strings ' " ` ''' " + - numbers 0 0x 0b e p + +3/ LR grammar basics with a calculator and error recovery + +4/ Adding indent/newline handling to LR Grammar + +5/ Statements + Structured control flow. + : INDENT UNDENT -or- { } + if EXPRESSION BLOCK + or + if BLOCK then BLOCK + ditto for 'while'. Means 'do{}while' not needed; + Do I want "while: then:" or "while: do:" ?? + + ditto for 'switch' but this has a twist. The case labels can + be a new enum. Also labels can appear multiple times! + including default??? + Also for/then/else and while/do/then/else + + 'next variable' instead of 'continue' or 'loop' + 'last variable' instead of 'break' + + 'return value' and 'use value' + + 'for ???' + The point of 'for' is to allow a co-routine. + So we want really simple heads just like go + for initialise; next + while condition + or + for varlist, iterator := value.each() then iterator.next() + calls iterator.next() + + funcall + assignment/binding + + defer? fallthrough? exceptions? + + + for while/if/switch then do case else + + 'for' requires 'while' + 'switch' excludes 'then' + + [for] while [then] do [case] [else] + if then [else] + switch {case} [else] + + 'then' is implied after 'if expression' + 'do' is implied after 'while expression' + + But how do I get 'then' after 'while expression:' ?? + + for a = 0; while a < 10; then a+= 1: + print a + + I could have 'for then while' ... + for SimpleStatements + then SimpleStatements + while Expression: + statements + else: + something + + + We don't need 'break' as we can 'use false' in the condition + block. + we probably don't need 'continue' as we can 'use true' and have + the 'but always execute this bit' in the statement block. + +5.0/ + variables and scope + + is hole-in-scope allowed? No + Can we import a scope + - from a module? - yes with version number or name list. No over-riding allowed. + - from a struct? - nice, but not nice enough. + might be useful in 'switch' Might be OK with explicit list. + - does a syntax like $foo help? Why not just use binding. + + We introduce a variable with(?) + name : type = value + but could that be ambiguous with ':' being used to start a block? Probably not. + Multiple variables could then be: + name:type, name:type, name:type = val, val, val + but that looks rather silly. + name, name, name: type = val, val, val + Then we cannot assign a new and an old in the one asignment, to a tuple (from a function). + Could use "var!" to assert a new variable. + + a!, b = fred() + + a bit ugly? + + An issue is that we introduce new state incrementally through a block + of code, but we don't really want to keep indenting. + Yet it would be nice to mark where the variable's scope begins and ends. + a! could introduce with assignment and close with usage (?) + + a! = function() + if a >= 0: + x += a?; + + if a = function() + use a >= 0 + then: x += a; + + I guess passing 'a!' with pass-by-reference, or generaly taking a reference + would be bad. + + How much do we really need to declare variables? + In most cases types can be computed. In those cases we just need to guard + against typos. Also make it clear to human reader what intention is. + For the former, ensuring each name is both assigned and use is probably enough. + For the latter, we want to differentiate between "assign" and "re-assign" + + Maybe '=' normally declares a new var, and '!=' over-rides?. Or :=. + i.e. ?= for any '?' changes an existing value. + Just '=' defines a new name. + But what about tuple assignment? only newnames are allowed. += doesn't work, nor + does := + + Swap assignment? Move assignment. They require lvalue on both sides. + swap a, b (b becomes a) + move a, b (b becomes nil) + + move shouldn't be needed for locals as data flow will figure it out. + It is needed for members of structures. Can I use a symbol? <=?? <<= ?= =< == + Or do I mark the lvalue '@a' means "return the value of 'a' and set 'a' to NULL" + b := @a.sort() + + I like '=' for assigning an immutable binding, and 'x=' for some 'x' for + mutable bindings. So '+=' adds and '*=' multiplies. + ':=' replaces. + But what introduces a new name? + var x + is a little bit noisy. + x .= value + means "has value now, but might change" The visual connection between + "." and ":" might help. + Maybe + x := value + does two things (two dots): declares the name and assigns the value. + x .= value + just does one thing - it assigns a value. + + To introduce a variable without giving it a value we assign '_' + + x := _ + if foo: + x .= thing + else: + x .= other + print x + + + Type can be given in <> + + x := 27 + + Though that might not be good as is otherwise unbound. + We could go with + x:int := 27 + + Though there are more colons than I would like. Probably <> is OK. + + I like the idea of binding a name to a field rather than to a value. + This is like by-reference function parameters. + It is different from a pointer because it is really just a syntactic shorthand. + i.e. the binding is constant. + I'm not sure if this is really useful though. + + I also like the idea of the Pascal "with". I have occasionally missed it. + It exposed fields from a structure into the namespace. + Unfortunate it isn't obvious how to expose two structures, particularly + of the same type. I guess + x = &thing + x.field y.field is good enough. + + + Do I really need multiple assignment? + It is useful for 'swap', but I think I prefer an explicit 'swap'. + It allows unbinding of tuples, but is not + a = tupple + a.1 a.2 a.3 + just as good? I guess names are better, but if names are important + maybe they should be declared in the tuple. + + One benefit of multiple assignment is that a "simple statement" + can declare multiple variables, useful in a "for" clause. + That could just as well be handled with a 'simple block' which + is 'simple_block ; simple_statement' + + + + Q: Do I really need "x:=_" in the above? + As the "print x" usage is not an initial usage, there must be + a prior assignment - or two. So I could make it work, but do I want to? + It would mean that when reading code I cannot easily tell the lifetime + of a name. + + Maybe I use a 'var' statement to declare names + var a, b + + What if I allowed a suffix statement which maintained the scope + of the cond statement, but could affect the more general scope. + + if foo: + x := thing + else: + x := other + finally: + b := x + print b + + I could declare that a name is bound throughout the whole block in which + is appears and if it appears in multiple blocks, one of those must contain + the others. + On every path that leads to any usage, the name must be initially bound. + So if it is defined in one branch of an 'if' but not the other, then it must + be local to that if. + If it is bound with a do loop, it must be local. + If it is bound in all case and the 'else', then it could be more global. + + multiple assignment is useful to collect procedure return + a, x+, b: = myfunction() + + myfunction(): + a=$1; x+=$2; b:= $next; + + i.e. after evaluating a function, all the return values are + available as $N or $name until the next procedure call. + In that case $$ could be an error + + myfunction() + switch $$: + case filenotfound: whatever. + + What about functions called inside expressions? + The value cannot be used as there isn't one. So all code must be + dead until $$ is tested. + A function could identify return values as $xx names. A type + might be 'error' which has a special behaviour. + + I wonder about + some_expression ? $$+1 : other + no point + (some_expression ?: other-1) + 1 + +5.0a - decision time. + + A new local name (variable) can be introduced with: + name ::= value // binding is constant + name::type = + name := value // initial assignment + name:type = + name = value // name must already be defined and is being replaced. + + This binding extends at least until the end of the enclosing + statement. If the statement loops, the binding ends with the loop + If the statement is one of alternates, the binding continue only + if all branches introduced it, or only into parallel branches + in later conditionals. + + After the minimal extent, a new binding will over-ride. + The name must have been used before it is over-ridden. + + Bindings can be changed with + name op= value + acts as + name = name op value + + Multiple assignments are not supported. Use + a=1; b:=2; c = 3; + if you want. To swap two bindings we have + swap lvalue, lvalue + More that 2 can be given and they are rotated with first value + landing in final lvalue + + + Assigning to record fields and array elements normally uses + =, though := can be used for record fields when initialising + a record before first use. + + Assigning to a reference normally makes the reference refer to + something else. If the reference is to a struct or array, then + foo.field or foo[index] can be used to assign to a member. + So assign to the whole thing being referenced, use + foo. = value + which is easily confused with + foo .= value // damn. + + For each name we need to identify places in the code where is + initialized and change, and then each usage needs to link back to + one of those. If a variable is changed conditionally: + if x: a = 1 + subsequent usage of 'a' link back to the end of that whole + statement. + + We can compare two expressions using the target of these links when + comparing bindings. If they match, then the values from the + expressions are the same. This can be used to determine where a + name is valid. + + Q: what about concurrent memory models. Need to study this. + + +X/ types-1 + + boolean. Needed for 'if' and comparisons etc. + Maybe places that might expect boolean actually expect object with 'test' + method + + 'order': Less, Eql, Greater + a ?= b or a name:type, name:type) + interfaces { name:functiontype, name:functiontype,...} + Each function has an implied(?) first argument + "self:self". The type 'self' can be used in other + args and return values. + +X/ types 3 + + algebraic types + <: :> &(intersection) |(union) + parametric types - type or constant parameter + value-dependent types. Value can be quite distant + linear types: number of references is part of type and can depend on value + temporal types: linear progression depends of value. "clock" concept + needed. + parallel types(?) can be accessed in multiple threads. Maybe + atomic types(?). + dependent types that depend on an atomic can also be parallel + e.g. they becomes writable when an atomic has some value. + a refcnt atomic could interface with 'linear'... + + A borrowed reference might need to indicate where it is borrowed + from? There must be some strong reference which we "know" won't be + dropped. e.g. it could be read-only for the lifetime of the borrow. + + +5a/ functions, procedures and tuples. + + A function can return 1 value. A procedure can return + 0 or more - a tuple to be precise. + So where can a procedure be used? + - direct assignment + no, all return values are available in $N until next procedure call. + + or maybe proc(a,b,c) -> x,y,z or proc(a,b,c, out:x,y,z) + +Q/ + dynamic dispatch and polymorphism. + + Dispatching a method call against a reference of incomplete type can be handled + is various ways. We need to understand these to understand consequences of choices + about how methods are attaches to types. + + 1/ If only one, or may be 2, methods are needed for the apparent type, then they + can be passed around with the pointer. + This is exactly what qsort() allows. 'comparable' has a single method + 2/ If only one interface is needed, a pointer to that interface's implementation + This requires interfaces to be separate well defined things, which isn't + the case for 'go' I think + 3/ The object can contain a pointer to one of the above or to a function which + finds and returns a given interface or method. This requires each interface + or method to have a globally unique name, which isn't too hard to manage using + the relocating linker + + I think that every object which has interfaces needs to have that lookup function. + It may be in the object, or may need to be part of any reference. + Arrays etc can be parameterised by a concrete type so they can hold one function + for many references to small objects. + + A module can define an interface to some other type, or to an interface. + So a module might define a "sort" interface to an "array of comparable" interface. + If a module imports several modules which all add different interfaces to an + external interface, then the importing module must define a lookup function + which finds all the different methods by 'name'. + + Any data type can have a collection of methods. Some of these might belong + to an interface. The other can only be used when you have a reference of the + type, not of an interface. + A data type can declare that it contains a dispatch function, or the compiler + will use fat pointers. + + A data type might instead contain an enum - which might be much smaller. + This assumes that all subtypes are in the one module and the compiler can + create switch statements to handle all interface methods. + +X/ + Error returns from functions. Exceptions? + -errno works surprisingly well with ERR_PTR(). + But NaN works even better. + + Sometime we might want to handle errors in normal flow. + Sometime we might want them to be exceptional. + go distinguishes by allowing "foo, err := function" + to catch the error. + I could allow 'foo' to 'hold' the error and so + foo := function + except foo == errortype: + do stuff + Without an 'except', the block is aborted. + + How that that work for procedures? They explicitly return err if needed? + Or any return can be conditional + + +6/ Declarations + +7/ Expressions + + - * / % & | ^ && || ! &^ &~ + >> << < > <= >= != + += -= *= /= %= &= |= ^= (one token or 2?) + and or not cand cor "and then" "or else" + else then + ?: if/else + max min (link 'and' 'or', or max() min() + lambda somehow? fn( + + x.y x() x[] x{} x<> + .1 .2 .3 for tuple access. + + * as a prefix operator dereferences + < as a prefix operator dereferences and sets to nil. + + precedence - how many levels + + with tuple: + $1 $2 $3 ... + + Integer division? + - overload '/' + - 'div' + - '\' + - '//' - no, that's a comment + - /_ ( divide to the floor) + - /- (divide and discard remainder). then + -/ could be 'keep remainder'. + but /- could be 'divide by a negated value' 4/-5 + + String concatention? + - overload '+' + - use '++' as general 'join' operator + +6/ Pointers are special + The type can carry refcount info and locking info + +7/ assignment and binding + patterns. i.e. destructuring. + If a structure is a namespace, then "with" might populate the active namespace... + That could trigger hole-in-scope though, which is bad. + Patterns are mostly used in switch/case. + switch value: case pattern + and the point of 'pattern' is that it might not match. e.g. it might assume + some element is not NULL, and has a particular structure. + e.g. if this is a cons-cell, do that, if it is nil, do the other. + + Ahhh, no. This is used for tagged structures. + The case determines that a given tag is active, and makes the relevant + fields easily available. Is that syntactic sugar needed? I don't think so. + A switch might be useful, not it doesn't need syntax. + May the name '_' could be bound to the recent 'use'd value so + switch some_funct(...): + case _.tag = foo : ?? + that's a bit ugly. + tagswitch some_funct(..): + case foo: print _.foo_name + that is probably cleaner. + So 'tagswitch' is a shorthand for + switch: _ = X; use _.tag: + +8/ Operator choices - is this part of Expressions? + a else b // same as a ?: b + a if b else c // b ? a : c + +9/ Type syntax + + I'd like to use ':', but not sure if it is being over-used for + blocks. + However: + field : type + declares a field to have the type, so + fred:int = 27 + creates and integer call fred with value 27. + A type can be: + A name "int" + A struct "{field: type, field:type, ..}" + An array "[ length : type ]" + A procedure "(arg:type, arg:type -> result:type, result:type)" + A function "(arg:type, arg:type): result_type + A tagged union in a struct + "{ struct fields, tag -> name:type, name:type; tag2->name:type ...}" + A borrowed pointer "*type". + An owner pointer "~type" + A counted pointer "+type" + A collected pointer "?type". + +xx/ output formatting + +10/ modules, packages, exports/imports and linkage. +.... + +11/ standard library + strings. hash, file, regexp, trees, channels/pipes + +12/ reflection + - useful for tracing written in-language +unit tests and mock objects? diff --git a/Byte Code b/Byte Code new file mode 100644 index 0000000..f4f50d0 --- /dev/null +++ b/Byte Code @@ -0,0 +1,135 @@ + +Thoughts on a bytecode level language to implement an OO language on. + +Omo - Object/Method Oriented. + +Everything is an object, and we do things by invoking methods on objects. +This invocation passes some objects and returns some objects. + +There are a small number of different implementation styles of +objects (styles of classes). + 1/ builtin implementations, such as boolean and code variable + 2/ external implementation - e.g. C code + 3/ explicit implementations, built using code + + +There are two different objects of type 'boolean'(class Boolean), one +we call 'true', one we call 'false'. +One of the methods that this type accepts is 'select', which take two +objects and returns one of them. + +The class 'code' accepts the method 'execute', and is passed .... +something. + +So to implement an if statement, we create code objects for the two +forks, pass them with the select method to the condition, and invoke +the execute method on the result. + +To implement 'while', we evaluate the condition, then invoke the +select method passing body and while loop, and null, and the 'execute' +the result. + +Maybe 'execute' should be passed a 'continuation' code, but it has to +return sometime, so why not straight away. + +So, while X do Y done; becomes + +X.select(stuff, NOP).execute + +stuff: Y ; X.select(stuff, NOP).execute ; + +An explicit implementation provides a piece of 'code' for each method +needed for that type. An object of an explicit class contains a list +of objects. Some of these objects might be 'variables' but that is +not essential. They may well be externally implemented objects. + +A variable supports two methods, get and set. They return and +receive a reference to an object, respectively. + +A piece of code can contain object identifiers and method names +(numbers, whatever). +An object identifier can refer to + A subobject within the current object. + An object imported into the class of the current object. + An object passed with the method which called this code + A temporary object +After a method completes, the object(s) that it returns are assigned +to temporary objects so that other methods may access them. + +So a 'code' starts with + -> t1 t2 t3 ... +to assign 'names' to passed object (references). +Then has multiple + Method (I1 I2 I3 I4 ) -> t6 t7 t8 +To call method on object I1 passing I2.. returning t6 ... +And finally + Return (I10 I11 I12) +to return some object references. + +One can imaging that the interpreted might do dependancy analysis and +possibly invoke different methods in different threads if they are +independant. + +Question: What about when the dependancy is is sideeffects that cannot +be seen from the calling patern. Maybe method with side-effects need +to be flagged -- what about methods that don't have side effects, but +depend on the side effects of others... +Probably just have sync points within the 'code' such that preceding +methods must complete before following ones commence. + +Some methods might return 'pending' objects that in some way refer to +a thread which is continuing to evaluate the method. The interpreter +would not then wait for that thread to finish until the object was +actually needed. e.g. until it is ready to invoke a method, on it, or +pass it to another method. Some methods may allow pending objects to +be passed in for certain arguments. +This could allow easy parallelism. +When the last reference to a 'pending' object is dropped, it may be +appropriate to let it continue in it's own right, or it may be +appropriate to terminate the thread(s) working to evaluate it. This +decision would probably depend on whether side effects are expected. + +There would be some standard external classes, such as string and +int32. +The distinction between builtin and standard external classes depends +on whether the interpreter needs to 'understand' the class, either for +proper implementation of explicit classes, or for effective +optimisation of code. + +Multi-threading can be achived in two ways + 1/ with OS support, such as Pthreads + 2/ by round robining a number of states in the interpreter. + +1 is probably fiddly as yet (witness the problems that java seems to +have - or is that just for JIT compiling) +2 does not naturally allow for multi-threading in external classes. +For 2, we would have to require that external methods either be 'fast' +or that they engage in co-operative multi-tasking, and never block. +Thus there needs to be a return status which says "I haven't finished +yet - call me again on *this* event", which might be selectable, or +time, or process-exit, or immediate, or semaphore release. This would +presumably cause a thread to be created, and a pending object returned. +The event would then be registered as blocking the thread. + +Semphores probably need to be a builtin class. Some methods will +require a semaphore within the object to be Ped before continuing. + +References to objects will probably be fat, including a reference to +the class, and maybe a context. Hopefully dynamic optimisations will +reduce the need to always carry a fat pointer, but what about pointers +that are embedded in objects, only compile time optimisations will be +able to help them. +Maybe some objects (types of) have the class pointer in the object, +others have it with the pointer,and still others are determined to +have a single implementation, and so need no class pointer. Or maybe +the context can determine the class ?? + + +A "method" is a name-space-change plus a 'code'. + +But sub-blocks like to allow local variables, so mini-methods are +required, which inherit context. In any context, we need a chain of +parent contexts. Many O-O langs have only "object" and "class" (and +static "package"). I think that a completely general scheme is +needed. Objects can often be 'part' of a larger object, from which +some context is needed, and so-on recursively. \ No newline at end of file diff --git a/Design b/Design new file mode 100644 index 0000000..4009150 --- /dev/null +++ b/Design @@ -0,0 +1,2803 @@ +Design notes for a language - 2013 version +=========================== + +Design Philosophy +----------------- + +A large part of design is compromise. Once the good ideas are formed, +they must then be blended nicely, and this blending always involves +compromise. + +So we need a clear philosophy to guide this compromise. This +philosophy does not serve to provide simple answers, but rather +provides some unifying themes that can but use to motivate, or at +least to justify, the answer. + +1/ + The compiler needs maximal information so that it can detect errors + and optimise implementation. + + The syntax needs to be clean and non-distracting so that a human + reader can easily see what the code is trying to say. + + These two principles are clearly in conflict. One wants to add + more to the program, the other - less. Finding a balance between + these two for each design element is key to success. The best answer + will involve finding a syntax which provides information without + distraction. This will not always be possible (if ever) but is a + worthy goal. + +2/ + Language should enable a feature to be provided, rather than provide + it directly. + This means that the feature can be in a library, and can be changed + using the language itself. This is inherently more flexible. + However if the language doesn't "know" about a feature, it is less + able to type-check and optimise it. + + + 3/ Things that are different should look different. + This allows the compiler to correct silly mistakes more easily, and + reduces the chance of making them. + + So trying to use just a few core concepts is a mistake. It muddies + human thinking and makes life harder for the compiler. + + 4/ + familiar is better than novel + mnemonic is better than obscure + short is better than long + consistent is better than irregular + pronounceable is better than unpronounceable + + +Design Concerns +--------------- + +Name of language +token syntax +comments +expressions or statements + if/then/else -> ? : + But what form could we use for a case/switch? Maybe we shouldn't. + +syntax of structure blocks +How types are constructed and described + - scalar: int, float, char, Bool, enum + - vector: zero based. with size + - struct + - tuples - how first-class are they? + - union/variant/alternate + - non-discriminated unions? + - lookup table, stack, queue, + - functions and methods + - lambda forms + - closures + - references +Does a "pointer to a pointer" make sense in a strongly statically + typed language? +Should enum values be pervasive, or require typename: prefixes? + 'import' should be allowed to import an enum namespace. +Slices + - pointer + length. What is the role of 'capacity' in Go? +formal parameters - in/out/inout? Just a struct. + are the names of parameters part of the type, or not. + Probably not. But names in struct are part of the type. So not + really a struct. +return values from function: 'return', or name=value +closures for loops + what does 'break' mean in a closure? or 'continue'? or 'return'. + I guess the closure includes implicit labels. _loop _func + So "return label value,value" could return from enclosing block named + 'label'. +break/continue/goto +__reserved_names__ +case/typecase/match/switch +exceptions/errors +defer/finally/after... + how is this scoped? + - does it create a block: try BLOCK finally STUFF + - does it run on function exit + - can it attach to the lifetime of a value: + x = take_lock(semaphore) + unlock implicitly happens when x binding is broken. +printf - formatting, input/output in general +manifest constants of various types +What are strings? What operations are in the language? +What sorts of numbers are built in? + Sized ints, unlimited ints, float, fixed, imaginary, complex +Initialization and default values +Polymorphism: + subtype + parametric + If parameter is untyped, then it can still be useful if + any needed function (e.g. compare, add) are passed in as well. + These functions could be in a structure ... which could be a vtable.(virgil) +Scope of bindings +overloading of bindings. + - functions/methods with different arg types + - operators for different objects + - function with different first arg type is like a method + and can be useful for sin(), min() abs() etc. + Certain names could be declared as methods and converted to + method syntax when called. +subrange types + - useful for validating array references + - useful for avoiding 'default' case in switch + - can be deduced from some arithmetic: 'mod' particularly +overriding of bindings + - relates to inheritance. +Nature of bindings: immutable, const, var, computed +target: script, jit, pre-compile, ... +"eval" +memory allocation and management +static or dynamic typing +threads/tasks +co-routines, generators +macros +equality tests +assignment semantics - references, copies + binding versus assignment: particular relevant with tuples +Type inferences + avoid redundancy, but redundancy can be good. +modules +foreign function interface +pattern matching for destructuring + This is the general case for "tuple = tuple" with the question of + which bits on the left are literal and which are formal names. + This is only an issue if the appearance of a value starts like + an identifier "Some(value)" "None". +'unsafe' at different levels + - bit manipulation + - memory allocation/casting + - concurrent access. +tracing/debugging + can the language/compiler make it easier to trace code execution? + flag every time a field changes + +dependant types +reflection + tags for reflection + +structure embedding using "import struct" or "import fields from + struct" ?? + +function attributes: optimise_options, inline, .... + +destructors? + - go doesn't have them - relies on 'after'. + +introspection + why is this so much fun? I never use it in C!!! or Python. + introspection in Go look weird and is probably related to lame type + system. + +nullable pointer: builtin functionality or via a union? +array indexing: should checks be explicit where not resolved by type? +operators: type must match. A constant can be one of several types. + iterate through options until expression is type-correct. If + there are several constants .... might need to be careful. + a = 1 << n '1' must match 'a' + if (1< (return type) + + Do we want a name to introduce? "struct" "fun"/"fn"? If so we + should for consistency have array ... like Pascal. + + Without a name: + { name:type, name:type} is a struct + [length:type ] is an array ([5:int] reads "array of 5 integers"). + (from:type,from:type -> to:type, to:type) is a function + { tag -> name:type, name:type; tag2 -> name:type, name:type } is + a union + + That is really rather elegant. May be a bit too terse though, don't + know. + + 'tags' get assigned globally unique numbers, so multiple tagged + sets can be embedded in another. + However if the first one is "tag = 0 -> ... ,", then zero etc + is assigned and embedding is more limited. + + Does that just leave pointers? + - closures have same apparent type as functions + - methods do too. + - tuples only have implicit types - unless they are structs or + function parameter/return + - vtables ? interfaces + + So: pointers. + Maybe just a leading '*'. + But do we want to distinguish among: + borrowed + owned + counted + collected + These are properties of the object in some cases, the pointer in + others. + An object can be: stack, owned, counted, collected + A pointer can be: strong or weak. + But a strong pointer needs to know what to do when dropped: + - nothing (so may as well be weak) + - destroy/free + - dec ref and maybe destroy/free + + So a pointer needs to be one of: + borrowed, owned, counted, collected + * ~ + ? + Pointers to stack are always borrowed. Raw pointers are borrowed + too, but can only exist in unsafe code. + + And vtables. These are usually created automatically from + method and interface definitions. So they don't need a syntax. + Interfaces do though. They are sets of named functions. So + they look just like structures but all the fields are functions. + In fact they *are* structures. They use embedding for + inheritance. + + Pointers to different values are quite different. + - pointer to scalar or record is just an address + - pointer to fixed-sized array is just an address + - pointer to var-sized array is address plus size + - pointer to implementation is a pointer as above plus a pointer + to the vtable. + This makes var-sized arrays with vtables a little awkward. + + But is there really any value in allowing implementations on + var sized arrays? Yes, I think that would be nice. I guess + they get converted to struct for a vtable pointer, with the struct + stored ... in the 'once' space. + + Should it be possible to add to a struct after it is defined? + This sort-of makes sense for unions where extra cases could be + added. + + With structures we can embed old in new. Maybe that way around is + best. + This new union has all the value in the old one, plus some more + values. + A value from the old union can be assigned to a variable of the + new union, but not viceversa without range checking. + How does multiple embedding work with unions? With an offset? + + + Initialisers for different types: + Array: [ value, value, num: value, ] + struct: { name: value } + union: { tag -> name:value, name:value + | tag -> name:value + } + tuple: (value, value, value) - can map to several known types. + function: .... + + functions are more awkward. If the type is given explicitly as + def name : type = value + then we don't want to repeat any of it, but if we don't have an + explicit type, we need to at least list parameter names. + def foo = (arg, arg, arg) { + code + return rv,rv + } + + or assume '_' is the argument list + def foo = { + my, args, = _ + return args, my + } + Real asymmetry. What is the inverse of 'return' ? + import? with? receive? + 'receive' might work with co-routines? Maybe not. + + + Another thought about structs. + There are 2 sorts: Those which are used for external + communication and must be in exact order with no padding, + and those which are only used internally and the compiler is free + to reorder as it chooses. The second sort could usefully be + extended later. This would allow a literate programming style + where elements are added to the structure at much the same time + as the code which uses it. + External communication structures could be flagged as big-endian + or little-endian (or host-endian). or should each field have + an end? Probably not. + + + Pointers: + I don't really like the '*' prefix operator in C, nor the need for + '->' after pointers where '.' is enough after a struct. + Pointers to arrays behave the same as the array. It would be nice + if pointers to all things behaved just like the things. + + There are some exceptions though: + 1/ p = foo + this could make the pointer point to foo, or copy the contents + of 'foo' into the thing pointed to by 'p'. + Both are valid so we need a simple syntax to differentiate. + 2/ if p < q: + Could compare the addresses of the boxes for e.g. double-lock + deadlock avoidance, or could compare the values themselves. + 3/ if p == q: + Does this say "the same box" or "the boxes hold the same + values"? Both are necessary. + + i.e. these are the things that you can do with a pointers: assign + it and compare it. + if p.foo and p[foo] and p(foo) automatically dereference, what + syntax do I want for the occasional explicit dereference. Or do I + want a syntax to inhibit auto-dereference? + p. = foo + could assign to the box rather than assign a new box. Or + p = foo + could assign to the box, while + p = &foo + could assign a new box. But pointers to pointers become + problematic.... or do they. If general the languages imposed + whatever referencing or dereferencing is needed to match types. + If & is present, then it skips up one level. + + But parameter passing should be like assignment, and the default + should be by-reference if possible. So assignment should be + by-reference. So we need + *p = *q + or + p. = q. + or decide that annotations on the right are deduced based on type, + *p = q + p. = q + p[] = q + + p < q should normally compare the values though. + p == q should compare pointers + + If p is a pointer to pointer to int, then + p = 5 + means a double dereference on the else, which I think is wrong. + p = + should alway cause p to point to something new. + p. = + can change the thing p points to. so + p.. = 5 + would be needed. That's a bit ugly. + (p) = 5 + does that help? It shows something odd is happening. + **p = 5 + At least we are familiar with that. + + As pointers are distinct from other values (e.g. they cannot have + methods), they could have a different assignment operator. + p = value # assigns the pointer + p := value # copies the value into p. + + Maybe I'm trying too hard here. 'p' is the pointer. Any operation + on 'p' other than comparison dereferences it. That is ugly. + p === q + could be pointer comparison + or + p is q + like in 'crack': http://www.mindhog.net/~mmuller/projects/crack/Manual-0.3.html + + Maybe "p." means "dereference as many times as needed". + So it never receives a pointer, only a non-pointer value. + To change a pointer you could + newp = thing where newp :int* = p + + i.e. explicitly name the type of the value being copied. + p =(int*) q + + Thought: assignment is different from function parameter binding, + as the latter can never mean a 'copy' if the target is a pointer. + + + + There might be a bigger picture here. 'pointers' are very + different values to most others. They cannot be the receiver of + methods, only the thing they point to can. But what about pointers + to arrays aka slices. Can they be the receivers of methods? + Yes in exactly the same way that arrays can. + Which is like saying that 'pointers' can receive methods, but only + the methods that their targets receive. + This is a little awkward for slices because they need to carry a + parameter: the size. Unless we support parameters more broadly and + decide that any type might have arbitrary runtime parameters. + So a slice becomes a hidden object with a size and a pointer. + Does that help??? + + What I would like is: + a < b + compares the pointed-to values. It is exactly a.cmp(b) < 0 + a == b + does that same: a.cmp(b) == 0 + a = b + assigns the values + a = &b + assigns the pointer. + + + My biggest problem is that the mapping "x < y => a.lt(y)" combined + with the fact that field selection auto-dereferences, means that + "a < b" compare content. So "a == b" must also compare content. + So let's go for 'a === b' asks 'are these as indistinguishable as + possible'. + And in 'unsafe-pointer' mode, a <<< b can compare pointers. + + Given that, what does "a = b" do? It really should assign + pointers. + So remaining question is how to assign values. + a = .... // always changes a, not what a points to + a. = .... // always changes exactly what a points to + a.. = ....// changes the thing pointed to by what a points to. + + a = copy b ??? + This is like "swap a, b" in a way. + + Getting the 'type' of that would be interesting. owned? borrowed? + + + Structures can be internal (no implied order or padding) or + external (order and alignment rules apply, endian can be + specified). + Unions can define a local enum, or use a predefined enum (Boolean + might be useful) + For unions, the enum might be implicitly stored or a stage + parameter, or it might be an explicit structure member. + Do we want names for these things like 'struct' in C, or do we want + to use 'traits' to distinguish, or something implicit? + + Enums: + In C, the enum values become global names and the variable has one + of those values, which are integers. + In Go, enum values are local to the type: type.foo type.bar. + In Rust, the values are global function names which can return + a suitably qualified union value. + + I like namespace control. I also like enums being tied closely to + unions. So if an enum is actually a union, then ?? + If 'v' is a variable, a 'e' is an enumerate value with member x,y + v can be compared with typename.e + v.e is true or false + v.e.x and v.e.y are only accessible if v is known to be e + + or v.e could be one of these things that is a value or nothing, and + can be used with 'else'. + If there is no ambiguity, v.x is valid and can be used if v.e is + known. Or can be used with an implicit test of v.e + if v.x == 3 + is implicitly "if v.x && v.x==3" + or if a := v.x + + + Pointer traits + The compiler understands 2 types of pointers: strong and weak. + Weak pointers are held in the context of some strong pointer. The + strong pointer may be to the same object, or it maybe to some + container object or allocation domain which is known not to free + anything for the time being. + Weak pointers can only be stored in life-limited variables which + are subordinate in lifetime to the strong reference. This includes + local variables in a lower scope, and variables inside the + strong-referenced value. + + Strong pointers can be moved, copied, or destroyed. + When moved, the previous storage must become invalid. + When copied, the storage manager much be told. This might + copy the object, add a refcount, or do nothing and leave it + up to mark/sweep garbage collection. + When destroyed, the storage manager again must know. It + might decrement the reference count, free the object, or again + do nothing. + + "same object" is a fairly vague term which needs to be made + specific. + It could be a structure - holding two pointers, one strong and one + weak. If they were to the same type that would be pointless. If + the strong were to a container and the weak were to an object that + is somehow 'inside' that container, it would make sense. + It could also be a linked collection of structures such as a linked + list or a binary tree. In order for the language to be able to + "reason" about such a thing it would be necessary for it to "know" + about such a thing, so there would need to be a way to describe + it. There need to be rules for when two structures (including + arrays) are part of the 'same' object and when they are part of + different objects. We also need to know if these things are + hierarchical: whether something can be a 'sub-object'. + + This important rule is that all weak references in an object must + be invalidated before or as the superior strong reference is + moved. + This has many interesting cases, one of which is the doubly linked + list. In such a list the forward link might be seen as strong, so + the backward link would need to be seen as weak. It is a + distinguished weak link as it can always be found from the object. + So to move the strong reference, we need to be sure all weak + references are gone. That doesn't mean we need to invalidate all + weak references in the whole object, as we know how to get to the + only one. + So it seems some weak references as 'findable' - probably in + various ways. The language needs to know if an object can + contain: + - no weak references (so moving strong references is trivial) + - findable weak references (so moving a strong reference just + requires focused removal), or + - general weak references (so moving a strong reference requires + invalidating all references) + + That case of doubly linked lists is also interesting as a structure + can be on multiple lists. This can be managed in a couple of ways: + - refcount the structure so it can have multiple strong links for + the forward links + - have one list subordinate to another so that the struct must be + removed from the subordinate one (or verified not present in) + before being removed from primary one. This is probably the + sort assertion that would need to be specified in the language + + + When we have a weak reference stored in a structure we need to be + able to explain why it is safe. And when we remove a strong + reference, we need to explain why we know that all dependant weak + references are known to be resolved. + +Attributes of things. + Different fields, locals, globals etc may have different + attributes beyond their type, or maybe as part of their type. + Some of these can change during the lifetime of the name or value. + I need to collect them all together to try to understand them. + + constant - a value assigned (or bound) once and never changed. + if it is scoped then it is bound once whenever the scope is + entered though possibly to a different value each time. + A name might get reused within a scope if it is + only assigned when dead. e.g. + a = foo + print a + if bar: + a = 1 + else: + a = 2 + print a + here 'a' has been reused as the foo' value dies + after the first print (it can never be seen again). + This 'a' could still be 'constant' + + There is a question here: when would 'foo' be + destroyed if we had destructors? It feels like it + should be destroyed as soon as it dies. However if + it holds a lock or other resource, we might want it + to only be destroyed when it goes out of scope. + + stable - This is like 'constant' but applies to a value + rather than a name. It means the value (which + might be a large data structure) will not change + for the moment. 'stable' needs to be asserted + while being the sole owner of a value. + add-only- This too applies to a value. It says that other + values can be added to the main value, but not + removed. + This only applies to values included by dependant + references. + readonly- This attribute of a borrowed reference means that + it will only be used to read. The value might + change, but this reference won't be used to change + it. + A value can only become stable while all borrowed + references are readonly. + mutable - This is maybe just the absence of the above. + A mutable field can be changed as long as we have a + mutable pointer to the structure. + A mutable local variable can be updated when not + dead and can possibly have it's address taken. + + owned - This is a strong reference - when it dies the + object dies. + Maybe we can have 'indirect' references through + e.g. a refcounter. Then when the ref dies, the + refcounter is decremented. And if it dies, the + target dies. + borrowed - A weak reference. Must be in the context of some + strong (owned) reference to something. The strong + reference might be to the same thing, or it might + be to a container object which is add-only + dependant- A reference in a structure cannot be borrowed but + it can be 'dependant'. This means that the + referred value must exist somewhere else.... need + to clarify this. + pure function + + + I think we will need user-defined attributes. These record + important aspects of the state of a value. These can be + asserted by unsafe code, and required by functions etc and the + compiler can check that requirements are always met. + So a list_head might have an 'empty' flag which is cleared by + 'list_add' and has a known value after the "list_empty" test. + The 'stable' flag in one data structure might be implied by + the 'held' flag in a lock. + + Integers could have a 'non-zero' flag which allows them to be + the divisor. They might also have an upper bound - like + arrays? + I sometimes "know" that a number is a power of two, which is + easy to test for + + A similar flag would apply to pointers which may, or may not, be NULL. + +Names for types? + I don't like typedef in C. It hides stuff. + I see a type name and I don't immediately know if it is a + pointer or a structure. + Structures need names as do enums and unions. + Pointers don't need names. + Scalars (uint32 etc) have names and don't need more, unless + some sort of 'unit' concert were included. + What about arrays? That are like pointers so maybe not. + But array contains a size. It is like a record in that it + combines to things - member type and size. Is that enough to + justify a name? I suspect not, so in the first instance, only + things with names can have names. + +Polymorphism: + + Parametric polymorphism is quite important. + Various types can be parameterised by other types. This + sets the types of fields in structs and members of arrays etc. + Also want values as parameters for the size of arrays. + And internal parameterisation so the size of an array can match + some int field, or an enum can select an option. + + Functions/methods are very important here. The type + parameters may be implied by the actuals. This then needs + to imply the types of the results. + Each result can be an intersection?? Or maybe it can be + a type-function of the parameters - but do we allow + intersections? + + A parametric type is like a function. It needs a name and + some parameters: + type foo> = { + thing : a + stuff:b + } + A method for a parametric type is then a parameteric-typed + function + interface foo = { + fn foodo(first:a, second:a -> bool) + } + methods might want an extra type parameter, 'self'. + Using this would require the compiler to know that two objects + didn't just conform to the same interfaces, but were of the + same type. Shouldn't be too hard. + + But we also want a function to be parameterised. + choose(which:Bool, first:a, second:b -> a|b) + which could be: + choose(which:Bool, first:a, second:b -> + result:A) + + When this is called, the actual parameters have a list of types, + including any interfaces. Each use of a type-parm intersects + possible types until we have minimum set. Then results is + a possible list of types... needs to be the intersections.. + + +Interfaces: + Interfaces list methods that can be called on a value. + Often we want an interface with a single method. This is + particularly useful for the methods that operators map to, + but there are lots of others. Constructor, destructor etc + etc. + Go effectively does this as you only need to declare the + methods that you will use - the effective type is the sum + of all these. + One advantage is you don't need two names for everything + Comparable:compare, addable:add, destroyable:destroy. + + But it can be good documentation to collect related + methods together into an 'interface'. It can also provide + namespace control. + + method foo(self:A, b:int -> bob:rune); + + This declares a method which can then be realised for + different concrete types. + + interface bar { + foo(self:bar, ...) + } + declares the same but "bar.foo", not "foo". + Maybe syntax should be more similar. + interface foo(self:Self: b:int -> bob:rune); ?? + + But now we have a preference for polluting the global + namespace. + If I want a method that is just for my type, it should not + require an interface. I guess: + method mytype.mymethod(thing, thing -> thong) + is a local method: + method foo.bar(...) + is an interface method. + impl foo for mytype { + method bar(....) + } + can save some typing.. + + Every type effectively defines an interface which applies only + to itself using its name. + + Syntax is still a bit ugly. I don't want to assume 'self' + is the parameter name, and I don't want it to appear as + first arg. So where do I put it? + Go does + func (name Type) methodname (parameters) (result) + Which works because all methodnames are global ... sort of. + I want two methodname namespaces: + - local to the Type + - inherited from an interface. I could use + .interface.method?? + func (name Type) .compare(other Type) + matches + interface .compare(other T) + + Another thought. I sometimes want an object to respond to + the same interface in two different ways. A bit like being on two lists. + I might have a table with byname and bynum lookup methods which both + use []. + mytable.byname['fred'] = entry + print mytable.byid[42].name + Sometimes ".foo" is an attribute, sometimes a view... + + So an interface can be bound to the object, or to a name in the object. + Garbage collection: + Supposing that I thought this was a good idea, how would it + work? + The general idea is you keep track of what memory is allocated + in what sized blocks (as you have to for 'free' anyway) and + keep one spare bit for each piece of memory. + + Then you : + - clear all those bits. + - scan all allocated memory looking for anything that + might be a pointer into the target area, and + setting the relevant bit. + - when you set a bit, you scan that memory as well + - when finished, any memory which doesn't have the bit + set is ipso-facto free. + + Variations involve breaking the heap up into sections, + possible with different age properties, and only cleaning + one section at a time. That might mean keeping a record of + all pointers that cross a boundary from one section to + another. + + Another variation involves copying boxes to make free space + more compact. This requires finding all inward pointers. + + + It is really hard to see this as a valuable thing. The + programmer should know what is going on in general, and the + language should be able to track and enforce proper handling. + + It seems useful for language which encourage lots of + short-lived objects, e.g. for string and list manipulations. + The language should be able to track these, free them when + no-longer needed, and copy them to a more stable area when + they are made more permanent. + + I think I'm going to have to write some code and see what + happens. + + + Type Equality: + When are two types "the same" ? + Some languages use "same name". Others use "same structure". + Both have their appeal. + Go uses structure unless both types have names - then it requires + 'same name'. + I wonder if "int" is a name. Probably it is. + I'm cautious of structural equivalence of structures. The 'names' + of the fields are important and I don't think two fields are the + same just because they are spelt the same. + 'arrays' don't have names, so structure works there. + Parameterised types should match structurally too - array is just a + parameterised type. Ditto for pointers. + Two structures declared separately are different though, even if + the names and types are the same. + What about functions: Are the parameter/return names part of the + type, or are they merely internal name bindings? They look like + structures but aren't usually used as structured. Yet when arg + lists are long it could make code more readable to tag some params. + I guess we fall back on "unnamed things match named things". + If both function declarations have names, they must be the same + declaration. If either doesn't they match on structure. + + A manifest value doesn't have a type. So if I say: + + a = { foo=1, bar=2 }; + then there is no type. That value could be of any type for which + it is a match. If the type of 'a' is known, it must match. If + not, the binding is illegal. There is some finite list of types + which this could match and we effectively try them all until one is + type-legal, or we run out. + + + Automatic casting: + casting numbers to larger types is probably safe?? + What about casting a Foo to a union which has Foo as one branch. + This can be a compile-time rewritten to a simple union creation. + + Expressions or statements? + i.e. do I want to be able to say: + a = if x then b else c fi + so a becomes 'b' or 'c'. + I tend to think not. if/then/else/fi is a statement and making + it an expression as well can be confusing and make it hard for the + compiler to determine what your mistake is. + This also applies to 'switch/case' but not to while/for. + + For 'if', can use + a = x ? b : c + but that doesn't generalise to switch very well. Does that matter? + + Another syntax is : + a = b if x else c; + which is quite different to the reader - both human and machine. + For switch: + a = b if x, + c if y, + d if z, + e otherwise + using true + +Case/switch: + Should this just be syntactic sugar over if/then/else. + I don't like the idea of typecase or Rust:match as it + provides functionality that is 'only' in the case statement, + not stand alone. To that extent I like Go:type assertion as + it is stand alone. + + But what about fall-through. It is implicit in C and explicit + in Go with yukky rules. + Why do we find that we need effective 'gotos' there, but not + so much elsewhere? + Maybe it is like dropping an else. + if a: + b + else if c: + d + + or + + if a: + b + if a or c: + d + + No, it isn't quite like that. + + switch (num_bits) { /* fall through... */ + case 8: if (byte & 128) num++; + case 7: if (byte & 64) num++; + case 6: if (byte & 32) num++; + case 5: if (byte & 16) num++; + case 4: if (byte & 8) num++; + case 3: if (byte & 4) num++; + case 2: if (byte & 2) num++; + case 1: if (byte & 1) num++; + default: break; + } + + readdir uses it: switch is on 'state' and we might handle + several consecutive states. + + case TypeSegmentMap: + case TypeQuota: + /* These only get merged while in a checkpoint. */ + if (fs->qphase == fs->phase) + break; + printk("Suitably in-phase\n"); + /* FALL THROUGH */ + case TypeFile: + case TypeSymlink: + + one case is a special-case of another. The enum + doesn't show full structure. Really want to repeat + the TypeSegmentMap an TypeQuota with TypeFile etc. + + switch(why) { + case NewSpace: + watermark += RELEASE_RESERVED * fs->max_segment; + /* FALL THROUGH */ + case ReleaseSpace: + watermark += ACCOUNT_RESERVED * fs->max_segment; + /* FALL THROUGH */ + case CleanSpace: + case AccountSpace: + /* Definitely no water mark here. */ + break; + } + + Again, there is a progression of values, like the first case. + + So two options: + 1/ state-machine and we handle a state and move on + 2/ values have a specific/general relationship. + and probably other less general things. + FIXME + + Maybe "switch" chooses just one branch, while "select" chooses + all relevant branches? Or something + + +Break / continue / return + Do these have labels? I almost never need labels. But I + often use "continue" inside a 'switch' to break out and + continue the surrounding loop. A 'break' from nested 'for' + loops can be good. + + But labels are ugly - how are they indented? Or are they part + of the 'loop' syntax? + Sometimes I want to break out of an 'if' so I make it a + 'while', which looks odd. + But breaking out of an 'if' would mean that to conditionally + break a while would need a label. But break from 'if' is + also conditional, so maybe not needed.... gets ugly. + + The label could be implicit. 'break if' ??? + + This is really a controlled 'goto'. We can go to: + - end of block. 'continue' for while, 'break' for switch + - out of block. 'break' for while', 'break' for switch + + What if we tagged blocks as being 'breakable'? Too + surprising. + + What about "break variable" which leave the scope of the + variable? + This would be hard to break out of a while loop. + Or "next i" would continue a for loop + Or any variable bound once in the loop. + then "last i" could break out. or 'done' or 'have' or return! + let v =: + if asa + return + if sdfsfd + have v2 + +Anonymous/blank variable. + Go uses '_' as a 'blank' identifier. It can be assigned or bound + to anything which evaluated the thing but ignores it. This makes + some sense. + For 'switch' it might be nice to have a similar thing which holds + the value which is being switched on. + If the name is used, the value must be 'true', else it must match + the value. This is a rather odd semantic. And what is the scope + of the name? Including blocks within switch would be useful, but + then nested switches redefine a name, which is less good. + + So maybe it should be + switch _1 = value: + case _1 == 3: whatever + case _1 == 4: otherthing + + + Structured blocks: + { curly brackets are common, but a bit ugly } + BEING begin/end is rather verbose END + COMMENT ending with reverse-spelled open is cute but silly TNEMMOC + proposal: + indenting stuff after a colon is really cool + and works rather well I think. + It doesn't easily allow do { } while(); constructs, though + do: + stuff + until done + would work. + + There is no harm in allowing a '{' instead of the ':', and then + ending when a '}' is found. + + + Indenting.... how far can I push this? + - two consecutive non-empty lines must have compatible indents + meaning the strings of leading space/tabs are either identical, + or one is a prefix of the other. Anything else is an error. + + If the second is longer, then second line continues the first. + If they are the same, the first line is terminated if that make + sense. + If the second is shorter it must match an earlier indent and + all parse back to there is terminated. + This is all much like python + + - Can we require that smaller units much be indented correctly. + i.e. if a nonterminal starts half way into a line and wraps, + then the next line must indent the same depth: + if a == b and a == + d: + for a fairly ugly example. i.e. the cfactor is still open, so + the next line must be indented to the start of the cfactor. + This isn't needed for syntax but avoids confusing looking code. + Need to be careful to allow opening bracket at end of line. + i.e. if the only part of the nonterminal is a literal, then that + non-terminal doesn't count. + + A "pragma" or "safety feature" would select among: + - ignore indent + - indent for termination. + - check all indents + + Go uses newlines without worrying about indents. I don't think I + like this. Indents are important sometimes and newlines aren't + always. + + So yes: options. which specify tab size for 'check all indents'. + + Colons: + I seem to be using the humble ':' too much: + 1/ The conditional expression a?b:c + 2/ The indented block opener + if a == b: + do_something + 3/ path separator for modules + io:printf() + 4/ To introduce a type: + var foo : type = value; + + + so if I wrote: + if a == b ? io:printf() : print='yes': + Then I could get confused. + The first ':' can't be the block opener because the ?: isn't + finished. + But the second ':' could introduce the global module :print + So ?: doesn't conflict with if:, but we can't use ':' for modules. + So either use '::' like Rust and C++, or '.' like other namespaces. + :: is very noisy. Maybe that is good? + . is very quite and make the lhs look like a value, not a module. + Does that matter? Different things should look different. + + Large scale structure: + Key elements of a program are: + declarations + these define types, methods, functions, + and give names to modules .. and other things + statements + These always live in functions/methods I think. + expressions + These mostly live in statements or some declarations. + + So a program should be all declarations. But that isn't much fun + for "scripts". Or would scripts look better if they were forced to + keep code together with a low-syntax 'main' function. + + func main(): + code goes here + + declarations start with the thing being declared: + func type mod const var impl + each is followed by a name and a value - maybe. + Not sure where imports fit in + import name-to-import 'path-to-module' + ?? + + + Assignments/bindings. + There are 3 things one might want to say: + 1/ This new name is bound to this value for the rest of + this scope + 2/ This new name is bound to this value now, but it + might get changed later in this scope + 3/ This name is already present in this scope and I'm + now changing the value + We could use 'var' 'let' or 'set' or 'const' or 'let mut' or 'with' + to introduce one or another. + We could use "=" vs ":=" for different assignments. + We could not care about different between 2 and 3 + We could use a sigil to differentiate 'new' from 'old'. + In a tuple "a,b,c = 1,2,3" we could conceivably have all 3. + We could add :type to name to introduce new name. + + a = b // only allowed if 'a' already defined and mutable + a: = b // defines new 'a' + + DON'T KNOW about constants.... + + We could also distinguish "this is a default value" and "this + is an initial value" where the former can be replaced without + use. Probably not useful though. + + Least typing should be "new constant name". Next should be + "reused mutable name". Most typing for "new mutable name". + + += -= etc mean pre-existing mutable. So := should too? + + I really don't like the + a,b,c = function() + thing. Having different bindings has got to be ugly, whether + we make them so they are visually ugly, or assume them so + they are conceptually ugly. + a:=, b+=, c= function() + a:, b+, c = function() + + Does + a[xx] = b + need to know if a[xx] has been assigned yet? + If we don't want to pre-initialise arrays, then probably yes. + For 'hash' arrays, there might be requirements about checking + first. + + I like: ":=" means "replace current value". + So '=' means 'assign initial value. But how do we indicate + 'const' or 'var'. + Could require 'var' in front? + + var (a, b, c) = function() + (var a), b+, c: = function() + a:, b+, var c = function() + + + New thought: The matching idea from Rust could make + interesting assignment. + if 4*5 = fred + could always be made true by binding 'fred' to 20. + if action(state, NEWLINE) = REDUCE(production) + will fail if the action is SHIFT, or bind production + if the action is a REDUCE. + if (a,b,c) = (1,2,3) + if a and b are already bound but c is not, this tests + a and b and if they match 1 and 2, binds c to 3 and continues. + Probably need a sigil on the 'c' or 'production'. '?' + if 4 * 5 = fred? + if action(state, NEWLINE) = REDUCE(production?) + switch (action) { + REDUCE(production?) : + SHIFT: + ACCEPT: + + This 'binding' is very different to assigning to an array + element or a struct field. + + + There could be quite a lot of different assignment-like + things. + + - 'global' constants which have a fixed value, at least for + the invocation of a program + - 'fields', both in structures and stack frames, which can be + updated and have their address taken, so are mutable + - 'constants' which are assigned just once in their scope - + though possibly in multiple code locations due to branches + if foo() { a = 1} else {a = 2} + 'a' could be a constant + - 'reused constants' - one name which is used repeatedly but + each time it is really a different variable. There are + clear points before each 'assignment' were it can be + declared 'invalid'. e.g. a 'status' return value that is + only used for immediate operations. + These are never assigned while 'live'. So + a = 1; if b {a = 2}; print a + is not an option as the value from 'a=1' could be printed + and so is live during the if statement. + These can even be 'updated'. + a = 1; a = 2*a; print a; + Semantics are identical to + a = 1; A = 2*a; print A; + so it is a reused-constant. + A for-loop control valuable should be a reused constant. + - 'variable'. This is a value that can be changed in + different ways on different paths. It is conceptually one + value that changes. Possibly an accumulator + + We can assign to: + - a name (local variable) + - a value.name (field) + - a value[expression] (array slot) + Can we also assign to a function call? + - foo(a,b,c) = 42 + That would mean that 'foo' returns an lvalue. That is not + unreasonable. + + But I'm still not happy with how lvalues work. + a = lvalue + Does that make 'a' point to the thing or does it copy the + thing? + I think it that form it must make 'a' the lvalue. + *a = lvalue + would copy the content. + If 'a' and 'b' are pointers to pointers to integers, then + a = b + *a = *b + **a = **b + would all do different things in C. Is there any other way to + achieve those same things with less syntax? + Type enforcement can make the '*' on the right redundant. + + How do I move a reference, as opposed to borrow a reference? + Or copy an object? In Rust it depends one the type. If there + are mutable fields or owning pointers, it cannot be copied so + it is moved. '&' is used to borrow a reference. + I'm not sold on the type-dependant answer. I think the syntax + should make is obvious. + + I think I want syntax for "take the value and replace with + NULL". + a <= foo + then we could swap with + a <=> foo + But that is a comparison! Could use "<-" but it doesn't look + like '='. a =< b ?? Then I could pass " for ''r { + typify(n:''r -> regexp) + } + + The interface 'n' is a parameterised interface + interface new { + typify(n:self -> r:result) + } + + These need to be 'constant' or 'referentially transparent' + or 'pure' functions that can be evaluated at compile time. + + Loops: + while loops are easy-ish. What about list loop and more + generic things. 'foreach' etc. + Co-routines can have a function which returns successive + values, via 'yield' or lazy evaluation. + Or a closure can be passed to a looping function. + Or a 'iteration object' can be returned on which we call 'next' + repeatedly. + Loops: (repeating myself?) + So many possibilities + for loop while repeat/until foreach etc etc + Really like exit in the middle + do: + preparation + while condition: + actual work + + + For loops like to initialise and increment: + with var = val + then var++ + while var < last: + print var + + foreach loops need a variable(set) and a list of values, which + might be generated lazily. + Alternate to lazy generation, could pass the body as a closure + to an 'each' function like Rust does. + Lazy generation could involve a closure being returned from a + function. The closure would need some state, so the 'return' + couldn't discard it completely. These could be 'inout' + parameters? or + + How to address the need for "search and perform special case if + not found? + for each possibility + if match: + break + then + Do this if loop completed normally + else + Do this if loop aborted. + + if: + for each possibility: + if match: + use true + else: + handle not-found case. + + http://www.scribd.com/doc/38873257/Knuth-1974-Structured-Programming-With-Go-to-Statements + talks about 'events' + + loop until event1, event2, ... + stuff + ... event1; + ... event + then event1 => .... + event2 => .... + + This is a lot like exceptions. However 'until' has a different feel to it than 'try'. + 'use' is better than 'raise'. + + I like this as a generisation of switch: + + switch: + .... use foo; + ... use bar; + then: + foo -> .... + bar -> .... + + All 'use' values must be either unique identifiers or + expressions with values that have a common type. + If unknown identifiers, then they comprrise values in a + new anonymous enum. + The labels in the 'then' part are value of that type. + A value *can* appear twice with some annotation(?) in which + case all relevant branches are taken. + Maybe the notation is 'next' at the end of the statement + block? + If there is no 'next', then the labels cannot appear again. + Actually we allow 'next' and 'last' if and only if at least + one label occurs again. If there is any next/last, then there + must be one at the end of the block. + + if statements: + + Go has nice feature where you can perform an action before an 'if' + test. + if a = something(); a > 3 { use a } + + This is particularly useful in elseif constructs + a = something() + if a > 3 { use a } + else if b = somethingelse(); b < 5 { use b } + else if .... + + without the if action, we need an extra level of indenting. + if a > 3 { use a} + else { + b = something else(); + if b < 5 { } + else ... + + But only a simple statement is allowed. + Could I use "with" + + with: a = something() + if a > 3 : + use a + else with b = somethingelse(): + if b < 5: + use b + + Looks a bit clumsy and probably ambiguous. Maybe: + + if with: + a = something + a < 5 + + no, that is much worse. + + We can make "with stuff" unambiguous by requiring a second body + with: + commands that can start new scope + do/if/while/match + + + Another idea. We don't allow assignments in expressions, but + we allow a "where" clause at the end which contains a list of + bindings. The expressions are only evaluated where needed: + if a < 5 && b > 6 where b := n/2: + + will not divide n by 2 if a >= 5. + This means that list can only be bindings, not general statements. + If they are new bindings, how are they scoped? At least to the + bodies of the if or while and the rest of the expression. + If a binding were optional, it couldn't carry through: + if a < 5 || b > 6 where b := n/2: + if a<5, b would not be bound, but it would be bound in the 'else'. + This lazy evaluation could work with a prefixed "with" clause too. + Would assignments be OK? I think they could be confusing. + Lazy evaluated binding could allow two different paths which want + to evaluate at different times to just use one piece of source code. + + + Another idea: + As well as + if expression: + statements + We could have + if: + function block with 'return' + then: + statements + else... + Maybe 'return' looks a bit heavy and a synonym might be better. + yield? use? done? I like 'use' for now. + The 'if' is one big scope, so binding created in the function + block remain in the 'then' statements. + This allows: + if: + a = 4 + use a > 3 + then: + .... + else if: + b = a*a + use b < 0 + .... + How would that look with {}? + + if { + a = 4 + use a > 3 + } then { + .... + } else if { + b = a*a + use b < 0 + } + + The same syntax could allow conditional expressions: + a = if a = 4: + use 4 + else: + use 3 + though that is rather verbose and part of the appear of ?: is brevity. + + ----- + If do want statements in the condtion of 'if' and 'while', optionally. + + if cond BLOCK + or + if: + BLOCK + then: + BLOCK + + An expression block may contain "use" directives: "use True". + They act a bit like a "return" - we have the answer. + For while, it means that we can avoid "do {} while" as a separate case + while: + BLOCK + do: + skip + + We may also be able to avoid 'break' in some cases with "use False" + at arbitrary places. 'continue' isn't quite so easy, though a whileswtch + could do it:-) + + In a 'switch' we can "use" values other than True or False. + If a name is used which isn't defined, then it is assumed to be a new + enum, so all 'use' statements must use undefined names, and all + cases much use the same set of names. + + switch: + use do_it + case do_it: print "Hello" + case skip_it: abort() + + case names can appear multiple times with mcase(?) + + switch bit: + case 8: if byte & 128 {num += 1} + case 7, 8: if byte & 64 {num += 1} + case 6..8: if byte & 32 {num += 1} + case 5..8: if byte & 16 {num += 1} + case 4..8: if byte & 8 {num += 1} + case 3..8: if byte & 4 {num += 1} + case 2..8: if byte & 2 {num += 1} + case 1..8: if byte & 1 {num += 1} + + + break? or last? + "last i" exits the enclosing loop that modifies 'i' + "next i" retrys + + 'for' loops can be: + + for assignement: + while condition: + then statement: + do: + stuff + else: + + or + for assignment ; condition ; statement: + + or + foreach varlist = object + + So a general while statement is + for initialisation (can introduce variables) + while condition + then block + do block + else block + + First do 'for', then 'condtion'. + if 'condition' is true, do 'do' and 'then' and try cond again + if 'condition' ever false, do 'else' and stop. + + Maybe: + 'if' and 'switch' are much the same + 'then' and 'case true' ditto + 'else' and 'case false' as well + 'while' case be followed be cases as well as by 'else'. + + With a 'for' I want a close to run if we didn't 'break'. + I think 'else' does that. + + Do I want to use "else" for "default" in case? + "then" happens if "use true" or no "use" value + "case value" happens if "use value" + "else" happens if "use false" or "use value" with no match. + + + Print formatting. + Only values with Interface 'printable' can be printed. + The function is passed a width, a precision, and stuff. + So string can just have '%v' for a value, or '%20l' for a left + justified value. + But what about '*' parameters - and checking there are the right + number? + Could allow a function to have a compile-time preliminary section + which can throw errors and possibly infer types. + This probably requires there to be compile-time functions which + run fully at compile time. Though any functions that can, should + unless tagged run-time-only. + Function must be able to either flag a compile time errors. + At runtime this is ignored, so a runtime error or coping strategy + is also needed. + + So: function runs until non-constant is needed, or heap change etc. + If no compile error triggered, all runs again at run time. + But for libraries to work, the compile-time part must be easily + available in the library without the need to rebuild the object code. + + This doesn't really answer the problem.. + print "%20l", foo + or + print "%v", foo.fmt(20,0,'l') + + It would be nice to have curried functions here so that the 'fmt' + function could take a stream from which it could get a buffer to fill. + Or foo.fmt could return a function which acts on a stream ... that might be + complicated. + + Exceptions and errors. + Errors need to be easy to catch, but even that isn't really enough. + They need to be hard to not-catch. + So any possible error must be caught or explicitly passed over. + That suggests that the type of a method should include the list of + exceptions that can be returned. + + "try" is really pointless. An 'except' clause can simply be + available to every statement. Maybe spell it 'catch'. + + { + fd = open(); + fd.write(thing); + catch file_not_found: + whatever + } + + -or- + + { + fd = open(); + fd.write(thing); + } catch file_not_found { + whatever + } + + A 'catch' can occur anywhere. It will catch any unhandled errors + that happen before the location. So: + fd = open(file) + catch file_not_found: + return "fool" + fd.write(thing) + return "wise" + + One thing the 'try' does provide is a synch point. in 'catch' we + know that everything upto 'try' succeeded, so names bound there are + certain to be bound. + Without 'try' we only know that the current scope was entered, and + that any code since the last equivalent catch which could not have + thrown the error must have been executed. + This is probably enough providing that 'equivalent catch' is well + defined. + At any point in the code there is a set of possible exceptions. + A 'catch' subtracts from the set and any op that can raise adds to + it. + Within a 'catch' we know that any that happened before these + exceptions last became possible has happened. + + This requires that exceptions can live in a set. Maybe there is a + big discriminated union of exceptions? Or an hierarchy of them? + I think an inheritance tree .. though that is hard to merge. i.e + if two modules each declare some new exceptions ... not a big deal. + We know the type of the exception set for each imported module. + But wait: If I import two modules and want to pass up exceptions + inherited from either, what is my exception type? I either need to + offset enums when inheriting them, which is messy, or have + globally unique enum values, which are like type names. + + + So when I say: + catch exception_name(parameters): + that only applies to previous operations that could return an + exception with that value + + + Do I want an 'else' clause for 'catch' or should 'catch' clause + break/continue/return to get past the real work-to-do. + + Can destructors throw exceptions? + How do exceptions work in constructors? There needs to be a point + where values transition from 'local' to 'in object' much like + the 'return' in a function. + + + Should break/continue/return be defined as exceptions? The + (implicit) label they go to is the expression 'type' + + + + +Scope: + + 'scope' applies to various things: + - how to have narrow scope for bindings in block. Just + having declarations at the top which go out-of-scope at the + bottom is too restrictive + - which names in a module are visible outside that module. + I think they must all be visible within the module, where that + is meaningful + - how to import names from modules. "import xx from yy" is nice, + "import * from yy" is good but risky if 'yy' changes, so maybe + "import v1* from yy" which means "all names listed in v1. Or + "import * from yy.v1". + - what is the scope for 'finally' or 'later', 'defer', 'after' ? + + + let a = b in c + or + c where a = b + + or + with: a = b + do: + c + + The introduce bindings but they are either for a single statement + or for a block. We can keep limit indents a bit by combining + an if-list or loop with the 'with'. I wonder if that is enough. + + Having completely free scoping would require + let a := b + anywhere, and "unlet a" anywhere too. I think that is probably too + free. + + + For extra-module visibility there are options: + - 'public' tag for all public names + - 'private' tag for all non-public names + - Some convention with capitals or underscores to direct which + is which, but with UTF I don't trust capitals, and I hate + imposing underscores except for "don't use this" cases. + - 'export' lists (or 'hide' lists) + 'export' lists would be good for wild-card imports. + - external/internal section of the module + Would need section in each 'struct' too. + + I think top-level symbols that might be imported should be + explictly exported, and exported to a list if a wild-card import + is allowed. + Names in a struct which should be hidden can be marked 'private' + as hiding them is less critical. + + All symbols should be available even without import, using e.g. + :module:symbol + This allows instant access in simple scripts without pre-planning. + If 'module' is imported, then just "module:symbol" is sufficient. + Possible ":module" searches current directory, then system path, + while "::module" only searches system path. + + + + + I think 'after' scope should be tied to a name so when the name + goes out of scope, the action triggers. + So it could be: + after name: + actions + though most such actions are to free memory - which must be + automagic, or to release resources (locks, files, etc) which + should be the destructor of some value. + i.e. 'open' returns an 'owned' value so it is auto-closed when + forgotten. + + 'name' could be a loop label. That might be interesting. + + But should 'after' file when the value changes, or when the name + disappears. I think it should be when the value changes where + that is meaningful. + Maybe it only makes sense on 'owned' values? + + + + Tuples: + Tuples are present as the parameters to a "function". But where + else? + Tuples are usually (a,b,c) but can be useful as a:b:c. This is + briefer. As long a 'tuple' isn't a first class type, + "a" can be a 1-element tuple or an expression, depending on + context. + This is useful for a '%' function for string conversion. + The args to '%' are tuples where first value is sent 'stringify' + with remaining bits. so 3:'x' will print in hex ... bit ugly + though. + Maybe not needed then. + + mapping between tuples and structures is probably useful. + Assigning tuple to tuple with matching + return a tuple from a function .. well, return a struct but allow + matching to a tuple of names. + Tuple could equally be an array if length matches. + + virgil maps all tuples to lists of values, so nesting disappears. + This is pleasingly simple. It means that you cannot have tuples + as individual arguments or return values but if you cannot query + the size (which would make them arrays) that doesn't matter. + + A tuple can be assigned (bound) to an array if the size is + flexible, but binding contents of an array to a tuple is only + possible if the array size is known at compile time. + So the type of printf could (almost) be + printf(string, printable[]) + and it could still be called as + printf("I have %d and %d", foo, bar); + + + What about + if a in (1,2,4,8): + as a use of tuples. It expands to a disjunction of separate tests. + + + Expressions with side effects. + "functions" will often have side effects, but what about + a++ + a += 5 + etc + + They are problematic because the interact badly with + order-of-operation rules. + They are useful because they allow brevity. + if x && (a += 5) < 7 ... + rather than + if x: + a += 5 + if a < 7 ... + order doesn't apply immediately after && because && is a + serialisation point. + if with a=5: a < 6 and then with b=6: + + + Type specifiers: + Do we want: + type name, name + or + name, name type + possibly with punctuation? + Do we want annotations on the name: + char foo[] + or on the type + [char] foo + i.e. can we embed an identifier inside the type + struct foo *(*ID)(); + + I think the C experiment shows that to be a failure. You still need + the types without the identifiers, so it is best to always have them + separate. + i.e. a type should always be identified by a set phrase. + type name = phrase + var name: phrase = value + + + + Dependant types: + http://beust.com/weblog/2011/08/18/a-gentle-introduction-to-dependent-types/ + suggests they only work for immutable values and I don't doubt + they are easier there. With mutable values we would need to allow + the type of a name to change ... or be dependant on some variable. + So if "array of N integers" is the type of A, then + A[N++] = foo + doesn't change the type. i.e. if the parameter is variable, the + value can be mutable. + So this boils down to describing invariants which the compiler will + validate for us. i.e. a proof assistant. + + An object can be in a number of different stages. Some of these + happen during initialisation, but others can be just part of the + overall life-cycle of the object. During different stages, + different properties can persist. For example an object may be + immutable for a certain well defined period of 'time'. If the type + system can know of this, then any references that only exist during + that time can be known to be immutable, while references that last + longer are not. + + There is scope for more than 'immutable' and 'mutable'. + e.g. an object might be in a 'growth' state, where new components + are added, but none are removed. + The set of objects in a container be unchanging, while the objects + themselves may change. + At some stage, the set of contained objects might transition from + one type to another type - or from one stage to another stage. + For the language to be able to affirm this it would need to know + that some operation visits all member elements of some type/stage. + This might need to be (unsafely) affirm by the programmer rather + than mathematically proven by the compiler. + + + So there are two ideas here. Once is more explicit statements + about life lines. An object can identifier several 'stages' + which can be transitioned by various operations. + Each component can have different type properties in different + stages. + The type system can restrict certain methods to certain stages. + The compiler must *know* the stage of a pointer. It is static + info store in the lifeline analysis. When an object is known to + 'own' certain borrowed pointers, the lifetime of those borrowed + pointers is affected by changes in the object's stage. + + stages can affect: + - mutability: shallow or deep. Addition or subtraction or both or neither. + - values of scalars + - NULL, nullable, or non-null for pointers + - size of arrays as related to other values - variable or + constant. + - 'once' or counted or collected for some pointers. + + Some stages are exported, some are internal only. + + It might be helpful to have different names for different aspects + of type variability. + We have + Parameterised types: The type depends on some type parameter + Dependent types: the type depends on some nearby value + Temporal types: the type depends on a state which changes over + time. Such a state might be represented in some variable, + but it shouldn't need to be + Linear types: The number of strong references is dependent on + some variable or age thing. So this is an aspect of a type + that can vary. + + These need to be specified and then enforced. + Parameterised types make sense where-ever a type has multiple + subtypes. + This includes a struct with multiple elements and a function with + arguments and returns. + struct : + next: f + others: array of f + func max(a,b:f):f + + Type checking this should be fairly easy. Implementing would be + easy if f was always a pointer... + + Dependent types work in a structure or a function call. + We probably want arbitrary computations on values(?). + If changing the value means changing the type we need types of + different precision e.g. mutable and immutable. + Optional and non-optional. + Or we need multi-assignment to change the value and the typed name + at the same time. + Or we can allow that the type is flexible between uses. + So the type assertions are only verified when the type is used. + So I can change inter-dependent fields in any order and it will + work. + + Temporal types might be tricky to describe. We need to identify a + clock. It might be an internal field or it might be an external + value. + By "external" we must mean with a super-ordinate object. + e.g. a 'container' might contain lots of 'members' and the type + of each member might be dependent on a field in the container. + The clock could even be the program counter. + + + Boot Strapping + An interpreter written in C, with a compiler written in target, + that links to llvm, seems like a good plan. + + + + Modules / packages / crates. + This is the primary mechanism for namespace control and for + libraries. + Within a package, all names defined in that package are visible + providing the right path is used. Outside the package, only + exported names are visible. Some names are exported by default + but can be marked "private". Others are private by default and but + can be marked "public". + + Packages can define: + - types: struct, enum, array, function, interface etc + - constants + - functions + - methods + - other packages + + Does it make sense for a package to export a method for a type that + the package imported. As the name of the method is in the + namespace of the type, it probably isn't polite to add something to + someone else's namespace. + But a new name for a type could be defined, added to, and exported. + + How are name conflicts handled. + Certainly you cannot define the same name twice in the same scope. + However: + you might import a name with a wildcard, then define it. + You might embed a struct/enum in a local structure enum and + add name which already exists. + + This is interesting because the library that you import might get + a new revision. It would be a pain for old code to break when + that happens. + + So importing a library should specify a version. only names from + that version are imported. + + You don't need to explicitly import if you don't want namespace + changes. + Just use ":modulename:symbol" to get any symbol from any module. + If you want to avoid using ":modulename:" you can + import name, name from modulename + or + import version_number from modulename + + exported symbols needed to be version-tagged. That could get ugly. + + private-by-default could be "public.2" rather than "public" which + isn't too bad, but what about public by default? + I guess I need to decide which symbols are which! + There are "items": type/constant/function/package + And there is fields, which include methods. + + Go: exports only things that start with upper case letter + Rust: exports all field/methods but nothing else. + + We could have an "export" clause for items: + export version item, item, item, item + This makes "static" the default which is safest, and makes the + "interface" easy to see. + + For fields, we could have an "export version" block, or + "export version" exports all fields/methods listed *before* the + pseudo. + So when you add more to the end, they aren't exported. + Though it is probably more consistent to have + export version field,field,field + inside the struct + + + Should modules have "variables"? or any state? + If so, they can usefully have an 'init' routine. + + + Module naming: + modules live in files with related name. If file name is a valid + identifier, we can just use that name in the code. + If it isn't, or if we want to change it for some other reason, we + would need to 'import identifier "filename"' + Also, the file might be in a directory hierarchy, where '/' might + not be the right separator. + So we need to import a path, possible into a module, and provide + a local name for that, if default doesn't suit. + import "file" = name + import "file" { import "file" {import identifier=new, id, id}} + + + Overloading + Overloading seems to be largely disliked. Hard to be sure why. + + Overloading + - * / % is already done for numbers. + Overloading [] could be very useful for hashes + Overloading () && || ?: would be very bad because they have + strong meanings. Ditto for = + Overloading -> . prefix-* prefix-& could be weird + Overloading & | ^ << >> &^ ~ might be OK + Overloading == would be bad, it has to apply to all types anyway? + Overloading < <= > >= ??? + + So: + some operators need boolean. Possibly a 'testable' interface + could allow anything to be tested. ?: && || ! + some operators compare. So a 'comparable' interface? + < <= > >= == != + some operators are limited and can freely be overloaded: + + - * / % & | ^ << >> &^ prefix-~ prefix-`-' + some operators are special: + [] + + [[ &^ is not really different from "& ^" except that "&^=" is + legal ]] + [[ what about bit operations without having to explicitly shift? + |< &^< ?? |+ &- + + Maybe it would be cleaner to have a simple opp to convert a + number to a bit. #1 Then we clear a bit with "a &^#= 5" ?? + + Or a 'bitset' type could define '+' and '-' to add/remove bits. + ]] + + Overloading of a binary operator is a method-call on the first + argument passing the second. + + So for binary operations like + - * / , left and right side must + be identical, like with Go. + But what about '<<' and '>>'. Go allows the RHS to be any unsigned + int type. This isn't nicely consistent. To be consistent we need + to required either a type or an interface. The type would need to + be unsigned int. The interface would need to be + to_unsigned_int_clipping which returns 0 or MAXINT if it cannot + give a faithful unsigned int. This is much the same type as the + index to an array. + + So having an 'int' and 'uint' interface might be defensible + + So each operator is defined in terms of a method call. + e.g. a + b => a.plus(b) + a && b => if a.test() then a else b fi + a < b => a.compare(b) == less_than + a[b] => *a.lookup(b, create:Bool) + + All of the first set of operators can also be combined with '=', + e.g. += + ++ -- : are the really needed? +=1 -=1 are just as good? + Though the could call a.incr() and a.decr() + + Do we want a += b to be different from a = a.plus(b) ??? + For a list it would be good to know that we don't need to duplicate. + + + Wait... for 'plus' to be a valid method it would need to define the + type of the second arg, and the return type. Don't like that. + So treat it not as a known method, but a simple textual + substitution. + a + b is translated to a.plus(b) which is in any particular + interface. + + On the other hand, test and compare are from an interface. + What about lookup? Do parameterized methods help here? + + Type of plus is (X,X)->X + Type of shiftleft is (X,uint)->X + Type of compare is (X,X)-> (-1,0,1) + +List of (possible) operators. + + + - * / % & | ^ + < <= > >= != == + && || ! + << >> + "in" "not in" + "is" "!is" "is<=" pointer comparison. + [] + [x:y] ?? + + >>> signed right shift? + + String catentation? '+'?? '++' ?? + + + "in" dispatches on the right-hand-op. + "string" does substring search + "int" does bit test. + + if 'in' can to bit-test, how do I do bit set/clear? + |- |+ &- &+ Not good as +- can be prefix + -| +| ?? Hard to type... + What is good for set-addition and set-subtraction? + foo[] = bar + is niceish for adding an element to foo, at end I guess. + But it doesn't generalise to subtraction. + :+ and :- ?? + + Could use ++ and -- except that I want ++ for concat. + What is -- ?? + + 'else' - like || or ?: "a else b" == a if !!a, or b if !a + 'then' ?? a then b is either b or NULL. + 'and then' == && + 'or else' == || + 'and' == & + 'or' == | + 'not' == ! + + 'as' for cast: "value as type". + I think I prefer "type(value)", though the 'as' allows + separate name spaces for types and functions + + 'div' for integer division? Probably not. + + 'if'. "a if b" is either 'a' or False. + This isn't always type safe, but + "a if b else c" + becomes 'a' if 'b' is true, or 'c' if not. + so "a else b" is short hand for "a if a else b" + + Prefix operators: + - negate + + abs value? + / reciprocal? + ~ invert + ! not + + & takes the address of a field? + [n:] takes address of an array member + Can we take address of a local variable? Or do we define + the local variable as a pointer and allocate from stack. + * dereference a pointer. This is normally automatic, but not + for assignment + + < extract the value and set variable to nil. + +tasks/IPC + + co-operatively schedules threads across N processes with + all OS call non-blocking seems like the way to go. + Questions are: + - how to slice? + time? + on blocking action? + explicit yield? + - when to clone? + always? + - how to create new threads? + library or builtin? + function or closure? + - how to control threads + handle for 'kill'? + - how to wait for threads? + maybe don't. When they die, they die. Any waiting + uses IPC. + - What to do for IPC? + - message passing + - shared memory with locks + - monitor? + - synchronous interfaces? + - how much control of shared memory is needed? + + Message passing is clearly a good tool but I'm not sure it should + be the only one. Something that abstracts locking could be good. + E.g. a pointer that needs to be 'activated' before it can be used. + i.e. a refcounted object which has "lock" which returns a 'once' + pointer to a subtype which provide more access. The original + allows read access to some things. + + Or a type where some/all methods are required to own a particular + lock. Actually it should be that some fields are only accessible + when some lock is held. So the type is parameterised by a lock. + By default fields are only accessible on a single thread. If + a lock is provided, then they become accessible when that lock is + held. + + + I wonder if we need more pointer types, or use dependant types.. + - 'once' references + - 'one thread' + - 'ref counted' - needs locking.. so: + - counted-onethread + - managed-onethread + - borrowed references might be onethread or need-locks + + + ParaSail goes even further and parallelises everything. Any two + expressions that aren't interdependent are potentially run in + parallel. This requires the compile to know about interdependence. + ParaSail puts in lots of requirements to avoid aliasing etc. + You can never (I think) have two pointers to the one thing. This + feels overly restrictive to me, but could often happen in practice. + If the compiler could detect this - or at least make good use of + owned pointers.. + + + (may2014) + I really like the idea that locking is done with 'once' pointers. + The locks are released just like memory is freed. + + Weird idea: I wonder if a hash table of spinlocks would work so the same + data structs can be used with or without locks. + The obvious problem is that alias can catch stacked locks and I deadlock + waiting for a lock that I hold. + With little effort I could tell if it was me. + What would I gain though? hashing pointers does take some time... + It would remove the need to store spinlocks in any data structure. + + + The base hardware locking primitive seems to be + load-linked/store-conditional, or cmp-swap. If I were to expose + that in the language I would really need to understand all the + stuff about memory models, which might be fun to learn I guess.... + + +Starting tasks + go BLOCK + starts BLOCK in another task. When it completes, thread is done. + Any non-shareable values mentioned there become unavailable. + Any refcounted values that are also used after BLOCK are increfed first. + + +Scheduling + if we have multiple threads then we need to have a scheduler. + We could use OS threads and use the OS scheduler, but that is likely + to impose an overhead that we don't want. + So even if we schedule only when a thread blocks, we still need some + priority info to decide who goes next, and some way to share + threads across a pool of OS threads. + Explicitly assigning numbers is very boring and error prone. + Putting things in order somehow would be nicer. + With event-loop code we can have an ordered list of what + to do next. This is effectly an explicitly coded scheduler. + To do something like that means putting all routines into some + data structure in the language. That would be good for introspection + and killing etc. + Some threads are blocked, some are ready. A thread can be placed in + a group for round robin. A group can be ....(don't know what I was + going to say). + + +Memory allocation: + How do I get a new 'box' to hold a value? + For 'stack' boxes, it just happens automagically as needed. + For various heaps we need some syntax, like + new(type) + but we need to know which heap (per-thread, common, whatever) + so: + new(type, heapname) + + But I'm not very keen on builtin function - they pollute the + namespace. + var name : type = @heapname + I don't think I just want a sigil like Rust as I suspect there + could be multiple sources of allocation, or styles of + allocation - enough that names might be useful. + + Whatever it is, we need to indicate: + - a type - could come from the variable being assigned + - parameters to the type such as array size or union + discriminant (does that make sense?) + - region to allocate in + - type of pointer: unsafe, counted, owned, managed, + never-free + These could be associated with the region. + 'Managed' (garbage collected) and 'never-free' + really need to be in their own regions, but + counted, owned, unsafe can be together. + + However they may well be managed separately. + The compiler only knows about 'owned' and + 'borrowed' references. When an owned reference + is dropped the manager knows whether to dec a ref + or explicit free or do nothing. + Allocating from a 'never-free' manager will return a + borrowed reference from the manager. + + Given that the type might need to be parameterised it is best + to have it explicit and not deduced from the var. Let the var + deduce from the allocated value. + + So we want some syntax that includes allocator and type and + says "allocate it". new(type, heap) is obvious, but we cannot + pass types to any other functions. + heap(type) has same problem. + typename.alloc(heap) is too noisy + We could include the heap as a parameter? + typename + when there are no parameters + + But.... in many cases an allocation goes straight into a + struct field, so the type is known and repeating it is boring + and while it can be type checked and so not error prone, it + shouldn't be necessary. + But we still need parameters...... + + +Global variables. + Opinions vary. They are like gotos : dangerous but useful. + Can we constrain them? + + 1/ total-global is a bad idea. module-global at most. This + avoids namespace pollution. Still accessible from other + modules as "that_module.globalvar". + 2/ makes referential transparency hard: so include 'read/writes + mod-static' and 'read/writes foreign static' properties of + a function + 3/ local variables must never 'hide' global variables ... or + anything in a greater scope. + 4/ it should be easy to pass lots of function parameters. + @struct means "take all the names in this struct and if + any are names of function arguments, use them as such. + + Globals are bad because very few things are 'once only' Even + a random number generator context could be per-thread. + + If two functions don't touch globals, and operate on + unconnected data structures, or are read-only on shared one, + then they can run in parallel. Detecting this would be good. + Warning when it cannot be done would be nice. Programmer + would need to suggest it. + + A global that is set once in an early 'stage' of the program + and never changed again does not break referential + transparency for most of the life of the program. + + Analysing these lifetime issues might be difficult. Static + analysis of all known types might be part of the problem. + Loaded code (.so) could not be known about at compile time so + types would need to be explicit, but they must always be in a + stage *after* they are loaded. + + Key lessons here: + - function need 'access globals' in their type + - program stages might help isolate these. + So global gets type 'read-only is stage X', and function + gets 'called in stage X'. But what is 'X'. + + +Stuff: + If a name is never changed but not declared constant,is that an error? + if a value is set but never used, is that an error? + What about: + a = 1; + a = 2; + ?? + probably. + + +Closures: + Closures are a cool idea - a little inline function with + access to the current stack frame - a bit like a method for + the current frame rather than for a separate object. + It can be used + - to create arbitrary code structures like maps + and threads bodys and innovative loops + - to pass compare functions to sorts code etc + - I wonder what else.. + - by holding state they can work as generators + - this requires the closure to outlive the frame! + - closures are good for 'after' functions. + + We need to know what happens to names in the closure. + Are the values copied, or are references taken, or what? + The answers are probably obvious in various contexts, but need + to be clearly enumerated. + + A possible trouble with closures is that they can play tricks + with the lifetime of a variable. e.g. if we create a closure + inside a loop, using a variable that was declared in the loop, + then does each closure get the *same* variable, or do they + each get separate variables -- effectively duplicating the + variable. + I think that neither is really appropriate as either could be + confusing. If however the name binding was a constant, then + copying the value into the closure may well be appropriate. + + In general I don't think that closing over a variable should + extend its life time. Go does this and allows the address of + a local variable to be passed out - the variable is + automatically made into a heap variable. I think it best to + require an explicit heap allocation if you want to return a + pointer. It is less likely to be confusing. + A pointer obtained with '&' should only ever be a 'borrowed' + pointer. There must be a strong reference which protects it. + + + Syntax for closures? + Go uses func (args) result { body } + Rust uses |arg list| { body } + Python uses lambda args : body + + For small closures, listing the args is boring. + Using "$1 $2" is simple and brief. + Still need a syntax to say "this is a closure". + { use $1 < $2 } + where expression is expected could do it. + Then + { a,b: use a < b } + could be a more general syntax. Not sure it can be + differentiated on an LR basis though... + That is a similar syntax to list comprehensions + [ a : for a in range(50) if a % 2 ] + No: list comprehensions are in [] - the array syntax. + But {} is like a structure... or do we use tuples for + structures? + + I think I want in-line closures to be expression only. + I think burying statements inside expressions is wrong. + This argues for "x if b else c". + If you want a more complex closure, you need a local + function declaration. + This means that you cannot pass a large closure directly + to a function like Rust does. Given issues with 'break' + and 'continue', this is probably a good thing. But are + there times when it would be good? If there were it should + probably be handled by something like 'do' which has + a closure as a distinct syntactic object. Does it need + args? All local variables are available, but they aren't + available to the function which is passed the closure. + Passing local vars in would be clumsy - really do need + lambda concept. + +Type analysis. + I've never done anything like type analysis, so my first + thought is that it sounds easy, but is probably not else more + would be done. + + The first and probably most difficult task is to create a good + type model. This will allow us to annotate every node in the + code graph with appropriate type information, and the make + correct deductions. + The deduction we want to be able to make are: + - is this correct? or at least consistent + - what are the unspecified or underspecified types. + + Scalars, strings, Arrays, structs, unions, functions, closures + are different arrangements of data. + Scalars, strings, functions, closures are immutable, as can be + the others. Immutable values can be copied instead of + referenced. + + Values can be: + mutable or immutable - can I change them + volatile or non-volatile - others can change them + once, borrowed(wrt some once) + compile-time, or runtime. + + Every variable has a definite type, as does any function + result. The args to a function determine the types of the + results, though those types may be a complex function of the + arg types. e.g. intersection, union. + + Constants have a list of possible types. We choose the first + type for which there is no type error. + If there are multiple constants, they are tested in + depth-first, left to right order. + +Literate programming + There should be more of this. Language should make it easy by + allowing order of code to match order of presentation. This + means we should be able to add to any namespace at a later + date by reintroducing it with some sort of additive syntax: + type foo += { .... stuff } + Maybe could allow previous stuff to be repeated, which could + be filled in by tool and formatted differently by + presentation. + + In literate mode, everything is a comment except between + [[[ and ]]] each all alone on a line. + + It would be nice to allow the description to be processed as + markdown. For this, the compiler needs to extract precisely + the code blocks (not the span-level code though). + I think that and line indented 4 or more spaces following a + block line starts a code block which ends when a non-empty has + fewer than 4 leading blanks. + First line in code block cannot start with list char: - * + 1 > + + Github markdown also allows ``` to start a code block, and ``` + to end it. ```plato would turn on 'plato' syntax + highlighting. + stackoverflow would use + + What about: + reStructuredText + Asciidoc http://www.methods.co.nz/asciidoc/ - nah, ugly + pandoc http://johnmacfarlane.net/pandoc/README.html#pandocs-markdown + + reStructuredText is similar to markdown. It uses "::" at the + end of a line to suggest following lines until undent are + code. + + pandoc is like markdown. One thing it adds is + ~~~~~~ { .plato } + Here is some code + ~~~~~~~ + as well as ``` like github + + When writing a program in literate style, is there a need to + write code out of order? I suspect that there is, as we might + want to add a second task for a loop to do, or refine a test. + Actually - that is a trick one. Can we *change* previous + code? + + We could deliberately leave ellipses everywhere with names? + The block would have a name as well as the ellipsis. + + This really needs to be tried. + + +Extension + If I want this to be very general programming language, it + should be possible to write in various different styles. + This included: + - easy access to regexps like perl + - dynamic typed programming + - lisp-like programming? + + A string could have an 'r' suffix to make it behave as a + regexp: + if "foo.*bar"r < mystring + if (a,b) = "foo\([bar]*\)x\(.\)"r < mystring + + Dynamic typing needs a universal type which must be + extensible. + This is like an extensible union which I already considered + with universal enums. But that doesn't really work... + Need to think more. + + Lisp? That is really a totally different engine with a totally + different syntax. Hard to see how it could possibly fit. + Any 'eval' based operation is really impossible to fit within + the syntax. + + List processing more generally? If we have arrays that can be + chopped up and joined, we start to have something a bit + general. + Add list comprehension and map/reduce functions it might get + interesting. But then memory management gets interesting too. + Could 'tuples' be claimed by a type like literals can? + + +Random stuff - don't know where this fits. + + If a variable is know to be of some interface type, and we assign + a value of a more refined type, then the variable can be known to + have that more refined type until the value could change. + This might be useful for struct fields if we have exclusive access, + or named function return values. + +Strings + Strings need to be a built-in type. But how should they look? + NUL termination has problems - counting. + Storing a count in front can be wasteful if we want 64bit counts + and short strings. + Could take the utf8 approach: + + 0-127 : 1 byte 7bit string length. (128 bytes) + 128-191: 2 byte 14bit string length (16 Kilobytes) + 192-255: 8 byte 62bit string length (4 exabytes) + + We always nul terminate so strings can be passed to 'C'. + +Numbers: + I really don't like the fact that ints can overflow without me + knowing. + I think any possible overflow must be checked, unless the var is + explicitly declared as cyclic (modulus). + This means the language must know the range of values, or include + lots of tests. + + +runtime assistance. + - profiling. collecting statistics of where time is spent + - coverage. Count number of types each block is entered and allow + this to be accumulated over multiple runs. + Then format code with highlighting of un-executed and hot-spot code. + + For interpreter, this can all be internal. + For compiler, we need an external data file. Probably a log that we append + on completion and process later, using uuid hashes of functions. + + Unit-tests, with mocking. This needs lots of experimentation. + + +Literate help. + Sometimes it might be nice to assert that the order of things mustn't matter. + When building a function up during literate programing you might want to + say the certain things happen, but you want to be sure their order isn't important. + i.e. they all appear to happen at once. + [Other times there are dependencies that must be honoured. + For these, "after foo" or "before foo" might be useful + ] + In particularly, when updating a data structure a number of steps might + be needed. Each should be able to access the 'before' field and assign the + 'after' fields. + Maybe an explicit syntax for 'before' or 'orig', and an error if there is + room for uncertainty. + We probably want a name for a block of code in which a value is varying. + Invarient apply at start and end but not during. Values assigned are not + visible until the end. diff --git a/I Object b/I Object new file mode 100644 index 0000000..1a8e47e --- /dev/null +++ b/I Object @@ -0,0 +1,473 @@ +NOTE - where to put stuff about "is natural a type, is non-zero a type (for div)" + +Objects and polymorphism, and some standard chalenges + +Some definitions: + +Object a record of operations + +Operation a function specific to, and depending on, and part of, an obect + +Expectation what an operation (in high level, abstract, programmer based sense) expects of + its operands + +Behaviour a specification of what an object ('s operations) does + +'' vs State behaviour can be modeled as procedure plus state + +Types subset of behaviour that satisfies expectation + also a set of object which share this subset of a behaviour + + sum of types of operations + +Subtypes relation between behaviours + co variance/ contra-variance + + +Polymorphism two objects in the same type conform to same expectation and so can + equally be arguments to an operation + +Examples.... + + +Behaviours(?) that don't fit the model + stack + total ordering + +Parameterised behaviours + stack of plates + ordering of numbers + +Class: this that creates object - the set of those object - hence like a type + +Other Issues + infix (mult, add) + declared in parent (field) + refined in various children + e.g. float: float a mult int b -> a.mult(floatfromint(b)) + + Matching of infix to operator done at compile time. + thus not usable in very generic functions + +Subjects + an object may have multiple orderings. i.e. subtype of ordered(thing) for various things. + may even be subtype of ordered(X) for specific X twice. + + This subtyping is not so much what-it-is, but how-it-behaves and, like people, it may + behave differently in different circumstances + This is almost a HAS-A link - it HAS-A behaviour. + So how do we tell a routine to sort an array of things based on a particular behaviour. + One could make an object with a compare fn which takes to things and compares them, + and have a different object (class) for each behaviour. But that is rather gross. + What is really required is to tell the sort routine what message to send. + +Object creation + This is a really interesting issue. + Who creates an object: presumably the class... + when we convert an int to a float do we + 1/ anint.convert-to-float - i.e. call an operation on the int object + or + 2/ float.createfromint(anint) - i.e. call a create operation passing the int as arg. + + For funcitonal languages, where objects are immutable, create must able to create all objects. + +Extending a class + it may be desirable to define new functionality for a given type of object, function which + are defined in terms of the objects behaviour, not its implementation. + Adding functions normally requires subtyping, but what we are saying that that the new function + are a logical extension of the object. we are just elaborating what was already there. + e.g. int.isprime + to do this we define a new type which is a subtype and provide the implementation as a statement + that the new type elobarates, rather than refines, the old. + e.g. stat-real elaborates real + this would (almost) imply that + GLB(stat-real, float) elaborates float + + +Example: + integer: + integer is unimplementable as it is infinite, and some values will not fit in virtual memory. + However, many useful subsets of integer are implimentable. + We want a supertype which means "integer, or useful subtype thereof" + As some ops will overflow, we define the set of possible value to be the integers plus "overflow". + define e.g. Z(m) a+b -> overflow if a or b is overflow + -> (a+b) if a and b and a+b are < m in abs value + -> (a+b) or overflow if a or b or a+b > m in abs value + then if m > n then Z(m)+ is more specific then Z(n)+, so Z(m) is a subtype of Z(n) + yet we normally consider a smaller set to be the subtype... + + integers and type subsumption: + one view of subtyping is that an object of a particular type can be used + when an object of a higher type is expected (because it IS of that type too). + If an 8bit integer is considered to be a subtype of a 16 bit integer, then + we would expect "if x.lessthan(200) then x.add(200)" to not overflow + But no.... + If 16bit is subtype of 8bit, then will work. + anything an 8bit can do, a 16bit can do.. + But while the object may be capable, the value might not be suitable. + ..but with "overflow" in our model, any value is suitable, though it might be overflow... + + alternately, overflow might be modelled as an exception. i.e. the op does not return. + Z(m) a+b -> (a+b) if a+b < m in abs value + -> exception otherwise + then if 8bit is subtype of 16bit... no real difference + + Want to say: 16bit.add(a,b) where a,b are subtypes of 16bit, result is 16bit + may still overflow, but there wont be any surprises. + + So: Z(n) is the type/set of integers less than n in abs value. + operations are limited - maybe decimalise, isprime, isodd + Various classes exist which can create Z(n) by adding, multiplying etc. + these require arguments which are of a subtype. + To facilitate this, we want to be able, given a variable of a type, automagically choose a + class which is of that type. + If V is Z(100), want to choose (maybe) class Z(127) == signed(8). + ouch - we are upside down again... + values subset one way, types subtype the other + +Challenges: + output formatting + point/colour point?? + min + - find that paper about open/closed self - + binary operators - arithmetic. + ... + +output/printing + This is often one of the hairiest parts of a language because various types want to + be treated in similar ways, and people like to avoid extra syntax. + + There seem to be three inputs to a formatting request: + 1/ the objects (values) to be formatted + 2/ the detailed shape of the formats for each object (hex/decimal field width etc) + 3/ the constant padding, and the ordering of the formatted values + (which is only almost always constant) + + some parts of 2 are closely tied to 1, some to 3. + e.g. field width is probably 3's domain. but is useful with 1 to print part of a string. + keeping the inputs separate hazards consistant update + keeping them together hazards abstraction and maybe readability. + Thus a choice is probably appropriate. Let the programmer decide. + + maybe + stdout.print("now is the ", time, " for ", all," ", good, " men to come to that aid of the ", party) + where time = now.asciitime(), + all = people.count.decimal, + good = if (random > 0.5) "good" else "bad", + party = "labour"; + But: + hard to distill this format into a per-language file. leaves too much un answered. + + different formats of a value are a bit like different subjects looking at an object. + "you think it is an integer, but I think it is 32 bit hex, uppercase" + so I might define "Address" which is an elaboration of int which defines + tostr as .hex(8,"0").toupper + +The numerical paradox + n-bit integers seem to subtype 16bit is a subtype of 8bit + subranges subtype 0-255 is a subtype of 0-65536 + + why? + A distinction between objects (concrete) and values (abstract). + look at casting in C - (int)float preserves value, (int)point doesn't really. + (int)char certainly doesn't. + + or 16bit type defn from above somewhere includes all 32bit and 64bit objects. + an 8bit 5 is different from a 32bit 5 etc. + but 5 IS 5.0 IS 5.00 + if we start treating objects differently from values, we are losing abstraction. + + So we want the compiler/language to determine which representation is appropriate + and how to perform an operation. + Probably the most difficult to specify is dealing with loss of precision, as it is + expected. overflow, and to lesser extent underflow, can be deliberately avoided. + +Intrinsic types. + What types are INTRINSIC to a language. + + 1/ Boolean essential for decisions. + 2/ Code - may be third class, but must exist. + 3/ lamba- again, often of a low class + 4/ sequence - as in args to a procedure, statements in code, subobjects in an object + nice to make this first class - very parameterised, anonymous type. + also bits in an integer? + Hmm <'cart,x,y> <'polar,r,theta> + or imaginary.cart(x,y) imaginary.polar(r,theta) + +Precision in types + To what extent should type of variables be declared? + for very local variables which are assigned once, e.g. for loop variable, + it seems unnecesary to declare type. just keep scope local and deduce type from usage. + + even at larger level, is "int" enough, or do we want "count" "label" "area" ... + + +Multi-threading. + problem of locking - protecting against unwanted concurrent access. + 2 senarios - within and without. + + within the methods of an object, the object may occasionally violate the invarant. + While the invariant is violated, the object must be single threaded. + Dangerous code segments could be bracketed somehow. + + without an object, we may want to use an object assuming conf->mirrors[i].that we are only user. + i.e. singlethread is a behaviour we want, so we include it in required type + or argument objects. + + How to specify fork/join. + don't want it to be tooo automatic or we cannot have fine control. + maybe model threads of control as objects - have arrays of them etc?? + +Parsing. + + Parsing is a very common sort of problem which must be very configurable. + wan't some wway to specify a parser - as a list of parsing objects? + const-string might be "extended" to have a parse method which matches the string. + + +Executables as objects? + + for uniformity, as well as expressivity, it would be nice if executable code + were treated like an object - was sufficiently first class to be passed around. + Then + IF a THEN b ELSE c FI + would be a.choose(b,c) + where true.choose(x,y) == x.execute + and false.choose(x,y) == y.execute + or maybe it would be + a.choose(b,c).execute + with obvious redefinition of true and false. + a;b would be a sequence object. + {sequence}.execute would be {sequence}.head.execute, then {sequence}.tail.execute + need some input to .execute, and some output. This is looking very functional... + + "normal" lang have code elements which access/modify "local variable". + such code elements would have limited scope and so could not be stored in variables + of greater scope. if scope is a part of a variables/values "type" then this can be + a type constraint. + naturally, code elements could be created which have global scope. This needs a lamba + abstraction. + + methods have the scope of an object, and are assigned to fields in that object. + Is each procedure (block?) call a creation of an object? like Beta? + Calling a method creates an object with fields which are the arguements and local vars, + enstacks the object, and proceeds to execute the .code sequence. + The objects in a .code sequence are method calls. Each identifies a target object, a + target method of that object, and a list of objects as arguments. + OR + an object maps a methodname to a method. + a method maps an object to a (more specific) method. + eventually (After all args collected) a method maps an object to an object. + Hmmmm. an object maps a methodname to a method which is a new scoped execution object. + given an arguement, this executes. + That sees procedure parameters as a single sequence. + + This is a bit wishy washy, but seems to tend toward a nice simple execution model... + What do I particularly want: + scope executable which can be passed down/around and optionally executed later. + unscoped lambda expression - i.e. an independant object! + combination - a scoped lambda expression. + + BUT: What are the semantics of an atomic execution. is it value transformation ala functional + lang, or state change (of some scoped/stacked object) or both? + + +Subtyping objects is an illusion. + Subtyping of values makes lots of sense. Values don't change. + Subtyping of mutable objects just doesn't work. The objects of the sub type + must behave like objects of the supertype, and hence be subject to any change + that the supertype may be subject to. + Thus the invarient can only become less restrictive, thus allowing more value. + This is the integer subrange vs n-bit int distinction again. + + So with typing, we focus on sub-types of VALUES of objects, and + type function to express types of objects. + The value of an array of ints is a subtype of the value of an array of objects, + but an array-of-ints object is a restriction of an array-of-objects object. + Clearly parameterised classes will play a bit role here. + +SubObject references. + It is often desirable to have a pointer into a bag object such as a + linked list, a skip list, etc. + This is more than just a handle on a subobject as it may be used to + navigate around the structure. + Particularly in a multi-thread envronment, the bag object needs to + keep track of what active pointers there are too its sub objects, so + it can guard against deleting/changing an active object, or maybe an + object on an active path. + Requiring the client to "release" a reference is as bad as using + malloc/free. It is better to have language support - e.g. though + reference tracking. References must be released when pointers are + destroyed, and it is best if language system guarentees this. + This is like memmory allocation problem, except that a garbage + collect approach won't cut it. + + Could possibly have several classes of references e.g. readonly and + read-write and ways to promote/demote them. + There is also a question of cutting a object up (e.g. double linked + list) and how that is type checked..... + Also, I like having a pointer to a thing that could be in one of + several doubly linked lists, and I don't care which one. + If I don't know, the compiler wont know, so how is concurrency + guarded... + + +Objects without inheritance. + I have a feeling that when separated from subtyping, object + inheritance is purely an implementation convenience. + + This convenience is possibly confusing and might be better + discarded. We can simply use object inclusion (has-a) and let the + compiler determine that a subobject can be inlined. + methods for the new class can be equated to methods on the sub + object or, quite naturally, a new method with calls a method on the + subobject. This discards super. calls and removes any name confusion + with multiple inheritance. + A question is self. calls. This is (after all) the most interesting + thing in oo programming. The routine in the superclass calls a + method in the subclass (which may, or may not be the same class). + + One answer is my idea of extension classes which implement new + methods on objects of some known type interms of the external + interface to that object. e.g. sine(x) in terms of plus/mult/div etc. + Q. to what extend with the object be mutable?? + +Question. are safe downcasts needed. Are they like typecase in + modular-3? but there may be subthings you don't know about. + Should not you extend each possible subclass with the necessary + method. can this be done post-hoc. Is this just needed because + programmers don't really understand subtyping?? + + Java has an example (probably many) of safe downcasting.. + A Hashtable is defined which maps objects to objects. + When you get an object out, the programmer probably knows the + type from the content of the key, but the language doesn't. + Hence a safe downcast is needed. + + Two ideas present for making this more typesafe. + [ note safedown cast is not "typesafe" because it can through an exception] + 1/ The key to the hash table could be + (key,type-of-content) + So store(key:object, ctype:type, content:ctype) + and fetch(key:object, ctype:type) : ctype + + 2/ The type can be explicitly stored with the content (instead of implicitly) + Then the table entry would be + + It would be in either case. + + This requires that types, or at least classes, exist as objects. + Q. are classes enough, or are all types potentially needed as objects? + we would really want a partial ordering, so all types are needed. + +The next question is: who owns an operation like addition? + Addition of integers is a function of the set of integers (Z) more + than of any particular integer. 3.plus(4) always looked terribly asymetric. + If we think of objects more like values (which is pleasing for + numbers), then there are possibly many things we could mean by + "plus", modular arith for example. + Real-plus is the *same* op as integer-plus (as an integer IS a + real), but what about float-plus? + float is a REAL problem (joke!) because people want to treat it like + a real, but it simply ISN'T real (more-so than int isn't integer). + Possibly the semantics could be that a+b yields "real" addition, but + this value - which cannot be stored - gets converted to whatever it + gets stored in. e.g. a float or an int, which an implicit but + clearly acknowledged possible loss of information. + This doesn't quite work with nested expressions, as a difference of + two very close large numbers looses a lot of precision. We simply + cannot keep infinite precision (or even adequate precision) + intermediates. + + I think I've lost the thread a bit though. That is a question of + overloading '+' and how to resolve that overloading. + ... but is it the same problem? + + A problem with associating plus with "int" rather than "3", is that + if we have an object that we don't know is an int, how do we add to + it. + Well, we must know that it is of a type that supports add (is that + ring? - probably ring-of-X) i.e. generic addition must be through + parameterised genericity rather than subtype genericity. (actually, + it's probably additive-group rather than ring). + + Lets practive that: + We say "X == Additive-group of type X" is a type which supports the + binary operation "plus", thus is A and B are both of type X, then + "plus(A,B)" is of type X. (Additive-group "extends" X) + Then we can define an additive group of type "int", "float", + "matrix(n,m,scalar)", "0..6", where for each we must specify how + addition works in each case. + We can say that "X is an Additive-group(X)" + + But if we see "plus(3,4)", what type is that? is it int.plus or + float.plus or "0..6".plus? If we want 3 to be simply the value "3", + equal in types 0..6 and int and float, then the plus must be decorated, + or the type must be propergated down from the receiving method(????) + + If we use primarily one additive-group at a time, something like a + with-clause would suffice for briefness. But we may well be mixing + ints with floats, or anything else, and an accidental expectation of + int rather than modulo semantics would be confusing. + + Maybe a low-cost with-clause on a per-expression basis: + int(3+4 div b) float(3/b+4/f) Zseven(4+5) + Without such clause, the least surprising would be used - int.plus + + For non-operator methods, this isn't an issue. Just include type in + name. + + Rewind... When we say "X is an Additive-group of X", we need to + include the "of X" because if Y is a subtype of X, "Y is an + Additive-group of X", but probably not "of Y". or more importantly, + "of X" is more useful than "of Y". + + +An assertion: (did I make this earlier?) + subtyping mutable types doesn't make sense. Only subtyping of + immutable values. + The rationale is that the important use of subtyping is: is A is a + subtype of B, then a member of A can be used where ever a member of + B can be used. + Thus any change that is legal to a member of B MUST be legal to a + member of A, and the B-visible change will be the same. + Thus A must maintain the same state, and can only extend it with + more information. Yet B-acceptable changes will not usefully affect + the extra information...... + Maybe not. A could contain a change history + Lots of other examples. A shape that can change colour... + + So we can reasonably subtype all types. + +Want to motivate result type being function of arg types. + choose(a,b:object) -> lub(a,b) + choose(a,b)+1 -> lub(additive-group/a, ag/b) + maybe say union rather than lub. + double(a) -> type-of(a), but what does that mean? + must choose significant point in type lattice between root and 2*a. + this probably what groups are about, but want more examples. + hmm. type info is static. this worth remebering. it is exactly what + is important. + + +HereWeGoAgain + +An "Object" is a "thing" which has "behaviours". +By "thing" we mean that it is something that can be held (references, pointed to). +It is a bit like a black box. + +Behaviours include (at least) responses (reactions). +A response is elicited when we send an(other) object to the object. +The response was three parts: + It produces an object. + It (may) change the original object into a different object + It may send other objects to other objects thereby causing other + objects to change into different objects. + +So far it is pretty confusing. We need some structure. + +A "type" is a set of "objects". + +types will usually be prescriptive (is that the right word) meaning that any object which +fits the description is a member of the set. + diff --git a/Interpret b/Interpret new file mode 100644 index 0000000..f9ade7a --- /dev/null +++ b/Interpret @@ -0,0 +1,63 @@ +Interpretting - April 2013 + +Supposing I had a language design, and a parser which created a basic +abstract syntax graph. What else is needed before I can have an +interpreter which can run programs. + +Assume a simple source-only interpreter - no bytecode format. + +I need: + - symbol tables - lots of them. One for every scope. Each symbol can + have a type and a value. Symbol table must complain about + re-definitions. + + - process declarations by putting things in symbol tables. This may + involve loading modules, but that can happen later too. + Also create proper internal representations for things like + structures and assign numbers for enums. Though we need to be + able to evaluate constant expressions for that, which requires + type analysis. So I guess that has to come later. + + + - type resolution. Need to determine type for each expression and + for each name and make sure the types all match - or at least are + compatible + + - Actually, we cannot separate type checking from expression + evaluation as they can depend on each other: if two arrays are + sized by expressions and need to be compatible. + + + - a virtual machine with a stack of frames and a heap. Maybe several + stacks for tasks or co-routines. + + - a 'standard library' of functions, methods, constants + - memory allocation + - io + - formatting + - threads / IPC + - string handling + +What errors are possible: + - parse errors: bad token, unexpected token (which might EOF) + - type errors: access attempt doesn't match type, in lots of + different ways. Is ignored return value a type error? + - namespace errors: name not defined, or already defined. + - enum values being reused ( { a=1, b=1} ). + - some constructs only allowed when 'unsafe' - is that a type error + or something else? If an 'unsafe' module provides those + constructs, then it could be a type/namespace error. But that + hides them from the language. + - numeric value is badly formatted + - string contains unknown escape + +Todo: + - generic number handling + use gmp - bignum + - string processing + \ b r t n a f q v 0-3 x u U \ + - simple symbol table + - structure to hold type info + - expression structure + - statement structure + - stack machine diff --git a/NOTES b/NOTES new file mode 100644 index 0000000..a009e25 --- /dev/null +++ b/NOTES @@ -0,0 +1,317 @@ +02oct2014 + I don't know how to create a grammar or condstatements. + The newlines are confusing. + When I have + if cond: + statements + + This ends with NL OUT NL. + The first NL is included in "statements" which is allowed to end with an NL. + The "OUT" reduced us down to "if Expression Block" and then cancels. + The second NL is the problem. It could end the whole statement, or + it could be followed by an 'else' and I don't know which. + + I currently have complex statements separated by newlines. The means the + second NL must close the complex statement. So we cannot shift it until + we have reduced a ComplexStatement. So we cannot see if an 'else' is coming. + + If we request ComplexStatement to end with a newline, then have: + ComplexStatement -> + IfPart Newlines + IfPart OptNL Elsepart NEWLINE + IfPart -> if Expression Block + ElsePart -> else Block + + Then the NEWLINE can be shifted. If we see an 'else' we reduce it to OptNL. + If we don't we reduce to ComplexStatement. + But if we see another newline .... Need both to allow a list + + If + ComplexStatement -> WhilePart CaseList ElsePart + WhilePart -> while Expression Block + CaseList -> Newlines + | OptNL CasePart + | OptNL ElsePart + Elsepart -> Newlines + | OptNL else Block Newlines + + OR + + WhilePart -> while Expression Block + CaseList -> OptNL CasePart + | + ElsePart -> OptNL else Block + | + ComplexStatement -> WhilePart WhileSuffix + WhileSuffix -> Newlines + | OptNL CasePart WhileSuffix + | OptNL ElsePart Newlines + + ComplexStatement -> ForPart WhilePart WhileSuffix + | ForPart WhileSuffix + | WhilePart WhileSuffix + | SwitchPart WhileSuffix + | Ifpart IfSuffix + + + +24sep2014 +another problems. + If I have + statementlist -> statementlist NEWLINE statement + then the newline will completely close the statementlist which I don't want. + + if a: + stuff + else if b: + stuff HERE + thing + + at 'HERE' there is a newline before and after the OUT which + need to be shifted into two different things... it isn't working for + some reason. + +------ + +I have a problem. + + if cond : a=b; d=f + +parses badly. +The NEWLINE at the end could be shifted to turn the "simplestatements" +into a "statements", or it could trigger a reduce and then be shifted for +the IfPart. +So in neither case is it ignored, and that is all the previous logic involved. + +Both the state at 'if' and at 'a=b' starts_line and there are no +outstanding indents. +So the NEWLINE should reduce anything that started with starts-line +that doesn't contain an indent or a newline. + +Actually, 'a-b' doesn't start_line. + +So when we see a NEWLINE when it is allowed, we could reduce anything that is +completely since the last start. maybe.... + +If newlines are ignored, obviously we ignore any we find. +If not, there must be a starts_line since the last indent. +We really want to reduce everything since there to a single non-recursive +symbol. +But maybe we need to SHIFT before we can REDUCE that far. +So we just reduce as far as we can. + + + +Hmmm... I've made a mess of this. How embarrassing. +My top-of-stack and 'next' handling gets confused. The indent +on the 'next' token gets stolen when I reduce. + +So: + The stack alternates tokens, which can hold indent, and states, + which can allow newlines. The top and bottom are states. + Each frame contains a state (with newline flag) and the following + token (with indent information). + The final (topmost) state (with newline flag) is stored in 'next', + as is the look-ahead token (together with indent info). + + When we reduce() we remove several (0 or more) frames and replace + with a single frame. The information we remove is actually + a token and its following state, N times. Including the state in 'next', + but not the token in 'next'. + The new frame is the new reduced-to token with the old state, either + from frame or from 'next'. 'next' gets a new state. + + This suggests I did the frame the wrong way. A frame should have + a token (With ast and indent) and then a state (With newline flag). + The bottom-off-stack can have a null token. + 'next' just has the look-ahead token with indent state. + + Reduce discards N frames (never the bottom frame) and pushes + the resulting token (with ast) and 'goto' state from previous frame. + 'shift' pushes the 'next' token and 'goto' state. + + What a mess. + - fixed now (I hope) + +Back to where I was. + + I want to parse: + + if a==b: print a; print b + + and when I see the NEWLINE I want to reduce "print a; print b" to Statementlist + without shifting the NEWLINE. Then the NEWLINE is shifted to make a ComplexStatement. + + The state at the start of 'print' doesn't expect a newline, but at start of 'if' + does (I hope)... only it doesn't if it is at the start of a block. But in that + case it is indented. + So we look backwards for an indent or a starts-line state. + If we can reduce without discarding the state or absorbing the indent, we do. + + Only .... now that 'print' isn't in a starts_line state, it also + isn't after an indent, so we are ignoring newlines. + I think I want my cake, and am eating it. + + How to parse: + if a == b + : print a; print b + + This should parse like the above. So "Block" isn't a nicely + reducible element. Only Statementlist is. + if a == b + { stuff } + + The state after the ':' is important for reducing back to. + If that because it is a recursion point? No. + It is because next thing can be preceded by a newline. + i.e. CompleStatement can follow a newline. + So we find symbols that end with a newline, and thus symbols + which follow a newline in a production, but only immediately. So + ComplexStatement. + + Then any state where a newline-following symbol follows DOT, + is declared a line-starting state. + These make newlines visible until the next indent and .... + + and what? If a newline appears (before an indent) it reduces + everything since that state while it can or until there is just + one thing which is not recursive. + + After a line-starting state, an *internal* indent disables newlines. + An initial indent reduced the reductions that a newline can cause.? + No, that don't work. + I think I need to track if a symbol is at start-of-line. + When we get a newline, we reduce anything we can since that + start-of-line until we get one symbol? + i.e. if we get NEWLINE and top symbol didn't start line, + we reduce if that reduction won't swallow start of line. + + +Summary: + - IN, OUT, NEWLINE tokens. NEWLINE that would be before IN is moved + to after OUT + - IN are recorded against 'next' symbol. Each symbol records indents + (INs minus OUTs) it contains, and whether it started with an IN + - in parsergen, each symbol is tagged if it can end with + a newline + - Each symbol following a can-eol symbol is 'starts-line' + - Each state where a starts-line symbol follows DOT is a starts-line + state. + - When parsing we record which symbols followed IN or NEWLINE. + - If there are net indents after a starts-line start, other than + immediately after, then NEWLINE tokens are ignored. + - If we see a NEWLINE which is not ignored then we must reduce any + production which started after the most recent start-of-line. + So if we can reduce and length is less than length-to-start, + reduce. + - If we see an OUT we must reduce any production which started + indented. + + +We end up with lots of starts-line state that aren't interesting, as +they are very close to a newline anyway or only a terminal away from +the next starts-line start + +Optional NEWLINES are awkward. When we see a newline, we are prone +to reduce early so we end up with a newline to be shifted when it +isn't wanted any more. + +Optional newlines are even more awkward. An optional newline +in "block -> OptNL { statementlist }" messes up because the NEWLINE +forces the reduction of OptNL from 'empty' before NEWLINE is shifted. +So we can never achieve "OptNL -> NEWLINE" except at the start of a +line, after an OUT. +The purpose of reducing early is to ensure a symbol never includes a +newline unless it started at start-of-line, or explicitly allows +newline. + +How is + if cond : st1 ; st2 + st3 + +(where newline is permitted between 'st1;st2' and 'st3') different +from +if: + cond +then: + st +(where newline is permitted between 'cond' and 'then'). +In first instance I want to reduce further. In second instance I +cannot. +In first case, new thing started midline. In second it didn't. + + +ARG. Still not sure I have this right. Though maybe by indent +grammar is broken.... + +We definitely need to know if a start "starts lines". i.e. it is a +state where we are expecting a 'line like' think. +A 'line like' thing should be a thing. i.e. a non-terminal. +A non-terminal which ends with a newline is a perfect candidate for +'linelike' So any state which is followed by can_eol is linelike? + + +The grammar needs to be carefully constructed. Anywhere a NEWLINE +appears, we definitely don't ignore newlines. +So they should only appear after things that we want to be line-like; +So + variable = expression NEWLINE +is no good, because we don't want 'expression' to be linelike. +Is: + for SimpleStatements NEWLINE +ok? Not really because SimpleStatements in recursive. + + +1/ outdents and newlines must "close" any productions which started + at-or-after the matching indent, or after the matching start-line + + The key idea is that the total set of tokens for any given symbol + must: + - not include an OUT without the matching IN. If the IN was at the + start, the OUT must be at the end. + - not include a NEWLINE unless it started at or before start-of-line. + unless NEWLINEs are being ignored. + (unless the symbol includes only the newline) + + So when we see an 'OUT' we reduce anything we can until we can + cancel against and IN. If the IN we would cancel against is at + the start, reduce again if length==1. + + When we see a NEWLINE, reduce if we can as long as length doesn't + go beyond start-of-line. + +2/ NEWLINES are ignored after an indent if they are not "expected" + since the indent. + + A symbol is 'linelike' if it is ever followed by a NEWLINE. + i.e. the symbol after it in some production begins with a NEWLINE. + (if "a -> b c" and "x -> a NEWLINE", then a is linelike, but c + isn't). + + If a state is followed by a linelike symbol, then it is a + starts_line state. + Newlines are expected in starts_line states + +SimpleStatements Block ComplexStatements + +so: + - track which symbols *start* with a newline + - deduce which symbols are linelike - they are followed by newline + - deduce which starts start_line - a lineline symbol follows DOT + - Make sure grammar handles newlines properly. + + + + + + + + + + + + +The "shift if you can, reduce if you cannot" rule means that an +unexpected symbol effectively terminates everything. But we don't +want to terminate an indent before we see the outdent. so that needs +fixing. diff --git a/Numbers b/Numbers new file mode 100644 index 0000000..f56c9f7 --- /dev/null +++ b/Numbers @@ -0,0 +1,95 @@ + +How do numbers fit into an OO framework. + +Number for which you can do + a = b + c + float = float + int + a = i++ + a < b + +There are several options: + +A/ Number objects can be immutable +B/ Number objects can be mutable and assignment makes a copy +C/ Number objects can be mutable and assignment takes a reference. + +If numbers are immutable, then what does + i++ + +means? It must mean that a new object is created which has a value +one more than that value of the object stored in i, and this object is +then stored in i. So '++' operates on the name, not on the object. +That seems a bit out of the spirit of oo where we operate on objects. + + +If assignment makes a copy, then that is a problem for complex +objects. We absolutely need by-reference handling for many things. +So we would some-how needs different assignment for different +objects. So take references, some take copies. The difference would +have to be explicit e.g. + a = b (reference) + a := b (copy value) + +that is error prone. + + +If mutable objects assign by reference then + a = b + while (a++ < b+5) + ; +would never stop, so the model is unexpected and error prone. + +So... where does that leave us? + +It seems that operators: + a + b +should really create new objects. +Operators like + a += b +Really must change the left-hand object + +So assignments of number really really wants to be a value-copy. +Maybe we use type tags to make it harder to go wrong. +i.e. + a := b +Always makes 'a' a new object distinct from b. + a =& b +Always makes 'a' refer to the same object as b + a = b +is the same as =&, but is not allowed for 'bycopy' objects. + +Hmm.. what sort of assignment happens to formal variables in a +function call. They should be by reference, but maybe changing the +reference of a formal parameter shouldn't be permitted as that could +lead to confusion? + +or maybe...... + +Instead of two difference sorts of assignment, we could have +auto-dereference objects. +If a type is strict-once, the there is guaranteed never to be any +aliasing. So in + a = b + c +the '+' creates a new object with sum of b and c and it is assigned by +reference to a. As there is no risk of an alias, no 'copy' is needed. +But on + a = b +The object has a reference already, so to avoid counting the reference +we make a new copy. +So what happens in a function call where we want a by-reference +parameter? +We must be able to over-ride the auto-copy, but we still don't want +reference counting. So we need to know that one reference will +outlive the other. In that case we use '&' to make a reference. +For a function call, it is clear that the actual parameter reference +lasts longer. + +For a = &b +we can probably use data-flow analysis to know when the object is +destroyed. +What about + a[5] = &b; + a[6] = &b; + a[n] = 4; + +Hmmmm.. diff --git a/Numbers-impl b/Numbers-impl new file mode 100644 index 0000000..350ced8 --- /dev/null +++ b/Numbers-impl @@ -0,0 +1,20 @@ + +Implementation of numbers in the language interp/compiler + +I really like Go's typeless constants. i.e. compile-time numbers are +arbitrary precision. It is only when we write them into the executable +that a size must be set. + +So we need a data structure that can hold any number, whether integer +or float, with complete decimal or binary precision. +A large integer together with large binary and decimal exponents would +do it. +How large do we make the integer? An array of integers? + +How precise do we make division? Go doesn't try too hard. + +We could keep a rational number. So large-int for numerator and +denominator, and 'int' for exponent and power. + +Then we need add, subtract, multiply, GCD, divide for large-int, +and add, subtract, multiply, divide for scaled rational. diff --git a/PLAN b/PLAN new file mode 100644 index 0000000..54bc4cc --- /dev/null +++ b/PLAN @@ -0,0 +1,7 @@ + +learn erlang. And haskell? and write some code in Go + + http://learnyouahaskell.com/chapters + +understand what a MONAD is. + diff --git a/Parsing b/Parsing new file mode 100644 index 0000000..7e2878c --- /dev/null +++ b/Parsing @@ -0,0 +1,1231 @@ +Apr 2013 + +What do I remember about how parsing works? +And why don't I just use yacc? +A: because I didn't invent that :-) + +We have productions from a non-terminal to a list of +terminals and non-terminals. + +We have an input stream of terminals. +Parsing involves a stack of symbols and two operations: shift and reduce. + +"shift" take the next terminal from input and pushes it onto the stack. +"reduce" pops a sequence of symbol off the stack which match the +right hand side of a production, and replace it with the non-terminal. + +On every "reduce" action some parsing code is run which typically +gathers important info from the symbol list and attaches it to the +non-terminal which is pushed. + +We finish with "program" as the only non-terminal on the stack, +and empty input stream, and a full abstract syntax tree attached to the +"program". + +OR an error. + +To choose between "shift" and "reduce" we use a state machine. +Each state corresponds to a point between two symbols in a production, +though multiple state may well get merged. +For each state there is a list of terminals which should be shifted +and a list of terminal that should trigger a 'reduce', together with +the production that should be reduced. Each of these also determines +the next state. + +Actually not. +We have: a stack of "state" + anything the actions might want. + +Actions: given a state and a lookahead determine action: + - shift: newstate + - reduce: non-terminal, count, action + - accept: all done + +When a reduce happens, we look up a separate "goto" table + goto: state + non-terminal -> state + +We create these tables in 4 steps using some intermediate tables. +build_first + create the 'first' map which lists all the possible 'first' terminals + for each non-terminal +build_follow + create the 'follow' map which lists all possible following terminals + for each non-terminal +build_itemsets + create itemsets which maps an 'itemset' to closures. + Each 'itemset' corresponds to a 'state' above. + An 'item' is a production plus an index, 0 for the start, 1 after the first + symbol etc. + An 'itemset' is a collection of 'items' that can all exist at the same point. + The 'closure' is a list of nonterminals and terminals which may follow + this point. i.e. all that follow immediately, plus the first for each, + plus the first of each of those, etc. + + This involves building the 'goto' table + +build_actions + + - if [A -> alpha . a beta] is in Ii and goto(Ii, a)==Ij, + action[i,a] is "shift j" + - if [A -> alpha . ] is in Ii then + action[i,a] is "reduce A -> alpha" for all a if FOLLOW(a) + +build_first: + Each terminal may only start with that terminal. + Each non-terminal that can produce an empty string gets 'Empty' added. + This is repeated until no more can be added (As setting Empty might + allow other non-terms to produce empty) + Then for each production, and for each symbol until a non-empty is found + add the first of each symbol to the first for this symbol. + Then repeat until no change. + +build_follow: + build_follow requires the first() mapping from symbol to terminal sets. + Initially, follow() everything is empty, except follow(Start) is End + Then for each production + for each symbol but the last, all of first(next-symbol) except Empty into + follow(this) + for last symbol, if first(last) contains empty, then all for follow(last) + go to follow(this) + or something like that + +build_itemsets + first itemset has one item; "START 0" + for each new itemset, we add closure and build_goto. + if build_goto added itemsets, we add their closure and build their goto. + + To close an itemset: + - collect lists of terminals and nonterminal which are 'next' + in each production + - for each non-terminal, add 'first()' ??? + the code in LRparse.itcl looks confused - or I am. + + Having created the closure, we create the goto for the itemset plus + each symbol in the closure. + For an itemset+symbol, the 'goto' is the set of all items which are + one step forward. over *this* symbol. This might be a new itemset, + which will need a closure and gotos. + + +So we assign a number to each terminal and a higher number to each non-terminal. +We need to be able to store sets of these numbers. This would be an auto-grow array +for which the first part is sorted. Every time (say) 8 entries are added, we +sort the new entries in. + +We also assign an index to each production, which is a list of a head and several +body parts +An 'item' then is 2 numbers: production+offset. +An 'itemset' is just like the terminal sets, but fatter. +Each itemset gets assigned a number. +A 'goto' is a map from itemset plus symbol (2 numbers) to another itemset. + +'first' and 'follow' are arrays pointing to termsets. +Infact there is probably one array for symbols where each has +name, first, follow, first_production +'itemsets' needs content-lookup so could just use linear search, or occasional sort. + +'goto' is a separate small table for each itemset which can be created sorted. + +'actions' is a per-itemset list which is created sorted. + +================================= +Errors: + +grammar errors are "shift-reduce" conflicts and "reduce-reduce" conflicts. + +shift-reduce are resolved by shifting, but cause a warning +reduce-reduce are resolved arbitrarily I guess. Best to avoid them. + +However we can use precedence to resolve these conflicts.... nearly +as easy to be explicit about different nonterminals + +Parse errors happen where we can neither 'shift' or 'reduce'. + +We can discard unexpected symbols until we recognise something, but +that may be never and might discard too much. +Or we can reduce prematurely so we are likely to expect more significant +symbols. +What we actually do is replace an unexpected symbol by 'error' which should let +us reduce.... don't really know how this is meant to work. We want to shift +some and reduce some but how much of each? Maybe replace each bad symbol +by 'error' ... but then we lose stuff. +Maybe check each state on the stack to see if anything recognises the next symbol. +If not, discard it and try again? Or something. + +So an error production is: + A -> error term +meaning that in the context of A, if we find an error, we look for 'term' +and decide that is a (bad) A. +So if we get an unexpected terminal, we pop states until one +allows us to shift and error, then shift the error, then discard terminals +until one is recognised. + + +================================ +Lexer. +hand crafted which "knows" about + identifiers/reserved words + numbers + symbols + newlines + end of file. + indentation + strings + comments + +Gets list of reserved words from grammer, and list of symbols + +ident/reserve: _alpha followed by _alphanum +numbers: digit followed by digits, periods, a-z and +- immediately following + an 'e' or 'E' +newlines are .. newlines +strings are: ' upto next' on same line, " upto next ", 3 of these upto next + 3 on same line, or some multiline thing with no indent. +comments are: // to end of line, or /* to */ but without another /* + +symbols: any ASCII after space and before delete other than digits, + alpha and quotes. i.e. anything left. Unknown symbols are the + "unknown" token. + +Grammer is: + +nonterm -> nonterm + foo += bar "$1 $2 - args for HEAD struct" + | something else ; + +Any identifer that does appear on the LHS is a literal. +Any symbol sequence is a symbol. +So as we parse we add each word into symbol table. As we +find proctions we mark some of them as non-terminals. +First non-terminal is the 'start' symbol. + +This builds an abstract syntax tree. + +Each node in the tree has a name (the symbol) and either: + for non-terminals: a list of other nodes + for terminals: the id and value + +Each symbol on the rhs can be followed by a number which indicates where +in the symbol list of the lhs this symbol should go. 1 is first, 2 is second. +0 has a special meaning. The list of symbols in parent is set to the list +of symbols in this child, then others are appended. + +=============================== +How do we handle fine indents in parsing? + +Each token carries not only the token name/value, but also +the line-num, char-num of where it starts, and the indent size +of that line. A 'reduce' ensures all tokens have compatible indents. +If the first token is a terminal, it is ignored.(maybe) + +Indentation changes result in 'CLOSE' tokens + statementlist -> statementlist statement statement_end + | + statementend -> ; + | ; CLOSE + | CLOSE +No, that doesn't work - it could swallow 'CLOSE's. +I suspect we need OPEN and CLOSE to be paired. +Or we want 'SAME INDENT' and 'SMALLER INDENT' to be distinct. +Python uses INDENT and DETENT and NEWLINE, which is sometimes ignored. + +Maybe: + If a newline is found were it isn't expected it is quietly ignored. + If it is followed by an indent, it and following newlines to a dedent + are also ignored. +Then + block -> : NEWLINE INDENT statementlist DEDENT + | { statementlist } + statementlist -> statementlist statement ; + | statementlist statement NEWLINE + | statementlist statement ; NEWLINE + +================================== +Just clarifying how the states in the parser work. + +A 'state' is a collection of points in different productions that are +all equivalent in some way. + +We have a stack. Every item on the stack is a pair of a 'state' and a +'frame'. A 'frame' is effectively the abstract syntax tree for some +symbol which lead to that state. + +The two main parsing operations are: + SHIFT(state): This pushes the given state onto the stack together + with a frame built for the terminal. + The given state is one step further along one or more productions from the + state that was on top of the stack (and is now second from top). + + REDUCE(production): This pops one entry from the stack for each body + part in the production and combines the frames as determined by the + production to product a new frame. + It also looks up the head in the goto table for the top-of-stack + state and makes that the new state + +So in each case, a state plus a symbol founds goes to a new state. + +So given a state we examine each production and look at each following +symbol. +For each unique symbol found, we create a state which is one step +forward in each production over precisely that symbol. +To this state we add the '0' start for all productions for all +non-terminals which could come next + +Once we get a new state, we create lists of the terms and non-terms +which come next in all the different productions. Then for each +nonterm we add all terms and non-terms at start of productions +for that nonterm. Isn't this just 'first' all over again? No, +because 'first' skips over 'empty' nonterminals, but we don't want to here. +This pair of symsets then lives with the itemset and are used to build +the goto table and to expand the itemsets to a full state. + +So each itemset needs a set of items and (computed from that) a set of +following symbols. We need to be able to quick check if a given +itemset is know. +The set of following symbols is used partly to help complete the +itemset, partly to build the goto table from that itemset. So we +don't need to hold on to it if we build the complete itemset. +So after we complete the itemset and build the goto table for it, +an itemset is just a list of items. + +However: completing an itemset is a lot of work, and if we have +already found it, that is pointless. So we want to check against +known itemsets before completion. So we need to be able to see the +non-completed form of existing itemsets. +The completion of an itemset involves adding "X,0" items, All other +items are non-zero. So as long as we get the encoding right, it +should be easy to just compare the prefix. + +itemsets need some sort of stable identity to be stored in the goto +tables, so we need to store a number in each itemset. +How do we provide lookup? I could use a skip list... yes. + + + +--------------------------------------------- +I've changed my mind ... I think. +I don't want the grammar processing to be in the same program as the +grammar matching. It is kind-a neat having just one program and being +able to fiddle the grammar and re-run it. However there are impedance +mismatches. +1/ symbol numbering. The matcher needs to 'know' the numbers of + symbols as compile time constants. The could be done by having the + lexer assign known numbers to known names while parsing the grammar. +2/ AST types. It requires a single data type with a simple list of + children. This is overly simplistic and makes the matchers + unnecessarily messy +3/ AST management code. This probably has to change with the grammar + anyway and keeping it separate doesn't gain anything but loses + much. + +If we had an interpreted language with introspection and 'eval', then +it might make sense, but I don't really want that sort of language. + +So: Have the grammar processing output some C (or later 'plato') code +to do the parsing. The grammar file can then define the ast data +types and the per-production code. +So we need some syntax to embed code and grammar in the same file. It +would be best if it looks like C to emacs so I can edit it easily. + +Each production needs to know the type (always a struct I think) of +the frame for the head and some code to run. + +The code has '$N' where variables can be substituted. $0 means HEAD +frame, +Old frames are (probably) freed so we need to move whatever is +needed out. + +The generated code needs to include the go_to table which maps +state to token to state, and the action table which maps +state to terminal to action, where action is ACCEPT, SHIFT or +REDUCE(n) where n is a production. The table of productions lists +body size, token, (to feed to go_to) and code fragment. + +gram file contains: + +#include lines +#define lines +enum lines +struct lines +static lines + +$$$$$ to mark start of grammar + +head -> token token ${ code }$ + +#include lines go in .c +#define lines go in .h +struct lines go in .h +other lines go in .c + +#define TOK_token N + gets put in .h + +func define for parser goes in .h + +"known" table goes in .c +go_to as an array of pointers to arrays of pairs of integer with +length. To we index state into go_to then binary search token. + +action table is array of numbers + -1 for accept + -2 for reduce + (N*256 + body_size) for production. + +production table is a switch statement which includes the code +and assigned the result token. + +Wait - do I care about token names in the C code? I probably don't. +If I want to use similar things I can create my own defines. +So drop the TOK_token stuff... though it might make the generated +C file more readable. + + +------------ +I need to be able to capture raw input text that matches some +sequence of symbols. This will be used to copy code out of the +grammar file and into the code or include files. +It is important to capture spacing and comments as well as the +functional bits. +Probably best to turn on capture when we see a newline, then if we +decide we don't want that, we can turn off and discard it. +So 'capture' collects every character processed which will typically +be leading spaces and the target token. + +When capturing C code we count '{}' and if we find something we don't +recognised, we add it to the previous stanza. +Or maybe have '$$$$$' tokens to differentiate include from C code. + + +------------------------------------- + +What about indents and newlines? +If I find a newline that I'm not expecting, I can ignore it but +require future tokens to be indented more than the token which +started the current production(s). But I don't really know which +are the current productions until we reduce, by which time it might be +a little late to report an error. + +With each frame on the stack we record the indent of the line +which started that nonterminal. On each 'shift', we ensure +the new symbol has at least the same indent. +On reduce, we set the new indent based on first token reduced. + +But that isn't the whole picture. +I think there are two different sorts of regions. +In one, we don't really expect newline, we are happy for them to +appear anywhere, as long as indenting never goes backwards. +In the other, we do expect newlines, they terminate something +but we allow extras as long as they are *more* indented than before. +In effect the extra indent in a line continuation and hides the +newline. + + if a and b : + then C + +that was expected and the indent is structural + + if a and + b: + C + +First one not expected but due to indent, is ignored. +Second one is expected and indent required and structural. +Intent relative to 'if' line + + if a and b : C + d e + +The newline was possible as a terminator, so the next indent +becomes a line continuation. That could be a bit confusing, +so we probably don't want to allow that? + +Each newline gets parsed as either + INDENT - if it is followed by an increasing indent + NEWLINE- if it is followed by the same indent + UNDENT+- if it is followed by a reduced indent. + +If an indent is not expected, it and all newlines and indents +and matching undents are ignored. +If newline or undent is not ignored or expected, it is an error. + +That would make: + if asdad + { + stuff + } +an error because that first NEWLINE isn't expected or ignored. +I guess it should be optional before a '{'. + + IF -> if Expression NEWLINE { Statementlist } + | if Expression { Statementlist } + | if Expression : INDENT Statementlist UNDENT + + +Maybe this: + - If we see an unexpected INDENT, we hold it and continue. + Next SHIFTed symbol gets it attached. + - when we reduce all the attached indents get collected and + then attached to the nonterm when it is 'shifted' in. + - If we see an unexpected UNDENT, we hold it and continue. + Any shift while we hold an UNDENT is an error (how to handle?) + unless it holds matching INDENTS (i.e. is a reduced nonterm). + +That is promising but doesn't explain newlines and allows: + a = b; + c = d; +Why is this a problem? It could be a single line, but is wrapped. +I think it is a problem because a NEWLINE was allowed after the ';'. +So it doesn't really work as one line. + a = b + ; c = d; +is more messy. Newline is allowed there too. + a = + b ; c = d +That makes sense. newline wasn't permitted. + a = b + + c ; d = f; +That is OK. Why the difference? INDENT is attached to '+' which +is not the start of the production that swallows it... actually it is. + statementlist -> statementlist statement +the 'statement' is 'd = f;', + was swallowed into statementlist. +Maybe that means it is bad. +So: When we reduce a production, and indents that are not on the first + symbol must be matched by pending undents. + Any indents that are on the first symbol must be carried to parent + +This is getting there, but doesn't work for +{ + foo; +} + +as shifting the last '}' will be an error as an undent is +pending.... No, that is the old rule. +Current rule is: + + - If we see an unexpected INDENT, we hold it and continue. + Next SHIFTed symbol gets it attached. i.e. we read the next + symbol and mark it as indented. + - When we see an unexpected UNDENT, we keep count of it and continue. + It is not attached (yet). + - When we reduce a production, Any indents attached to the first + token are moved to the created token. Any indents on other tokens + must match saved undents, and are canceled. If there aren't + enough saved undents, then it is an error. This means that if + a nonterminal continues onto more lines, it must end the last line. + So a a ; b b ; c + c ; d d + is not ok, but + a a ; b b ; c c ; + d d + is as is + a a ; b b ; c + c ; + d d + That seems to make sense. + + Only I want to revise the above again. + - If we see an INDENT or UNDENT which has no action, we count it, + keeping a single count which is +ve or -ve or 0. + - When we shift in a terminal, it collects the current count. + - When we reduce a production, the count on the first token gets + attached to the resulting nonterminal. The counts on the + remaining tokens must be non-negative, and there must be enough + excess UNDENTS to counter them. + + But what about NEWLINEs? + If there is an action for it, we just accept it and eventually + shift it. + But what if no action? + + 1/ + a = [ 1, 2, INDENT + 3, 4, NEWLINE + 5, 6, UNDENT + ] + + Actually, that isn't an UNDENT, it is an UNDENT INDENT if + anything. But in any case the NEWLINE is allowed and ignored + + 2/ + if true { INDENT + a = NEWLINE + b + c UNDENT + } + + First newline is confusing and should be illegal. But why? + However is that different to '1'? + Maybe because NEWLINE is expected in a little while? + + 3/ + if true { a = + b + c; + d = e; + } + + is even more like 1, and even worse than 2. Even with ';'s so no + reason to expect newlines soon, it looks bad. But here it is the + indent which is bad, not the newline. + + Is it because '1' has "a -> a b" (first token is common) while + bads is "a -> b c d" - different token. + + Maybe we need to flag non-terminals which are, or are not, allowed + to have unexpected newlines. + So a statement-list may not, but other lists may. + + If a production contains a newline, then the head may not contain + unexpected newlines. But how would we know? And indents are still + allowed. So if the stack contains indents.... + + a = + b + c + is bad, but + a = + b + c + is fine. + a = + b + + c + is also fine. + + New attempt. + + When we see an INDENT, UNDENT, or NEWLINE which would otherwise be + an error, we record it. We should never end up recording an indent + and undent at the same time. Probably not a new line either. + When we SHIFT a terminal, we attach to it in the current recorded + state, which is a simple number: 1 for NEWLINE, 2 for INDENT, -N for + UNDENT, 0 for none of the above. + When we REDUCE a production: + Any state on the first symbol is carried up to the resulting + nonterminal. + Any other INDENTS must cancel with following UNDENTs. They can + follow either in subsequent symbols, or in the current (unshifted) + state. + There must be no uncanceled UNDENTs on any symbol, but there can + be in the state. These are carried forward. + Any NEWLINEs between an INDENT and UNDENT are canceled together with + the INDENTs. + If there are any extra NEWLINES, and a NEWLINE has an ACTION for + the current state, then that is an error - stray newline, indent + expected. + + That seems good, though will need to be tested. Remaining question: + Is an undent less than the previous indent allows? How is it parsed + and what does it mean? + if cond1 and (cond 2 + or cond 3) + and cond 4 + + This could be an UNDENT followed by an INDENT. + However this fails the above rules. We would reduce + "cond 2 or cond 3" which contains an INDENT before we see an UNDENT. + So we would have + '(' Cond-with-indent ')' [undent pending] ['and' is lookahead] + + Maybe each symbol gets a 'leading indent' and 'internal indent'. + Internal indents can be canceled against undents and hide newlines. + Leading indents must wait until they become internal. Only then can + they cancel newlines and undents. + + + I'm not sure about combining these things. Might need 2 numbers. + Any newline after an indent is ignored. So there is no point + keeping both and indent and a newline on a symbol - just the indent + is enough. + Undents can only ever be 'internal'. Intents can be leading or + internal. + So we have 'leading indent', 'internal undent', 'internal indent'. + where indent can also be newline. + When we shift in a terminal, it can have undent or indent. It + cannot have both. If there is an indent, the undent must have been + canceled. + When we reduce, indent/newline/undent is canceled. If there is an + undent followed by an indent, that is an error. so a symbol can + never carry both an indent and an undent. + If it carries and undent, it is probably a closing bracket. + if a { + b + } + '}' carries an undent which balances the indent on 'b'. + What about: + if a: + foo + else: + bar + + The undent doesn't get attached to 'else' as indenting syntax is + explict here. + How about: + if (a) + { + foo; + } + else + { + bar; + } + + 'else' sees an undent and a newline. The undent balances the + indent in the 'block' so it only carries the newline. + + But ... the 'else' - in both cases - gets a newline, and in the first + it isn't allowed because it is at the same level as newlines being + recognised terminators. But in this case the 'then' *must* be on + the next line because we just had a line-oriented block. + Maybe we just have to 'expect' that newline. + + Anyway, a symbol never carries both an indent and an undent. + Can it carry a newline with either of those? And newline 'inside' + the dent will get ignored, so the interesting possibilities are: + newline indent + and + undent newline + + I think 'newline' can be leading while 'indent' is internal. + However I don't think the other possibility is valid. + + + 'leading_nl' is one of "nothing", "newline" "indent" + There can only be a single indent. + + 'internal_nl' can be a number of undents, or a number of indents, + or nothing. There are no 'internal newlines'. If they were + in a dent, they were ignored. If they weren't then based on + production they were either ignored or caused an error. + + So to process a production, we first move the leading nl out + of the way. Then we look at leading, then internal for each node. + If leading is: + nothing: do nothing + indent: increment current indent + newline: if indented, ignore. if outdented, error. + If nonterm allows, ignore else error + Then add 'internal' to current indent. However once a negative + indent has been seen, no further positives are allowed. + + +(09may2013) +Damn - I've broken something. +My grammar has "NEWLINE" at end of statement, but parser returns +"UNDENT" which isn't recognised so it is swallowed. +So I think I really want every NEWLINE to be visible. +So as same-level newline is "NEWLINE" +A great-level newline is "NEWLINE INDENT" +A lesser-level newline is "NEWLINE UNDENT+ INDENT?" + +If a grammer rule can be ended by a newline it just puts NEWLINE at +the end. +If a grammer rule requires an undent to close it, then it must +also list an indent: + IF -> if EXPR : INDENT commandlist UNDENT + +The newline before the INDENT gets held, then destroyed by the INDENT. +The newline before the UNDENT is probably match by the commandlist, +otherwise it is help, then destroyed by the UNDENT. +So we are moving NEWLINE-destruction from lexer to parser. + +So the pending dents are NEWLINE then UNDENT then INDENT or NEWLINE +The first NEWLINE is destroyed by following INDENT or UNDENT. And +UNDENT cancels a previous INDENT, but an INDENT doesn't cancel an +UNDENT. A newline following an INDENT is impossible for a terminal. +So we can have UNDENT INDENT-or-NEWLINE where UNDENT can have a count. + +These are then attached to the next thing shifted, except that there +cannot be both an UNDENT and an INDENT when the SHIFT happens, else +error unexpected INDENT. + +So a shifted terminal has undent count and possible newline, or no +undents and possible indent or newline + +When we reduce a series of symbols to a nonterminal: + - the tags on the first symbol are carried to the non-terminal + because they come completely before both. + - remaining indents and undents must match, or must be able to + cancel against pending undents, including the look-ahead symbol. + - and newlines between indent and undent are ignored. + - + + Arg. + There is a thing which is one of + undents + undents newline + indents + newline + -nothing- + + Each symbol has 2 of these: leading and internal. + A Terminal always has -nothing- for the internal. These are + collected from previous reductions and shifted ignored symbols. + A non terminal gain them in a reduction. + The leading thing of a new nonterminal is the leading thing of the + first reduced symbol. + The internal thing of a new nonterminal is a combination of the + leading and internal things (in that order) of the remaining + symbols, together with some undents that are pending, or in the + lookahead. + As we merge the remaining things: + - indents may not follow undents + - newlines following an indent and before next undent are + discarded. + - indent then undent are discarded. + - newlines with no indent are either discarded or cause an error + depending on context + So result for internal of new nonterminal is either + - some number of undents + - some number of indents + + When we see an unexpected newline or indent it goes on the next + terminal. + When we see an unexpected undent it goes on the current + top-of-stack as internal. + + So: + pending: 0 = nothing + 1 = newline + 2 = indent + leading: same + internal: count, -ve undents, +ve indents + + +I'm getting close. + Stray indents must be disallowed in the same places where stray + newlines are.... or something. + And how do I know where these stray things are allowed? + + Hypothesis: There are two sorts of nonterminals - horizontal + and vertical. Horizontal nonterminals are expected to be all on + one line but can use indents for line continuation. Their + productions do not contain a NEWLINE symbol. + Vertical nonterminals are expected to run vertically. Their + productions do often contain newlines, possibly indirectly. + + A horizontal nonterminal may contain an indent, but not an + unindented newline. + A vertical nonterminal may contain newlines but no indents. + + a = [ + 1, 2, + 3, 4, + 5, 6, + ] + + The list if number is horizontal and starts with an indent, so + safe. + The [list] structure is also horizontal and contains an indent + The second indent is legal because it in horizontal. + if a { + a; b; + c; d; + } + +FOUND A MISTAKE. First token in production does not necessarily hold +first terminal. It could have an empty production. That is easy to +handle: reducing an empty production collects the pending stuff. + + +double-damn! I do need internal newlines after all - maybe. + + { + a = + 1 + } + +That indent is used by statementlist which is vertical so the newline +in binding, which is horizontal, isn't protected by the indent and so +is not allowed. +i.e. the indent at the start doesn't protect things. + type str = { + a: int; + b: char; + } + +More thoughts next morning: + - Newline before indent needs to be invisible, else the indent + cannot effect line continuation DONE + - hidden undents might blur with expected undents. Need to decide if + care is needed: almost certainly is + - I think the horiz/vert distinction is on productions, not nonterms. + A production "A -> A ... " is vertical, others are horizontal. + Maybe 'headed' and 'headless' are better term. + A 'headless' production can contain unindented newlines. but not + unexpected indents. A 'headed' production can be indented but + must not have an unindented newline. + +And that afternoon.... + a = 1 + + 2 +is now complaining because it sees an indent in "1+2" which is + Term -> Term + Factor +so is vertical and doesn't like indent. But the indent is really +in "a = 1 + 2" which is horizontal so permitted. +How to we express that? + +A production that starts with an indent or newline can contain +with a sequence of newlines (if vertical) or an indent and a sequence +of newlines (if horizontal). + +A production that does not start with Newline or Indent is deemed to +be horizontal, so can contain an indent and some newlines. + +A production which contains or starts with an indent which is not +canceled by a trailing undent passes that indent up to it's parent. + +So an indent must ultimately be on the first symbol of a recursive +(vertical) production, or it must be the first indented symbol in a +non-recursive (horizontal) production. + +Do I really need internal indents? + +A production starts: + - on a newline (same indent as something earlier) + - on an indented line + - in the middle of a line +Based on this and recursive nature we classify as horizontal or +vertical (h if middle or non-recursive. v if non-middle and recursive) +The first non-middle token in the body must start with indent or +newline corresponding to horiz or vert status. Subsequent non-middle +tokens may only start with a newline +Any non-initial indent in a body must be closed before the production +is reduced. So a nonterm can only carry an initial indent. + + + a = 1 + + 2 + 3 + + 4 + + a = + 1 + 2 + + +Wrong again!!!! +Once we see an unexpected indent, future newlines are +ignored. Sometimes. Until we see the indent. + +Can we have multiple unexpected indents? Yes - if never expecting +any. +So we have a count of unexpected indents. When we see one we expected +we push the counter with the indent and restore it when the undent +arrives. + +If the pending production is horizontal, newlines are ignored. If +vertical, newlines are recognised. + +(next morning - 11may2013) + +So when we see a NEWLINE which the grammar allows, how do we know +whether to SHIFT it or SKIP it? +We SKIP if the it is or will be part of a production that contains an +indent line continuation. But it might be in lots of productions +concurrently, and the LALR engine doesn't really tell us about +productions until they are ready to reduce. + +Proposal: + Require NEWLINEs to appear in grammer following a nonterminal which + is the entire object that they complete + So if any command can be terminated by newline or semi, then + Termcommand -> Command NEWLINE + | Command ; + Commandlist -> CommandlistN + | + CommandlistN -> Command + | CommandlistN Command + + Alternately if the newline is only needed for some commands: + Command -> IfCommand + | WhileCommand + | Assignment ; + | Assignment NEWLINE + | FunCall ; + | FunCall NEWLINE + + Then when we see a NEWLINE that is expected we trace-ahead how + many stack frames would be absorbed before it is shifted. + i.e. while action(state, NEWLINE) == Reduce(production): + if production.len > 0: + tos -= production.len + state = tos.state + state = goto(state, production.head) + tos++ + + Now if any symbol from 'tos' to the real tos contains an Indent, + we Skip the NEWLINE, otherwise we process all those reductions and + then SHIFT the NEWLINE. + + Can we deduce this ahead of time so the matcher just does a + lookup? + The result needed is stack depth to examine, and that cannot be + known ahead of time. + So we need the 'production->head' mapping to be more accessible, + but still need to search. + + How does this relate to 'recursive' and 'horiz'v'vert' ? + Is the symbol before a NEWLINE there-fore horizontal? + It shouldn't come up. + We could flag every nonterm followed by NEWLINE as 'oneline', + and flag all nonterms which produce recursion as 'recursive' and + warn if any nonterm gets both flags. + + When we 'Skip' a newline, do we still record it in m.nl? + I think not, it would be irrelevant. + + Do we want to search ahead every time we see a newline? + We would need to skip to the next expected symbol, then see how + many symbols are about to be reduced, and examine those symbols. + + Maybe this leads to a very different approach. + INDENT and UNDENT are still the same: attached to next symbol and + passed back. + But NEWLINEs are not. When we see a newline, we optionally skip + forward to the next expected symbol (or ERROR), then examine + how much of the stack we would reduce. + If there is a net indent, or a vertical production, we ignore the + newline otherwise we remember the decision and start reducing. + + How does that work? + a = b + + c + + d + return + + The first newline appear only as INDENT and is recorded on the + + The second would trigger a reduction back to "Assignment" "a = b + + c" which contains an indent (on the '+') so the new is ignored and + The '+' is seen and shifted. + The third newline is likewise ignored. The UNDENT gets queued + and the NEWLINE now sees no net indent and so it isn't ignored. + So the Assignment is reduced, the NEWLINE is SHIFTED and we reduce + a command. + a = [ + b, c, + d, e, + f, g, + ] + x = ... + + First newline is only seen as indent, attached to 'b'. + Second newline is not expected so we peek at 'd'. + 'd' is shiftable as current production is + "listN -> listN , IDENTIFIER" + This 'listN' contains the indent on 'b' as a leading indent, + but as it is a recursive production, that is accepted as an + indent and the newline is ignored. This continues until we have + the comma after 'g' shifted. + Look-ahead is now ']' which triggers a reduction to 'List' + List -> listN + | + Which is not recursive. Though it contains a recursive which + has an indent, so that might be cool + So we only go back until we find a suitable indent. + + { + thing + a = b + (c) + d + + First newline gets attached to start of 'thing' as indent + next terminates 'thing' as it is the next to be reduced + Third reduces a=b to 'assignment' which has no indent, so is not + ignored. + + { + if a { + stuff + } + a = b + + After the '}' we don't expect a newline because we have complete + command. We see a newline so are tempted to ignore it but need to + check the indents. look-ahead of 'a' (like everything else) + reduces + commandlistN -> commandlistN command + which is recursive and has a leading indent, so we are good. + + a = [ + b, c + ,d, e + ,f, g, + ] + + This is weird and probably should be illegal. + The NEWLINE after 'c' is not expected so we peek at ','. + This would be shifted so there is no indent to permit the 'ignore' + and we trigger an error. Probably good. + + So to sum it all up. + + - Lexer returns: + INDENT for a newline with an increased indent + NEWLINE for a newline with the same indent + NEWLINE UNDENT+ {NEWLINE | INDENT} for a newline with lesser + indent. + - Parser tracks: + - depth of lexed "ignored_indent" When an INDENT or UNDENT + isn't expect by the grammar + - number of "unparsed_undents" which is incremented when we + accept an unexpected indent, and decremented when we reduce + an indent away. + - When an INDENT is shifted, "ignored_indent" is attached to is, + and then set to zero. "unparsed_undents" must be zero. + - When an INDENT leads to ERROR, we tag the next terminal as + indented and count the indent in "ignored_indent" + - When an UNDENT is seen while the "ignored_indent is +ve, we + decrement the UNDENT count and increment "unparsed_undents". + However if there is a pending indent waiting to be shifted, we + cancel that instead of incrementing "unparsed_undents". + - When an UNDENT is seen while "ignored_indents" is zero, we either + SHIFT it or trigger an ERROR. On SHIFT we find the previous + UNDENT (which must exist) and restore the indent count. + - When we reduce a production we check that it has only one + indented symbol which may be internal or at the start, and + this must correspond to whether it is horizontal or vertical. + If it was at the start, the resulting symbol gets the indent. + If it is internal, "unparsed_undents" is decremented if it is + +ve, otherwise the indent is 'carried forward' to the next + symbol shifted. + + + - When we see a NEWLINE that the grammar expects, we each + production that will be reduced before it is shifted, and if any + contain a legal indent we discard the newline, else we accept + it. + - When we see a NEWLINE that the grammar does not expect we skip + it till we find an expected symbol and repeat the 'search + reduceable productions' process and throw and error if the right + indent isn't present. + +if a and + b: + c + d + while true : + + + New idea. end-of-line is different from start-of-line. + End of line may or may not terminate a nonterminal + Start of line is one of 'indent' or 'same'. + a new line is coded as one of: + Indent + Eol Same + Eol Undent+ Same + Eol Undent+ Indent + + Eol, Indent and Undent can be in the grammar, Same cannot. + Eol is ignored if not expected or a reduce production would contain + an indent. + NO that doesn't work - what would terminate that line? I was using + the newline after the indent, but that doesn't exist any more. + + Try again: + When a newline is ignored by an indent, we count that and ignore + all indent/undents and newlines until the matching undent arrives. + That is parsed as a Newline. This implies no nesting. What + about: + + if a == b && + c == d { + print("Hello"); + print("world"); + } + + The first newline is an indent, so it is ignored but indent is + remembered. + The second newline?? The third shouldn't be ignored. + Maybe the second is in a state where a recursive symbol is + expected. + But so is the first. + if + a == b + && c == d { + print(x); + print(y); + } + + Why is newline after 'b' different to the one after (x); ?? + The one after the 'b' isn't expected. The one after '(x);' + isn't either. So neither newlines are semantic. + If the ';' was missing, the second newline would reduce to a + vertical production, so is ignored. + + What about + a = b + c + * d + +e + ?? + The first newline is followed by indent, so ignored. + The second is at end if indented thing, so ignored. + The third is still indented, so still ignored + After the next undent a newline will fix it. + So maybe we want: { newline undent}* [ newline | indent ] ?? + + +OK, I have a new problem with indents. + + if condition : INDENT statementlist UNDENT else : INDENT statementlist UNDENT + +doesn't work, because it is a horizontal production, but the 'else' is not indented. +Actually I have: + If_Statement -> if Expression Block else Block +and + Block -> : INDENT Statementlist UNDENT + | { Statementlist } + +Also with same productions: + + if expression + { + stuff; + } +doesn't work because the '{' is not indented. + +Can I find a clean way to allow these automatically, or do I need some explicit +tagging in the grammar? +Tagging with OptionalNewline seems to work and probably makes sense. It would be +hard to prevent: + + function foo + (thing) { .... + +at the same time as allowing '\n{' + + +------------------ +Next steps (02May2013) + - Get indenting and newlines worked out. + - create basic structure of grammar. This will be a long way from + final, but creates room to try things out. + + + A file contains declarations: + types, functions, methods, constants, implementations + maybe variables + interface - between modules. import/export etc + init function? + + + Types: array, struct, enum, union? pointer function + parameterized, string int float char Bool + closures + + + Code block: loops, conditionals, break/continue/return + catch, defer? + actions: assign (= := += *= ...), funcall, methodcall + + + expression: conditions, boolean, arithmetics, binary + 'in' + + + bindings, variable declarations, local variables etc. + local functions? are they just closures? + + +----------- +TO FIX +DONE - (" scanner doesn't notice quote +DONE - grammar cannot say "no type of these nonterms + - grammar to synthesise A' symbols +DONE - better tracing - show full stack info + - store 'indents' better. + - need lots of test code. + - use a 'tos' pointer rather than counter. + - GPL notice? + - easier to write 'main' function + - make it easier to handle plain-text files. + +shift-symbol, record-indents, handle linebreaks diff --git a/Rope-11 b/Rope-11 new file mode 100644 index 0000000..c859d44 --- /dev/null +++ b/Rope-11 @@ -0,0 +1,23 @@ + +Rope is my attempt to design a programming language. +The name comes from part of the design philosophy - to give the +programmer "enough rope" to do whatever they like, including hang +themselves. Thus the aim is to be very expressive but also to make it +clear how different expressions are implemented so that the programmer +has information to guide their choice of rope. + +While we avoid saying "You cannot do that", we might make some things +a bit difficult or ugly so that only the programmer who really wants +to do it that way will try. + +A program file can container "pragmas" which state that certain +language features may, or may not be, used. This allows the +programmer to decide their level of risk. This is a bit like the +multiple sorts of warnings that gcc can provide, but instead of being +selected on the command line they are selected in the source code +file. + + +Rope understands the UCS and converts the source code to that format +depending on the current locale. Thus any alphanum in any language, +such as ελλενικ can be used in names. diff --git a/WhyNot b/WhyNot new file mode 100644 index 0000000..0176dab --- /dev/null +++ b/WhyNot @@ -0,0 +1,27 @@ + +Why not Go: + + - numeric literals are typeless. This is really neat. + 1.0 can be an integer in the right context. + 500 can be a float. + + However string literals are not. + '4' is a Rune literal + "4" is a string literal + + Why? + + - value = foo() can throw an error + value,err = foo() will instead return an error status in err. + + value = <-chan can block + value,ok <-chan will not block, but ok will say if it wanted to. + + But different things should look different... + + + + +Why not Rust: + + - everything is an expression... What about while loops? diff --git a/harmful b/harmful new file mode 100644 index 0000000..c21f97a --- /dev/null +++ b/harmful @@ -0,0 +1,57 @@ + + Inheritance considered harmful. + +Definition of "inheritance" which I am using. + + Defining one class of objects as an extension of another class of objects. + +Why it is harmful + Inheritance is harmful because it focuses on subtype polymorphism in + preference to parametric polymorphism. + + List of open issues from that workshop: + point/colour point + link list/doubly linked list + .. + +Why it isn't necessary + Inheritance is not necessary as it an implementation convenience. + subtype polymorphism can be expressed without it + + +Conclusion? + + + +TAKE TWO + + + +1/ Inheritance is an implementation technique, not an abstraction technique + +1a/ Inheritance is used to effect several different abstractions. + These should be available in their own right. + + i) subtype polymorphism + + ii) code reuse + + iii) providing default behaviour + + Benefits of defaults; + code brevity + code resiliance to extending type, if defaults are provided + +2/ Inheritance focuses on subtype polymorphism to the detrement on + parametric polymorphism. + + i) Binary operators + + ii) linked lists + + iii) "selftype" + +Recomendations: + provide pure subtype and parametric polymorphism + provide explicit method of assigning default values/behaviours + \ No newline at end of file diff --git a/lr b/lr new file mode 100644 index 0000000..b93f4a1 --- /dev/null +++ b/lr @@ -0,0 +1,581 @@ + +Now that I've implemented an LR parser which handles indents nicely I +want to describe it. + +First the basic LR parser. This is nothing new. + +There are a set of tokens - primarily numbers. They can have content +attached but that does not affect the parsing. +Any difference that affects parsing must make a different token. +So each reserved word is a token, each special symbol is too. +Then 'identifier', 'number', 'string'. Are all tokens. + +These are "terminals" - anything produced by lexical analysis aka +scanning is a terminal token. + +There are also "non-terminal" token. These are the "head" of +productions. +For each nonterminal there must be at least one production, possibly +several. Each production consists of a "head" and a "body". The body +is simply a list of token, whether terminals or non-terminals. + + If_statement -> if Expression then Statementlist else Statementlist fi + +here I have capitalised the non-terminals. +What this means is that if the head token is permitted somewhere, then +the body sequence can be used in it's stead. Alternately, if the body +sequence is found where the head might be expected, then it can be +treated as if it *was* the head. + + +Parsing a text (sequence of terminals) against a grammar like this +involves a sequence of "shift" and "reduce" steps. This happens in +the context of a stack. Each entry on the stack represents a single +token, though it may have other state attached. +It also happens in the context of a 1-token lookahead. As well as +knowing what is in the stack, we know what the next terminal token is. + +A "shift" operation pushes the look-ahead terminal onto the stack. +This then allows us to see what the next terminal is. +A "reduce" operation can happen when the last few tokens on the stack +exactly match the body of some production (and when all the earlier +tokens on the stack provides a valid preamble to the head of that +production). +When this happens we pop off the tokens representing the body, and +push on the head token. +So terminals get on to the stack from 'shift' and non-terminals get on +the stack from 'reduce'. + +Eventually a reduce state will push the special 'start' non-terminal +on to the stack. When that happens, we have finished. + +So the first interesting question is: how do we know when to shift and +when to reduce - and what production to reduce. This is the fun bit. + +First we need to introduce the idea of a 'state'. A state if +effectively some position in some production. so one state would +represent that state in the above production where 'then' has been +shifted. In this state we are expecting a statementlist. +another state might be after the statementlist when we are expecting +'else'. In any state, we know there are a bunch of tokens on the +stack, and there is something we are expecting. + +Every entry on the stack already mentioned has a 'state' along with +the token. It is the state *after* that token was shifted or reduced. +Based on the state of the top element of the stack, we know what to +expect. Based on the state and the look-ahead terminal, we know what +to do. + +To encode this knowledge we need a few different data structures. +1/ An action table. This maps from a state plus a look-ahead terminal + to an action. The action is one of "Shift", "Accept" (which means + that Start symbol has been reduced) or "Reduce-N", where N is the + number of the production to be reduced. +2/ A go_to table, or state-transition table. This maps a state plus a + token to a new state. Of the two states mentions above, the 'go_to' + entry for the first state combined with the token "statementlist" + would be the second state. +3/ A production table. For each production we need to know how many + token to pop, which token to push, and maybe what action to perform + when that happens. Often this will extract information from the + popped tokens and store it in the pushed token to build up an + abstract syntax tree. + + +In simple terms the action of parsing them becomes: + + 1- Set current state to '0' + 2- look up current state together with lookahead token in + the action table. + 2a- if "Accept" - stop. + 2b- if shift, push 'state' and 'look-ahead' onto stack + 2c- if reduce, find production, perform action, pop appropriate + number of tokens, keeping state from first token as current + state. + Then push head token and current state to stack + 3- look up current state and top-of-stack token in go-to table, + and set current state to the value found. + 4- go to 2. + +All we need now at those tables. The production table is easily +extracted from whatever form the grammar is presented in. The action +and go-to tables take a bit of work. First we will need a few tools. + +Firstly we need to store a set of small numbers. This will be used +both to hold a set of tokens, and a set of 'items' which will be +described later. Items are actually pairs of small numbers. These +sets need to support test for membership, insertion, union, iteration, +and an equality test. + +Next we need a search table that these sets can be stored in. The +sets of items needs to be indexed so we can check if a particular +itemset already exists. + +Finally we will need some small lookup tables from token numbers to +small numbers. These will usually be fairly sparsely populated. + +Given these we can start to find the states and create the goto table, +and it is here that we need to understand what an item is. + +An 'item' is a production plus a number of body tokens that have +already been shifted. So "5:0" refers to production 5 with no tokens +shifted yet. If production 5 had 4 body tokens, then "5:0", "5:1", +"5:2", "5:3", "5:4" would all be valid items. + +"5:0" is a state where we are expecting the body of production 5. +If the second body token in production 3 was the head of production 5: + + 3: A -> X Y Z + 5: Y -> B1 B2 B2 B4 + +Then "3:1" would also be a state when we are expecting the body of +production '5'. So "3:1" and "5:0" are really the same state. + +A "State" then is a collection of "items". Some of the items will +have a non-zero token count, and some will have a zero token count. +We need a data structure that contains a set of "items" - much like a +set of token, so probably the same data structure can be used. We +need to be see if we already have a particular state so they will need +to be stored in some sort of lookup table with the ability to test two +states to see if they are equal. It is best if the lookup only +compares the non-zero items as they imply the zero items, but +computing them for a new set is non-trivial. If we can avoid that +step when the state is already known, that is a useful optimisation. + +We build the set of states effectively by recursing down the +production tree. We start with a state containing the item "0:0", +This is assigned the next free state number, which is "0". + +For each state we need to produce all the states that it can reach. +These are states which we reach after shifting a token, possible via +reduction, so there could be one for each token. We will need a set +of tokens we have dealt with already. + +First we need to "complete" the itemset. This involves adding all the +":0" items that are relevant. + + a/ Initialise a set of possible 'next' tokens. + b/ For each item in the state, consider the 'next' token in the + body, if there is one. + c/ if this token is already in our 'next' set, move on to the next item at + b/ + d/ for every production from this token, add an item with a token + count of zero + e/ We've modified the itemset we are iterating, so start again at + the beginning at 'b'. + +Now we have a complete itemset and a set of possible 'next' we can +look for states to go-to. + +For each token in the 'next' set we potentially create a new itemset. +To do that we look at each item in our completed set and see if the +productions next token is the current next token. If it is, we add +the item to the new itemset with the token count incremented. +Once we have this new itemset we check to see if it is already in the +search table. If not, we: + - assign a number + - add it to the table + - add it to the list of states still to process + - add 'current_state + current_token -> new_state' to the goto + table. + +Once we have completed and created goto tables for every state, we are +ready to move on to creating the action table. +Part of this is fairly straight forward, part is somewhat laborious +but nor particularly difficult. We'll do the straight forward bit +first. + +For each state we need to create a lookup table to map from a terminal +to an action. So we iterate through the states, allocate a lookup +table, and then iterate through the items in that state. + +If we find an item where the next token is a terminal, then we add a +'SHIFT' action for that terminal in the current state. +If we find an item where there is no next token - the token count +matches the body count for the production, then we want to add a +REDUCE action for that production, but what lookahead tokens should we +add it for? + +In many cases it won't matter. There is usually only one reducible +production, so if we arrange to reduce it whenever there is no +SHIFTable token, the parse will proceed correctly. But sometimes it +isn't that easy. Consider the rather contrived grammar: + + A -> B c + | C d + B -> x y + C -> x y + +If we see 'x' and then 'y', it would seem valid to reduce either of +the last two productions. What we choose depends on the next +terminal. If it is 'c' we want to reduce to B, if 'd' we want to +reduce to 'C'. To make that decision we need some idea of what the +'next' token might be in different circumstances. + +This is the somewhat laborious part, but thankfully we have a +computer. We need to complete 3 things for every non-terminal. + +1/ Whether it can produce an empty string or not. If there is a + production: + + A -> + + Then clearly A can product the empty string. Given that, if there + is another production: + + B -> A A + + then B can also produce the empty string. Rather then recursively + walking from each nonterminal, we will use a dynamic programming + approach were we repeatedly examine all productions flagging those + non-terminals which can produce 'empty', until we don't find any + more non-terminals to flag. + +2/ The set of terminals which can come 'first' in the expansion of + each non-terminal. This clearly included the first token from each + production of that nonterminal if the token is a terminal, but also + all the 'first' terminals for the first token of each production + when that is a non-terminal. Further if some of the first tokens + in a production can produce the empty string, then the 'first' + terminals of later tokens may also been needed. + Again we will use a dynamic programming approach to build up these + lists until they stop changing. + +3/ The set of terminals that can 'follow' a given non-terminal. This + is what we are really after. Given this, we will know which + lookahead terminal should trigger which reduction. + + +Empty: + To correctly set the 'empty' flags we first clear them all (assume + nothing can produce 'empty' and then we walk the list of production + and if we find one where the body is empty, or only contains symbols + which are nonterminals that can produce empty, then we flag the head + non-terminal as "can produce empty". + + If while doing this we set the flag on any non-terminal, we need to + loop back to the start and check every production again, as some + more production might now be able to produce "empty". When we have + made a full pass with now changes we can move on. + +First + The "first" set for each terminal is trivially the set which + contains just that terminal. We might choose to create all of those + for simplicity, or we might choose to handle terminals specially + when creating unions later in this process. + + For each non-terminal we start out assuming the set of 'first' + terminals is the empty set. Then for every production we examine + all the token in the body up to and including (but not beyond) the + first token which cannot produce 'empty'. For each of these we union + the 'first' terminals for the token to the set of 'first' terminals + for the head of the production. + + Once we have examined every production, if any 'first' set was + modified we loop back to the start and examine every production + again. We keep going until a full pass has not added any terminals + to any set. + + Setting the 'empty' flag and computing 'first' can be combined. + This should reduce the total number of passes by 1 as there will + only be a single pass that does nothing. It might reduce the number + of passes a little more as more happens in each. + +Follow + Now we get to use the "first" sets to produce "follow" sets. + We initially assume each 'follow' set is empty, and repeatedly scan + the production list adding things to the 'follow' sets until one + scan has made no changes (once again, a dynamic programming approach). + + For each production, and for each token in that production, we add to + the tokens 'follow' set, the entire 'first' set of the next token, + and every subsequent token until one is found that does not produce + 'empty'. As this part of calculating 'follow' is only based on + 'first' which does not change any more we do not need to repeat it + until stable. One the next step requires that. + + For the last token in each production, and for every previous token + until one if found that does not produce 'empty', the 'follow' set + of the head token is added to the follow set of the body token. + + Once these sets are stable, we will know every terminal which can + possibly come after each non-terminal. + + +Now that we have 'follow' we can examine each item in each itemset and +decide on which SHIFT or REDUCE actions are appropriate for each +look-ahead terminal. +It is still possible that some conflict might occur - two different +productions might suggest actions for the same terminal. + +If two production both suggest 'SHIFT' for the same terminal, then +that isn't a conflict. We just SHIFT and don't worry which production +it belongs to. + +If one production suggests 'SHIFT' and one suggests 'REDUCE', then +that is technically a conflict, but it is usually safe to resolve it +in favour of the SHIFT. This implies that we take the longest valid +match which is often what is desired. + +And example that can cause this sort of conflict happens with a +grammar containing: + + Statement -> if ( Expression ) Statement + | if ( Expression ) Statement else Statement + +If we have: + + if ( Expression ) if ( Expression ) Statement + +on the stack, and we see an 'else', we could reduce to + + if ( Expression ) Statement + +or shift to + + if ( Expression ) if ( Expression ) Statement else + +In the 'shift' case the else clause would apply to the most recent +'if', in the 'reduce' case it would end up applying to the more +distant 'if'. This 'shift' is probably preferred. + +If two different productions suggest a 'REDUCE' for the same +look-ahead character then that is a genuine conflict that cannot be +resolved. It means that the grammar is not LALR(1). It might still +be LR(1), but the simplifications of LALR(1) are over-simplification +in this case. In many cases though, a REDUCE/REDUCE conflict signals +a genuine ambiguity in the grammar. + +Expr -> Expr + Expr + | Val + Expr + | Number + +Val -> Number + + +When we have a Number and we see a '+', do we reduce the number to a +Val or to an Expr? + + +------------------------------------------ +I read the LALR paper: + +http://3e8.org/pub/scheme/doc/parsing/Efficient%20Computation%20of%20LALR(1)%20Look-Ahead%20Sets.pdf + +It doesn't help. +So I'll figure it out myself. + +The simple "FOLLOW" function find the 'next' terminal for each +non-terminal in any context at all. This can be too broad. +LALR find a more focused FOLLOW function that is potentially different +for each state. + +For a start we can consider 'follow' for each specific instance of +each non-terminal in the grammar. +This is 'first' of the next symbol and if that symbol can be empty, +then first of the following symbol etc. +But what is 'follow' of the non-terminal at the end of a production? +It is exactly FOLLOW of the non-terminal at the head of the +production. No, that is too simple. We've discarded info about the +state we are in. + +Different approach. Calculate the follow sets as we calculate the +item sets. +When we add an item to an itemset to complete it, we also add a +'follow' set for that item, which is 'first' of the next token (after +dot) possibly combined with 'first' of following tokens and possibly +'follow' of the item. +This 'follow' set remains with the item as it gets duplicated into +other itemsets. Once 'dot' gets to the end of the item, the 'follow' +set becomes active for that itemset and that item. i.e. if the +look-ahead is in that follow set, the given item is reduced. + +That seems to be a completely general rule that doesn't merge +anything... +Yes it does. It merged the itemsets earlier. + +i.e. when we have built an itemset with attached follow sets for each +time, we can only merge if the follow sets are the same, or don't +conflict. + +So: we need 'empty' and 'first'. + +An itemset contains a list of + production + offset + follow-set + +An itemset if consistent if the follow-sets are disjoint and don't +contain any terminal which is 'next' on and item. +So while building an itemset we keep a set of all 'next' terminals and +check for membership before adding. + +We only merge two itemsets if merging the follow sets doesn't create +an inconsistent itemset. + +Precedence can help avoid inconsistent set. +Each production can potentially be given a precedence and an +associativity: left or right. +When considering two productions that can use a symbol, if one has a +higher precedence, it gets the symbol. +If they have the same precedence they and have different associativity +or no associativity there is a conflict. +If they have associativity and one is a shift and one a reduce, then +left-associative prefers reduce and right associative prefers shift. + + +We really need the 'go to' table but the 'action' might be optional. + +For LR(0), each state can identify a production to reduce. +If there is no such production, we shift and follow the goto. +If there is no 'goto' state it is an error. + +For a simple LR(1), we try 'go to' first. If that works we shift. If +it doesn't we apply the one reduction. If there is no reduction we +error. + +Firstly we create the LR(0) states and impose precedence and see if +there are any conflicts. If not we finish. + +Then we create first/follow and see if that resolves all conflicts. +If yes, we finish. +Can we do this just for conflicts? A conflict it two or more +productions in an itemset which can act. Each conflicted state +indicates a single non-terminal. So we only need 'follow' for those. +But to create that we need 'follow' for the head of any 'reduce' +production. So to limit the set of follows needed, we would need to +know all the nonterminals which can be an ancestor of each state. + +We can short-circuit the (SLR) FOLLOW step if a conflict is between +two production with same head ... is that even possible? Probably. +But unlikely? + +If creating 'follow' doesn't resolve conflicts we need to try LALR. +This means, for each conflicted state, finding the 'follow' for the +heads of the items in 'this' state. +This involves following the 'goto' links backwards to all previous +states and asking them ... something. + +If this (conflicted) state has a completed production from X, we must +have got here eventually from a state which eventually lead to this +state, and has a 'goto' for 'X'. + + +To find 'follow' for a given item X->a, we scan all states and see if +following the body of the production (a) from each state T leads to the +current state (S). +If it does, we need to find FOLLOW(X, T) as the look-ahead tokens to +choose this item. +FOLLOW(X,T) is IFOLLOW(X,T) unioned with other stuff. +IFOLLOW(X,T), is the 'first' of whatever comes after X in items in T, +If empty can come after X, we need to union in the follow of this item +too. + +IT is all too hard. Let's not bother. + +Just a simple LR(0) plus precedence and assuming SHIFT over REDUCE if +no precedence info. + +So we read a token and if cannot goto and we can reduce, we reduce. +If we cannot reduce we have an error. + + +However: I want to know about conflicts, not have them hidden. So +while I don't want to use a separate action table and don't want the +parser to know which look ahead tokens are for which reduction, +I want the grammar processor to know. + +So: + +Each itemset has a 'next' symset as well as 'items' and 'go_to'. +This maps from non-terminal to magic number. +The magic number is an index into a list of symset (which might be shared). These are pure sets of terminals. + +These tell us the possible 'next' terminals after any for the +productions form the given non-terminal in this state. + +In state zero, there is one symset for 'start' and it contains only 'EOF'. + +When we create a new symset, we do so given an old symset and a +non-terminal. + +As we add each advanced item from the old symset to the new, we copy +the 'next' itemsef for the head non-terminal. +As we add new itemsets we calculate the 'next' for the new +non-terminal as the 'first' for the following symbols and, if they all +can produce 'empty', we union with the 'next' for the head of the +production we are basing on. + +A finished itemset if consistent if all the 'next' itemsets for +products that can reduce are disjoint, and are also disjoint with the +set of terminals which can come 'next'. +If it is inconsistent, that is an error. +It can be made consistent by applying precedence rules. This assigns +a number to each production, and a left/right to each number. +We don't record actions for items which have lower precedence than +items we do record actions for. + +When we create an itemset we might find that it already exists. If it +does we merge the new 'next' symbols (if any) into the old. These +changes potentially need to be carried forward, so we flag that +itemset for review. + +Precedence is set in the .gram file by lines that start with a lone $ +The first word is one of 'left' 'right' 'nonassoc' which creates a +precedence level and sets associativity. +After that come terminal. A second '$' precedes a virtual-terminal +name which can be linked to the level. + +At the end of a production, '$$' before the '${' introduces a +virtual-terminal which sets the precedence for this production, +over-riding any defaults. + +Precedence setting must come before the first use of a terminal. + + +------- +It's not works. What exactly do I store again? + +In each state we need the 'follow' set for each non-terminal. +As we add non-terminals we calculate by looking at 'first' of stuff in +production we started from and possibly the follow for the head of +that production. +For old non-terminals they come from the previous state. + +So when we 'complete' a set by adding productions, we add new sets. +When we create a new set, we copy what we have. + + +It works. But can I do better? +- avoid multiple copies of 'next' sets using refcount. +- even more by using a lookup table. +Currently have 756 follows with 186 itemsets +Only 73 unique next sets. +346 calls to 'build' the 186 sets. Some called 4 times. Most called twice. + +After adding refcount: +487 follow sets, still 73 Unique. +345 calls to 'build' + + + +----- +My LR code is wrong. +With the itemset we just compare core, which is easily found. +With the LA sets we also want to just compare the "core", but that is +that? +I'd be happy for equality to be "new is subset of found" but how then +do we impose an ordering. +As it is a sequential search, we could as the new is always >= old. +That might do for a start... + +No all wrong. +I need an LA set for each item, not for each nonterminal. + +In the core, LA is inherited from items in previous state. +In the completion, LA is calculated from core and completion items and +*may* be affected by changes to the LA of the core. +However changes to LA of core will propagate to subsequent states so +we need to do recalculation, so recording that 'may' probably isn't +helpful. + +So when we visit an item we always compute the completion so that we +can update the look-ahead sets. diff --git a/notes b/notes new file mode 100644 index 0000000..bd3acf1 --- /dev/null +++ b/notes @@ -0,0 +1,85 @@ +Random thoughts: + + - go encourages a return type of + foo, err = func(whatever) + + but that is wrong because you want to return 'foo' *or* an + 'err', not both. + We almost want a type that is like boolean but contains + an error. + So if the response is treated like that type no exception + is thrown, else it is. + + if err func(asdad) .... + if noerr func + or_err int myvar; + myvar = int_func(asdf) + + The big thing we want to avoid here is deep nesting + for a sequence of failable. Would goto work? + + a = func(sss) on_err label; + + a = f1 + b = f2(a) + c = f3 + on_err label; + + once an error happens, nothing depending on the error + gets called?? + + - it is really nice if extra statements and clauses can always be added + with no syntax fiddling. + For statements, that means always have braketing: + if (whatever) { + statement1 + } + + adding statement2 is just a oneline change. + + for clauses.. + if test + statement1 + to + if test + && test2 + statement1 + + is awkward to manage. + It would need: + if + test1 + and test2 + then + statement1 + statement2 + endif + + or + + if test1 + ifand test2 + then statement1 + then statement2 + + Both horrible. + + Probably best compromise is + + if test1 + and test2 + then statement1 + statement2 + + while test3 + or test4 + then + statement5 + + ? test1 + : statement1 + + do + while + then + else?? diff --git a/ocean-steps b/ocean-steps new file mode 100644 index 0000000..6eda832 --- /dev/null +++ b/ocean-steps @@ -0,0 +1,39 @@ +About 3 years with nothing... though I've been doing edlib instead. + +If I had concrete steps I might be able to make progress. + + +1/ Add a simple test sample code to git - maybe just fibonnaci and gcd + +2/ Add error reporting to parser. It is really useless without this. + Error productions are: + A -> $error term + If a terminal cannot be shifted, and we cannot reduce we push + the terminal back and consider 'error' instead. + We discard state until 'error' can be shifted, then discard + input until a terminal can be shifted. + 'error' object gets told: + first bad terminal + earliest state and last terminal that were discarded + error production needs an action which reports the error. + +3/ Add + var:type = value + type declarations. + +4a/ + Add [ len : type ] + array type + and var[expr] array access/assignment + +4b/ Add + struct name { field:type; f2, f3:type } + either name or fields are optional + +5/ Add + function name(field:type) : type { code } + functions. Functions MUST return a value + +6/ Add + proc name( fields ) :fields { code } + Procedures don't need to return anything, and may return serveral diff --git a/ooooo.tex b/ooooo.tex new file mode 100755 index 0000000..df639c4 --- /dev/null +++ b/ooooo.tex @@ -0,0 +1,121 @@ + +Obervation on Object Oriented Opinion + +@section{Introduction} + +@section{Definitions} + +@subsection{Object} + +An Object can operationally be defined as simply a record. +That is, it contains a number of values, each accessable through a +name. These values may be anything with the language can describe, but +can usually be at least references to other objects and procedures. + +The various attributes may be fixed or changeable (constant or +variable). +The visibility of the attributes may also be restricted in various ways, +with some attributes being visible to all procedured, and others being +visible only to selected procedures. + +We will use the term @samp{operation} to refer to procedures which are +fixed attributes of objects, and to which all attributes of the +associated object are visible. An operation is really more than the +procedure, it is the combination of the procedure and the object which +it is an attribute of. +Many different objects could have the same procedure as an attribute, +and each pairing would make a different operation. + +@subsection{Class} + +A class is a special kind of object. +@itemize +@item +It has an operation called @samp{new} which returns a new object. +All objects returned by the @samp{new} operation of a specific class +will have the same set of attributes with the same mutability and +visibility properties. +The initial values will normally be the same from object to object, +though arguments to the @samp{new} operation or state within the class +may vary these initial values. + +@item +Classes are "created" by the program text and so exist from the very +commencement of program execution. As such they can and do have unique +names in the global namespace of the program. +@end itemize + +To help provide confusion, the term "class" is often used to mean the +collection of all object that are or could be returned by the @samp{new} +operation of some class object. + +The term "The class of" an object refers to the class which created +the object. + +@subsection{Extension} + +Object extension is the process of adding attributes to an object. +...... + +@subsection{Types and polymorphism} + +We can say that a procedure @samp{expects} an object passed as an +argument to behave in a certain way. +The expectation may be explicit, or implicit. An implicit expectation +might be that a procedure attempts to access a particular attribute of +the object. This implies an that the procedure expects the object to +have a value for that attribute. + +The collection of expectations that a procedure places on an argument +object is termed the @def{type} of the argument. +A type may place expectations on the attributes which are visible, and +on the types of the values of those attributes. + +An object can be said to @def{conform} to a type when it meets all the +expectations of that type. +A class can be said to @def{support} a type when all of the objects that +its @code{new} operation creates will always conform to that type. + +@def{Polymorphism} is the exploitation of the observation that objects +of several different classes may well conform to a particular type, and +so a procedure may be written which will work with any of a number of +differently shaped object (from different classes) providing that they +all conform to a partcular type. + +The view of types as "expectations of behaviour" allows for a very broad +understanding of types as it puts them very much in the realm of the +programmers mind. Some languages (those referred to as dynammically +types) are happy to leave them soley in that realm. +Others try to encode part of the expectation so that conformance may be +automatically checked by the compiler, or at least communicated to +other programmers. + + + +@subsection{Inheritance} + +Inheritance is a notational convenience for describing classes. +Rather than giving a complete description of a new class, the new +classes is described as being a blend of some number of other classes +together with some new information. +That is clearly an workable vague description. + +To be more precise, we will consider a class to involve primarily the +@code{new} operation which in turn is primarily concerned with declaring +the name and nature of the attributes in a created object. +In that context, inheritance involves saying that the attributes for new +object of this class are the union of the attributes of new objects of +some other classes, with some modification. +The modification could involve adding further attributes, deleting +attributes, or changing the intitial value of some attributes. + +@subsection{Virtual Classes} + +@section{Interpretation} + +class variables +... + +@section{Theory} +@subsection{types} + diff --git a/pony-thoughts b/pony-thoughts new file mode 100644 index 0000000..4a5800a --- /dev/null +++ b/pony-thoughts @@ -0,0 +1,47 @@ + +I just went to a talk about pony ... now I want to get back to ocean! + +The main ideas were: + - type "Capabilities" like shared/unshared/once/whatever + - concurrency though message passing, not locks. + +I need to think about the latter. How does it scale. +e.g. can it implement the sort of concurrency we use for the dcache. +That includes fine-grained spinlocks and RCU. + +I think that pony uses a separate thread(actors) to manage any +shared/mutable object. Messages are sent to the actor and they act. + +That means one thread for each dentry?? +With coarse grained locking this might be good, like Hoare monitors. +How would you model locks and RCU in this sort of language? +control structure for holding the lock? +Reference that is allowed to unlock? + +RCU ?? read_lock is easy enough. So this would be supported in +the "runtime". What would the language check?... + +Certainly message passing is easier to work with. If I was to +provide that I would need a green-thread implementation. This would +need + - non-blocking kernel APIs so thread can switch tasks instead of block + - stack allocation. For internal code we might be able to use a tiny + stack and grow as needed. For FFI a real stack allocation is + important. But FFI isn't task-aware so it can use thread stack. + internal calls could be annotated with stack usage, calculated at + link time - whether dynamic or not. If a recursive function + is called it has a preamble which allocates a dynamic stack. + +Sending a message would queue for the task then try to take the lock and run. +If lock isn't available, then when the lock is released, the next message +will be dequeued. + +Do we wait for a reply? If we did, we could deadlock. +So have to wait for *any* message (including reply), and +allow other messages to intervene. +So if a monitor blocks waiting on a reply, then it must +accept messages on oher threads?? That means different threads +hold the 'same' lock. Is that a good idea? Probably not. +So how is pony deadlock free? It doesn't allow reply messages. +So you cannot say "insert if not present". You have to "insert, and +call me back when done". diff --git a/spaces b/spaces new file mode 100644 index 0000000..9e605f6 --- /dev/null +++ b/spaces @@ -0,0 +1,73 @@ + +On adding indents to an LALR(1) parser. + +Two ideas. + 1/ end-of-line should sometimes be a semantic end-of-statement. + But when? + + 2/ line continuation by indent is easy to read: + + this a semantic unit which has + been broken on to multiple lines but clearly + is all one unit + this is a new unit + + +We could suggest that whenever something is indented it doesn't end +the statement, but that doesn't allow for nexting + + if stuff { + thing 1 + thing 2 + } + thing 3 + +Thing1 is indented but should be terminated. Only locally though, the +'if' shouldn't be terminated. + + a = [ + one, two, + there, four, + ] + b = 5 + +pattern of indents is identical here, but now several newlines should +be ignored. why is this so? Mostly because lists are declared as +being terminated by newlines. + +More examples? + + +First we need help from the lexer/scanner. It must report newlines, +indents, and undents (or outdents or dedents or whatever). +If a line in more indented than the previous line, we get an 'indent'. +If a line is less indented, then we get one or more undents. +Possibly we can get undents and indents all at once. + + line1 + line2 + line3 + line4 + line5 + +Line 2 has an indent, line3 another indent. +line 4 get an undent so it lines up with line 2. +line 5 needs an undent, but that would imply it lines up with line 1 +which it doesn't. So it needs and undent and an indent. + +We get the lexer to produce these and the newlines, but it does +something clever with newlines. An 'indent' is never accompanied by +a newline. Instead the newline is delayed until after the matching +undent. +So the above would be scanned as + + line1 indent line2 indent line3 newline undent newline line4 newline + undent newline indent line5 undent newline + +If we then choose to ignore a particular indent and all newlines, +indents and undents until the matching undent, then we will still see +a newline at the end and it indented section will look like line +continuation. + +How do we choose what to ignore? + diff --git a/thoughts b/thoughts new file mode 100755 index 0000000..c752788 --- /dev/null +++ b/thoughts @@ -0,0 +1,168 @@ + +Thoughts about OO types systems. + +What is it? +it it enough +how does it relate to current implemented models, + e.g. prototypes +Is it implementable, useful. +Want some terminology: + object, class, subclass, type, subtype, inheritance + +want to talk about zones of objects of a type: a pointer can point to +a member of a zone. +Look into islands too. +Try to characterise linked lists and doubly linked lists. + + +The basic premise of OO is that "computing is done with objects", or +is that "programing is done with classes"??. +An "object" appears as a record of operations together with some +opaque state. The operations appear similar to functions in that they +accept arguments/parameters, and produce a result. In a typed setting +there will be a signature indicating the allowed types of arguments +and the required type of result. +The operations are not functions because they have the state of the +object as well as the arguments as inputs, and because they may change +the state of the object. + +The state of an object is only accessable though the operations that +it contains. The implementation of those operations normally see the +state of the object as a record of state variables which are +themselves objects. The implementation may define further operations on +those state variables, which are visible only within the object or its +descendants (see below). These are not really "object operations" but +are simple procedures. The two are easily confused, but should not be. + + +If two objects contain the same set of operations with the same +signatures, then either can be use in place of the other (at least as +far as type correctness is concerned). +More generally, if a piece of code requires an object to supply a +certain set of operations, then any object which supplies those +operations, whether or not they supply others, can be used. +This is called "polymorphism" as objects of many different shapes can +be used in the one place. +This leads to a partial ordering on types of objects where a type A is +<= a type B is any object of type A can be used wherever an object of +type B is required. Applying this recursively gives a useful subtype +relation. + +If A<=B and A.x : a->b and B.x:c->d then a>=c and b<=d +It seems tidy to assume that every object responds to every message. +This implies that the top of the type latice which we shall call +"OBJECT" should give signature BOT->OBJECT to every message, where BOT +is the bottom of the type lattice, a subtype of every type. +If we exclude the existance of any BOT object, then such operations +would not be callable, which is what we would like. +Only once the signature has been changed by subtyping to elevate the +argument type will operations be callable. + +To make operations a bit more meaningful, we will associate a +specification with each operation. i.e. a precondition and a +postcondition. These will normally need "access" to the state of the +object, which is best described abstractly with variants and an +invariant. + +Now, if A<=B and A.x :: [a/b] and B.x::[c/d] then c->a and b->d. +So the specification for all operations on OBJECT should be +[True/False] == MAGIC. +Similarly operations on BOT have signature OBJECT->BOT and +specification [False/True] == ABORT + +Clearly a BOT object is useless so excluding them is not a problem. + + +Parameterised types: +------------------- + It turns out to be useful to be able to talk about incompletely +specified, or parameterised types. This is similar to parameterised +expressions that we call functions. +So we defined parameterised types: +P == lambda x<=T . F(x) +We treat this like a virtual type, i.e. no object have this type, they +only conform to this type. +A conforms to (is a subtype of) this P if there exists some t <= T +such that A <= P t +Of course, P t may be another parameterised type. + +Parameterised types are not the same as types, they are only second +class types. A variable/name-binding must have a closed type, not a +parameterised type. A name-binding can have a closed type by applying +a parameterised type to a type variable within another parameterised type. + +A word about inheritance: +======================== + +The only required relationship between a subtype and its supertype is +that the signatures of all operations be in the same sub/super +relationship, and that the invarient of the subtype imply the +invariantof the supertype. In particular, the state does not need to +be represented in a similar way at all. + +However, it is often of practical value to use the implementation of a +super type when implementing the subtype. +In this case the subtype will contain at least the state variables +of the supertype and probably much of the procedure code of the +supertype. +Precisely, any local name binding (local variable or procedure) that +is assumed by any borrowed code must be provide with compatible +signature in the subtypes implementation. + +We will call a procedure that only calls operations on the current +object, never observes state directly, a pure procedure. +A pure procedure used as an operation on a supertype can be used as an +operation on any subtype, indepentantly of state storage. + +If a subtype only uses pure procedures from its supertype(s) then we +will call this pure inheritance. +If a subtype uses non-pure procedures and/or directly accesses state, +we will call this dirty(?) inheritance. It violates encapsulation. + +Both of these types of inheritance follows subtyping. + +The literature suggests to thoughts about inheritance which do not +follow subtyping. +They involve co-variance and deviance.":-)" + +Co-variance happens when we want the arguements to a operation are +re-typed down the type latice during inheritance. +A typical example is with binary operators. e.g. addition +The super-class can be added to itself, the subclass + can be added to itself, but not to the superclass. +I don't have any really good example through which to address this, +(I don't this point/colour-point makes sense) but I feel that +parameterised types can provide the desired fnctionality. + +Deviance is when heavy handed changes are made to operation +signatures, such as removing an operation or changing the signatures +in other non-conformant ways. Given that the type of self gets +modified in such a heavy handed way, it is hard to see that the new +self will make any sense in the context of the originaly operations. +If it does, it can only be (I guess) because there is a common +implicit super type to which both the superclass and the subclass +conform, and the operation of the superclass which are being borrowed +are infact valid for this common super class. I really need some more +examples to understand this better. + + +Object creation +=============== + +A question (prompted by abstract factory patten in 'design patterns') + How are objects created and initialised. + Initialisation is particularly important for immutable objects. + If creation is the result of an operation, the owner of the operation +could be: + 1/ an object of same class as the one being created. + This begs the question how was that object created. + This is done in prototype systems by allowing the type of an + object to be modified dynamically by adding operations. + 2/ an object which represents the class. + The object would have to be a pre-existing singleton. + Then, what other operations might a class have: what is it's + type. + 3/ a "memory" object could create other objects, but how would the + type of the returned object be guarenteed. Also, are we really + concerned with where something is stored, and can't the storage + type of an object change. diff --git a/twod b/twod new file mode 100644 index 0000000..9ae6ff9 --- /dev/null +++ b/twod @@ -0,0 +1,1358 @@ +How to write up two-dimensional parsing? + +What does the code do? + +- If an indent is ignored the matching undent is ignored + except to decrement the 'ignored' stack and to turn + a pending indent into a pending newline. + or to count the pending undents. + +- If a newline is found and cannot be shifted yet then we check + each production that would close before we shift, and if any still + have pending undents, we ignore the newline + +- when we reduce a production, we 'resolve' indents wich may leave + something pending. Either an indent or a number of undents? + +- productions can be recursive if first body symbol matches head. + A recursive production may be horizontal if it starts in the middle + of a line. non-recursive productions are always horizontal. + A horizontal production must be all on one line or have its body + indented. + A vertical production must not be indented. + + +Maybe I don't want to mention indents in the grammar ? + + if CONDITION : STATEMENTLIST + +doesn't have to be indented but we will only reduce it when we +see something that doesn't belong.... which could be an UNDENT + +So: when look-ahead is UNDENT we cannot shift so we reduce until +the top state has an indent attached, at which point we consume the +UNDENT. + + if hello : + foo + bar + bar + UNDENT + +the undent will reduce until we just have the ifstatement which +contains an indent. +So any indent is pushed onto the stack and a reduction sums all the +indents. + +What about newlines? A newline will reduce any production that +doesn't contain an indent. It gets discard when top of stack +was at start of line... + +So each 'state' also encodes + start-of-line or middle-of-line + and + count of indents. + +When an INDENT is seen the current state increments count of indents. +When an UNDENT is seen we must reduce until tos state has an indent, +which we decrement. +When a NEWLINE is seen, we reduce until tos state has indents or was +at start-of-line. + +nononono +UNDENT shouldn't force a reduce. Only newline should do that. But as +an undent is always followed by a newline, it still results in a +reduce happening where I want it. + +So an undent shifts in line an indent(?). Reduction nets the indents. +A newline forces the reduction until we find excess indents. + +But what does grammar look like? + statement -> assignment ';' + | assignment +So a newline will force the reduction to 'statement', but this will +allow + a = b c = d +which looks bad. So: + statement -> assignment ';' + | assignment NEWLINE +Then the NEWLINE can be shifted normally. +So newline forces reduction until it can be shifted or until ToS +contains an indent or started on a new line. +If we hit an irreducible state that does not accept NEWLINE when we +have to shift in an ERROR instead which might then be reducible. +When popping states we need to collect the indents. + + +Where do we attach the indent/undent info? To the state or to the +symbol? They are kept together but mean different things. +indent/undent are between terminals, just like states. But they +can be inside non-terminals which make them unlike states. + +What "is" a state? It is a point between two symbols in a sentential +form. It doesn't carry information. So I think the indent has to go +with a symbol. +An 'indent' should attach to the symbol *before* it as that is the +thing that is being continued. An 'undent' should also attach before +itself. Eventually those two terminals become included in a single +non-terminal and they collapse. +You can never have more undents than indents as a newline will force +reductions until top symbol has no undents. +In fact we reduce until the top symbol has a new indent, or is a +vertical production, or is explicitly allowed. + +If we see an INDENT, we attach to symbol at ToS +If we see an UNDENT, we attach to symbol at ToS (possibly canceling indent) +If we see a Newline and current ToS holds +ve indent, + we ignore it +If look-ahead is expected, we shift it. +If we can reduce, we do so. +Else we trigger error handling: + pop states until one accepted ERROR - collecting indent + shift ERROR with those indents + Discard input, but process indent/undent, until next symbol is + expected. + continue. + + +Does this work? If I have + a = b + + c + + d + +Then the NEWLINE after "c +" cannot be handled, but should not be an +error. +Stack would hold + var = expr + +where 'expr' contains an indent. +Maybe NEWLINE is ignored unless ToS has negative indent. + a = b + + c + + d +That must be an error. + +Problems: + 1/ accepting NL must depend on what it will reduce. + 2/ If UNDENT/INDENT appear together they must be kept separate. + +So maybe UNDENT does block shifts and so forces reduction. If not +reduction can be made, that is an error + +if (a) { + b; +} + +The NL UD NL before '}': + NL will ensure "b;" is reduced. + UD will reduce to command-list so we will have + + if ( expr ) { commandlist +where '{' holds the indent. So '}' needs to bring in the undent, +which is after another NL. + +So Undent waits for a symbol to collect it when shifted. + +What when I have: + if (a) { + foo; + while (b) + bar; + +When we try to reduce "foo;while(b)bar;" to "statementlist" the +negative indent triggers an error. This a bit late though. I want it +when I see the 'while'. I want an error there which can be reduced a +closed statement. +The undent should close the statementlist so that '}' is the only +shiftable symbol. But after 'statementlist', 'while' is still +permitted. + +If we just waited for the 'while' statement to reduce and then noticed +the error, we could still report it against the 'while' token, but +recover would be a lot harder as we have already parsed the 'while' +into the 'wrong' place. + +So when I see that token after NL,UD,NL, I need to be able to say +'yes' or 'no'. +Could require help from grammar. +e.g. if a token is allowed to have a lesser indent than previous +token, then there must be a literal UNDENT token. +Or we could require: + + Block -> BlockA } + BlockA -> { Statementlist + +That looks ugly, but the undent would close BlockA and allow the } to +be expected. +Or maybe + + Block -> { Statementlist OptNewline } + +so the newline after the indent is shifted and doesn't +reduce-or-error. + +Then the UNDENT would ensure that the Statementlist were closed, the +Newline would be shifted, only a '}' would be allowed. + +So: each symbol in a production must be in the same line as the +previous, or in an indented line. + +No. OptNewline there wouldn't work as a blank line would take us to +expecting a '}'. + +I think I need to go back to what works... only it doesn't? + + +Indent is attached to previous symbol. I think that is easier that +keeping it at the 'start' of the 'next' symbol as I was doing. + +Newline closes any productions that don't include an indent, then can +be shifted if wanted, or ignored. + +Undent reduces things until ToS has an indent that can be canceled. +This means that if part of a production can have a negative relative +indent, it must be separate: +Block -> Blockhead } +Blockhead -> { Statementlist + +So the Undent before the '}' will reduce Blockhead and can then be +discarded. +We would add +Block -> Blockhead ERROR + +It would be nice if that can be auto-handled. e.g. + +Block -> { Statementlist $undent } + + +leads to the above. +That would be messy for the reduce action though. +Could each state "know" + +-------------- +A state can have different items with different offsets, so how much +of the stack is part of the 'current' production is not well defined. +It is not completely random either. + +We can ignore non-kernel items as those are just potential productions +- they haven't started yet. +Of the remainder, many states have only one length. +Many others will have some offset and '1' for recursive + productions: A -> A . X +These can be called "potential" as well as we have no commitment. +What is left? + Funcall -> Factor . ( ExprList ) + Term -> ~ Factor . + + Funcall -> Factor . ( ExprList ) + Term -> Term * Factor . +etc + +i.e. just the bits that make the grammar distinctively LR. +In my case one is always reducible, is that important? + +So when we see an Undent we find the max-index of the current state +and determine if there is a matching indent within that distance. +If there isn't then we must reduce or trigger an error. + +Alternately: Each 'goto' from this state has a max-history. +We exclude any that don't have enough history to include balanced +indents. +If the next symbol would trigger one of those, it is an error. + +So: + For early error detection we record Indents against the previous + symbol, ignore newlines, and when we see an undent the next shift + must lead from a core item with enough history to keep balance. + +If that is well defined it might catch early undents, but what about +unneeded indents? + a = b + c; + c = d + e; + +I expect that is less likely, but still an error. +When we see the first ';' we will reduce to a statement and then a +statementlist. The second 'c' would lead us from + statementlist -> statementlist . statement +which is recursive and so doesn't allow indents (unless it started in +the middle of a line). + +Intends are allowed anywhere except after the first symbol of a +left-recursive production which started at beginning-of-line. + +a = b + + c + * d; + +is not allowed as 'Term->Term*Factor' is left-recursive. But why not? + +So maybe an indent has to be allowed anywhere. Maybe a warning but +there is no real recovery option. + +An undent is allowed in a state with a history back to the matching +indent. +If we see an undent, we check if the state can consume it. If not we +look to the next symbol which is not an undent (or newline) and reduce +until that symbol can be shifted. At this point the undent must have +been consumed. + +Or easier: We must reduce to the matching indent before we shift, +unless a NEWLINE is explicitly allowed... + +Take 1: + An indent is between two symbols, or within one symbol. + We hold a pending indent which gets attached to the front of the next + symbol that we shift. + Undents and newlines collect until we see a "real" symbol. + Any reductions that are then indicated happen and if the ToS symbol + is found to hold an indent, that cancels a pending undent. + Once we get to a state that can shift the lookahead we examine the + number of pending undents. It should be zero. i.e. we shouldn't + still be in the middle of a non-terminal that started at a greater + level of indent. + + If it isn't zero we trigger an error because a close-bracket or + similar is missing. This might be different to "unrecognised-error" + The error should lead to reductions which cancel out the remaining + undent. + I think the parser needs to incorporate error handling into states. + Each state knows what symbol to synthesis to get closer to a + reduction. + i.e. on 'no close' error, we reduce what we can, and synthesis when + we cannot. + +So that seems to handle early undent. +We also have 'unexpected indent', 'missing indent' and 'late undent'. + + +A 'missing indent' happens when a production that we expect to be +one-line sees a newline rather than an indent. +i.e. + a = b + + c; + +The + after the newline only reduces 'b' to Expression which starts +in the middle of the line. A newline should see a reduction to a +nonterm which started at the beginning of a line, or which contains an +indent. + +We trigger a 'no close' error here so that '+' is the start of a new +statement. Maybe if the next symbol is shiftable, we just report an +error and allow parsing to continue; + +An 'unexpected indent' happens when the fully reduced non-terminal +started at the beginning of a line and is a recursive non-terminal. +Or maybe we are in a recursive state. This triggers an error + +A 'late undent' is like an unexpected indent. The previous +fully-reduced non-terminal is recursive, started at the beginning of a +line, and still contains an indent. + +------------------ +Each symbol can carry a 'start-of-line' flag and an 'indent' count. +Each state may be flagged as 'recursive' if "A -> A . etc", and +records the symbol to synthesise to resolve a 'missing' error. + +Current state includes an undent count, a nothing/newline/indent flag. +When look ahead contains undent, we increment that count +When look ahead contains newline we set newline flag +When look ahead contains indent, we set flag to indent. +When we reduce we combine indent counts in symbols and keep +start-of-line from first symbol. +If the ToS after a reduce contains indents and current state contain +undents, they are reduce so that one becomes zero (min is subtracted +from both). +Before we shift we must test: +- if undents are non-zero, we synthesis the symbol expected by the + state, we keep synthesising and reducing until the undent count + reaches zero. We report that the synthesised symbols were missing. + +- If newline flag is set, the ToS symbol should start at SOL or + contain an indent. If it doesn't we just report a missing indent + error. + +- If indent flag is set and ToS symbol started at SOL and state is recursive + we report an unexpected indent. + +- If state is recursive and ToS started a SoL and contains an indent, + we report an extra indent. + + +--------------------- +(I think I'm getting there). +Hypothesis: + An indent is allowed in a non-recursive that started on a newline, + or in a recursive (at the recursion point) that didn't. + + The matching undent must be at the end of the production. + + A newline is only allowed in a production that started at the start + of a line, but not always. - only if recursive or annotated. + + An undent is required at the end of a production that contains an + indent. + + + An indent or newline found where not "allowed" generates an error + but no recovery. + An undent not found where required generates an error and emergency + reductions. + + +If we have an ambiguous grammar then the undent could simply +force a reduction. i.e. if the 'emergency reduction' just involves +choosing REDUCE of SHIFT, then it isn't an error or even a warning. + + +But where exactly is an indent or newline allowed - I don't think I +have this right. + - a newline is definitely allowed in a list that started at the + beginning of a line. We need to structure the production right + so we don't end up with a separator at the start of a line. + LIST -> LIST SEP ELEMENT + -> ELEMENT + would need to allow the newline before ELEMENT. The logic there + would be that if the first ELEMENT can be at start-of-line, other + ELEMENTS can too. + LIST -> LIST ELEMENT + -> empty + Is appropriate if no separate is needed, and an empty list is OK. + Here ... does the indent go after the LIST or before the ELEMENT? + i.e. if there are more symbols, how do we know. + + If the list already has an indent, then from there on it "started + at the beginning of a line" + This is treating it a lot like LL(1) which would be: + LIST -> ELEMENT LIST + | empty + + LIST -> ELEMENT SEP LIST + | ELEMENT + + - a newline is definitely allowed in a non-list which has already + been indented. + for (i=0; + i<10; + i++) + + a = b + + c + + d; + + So: whatever holds the indent can also hold a newline. + + i.e. a production can have a newline between symbols if an earlier + symbol has an indent before it. + This would allow + a = [ b + , c + , d ] + which is probably OK. + + What about indents? Can I have two indents in the one production? + No, I don't think so. + + If an indent occurs *inside* a production, newlines can appear + later. + If an indent occurs *in front* of a list production, newlines can + appear later. + + + +So (again!) + a symbol (on stack) can have an indent at the start, and can have a + number of indents within. + When we shift the first symbol after an indent, it gets an indent at + the start. + When we reduce the result inherit and indent at the start, and other + indents (within and at start) become within. + The indent at the start of a list can be treated as being within. + i.e. as list has an empty symbol at the from before the indent. + + A newline is permitted if the max history for the current state + contains an indent. Indent at start of non-list doesn't count. At + start of list it does. + An indent is permitted if the max history does not contain an + indent. + + An undent cancels indents within, or possibly before, the + immediately preceeding symbol. + + +Can I do all lists with empty start? + + // simple list + LIST -> LIST ELEMENT + | empty + + // non-empty list + LIST -> LIST0 ELEMENT + LIST0 -> LIST + | + + // non-empty with separator + LIST -> LIST0 ELEMENT + LIST0 -> LIST SEP + | + + // can be empty, sep is needed + LIST -> + | LIST1 + LIST1 -> ELEMENT + | LIST1 SEP ELEMENT + + Just stick "IsList -> " in front of anything that is a list. + + +For newline to be allowed, do indent have to be between symbols in +same production, or is within symbol ok? + +if ( a = + b ) // now I need another indent.. or an undent? + x = x + 2; + +The undent resets, then only an indent is allowed. +if ( a = + b ) + x = x + 2; + +No reset, so extra indent. Anything is probably OK except newline. +So the indent cannot be inside a previous symbol. +But for lists, the indent will be exactly inside LIST. +So maybe inside first symbol is OK ?? + x[a+ + b] = + c + + x = [ a, b, c, + d, e, f + ] + + +I think I might be trying too hard. I should allow newlines +and indents anywhere, and give a clear interpretation for undents. +However my real goal is to know exactly what newlines and undents mean +and how to parse them. +i.e. does the grammar need to mention all of the, some of them, or +just be ambiguous and allow line break to disambiguate? + +That last options doesn't really work as we cannot allow a separator +to be dropped anywhere by at end-of-line or before-close. +So grammar needs newlines. Various separators can be "sep or newline". +However I don't think indents need to be mentioned. + if condition : statementlist +needs undents to disambiguate I don't need to say that. + +So: + - indents get shifted with token + - indents are grouped by reduce + - and undent reduces until a matching indent in the top + state can be cancelled. + - newlines are messy. + We look ahead through possible reductions until we + either shift the newline or find an indent or cannot reduce. + If the shifted newline would not contain a newline in the production, + we choose that path. If we find an indent, we discard the newline. + If we cannot reduce we again ignore the newline + We could also discard a newline early if not in LA set. + + +But then: + if x : + a = b + + c = d + +Looks to the compiler like an addition and to the human like two assignments. + a = + b + + c + + d + +?? Only allow newline if production started with indent. +So: + If not expected and current production contains indent, discard. + If not expected and current production has no indent - error, recover. + If expected, test-reduce and discard if indent found else accept. +Not quite. + +What if we just test-recover until we hit an indent or want to shift the newline? + +Difference between statements and expressions is that a statement can +be terminated by a newline. +So if we find a newline and test-recover would use it, that is an +error. +If we find a newline and test-recover would not use it before an +indent, it isn't an error. + +So each state which cannot reduce has a link to the best state to shift to. +When we want to recovery we follow these links until they run +out. Then we can reduce which leads to a new state. + +If we find a newline, we do this until we find a state which will take +a newline, or we cross an indent. +If we crossed an indent, we discard the newline. +If not and we has to take any shift steps, we call an error. +This requires that newlines aren't buried to deep. + +If we find an undent, we do this until we have cancelled the undent +against an indent. If anything was shifted, we report an error. + +No - using indent just to disambiguate an ambiguous grammar doesn't +work as it still allows confusing constructs. + if foo: + bar + else: baz +Should be wrong as the 'else' looks to be part of the statement but +cannot be. It should be "unexpected". That really means that +"INDENT" "UNDENT" must be part of the grammar. +So: + if COND : INDENT STATEMENTLIST UNDENT +which is where I started. + +So an indent which is not expected protects all newlines and +indent/undent pairs until the matching undent.... not quite. + +Indents can be expected in a state (if we compute lookahead). +If an indent is not expected we simply discard it and keep count. +When we shift an expected indent we push the count with it. When we +push the matching undent, we restore the count. +When we get an undent, we discard it if indent count is non-zero. + +When we get a newline we look ahead to see if it would reduce across +an indent. If so we discard the newline. If not we keep it. + +But an undent still needs to force reductions doesn't it? + +i.e. if an undent doesn't cancel against top of stack we count it. +If we every try to shift while an undent is present, we throw an error +and shift other stuff until the undent is gone. + + +Indents must affect error handling. We no longer just discard tokens +left and right. The 'error' state for any indented token is the +matching undent. So we discard stack tokens until we get an error +state or an indented state. +If error, we shift the error and discard input until a token + matches. While discarding input we track indent/undent and + if we come to an unmatched undent we discard stack tokens again + +If we hit an indented state we discard input until the matching undent +arrives. + +But do we really want to discard tokens, or can we synthesise tokens? + +Discarding is standard. It gets to a known position. It allows any +location in a production to accept errors. +However discarding does discard data that cannot get further analysis. +If we use very fine grained error productions we might not miss much. + +Come back to this later. For now just discard. + +If we find an Indent on stack, an error "must" be expected. +i.e. + block -> : INDENT Statementlist UNDENT + Statementlist -> ERROR + +So: + if we can shift ERROR, we do, + else if + + + +Indents and newlines. + +Indents and newlines can be taken as strong syntactic marker, useful +both for regular parsing and error recovery. +Unlike other bracketing markers such as parentheses or paired key +words (BEGIN and END or DO and OD) indents cannot be lost. Every line +displays what the current indent level is and for every indent there +is precisely one location where the matching undent is (though +of course one location can have undents for multiple indents). + +Similarly start-of-line and end-of-line can be strong bracketing +markers which cannot be missed. + +The two can complement each other to provide a very effective +indication of the grouping that the programmer intends. Often the +programmer will want more explicit marking as well (e.g. "{" and "}" ) +which is perfectly fine as redundancy of expression is an important +part of effective communication and this is equally true in writing a +compute program. + +The connection between the two is simply that indentation effects a +continuation of the previous line. For example in + + A B C + D E + +Any bracketing implied by sol/eol should not be imposed around "A B +C" but only around "A ... E" - the indent effectively hides the line +break. +So this could be tokenised as + sol A B C indent sol D E eol undent eol + +Note that every SOL has a matching EOL and every INDENT a matching +UNDENT. + +As these markers are strongly paired they can be useful for error +detection and recovery. This is particularly true for indents. + +It seems universally true that any syntactic unit (such a declaration, +statement, or expression) which starts indented will finish while +still indented. That means it will complete before the undent which +matches the most recent indent when it was started. + +This fact is helpful but not quite as strong as we might like +For example, in + + A B C + D E + F G + H + +"F G" could be in the same syntactic unit as "A..E". however if "D" +started a new subunit, "F G" certainly aren't part of that. +Assuming that "A" started a syntactic unit, "H" is the first symbol +that is clearly not part of it. +This isn't really a problem for + if condition : + foo; + bar; + baz; + +as the statementlist is the last part of the if-statement, so +there is no room in it for baz - baz must be separate. +However in + + a = b + c + + d + e + + f + + g + h; + +This looks wrong but cannot be ruled out on the grounds of +indentation. +Here the expression of + statement -> var "=" expression ";" + +starts on the same line as the statement so there is no indent. + +We could argue that a syntactic element (expression) that stated other +that at the start of a line may only continue to the end of that line +(when adjusted for indents). This can hit problems with: + + if condition { + foo; + bar; + } + +Here the "block" of + block -> "{" statementlist "}" +starts other than at the beginning of a line, but continues on the +next line. + +There are a few different ways we could try to deal with this: + +a/ some special-case marking in the grammar to say that a newline is + permitted before the '}' +b/ Don't make a separate syntax element for "block": + ifstatement -> if condition { statementlist } + would mean that the '}' is at the same indent as the 'if' + which starts the element. +c/ decide that the last terminal of a production is a special + case and can go on a new line. + +The last is almost certainly a bad ideas it would encourage + a = b + ; +and discourage + if foo then + stuff + end if + +Another construct that is often used is: + + if condition + { + foo; + bar; + } + +where the "block" is allowed to start on a new line. +This potentially encourages 'block' to be a separate syntactic unit, +so 'b' above doesn't seem encouraging. That leave 'a' and marking +both the '{' and the '}' as being allowed to start a new line + + +If we assume these markings, then an undent can close any syntactic +unit that started at a greater indent, or at the same indent but not +at start of line. +We can use this: + - to detect errors by reducing early and then ensuring the next + token is expected. + e.g. the indent after 'bar;' would close the statementlist after + which only a '}' is expected. If it isn't there, that is an error. + - to handle ambiguous grammars. In the classic + statement -> if ( expression ) statement + | if ( expression ) statement else statement + the normal default is to treat 'else' as right-associative. + Using indents: + if (x) + if (y) + foo(); + else + bar(); + + can be parsed with the 'else' applying to the 'if(x)' as you would + visually expect. + + +Now we turn to line markings - sol and eol. +The above suggested that any unit should be "on one line" (allowing +for indents) unless specially marked. This is bit excessive. +For example + a = + b + c + + d; + +"b + c + d" is all one expression, but appear on two lines. Might +seem unlikely, but + a[some_long_expression_with_function_call_etc] = + verylongling + otherverylongthing + + moreofthesame; + +might not seem so unlikely. This could easily be made correct by +moving the '=' to the start of the second line, or indenting the third +line slightly. It isn't certain that we want to do that, and +annotating that a newline is allowed before the 'plus' seems clumsy. + +An alternate approach involves noting that 'statements' are rather +special. Not uniquely special but still special. +That is, statements are normally on one line. When they are two long +for a line the remainder is indented (make it part of the previous +line). +When a statement appears on multiple lines it is a notable exception. +For example the Python + + if condition: + statementlist + elif condition: + statementlist + else: + statementlist + +has three lines in one statement. This is a one-line statement which +has newlines before the 'elif' and 'else', or maybe 3 one-line objects +which can come together. + +While a statement is line-based, and expression probably isn't. They +tend to appear all on one line, but that is because the statement they +are in does. + +In some sense the line-nature of the statement "captures" the +expression. An expression can only be on multiple lines if they are +all indented with respect to the containing statement. + +So rather than marking some things as being allowed to continue to the +next line, we can mark some things as being line-shaped. These are +sometimes whole units and sometimes parts. e.g. + + if ( condition ) + + { statementlist + +Note the missing '}' above - it is not part of the line shape. + + if condition : statementlist + + else: statementlist + + function name(argumentlist) + + lval = expression ; + +and so forth. + +Being line-shaped just means that it must be "all on one line", but +not that it must be a line by self. Several line shaped things can +appear on a line + + a = a+1; b= b-2; + +However it would be nice to allow a syntax modification when separate +lines are used. i.e. + a = a + 1 + b = b - 2 + +The trailing semicolon is not needed. As the assignment is +line-shaped it must finish at the end of the line. +Declaring the semicolon to be optional would allow + a = a + 1 b = b - 2 +which looks more like an error than a correct construct. Here the +semicolon is need. + +A simple solution is to allow the NEWLINE to be part of the syntax +listed for the grammar: + OptSemi -> ; + | NEWLINE + +Then + Assignment -> lvalue = expression OptSemi + +as a line-shaped item should work. + + + +1/ If a point is 'newline allowed' then what follows can go + on a new line +2/ If something is line-shaped then linebreaks may only be intended. + +So in a production we can have 'optnewline' or 'optsemi' +anywhere but at the start. Or maybe 'at the start' means something a +bit different. + + statement -> if condition optnewline block + | if condition optnewline block optnewline else optnewline block + + block -> : statementlist + | { statementlist optnewline } + + +If something is line-shaped, then the EOL that ends the line on which +it starts, must end the thing. +If something is not line-shaped, then an EOL can be safely ignored. + +An LR state can expect NEWLINE if it has a goto state for it. +And it has an indent. + +If we look up the stack and find an indent before a newline-expected +then a newline is ignored. +If we find a newline-expected before an indent, the newline is used. +If a state expects a newline and was indented then the newline +expectation comes last, is seen first when looking back, so newline is +used. + +Separately, an undent reduces until it can cancel with an indent. + + + +but what about + a = 1 + (3 * + 4 + 2) + b = 1 + + 3 * func(a, b, + c, d) + +This isn't so much about whether an indent is present, as how much of +an indent there should be. This sort of indent doesn't affect parsing +or help with error recovery so it is probably best left to esthetics. +We could possibly require that everything in the brackets be indented +w.r.t. line on which the brackets started so that + + b = 1 + + 3 * func(a, b, + c, d) + +would not be allowed. However we would need to ensure that + b = 1 + + 3 * func( + a,b,c,d) +were allowed (I think) + + +I've got the UNDENT reduce a bit wrong. +Anything that started with an indent must be reduced, not just one +thing that started there and everything after. +This is awkward as reducing back to 'statementlist' doesn't stop +adding to that statementlist, but it should. So we want e.g. + statementlist -> statements + statements -> statements statement + | +Then we reduce back to 'statementlist' which cannot accept another +statement. The parser generator could insert these extra non-terms +automagically. +No that's not quite it either. If I have + a; + b; + c; + d; +then the undent after 'c;' needs to reduce the statement (which +is already reduced) but not the 'statementlist'. But there is +no extra state there. We aren't reducing back to a state, but back to +a symbol. The indent happened after a statementlist + +A recursive symbol A is a problem if: + A -> A B +and + X -> ?? A Y +for non-empty Y. This results in a state: + X -> ?? A . Y + A -> A . B + +If the Y is empty, X will reduce and B becomes unshiftable. +If this pattern is found we can create A' + + X -> ?? A' Y + A' -> A + +Now 'A' only appears at the end of a production so there is no +problem. + + +Got it wrong again ... + + foo + bar + baz + bat + +after 'bat', the recent indent is in Statements which makes it look +like we need to close all of Statements. But we don't. As the indent +stated in the middle of Statements we only need to close to it. +So we really do need to know if an indent is at the start or in the +middle. +We can cancel an 'in the middle' as soon as we find it. We can only +cancel an at-the-start when reduce_size > 1 (as it is now 'in the +middle'). + + + +Now back to newlines. It doesn't work very well to include newlines +in the grammar wherever one is allowed as it conflicts with where they +are required. E.g. for ';' to be optional we sometimes require one at +the end of a statement, but for statements to be line-like we need to +allow one at the start + +I think we need to allow NEWLINE in the grammar for EOL markings. +Maybe we want SOL too? + +The important cases are: + + a = b * + c + + d; + +Statement is line-like so return to SOL not allowed, but expression is +not to return is. + if condition: + statement + else: + statement + +Both 'if ..' and 'else..' are line-like + a = b + c + +Newline at end of this line is allowed in place of ';'. + +If we have the grammar recognise all newlines that are not protected +we need to support blank lines in lists. + Statements -> Statements OptNL Statement + + +No, I don't like that. EOL should be at the end. + +Some things can have an EOL at end, either optionally or required. +Or elsewhere in the production. + +When we see an EOL we need to ask "Do I ignore this or use this?". +The answer relates to indents and expected newlines. + +If a symbol can derive a sequence with an EOL, then the symbol is +flagged can-eol. +If that symbols after DOT in any item in a state are flagged +'can-eol', then the state is flagged 'can-eol'. +If a 'can-eol' state appear on the stack closer than a 'was indented' +state, the we use the EOL token. +If no states up to the first 'was indented' state are flagged +'can-eol' then we discard the EOL token. + + +That might work, but have I thought of all the issues? + - blank lines. The grammar will need to allow for these explicitly + in any context where EOL is expected. Not a big burden I guess. + - force reductions ? If EOL closes things, does it close anything + where not explicit? Probably not - that is sort of the point. + If it appears where not expected but in line context, that is an + error just like an "unexpected" error. + + - is the nesting 1-way? i.e. can I have line-oriented things in + noline boxes. or "vertical stacks in horizontal boxes"? + + A possible example is a function call. It might be nice to allow + a = b + func(first + second + third) * 2 + + There are two problem here. + Firstly there is no EOL after 'first' so that cannot mark the end + of an argument. + Secondly it forces an EOL interpretation on surrounding expressions. + + And what about + a = [1, 2, 3 + 4, 5, 6 + ] + + I would like to be able to drop the ',' after the '3' but it isn't + clear that it is safe. + a = [1, 2, 3 + +4 + +2 + 8, 9] + + Is a bit forced but with clear intent. I think it could be deemed + as wrong because the think that wraps is not the first list element + on that line. Is that a well defined concept? + + + + + + + + + + + + Recording the line structure allows for a number of different options. As mentioned already, a "NEWLINE" at the end of each line might do it. However as "IN" and "OUT" already imply a line break, we could just have "SAME" on breaks which don't have "IN" or "OUT". However this doesn't capture the idea that a line sometimes corresponds to a syntactic unit. + +Another approach is to follow the bracketing nature of "IN" and "OUT" +and have "SOL" and "EOL" for start-of-line and end-of-line. As we +noted already an indented section implies continuation and so should +delay the end of line. If we follow this idea, the above structure +could be tokenize as: + + + + +Here every "SOL" has a matching "EOL" but it can be delayed until after an indented block. So the "EOL" that matches the first "SOL" comes after the "OUT". It is worth noting here and a single line break can have multiple "OUT"s and even an "IN" as well. + +A B C + D E + F G + H +I + +After the "G" we would have "EOL OUT EOL OUT EOL IN SOL" before the "H". The three "EOL"s terminate the lines that start "F", "D", and "A" in that order, and a new "IN" is needed as "H" is indented with respect to "A". + +This tokenization captures the structure very effectively and is what +I will be using with one simplification. The "SOL" really isn't +needed. It is easily deduced and doesn't have a strong effect on the +parser. The "EOL" may for a syntactic element to be closed. The +"SOL" is simply recorded for reference, and that can be recorded when +"IN" or "EOL" is seen ("OUT" is always followed by "IN" or "EOL", so +"OUT" doesn't need to imply "SOL"). + +This tokenization captures the structure very effectively and is what I will be using with one simplification. The "SOL" really isn't needed. It is easily deduced and doesn't have a strong effect on the parser. The "EOL" may for a syntactic element to be closed. The "SOL" is simply recorded for reference, and that can be recorded when "IN" or "EOL" is seen ("OUT" is always followed by "IN" or "EOL", so "OUT" doesn't need to imply "SOL"). + +Unsurprisingly, these are exactly the tokens that can be generated by the scanner I previously described though the token names have been changed. + + + +When does an EOL actually mark and END. + +Now that we have a good understanding of IN and OUT we can turn to EOL. Intuitively EOL should mark the end of something much like OUT does. But what. Let's look at some cases: + +if a == b: + print "Hello" + print "World" + +Here there are 3 EOL tokens, one after "Hello", one after "World" and one after the OUT token right at then end. The first two should close the statement on that line. The last probably shouldn't as there might be an "else:" still to come. + +a = b + + c * d + + e + +There are 3 EOL token here in the same locations. Now the first two should not close anything, while the last one should close the statement. + +These two examples have very similar structure and are largely indistinguishable to the LR parser, and yet must be handled very differently. This implies that we need to provide the parser with extra information. We need to tell it something about the expected line structure of the language. + +To understand what to tell the parse we need to look into our (or my) intuition about why the newlines are or are not wanted in the different locations. + +I think the key observation is that statements are generally expected to be one-per-line, while there is no similar expectation for expressions. The earlier observation that an "else:" might still be expected shows that statements are always one-per-line, however that is a fairly well defined exception. i.e. there are certain placed in a statement where a newline is expected, or at least not a surprise. The start/end is one place. Before an "else" or "elif" is another. In C we might have: + +if (a == b) +{ + print "Hello" + print "World" +} + +so an EOL is expected before a '{' or a '}' in that language. + +If we define all the places where an EOL is expected then we should be able to work out when an EOL should close a syntactic unit and when it should be ignore. However it isn't as simple as ignoring an EOL where ever it isn't expected. We need to include our understanding that indents protect EOL in some sense. In the second EOL example above (a = b + c ...) the first two EOL tokens were ignored in part because they were indent. If the second they were not ignore because despite being indented, they were expected. So the interplay between indents and EOL expectations needs to be understood. + +If we go back to our very first example: + +A B C + D E + F +G H + +you might remember I suggested that if something started at "B C..." it should finish at "F". The "OUT" should close that as well. Our process described ..... not sure .. + + + + + + + +Also you need to know how the scanner can report indent information to the parser. I discuss that in some detail when introducing my lexical scanner but you probably don't need to read that now. It is sufficient to know that the token stream will have paired "INDENT" and "UNDENT" tokens, and that the "NEWLINE token that you might expect to come before an "INDENT" is delayed until after the matching "UNDENT". + +With this in place, it seems natural to allow the parser to absorb INDENT and UNDENT, but to require that when it reduces and production, the net indent absorbed while shifting the tokens which make up that production is zero. i.e for every INDENT there must be exactly one UNDENT. So: + +if condition { + +It might help to start with some examples. + +1. Lorem ipsum dolor sit amet, consectetur adipisicing +2. elit, sed do eiusmod tempor +3. incididunt ut labore et dolore +4. magna aliqua. Ut enim ad minim +5. veniam, quis nostrud exercitation ullamco +6. laboris nisi ut aliquip ex ea commodo consequat. +7. Duis aute irure dolor in reprehenderit in voluptate +8. velit esse cillum dolore eu fugiat nulla pariatur. + +Here are 8 lines of random text - the line number are just for reference and should not be considered part of the text. Similarly any punctuation or capitals should be ignored when trying to understand the structure. + +Looking at just the newlines and indents it appears that line 1-5, line6, and lines 7-8 form three separate and equal units. Within the first of these, lines 2 and 3-4 are again to separate and equal units. Line 5 is not part of whatever 2,3-4 are but is something else that is part of 1-5. + +Where that structural analysis is actually correct cannot be determined without knowledge of the meaning, so lets try something more meaningful. + +a = [ 1, 2 + 3, 4 + 5, 6 + ] + +Now we can clearly see that the continued lines are all part of what was started on the first line. The last line is less indented than the others much like line 5 in the first example. We took that to mean it is not a continuation 2nd and 3rd lines but only of the first, and this example bears that out. The lack of a comma at the end of the first 3 lines are not lead to any ambiguity. It is clear from the structure that something is completed. It cannot be the whole list that is completed as that started on the first line and we are still indented with respect to there, so it must be a "smaller" or more local syntactic unit that has completed. i.e. the list entry. + +Here is a different example using the same structure. + +if a = 1 { a = + b + c; + d = e; + } + +This looks very wrong. Breaking the first assignment looks bad, and having the second assignment start with the same indent as the continuation of the first is also wrong. Yet the list begun, but not finished, of the first line of the first example looks fine. Maybe it is the amount of indenting that makes a difference. We can try adjusting that and see: + +a = [ 1, 2 + 3, 4 + 5, 6 + ] +if a = 1 { a = + b + c; + d = e; + } + +certainly the amount of indenting helped in the first example, but fixing it doesn't really fix the second example. + +hello + +30apr2014 + I seem to be stuck here... + What do I know? + + When I see a newline, I must either IGNORE it or SHIFT it or REDUCE + The key question is: when to ignore it? + + We ignore newlines when they are protected by an indent. + What does "protected" mean? It means there is already an internal indent + in any symbol which can derive a newline. + + So in the above examples, only newlines are expected at end of + statement (with or without ';') and after '{'. + + So for all the newlines seen in a = [..] the closest symbol that derives + newline is 'statement' and it has an internal indent. So they are ignored. + + In the "if a = 1 { ..." The first newline is after "+ c;". We are in + statement (or statementlist) which derives newline and that has internal + indent so newline is ignored. But I don't want it ignored. + + How about: + When we see a newline, we find the closest reducible thing that + can contain a newline. If it started at the start of a line and + contains an internal indent, then we can ignore the newline. + + That has potential: a line-like thing can only be broken if it + starts at the beginning of a line. + + But can we implement this in an LR grammar? We don't know what context + we are in until we reduce into that context. So how do we determine if + we are in a line-like thing? + + Each symbol has a 'can_eol' flag. EOL gets that flag, and any + non-terminal which can produce an 'can_eol' symbol gets the flag too. + So 'statement' has it but 'expression' does not. + + Each state has an 'expect_eol' flag. If is set if the 'next' symbol of + any item in the state has can_eol set. + OR : It is set if any the head of any production in the state has can_eol set. + + Now recall that the state stack is conceptually: + STATE1 SYM1 STATE2 SYM2 STATE3 SYM3 + Each STATEn is expecting all the following syms to eventually be reduce + to something which can then advance that state. + So if STATEn has 'expect_eol', then everything following is part of + something that maybe could have a newline in it. + + So if STATEk has expect_eol then we ignore newline precisely if + SYMk was at start-of-line and somewhere from SYMk on there is an + internal indent. So we need a new start-of-line flag in the frame. + + Is expect_eol precise enough? I'm wondering about a production + that has a newline in the middle. What happens then? + After the newline has been shifted the whole won't be reduced so that + expect_eol state will still be there so we can still be ignoring + newlines. + + I think we declare that a production which can contain a newline but + cannot end with a newline is an oddity which we don't support. + + Similarly if any production from a symbol can contain a newline, + then every production is assumed to be line-like and so require an + indent for newlines to be ignored. + + + "Newlines are ignored only inside line-like symbols which started at + the beginning of a line and have since had an indent" + + Possibly good, but a grammar with no newlines would not allow + newlines. Maybe that doesn't matter - the lexer can ignore them. + Can we turn it around though + + "Newlines are only shifted if the nearest line-like thing does not + have an internal indent." + so "They are ignored if there is no linelike thing, or it holds an indent + + Also "An internal indent in a linelike thing that started midline is + not permitted. + + + That didn't work - 'expect_eol' definition doesn't make sense in that way. + + New attempt: + expect_eol is set if *all* of the *next* symbols in core item + have cal-eol set. + When a core item does not have a 'next' symbol we ignore that item. + + Looks good, but now a new issue appears. + If a production is reduced by an OUT indent, the following NEWLINE + doesn't have a home. If expect an optional newline there it might + use the newline before the outdent (which could have been ignored + otherwise??). and worse, it can create a reduce-reduce conflict + because the optional newline could terminate inner or outer statement. + + This is particularly a problem with + Block -> : Statementlist + Statement -> if Expression Block else Block + + After an OUT reduces ": Statementlist" to Block there is a NEWLINE + but we need to see the "else". If we indent the 'else' it works nicely. + I want to tell that a NEWLINE is needed there. That means I must put one + after "Statement -> if Expression Block" to avoid a reduce/reduce conflict. + Each. + + After trying to describe this I need to refine it again. + - A symbol is "can_eol" if any symbol in any production from this symbol + is followed only by nullable symbols. + - A state is "starts_line" if the previous symbol in any item "can_eol", + or is nullable and previous to that "can_eol", etc. + - The start state is "starts_line" if the start symbol "can_eol". \ No newline at end of file diff --git a/x b/x new file mode 100644 index 0000000..eb4bf80 --- /dev/null +++ b/x @@ -0,0 +1,4 @@ +While I was writing test code for my first toy language I particularly noticed the lack of interesting data structures: it write interesting loops you can do a lot more if you can build data structures while doing it. So my second toy will have some basic data structures. + +The first step to this is knowing how to declare variables, or local bindings. Then I'll have something to attach types of data structures too. So this instalment is about local variables and their scope. + -- 2.43.0