Thinking about types.

I’m still a long way from implementing any of this but I keep thinking about the sorts of types that I want “ocean” to have, so I thought I would write stuff down in the hope that it will stop looping around my brain.

Firstly, types will have names, names will be global to a module (probably a file) and will be a separate name-space to names of objects like variables and functions. So yes, that means that types aren’t first class objects. Type equivalence will be name equivalence: if you have two types with the same structure and different names, they are different types.

It may be possible to have anonymous types in some circumstances, and they may well support structural equivalence with named types, but that will only happen if it makes it easier to write code … it is just a vague idea at the moment. Anon types would be a bit like under-specified constants – like 45.0 might be int or float or something else. Type analysis will decide what it must be, then it will always have been that.

There will probably be cases where the syntax allows either an object name or a type name, though only one of those would survive further analysis. In these cases we might want them to be syntactically different. I think I will use a “:” prefix to indicate a type. Specifically, some types will accept parameters, both constants and types being likely sorts of parameters, and variables being possible. To pass “4” and “int” you would write “typename(4, :int)” or similar.

Scalars

The first group of types are scalars. These are simple things that the language knows about and that are always copied (i.e. no attempt is ever made to track multiple references to the one scalar) and they usually fit in a machine register.

Integers can be signed “int” or unsigned “uint“. These are probably 32bit – maybe 64. If you want a particular size, you can have i8, i16, i32, i64, ,u8, u16, u32, u64. I will probably support “int(N)” meaning an integer ranging from -N-1 to N, and the same for uint(N). “byte” might be a synonym for u8.

Floats are “float” or “float64” or “float128” etc.

A “number” will be a fraction with arbitrary large numerator and denominator.

“Boolean” or “Bool” is “True” or “False” and “Order” is “Less” or “Equal” or “More” or something like that.

A “char” is a UNICODE codepoint.

Structures

A “struct” is a collection of fields. Each field is declared much like a variable as “name:type=value” though the initial value is optional.

struct complex:
     real:number=0
     imaginary:number=0

might define a struct. A struct is for use inside a program only – never for export. The compiler is free to change the order of fields to improve alignment without wasting space. It is not possible to cast a pointer to a structure into some other sort of pointer – the language owns the internals, not the programmer.

A field in a struct can be named “_” (a single underscore) in which case it is treated as anonymous. If it is a scalar, then it must be the only field and the struct works a bit like a typedef in C. If it is a struct, then all the fields in that struct are imported into the parent, and there must be no name conflicts. If it is an array, then it must be only one and it is used whenever an array index operation is attempted on the struct.

Fields are accessed with standard dot notation, “foo.field" is a field of "struct foo".

Records

A record is similar to a structure, but the internal layout is under programmer control. The way the data is stored in memory is well defined, so that memory can be written to a file or sent over a network or similar. A record can be declared to be “big endian” or “little endian” or “host endian” — though I don’t know yet what the default is. This applies to all fields in the record. If you want different fields to have different endianness, then you need a sub-record which is declared differently. The endian in the outer will set the default for the inner, but will not over-ride an explicit setting for the inner.

Charsets for strings and alignment and padding will probably also be controllable somehow.

While a struct can contain anything, a record cannot — it can only contain well-defined things. So a record can contain fixed sized ints and Booleans and chars and other records. They can also contain arrays of these things. They cannot contain pointers or structs or other more esoteric things that we haven’t met yet.

Because the representation of a record is well defined, it is possible to cast the address of a struct to a pointer to an array of bytes, or to anything else that is well defined.

Arrays

An array cannot exist as a named type, but a variable or a field in a record or struct can be an array. If an array is an anonymous field (named with an underscore), the struct will appear for many practical purposes just like an array. An array is declared as [type:length] so:

struct months:
    receipts:[float:12]
    expenses:[float:12]

is a struct containing two twelve-element arrays. Array elements are indexed using standard name[index] notation.

Classes

A class is like a struct, but it can also have methods. A struct can hold function pointers in it which are a bit like methods, or it can hold a pointer to a separate struct of function pointers. A class might use either of these techniques, or it might do something else. It allows methods to be used, but leaves it up to the compiler to worry about implementation details.

Some sort of mechanism will be provided for declaring interfaces and sharing implementations, but I haven’t thought much about what this will be yet. I do expect there to be several internal implementation options, and that the programmer will have some opportunity to suggest a preferred approach.

One approach is that any struct can be “classified” (made into a class) by providing a set of methods and pointers to objects in the class would be implemented as fat pointers – two pointers together, one to the data, one to the implementation. This is exactly the interface used by the C-library “qsort()” function.

Some fields in a class will be “private” to certain methods, others will be public parts of one or more interfaces.

Pointers

There will be a number of different sorts of pointers. Some of them will imply “ownership” of the referenced object, and some won’t. Different sorts of ownership will be supported.

In the first instance I suspect that the only sort of ownership that will be supported is refcounting – so only classes and structs with an identified ref counter can be owned. Non-owning (borrowed) references will only be valid while some other designated owning reference remains valid. For example, a borrowed reference can point to a member of a structure as long as there is a valid pointer to the structure that is borrowed from.

Later I hope to allow owning references that have an implicit refcount of one, and probably other variations.

Pointer arithmetic will not be supported. If you want to do arithmetic on memory addresses, you need an array. Pointers can only point to scalars and to structs/records/classes. In particular you cannot have a pointer to a pointer, though you could have a pointer to a struct containing just a pointer.

If “foo” is a pointer then most accesses to “foo”, including array member access and field access, access the thing that “foo” points to. Only assignment modifies the pointer itself. If you want to modify the whole of the thing pointed to by a pointer (which is a structure or similar) then the “copy” or “swap” statement will be used. I imagine “copy” and “swap” to be statements in the core language which take 2 variables (or fields or similar) and copy or swap the content. That would mean that swapping pointers isn’t easy … I wonder if that matters.

There is probably a lot more to say about pointers, but their time haven’t really come yet.

enums

Enumerated types bother me. In C, the values in the type are global names, which feels a bit like name-space pollution. I could require a “type.” prefix, and it is not uncommon to see that sort of thing used in C – a common prefix for an enumeration – but it still feels a bit clumsy. It also introduces the typename into the object namespace, which I didn’t want. I’ll probably need to try things out and see what works. Possibly “:name” will find an enum with that name in any known type, and “:type:name” will disambiguate, when needed.

I suspect enums will look a lot like structs except that the names will be constants, not variables. No type will be needed and the value will still be optional.

In C we often want an enumeration of bits in a bit-field and Go has a syntax to make this easy – it seems like a hack to me though. I suspect I’ll just make the issue irrelevant by making such things unnecessary. One option is a “#” prefix operator which converts a number to a bit, so “#BUSY” is the same as a “(1 << BUSY)“. Another option is to have infix operators which operate between a bitset and a bit, so “flags +/ BUSY” and “flags -/ BUSY” will set or clear the “BUSY” bit.

Functions and procedures

Functions can be used in arbitrarily complex expressions so they really need to return precisely one value. Procedures can return any number of values so they can only be called in more restricted contexts. I think I want to maintain that distinction that Pascal had, rather than being like C and pretending they are all the same.

A function will be “name(parameters):return_type” while a procedure will have no return type, but (optionally) a second set of parameters separated by “::“. When calling a procedure, a multi-variable assignment statement can be used to collect the return values rather than passing them as special parameters. This can only work if the all names are being declared at this point, or if none of them are. I wonder if that is too restrictive.

Parameterized types

On top of all this, I want parameterized types – both integers and other types will be appropriate parameters, and when describing a function signature, there might be unbound types for which only an interface is given. Lots to think about there.

Error types

In the Linux kernel we have a practice where a pointer variable can hold an error code instead. An address with a signed-number equivalent between -1000 and 0 is treated as an error. The same thing can be done with positive numbers meaning success and negative meaning an error. Floating point has a somewhat-similar concept where a specific value – NaN – is not a number but is actually an error.

This is very powerful, particularly for function return values. You effectively get a cheap discriminated union which is either a useful value or an error. Providing the caller always checks for an error, things work nicely.

I would like to support this natively in ocean, at least for pointers and numbers that aren’t the full range of the bits used. Some sort of type annotation would say that an error code can be encoded is spare parts of the bit-space. A simple ‘is_err’ test could be used on any error-enhanced type to see if an error is present. The compiler would refuse to let an error-enhance value to be used until the error status has been tested. If I end up adding exception handling, the use of an erroneous value could trigger an exception.

Strings

I definitely want ocean to support strings natively, but that is hard — at least the witness of Python 3 seems to suggest that it is hard.

I think I want strings to be utf-8 encoded with a length (rather than nul termination), though there is a strong case for utf-16 in some cases. Working in the ASCII subset needs to be trivial. Probably the difficult part is understanding what an iterator looks like, and if there needs to be different sorts of iterators – code bytes, code-points, graphemes, something else. In the first instance, strings will be utf-8 with concatenation only. When I need more, I’ll have to invent something.

Scalars

Structures

Records

Arrays

Classes

Pointers

enums

Functions and procedures

Parameterized types

Error types

Strings

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta

Copyright