This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

RFCs

Current and past design documents for features in Bramble and Thorn.

These documents provide in-depth details to the design process for major features added to Bramble and to Thorn.

The numbers used to identify RFCs correspond to the GitHub issue number used to track the feature. Which is why they they may be very large and show gaps between numbers.

1 - RFC 248 - Raw Pointers

How to represent raw pointers in Bramble

Design for Representing Addresses in Bramble

Note: I have used my rough understanding of how to state type rules in order to express the semantic rules for raw pointers. There may be mistakes in my statements due to lack of experience.

Problem

Within the Bramble language there is no representation of address based operations or pointers. This severely limits both what can be built with Bramble and the FFIs that Bramble can call. In order for Bramble to grow, there needs to be a representation of the most basic raw pointer concepts. These can then be used to build not only heap based allocations, but smart pointers, or garbage collectors, etc.

Goal

In this document is the design for how raw pointers will be represented in the Bramble language.

UX

Being danger, the use of raw pointers should be both discouraged except in special cases (features which require raw pointers, FFI uses, and performance) and it should be immediately obvious to a reader that a raw pointer is being used. And the syntax of raw pointers should reflect this discouragement for writers and make raw pointer use obvious for readers.

So, any use of raw pointers should be obviously different from safer uses of reference values (e.g. Rust borrows, smart pointers, garbage collected pointers, etc). This leads to several design goals:

  1. Raw pointer types should be as explicit as possible when written down.
  2. Creating a raw pointer directly to a variable should be obviously different. (e.g. Rust uses the addr_of macro to just get the address and bypass any safety checks).
  3. Dereferences should also be obvious and explicit. Where dereferences of smarter reference types might be implicitly injected by the compiler, for raw pointers they must all be explicitly written.

This document lays out the language features for supporting raw pointers themselves. After this feature is implemented, then the next step is to implement something like unsafe blocks in C# and Rust to further push raw pointers to something that a user must “buy into” and to make them obvious to a reader.

Requirements

  1. Immutability must be respected as far as it can possibly be guaranteed. It is impossible to guarantee it is respected on the other side of an FFI but within Bramble, if a variable is aquired as immutable then it is a compiler error for Bramble to ever allow it to be mutated.
  2. A raw pointer can be assigned either the address of a declared variable or the null value. It cannot be assigned an arbitrary integer.
  3. null is a keyword that can be assigned to or compared with all ref types.
  4. It is possible to get the address of any variable, no matter it’s type.
  5. Function pointers are not covered in this document
  6. The design should discourage their use to only when necessary.
  7. Raw pointers can be used on uninitialized memory. (Currently, Bramble does not allow for uninitialized variables but that will come in the future.)
  8. When a raw pointer is copied it’s value will be copied and there will now be at least two extant raw pointers to a location in memory.
  9. You can get the address of a field in a structure.
  10. with Raw pointers, you can have internal references: that is a field in a struct is a pointer to another field in the same struct.
  11. Comparisons compare addresses. Raw pointers are equal if and only if they point to the same location in memory or they are both null.
  12. There is a way to compute offsets from a raw pointer.
  13. There is a function for getting the size of a type. This is needed for custom memory allocation, computing offsets, etc.
  14. Comparison operators work on raw pointers. (<, >, <=, >=, ==, !=)

Design

Syntax

There are two parts to implementing raw pointers: the reference types, which describe variables that hold addresses, and the operators used to to create and use raw pointers.

Pointer Type

The type syntax is inspired from Rust:

Raw Pointer = *(const|mut) TYPE

Where Type is any primitive or valid custom type, including raw pointers. const and mut must be specified to help make raw pointers as explicit as possible and to discourage their use by making them tedious.

Reference Operators

There are two operators that are needed: one to get the reference and one to dereference a pointer so that a user can access the target variable.

REFERENCE = @(const|mut) (IDENTIFIER[.FIELDNAME]*)
DEREFERENCE = ^EXPRESSION

MUTATE = mut LVALUE := EXPRESSION
LVALUE = [^]IDENTIFIER             // The formal labelling of the LVALUE is new to Bramble

Note: currently braid does not support mutations of fields in a struct. This proposal optimistically hopes that that language change can be fit in here. The LVALUE in Bramble needs to be greatly matured so it can support derefs, fields, array elements, and parens to control order of operations. The LVALUE presented here is still not complete but fully realizing it is outside the scope of this RFC.

The Reference and Dereference operators are unary operators with precedence higher than member access or array access operators.

The Reference operator can only be applied to identifiers or the fields of an identifier that’s a structure as those are the only constructs in Bramble that have physical locations in the computer. Literals, for example, have a location in the machine code data section but are not conceptually something that can be referenced. The reference operator explicitly requires const and mut so that it mirrors the syntax of the raw pointer type names.

@ is chosen as the addressing operator rather than the traditional & to make & available for future use with safer reference/borrowing systems. In C/C++ the & is used to get the address of a variable. In Rust & is used for both borrowing and raw pointers (in Rust the & is cast to a raw pointer and a separate macro must be used to explicitly get the address). For Bramble, I want all interactions with raw addresses/pointers to be as explicit and obvious as possible so the addresses are gotten with @. Then & will be used for safe references which will be what users are encouraged to use.

The @ operator can only be applied to lvalues (identifier or a specific field). This is because the raw pointer logically refers to a specific location in memory and all locations in memory are refered to via identifiers. This mirrors the syntactic rules of C, and is carried into Bramble both because it maps well to the core philosophies of Bramble and because it provides a very strict rule which makes implementation easier.

The Dereference operator is ^, which differs from other languages which use * to dereference pointers. This was chosen to to make the dereferencing of a raw pointer explicitly different from dereferencing safer references or smart pointers. The ^ will probably be used for bitwise xor when that operator is added to Bramble but I believe it will be clear which operator is being used by whether it is unary or binary.

Unlike the reference operator, the dereference operator can, syntactically, be applied to any expression. The requirement being that the expression resolve to a reference type. This is because dereferencing is operating on a value (the address) and the address is expected to have already come from a variable.

This makes using dereferenced identifiers legal as the left value in mutation statement.

Both the reference and dereference operator symbols were chosen to be uniquely applicable to raw pointers and distinct from future, safer, methods of handling references in Bramble (be those GC pointers, reference counters, or a lifetime system a la Rust). The priority here is given to clarity for the reader of code rather than efficiency for the writer.

Operator Precedence

  1. @ and ^ will have the same precedence as unary minus (-) and not (!)

Semantics

Some notes, currently the following types and their uses are semantically legal:

x: i32, y: i32
1. *const *const i32
2. *mut *mut i32
3. ptr: *const *mut i32 :- mut ^^ptr := 5
4. *mut *const i32 :- mut ^ptr := @y

The latter two may seem unintuitive or that they should be illegal. Example 3 is legal because the outer *const refers to the location in memory that is storing the *mut i32, that pointer cannot be changed through indirect means, but because that pointer is an indirect mutable pointer, the location it points to (the i32) can be changed. Example 4 is legal, because the location that the outer pointer points to is mutable (it can be changed to point to a different i32) but the i32 that the inner pointer points to cannot be mutated.

It may make more sense to semantically have an occurance of a *const in a chain of nested pointers to override and turn the entire chain into *const.

Reference Operator

The reference operator can only be applied to an Identifier that is a variable in the current scope. The type of the reference expression is *const T where T is the type of the variable.

The mutable reference operator @mut can only be applied to a variable that is also mutable. The type of the mutable reference expression is *mut T.

mut and const are modifiers on the * reference type.

Deference Operator

T: Is a Type :- R: *(const|mut) T
-----------------------
:- ^R -> T

Mutating a dereference: If the raw pointer is to a mutable location (*mut T) then the value of that location can be mutated via the dereference operator;

let x: i64 := 1;
let mptr: *mut i64 := @mut x;
mut ^mptr := 5;

The type rules are:

mut ^R := V => T: Type :- R: *mut T, V: T

Dereferencing null or a variable whose value is null is undefined behavior.

Assignment Semantics

A variable of type *const T can be assigned a value of type *const T or a value of *mut T.

A variable of type *mut T can only be assigned a value of type *mut T.

let x: i64 := 5;
let mut y: i64 := 10;

let cptr_x: *const i64 := @const x;  // Legal
let cptr_y: *const i64 := @const y;  // Legal

let mptr_x: *mut i64 := @mut x; // Illegal, cannot get mutable address to an immutable variable
let mptr_y: *mut i64 := @mut y; // Legal

let mptr_x: *mut i64 := @const x; // Illegal, cannot get mutable address to an immutable variable
let mptr_y: *mut i64 := @const y; // Illegal, cannot cast a const location to a mutable pointer
Ref Type Variable Mutability

A variable of reference type (immutable and immutable) can itself be annotated as mutable: this means that the address stored in the variable can be mutated it has no bearing on if the location the address refers to can be mutated.

let x: i64 := 5;
let x2: i64 := 5;
let mut y: i64 := 10;
let mut y2: i64 := 10;

let mut cptr_x: *const := @const x;
let mut cptr_y: *mut := @mut y;

mut cptr_x := @const x2;     // Legal
mut cptr_y := @mut y2; // Legal 

Copying References

Copying: a mutable reference can be copied to a variable of type *const or *mut, so long as the underlying type is the same. An immutable reference can only be copied toa variable of type *const.

Examples:

fn write_to(m: *mut i64) {...};
fn read_from(i: *const i64) {...};

let x: i64 := 5;
let mut y: i64 := 10;

let cptr_x: *const i64 := @const x;       // Legal
let cptr_y: *const i64 := @const y;       // Legal
let cptr_x2: *const i64 := cptr_x;  // Legal
let cptr_y2: *const i64 := cptr_y;  // Legal

let mptr_y: *mut i64 := @mut y; // Legal
let mptr_x: *mut i64 := @mut x; // Illegal, cannot get mutable address to an immutable variable
let mptr_x: *mut i64 := cptr_x; // Illegal, cannot get mutable address to an immutable variable

write_to(@mut y); // Legal
write_to(@const y);     // Illegal, type mismatch requires *mut i64 but `@x` has type *const i64
write_to(@mut x); // Illegal, @mut cannot be applied to an immutable variable

read_from(cptr_x); // Legal
read_from(cptr_y); // Legal
read_from(mptr_y); // Legal
read_from(@const x);     // Legal
read_from(@const y);     // Legal
read_from(@mut y); // Legal

Comparisons

Comparisons are done on the addresses stored in the reference variable. Mutable and immutable references can be compared with each other.

Equality

  • == - returns true if the LHS and RHS have the same address value, otherwise false.
  • != - returns true if the LHS and RHS have different address value, otherwise false.

Less Than/Greater Than

Comparison operators will operate on the numerical values of the addresses as if they were integer types.

Offsets

A key part of raw pointers is being able to use them to move through large blocks of allocated memory: e.g. dynamic arrays and so on. On a semantic level this mirrors the array access operation [n] with the key difference being that and offset could be negative.

A similar, but visibly different operator is provided for getting offsets from a given raw pointer (which addresses the need for pointer arithmetic).

ptr<i64> // The <> operator takes a signed 64 bit integer and returns a new address

The offset operator will get an offset address in terms of multiples of the size of the underlying type. The address it will return computes as ptr value + size_of(T) * offset where T is the underlying type of the ptr.

So for a pointer to a i64, the size of the underying type is size_of(i64) = 8 and the offset will move in steps of 8: ptr<2> = ptr + 2 * 8 or ptr<-3> = ptr - 3 * 8.

The <> characters were chosen to make the pointer offset operation visually distinct from the array access operation while still sharing some resemblance because logically they operate in identical manners with the array access being more restrictive in its inputs (they must be unsigned).

Syntax:

PtrOffset = IDENTIFIER<EXPRESSION>

Semantics:

T: Type :- R: *(const|mut) T
----------------------
p: R, o: i64 :- p<o> -> R    // The offset of a *const is *const and the offset of a *mut is *mut

Alternative syntax: Use this as a binary operator: ~ e.g. ptr ~ -5. I don’t like this because it’s very subtle and with some fonts could be confused with -.

One reason to avoid using <...> is it’s a pain to implement in the parser. Perhaps $ or # as binary operators. The semantic rules for typing stay the same but it becomes easier to parse.

Decision

At the time of implementation, I’ve decided to use @ as a binary operator for raw pointer arithmetic. This is because all the other operators (e.g. <> and # and $) could have future uses, the @ is already associated with raw pointers, and this minimizes the amount of special characters which are devoted to raw pointer manipulation.

Member Access on Values of Reference Type

If a reference points to a structure then to access members of the structure you need to dereference the pointer first: *ptr.field or (*ptr).field. To dereference a field in a structure that is a reference: *(ptr.field_ptr).field or (*(ptr.field_ptr)).field.

size_of

The size_of function takes a type and returns the number of bytes allocated to store a type in memory. This takes into account padding bytes that might be added for memory alignment purposes.

Syntax:

SIZE_OF = size_of(TYPE)

Type:

:- size_of(T) -> u64

For simplicities sake, this is operating on the C notion of size where every type as a size. For primitive types with the bitsize in the name, the size of the type shall remain constant across all versions of Bramble and all platforms and architectures.

Insight

The insight should focus on the physical nature of pointers and addresses, as they represent the actual locations in memory and deal with interacting with those physical realizations of the logical variables. So, when possible the visualization and Insight should emphasize that link to the physical resource.

Recorded events: When the @ operator is applied to a variable then:

  1. The lexer will emit an event for the @ token
  2. The parser will emit an event for generating a @ unary node in the AST.
  3. The type resolver will emit an event which includes a reference span to the definition for whatever is the operand of @, because the declaration of that variable is also the aquisition of the physical memory.
  • If the operand is a field of a structure, then it will reference the declaration of that structure variable.
  • The reference type should probably be annotated to make it explicit that this is refering to the same physical location and not that it is using a reference for a compile time decision.
  1. LLVM will emit an event with the LLVM get address instruction.
  2. Record if the pointer is pointing to a type which is memory aligned
  3. When a pointer is bound to a Field in a struct, the node should display the field name. Have a reference to the span for the field’s definition.

When the * deref instruction is applied to an expression:

  1. The lexer will emit an event for the * token.
  2. The parser will emit an event for generating the * unary node in the AST.
  3. The type resolver will emit an event which includes a reference to the definition of where the operand comes from.

This is a bit outside the scope of this RFC, but because this is dealing with language features that surface the physical nature of how variables are stored:

It would be nice to record the memory layout information for all structs and display that in the LLVM tab for the Viewer. It would also be nice to have a compiler flag to print out memory layout table for all types after compilation.

Viewer

Highlight the creation of raw pointers and the dereference of raw pointers in some manner to emphasize that something dangers is being done.

In this case, the Cell of an @ expression or a * expression will be will be colored a light orange as a warning sign.

Tests

Integration Tests

  1. Get Reference to each primitive type
  2. Get Reference to array
  3. Get Reference to structure
  4. Get reference to field in structure
  5. Get reference to element in array
  6. Get reference to reference
  7. Repeat all tests for mutable references
  8. Test passing a reference to a function
  9. Test passing a mutable reference to a function and function mutates value
  10. Return reference from function
  11. Test using malloc and free
  12. Copy struture with reference field => both copies point to the same value
  13. Test derefence of each of the Reference tests
  14. Test mutable dereference of each of the reference tests
  15. Reference to a mutable value, mutate the value, test that the reference reflects the mutation.
  16. test initializiing a reference to null
  17. Test comparing a reference to a null
  18. Test comparing two references compares addresses
  19. Get pointer to first element in the array, then use pointer arithmetic to move to each element of the array.

Implementation

Types

The const keyword will be added to the token set and the lexer.

A new Ref variant will be added to the Type enumeration. The Ref variant will have two fields. One field is_mutable boolean which is false for immutable references and true for mutable references. And a boxed value to the target type that the reference refers to.

The Parser will be updated to parse the ref type annotations according to this grammatical rule:

RefType = *(const|mut) <Type>

*const <Type> will set is_mutable to false and the target type to <Type>. *mut <Type> will set is_mutable to true and the target type to <Type>.

@ Operator

@ will be added to the Lexer and token set as Ampersand.

Two new variants will be added to the Expression enumeration: AddressOf and AddressOfMut.

The parser will follow these rules:

AddressOf = @<Identifier>
AddressOfMut = @mut <Identifier>

Semantic rules will be added of the form:

  1. For AddressOf the operand must be a variable. It can be immutable or mutable.
  2. For AddressOfMut the operand must be a mutable variable.

^ Operator

The ^ token already exists in the Lexer and token set.

A new unary rule will be added to the Parser:

Deref = ^EXPRESSION
DeRefAssignment = mut *EXPRESSION := EXPRESSION

A new variant will be added to the Expression enum. This variant will have a single child, which is the expression it is dereferencing. A new variant for mutations will be added for mutating a reference.

Semantic Rules will be added of the form

  1. For DeRefAssignment, the <Expression> on the LHS must be mutable reference.
  2. For *<Expression> in the LHS, the type of <Expression> must be a Reference type (mutable or immutable). If the type of <Expression> :- *(const|mut) T then the type of *<Expression> :- T

size_of

The size_of function will be implemented using the LLVM API and computed as a static value during compile-time.

https://stackoverflow.com/questions/14608250/how-can-i-find-the-size-of-a-type https://llvm.org/doxygen/classllvm_1_1DataLayout.html#aa48b3b7e554b44f4e513d5dd8d9f9343

Syntactic Fuzzer Changes

  1. Add type annotations for *const T and *mut T. This should be able to use the same logic as arrays.
  2. Add expression generation for taking the reference of an identifier.
  3. Get a raw pointer to a field in a structure.
  4. Add a failure test by generating a reference operator where the operand is not an identifier.
  5. Add deref generation for expressions.
  6. Add deref mutation generation for statement generators.
  7. Compute raw pointer offsets
  8. Mutate dereference to a field name

2 - RFC 257 - Addressable Expressions

How to represent the expressions that can be used on the left hand side of assignments and mutations.

Addressable Expressions

Overview

Allow the Bramble compiler to support more than just identifiers as the LHS of mutation operations and the operand of @(const|mut) operations. Specifically, be able to use array elements and structure fields so that the following would be legal:

mut arr[0] := 5;
mut st.field := -1;
@const arr[0]
@const st.field
mut ^(@mut arr[0]) := 2;

Solution

(The term “Addressable” is inspired by the language used in the Go Lang spec).

The expressions which can be mutated or referenced with @ are expressions which have places somewhere in memory (the stack or the heap). These are called Addressable values. Without a place in memory, obviously, the expression cannot have its value changed or its address taken. Expressions which evaluate to a value are Value Expressions. All addressable expressions are also value expressions: in this case their value is read from their place in memory.

Whether an Addressable expression resolves to its address in memory or its value is context dependent: specifically, the operation and what operand position the expression takes. For example, the LHS of a mutation operation would resolve to an address, but the RHS of the same mutation operation would evaluate to an address.

I propose adding a new is_addressable flag to the SemanticContext of an expression. This will be added before type resolution. During type resolution, the compiler will not only check that the types are correct, but will also check if an operand is addressable for the operations which require addressable values.

In order to support mutation operations, an is_mutable flag will also have be added to the SemanticContext. Currently, this flag only exists in the symbol table for variables, but this cannot support dereference expressions or field access or array indexing expressions. The is_mutable flag will remain in the symbol table and a flag will be added to SemanticContext for mutability of addressable expressions.

The mutability and addressability of an expression will be combined into a single flag with 4 possible states: None|Value|Addressable|AddressableMutable. An expression can only be mutable if it is addressable, so having two separate flags creates risk of accidentally marking a value expression as mutable. None is used both as a place holder for when an expression has not had it’s Addressability computed and for nodes which are not expressions (structure defintions, etc).

A new phase ComputeAddressable will traverse the AST and set the addressability flag for every node in the AST, based upon a set of rules detailed in a following section.

Then in LLVM Generation, a new method will be added to Expression called to_addr -> Option<T>, which will return the address of an Addressable Expression, rather than its stored value, if the expression is addressable otherwise it will return None. This implementation is easily done by simply returning the pointer and skipping the build load operation.

Why this design? This design allows for determining whether expressions can be used for mutations and referencing in a way that is fully independent of the grammar rules or AST design. Which means both: no major refactoring or changes to the AST and no added bounds on how the grammar and AST could evolve into the future.

Why have this separate from type resolution? Type resolution is already large and complex, the danger is that addressability rules checking would get lost in all the type checking logic. Making it easy to break and hard to fix in the future. Having it as a separate step will make it more obvious that it occurs and easier to maintain and grow. It also is logically not the same thing as type checking.

Addressable Rules

All rules here assume that the expressions are of valid types for the operators (e.g. that in ^EXP, EXP is a raw pointer type). The type checking can happen after addressability is determined, because IDENTIFIERs are always addressable and the result of dereferencing operation must be addressable. Addressability can also be done after type checking, in which case it would be combined with checking that the operands for place operators are addressable.

All expressions are Value expressions unless they satisify the Addressable rules below. All other grammar rules are None.

X: IDENTIFIER
--------------
X :- Addressable


X: EXPRESSION :- Addressable
--------------
(X) :- Addressable


X: EXPRESSION :- Addressable, I: EXPRESSION
----------------
X[I] :- Addressable


X: EXPRESSION :- Addressable, F: IDENTIFIER
-----------------
X.F :- Addressable


X: EXPRESSION :- Addressable
----------------
@(const|mut) X :- Addressable


T: Type, X: *(const|mut) T
-----------------
^X :- Addressable


T: Type, X: *mut T
-----------------
^X :- Mutable


T: Type, X: *(const|mut) T
-----------------
X :- Addressable


X: EXPRESSION :- Addressable and Mutable, Y: Expression
----------------
mut X := Y :- Not Addressable

Array elements and structure fields are not de facto addressable. For example, foo().field, the function foo returns a structure and we access the field field. Within Bramble, we logically describe this as a value and not an addressable expression so it cannot be used for @ or mutations. (Even though, in the LLVM IR we generate we actually allocate space in the caller stack frame for the returned structure value, we still logically describe it as a value expression, this is because there is no label associated with it, so it would be hard for a user to mentally or visually track how the value is being used or mutated).

Implementation

Status

Add an enumeration that has the variants: None|Value|Addressable|AddressableMutable Add a field for this enumeration to SemanticContext with default value None.

Can I update canonize/foreach_mut to support POST order execution (as a configurable flag)? If so, then I think I can use that foreach for the traversal? Part of this would be updating foreach to do the Insight event recording too.

It might be easier to start out by adding this into type_resolver then move to its own stage?

Update the grammar rules so that any expression can be used on teh LHS for mutation and that any expression can be used as the operand for @(const|mut).

UX

The only impact on UX will be greater flexibility in what can be mutated and referenced. So, this will move Bramble towards being more intuitive in that regard.

Insights

We should mark what expressions resolve to being addressable and what expressions resolve to being only value expressions. This can be done by merging the events generated by the addressability analysis with semantic node events via spans tests.

A new compiler stage will be added to the trace output stage set: Addressability. This stage will record the assignation of every span in the input source code with whether it is Addressable or not. For each node in the AST that is checked, this will emit an event with the result of the Addressability analysis.

Tests

  1. Test mutating an array element
  2. Test mutating a structure field
  3. Test referencing an array element
  4. Test referencing a field
  5. Test mutating a dereference (mut ^a := 5)
  6. Test mutating an integer => fails
  7. Test mutating (x+5) => fails
  8. Test (x)
  9. Test (x)[5].field, an array of structures
  10. Test (x.field)[5], field is an array
  11. Test ^x.field, field is a pointer
  12. test ^x.field[2], field is an array of pointers
  13. Test ^x[0] array of pointers
  14. Mix in combinations of ()

3 - RFC 288 - MIR - Control Flow Graph

Use a Control Flow Graph as an IR to make implementation of rules that test data flow easier.

Control Flow Graph Intermediate Representation

This RFC proposes using a Control Flow Graph (CFG) IR as a stage between the AST and LLVM forms, because it will help simplify the implementation of key Bramble features: including consistency semantics, ownership rules a la Rust, and automatic injection of destructors for values.

Bramble’s MIR language design was inspired by Rust’s MIR and the design of the conversion from AST to MIR was inspired by LLVM.

Problem

Bramble needs to perform data flow analysis on each function in a program. The analyses needed are: consistency rules checking, automatic insertion of destructors for values, and ownership rules. More analyses may be added in the future.

For each of these analyses, Bramble needs to look at the sequence of operations that are applied to one or more locations in memory, then determine if every possible sequence of operations satisfy some given set of rules. This is falls under dataflow analysis.

While an AST can be used for such analyses, it is not well suited and would make implementation of the analyses more complex.

Requirements

  1. Simplify the forms of analysis which follow a “dataflow” paradigm.
  2. Be able to lower into LLVM IR.
  3. Integrate with Thorn Insights platform.
  4. Be able to represent all Bramble programs in a MIR form.
  5. MIR representations must be fully independent of the preceding AST representation. This means that the AST form of a Bramble program could be deleted after conversion to MIR and the compilation will always be able to complete successfully.
  6. Allow for intra-procedure (local) optimizations.

Solution

Rather than use the AST IR for dataflow analyses, create a new Control Flow Graph IR (henceforth referred to as a MIR), and all dataflow analyses are done against this form rather than the AST form.

The AST will be “lowered” into the MIR form after parsing and type resolution and before transformation into the LLVM IR form. After being lowered, any analyses and optimizations which work best with a CFG are applied. Then the MIR is lowered into LLVM IR form and compilation continues with the LLVM compiler toolset.

Alternatives

The alternative solution is to continue using an AST for all compiler analyses and optimizations. Most compilers do their work off an AST form and only use CFG forms for optional optimizations or linting. The Rust compiler currently uses a CFG IR form as an intermediate step between the AST and LLVM IR; however, its earliest versions used only an AST and the CFG IR was added later to help simplify the ownership rules analysis.

The reason to continue using the AST for all analysis rather than the MIR is to save development time and simplify the implementation of rules and optimizations that focus on the flow of data. It is a fact that all analyses that work on the CFG IR model will work on an AST, but their implementation may be much more complex and fragile.

Problems with using an AST for this analysis include the following. First, all these analyses need to validate some rule holds for one or more variables through every possible path of execution, the AST form makes it difficult to correlate all the different paths when there are branches. Second, the AST form recursively embeds operators as children of operators, this makes it difficult to find the leaf operands (specifically the variables being used), because the desired analyses are focused largely on tracking how each specific variable is used, being able to easily extract those variables from operations is critical for easy implementation. Third, all these analyses will want to know what unbroken chain of operations must be applied to a set of variables (Basic Block), this information is difficult to extract from an AST. All this information is encoded in the AST form, but the AST form prioritizes preserving the hierarchy of operations, which obfuscates the information that the dataflow analyses require; this leads to the implementaion of analyzers being more complex.

The ulimate price of the increased complexity of implentation is reducing the power of the analyzers. As experienced by the Rust team when the lifetime checker was implemented on an AST: having a less powerful analyzer means more blunt enforcement of the rules. This pushes more burden onto the user of the language. In the case of Rust, the “more blunt” analyzer meant that it could not understand more complex or subtle memory operations and would fail code that was technically correct. Using the AST as the basis for dataflow analysis, leads to simpler analyzers, which leads to more false negatives, forcing users to write worse code.

I chose to implement the MIR because significantly reducing the complexity and fragility of the dataflow analyses will be a net long-term benefit for the Bramble compiler. The consistency rules system could be incredibly complex, as it evolves, and a CFG IR model will greatly reduce the complexity of its implementation. Further, many optimizations benefit from the CFG IR model and implementing one now will allow me to start safely adding optimizations to Bramble. Finally, this decision was influenced by taking advantage of Rust’s experience: using the AST IR form greatly limited the power of ownership rules system and moving to a CFG IR was critical for building a more powerful and useable ownership rules model. Given that the consistency rules model that Bramble will implement shares many mathematical fundamentals with Rust’s ownership model and lifetime model; Rust’s experience provides valid insight into a major pitfall Bramble could suffer.

Design

MIR Architecture Model

The Bramble MIR’s design represents a simplified model of the physical memory and CPU instructions. MIR breaks memory down into 3 locations, based upon their lifetime relative to the execution of a program: static, heap, and stack. All operations in the MIR correspond to a subset of the instructions on a CPU and, like a CPU, the results of each single operation must be stored somewhere before being used in a subsequent operation. Further, in MIR, operations are a linear sequence and have no hierarchy.

Static memory represents the “.data” and “.code” (For x86) sections of a program, which correspond the parts of memory where data lives from the moment the program begins execution to when it stops execution. In MIR, the representations of each function are stored in the “Static” memory group. Heap memory corresponds to the heap, and data here will live from the moment it is explicitly allocated until it is explicitly deallocated (currently, no constructs in MIR correspond to or represent heap data). Stack memory, or local memory, corresponds to the stack frame of a function: data in this section will be valid from the moment a function is entered to the moment that instance of the function exits. In MIR, all variables defined within a function (user and temporary) are stored on the stack.

How the MIR uses abstractions to represent different parts of memory.

MIR Components

When lowered into its MIR form, a project is represented as a single MirProgram, which represents all the functions and types in the project. Each function is defined as a set of Basic Blocks. Each user defined type is stored as an ordered set of fields, where each field has a name and a type.

A Basic Block represent the largest set of operations where: if one operation is executed they are all executed and every operation is executed immediately after the preceding operation. Every Basic Block is composed of 0 or more Statements and a Terminator. The Statements represent operations that have results that are stored in memory. Terminators tell where the program will go after the last Statement in the Basic Block is executed.

Basic Blocks.

A visualizaiton of how a function with an If Expression would be represented via Basic Blocks. Each Basic Block is an ordered set of statements ending with a Terminator. The Terminator determines what the next Basic Block to be executed will be.

There is a single statement: Assign. This statement executes an operation and stores the result in a location in memory; this location can be either a user defined variable or a temporary variable for intermediate results. The operations correspond to unary and binary arithmetic instructions available on the CPU. Operations take one or two operands; operands can be either constants or specific locations in memory. Data can only be stored and read as base types (e.g., i32 or f64). For types constructed from base types (arrays and structs), Accessors are used to access the location of a specific element or field within a value of the constructed type.

The Terminator ends every Basic Block and is always executed. The simplest Terminator is the Return, which tells the program to exit the current function and remove its stack-frame from the stack. The GoTo Terminator tells the program to go to the specified Basic Block within the current function. The Conditional GoTo Terminator tells the program to go to one of two Basic Blocks based upon whether a given conditional value is true or false. Finally, the Call Terminator tells the program to call a function and provides a location to store the result of the function.

Lowering AST to MIR

The design for lowering the AST to the MIR resembles the current design for lowering the AST to LLVM IR. There is a set of MIR types that can represent any Bramble program in MIR and implement a CFG. The lowering process will generate a set of CFGs, composed of MIR values, that each represent a function from the input AST. The process uses a MIR Builder, which constructs the MIR form of a single function, and a Traverser, which traverses the AST and uses the Builder to convert types and functions into MIR forms.

The MirProgram type manages all custom types defined by the user and the MIR representation of each function. This is the final output of the lowering process. Subsequent operations that apply to the MIR will take the MirProgram as input.

The lowering process begins by traversing the AST for every user defined type and adds the type to the Type Table in MirProgram. Types are converted to MIR first, so that when they are encountered while lowering function definitions, their definitions can be easily found. This also separates responsibilities and leads to simpler code.

The Type Table assigns a unique integer identifier to each type and this identifier will be used within the MIR to represent types. This makes representation smaller in memory and makes programming easier and execution faster. Every type is represented by a single integer, so passing to other functions can be done by copying instead of cloning. Likewise, comparison will amount to a single integer comparison instruction. This will also lead to better Rust code.

How the MIR Builder manages types.

Note: The TypeTable is not strictly required: the MIR could continue to use Paths and the Structure Definitions in the AST form for referencing and looking up types. However, using an integer TypeId to identify types results in significantly simpler Rust code and is a better design overall. By taking this opportunity to add in the TypeTable we can begin to reap the benefits of this design and then, in the future, move the construction of the TypeTable to an earlier stage in the BrambleC pipeline so that the AST also uses TypeIds rather than the cumbersome Type enum.

After all types are lowered to the MIR, the AST is traversed again, and the name and signature of every defined function is added to the set of functions in MirProgram (without the MIR definition of the function). This is because functions are identified using integers in the MIR and we need to know the identifier for every function before lowering the function definitions, otherwise, there may be a call to a function that has not yet been lowered and it will not have an assigned function identifier.

After adding every function name and signature to the MirProgram and assigning them identifiers, the lowering process transforms the definition of each function from an AST to a MIR. A post order traversal of the AST of the function is performed and the Builder type is used to construct the MIR representation. After the function’s AST has been fully traversed the MIR is finalized and added to the MirProgram as the definition of the function. The lowering process is complete when every function has been converted into a MIR form.

How the MIR Builder manages functions.

Lowering MIR to LLVM IR

After completing every test and transformation which relies upon the MIR form, the compiler lowers the final MIR into its LLVM form. The lowering process is split into two components: the Traverser and the Builder. The Traverser iterates over each function and custom type, and calls methods on the Builder to construct the LLVM IR. At the completion of the lowering process there will be an LLVM Module that contains all the types and functions in the Bramble project.

The Builder provides a set of methods that correspond to the MIR instruction set. The Traverser calls these methods as it encounters the corresponding MIR instructions. The methods cause the Builder to create the LLVM IR instructions that correspond to the Bramble MIR instructions.

The Builder is responsible for generating the labels for all custom types and all functions, that are used within the LLVM IR and, ultimately, the generated assembly code.

References

The main inspiration for the MIR was from reading through the Rust Compiler source code and this blog post: https://blog.rust-lang.org/2016/04/19/MIR.html. Much of the Bramble MIR design is directly inspired by Rust’s MIR, though a significantly simpler form.

The second source of inspiration was “Engineering a Compiler” by Keith Cooper and Linda Torczan. The chapters on dataflow analysis and control flow graphs directly led me to believe those concepts the best for implementing the consistency rules system for Bramble.

In addition, I looked at Clang and Rosalyn to understand how those compilers used Control Flow Graphs and how they designed their CFG models.

The following links were used during research for the MIR design:

  1. http://s4.ce.sharif.edu/blog/2019/12/31/clang/
  2. https://homepages.dcc.ufmg.br/~fernando/classes/dcc888/ementa/slides/ControlFlowGraphs.pdf
  3. https://eli.thegreenplace.net/2013/09/16/analyzing-function-cfgs-with-llvm
  4. https://docs.microsoft.com/en-us/dotnet/api/microsoft.codeanalysis.flowanalysis?view=roslyn-dotnet
  5. https://github.com/dotnet/roslyn-analyzers/blob/main/docs/Writing%20dataflow%20analysis%20based%20analyzers.md
  6. https://github.com/dotnet/roslyn/blob/1deafee3682da88bf07d1c18521a99f47446cee8/src/Compilers/Core/Portable/Operations/BasicBlock.cs
  7. https://github.com/rust-lang/rust/blob/f7ec6873ccfbf7dcdbd1908c0857c866b3e7087a/src/librustc/mir/repr.rs