This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Documentation

Bramble is a compiled, statically typed language whose core goal is to develop language abstractions which allow a developer to explicitly define how they logically want their code to work while acknowledging the consequences of code be executed by a CPU which exists in the real world.

This goal was inspired while learning the Rust programming language and realizing that lifetime’s in Rust are semantic representations of the stack frames created by the CPU while executing a program. Rust made explicit something which was implicit in languages like C and complete hidden in GC languages like Java and C#. Through lifetimes, Rust guides the developer into explicitly thinking about the physical way a program is executed. Bramble’s core philosophy is to take this approach as far as possible.

1 - Overview

The Bramble Programming Language

What is Bramble

Bramble is a AOT compiled language designed from the gound up to help developers write optimized, efficient code. There are two major prongs of design to accomplish this goal:

  1. Amplification of the mechanics of the actual physical CPU, memory, and computer.
  2. Maximum transparency and feedback to what the compiler is doing and why.

2 - Getting Started

How to build and use Bramble

Instructions for building and using Bramble and Thorn.

Building the Bramble Compiler

Requirements

Rust

Bramble is written in Rust and to build the compiler you will need Rust and Cargo installed. The easiest way to setup Rust is to get rustup and install the latest stable toolchain.

LLVM

Bramble uses LLVM as a compiler backend, so you will need to install the LLVM libraries on your machine. Bramble has been tested against LLVM 11 and 13.

From Source
  1. You can go to the LLVM site and follow the instructions to build and install the libraries yourself.
MacOS

Using Homebrew:

brew install llvm
Fedora

Using DNF:

dnf install llvm-devel
Debian & Ubuntu

Follow the instructions provided on this site.

gcc

The Bramble make script, make.sh, uses GCC as a linker between compiled binaries.

Building

After Rust has been installed.

  1. Clone the Bramble repository:
  2. Build and test Bramble by running cargo test. This will build Bramble and execute all the unit tests.
  3. Build a release version of Bramble: cargo build --release
  4. Execute integration tests: go into the ./tests directory and run ./tests.sh. This will execute all the integration tests for Bramble and will validate that both the compiler and LLVM are functioning properly.
  5. Use ./make.sh to compile and run your own projects.

Hello World

  1. In the ./tests/scratch/ directory create a file named hello.br.
  2. Open hello.br in a text editor and write the following:
fn my_main() -> i64 {
  project::std::io::write("Hello World\n");
  return 0;
}
  1. Go to ./tests/ and run ./make.sh -i ./scratch/hello.br. This will compile and execute hello.br

Using Thorn Insights

Requirements

  • Rust: Latest Stable Toolchain
  • NodeJS v16+

Building

Thorn insights consists of two components: the thorns server and the viewer. You run the thorns server from the root of your project directory and then start the viewer which will connect to thorns and let you explore your insight data.

Thorns

To install Thorns, clone the GitHub repository and go to the repo directory.

From there run the following:

cd thorns
cargo install --path .

This will create a release build of thorns and install it into the Rust binaries set.

Then go to your Bramble project root and run thorns. This will automatically host the insights data that has been output to ./target.

Viewer

After the you have started the thorns server. Go to the Thorns repository directory and:

cd viewer
npm start

This will start the React based Thorn Viewer, which will connect to thorns and let you interact with the insights data.

2.1 - The Language

How to learn the language

Documents about the language itself.

Learning the Language

The documentation for the language itself is in the process of being written. Bramble is currently both very young and very similar to Rust, so it’s easy to learn. As such, for the launch of this website, the documentation has focused on building the compiler and unique features of the language (such as Thorn). In the near future, full documentation will be made for the language but for its current state and to cover through every evolution.

The best place to start for learning the current Bramble langauge rules is to look at the large suite of integration tests in the Bramble repository: ./tests/src. These integration tests are categorized by the language feature they are testing. So, they give both a reference for all the features and examples of how to use those features.

2.2 - Testing Bramble

How to test the compiler.

How to test the compiler source code.

The Bramble project comes with a suite of testing tools to help make the development process easy.

The tests are broken down into three categories: unit tests, integration tests, and fuzz testing. Integration tests are tests which compile and execute Bramble source code and validate the output. The Fuzz tester is a specially built tool which randomly generates Bramble source code for the Bramble compiler.

Unit Tests

The unit tests for Bramble are all included in the Rust source code, and can be identified by the source code file name containing the word test. Unit test modules are kept in the same directory as the module they are testing, rather than in a fully separate test module.

To run the Bramble unit tests, use the Cargo test command:

cargo test

If you want to run the tests against the release build:

cargo test --release

Integration Tests

All integration tests are run via a shell script; this makes it easy to run all the integration tests. The shell script will build a release version of Bramble and use that binary for all integration tests. The release build is used to make execution of the tests as fast as possible.

The integration tests are all stored in the ./tests with the source code for each test stored in ./tests/src and the scripts to run the tests in ./tests/. There are about 100 integration tests under the ./tests/src/ directory which the integration test will use for validation.

To run the integration tests, go to ./tests/ and then run ./test.sh.

There are three types of integration tests:

  1. Positive integration tests: these test that features build and run correctly.
  2. Failure tests: these test compiler errors.
  3. Import tests: these test the compiler importing one Bramble project into another Bramble project.

Adding an Integration Test

All that is necessary to create a new integration test is to add the following files to the ./tests/src directory.

Integration tests consist of 2 to 3 parts:

  1. The test file or project itself. This is one or more Bramble source code files.
  2. The expected output file. This what the test source code should output if it runs correctly. This file has the same name as the test code but with .out as the extension.
  3. Optionally, input data for tests that read from the console. This file has the test name with .in as the extension.

The test.sh script will automatically detect and execute your file. If you inlude an input file then the script will automatically pipe each line from the .in file to your test project.

Fuzz Testing

Testing all the possible syntactically valid Bramble expressions would be incredibly difficult if done manually. To avoid that work, a special Fuzz testing tool was written which randomly generates both syntactically valid and invalid Bramble code and tests that the compiler correctly handles it. Valid code should parse successfully and invalid code should generate a compiler error.

To run the fuzz tester:

cd ./tests
./test-syntax.sh

Currently, this only tests syntax and not semantics. The code it randomly generates will have undefined variables, functions, etc. As such, this fuzz tester only tests that Bramble can parse the code. The next step for the fuzz testing tool is to build a fuzz tester that generates semantically valid code which can fully compile and be executed.

3 - Thorn

The Thorn Insights Platform

How Thorn provides deep insights into what the Bramble compiler is doing.

Thorn is the insights platform that enables exploration into exactly what the Bramble compiler does with your code and why. It is inspired by the question of what would happen if a compiler was built with the goal of providing Compiler Explorer out of the box and times a thousand.

Currently, Thorn is in the MVP stage. It comprises a small project server that allows queries into the audit data generated by the Bramble compiler when insight recording is turned on and a web front end that allows you to explore that data in a pleasant visual form.

To tell Bramble to generate a Thorn insight record, simply add this CLI parameter: --json-trace. The Bramble compiler will then save a JSON file with the full insight record to the ./target/ directory.

Thorn Features

  1. Here’s a visualization of the AST generated by the Bramble Compiler for a small program:

    The user has selected an expression and the corresponding subtree has been highlighted. Helping the viewer investigate how the compiler sees their code.

  2. And here’s an example of the type resolution for every expression in a program:

    Here, Thorn provides a breakdown of how the type of the selected expression was resolved. The declaration of `pp` is highlighted in yellow to tell the user that Bramble used the delcaration to provide context when it was resolving the type.

  3. And here’s a beakdown of how the same expression was transpiled into LLVM:

  4. And a different view showing the LLVM IR for the selected function:

4 - Concepts

Guiding ideas for Bramble and Thorn

Design principles which are guiding the development of Bramble and Thorn.

4.1 - Projects

How projects work in Bramble.

How projects work in Bramble.

Projects

A project can be either a single Bramble source code file or a directory containing one or more Bramble files.

To compile a single file, pass the pass to the file itself to the compiler. To compile a multi-file project, pass the path to the projects root directory to the compiler.

Importing a Project

To import an external Bramble project into your project you will need to manifest and the binary of the import target project.

Referring to Imported Items

To use an imported item in your project, you will have to use the canonical path to the item. This path is specified with

project::<project name>[::<sub directory>]*::<file>[::<module>]*::<item>

Allowing use statements to create aliases and remove the need for full paths to imported items is a QOL feature that has not yet been added to Bramble.

Manifests

If you want to use a project as a library in another project, you need to generate a project manifest. This manifest is a file which includes information about exported functions and types.

To generate a manifest, pass the --manifest flag to the compiler.

Importing

The manifest is then passed to the compiler so that the compiler will add the items from the other project to your current project as external imports.

Linking

Bramble currently uses gcc to perform linking. This is the final step for handling imports. You will need to pass the binary for your project and the binary for the imported project to gcc and it will output the final executable file.

Note: that only one binary can include a main function.

Examples

There are two examples of multi-file projects in the Bramble integration test suite: ./tests/src/projects/basic and ./tests/src/projects/nested.

All integration tests import the std::io project and the ./tests/make.sh script shows how to build a project with imports.

4.2 - Modules & Paths

Using Modules to Organize Code

Using Modules to Organize Code

Modules

A module is a purely organizational structure in Bramble. They allow you to group related functions and types together under a shared name.

Accessibility

The concepts of public and private and accessibility are not yet implemented in Bramble.

Default Modules

In Bramble, there are a few default modules that always exist and help to organize your code.

  • The project module. This is the parent module and corresponds to your project as a whole. There is only one project module in any given project. The project module is also what you import when you import another Bramble project into your code.
  • The file module. The file module is a module that represents a source code file and it has a name that is the same as the file name.
  • directory modules. If there are subdirectories in your project, those will be added as modules which contain all the times in those subdirectories.

Custom Modules

Within your own code, you can use mod to create modules that fall under the file module.

Using Paths

There are two types of paths in Bramble: relative and canonical. The relative path is a path that gives directions to an item starting from the current module, this path is dependent upon where it is used. The canonical path starts from the project root. This path will always point to the same item no matter where in the source code it is used.

Path Keywords

There are several path keywords.

  • project: all canonical paths begin with the project keyword.
  • self: self is an alias for the current module.
  • super: move to the current modules parent module.

Example

Here is an example of a module with another module in it:

mod my_mod {
    fn is_3(i: i64) -> bool {
        let x: i64 := self::inner::add(1, 2);
        return i == x;
    }
        
    fn plus(a: i64, b: i64) -> i64 {
        return a + b;
    }
    
    mod inner {
        fn add(a: i64, b: i64) -> i64 {
            return super::plus(a, b);
        }
    }
}

If this source code were contained in the project: example in the file foo.br, then the project module is example and the file module is foo. The canonical path to add is project::example::foo::my_mod::inner::add.

In this example, self::inner::add and super::plus represent relative paths.

4.3 - Types

Basics of Bramble Types

Basics of Bramble Types

Primitives

Keyword Description
bool boolean
i8 signed 8 bit integer
i16 signed 16 bit integer
i32 signed 32 bit integer
i64 signed 64 bit integer
u8 unsigned 8 bit integer
u16 unsigned 16 bit integer
u32 unsigned 32 bit integer
u64 unsigned 32 bit integer
f64 64 bit float
string C Style String
null A special type that corresponds to the null keyword. It represents a pointer that is not pointing to any address.

Arrays

Bramble supports staticly sized arrays of any type. The syntax for an array type is [<type>; <size>]. For example, [i32; 4] creates an array of 4 i32 values. Getting an element in array is done via the [] operators. For example, arr[0] will return the 0th element of the array arr.

Example:

    let arr: [i64; 2] := [0, 1];
    let x: i32 := arr[0];

Structures

A structure is a named aggregate type consisting of 0 or more named fields, where each field as a type. Fields can be primtive, arrays, or structures.

struct MyStruct {
    field: i32,
    foo: AnotherStruct,
    barr: [i8; 5],
}

To access a field in the structure, use the access operator: foo.bar. For an array of structures you would use arr[0].bar. For a structure in a structure: foo.bar.fizz.

Raw Pointers

Like most modern languages, Bramble will have a safe system for handling references, but to build that system and to support FFI a form of raw pointers is necessary. Raw pointers simply represent and address to a location in memory and offer no safe guards.

There are two types of raw pointers:

Keyword Description
*const T Raw pointer to an immutable location in memory
*mut T Raw pointer to a mutable location in memory

A raw pointer can be created for any type in the langauge, including structures and arrays.

Creating a Raw Pointer

@const x will create a raw pointer to any variable x. If x has type T then @const x will have type *const T. x can be immutable or mutable.

@mut x will create a raw pointer to a mutable location x. x must be mutable.

    let mut x: i32 := 5;

    let cp: *const i32 := @const x; // Create a raw pointer to an immutable location

    let p: *mut i32 := @mut x; // Create a raw pointer to a mutable location

The following would fail, because x is immutable:

    let x: i32 := 5;

    let p: *mut i32 := @mut x; // Error: cannot make a mutable pointer to immutable variable

Dereferencing a Raw Pointer

The ^ operator dereferences raw pointers.

Mutable raw pointers can be used on the left side of mutations:

    let mut x: i32 := 5;
    let p: *mut i32 := @mut x;
    mut ^p := ^p * 2;  // This will change the value stored in the location `x`

Unsafe Blocks

In the future, Bramble will require the use of raw pointers to be contained within unsafe blocks, as is standard for many languages.

4.4 - Thorn Protocol

The interface between Bramble and Thorn.

How Bramble communicates insight information to the Thorn insights platform.

Insights

The Thorn insights platform provides a first-class feature to give developers complete insight into everything Bramble does during compilation and why.

In order for Bramble to communicate with Thorn there is a communication protocol that defines how Bramble records every decision it makes. This is a set of JSON files that Bramble will generate when insight tracing is turned on.

Generating Insights

To generate the Thorn insights data, use the --json-trace flag on the Bramble compiler.

Thorn Protocol

Spans

The core value in the Thorn protocol is the span. This is a tuple that uniquely identifies a span of code in your project. Spans are unique across all the files in your project.

Source Map

In order to make sure that every file in your project can be assigned unique spans, each file must be assigned a span range from the global span space. Every span within a specific file will be contained within it’s span range in the Source Map file.

In theory, spans can cross files, but there are currently no cases where that would happen.

Example Source Map:

[
    {
        "source": "./main.br",
        "span": [0, 264]
    },
    {
        "source": "./second.br",
        "span": [264, 401]
    }
]

Trace File

The events themselves are recorded to trace.json. This file stores the trace of your source code through every stage of the compiler. It is consists of an array of compiler events. The compiler events represent decisions the compiler makes about different spans of your code. Each event consists of the span that caused the decision, the compiler stage, and information about the decision that was made.

Every event is given a unique integer ID. Some events have a parent ID, this means that the parent event triggered this event.

Event Results

Currently, an event can have one of two results: ok or err. Both of which have string values. ok means that the compiler successfully made a decision. err means that the compiler failed to make a decision and includes the error message detailing why the compiler failed.

Ref Spans

Some stages of compilation require context (for example, type resolution has to look up the declarations of functions). When a compiler decision needs to refer to another part of your code to make a decision, that reference is recorded with a ref span, which records the secondary span of your code that the compiler referred to.

Example Trace File

[{"id": 428, "parent_id": 427, "stage": "parser", "source": [393, 398], "ok": "Boolean Negate"},
{"id": 427, "parent_id": 426, "stage": "parser", "source": [393, 398]},
{"id": 426, "parent_id": 425, "stage": "parser", "source": [393, 398]},
{"id": 425, "parent_id": 424, "stage": "parser", "source": [393, 398]},
{"id": 424, "parent_id": 423, "stage": "parser", "source": [393, 398]},
{"id": 423, "parent_id": 422, "stage": "parser", "source": [393, 398]},
{"id": 422, "parent_id": 401, "stage": "parser", "source": [386, 398], "ok": "Return"},
{"id": 401, "parent_id": 321, "stage": "parser", "source": [353, 401], "ok": "Funtion Definition"},
{"id": 321, "stage": "parser", "source": [264, 401], "ok": "File Module"},
{"id": 437, "stage": "canonize-item-path", "source": [0, 401], "ok": "$basic::basic"},
{"id": 438, "stage": "canonize-item-path", "source": [0, 264], "ok": "$basic::main::main"},
{"id": 439, "stage": "canonize-item-path", "source": [0, 231], "ok": "$basic::main::my_main"},
{"id": 440, "stage": "canonize-item-path", "source": [233, 264], "ok": "$basic::main::MyStruct"},
{"id": 441, "stage": "canonize-item-path", "source": [255, 261], "ok": "$basic::main::f"},
{"id": 442, "stage": "canonize-item-path", "source": [264, 401], "ok": "$basic::second::second"},
{"id": 443, "stage": "canonize-item-path", "source": [264, 351], "ok": "$basic::second::my_bool"},
{"id": 444, "stage": "canonize-item-path", "source": [353, 401], "ok": "$basic::second::notter"},
{"id": 445, "stage": "canonize-item-path", "source": [363, 370], "ok": "$basic::second::b"},
{"id": 446, "stage": "canonize-type-ref", "source": [75, 89], "ok": "$basic::main::MyStruct"},
{"id": 447, "stage": "canonize-type-ref", "source": [153, 176], "ok": "$basic::second::my_bool"},
{"id": 448, "stage": "canonize-type-ref", "source": [305, 317], "ok": "$basic::second::notter"},
{"id": 449, "stage": "canonize-type-ref", "source": [321, 334], "ok": "$basic::second::notter"},
{"id": 453, "parent_id": 452, "stage": "type-resolver", "source": [39, 40], "ok": "i64"},
{"id": 455, "parent_id": 454, "stage": "type-resolver", "source": [44, 45], "ok": "i64"},
{"id": 454, "parent_id": 452, "stage": "type-resolver", "source": [43, 45], "ok": "i64"},
{"id": 452, "parent_id": 451, "stage": "type-resolver", "source": [39, 45], "ok": "i64"},
{"id": 451, "parent_id": 450, "stage": "type-resolver", "source": [26, 46], "ok": "i64"},
{"id": 459, "parent_id": 458, "stage": "type-resolver", "source": [87, 88], "ok": "i64"},
{"id": 458, "parent_id": 457, "stage": "type-resolver", "source": [75, 89], "ok": "$basic::main::MyStruct", "ref": [233, 264]}]

4.5 - Raw Pointers

How Bramble represents memory addresses, pointers to locations in memory, and the manipulation of memory directly.

How Bramble represents memory addresses, pointers to locations in memory, and the manipulation of memory directly.

Overview

In order to support low level memory operations and full FFI, Bramble must have the ability to operate on memory directly through pointers and pointer manipulation. As with many modern languages, Bramble wants users to avoid using these tools, except in the cases where they are the only or clearly correct option. To this end, the operations to get raw pointers are tedious and typing heavy and will, eventually, have to happen within an unsafe expression block.

Pointer Types

There are two raw pointer types: *const T and *mut T. *const T is a pointer to an immutable location in memory: dereferencing this pointer will let you read from this location in memory but it will not let you write to this location. *mut T is a pointer to a mutable location in memory: dereferencing this pointer will let you read from this location and write to this location.

Creating a Pointer

To create a pointer, use the address of operators: @const or @mut. @const will return a *const pointer to the operand. @mut will return a *const pointer to the operand, but it can only be applied to mutable values.

The operand of @const and @mut must be an addressable expression. This is an expression that resolves to a value that has a location in memory, generally this is a variable.

Dereferencing a Pointer

To read from or write to the location through a raw pointer, you need to use the dereference operator: ^. To read from a memory location you can use the ^pointer in any expression where the underyling type would be valid. To write to a memory location, use mut ^p := ....

Pointer Arithmetic

To get addresses that are offsets from a given pointer, use the @ in its binary form: p@-2 will return a new address which is equivalent to p - 2 * (size_of(T)) where T is the type of the value that p points to.

5 - Standard Library

The Standard Library

The Standard Library

5.1 - std::io

Input & Output Functions

A list of all the functions in the std::io module.

This is a standard library containing support functions for handling some basic IO operations.

All functions in std::io have a canonical path that starts with project::std::io::<function>. For example: project::std::io::write("Hello World\n").

Functions

Function Description
readi64 Reads a single i64 from stdin
write Writes a string to stdout
writeln Same as write but appends a newline character
writebool Writes a bool to stdout. This will write either true or false
writeboolln Writes a bool to stdout and moves to a newline. This will write either true or false.
writef64 Writes an f32 to stdout
writef64ln Writes an f32 to stdout and moves to a newline
writei8 Writes an i8 to stdout
writei16 Writes an i16 to stdout
writei32 Writes an i32 to stdout
writei64 Writes an i32 to stdout
writeu8 Writes an u8 to stdout
writeu16 Writes an u16 to stdout
writeu32 Writes an u32 to stdout
writeu64 Writes an u64 to stdout
writei8ln Writes an i8 to stdout and moves to a newline
writei16ln Writes an i16 to stdout and moves to a newline
writei32ln Writes an i32 to stdout and moves to a newline
writei64ln Writes an i64 to stdout and moves to a newline
writeu8ln Writes a u8 to stdout and moves to a newline
writeu16ln Writes a u16 to stdout and moves to a newline
writeu32ln Writes a u32 to stdout and moves to a newline
writeu64ln Writes a u64 to stdout and moves to a newline

6 - Bramble Reference

Reference documentation for the Bramble Language and the Compiler

This provides reference documentation for the Bramble language and for the compiler.

Within this section will be clear and explicit documentation covering the features of the Bramble language, including syntax rules, semantic rules, and so on.

6.1 - Keywords

Bramble Language Keywords

Operators

Name Description
null Represents a null reference or address. A reference that does not point to any location in memory

null

null represents a raw pointer value that does not point to any location in memory. This is used to initialize a raw pointer with a placeholder value, when it is not yet known where the pointer should point.

null is also used to check if a pointer points to a valid location in memory or not. For managing heap allocated memory, null can be used to check if memory has been allocated on the heap or not.

null can be used for initialization, mutation, or comparison operators, but not for any other operation.

6.2 - Operators

Bramble Operators

Operators

Operator Description
^ Dereference raw pointer
==, != Equal to and Not Equal to
<, <=, >, >= Comparison operators
@const Address Of immutable addressable expression
@mut Address Of mutable addressable expression
@ Raw pointer offset

== & !=

Equality Operations

== checks if the value on the left and the value on the right are equal. != checks if the value on the left and the value on the right are not equal.

For raw pointers, this compares the memory addresses stored in the pointers rather than the values stored at the memory address.

Syntax

EQUAL := EXPRESSION == EXPRESSION
NOT_EQUAL := EXPRESSION != EXPRESSION

Semantics

x, y: T, T: Primitive :- x == y -> bool
x, y: T, T: Primitive :- x != y -> bool

<, <=, >, & >=

Comparison Operations

These are the ordering operations and compare two values to see what relative order they are too each other.

For raw pointers, these compare the memory addresses stored in the pointer rather than the values stored at the memory address.

Syntax

LT := EXPRESSION < EXPRESSION
LTE := EXPRESSION <= EXPRESSION
GT := EXPRESSION > EXPRESSION
GTE := EXPRESSION >= EXPRESSION

Semantics

x, y: T, T: Primitive :- x < y -> bool
x, y: T, T: Primitive :- x <= y -> bool
x, y: T, T: Primitive :- x > y -> bool
x, y: T, T: Primitive :- x >= y -> bool

@const & @mut

Raw Pointer Creation

The Address Of operators return a raw pointer to the location in memory where the value of the operand is stored. The operand for @const and @mut must be an addressable expression.

An addressable expression is any expression which is logically tied to a location in memory and is not a temporary variable used for transferring data between calls. An example of an addressable expression is a named variable, or the element of an array which is bound to a variable. An addressable expression may be stored in the stack or in the heap.

An example of a non-addressable expression, which may have a location in memory is the temporary value of a structure expression: MyStruct{ x: 1, y: 2, z: 3}. This may be allocated to the stack to because it is too big to fit in a register, but that location is considered strictly temporary and may get used for later structure expressions, or, for small structures, be optimized into a physical register or integer literal. Therefore, there can be no guarantee that it can be safely addressed and it is considered non-addressable.

Note about Mutability

It’s important to remember that there are two different concepts of mutability for a value of type pointer. The first, and most important, is the mut in *mut T: this means that the location the pointer points to can be mutated through the pointer. The second is the mut that refers to a variable of type *(const|mut) T (let mut p: *const i64 ...), this means that the pointer variable can be mutated (not the location it points to); in other words, that the pointer can be changed to point do a different address.

Example

  let i: i64 := 10;
  let mut pi: *const i64 := @const i;


  let mut j: i64 := 15;
  let pj: *mut i64 := @mut j;

  mut pi := @const j;   // Here, `pi` is mutated to point to `j`. 
                        // Note, that it still _cannot_ be used to mutate `j`

Syntax Rules

ADDRESS_OF := @(const|mut) EXPRESSION

The @ operator must be followed by either const or mut. This indicates whether to pointer can be used to write to the memory location or only to read from the memory location.

Semantic Rules

// x is an addressable value, T is a type
x: T :- @const x -> *const T
mut x: T :- @const x -> *const T, @mut x -> *mut T

@const can take any addressable value, whether it is mutable or immutable and will always return a value of *const T, where T is the type of the operand.

@mut can only be applied to an addressable value that is declared as mutable. And will return a value of type *mut T, where T is the type of the operand. This pointer can be used to write to the memory location as well as read from the memory location.

^

Raw Pointer Dereference operator

Taking a *const T or *mut T value as an operand, the ^ operator accesses the value stored in the location pointed to by the operand. If the location is mutable (operand is of type *mut T), then the dereference expression can be used on the left of a mutation expression. If T is an array or a structure, then the result of the dereference operation can be used as the left of element access or field access operators.

Example

  // Dereference a primitive value
  let i: i64 := 5;
  let p: *const i64 := @const i;
  project::std::io::writei64ln(^p);  // Prints 5

  // Using the dereference in a mutation expression
  let mut j: i64 := 10;
  let pj: *mut i64 := @mut j;
  mut ^pj := ^pj + 1;               // Adds 1 to j

  // Using with an array
  let arr: [i32; 2] := [1i32, 2i32];
  let p: *const [i32; 2] := @const arr;
  ^p[0]; // Will resolve to 1i32

Syntax Rules

DEREFERENCE := ^EXPRESSION

Semantic Rules

x: *const T :- ^x -> T
x: *mut T :- ^x -> mut T

@

Raw Pointer Offset

The @ operator is a binary operator and is used for computing address offsets from a raw pointer value. Given a pointer to a memory location, you can use @ to move to the left or the right of that memory location. The offsets are computed in increments of the size of the underlying type. So, if p is pointer to an i64, then p@1 would increment the address by 1*size_of(i64) or 8 bytes and p@-3 would decrement the address by -3*size_of(i64) or -24 bytes.

The left operand must be of type *(const|mut) T and the resulting type will be the same as the left operand. The right operand must be an integer type (either signed or unsigned).

Example:

    let mut arr: [i64; 3] := [1, 2, 3];
    let mut p: *const i64 := @const arr[1];

    project::std::io::writei64ln(^(p@-1)); // Will print 1
  
    let p2: *const i64 := p@1;
    project::std::io::writei64ln(^p2);  // Will print 3

Syntax Rules

RAW_POINTER_OFFSET := EXPRESSION @ EXPRESSION

Semantic Rules

x: *const T, o: i8|i16|i32|i64|u8|u16|u32|u64 :- x@o -> *const T
x: *mut T, o: i8|i16|i32|i64|u8|u16|u32|u64 :- x@o -> *mut T

6.3 - Type Casting

Type Casting

Operator

Name Description
as Allows primitive types to be cast between each other.

Casting

The as operator converts a value of one type to a value of another type. As much as possible, the actual value is preserved, but there are certain value ranges for which that cannot be done.

This can also be used to change the target type of a Raw Pointer value. For example, *const MyStruct can be changed to *const u8.

Valid Casts

Expression Target Type Description
Numerical Numerical Signed integers, unsigned integers, and floating point numbers can be cast between each other.
*mut T *const Y or *mut Y Mutable raw pointers can be cast to constant or mutable raw pointers
*mut T or *const T *const Y Any raw pointer can be cast to a constant raw pointer
*mut T or *const T integer Any raw pointer can be cast to an integer
Integer *const T An integer can be cast to a constant raw pointer

Numerical types: bool i8 i16 i32 i64 u8 u16 u32 u64 f64

Integer types: bool i8 i16 i32 i64 u8 u16 u32 u64

Floating point types: f64

T and Y can be any type.

For raw pointer casts, the goal is to make it as difficult as possible to bypass the declared mutability of a location in memory.

Integer Upcasting

When an integer is cast to an integer with higher bit width the additional bits need to be filled. If the source expression operand is signed, then the upcast will extend the sign bit. If the source expression operand is unsigned, then the upcast will do a zero extend.

Examples

Expression Result Description
1i8 as i32 1 Upcasting from signed to signed will result in the same value.
-1i8 as i32 -1 The negative sign is extended resulting in the same value.
-2i8 as u64 18446744073709551614 When upcasting a signed integer, the sign is extended.
65535u16 as i64 65535 Unsigned integers are upcast by extending zeroes.

Integer Downcasting

When an integer is cast to a type with a smaller bit width, then the extra bits in the source are truncated.

Examples

Expression Result Description
-2i16 as i8 -2 These values are small enough that truncation has no effect.
-5i16 as u8 254 u8 is formed by truncating all but the least significant byte from -5.
-500i16 as i8 12 Truncation converts the negative i16 to a positive i8.

Floating Point to Integer Casting

FP to Int conversion takes the FP value and converts it to an integer value. If the FP value is outside the possible bounds of the target Integer type, then the result is a poison value. Conversion is always done by rounding towards zero.

For more details on poison values, see the LLVM Documentation.

Syntax Rules

EXPR as TYPE

Semantic Rules

S, T: Primitive Type, e: S :- e as T -> T

7 - Contribution Guidelines

How to contribute to Bramble and Thorn

Any contribution to the Bramble language is welcome.

Bramble is developed in the Rust programming language and is built on top of LLVM for its backend. If you wish to contribute, feel free to reach out through one of the contact options or download the repository from GitHub and play around with the language.

The Thorn Insight tools are developed with Rust and React (for any GUI tools).

Language Proposals

If you have any proposals for the language, add them as Enhancement issues on the GitHub repository. Include a brief description of what problem the proposal would solve and any examples or diagrams you believe necessary to help explain your idea.

Code Review

All submissions will require a code review. To submit a PR, please document the changes that were made, why they were made (with a link to an issue if one exists), along with a brief outline of how the changes were tested.

Tests

Bramble is always striving to be as thoroughly tested as possible, so please make sure to add or update any necessary tests for you changes. For any changes to the language or to LLVM output, make especially sure to check the integration test suite.

8 - RFCs

Current and past design documents for features in Bramble and Thorn.

These documents provide in-depth details to the design process for major features added to Bramble and to Thorn.

The numbers used to identify RFCs correspond to the GitHub issue number used to track the feature. Which is why they they may be very large and show gaps between numbers.

8.1 - RFC 248 - Raw Pointers

How to represent raw pointers in Bramble

Design for Representing Addresses in Bramble

Note: I have used my rough understanding of how to state type rules in order to express the semantic rules for raw pointers. There may be mistakes in my statements due to lack of experience.

Problem

Within the Bramble language there is no representation of address based operations or pointers. This severely limits both what can be built with Bramble and the FFIs that Bramble can call. In order for Bramble to grow, there needs to be a representation of the most basic raw pointer concepts. These can then be used to build not only heap based allocations, but smart pointers, or garbage collectors, etc.

Goal

In this document is the design for how raw pointers will be represented in the Bramble language.

UX

Being danger, the use of raw pointers should be both discouraged except in special cases (features which require raw pointers, FFI uses, and performance) and it should be immediately obvious to a reader that a raw pointer is being used. And the syntax of raw pointers should reflect this discouragement for writers and make raw pointer use obvious for readers.

So, any use of raw pointers should be obviously different from safer uses of reference values (e.g. Rust borrows, smart pointers, garbage collected pointers, etc). This leads to several design goals:

  1. Raw pointer types should be as explicit as possible when written down.
  2. Creating a raw pointer directly to a variable should be obviously different. (e.g. Rust uses the addr_of macro to just get the address and bypass any safety checks).
  3. Dereferences should also be obvious and explicit. Where dereferences of smarter reference types might be implicitly injected by the compiler, for raw pointers they must all be explicitly written.

This document lays out the language features for supporting raw pointers themselves. After this feature is implemented, then the next step is to implement something like unsafe blocks in C# and Rust to further push raw pointers to something that a user must “buy into” and to make them obvious to a reader.

Requirements

  1. Immutability must be respected as far as it can possibly be guaranteed. It is impossible to guarantee it is respected on the other side of an FFI but within Bramble, if a variable is aquired as immutable then it is a compiler error for Bramble to ever allow it to be mutated.
  2. A raw pointer can be assigned either the address of a declared variable or the null value. It cannot be assigned an arbitrary integer.
  3. null is a keyword that can be assigned to or compared with all ref types.
  4. It is possible to get the address of any variable, no matter it’s type.
  5. Function pointers are not covered in this document
  6. The design should discourage their use to only when necessary.
  7. Raw pointers can be used on uninitialized memory. (Currently, Bramble does not allow for uninitialized variables but that will come in the future.)
  8. When a raw pointer is copied it’s value will be copied and there will now be at least two extant raw pointers to a location in memory.
  9. You can get the address of a field in a structure.
  10. with Raw pointers, you can have internal references: that is a field in a struct is a pointer to another field in the same struct.
  11. Comparisons compare addresses. Raw pointers are equal if and only if they point to the same location in memory or they are both null.
  12. There is a way to compute offsets from a raw pointer.
  13. There is a function for getting the size of a type. This is needed for custom memory allocation, computing offsets, etc.
  14. Comparison operators work on raw pointers. (<, >, <=, >=, ==, !=)

Design

Syntax

There are two parts to implementing raw pointers: the reference types, which describe variables that hold addresses, and the operators used to to create and use raw pointers.

Pointer Type

The type syntax is inspired from Rust:

Raw Pointer = *(const|mut) TYPE

Where Type is any primitive or valid custom type, including raw pointers. const and mut must be specified to help make raw pointers as explicit as possible and to discourage their use by making them tedious.

Reference Operators

There are two operators that are needed: one to get the reference and one to dereference a pointer so that a user can access the target variable.

REFERENCE = @(const|mut) (IDENTIFIER[.FIELDNAME]*)
DEREFERENCE = ^EXPRESSION

MUTATE = mut LVALUE := EXPRESSION
LVALUE = [^]IDENTIFIER             // The formal labelling of the LVALUE is new to Bramble

Note: currently braid does not support mutations of fields in a struct. This proposal optimistically hopes that that language change can be fit in here. The LVALUE in Bramble needs to be greatly matured so it can support derefs, fields, array elements, and parens to control order of operations. The LVALUE presented here is still not complete but fully realizing it is outside the scope of this RFC.

The Reference and Dereference operators are unary operators with precedence higher than member access or array access operators.

The Reference operator can only be applied to identifiers or the fields of an identifier that’s a structure as those are the only constructs in Bramble that have physical locations in the computer. Literals, for example, have a location in the machine code data section but are not conceptually something that can be referenced. The reference operator explicitly requires const and mut so that it mirrors the syntax of the raw pointer type names.

@ is chosen as the addressing operator rather than the traditional & to make & available for future use with safer reference/borrowing systems. In C/C++ the & is used to get the address of a variable. In Rust & is used for both borrowing and raw pointers (in Rust the & is cast to a raw pointer and a separate macro must be used to explicitly get the address). For Bramble, I want all interactions with raw addresses/pointers to be as explicit and obvious as possible so the addresses are gotten with @. Then & will be used for safe references which will be what users are encouraged to use.

The @ operator can only be applied to lvalues (identifier or a specific field). This is because the raw pointer logically refers to a specific location in memory and all locations in memory are refered to via identifiers. This mirrors the syntactic rules of C, and is carried into Bramble both because it maps well to the core philosophies of Bramble and because it provides a very strict rule which makes implementation easier.

The Dereference operator is ^, which differs from other languages which use * to dereference pointers. This was chosen to to make the dereferencing of a raw pointer explicitly different from dereferencing safer references or smart pointers. The ^ will probably be used for bitwise xor when that operator is added to Bramble but I believe it will be clear which operator is being used by whether it is unary or binary.

Unlike the reference operator, the dereference operator can, syntactically, be applied to any expression. The requirement being that the expression resolve to a reference type. This is because dereferencing is operating on a value (the address) and the address is expected to have already come from a variable.

This makes using dereferenced identifiers legal as the left value in mutation statement.

Both the reference and dereference operator symbols were chosen to be uniquely applicable to raw pointers and distinct from future, safer, methods of handling references in Bramble (be those GC pointers, reference counters, or a lifetime system a la Rust). The priority here is given to clarity for the reader of code rather than efficiency for the writer.

Operator Precedence

  1. @ and ^ will have the same precedence as unary minus (-) and not (!)

Semantics

Some notes, currently the following types and their uses are semantically legal:

x: i32, y: i32
1. *const *const i32
2. *mut *mut i32
3. ptr: *const *mut i32 :- mut ^^ptr := 5
4. *mut *const i32 :- mut ^ptr := @y

The latter two may seem unintuitive or that they should be illegal. Example 3 is legal because the outer *const refers to the location in memory that is storing the *mut i32, that pointer cannot be changed through indirect means, but because that pointer is an indirect mutable pointer, the location it points to (the i32) can be changed. Example 4 is legal, because the location that the outer pointer points to is mutable (it can be changed to point to a different i32) but the i32 that the inner pointer points to cannot be mutated.

It may make more sense to semantically have an occurance of a *const in a chain of nested pointers to override and turn the entire chain into *const.

Reference Operator

The reference operator can only be applied to an Identifier that is a variable in the current scope. The type of the reference expression is *const T where T is the type of the variable.

The mutable reference operator @mut can only be applied to a variable that is also mutable. The type of the mutable reference expression is *mut T.

mut and const are modifiers on the * reference type.

Deference Operator

T: Is a Type :- R: *(const|mut) T
-----------------------
:- ^R -> T

Mutating a dereference: If the raw pointer is to a mutable location (*mut T) then the value of that location can be mutated via the dereference operator;

let x: i64 := 1;
let mptr: *mut i64 := @mut x;
mut ^mptr := 5;

The type rules are:

mut ^R := V => T: Type :- R: *mut T, V: T

Dereferencing null or a variable whose value is null is undefined behavior.

Assignment Semantics

A variable of type *const T can be assigned a value of type *const T or a value of *mut T.

A variable of type *mut T can only be assigned a value of type *mut T.

let x: i64 := 5;
let mut y: i64 := 10;

let cptr_x: *const i64 := @const x;  // Legal
let cptr_y: *const i64 := @const y;  // Legal

let mptr_x: *mut i64 := @mut x; // Illegal, cannot get mutable address to an immutable variable
let mptr_y: *mut i64 := @mut y; // Legal

let mptr_x: *mut i64 := @const x; // Illegal, cannot get mutable address to an immutable variable
let mptr_y: *mut i64 := @const y; // Illegal, cannot cast a const location to a mutable pointer
Ref Type Variable Mutability

A variable of reference type (immutable and immutable) can itself be annotated as mutable: this means that the address stored in the variable can be mutated it has no bearing on if the location the address refers to can be mutated.

let x: i64 := 5;
let x2: i64 := 5;
let mut y: i64 := 10;
let mut y2: i64 := 10;

let mut cptr_x: *const := @const x;
let mut cptr_y: *mut := @mut y;

mut cptr_x := @const x2;     // Legal
mut cptr_y := @mut y2; // Legal 

Copying References

Copying: a mutable reference can be copied to a variable of type *const or *mut, so long as the underlying type is the same. An immutable reference can only be copied toa variable of type *const.

Examples:

fn write_to(m: *mut i64) {...};
fn read_from(i: *const i64) {...};

let x: i64 := 5;
let mut y: i64 := 10;

let cptr_x: *const i64 := @const x;       // Legal
let cptr_y: *const i64 := @const y;       // Legal
let cptr_x2: *const i64 := cptr_x;  // Legal
let cptr_y2: *const i64 := cptr_y;  // Legal

let mptr_y: *mut i64 := @mut y; // Legal
let mptr_x: *mut i64 := @mut x; // Illegal, cannot get mutable address to an immutable variable
let mptr_x: *mut i64 := cptr_x; // Illegal, cannot get mutable address to an immutable variable

write_to(@mut y); // Legal
write_to(@const y);     // Illegal, type mismatch requires *mut i64 but `@x` has type *const i64
write_to(@mut x); // Illegal, @mut cannot be applied to an immutable variable

read_from(cptr_x); // Legal
read_from(cptr_y); // Legal
read_from(mptr_y); // Legal
read_from(@const x);     // Legal
read_from(@const y);     // Legal
read_from(@mut y); // Legal

Comparisons

Comparisons are done on the addresses stored in the reference variable. Mutable and immutable references can be compared with each other.

Equality

  • == - returns true if the LHS and RHS have the same address value, otherwise false.
  • != - returns true if the LHS and RHS have different address value, otherwise false.

Less Than/Greater Than

Comparison operators will operate on the numerical values of the addresses as if they were integer types.

Offsets

A key part of raw pointers is being able to use them to move through large blocks of allocated memory: e.g. dynamic arrays and so on. On a semantic level this mirrors the array access operation [n] with the key difference being that and offset could be negative.

A similar, but visibly different operator is provided for getting offsets from a given raw pointer (which addresses the need for pointer arithmetic).

ptr<i64> // The <> operator takes a signed 64 bit integer and returns a new address

The offset operator will get an offset address in terms of multiples of the size of the underlying type. The address it will return computes as ptr value + size_of(T) * offset where T is the underlying type of the ptr.

So for a pointer to a i64, the size of the underying type is size_of(i64) = 8 and the offset will move in steps of 8: ptr<2> = ptr + 2 * 8 or ptr<-3> = ptr - 3 * 8.

The <> characters were chosen to make the pointer offset operation visually distinct from the array access operation while still sharing some resemblance because logically they operate in identical manners with the array access being more restrictive in its inputs (they must be unsigned).

Syntax:

PtrOffset = IDENTIFIER<EXPRESSION>

Semantics:

T: Type :- R: *(const|mut) T
----------------------
p: R, o: i64 :- p<o> -> R    // The offset of a *const is *const and the offset of a *mut is *mut

Alternative syntax: Use this as a binary operator: ~ e.g. ptr ~ -5. I don’t like this because it’s very subtle and with some fonts could be confused with -.

One reason to avoid using <...> is it’s a pain to implement in the parser. Perhaps $ or # as binary operators. The semantic rules for typing stay the same but it becomes easier to parse.

Decision

At the time of implementation, I’ve decided to use @ as a binary operator for raw pointer arithmetic. This is because all the other operators (e.g. <> and # and $) could have future uses, the @ is already associated with raw pointers, and this minimizes the amount of special characters which are devoted to raw pointer manipulation.

Member Access on Values of Reference Type

If a reference points to a structure then to access members of the structure you need to dereference the pointer first: *ptr.field or (*ptr).field. To dereference a field in a structure that is a reference: *(ptr.field_ptr).field or (*(ptr.field_ptr)).field.

size_of

The size_of function takes a type and returns the number of bytes allocated to store a type in memory. This takes into account padding bytes that might be added for memory alignment purposes.

Syntax:

SIZE_OF = size_of(TYPE)

Type:

:- size_of(T) -> u64

For simplicities sake, this is operating on the C notion of size where every type as a size. For primitive types with the bitsize in the name, the size of the type shall remain constant across all versions of Bramble and all platforms and architectures.

Insight

The insight should focus on the physical nature of pointers and addresses, as they represent the actual locations in memory and deal with interacting with those physical realizations of the logical variables. So, when possible the visualization and Insight should emphasize that link to the physical resource.

Recorded events: When the @ operator is applied to a variable then:

  1. The lexer will emit an event for the @ token
  2. The parser will emit an event for generating a @ unary node in the AST.
  3. The type resolver will emit an event which includes a reference span to the definition for whatever is the operand of @, because the declaration of that variable is also the aquisition of the physical memory.
  • If the operand is a field of a structure, then it will reference the declaration of that structure variable.
  • The reference type should probably be annotated to make it explicit that this is refering to the same physical location and not that it is using a reference for a compile time decision.
  1. LLVM will emit an event with the LLVM get address instruction.
  2. Record if the pointer is pointing to a type which is memory aligned
  3. When a pointer is bound to a Field in a struct, the node should display the field name. Have a reference to the span for the field’s definition.

When the * deref instruction is applied to an expression:

  1. The lexer will emit an event for the * token.
  2. The parser will emit an event for generating the * unary node in the AST.
  3. The type resolver will emit an event which includes a reference to the definition of where the operand comes from.

This is a bit outside the scope of this RFC, but because this is dealing with language features that surface the physical nature of how variables are stored:

It would be nice to record the memory layout information for all structs and display that in the LLVM tab for the Viewer. It would also be nice to have a compiler flag to print out memory layout table for all types after compilation.

Viewer

Highlight the creation of raw pointers and the dereference of raw pointers in some manner to emphasize that something dangers is being done.

In this case, the Cell of an @ expression or a * expression will be will be colored a light orange as a warning sign.

Tests

Integration Tests

  1. Get Reference to each primitive type
  2. Get Reference to array
  3. Get Reference to structure
  4. Get reference to field in structure
  5. Get reference to element in array
  6. Get reference to reference
  7. Repeat all tests for mutable references
  8. Test passing a reference to a function
  9. Test passing a mutable reference to a function and function mutates value
  10. Return reference from function
  11. Test using malloc and free
  12. Copy struture with reference field => both copies point to the same value
  13. Test derefence of each of the Reference tests
  14. Test mutable dereference of each of the reference tests
  15. Reference to a mutable value, mutate the value, test that the reference reflects the mutation.
  16. test initializiing a reference to null
  17. Test comparing a reference to a null
  18. Test comparing two references compares addresses
  19. Get pointer to first element in the array, then use pointer arithmetic to move to each element of the array.

Implementation

Types

The const keyword will be added to the token set and the lexer.

A new Ref variant will be added to the Type enumeration. The Ref variant will have two fields. One field is_mutable boolean which is false for immutable references and true for mutable references. And a boxed value to the target type that the reference refers to.

The Parser will be updated to parse the ref type annotations according to this grammatical rule:

RefType = *(const|mut) <Type>

*const <Type> will set is_mutable to false and the target type to <Type>. *mut <Type> will set is_mutable to true and the target type to <Type>.

@ Operator

@ will be added to the Lexer and token set as Ampersand.

Two new variants will be added to the Expression enumeration: AddressOf and AddressOfMut.

The parser will follow these rules:

AddressOf = @<Identifier>
AddressOfMut = @mut <Identifier>

Semantic rules will be added of the form:

  1. For AddressOf the operand must be a variable. It can be immutable or mutable.
  2. For AddressOfMut the operand must be a mutable variable.

^ Operator

The ^ token already exists in the Lexer and token set.

A new unary rule will be added to the Parser:

Deref = ^EXPRESSION
DeRefAssignment = mut *EXPRESSION := EXPRESSION

A new variant will be added to the Expression enum. This variant will have a single child, which is the expression it is dereferencing. A new variant for mutations will be added for mutating a reference.

Semantic Rules will be added of the form

  1. For DeRefAssignment, the <Expression> on the LHS must be mutable reference.
  2. For *<Expression> in the LHS, the type of <Expression> must be a Reference type (mutable or immutable). If the type of <Expression> :- *(const|mut) T then the type of *<Expression> :- T

size_of

The size_of function will be implemented using the LLVM API and computed as a static value during compile-time.

https://stackoverflow.com/questions/14608250/how-can-i-find-the-size-of-a-type https://llvm.org/doxygen/classllvm_1_1DataLayout.html#aa48b3b7e554b44f4e513d5dd8d9f9343

Syntactic Fuzzer Changes

  1. Add type annotations for *const T and *mut T. This should be able to use the same logic as arrays.
  2. Add expression generation for taking the reference of an identifier.
  3. Get a raw pointer to a field in a structure.
  4. Add a failure test by generating a reference operator where the operand is not an identifier.
  5. Add deref generation for expressions.
  6. Add deref mutation generation for statement generators.
  7. Compute raw pointer offsets
  8. Mutate dereference to a field name

8.2 - RFC 257 - Addressable Expressions

How to represent the expressions that can be used on the left hand side of assignments and mutations.

Addressable Expressions

Overview

Allow the Bramble compiler to support more than just identifiers as the LHS of mutation operations and the operand of @(const|mut) operations. Specifically, be able to use array elements and structure fields so that the following would be legal:

mut arr[0] := 5;
mut st.field := -1;
@const arr[0]
@const st.field
mut ^(@mut arr[0]) := 2;

Solution

(The term “Addressable” is inspired by the language used in the Go Lang spec).

The expressions which can be mutated or referenced with @ are expressions which have places somewhere in memory (the stack or the heap). These are called Addressable values. Without a place in memory, obviously, the expression cannot have its value changed or its address taken. Expressions which evaluate to a value are Value Expressions. All addressable expressions are also value expressions: in this case their value is read from their place in memory.

Whether an Addressable expression resolves to its address in memory or its value is context dependent: specifically, the operation and what operand position the expression takes. For example, the LHS of a mutation operation would resolve to an address, but the RHS of the same mutation operation would evaluate to an address.

I propose adding a new is_addressable flag to the SemanticContext of an expression. This will be added before type resolution. During type resolution, the compiler will not only check that the types are correct, but will also check if an operand is addressable for the operations which require addressable values.

In order to support mutation operations, an is_mutable flag will also have be added to the SemanticContext. Currently, this flag only exists in the symbol table for variables, but this cannot support dereference expressions or field access or array indexing expressions. The is_mutable flag will remain in the symbol table and a flag will be added to SemanticContext for mutability of addressable expressions.

The mutability and addressability of an expression will be combined into a single flag with 4 possible states: None|Value|Addressable|AddressableMutable. An expression can only be mutable if it is addressable, so having two separate flags creates risk of accidentally marking a value expression as mutable. None is used both as a place holder for when an expression has not had it’s Addressability computed and for nodes which are not expressions (structure defintions, etc).

A new phase ComputeAddressable will traverse the AST and set the addressability flag for every node in the AST, based upon a set of rules detailed in a following section.

Then in LLVM Generation, a new method will be added to Expression called to_addr -> Option<T>, which will return the address of an Addressable Expression, rather than its stored value, if the expression is addressable otherwise it will return None. This implementation is easily done by simply returning the pointer and skipping the build load operation.

Why this design? This design allows for determining whether expressions can be used for mutations and referencing in a way that is fully independent of the grammar rules or AST design. Which means both: no major refactoring or changes to the AST and no added bounds on how the grammar and AST could evolve into the future.

Why have this separate from type resolution? Type resolution is already large and complex, the danger is that addressability rules checking would get lost in all the type checking logic. Making it easy to break and hard to fix in the future. Having it as a separate step will make it more obvious that it occurs and easier to maintain and grow. It also is logically not the same thing as type checking.

Addressable Rules

All rules here assume that the expressions are of valid types for the operators (e.g. that in ^EXP, EXP is a raw pointer type). The type checking can happen after addressability is determined, because IDENTIFIERs are always addressable and the result of dereferencing operation must be addressable. Addressability can also be done after type checking, in which case it would be combined with checking that the operands for place operators are addressable.

All expressions are Value expressions unless they satisify the Addressable rules below. All other grammar rules are None.

X: IDENTIFIER
--------------
X :- Addressable


X: EXPRESSION :- Addressable
--------------
(X) :- Addressable


X: EXPRESSION :- Addressable, I: EXPRESSION
----------------
X[I] :- Addressable


X: EXPRESSION :- Addressable, F: IDENTIFIER
-----------------
X.F :- Addressable


X: EXPRESSION :- Addressable
----------------
@(const|mut) X :- Addressable


T: Type, X: *(const|mut) T
-----------------
^X :- Addressable


T: Type, X: *mut T
-----------------
^X :- Mutable


T: Type, X: *(const|mut) T
-----------------
X :- Addressable


X: EXPRESSION :- Addressable and Mutable, Y: Expression
----------------
mut X := Y :- Not Addressable

Array elements and structure fields are not de facto addressable. For example, foo().field, the function foo returns a structure and we access the field field. Within Bramble, we logically describe this as a value and not an addressable expression so it cannot be used for @ or mutations. (Even though, in the LLVM IR we generate we actually allocate space in the caller stack frame for the returned structure value, we still logically describe it as a value expression, this is because there is no label associated with it, so it would be hard for a user to mentally or visually track how the value is being used or mutated).

Implementation

Status

Add an enumeration that has the variants: None|Value|Addressable|AddressableMutable Add a field for this enumeration to SemanticContext with default value None.

Can I update canonize/foreach_mut to support POST order execution (as a configurable flag)? If so, then I think I can use that foreach for the traversal? Part of this would be updating foreach to do the Insight event recording too.

It might be easier to start out by adding this into type_resolver then move to its own stage?

Update the grammar rules so that any expression can be used on teh LHS for mutation and that any expression can be used as the operand for @(const|mut).

UX

The only impact on UX will be greater flexibility in what can be mutated and referenced. So, this will move Bramble towards being more intuitive in that regard.

Insights

We should mark what expressions resolve to being addressable and what expressions resolve to being only value expressions. This can be done by merging the events generated by the addressability analysis with semantic node events via spans tests.

A new compiler stage will be added to the trace output stage set: Addressability. This stage will record the assignation of every span in the input source code with whether it is Addressable or not. For each node in the AST that is checked, this will emit an event with the result of the Addressability analysis.

Tests

  1. Test mutating an array element
  2. Test mutating a structure field
  3. Test referencing an array element
  4. Test referencing a field
  5. Test mutating a dereference (mut ^a := 5)
  6. Test mutating an integer => fails
  7. Test mutating (x+5) => fails
  8. Test (x)
  9. Test (x)[5].field, an array of structures
  10. Test (x.field)[5], field is an array
  11. Test ^x.field, field is a pointer
  12. test ^x.field[2], field is an array of pointers
  13. Test ^x[0] array of pointers
  14. Mix in combinations of ()

8.3 - RFC 288 - MIR - Control Flow Graph

Use a Control Flow Graph as an IR to make implementation of rules that test data flow easier.

Control Flow Graph Intermediate Representation

This RFC proposes using a Control Flow Graph (CFG) IR as a stage between the AST and LLVM forms, because it will help simplify the implementation of key Bramble features: including consistency semantics, ownership rules a la Rust, and automatic injection of destructors for values.

Bramble’s MIR language design was inspired by Rust’s MIR and the design of the conversion from AST to MIR was inspired by LLVM.

Problem

Bramble needs to perform data flow analysis on each function in a program. The analyses needed are: consistency rules checking, automatic insertion of destructors for values, and ownership rules. More analyses may be added in the future.

For each of these analyses, Bramble needs to look at the sequence of operations that are applied to one or more locations in memory, then determine if every possible sequence of operations satisfy some given set of rules. This is falls under dataflow analysis.

While an AST can be used for such analyses, it is not well suited and would make implementation of the analyses more complex.

Requirements

  1. Simplify the forms of analysis which follow a “dataflow” paradigm.
  2. Be able to lower into LLVM IR.
  3. Integrate with Thorn Insights platform.
  4. Be able to represent all Bramble programs in a MIR form.
  5. MIR representations must be fully independent of the preceding AST representation. This means that the AST form of a Bramble program could be deleted after conversion to MIR and the compilation will always be able to complete successfully.
  6. Allow for intra-procedure (local) optimizations.

Solution

Rather than use the AST IR for dataflow analyses, create a new Control Flow Graph IR (henceforth referred to as a MIR), and all dataflow analyses are done against this form rather than the AST form.

The AST will be “lowered” into the MIR form after parsing and type resolution and before transformation into the LLVM IR form. After being lowered, any analyses and optimizations which work best with a CFG are applied. Then the MIR is lowered into LLVM IR form and compilation continues with the LLVM compiler toolset.

Alternatives

The alternative solution is to continue using an AST for all compiler analyses and optimizations. Most compilers do their work off an AST form and only use CFG forms for optional optimizations or linting. The Rust compiler currently uses a CFG IR form as an intermediate step between the AST and LLVM IR; however, its earliest versions used only an AST and the CFG IR was added later to help simplify the ownership rules analysis.

The reason to continue using the AST for all analysis rather than the MIR is to save development time and simplify the implementation of rules and optimizations that focus on the flow of data. It is a fact that all analyses that work on the CFG IR model will work on an AST, but their implementation may be much more complex and fragile.

Problems with using an AST for this analysis include the following. First, all these analyses need to validate some rule holds for one or more variables through every possible path of execution, the AST form makes it difficult to correlate all the different paths when there are branches. Second, the AST form recursively embeds operators as children of operators, this makes it difficult to find the leaf operands (specifically the variables being used), because the desired analyses are focused largely on tracking how each specific variable is used, being able to easily extract those variables from operations is critical for easy implementation. Third, all these analyses will want to know what unbroken chain of operations must be applied to a set of variables (Basic Block), this information is difficult to extract from an AST. All this information is encoded in the AST form, but the AST form prioritizes preserving the hierarchy of operations, which obfuscates the information that the dataflow analyses require; this leads to the implementaion of analyzers being more complex.

The ulimate price of the increased complexity of implentation is reducing the power of the analyzers. As experienced by the Rust team when the lifetime checker was implemented on an AST: having a less powerful analyzer means more blunt enforcement of the rules. This pushes more burden onto the user of the language. In the case of Rust, the “more blunt” analyzer meant that it could not understand more complex or subtle memory operations and would fail code that was technically correct. Using the AST as the basis for dataflow analysis, leads to simpler analyzers, which leads to more false negatives, forcing users to write worse code.

I chose to implement the MIR because significantly reducing the complexity and fragility of the dataflow analyses will be a net long-term benefit for the Bramble compiler. The consistency rules system could be incredibly complex, as it evolves, and a CFG IR model will greatly reduce the complexity of its implementation. Further, many optimizations benefit from the CFG IR model and implementing one now will allow me to start safely adding optimizations to Bramble. Finally, this decision was influenced by taking advantage of Rust’s experience: using the AST IR form greatly limited the power of ownership rules system and moving to a CFG IR was critical for building a more powerful and useable ownership rules model. Given that the consistency rules model that Bramble will implement shares many mathematical fundamentals with Rust’s ownership model and lifetime model; Rust’s experience provides valid insight into a major pitfall Bramble could suffer.

Design

MIR Architecture Model

The Bramble MIR’s design represents a simplified model of the physical memory and CPU instructions. MIR breaks memory down into 3 locations, based upon their lifetime relative to the execution of a program: static, heap, and stack. All operations in the MIR correspond to a subset of the instructions on a CPU and, like a CPU, the results of each single operation must be stored somewhere before being used in a subsequent operation. Further, in MIR, operations are a linear sequence and have no hierarchy.

Static memory represents the “.data” and “.code” (For x86) sections of a program, which correspond the parts of memory where data lives from the moment the program begins execution to when it stops execution. In MIR, the representations of each function are stored in the “Static” memory group. Heap memory corresponds to the heap, and data here will live from the moment it is explicitly allocated until it is explicitly deallocated (currently, no constructs in MIR correspond to or represent heap data). Stack memory, or local memory, corresponds to the stack frame of a function: data in this section will be valid from the moment a function is entered to the moment that instance of the function exits. In MIR, all variables defined within a function (user and temporary) are stored on the stack.

How the MIR uses abstractions to represent different parts of memory.

MIR Components

When lowered into its MIR form, a project is represented as a single MirProgram, which represents all the functions and types in the project. Each function is defined as a set of Basic Blocks. Each user defined type is stored as an ordered set of fields, where each field has a name and a type.

A Basic Block represent the largest set of operations where: if one operation is executed they are all executed and every operation is executed immediately after the preceding operation. Every Basic Block is composed of 0 or more Statements and a Terminator. The Statements represent operations that have results that are stored in memory. Terminators tell where the program will go after the last Statement in the Basic Block is executed.

Basic Blocks.

A visualizaiton of how a function with an If Expression would be represented via Basic Blocks. Each Basic Block is an ordered set of statements ending with a Terminator. The Terminator determines what the next Basic Block to be executed will be.

There is a single statement: Assign. This statement executes an operation and stores the result in a location in memory; this location can be either a user defined variable or a temporary variable for intermediate results. The operations correspond to unary and binary arithmetic instructions available on the CPU. Operations take one or two operands; operands can be either constants or specific locations in memory. Data can only be stored and read as base types (e.g., i32 or f64). For types constructed from base types (arrays and structs), Accessors are used to access the location of a specific element or field within a value of the constructed type.

The Terminator ends every Basic Block and is always executed. The simplest Terminator is the Return, which tells the program to exit the current function and remove its stack-frame from the stack. The GoTo Terminator tells the program to go to the specified Basic Block within the current function. The Conditional GoTo Terminator tells the program to go to one of two Basic Blocks based upon whether a given conditional value is true or false. Finally, the Call Terminator tells the program to call a function and provides a location to store the result of the function.

Lowering AST to MIR

The design for lowering the AST to the MIR resembles the current design for lowering the AST to LLVM IR. There is a set of MIR types that can represent any Bramble program in MIR and implement a CFG. The lowering process will generate a set of CFGs, composed of MIR values, that each represent a function from the input AST. The process uses a MIR Builder, which constructs the MIR form of a single function, and a Traverser, which traverses the AST and uses the Builder to convert types and functions into MIR forms.

The MirProgram type manages all custom types defined by the user and the MIR representation of each function. This is the final output of the lowering process. Subsequent operations that apply to the MIR will take the MirProgram as input.

The lowering process begins by traversing the AST for every user defined type and adds the type to the Type Table in MirProgram. Types are converted to MIR first, so that when they are encountered while lowering function definitions, their definitions can be easily found. This also separates responsibilities and leads to simpler code.

The Type Table assigns a unique integer identifier to each type and this identifier will be used within the MIR to represent types. This makes representation smaller in memory and makes programming easier and execution faster. Every type is represented by a single integer, so passing to other functions can be done by copying instead of cloning. Likewise, comparison will amount to a single integer comparison instruction. This will also lead to better Rust code.

How the MIR Builder manages types.

Note: The TypeTable is not strictly required: the MIR could continue to use Paths and the Structure Definitions in the AST form for referencing and looking up types. However, using an integer TypeId to identify types results in significantly simpler Rust code and is a better design overall. By taking this opportunity to add in the TypeTable we can begin to reap the benefits of this design and then, in the future, move the construction of the TypeTable to an earlier stage in the BrambleC pipeline so that the AST also uses TypeIds rather than the cumbersome Type enum.

After all types are lowered to the MIR, the AST is traversed again, and the name and signature of every defined function is added to the set of functions in MirProgram (without the MIR definition of the function). This is because functions are identified using integers in the MIR and we need to know the identifier for every function before lowering the function definitions, otherwise, there may be a call to a function that has not yet been lowered and it will not have an assigned function identifier.

After adding every function name and signature to the MirProgram and assigning them identifiers, the lowering process transforms the definition of each function from an AST to a MIR. A post order traversal of the AST of the function is performed and the Builder type is used to construct the MIR representation. After the function’s AST has been fully traversed the MIR is finalized and added to the MirProgram as the definition of the function. The lowering process is complete when every function has been converted into a MIR form.

How the MIR Builder manages functions.

Lowering MIR to LLVM IR

After completing every test and transformation which relies upon the MIR form, the compiler lowers the final MIR into its LLVM form. The lowering process is split into two components: the Traverser and the Builder. The Traverser iterates over each function and custom type, and calls methods on the Builder to construct the LLVM IR. At the completion of the lowering process there will be an LLVM Module that contains all the types and functions in the Bramble project.

The Builder provides a set of methods that correspond to the MIR instruction set. The Traverser calls these methods as it encounters the corresponding MIR instructions. The methods cause the Builder to create the LLVM IR instructions that correspond to the Bramble MIR instructions.

The Builder is responsible for generating the labels for all custom types and all functions, that are used within the LLVM IR and, ultimately, the generated assembly code.

References

The main inspiration for the MIR was from reading through the Rust Compiler source code and this blog post: https://blog.rust-lang.org/2016/04/19/MIR.html. Much of the Bramble MIR design is directly inspired by Rust’s MIR, though a significantly simpler form.

The second source of inspiration was “Engineering a Compiler” by Keith Cooper and Linda Torczan. The chapters on dataflow analysis and control flow graphs directly led me to believe those concepts the best for implementing the consistency rules system for Bramble.

In addition, I looked at Clang and Rosalyn to understand how those compilers used Control Flow Graphs and how they designed their CFG models.

The following links were used during research for the MIR design:

  1. http://s4.ce.sharif.edu/blog/2019/12/31/clang/
  2. https://homepages.dcc.ufmg.br/~fernando/classes/dcc888/ementa/slides/ControlFlowGraphs.pdf
  3. https://eli.thegreenplace.net/2013/09/16/analyzing-function-cfgs-with-llvm
  4. https://docs.microsoft.com/en-us/dotnet/api/microsoft.codeanalysis.flowanalysis?view=roslyn-dotnet
  5. https://github.com/dotnet/roslyn-analyzers/blob/main/docs/Writing%20dataflow%20analysis%20based%20analyzers.md
  6. https://github.com/dotnet/roslyn/blob/1deafee3682da88bf07d1c18521a99f47446cee8/src/Compilers/Core/Portable/Operations/BasicBlock.cs
  7. https://github.com/rust-lang/rust/blob/f7ec6873ccfbf7dcdbd1908c0857c866b3e7087a/src/librustc/mir/repr.rs