#garbage-collection #type-checker

kamo

A library to assist in the creation of an interpreter or compiler and its associated runtime

14 releases (8 breaking)

new 0.9.3 Apr 28, 2024
0.8.1 Apr 16, 2024
0.6.0 Mar 26, 2024
0.4.0 Dec 27, 2023
0.1.1 Nov 27, 2023

#20 in Parser tooling

Download history 16/week @ 2024-01-14 15/week @ 2024-02-25 4/week @ 2024-03-10 82/week @ 2024-03-24 146/week @ 2024-03-31 105/week @ 2024-04-07 147/week @ 2024-04-14 167/week @ 2024-04-21

569 downloads per month

MIT/Apache

1MB
15K SLoC

Kamo

Build Status Crates.io Crates.io

Kamo (カモ) is japanese for duck.

  • This is the release of the third part of the Kamo project: type system and type checker. The type system is available when the types feature is enabled. It is not enabled by default. When the type system is enabled the env module is also available. It is used by the TypeChecker.

  • This is the second supplement to the second part of the Kamo project: memory management and runtime values. It adds a runtime parser for s-expressions in the module kamo::form::sexpr. See the change log for more information about the changes and fixes.

  • This is a supplement to the second part of the Kamo project: memory management and runtime values. It adds procedural macros to parse s-expressions into runtime values. The macros are defined in the crate kamo-macros.

  • This is the release of the second part of the Kamo project: memory management and runtime values.

  • The first part was the parser combinator library.

Memory Management Module kamo::mem

This module implements automatic memory management. Values are allocated in a mutator which holds an arena allocator for each type. Memory collection is done by a mark and sweep garbage collector.

The mutator implementation is thread local. The mutator holds a vector of arena allocators for each type. The arena allocator is implemented as a vector of buckets. Each bucket holds a vector of slots. The slots are used to store the values. Buckets have a fixed size and are allocated on the heap. The buckets are allocated on demand when the slots are full.

The hierarchy of types is as follows:

  • Mutator contains Arenas.
  • Arena contains Buckets.
  • Bucket contains Slots.
  • Slot contains a value.
  • a value is represented by a Pointer.

Values managed by the mutator must be traceable. The mutator uses the Trace trait to trace the values. The Trace trait is implemented for all types that can contain values managed by the mutator.

Pointers use reference counting to keep track of the number of references to the value. When the number of references reaches zero the value may be subject to garbage collection. To avoid cycles Pairs, also known as Cons cells, and Vectors do not hold locks to their elements. Instead they are traced by the mutator when memory collection is done. This is the reason why a drop of the reference count of a pointer to zero does not immediately free the value.

The garbage collector is implemented as a mark and sweep collector. All values which hold a lock are roots. The garbage collector traces all roots and marks all values that are reachable from the roots. All values that are not marked are unreachable and are freed. Tracing is done iteratively by repeatedly calling the trace method on the values and adding the values to a work list. This continues until the work list is empty.

std::alloc::Allocator Trait

The mutator uses the global allocator to allocate memory for buckets, arenas, vectors and strings etc. Dynamic memory allocation is done by the allocator provided by the std::alloc crate. Since the implementation of the allocator provided by the standard library is highly optimized it would be a waste of effort to reimplement it.

As a consequence the mutator is not able to free memory when the global allocator fails to allocate memory. This is because the mutator does not know which memory is allocated by the global allocator. The mutator only knows which memory is allocated by its arenas. Collection of memory is triggered only by the allocation pressure.

When the allocator trait is stabilized the mutator will be able to use the the trait to mitigate allocation failures by freeing memory when the global allocator fails.

Runtime Values Module kamo::value

This module implements runtime values. Values can either hold immediate values or pointers to values allocated in the mutator.

The following immediate values are supported:

  • nil, the empty list
  • bool
  • char, a Unicode scalar value
  • i64
  • f64

The following pointer values are supported:

  • pair, a pair of values, also known as Cons cell
  • symbol, a symbol, an interned string
  • string, a UTF-8 string
  • vector, a vector of values
  • bytevector, a vector of bytes

Value provides a safe interface to the values in the runtime. It is implemented as a wrapper around ValueKind. It provides methods to convert the value into a Rust type, to check the type of the value, and to visit the value. Heap values are automatically locked and unlocked when necessary.

Value implements the visitor pattern. The Visitor trait is used to visit the value. When implementing the Visitor trait it must be considered that the value may be cyclic. The visitor must keep track of the values it has already visited to avoid infinite loops.

The Visitor trait is used to implement printing of values. The Value itself does not implement printing. Instead the Display trait is implemented for a specific vistor. This makes it possible to implement different printing strategies for values.

An implementation of the Visitor trait is provided to print values in a Scheme-like syntax.

Parser Combinator Library kamo::parser

This module implements a parser combinator library. The library is focused on parsing UTF-8 text in a safe and mostly zero-copy way. It is designed to be used to implement parsers for programming languages. It is not designed to be used to parse binary data.

One design goal of this library is to make it easy to write parsers that keep track of the position of the input and the cause of the error. The position is tracked automatically by its offset in bytes from the begining of the input and the line and column number in UTF-8 characters. The cause of the error is tracked by a stack of causes in the error. The stack is used to keep track of multiple causes of one error. The cause is added to the stack when a parser fails and holds the position, error code and a message. The error code is used to identify the cause of the error. Typically parsers wich take other parsers as input will add a cause to the stack when the input parser fails.

Example

Example of a parser that parses a Scheme-like byte-vector and returns it as Vec<u8>.

The grammar is defined as follows:

ByteVector = "#u8(" Byte* ')'
Byte       = <any exact integer between 0 and 255>

The parser is defined as follows:

use kamo::parser::{code, prelude::*, Input, ParseError, ParseResult, Span};

fn main() {
    assert_eq!(
        parse_bytevec(Input::new("#u8(0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15)")),
        Ok((
            vec![0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
            Input::new("")
        ))
    );
}

fn parse_bytevec(input: Input<'_>) -> ParseResult<'_, Vec<u8>> {
    context_as(
        delimited(
            pair(tag("#u8("), ascii::whitespace0),
            list0(parse_bytevec_value, ascii::whitespace1),
            context_as(
                preceded(ascii::whitespace0, char(')')),
                code::ERR_CUSTOM + 2,
                "expecting a closing parenthesis: )",
            ),
        ),
        code::ERR_CUSTOM + 1,
        "expecting a byte vector: #u8(<byte>*)",
    )(input)
}

fn parse_bytevec_value(input: Input<'_>) -> ParseResult<'_, u8> {
    let result = preceded(
        ascii::whitespace0,
        any((
            parse_binary_natural,
            parse_octal_natural,
            parse_decimal_natural,
            parse_hexadecimal_natural,
            parse_natural,
        )),
    )(input);

    match result {
        Ok((value, cursor)) => {
            if value > u8::MAX as u64 {
                Err(ParseError::new(
                    Span::new(input.position(), cursor.position()),
                    code::ERR_CUSTOM + 3,
                    "expecting a byte value: 0..255",
                ))
            } else {
                Ok((value as u8, cursor))
            }
        }
        Err(err) => Err(err),
    }
}

#[inline]
fn parse_binary_natural(input: Input<'_>) -> ParseResult<u64> {
    preceded(tag("#b"), literal::natural(literal::Radix::Binary))(input)
}

#[inline]
fn parse_octal_natural(input: Input<'_>) -> ParseResult<u64> {
    preceded(tag("#o"), literal::natural(literal::Radix::Octal))(input)
}

#[inline]
fn parse_decimal_natural(input: Input<'_>) -> ParseResult<u64> {
    preceded(tag("#d"), parse_natural)(input)
}

#[inline]
fn parse_hexadecimal_natural(input: Input<'_>) -> ParseResult<u64> {
    preceded(tag("#x"), literal::natural(literal::Radix::Hexadecimal))(input)
}

#[inline]
fn parse_natural(input: Input<'_>) -> ParseResult<u64> {
    literal::natural(literal::Radix::Decimal)(input)
}

Type System

The type system is available when the types feature is enabled. It is not enabled by default. When the type system is enabled the env module is also available. It is used by the TypeChecker.

Types are defined by a type code which is a byte array. The type code is used to represent the type of a value. This allows for a simple and efficient way to represent types. The syntax of the type code is straightforward and easy to understand and parse. The type code allows the construction of complex types with multiple levels of nesting by combining simpler types.

The supported primitive types are the following:

  • Boolean: A boolean value.
  • Character: A Unicode character.
  • Integer: A signed integer value.
  • Float: A floating-point value.
  • Symbol: An interned string value.
  • Type: A type value.
  • Binary: A byte array value.

Besides the primitive types, the type system supports the following compound types:

  • Array: A homogeneous collection of elements.
  • Pair: A pair of two values also known as a cons cell.
  • Lambda: A function type.
  • Option: A type that can be None or Some(T).
  • Union: A type that can be one of the given types. There must be at least two types in the union.

Special types are also supported:

  • Any: A type that can be any value.
  • Void: A type that can't be any type or value. It is used to represent the absence of a type. It is used in the context of functions that don't return a value or do not take any arguments.
  • Nil: A type that represents the absence of a value. It is used in the context of optional types.

The types are grouped into different categories:

  • Specific Types: Types that represent a specific value and can be a member of a union type. These types are the following: Boolean, Character, Integer, Float, Symbol, Type, Binary, Array, Pair and Lambda.
  • Filled Types: Types that represent a mor general specific value. These types can be used in an option type. These types are the following: Any, Union and Specific Types.

Parser

The type system includes two parsers: a type code parser and a parser for a textual representation of types. The type code parser is used to parse type codes and convert them into a type. The parser for the textual representation of types is used to parse a string and convert it into a type. The textual representation of types is a human-readable format that can be used to represent types in a more readable way than the type code.

Type Checker

The TypeChecker trait is used to implement type checking for an interpreter or compiler. The type checker is used to check if a value has a specific type or if a type is a subtype of another type. The type checker can also be used to infer the type of an expression or a value. The type checker is used to implement type inference and type checking for a programming language.

Macros

The macros are used to parse a string literal into a single or an array of kamo::value::Values. The macros are defined as follows:

  • sexpr!(<mutator>, <expression>) - A macro for parsing a single s-expression from a string. It returns a kamo::value::Value.

  • sexpr_file!(<mutator>, <filename>) - A macro for parsing multiple s-expressions from a file. It returns an array of kamo::value::Values. The array may be empty.

  • sexpr_script!(<mutator>, <expressions>) - A macro for parsing multiple s-expressions from a string. It returns an array of kamo::value::Values. The array may be empty.

The macros all take an optional MutatorRef identifier. This is used to allocate values on the heap. If the expression does not contain any values that need to be allocated on the heap, then the Mutator identifier can be omitted.

The syntax for the macros is as defined by the Scheme standard R7RS for the read procedure. The syntactic definition is the <datum> in section "7.1.2 External representations" of the standard.

The syntax deviations from the standard are:

  • The extactness (#e and #i) of numbers is not supported. Floating-point numbers are always inexact and integers are always exact.

  • Numbers may only be signed 64-bit integers or IEEE 754 double precision floating-point numbers. The standard allows for arbitrary precision integers, rationals and complex numbers.

  • Labels are not supported.

  • The #; comment syntax is only supported in the macros which parse multiple s-expressions. The #; comment syntax may not be nested.

  • Character literals defined by a hex escape sequence may have 1 to 6 digits. The standard excepts 1 or more digits. The code must be a valid Unicode code point.

The parser is implemented with the pest crate. The grammar is defined in src/sexpr/sexpr.pest. This is necessary because the combination parser library defined in kamo::parser cannot be used here. It would be cyclic dependency. There will be an implementation of a parser for s-expressions in the kamo::form module in the future. It will be based on the kamo::parser crate and will be used by the scheme interpreter.

Examples

use kamo::{mem::Mutator, sexpr, value::{print, Value}};
 
let m = Mutator::new_ref();
let value = sexpr!(m, "(1 2 3)");
 
assert_eq!(print(value).to_string(), "(1 2 3)");
use kamo::{sexpr_file, value::{print, Value}};
 
let m = Mutator::new_ref();
let values = sexpr_file!("tests/sexpr/values.scm");
 
assert_eq!(values.len(), 3);
assert_eq!(print(values[0].clone()).to_string(), "()");
assert_eq!(print(values[1].clone()).to_string(), "100");
assert_eq!(print(values[2].clone()).to_string(), "#t");

let values: &[Value] = &sexpr_file!("tests/sexpr/empty.scm");
assert_eq!(values.len(), 0);
use kamo::{mem::Mutator, sexpr_script, value::{print, Value}};
 
let m = Mutator::new_ref();
let values = sexpr_script!(m, "(define a 1)\n(define b 2)\n(+ a b)");
 
assert_eq!(values.len(), 3);
assert_eq!(print(values[0].clone()).to_string(), "(define a 1)");
assert_eq!(print(values[1].clone()).to_string(), "(define b 2)");
assert_eq!(print(values[2].clone()).to_string(), "(+ a b)");

let values: &[Value] = &sexpr_script!("");
 
assert_eq!(values.len(), 0);

Feature List

  • Module kamo::parser for parsing UTF-8 text. A parser combinator library for parsing UTF-8 text in a safe and mostly zero-copy way.
  • Module kamo::mem for automatic memory management. Values are allocated in a mutator which holds an arena allocator for each type. Memory collection is done by a mark and sweep garbage collector.
  • Module kamo::value for values. Values can either hold immediate values or pointers to values allocated in the mutator.
  • Module kamo::types for types. The type system is used to infer the types of the intermediate representation and the AST.
  • Module kamo::eval for evaluation. The evaluator processes an AST, which is an symbolic expression tree, and evaluates it to an intermediate representation. The intermediate representation can then be interpreted or compiled to a target representation. The interpreter is generic and can be used to interpret any intermediate representation.
  • Module kamo::repl for a read-eval-print-loop. The REPL is used to interactively evaluate expressions and is generic and can be used to evaluate any intermediate representation.
  • Module kamo::lang::scheme for the Scheme language. The Scheme language is implemented as a library on top of the kamo modules. It implements a subset of the R7RS standard.

Dependencies

~0.4–1.5MB
~27K SLoC