Compiler Weekly: Calculator REPL

For this post (and maybe the next few) I want to try something different. Im calling this series “compiler weekly” because I want to explore different parts of compilers in fast iterations.

My goal of this week was to figure out what tools to use when building compilers in rust. I needed a good lexer and parser. Ideally a set of libraries that are easy to use, expressive, extendable and with good error messages.

To test this I build a very simple calculator REPL. It takes math input like this: 5 + 5 / 10 and will print out the answer. It will not handle brackets, order of operations or any other intricacies of calculators. Those are problems for a later time.

The first thing to build for the compiler is a language definition. I roughly used the following definition, written here in BNF:

<equation>       ::= <add> | <subtract> | <multiply> | <divide> | <value>

<add>            ::= <equation> "+" <equation>
<subtract>       ::= <equation> "-" <equation>
<multiply>       ::= <equation> "*" <equation>
<divide>         ::= <equation> "/" <equation>
<value>          ::= <characteristic> | <characteristic> <mantissa>

<characteristic> ::= "-" <number> | <number>
<mantissa>       ::= "." <number>
<number>         ::= <number> <digit> | <digit>

<digit>          ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"

I should have started by figuring this out before writing the rest of the program, but in this case I wrote it afterwards.

You can see that a value is a valid equation. A value is basic decimal notation with an optional minus sign: -10, 5.6, -0.01 and so on. The simplest possible program is the text 0. Then you can combine values with a series of symbols 5 + 5, 100 - 60 or 1.5 + 1 - 1.5.

There is one ambiguous part of this definition. The - symbol can either be used to create a subtract expression or a characteristic with a minus sign. This could be improved for a future version. For now the program will just need to separate operator and the value with a space.

Lexer

The lexer is responsible for taking a stream of text and converting that text into a stream symbols that are easier to work with. This is the first piece of code that rune. For example the symbols '-', '1', '0' would pass through the lexer and become Number(-10).

For this I tried using Logos. The reason that I chose it was that it was easy to implement. Did not require many traits to be implemented and allowed for simple regex based parsing. The final lexer is here:

use logos::{Logos, Lexer};
use std::str::FromStr;
use std::iter::FromIterator;
use core::fmt;

#[derive(Logos, Debug, Clone, PartialEq, Copy)]
pub enum Token {
    #[regex(r"-?\d+(?:\.\d+)?", number)]
    Number(f64),
    #[token("+")]
    Plus,
    #[token("-")]
    Minus,
    #[token("*")]
    Multiply,
    #[token("/")]
    Divide,
    #[error]
    #[regex(r"[ \t\n\f]+", logos::skip)]
    Error
}

pub fn lex(input: &str) -> Vec<Token> {
    Vec::from_iter(Token::lexer(input))
}

impl fmt::Display for Token {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        match self {
            Token::Plus => write!(f, "+"),
            Token::Minus => write!(f, "-"),
            Token::Multiply => write!(f, "*"),
            Token::Divide => write!(f, "/"),
            Token::Number(n) => write!(f, "{}", n),
            Token::Error =>  write!(f, "ERROR"),
        }
    }
}

fn number(lex: &mut Lexer<Token>) -> Option<f64> {
    f64::from_str(lex.slice()).ok()
}

The logos library did a great job of making lexing simple. Adding a few macros to an enum was all that is needed. You can even reference functions within the macro to add extra functionality if the built in conversions are not enough (number is an example of this). Unfortunately I had to implement some extra traits for the parser in the next section. Those traits made the parser more complex than it needed to be.

I think the thing that I would do differently next time would be to return the Lexer type directly rather than converting everything directly to Vec<Token>. The Lexer type has methods such as span and slice that would be super useful when generating error messages later.

Parser

The parser takes the stream of symbols generated by the lexer and converts them into an AST. The AST is much more useful because it contains information about how the program is structured rather than just being a big list. For example Number(1), Plus(), Number (1) becomes Equation(Value(1), Add, Value(1)). With the AST the program should have the same understanding of the input as the programmer who wrote it.

For this I went with pom. It is a fairly simple example of a parser combinator library. The main thing that drew me here was that there were no macros and everything could be adjusted.

extern crate pom;

use pom::parser::*;
use crate::lexer::Token;
use self::pom::Error;
use crate::math_engine::{Math, MathOp};

pub fn parse(tokens: &Vec<Token>) -> Result<Math, Error> {
    let parser = equation() - end();
    parser.parse(&tokens)
}

fn value<'a>() -> Parser<'a, Token, Math> {
    Parser::new(move |input: &[Token], start: usize| {
        if let Some(s) = input.get(start) {
            if let Token::Number(x) = *s {
                Ok((Math::Value(x), start + 1))
            } else {
                Err(Error::Mismatch {
                    message: format!("expect: Number, found: {}", s),
                    position: start,
                })
            }
        } else {
            Err(Error::Incomplete)
        }
    })
}

fn binary<'a>(t: Token, o: MathOp) -> Parser<'a, Token, Math>{
    let expr = value() - sym(t) + equation();
    expr.convert(move |vt| Ok(Math::Equation(Box::from(vt.0), o, Box::from(vt.1))) as Result<Math, Error>)
}

fn equation<'a>() -> Parser<'a, Token, Math>  {
    call(||binary(Token::Plus, MathOp::Add))
        | call(||binary(Token::Minus, MathOp::Sub))
        | call(||binary(Token::Multiply, MathOp::Mul))
        | call(||binary(Token::Divide, MathOp::Div))
        | value()
}

At the moment I am not fully satisfied with pom. The library seems to have been built with the goal of parsing text rather than symbols. From what I saw there was no way to extend the error messages and reference different spans that came from logos. I would also prefer not to use overloaded operators for the combining stages. I get that its clean when you understand all of the symbols but their implementation is not intuitive. Functions with good names and comments would be better. Ill probably replace this library in future versions.

Runtime

The core of the program was very simple. One module that runs all of the math calculations. It simply looks at the AST and converts it into simple commands to run. It recurses down the tree and executes what it can. Eventually it will return a number that can be displayed.

#[derive(Debug)]
pub enum Math {
    Value(f64),
    Equation(Box<Math>, MathOp, Box<Math>)
}

#[derive(Debug, Copy, Clone)]
pub enum MathOp {
    Add,
    Sub,
    Mul,
    Div,
}

pub fn exec(math: &Math) -> f64 {
    match math {
        Math::Value(n) => *n,
        Math::Equation(a,t,b) => {
            match t {
                MathOp::Add => exec(a) + exec(b),
                MathOp::Sub => exec(a) - exec(b),
                MathOp::Mul => exec(a) * exec(b),
                MathOp::Div => exec(a) / exec(b),
            }
        }
    }
}

The main program just puts all of the pieces together. There is a loop that grabs input. That input goes to the lexer to be converted into symbols. The symbols go to the parser to become equations. The finally the math engine runs the equations and prints out the results. Those steps repeat until the program is closed.

mod lexer;
mod parser;
mod math_engine;

use std::io::stdin;
use crate::lexer::lex;
use crate::parser::parse;
use crate::math_engine::exec;

fn main() {
    let mut buffer = String::new();
    while stdin().read_line(&mut buffer).is_ok() {
        let tokens = lex(&buffer);
        let res = parse(&tokens);
        match res {
            Ok(math) => println!("{}", exec(&math)),
            Err(e) =>  eprintln!("error: {:?}", e)
        }
        buffer.clear();
    }
}

Thoughts For The Future

Overall, I think that this was a good first start. I really liked working with the logos library. I think it did a great job of building a lexer and made it easy to understand what I needed. The pom library was a good start but I will probably look for something better later. It just does not align with my mental model of how the AST should work. Maybe I just need to build a wrapper around it or something.

Next time I want to dive deeper into the parsing side of things, so it will be a good time to experiment there.

_13/10/2021