Moonbit 与 llvm 共舞 上篇 - 实现语法前端

引言
编程语言设计与编译器实现历来被视为计算机科学领域中最具挑战性的课题之一。传统的编译器教学路径往往要求学生首先掌握复杂的理论基础:
- 自动机理论:有限状态自动机与正则表达式
- 类型理论:λ演算与类型系统的数学基础
- 计算机体系结构:从汇编语言到机器码的底层实现
然而,Moonbit作为一门专为现代开发环境设计的函数式编程语言,为我们提供了一个全新的视角。它不仅具备严谨的类型系统和卓越的内存安全保障,更重要的是,其丰富的语法特性和为AI时代量身定制的工具链,使得Moonbit成为学习和实现编译器的理想选择。
系列概述 本系列文章将通过构建一个名为TinyMoonbit的小型编程语言编译器,深入探讨现代编译器实现的核心概念和最佳实践。
- 上篇:聚焦语言前端的实现,包括词法分析、语法解析和类型检查,最终生成带有完整类型标记的抽象语法树
- 下篇:深入代码生成阶段,利用Moonbit官方的
llvm.mbt绑定库,将语法树转换为LLVM中间表示,并最终生成RISC-V汇编代码
TinyMoonbit 语言设计
TinyMoonbit是一种系统级编程语言,其抽象层次与C语言相当。虽然在语法设计上大量借鉴了Moonbit的特性,但TinyMoonbit实际并非Moonbit语言的子集,而是一个为测试llvm.mbt功能完备性兼具教学作用的简化版本。
注:由于篇幅限制,本系列文章所提到的TinyMoonbit实现比真正的TinyMoonbit更加简单,完整版本请参考 TinyMoonbitLLVM。
核心特性
TinyMoonbit提供了现代系统编程所需的基础功能:
- ✅ 底层内存操作:直接的指针操作和内存管理
- ✅ 控制流结构:条件分支、循环和函数调用
- ✅ 类型安全:静态类型检查和明确的类型声明
- ❌ 简化设计:为降低实现复杂度,不支持类型推导和闭包等高级特性
语法示例
让我们通过一个经典的斐波那契数列实现来展示TinyMoonbit的语法:
extern fn (x : Int) -> Unit
print_int(Int
x : Int
Int) -> Unit
Unit;
// 递归实现斐波那契数列
fn (n : Int) -> Int
fib(Int
n : Int
Int) -> Int
Int {
if Int
n (self_ : Int, other : Int) -> Bool
<= 1 {
return Int
n;
}
return (n : Int) -> Int
fib(Int
n (self : Int, other : Int) -> Int
Performs subtraction between two 32-bit integers, following standard two's
complement arithmetic rules. When the result overflows or underflows, it
wraps around within the 32-bit integer range.
Parameters:
self : The minuend (the number being subtracted from).
other : The subtrahend (the number to subtract).
Returns the difference between self and other.
Example:
let a = 42
let b = 10
inspect(a - b, content="32")
let max = 2147483647 // Int maximum value
inspect(max - -1, content="-2147483648") // Overflow case
- 1) (self : Int, other : Int) -> Int
Adds two 32-bit signed integers. Performs two's complement arithmetic, which
means the operation will wrap around if the result exceeds the range of a
32-bit integer.
Parameters:
self : The first integer operand.
other : The second integer operand.
Returns a new integer that is the sum of the two operands. If the
mathematical sum exceeds the range of a 32-bit integer (-2,147,483,648 to
2,147,483,647), the result wraps around according to two's complement rules.
Example:
inspect(42 + 1, content="43")
inspect(2147483647 + 1, content="-2147483648") // Overflow wraps around to minimum value
+ (n : Int) -> Int
fib(Int
n (self : Int, other : Int) -> Int
Performs subtraction between two 32-bit integers, following standard two's
complement arithmetic rules. When the result overflows or underflows, it
wraps around within the 32-bit integer range.
Parameters:
self : The minuend (the number being subtracted from).
other : The subtrahend (the number to subtract).
Returns the difference between self and other.
Example:
let a = 42
let b = 10
inspect(a - b, content="32")
let max = 2147483647 // Int maximum value
inspect(max - -1, content="-2147483648") // Overflow case
- 2);
}
fn main() -> Unit {
(x : Int) -> Unit
print_int((n : Int) -> Int
fib(10));
}
编译目标
经过完整的编译流程后,上述代码将生成如下的LLVM中间表示:
; ModuleID = 'tinymoonbit'
source_filename = "tinymoonbit"
define i32 @fib(i32 %0) {
entry:
%1 = alloca i32, align 4
store i32 %0, ptr %1, align 4
%2 = load i32, ptr %1, align 4
%3 = icmp sle i32 %2, 1
br i1 %3, label %4, label %6
4: ; preds = %entry
%5 = load i32, ptr %1, align 4
ret i32 %5
6: ; preds = %4, %entry
%7 = load i32, ptr %1, align 4
%8 = sub i32 %7, 1
%9 = call i32 @fib(i32 %8)
%10 = load i32, ptr %1, align 4
%11 = sub i32 %10, 2
%12 = call i32 @fib(i32 %11)
%13 = add i32 %9, %12
ret i32 %13
}
define void @main() {
entry:
%0 = call i32 @fib(i32 10)
call void @print_int(i32 %0)
}
declare void @print_int(i32 %0)
第二章:词法分析
词法分析(Lexical Analysis)构成了编译过程的第一道关卡,其核心使命是将连续的字符流转换为具有语义意义的词法单元(Tokens)序列。这个看似简单的转换过程,实际上是整个编译器流水线的基石。
从字符到符号:Token的设计与实现
考虑以下代码片段:
let Int
x : Int
Int = 5;
经过词法分析器处理后,将产生如下的Token序列:
(Keyword "let") → (Identifier "x") → (Symbol ":") →
(Type "Int") → (Operator "=") → (IntLiteral 5) → (Symbol ";")
这个转换过程需要处理多种复杂情况:
- 空白符过滤:跳过空格、制表符和换行符
- 关键字识别:区分保留字与用户定义标识符
- 数值解析:正确识别整数、浮点数的边界
- 运算符处理:区分单字符和多字符运算符
Token类型系统设计
基于TinyMoonbit的语法规范,我们将所有可能的符号分类为以下Token类型:
pub enum Token {
(Bool) -> Token
Bool(Bool
Bool) // 布尔值:true, false
(Int) -> Token
Int(Int
Int) // 整数:1, 2, 3, ...
(Double) -> Token
Double(Double
Double) // 浮点数:1.0, 2.5, 3.14, ...
(String) -> Token
Keyword(String
String) // 保留字:let, if, while, fn, return
(String) -> Token
Upper(String
String) // 类型标识符:首字母大写,如 Int, Double, Bool
(String) -> Token
Lower(String
String) // 变量标识符:首字母小写,如 x, y, result
(String) -> Token
Symbol(String
String) // 运算符和标点:+, -, *, :, ;, ->
(Char) -> Token
Bracket(Char
Char) // 括 号类:(, ), [, ], {, }
Token
EOF // 文件结束标记
} derive(trait Show {
output(Self, &Logger) -> Unit
to_string(Self) -> String
}
Trait for types that can be converted to String
Show, trait Eq {
equal(Self, Self) -> Bool
op_equal(Self, Self) -> Bool
}
Trait for types whose elements can test for equality
Eq)
利用模式匹配
Moonbit的强大模式匹配能力使我们能够以一种前所未有的优雅方式实现词法分析器。与传统的有限状态自动机方法相比,这种基于模式匹配的实现更加直观和易于理解。
核心分析函数
pub fn (code : String) -> Array[Token]
lex(String
code: String
String) -> type Array[T]
An Array is a collection of values that supports random access and can
grow in size.
Array[enum Token {
Bool(Bool)
Int(Int)
Double(Double)
Keyword(String)
Upper(String)
Lower(String)
Symbol(String)
Bracket(Char)
EOF
} derive(Show, Eq)
Token] {
let Array[Token]
tokens = type Array[T]
An Array is a collection of values that supports random access and can
grow in size.
Array::(capacity? : Int) -> Array[Token]
Creates a new empty array with an optional initial capacity.
Parameters:
capacity : The initial capacity of the array. If 0 (default), creates an
array with minimum capacity. Must be non-negative.
Returns a new empty array of type Array[T] with the specified initial
capacity.
Example:
let arr : Array[Int] = Array::new(capacity=10)
inspect(arr.length(), content="0")
inspect(arr.capacity(), content="10")
let arr : Array[Int] = Array::new()
inspect(arr.length(), content="0")
new()
loop String
code[:] {
// 跳过空白字符
StringView
[' ' | '\n' | '\r' | '\t', ..rest] =>
continue StringView
rest
// 处理单行注释
StringView
[.."//", ..rest] =>
continue loop StringView
rest {
StringView
['\n' | '\r', ..rest_str] => break StringView
rest_str
StringView
[_, ..rest_str] => continue StringView
rest_str
StringView
[] as rest_str => break StringView
rest_str
}
// 识别多字符运算符(顺序很重要!)
StringView
[.."->", ..rest] => { Array[Token]
tokens.(self : Array[Token], value : Token) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
let v = []
v.push(3)
push((String) -> Token
Symbol("->")); continue StringView
rest }
StringView
[.."==", ..rest] => { Array[Token]
tokens.(self : Array[Token], value : Token) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
let v = []
v.push(3)
push((String) -> Token
Symbol("==")); continue StringView
rest }
StringView
[.."!=", ..rest] => { Array[Token]
tokens.(self : Array[Token], value : Token) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
let v = []
v.push(3)
push((String) -> Token
Symbol("!=")); continue StringView
rest }
StringView
[.."<=", ..rest] => { Array[Token]
tokens.(self : Array[Token], value : Token) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
let v = []
v.push(3)
push((String) -> Token
Symbol("<=")); continue StringView
rest }
StringView
[..">=", ..rest] => { Array[Token]
tokens.(self : Array[Token], value : Token) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
let v = []
v.push(3)
push((String) -> Token
Symbol(">=")); continue StringView
rest }
// 识别单字符运算符和标点符号
[':' | '.' | ',' | ';' | '+' | '-' | '*' |
'/' | '%' | '>' | '<' | '=' as c, ..rest] => {
Array[Token]
tokens.(self : Array[Token], value : Token) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
let v = []
v.push(3)
push((String) -> Token
Symbol("\{Char
c}"))
continue StringView
rest
}
// 识别括号
StringView
[Char
'(' | ')' | '[' | ']' | '{' | '}' as cStringView
, ..rest] => {
Array[Token]
tokens.(self : Array[Token], value : Token) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
let v = []
v.push(3)
push((Char) -> Token
Bracket(Char
c))
continue StringView
rest
}
// 识别标识符和字面量
StringView
['a'..='z', ..] as code => {
let (Token
tok, StringView
rest) = (StringView) -> (Token, StringView)
lex_ident(StringView
code);
Array[Token]
tokens.(self : Array[Token], value : Token) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
let v = []
v.push(3)
push(Token
tok)
continue StringView
rest
}
['A'..='Z', ..] => { ... }
['0'..='9', ..] => { ... }
// 到达文件末尾
[] => { Array[Token]
tokens.(self : Array[Token], value : Token) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
let v = []
v.push(3)
push(Token
EOF); break Array[Token]
tokens }
}
}
关键字识别策略
标识符解析需要特别处理关键字的识别:
pub fn (rest : StringView) -> (Token, StringView)
let_ident(StringView
rest: type StringView
StringView represents a view of a String that maintains proper Unicode
character boundaries. It allows safe access to a substring while handling
multi-byte characters correctly.
@string.View) -> (enum Token {
Bool(Bool)
Int(Int)
Double(Double)
Keyword(String)
Upper(String)
Lower(String)
Symbol(String)
Bracket(Char)
EOF
} derive(Show, Eq)
Token, type StringView
StringView represents a view of a String that maintains proper Unicode
character boundaries. It allows safe access to a substring while handling
multi-byte characters correctly.
@string.View) {
// 预定义关键字映射表
let Unit
keyword_map = Unit
Map.(Array[(String, Token)]) -> Unit
from_array([
("let", Token::(String) -> Token
Keyword("let")),
("fn", Token::(String) -> Token
Keyword("fn")),
("if", Token::(String) -> Token
Keyword("if")),
("else", Token::(String) -> Token
Keyword("else")),
("while", Token::(String) -> Token
Keyword("while")),
("return", Token::(String) -> Token
Keyword("return")),
("extern", Token::(String) -> Token
Keyword("extern")),
("true", Token::(Bool) -> Token
Bool(true)),
("false", Token::(Bool) -> Token
Bool(false)),
])
let Array[Char]
identifier_chars = type Array[T]
An Array is a collection of values that supports random access and can
grow in size.
Array::(capacity? : Int) -> Array[Char]
Creates a new empty array with an optional initial capacity.
Parameters:
capacity : The initial capacity of the array. If 0 (default), creates an
array with minimum capacity. Must be non-negative.
Returns a new empty array of type Array[T] with the specified initial
capacity.
Example:
let arr : Array[Int] = Array::new(capacity=10)
inspect(arr.length(), content="0")
inspect(arr.capacity(), content="10")
let arr : Array[Int] = Array::new()
inspect(arr.length(), content="0")
new()
let StringView
remaining = loop StringView
rest {
StringView
[Char
'a'..='z' | 'A'..='Z' | '0'..='9' | '_' as cStringView
, ..rest_str] => {
Array[Char]
identifier_chars.(self : Array[Char], value : Char) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
let v = []
v.push(3)
push(Char
c)
continue StringView
rest_str
}
StringView
_ as rest_str => break StringView
rest_str
}
let String
ident = (ArrayView[Char]) -> String
String::(chars : ArrayView[Char]) -> String
Convert char array to string.
let s = @string.from_array(['H', 'e', 'l', 'l', 'o'])
assert_eq(s, "Hello")
Do not convert large datas to Array[Char] and build a string with String::from_array.
For efficiency considerations, it's recommended to use Buffer instead.
from_array(Array[Char]
identifier_chars)
let Token
token = Unit
keyword_map.(Unit) -> Unit
get(Unit
identifier).(() -> Token) -> Token
or_else(() => Token::(String) -> Token
Lower(String
ident))
(Token
token, StringView
remaining)
}