正则表达式引擎的两种实现方法:导数与 Thompson 虚拟机
正则表达式引擎的实现方式多样,不同方法在性能、内存消耗和实现复杂度上各有权衡。本文将介绍两种数学上等价但实际表现迥异的正则匹配方法:Brzozowski 导数方法和 Thompson 虚拟机方法。
这两种方法都基于相同的抽象语法树表示,为直接的性能对比提供了统一的基础。其核心思想在于:这些看似不同的方法实际上是用不同的计算策略来解决同一个问题——一个依靠代数变换,另一个则通过程序执行。
约定与定义
为了建立统一的基础,两种正则表达式引擎都采用相同的抽象语法树(AST)表示,用树形结构来描述正则表达式的基本构造:
enum Ast {
(Char) -> Ast
Chr(Char
Char)
(Ast, Ast) -> Ast
Seq(enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast, enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast)
(Ast, Int?) -> Ast
Rep(enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast, Int
Int?)
(Ast) -> Ast
Opt(enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast)
} derive(trait Show {
output(Self, &Logger) -> Unit
to_string(Self) -> String
}
Trait for types that can be converted to String
Show, trait ToJson {
to_json(Self) -> Json
}
Trait for types that can be converted to Json
ToJson, trait Hash {
hash_combine(Self, Hasher) -> Unit
hash(Self) -> Int
}
Trait for types that can be hashed
The hash
method should return a hash value for the type, which is used in hash tables and other data structures.
The hash_combine
method is used to combine the hash of the current value with another hash value,
typically used to hash composite types.
When two values are equal according to the Eq
trait, they should produce the same hash value.
The hash
method does not need to be implemented if hash_combine
is implemented,
When implemented separately, hash
does not need to produce a hash value that is consistent with hash_combine
.
Hash, trait Eq {
equal(Self, Self) -> Bool
op_equal(Self, Self) -> Bool
}
Trait for types whose elements can test for equality
Eq)
此外,我们还提供了智能构造函数来简化正则表达式的构建:
fn enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast::(chr : Char) -> Ast
chr(Char
chr : Char
Char) -> enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast {
(Char) -> Ast
Chr(Char
chr)
}
fn enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast::(self : Ast, other : Ast) -> Ast
seq(Ast
self : enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast, Ast
other : enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast) -> enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast {
(Ast, Ast) -> Ast
Seq(Ast
self, Ast
other)
}
fn enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast::(self : Ast, n? : Int) -> Ast
rep(Ast
self : enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast, Int?
n? : Int
Int) -> enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast {
(Ast, Int?) -> Ast
Rep(Ast
self, Int?
n)
}
fn enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast::(self : Ast) -> Ast
opt(Ast
self : enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast) -> enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast {
Unit
@fs.
(Ast) -> Ast
Opt(Ast
self)
}
AST 定义了四种基本的正则表达式操作:
Chr(Char)
- 匹配单个字符字面量Seq(Ast, Ast)
- 序列匹配,即一个模式紧跟另一个模式Rep(Ast, Int?)
- 重复匹配,None
表示无限次重复,Some(n)
表示恰好重复 n 次Opt(Ast)
- 可选匹配,相当于标准正则语法中的pattern?
举个例子,正则表达式 (ab*)?
表示一个可选的序列('a' 后跟零个或多个 'b'),可以这样构建:
Ast::chr('a').seq(Ast::chr('b').rep()).opt()
Brzozowski 导数方法
导数方法基于形式语言理论,通过代数变换来处理正则表达式。对于输入的每个字符,该方法计算正则表达式的"导数",实质上是在问:"消费掉这个字符后,还剩下什么需要匹配?"这样就得到了一个新的正则表达式,代表剩余的匹配模式。
为了明确表示导数和可空性,我们对基本的 Ast
类型进行了扩展:
enum Exp {
Exp
Nil
Exp
Eps
(Char) -> Exp
Chr(Char
Char)
(Exp, Exp) -> Exp
Alt(enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp, enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp)
(Exp, Exp) -> Exp
Seq(enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp, enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp)
(Exp) -> Exp
Rep(enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp)
} derive(trait Show {
output(Self, &Logger) -> Unit
to_string(Self) -> String
}
Trait for types that can be converted to String
Show, trait Hash {
hash_combine(Self, Hasher) -> Unit
hash(Self) -> Int
}
Trait for types that can be hashed
The hash
method should return a hash value for the type, which is used in hash tables and other data structures.
The hash_combine
method is used to combine the hash of the current value with another hash value,
typically used to hash composite types.
When two values are equal according to the Eq
trait, they should produce the same hash value.
The hash
method does not need to be implemented if hash_combine
is implemented,
When implemented separately, hash
does not need to produce a hash value that is consistent with hash_combine
.
Hash, trait Eq {
equal(Self, Self) -> Bool
op_equal(Self, Self) -> Bool
}
Trait for types whose elements can test for equality
Eq, trait Compare {
compare(Self, Self) -> Int
}
Trait for types whose elements are ordered
The return value of [compare] is:
- zero, if the two arguments are equal
- negative, if the first argument is smaller
- positive, if the first argument is greater
Compare, trait ToJson {
to_json(Self) -> Json
}
Trait for types that can be converted to Json
ToJson)
Exp
中各构造器的含义如下:
Nil
- 表示不可能匹配的模式,即空集Eps
- 匹配空字符串Chr(Char)
- 匹配单个字符Alt(Exp, Exp)
- 表示选择(或),在多个模式间进行选择Seq(Exp, Exp)
- 表示连接,将两个模式依次连接Rep(Exp)
- 表示重复,对模式进行零次或多次重复
通过 Exp::of_ast
函数,我们可以将 Ast
转换为表达能力更强的 Exp
格式:
fn enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp::(ast : Ast) -> Exp
of_ast(Ast
ast : enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast) -> enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp {
match Ast
ast {
(Char) -> Ast
Chr(Char
c) => (Char) -> Exp
Chr(Char
c)
(Ast, Ast) -> Ast
Seq(Ast
a, Ast
b) => (Exp, Exp) -> Exp
Seq(enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp::(ast : Ast) -> Exp
of_ast(Ast
a), enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp::(ast : Ast) -> Exp
of_ast(Ast
b))
(Ast, Int?) -> Ast
Rep(Ast
a, Int?
None) => (Exp) -> Exp
Rep(enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp::(ast : Ast) -> Exp
of_ast(Ast
a))
(Ast, Int?) -> Ast
Rep(Ast
a, (Int) -> Int?
Some(Int
n)) => {
let Exp
sec = enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp::(ast : Ast) -> Exp
of_ast(Ast
a)
let mut Exp
exp = Exp
sec
for _ in Int
1..<Int
n {
Exp
exp = (Exp, Exp) -> Exp
Seq(Exp
exp, Exp
sec)
}
Exp
exp
}
(Ast) -> Ast
Opt(Ast
a) => (Exp, Exp) -> Exp
Alt(enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp::(ast : Ast) -> Exp
of_ast(Ast
a), Exp
Eps)
}
}
同样,我们也为 Exp
提供了智能构造函数来简化模式构建:
fn enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp::(a : Exp, b : Exp) -> Exp
seq(Exp
a : enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp, Exp
b : enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp) -> enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp {
match (Exp
a, Exp
b) {
(Exp
Nil, _) | (_, Exp
Nil) => Exp
Nil
(Exp
Eps, Exp
b) => Exp
b
(Exp
a, Exp
Eps) => Exp
a
(Exp
a, Exp
b) => (Exp, Exp) -> Exp
Seq(Exp
a, Exp
b)
}
}
不过,Alt
的智能构造函数特别重要——它保证构造出的 Exp
符合 Brzozowski 原论文中的"相似性"标准化要求。两个正则表达式如果能通过以下规则相互转换,就被认为是相似的:
因此,我们对 Alt
构造进行标准化,确保始终使用一致的结合律和选择顺序:
fn enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp::(a : Exp, b : Exp) -> Exp
alt(Exp
a : enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp, Exp
b : enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp) -> enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp {
match (Exp
a, Exp
b) {
(Exp
Nil, Exp
b) => Exp
b
(Exp
a, Exp
Nil) => Exp
a
((Exp, Exp) -> Exp
Alt(Exp
a, Exp
b), Exp
c) => Exp
a.(a : Exp, b : Exp) -> Exp
alt(Exp
b.(a : Exp, b : Exp) -> Exp
alt(Exp
c))
(Exp
a, Exp
b) => {
if Exp
a (Exp, Exp) -> Bool
automatically derived
== Exp
b {
Exp
a
} else if Exp
a (self_ : Exp, other : Exp) -> Bool
> Exp
b {
(Exp, Exp) -> Exp
Alt(Exp
b, Exp
a)
} else {
(Exp, Exp) -> Exp
Alt(Exp
a, Exp
b)
}
}
}
}
nullable
函数用于判断一个模式是否能够在不消费任何输入的情况下成功匹配(即匹配空字符串):
fn enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp::(self : Exp) -> Bool
nullable(Exp
self : enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp) -> Bool
Bool {
match Exp
self {
Exp
Nil => false
Exp
Eps => true
(Char) -> Exp
Chr(_) => false
(Exp, Exp) -> Exp
Alt(Exp
l, Exp
r) => Exp
l.(self : Exp) -> Bool
nullable() (Bool, Bool) -> Bool
|| Exp
r.(self : Exp) -> Bool
nullable()
(Exp, Exp) -> Exp
Seq(Exp
l, Exp
r) => Exp
l.(self : Exp) -> Bool
nullable() (Bool, Bool) -> Bool
&& Exp
r.(self : Exp) -> Bool
nullable()
(Exp) -> Exp
Rep(_) => true
}
}
deriv
函数计算模式对于特定字符的导数,按照 Brzozowski 导数理论中定义的规则对模式进行变换。我们对规则进行了重新排列,使其与 deriv
函数的实现顺序保持一致:
fn enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp::(self : Exp, c : Char) -> Exp
deriv(Exp
self : enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp, Char
c : Char
Char) -> enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp {
match Exp
self {
Exp
Nil => Exp
self
Exp
Eps => Exp
Nil
(Char) -> Exp
Chr(Char
d) if Char
d (self : Char, other : Char) -> Bool
Compares two characters for equality.
Parameters:
self
: The first character to compare.
other
: The second character to compare.
Returns true
if both characters represent the same Unicode code point,
false
otherwise.
Example:
let a = 'A'
let b = 'A'
let c = 'B'
inspect(a == b, content="true")
inspect(a == c, content="false")
== Char
c => Exp
Eps
(Char) -> Exp
Chr(_) => Exp
Nil
(Exp, Exp) -> Exp
Alt(Exp
l, Exp
r) => Exp
l.(self : Exp, c : Char) -> Exp
deriv(Char
c).(a : Exp, b : Exp) -> Exp
alt(Exp
r.(self : Exp, c : Char) -> Exp
deriv(Char
c))
(Exp, Exp) -> Exp
Seq(Exp
l, Exp
r) => {
let Exp
dl = Exp
l.(self : Exp, c : Char) -> Exp
deriv(Char
c)
if Exp
l.(self : Exp) -> Bool
nullable() {
Exp
dl.(a : Exp, b : Exp) -> Exp
seq(Exp
r).(a : Exp, b : Exp) -> Exp
alt(Exp
r.(self : Exp, c : Char) -> Exp
deriv(Char
c))
} else {
Exp
dl.(a : Exp, b : Exp) -> Exp
seq(Exp
r)
}
}
(Exp) -> Exp
Rep(Exp
e) => Exp
e.(self : Exp, c : Char) -> Exp
deriv(Char
c).(a : Exp, b : Exp) -> Exp
seq(Exp
self)
}
}
为了简化实现,我们这里只进行严格匹配,也就是说模式必须匹配整个输入字符串。因此,只有在处理完所有输入字符后,我们才检查最终模式的可空性:
fn enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp::(self : Exp, s : String) -> Bool
matches(Exp
self : enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp, String
s : String
String) -> Bool
Bool {
loop (Exp
self, String
s.(self : String, start_offset? : Int, end_offset? : Int) -> @string.View
Creates a View
into a String
.
Example
let str = "Hello🤣🤣🤣"
let view1 = str.view()
inspect(view1, content=
"Hello🤣🤣🤣"
)
let start_offset = str.offset_of_nth_char(1).unwrap()
let end_offset = str.offset_of_nth_char(6).unwrap() // the second emoji
let view2 = str.view(start_offset~, end_offset~)
inspect(view2, content=
"ello🤣"
)
view()) {
(Exp
Nil, _) => {
return false
}
(Exp
e, []) => {
return Exp
e.(self : Exp) -> Bool
nullable()
}
(Exp
e, @string.View
[Char
c@string.View
, .. s]) => {
continue (Exp
e.(self : Exp, c : Char) -> Exp
deriv(Char
c), @string.View
s)
}
}
}
虚拟机方法
虚拟机方法将正则表达式编译成简单虚拟机的字节码指令。这种方法把模式匹配问题转化为程序执行过程,虚拟机同时模拟非确定性有限自动机中所有可能的执行路径。
Ken Thompson 在 1968 年的经典论文中描述了一种将正则模式编译为 IBM 7094 机器代码的引擎。其关键思路是:通过维护多个执行线程来避免指数级回溯,这些线程同步地在输入中前进,每次处理一个字符,同时探索所有可能的匹配路径。
指令集与程序表示
该虚拟机基于四种基本指令运行,它们分别对应 NFA 的不同操作:
enum Ops {
Ops
Done
(Char) -> Ops
Char(Char
Char)
(Int) -> Ops
Jump(Int
Int)
(Int) -> Ops
Fork(Int
Int)
} derive(trait Show {
output(Self, &Logger) -> Unit
to_string(Self) -> String
}
Trait for types that can be converted to String
Show, trait ToJson {
to_json(Self) -> Json
}
Trait for types that can be converted to Json
ToJson)
每条指令在 NFA 模拟中都有其特定作用:Done
标记匹配成功完成,对应 Thompson 原设计中的 match
;Char(c)
消费输入字符 c
并跳转到下一条指令;Jump(addr)
无条件跳转至地址 addr
,即 Thompson 的 jmp
;Fork(addr)
创建两条执行路径——一条继续执行下一条指令,另一条跳转到 addr
,对应 Thompson 的 split
。
Fork
指令是处理模式非确定性的关键,比如选择和重复操作,这些情况下需要同时探索多条执行路径。这直接对应了 NFA 中的 ε-转换, 即执行流可以在不消费输入的情况下发生分支。
我们定义了 Prg
类型,它封装了指令数组并提供便捷的方法来构建和操作字节码程序:
type Prg type Array[T]
An Array
is a collection of values that supports random access and can
grow in size.
Array[enum Ops {
Done
Char(Char)
Jump(Int)
Fork(Int)
} derive(Show, ToJson)
Ops] derive(trait Show {
output(Self, &Logger) -> Unit
to_string(Self) -> String
}
Trait for types that can be converted to String
Show, trait ToJson {
to_json(Self) -> Json
}
Trait for types that can be converted to Json
ToJson)
fn type Prg Array[Ops] derive(Show, ToJson)
Prg::(self : Prg, inst : Ops) -> Unit
push(Prg
self : type Prg Array[Ops] derive(Show, ToJson)
Prg, Ops
inst : enum Ops {
Done
Char(Char)
Jump(Int)
Fork(Int)
} derive(Show, ToJson)
Ops) -> Unit
Unit {
Prg
self.(self : Prg) -> Array[Ops]
Convert newtype to its underlying type, automatically derived.
inner().(self : Array[Ops], value : Ops) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
let v = []
v.push(3)
push(Ops
inst)
}
fn type Prg Array[Ops] derive(Show, ToJson)
Prg::(self : Prg) -> Int
length(Prg
self : type Prg Array[Ops] derive(Show, ToJson)
Prg) -> Int
Int {
Prg
self.(self : Prg) -> Array[Ops]
Convert newtype to its underlying type, automatically derived.
inner().(self : Array[Ops]) -> Int
Returns the number of elements in the array.
Parameters:
array
: The array whose length is to be determined.
Returns the number of elements in the array as an integer.
Example:
let arr = [1, 2, 3]
inspect(arr.length(), content="3")
let empty : Array[Int] = []
inspect(empty.length(), content="0")
length()
}
fn type Prg Array[Ops] derive(Show, ToJson)
Prg::(self : Prg, index : Int, inst : Ops) -> Unit
op_set(Prg
self : type Prg Array[Ops] derive(Show, ToJson)
Prg, Int
index : Int
Int, Ops
inst : enum Ops {
Done
Char(Char)
Jump(Int)
Fork(Int)
} derive(Show, ToJson)
Ops) -> Unit
Unit {
Prg
selfArray[Ops]
Sets the element at the specified index in the array to a new value. The
original value at that index is overwritten.
Parameters:
array
: The array to modify.
index
: The position in the array where the value will be set.
value
: The new value to assign at the specified index.
Throws an error if index
is negative or greater than or equal to the length
of the array.
Example:
let arr = [1, 2, 3]
arr[1] = 42
inspect(arr, content="[1, 42, 3]")
.(self : Prg) -> Array[Ops]
Convert newtype to its underlying type, automatically derived.
innerArray[Ops]
Sets the element at the specified index in the array to a new value. The
original value at that index is overwritten.
Parameters:
array
: The array to modify.
index
: The position in the array where the value will be set.
value
: The new value to assign at the specified index.
Throws an error if index
is negative or greater than or equal to the length
of the array.
Example:
let arr = [1, 2, 3]
arr[1] = 42
inspect(arr, content="[1, 42, 3]")
()(Array[Ops], Int, Ops) -> Unit
Sets the element at the specified index in the array to a new value. The
original value at that index is overwritten.
Parameters:
array
: The array to modify.
index
: The position in the array where the value will be set.
value
: The new value to assign at the specified index.
Throws an error if index
is negative or greater than or equal to the length
of the array.
Example:
let arr = [1, 2, 3]
arr[1] = 42
inspect(arr, content="[1, 42, 3]")
[index] = Ops
inst
}
AST 到字节码的编译
Prg::of_ast
函数采用标准的 NFA 构造技术,将 AST 模式转换为虚拟机指令:
-
Seq(a, b)
:code for a code for b
-
Rep(a, None)
(无界重复):Fork L1, L2 L1: code for a Jump L1 L2:
-
Rep(a, Some(n))
(固定重复):code for a code for a ... (n times) ...
-
Opt(a)
(可选):Fork L1, L2 L1: code for a L2:
需要注意的是,Fork
构造器只接受一个地址参数,这是因为我们总是希望在 Fork
指令后继续执行下一条指令。
fn type Prg Array[Ops] derive(Show, ToJson)
Prg::(ast : Ast) -> Prg
of_ast(Ast
ast : enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast) -> type Prg Array[Ops] derive(Show, ToJson)
Prg {
fn (Prg, Ast) -> Unit
compile(Prg
prog : type Prg Array[Ops] derive(Show, ToJson)
Prg, Ast
ast : enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast) -> Unit
Unit {
match Ast
ast {
(Char) -> Ast
Chr(Char
chr) => Prg
prog.(self : Prg, inst : Ops) -> Unit
push((Char) -> Ops
Char(Char
chr))
(Ast, Ast) -> Ast
Seq(Ast
l, Ast
r) => {
(Prg, Ast) -> Unit
compile(Prg
prog, Ast
l)
(Prg, Ast) -> Unit
compile(Prg
prog, Ast
r)
}
(Ast, Int?) -> Ast
Rep(Ast
e, Int?
None) => {
let Int
fork = Prg
prog.(self : Prg) -> Int
length()
Prg
prog.(self : Prg, inst : Ops) -> Unit
push((Int) -> Ops
Fork(0))
(Prg, Ast) -> Unit
compile(Prg
prog, Ast
e)
Prg
prog.(self : Prg, inst : Ops) -> Unit
push((Int) -> Ops
Jump(Int
fork))
Prg
prog(Prg, Int, Ops) -> Unit
[fork] = (Int) -> Ops
Fork(Prg
prog.(self : Prg) -> Int
length())
}
(Ast, Int?) -> Ast
Rep(Ast
e, (Int) -> Int?
Some(Int
n)) =>
for _ in Int
0..<Int
n {
(Prg, Ast) -> Unit
compile(Prg
prog, Ast
e)
}
(Ast) -> Ast
Opt(Ast
e) => {
let Int
fork_inst = Prg
prog.(self : Prg) -> Int
length()
Prg
prog.(self : Prg, inst : Ops) -> Unit
push((Int) -> Ops
Fork(0))
(Prg, Ast) -> Unit
compile(Prg
prog, Ast
e)
Prg
prog(Prg, Int, Ops) -> Unit
[fork_inst] = (Int) -> Ops
Fork(Prg
prog.(self : Prg) -> Int
length())
}
}
}
let Prg
prog : type Prg Array[Ops] derive(Show, ToJson)
Prg = []
(Prg, Ast) -> Unit
compile(Prg
prog, Ast
ast)
Prg
prog.(self : Prg, inst : Ops) -> Unit
push(Ops
Done)
Prg
prog
}
虚拟机执行循环
在 Rob Pike 的实现中,虚拟机会在输入字符串结束后再执行一轮来处理最终的接受状态。为了明确这个过程,我们的 matches
函数采用两阶段方法来实现核心的虚拟机执行循环:
阶段一:字符处理。对于每个输入字符,处理当前上下文中所有活跃的线程。如果 Char
指令匹配当前字符,就在下一个上下文中创建新线程。Jump
和 Fork
指令会立即在当前上下文中产生新线程。处理完所有线程后,交换上下文并继续处理下一个字符。
阶段二:最终接受判断。处理完所有输入后,检查剩余线程中是否有 Done
指令。同时处理那些不消费输入的 Jump
/Fork
指令。如果有任何线程到达 Done
指令,就返回 true
。
fn type Prg Array[Ops] derive(Show, ToJson)
Prg::(self : Prg, data : @string.View) -> Bool
matches(Prg
self : type Prg Array[Ops] derive(Show, ToJson)
Prg, @string.View
data : #builtin.valtype
type @string.View
A @string.View
represents a view of a String that maintains proper Unicode
character boundaries. It allows safe access to a substring while handling
multi-byte characters correctly.
@string.View) -> Bool
Bool {
let (Array[Ops]) -> Prg
Prg(Array[Ops]
prog) = Prg
self
let mut Ctx
curr = struct Ctx {
deque: @deque.Deque[Int]
visit: FixedArray[Bool]
}
Ctx::(length : Int) -> Ctx
new(Array[Ops]
prog.(self : Array[Ops]) -> Int
Returns the number of elements in the array.
Parameters:
array
: The array whose length is to be determined.
Returns the number of elements in the array as an integer.
Example:
let arr = [1, 2, 3]
inspect(arr.length(), content="3")
let empty : Array[Int] = []
inspect(empty.length(), content="0")
length())
let mut Ctx
next = struct Ctx {
deque: @deque.Deque[Int]
visit: FixedArray[Bool]
}
Ctx::(length : Int) -> Ctx
new(Array[Ops]
prog.(self : Array[Ops]) -> Int
Returns the number of elements in the array.
Parameters:
array
: The array whose length is to be determined.
Returns the number of elements in the array as an integer.
Example:
let arr = [1, 2, 3]
inspect(arr.length(), content="3")
let empty : Array[Int] = []
inspect(empty.length(), content="0")
length())
Ctx
curr.(self : Ctx, pc : Int) -> Unit
add(0)
for Char
c in @string.View
data {
while Ctx
curr.(self : Ctx) -> Int?
pop() is (Int) -> Int?
Some(Int
pc) {
match Array[Ops]
prog(Array[Ops], Int) -> Ops
Retrieves an element from the array at the specified index.
Parameters:
array
: The array to get the element from.
index
: The position in the array from which to retrieve the element.
Returns the element at the specified index.
Throws a panic if the index is negative or greater than or equal to the
length of the array.
Example:
let arr = [1, 2, 3]
inspect(arr[1], content="2")
[pc] {
Ops
Done => ()
(Char) -> Ops
Char(Char
char) if Char
char (self : Char, other : Char) -> Bool
Compares two characters for equality.
Parameters:
self
: The first character to compare.
other
: The second character to compare.
Returns true
if both characters represent the same Unicode code point,
false
otherwise.
Example:
let a = 'A'
let b = 'A'
let c = 'B'
inspect(a == b, content="true")
inspect(a == c, content="false")
== Char
c => {
Ctx
next.(self : Ctx, pc : Int) -> Unit
add(Int
pc (self : Int, other : Int) -> Int
Adds two 32-bit signed integers. Performs two's complement arithmetic, which
means the operation will wrap around if the result exceeds the range of a
32-bit integer.
Parameters:
self
: The first integer operand.
other
: The second integer operand.
Returns a new integer that is the sum of the two operands. If the
mathematical sum exceeds the range of a 32-bit integer (-2,147,483,648 to
2,147,483,647), the result wraps around according to two's complement rules.
Example:
inspect(42 + 1, content="43")
inspect(2147483647 + 1, content="-2147483648") // Overflow wraps around to minimum value
+ 1)
}
(Int) -> Ops
Jump(Int
jump) =>
Ctx
curr.(self : Ctx, pc : Int) -> Unit
add(Int
jump)
(Int) -> Ops
Fork(Int
fork) => {
Ctx
curr.(self : Ctx, pc : Int) -> Unit
add(Int
fork)
Ctx
curr.(self : Ctx, pc : Int) -> Unit
add(Int
pc (self : Int, other : Int) -> Int
Adds two 32-bit signed integers. Performs two's complement arithmetic, which
means the operation will wrap around if the result exceeds the range of a
32-bit integer.
Parameters:
self
: The first integer operand.
other
: The second integer operand.
Returns a new integer that is the sum of the two operands. If the
mathematical sum exceeds the range of a 32-bit integer (-2,147,483,648 to
2,147,483,647), the result wraps around according to two's complement rules.
Example:
inspect(42 + 1, content="43")
inspect(2147483647 + 1, content="-2147483648") // Overflow wraps around to minimum value
+ 1)
}
_ => ()
}
}
let Ctx
temp = Ctx
curr
Ctx
curr = Ctx
next
Ctx
next = Ctx
temp
Ctx
next.(self : Ctx) -> Unit
reset()
}
while Ctx
curr.(self : Ctx) -> Int?
pop() is (Int) -> Int?
Some(Int
pc) {
match Array[Ops]
prog(Array[Ops], Int) -> Ops
Retrieves an element from the array at the specified index.
Parameters:
array
: The array to get the element from.
index
: The position in the array from which to retrieve the element.
Returns the element at the specified index.
Throws a panic if the index is negative or greater than or equal to the
length of the array.
Example:
let arr = [1, 2, 3]
inspect(arr[1], content="2")
[pc] {
Ops
Done => return true
(Int) -> Ops
Jump(Int
x) => Ctx
curr.(self : Ctx, pc : Int) -> Unit
add(Int
x)
(Int) -> Ops
Fork(Int
x) => {
Ctx
curr.(self : Ctx, pc : Int) -> Unit
add(Int
x)
Ctx
curr.(self : Ctx, pc : Int) -> Unit
add(Int
pc (self : Int, other : Int) -> Int
Adds two 32-bit signed integers. Performs two's complement arithmetic, which
means the operation will wrap around if the result exceeds the range of a
32-bit integer.
Parameters:
self
: The first integer operand.
other
: The second integer operand.
Returns a new integer that is the sum of the two operands. If the
mathematical sum exceeds the range of a 32-bit integer (-2,147,483,648 to
2,147,483,647), the result wraps around according to two's complement rules.
Example:
inspect(42 + 1, content="43")
inspect(2147483647 + 1, content="-2147483648") // Overflow wraps around to minimum value
+ 1)
}
_ => ()
}
}
false
}
在 Rob Pike 的原始博客中,他使用递归函数来处理 Fork
和 Jump
指令,以保证线程按优先级执行。而我们这里采用了类似栈的结构来管理所有执行线程,这样可以自然地维护线程优先级:
struct Ctx {
@deque.Deque[Int]
deque : type @deque.Deque[A]
@deque.T[Int
Int]
FixedArray[Bool]
visit : type FixedArray[A]
FixedArray[Bool
Bool]
}
fn struct Ctx {
deque: @deque.Deque[Int]
visit: FixedArray[Bool]
}
Ctx::(length : Int) -> Ctx
new(Int
length : Int
Int) -> struct Ctx {
deque: @deque.Deque[Int]
visit: FixedArray[Bool]
}
Ctx {
{ @deque.Deque[Int]
deque: (capacity? : Int) -> @deque.Deque[Int]
Creates a new empty deque with an optional initial capacity.
Parameters:
capacity
: The initial capacity of the deque. If not specified, defaults
to 0 and will be automatically adjusted as elements are added.
Returns a new empty deque of type T[A]
where A
is the type of elements
the deque will hold.
Example
let dq : @deque.Deque[Int] = @deque.new()
inspect(dq.length(), content="0")
inspect(dq.capacity(), content="0")
let dq : @deque.Deque[Int] = @deque.new(capacity=10)
inspect(dq.length(), content="0")
inspect(dq.capacity(), content="10")
@deque.new(), FixedArray[Bool]
visit: type FixedArray[A]
FixedArray::(len : Int, init : Bool) -> FixedArray[Bool]
Creates a new fixed-size array with the specified length, initializing all
elements with the given value.
Parameters:
length
: The length of the array to create. Must be non-negative.
initial_value
: The value used to initialize all elements in the array.
Returns a new fixed-size array of type FixedArray[T]
with length
elements, where each element is initialized to initial_value
.
Throws a panic if length
is negative.
Example:
let arr = FixedArray::make(3, 42)
inspect(arr[0], content="42")
inspect(arr.length(), content="3")
WARNING: A common pitfall is creating with the same initial value, for example:
let two_dimension_array = FixedArray::make(10, FixedArray::make(10, 0))
two_dimension_array[0][5] = 10
assert_eq(two_dimension_array[5][5], 10)
This is because all the cells reference to the same object (the FixedArray[Int] in this case).
One should use makei() instead which creates an object for each index.
make(Int
length, false) }
}
fn struct Ctx {
deque: @deque.Deque[Int]
visit: FixedArray[Bool]
}
Ctx::(self : Ctx, pc : Int) -> Unit
add(Ctx
self : struct Ctx {
deque: @deque.Deque[Int]
visit: FixedArray[Bool]
}
Ctx, Int
pc : Int
Int) -> Unit
Unit {
if Bool
!Ctx
selfBool
.FixedArray[Bool]
visit(FixedArray[Bool], Int) -> Bool
Retrieves an element at the specified index from a fixed-size array. This
function implements the array indexing operator []
.
Parameters:
array
: The fixed-size array to access.
index
: The position in the array from which to retrieve the element.
Returns the element at the specified index.
Throws a runtime error if the index is out of bounds (negative or greater
than or equal to the length of the array).
Example:
let arr = FixedArray::make(3, 42)
inspect(arr[1], content="42")
[Bool
pc] {
Ctx
self.@deque.Deque[Int]
deque.(self : @deque.Deque[Int], value : Int) -> Unit
Adds an element to the back of the deque.
If the deque is at capacity, it will be reallocated.
Example
let dv = @deque.of([1, 2, 3, 4, 5])
dv.push_back(6)
assert_eq(dv.back(), Some(6))
push_back(Int
pc)
Ctx
self.FixedArray[Bool]
visit(FixedArray[Bool], Int, Bool) -> Unit
Sets a value at the specified index in a fixed-size array. The original value
at that index is overwritten.
Parameters:
array
: The fixed-size array to modify.
index
: The position in the array where the value will be set.
value
: The new value to assign at the specified index.
Throws a runtime error if the index is out of bounds (less than 0 or greater
than or equal to the length of the array).
Example:
let arr = [1, 2, 3]
arr[1] = 42
inspect(arr, content="[1, 42, 3]")
[pc] = true
}
}
fn struct Ctx {
deque: @deque.Deque[Int]
visit: FixedArray[Bool]
}
Ctx::(self : Ctx) -> Int?
pop(Ctx
self : struct Ctx {
deque: @deque.Deque[Int]
visit: FixedArray[Bool]
}
Ctx) -> Int
Int? {
match Ctx
self.@deque.Deque[Int]
deque.(self : @deque.Deque[Int]) -> Int?
Removes a back element from a deque and returns it, or None
if it is empty.
Example
let dv = @deque.of([1, 2, 3, 4, 5])
assert_eq(dv.pop_back(), Some(5))
pop_back() {
(Int) -> Int?
Some(Int
pc) => {
Ctx
self.FixedArray[Bool]
visit(FixedArray[Bool], Int, Bool) -> Unit
Sets a value at the specified index in a fixed-size array. The original value
at that index is overwritten.
Parameters:
array
: The fixed-size array to modify.
index
: The position in the array where the value will be set.
value
: The new value to assign at the specified index.
Throws a runtime error if the index is out of bounds (less than 0 or greater
than or equal to the length of the array).
Example:
let arr = [1, 2, 3]
arr[1] = 42
inspect(arr, content="[1, 42, 3]")
[pc] = false
(Int) -> Int?
Some(Int
pc)
}
Int?
None => Int?
None
}
}
fn struct Ctx {
deque: @deque.Deque[Int]
visit: FixedArray[Bool]
}
Ctx::(self : Ctx) -> Unit
reset(Ctx
self : struct Ctx {
deque: @deque.Deque[Int]
visit: FixedArray[Bool]
}
Ctx) -> Unit
Unit {
Ctx
self.@deque.Deque[Int]
deque.(self : @deque.Deque[Int]) -> Unit
Clears the deque, removing all values.
This method has no effect on the allocated capacity of the deque, only setting the length to 0.
Example
let dv = @deque.of([1, 2, 3, 4, 5])
dv.clear()
inspect(dv.length(), content="0")
clear()
Ctx
self.FixedArray[Bool]
visit.(self : FixedArray[Bool], value : Bool, start? : Int, end? : Int) -> Unit
Fill the array with a given value.
This method fills all or part of a FixedArray with the given value.
Parameters
value
: The value to fill the array with
start
: The starting index (inclusive, default: 0)
end
: The ending index (exclusive, optional)
If end
is not provided, fills from start
to the end of the array.
If start
equals end
, no elements are modified.
Panics
- Panics if
start
is negative or greater than or equal to the array length
- Panics if
end
is provided and is less than start
or greater than array length
- Does nothing if the array is empty
Example
// Fill entire array
let fa : FixedArray[Int] = [0, 0, 0, 0, 0]
fa.fill(3)
inspect(fa, content="[3, 3, 3, 3, 3]")
// Fill from index 1 to 3 (exclusive)
let fa2 : FixedArray[Int] = [0, 0, 0, 0, 0]
fa2.fill(9, start=1, end=3)
inspect(fa2, content="[0, 9, 9, 0, 0]")
// Fill from index 2 to end
let fa3 : FixedArray[String] = ["a", "b", "c", "d"]
fa3.fill("x", start=2)
inspect(fa3, content=(
#|["a", "b", "x", "x"]
))
fill(false)
}
visit
数组用于过滤掉低优先级的重复线程。添加新线程时,我们先通过 visit
数组检查该线程是否已存在于 deque
中。如果已存在就直接丢弃;否则加入 deque
并标记为已访问。这个机制对于处理像 (a?)*
这样可能无限扩展的模式很重要,能够有效避免无限循环或指数级的线程爆炸。
基准测试与性能分析
我们通过一个对很多正则表达式实现都构成挑战的病理性案例来比较这两种方法:
test (@bench.T
b : type @bench.T
@bench.T) {
let Int
n = 15
let String
txt = "a".(self : String, n : Int) -> String
Returns a new string with self
repeated n
times.
repeat(Int
n)
let Ast
chr = enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast::(chr : Char) -> Ast
chr('a')
let Ast
ast : enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast = Ast
chr.(self : Ast) -> Ast
opt().(self : Ast, n~ : Int) -> Ast
rep(Int
n~).(self : Ast, other : Ast) -> Ast
seq(Ast
chr.(self : Ast, n~ : Int) -> Ast
rep(Int
n~))
let Exp
exp = enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp::(ast : Ast) -> Exp
of_ast(Ast
ast)
@bench.T
b.(self : @bench.T, name~ : String, f : () -> Unit, count? : UInt) -> Unit
Run a benchmark in batch mode
bench(String
name="derive", () => Exp
exp.(self : Exp, s : String) -> Bool
matches(String
txt) |> (t : Bool) -> Unit
Evaluates an expression and discards its result. This is useful when you want
to execute an expression for its side effects but don't care about its return
value, or when you want to explicitly indicate that a value is intentionally
unused.
Parameters:
value
: The value to be ignored. Can be of any type.
Example:
let x = 42
ignore(x) // Explicitly ignore the value
let mut sum = 0
ignore([1, 2, 3].iter().each((x) => { sum = sum + x })) // Ignore the Unit return value of each()
ignore())
let Prg
tvm = type Prg Array[Ops] derive(Show, ToJson)
Prg::(ast : Ast) -> Prg
of_ast(Ast
ast)
@bench.T
b.(self : @bench.T, name~ : String, f : () -> Unit, count? : UInt) -> Unit
Run a benchmark in batch mode
bench(String
name="thompson", () => Prg
tvm.(self : Prg, data : @string.View) -> Bool
matches(String
txt) |> (t : Bool) -> Unit
Evaluates an expression and discards its result. This is useful when you want
to execute an expression for its side effects but don't care about its return
value, or when you want to explicitly indicate that a value is intentionally
unused.
Parameters:
value
: The value to be ignored. Can be of any type.
Example:
let x = 42
ignore(x) // Explicitly ignore the value
let mut sum = 0
ignore([1, 2, 3].iter().each((x) => { sum = sum + x })) // Ignore the Unit return value of each()
ignore())
}
模式 (a?){n}a{n}
是回溯引擎中典型的指数爆炸案例。这个模式有 n 种不同的方式来匹配 n 个 'a' 字符,在朴素的实现中会产生指数级的搜索空间。
name time (mean ± σ) range (min … max)
derive 41.78 µs ± 0.14 µs 41.61 µs … 42.13 µs in 10 × 2359 runs
thompson 12.79 µs ± 0.04 µs 12.74 µs … 12.84 µs in 10 × 7815 runs
从基准测试结果可以看出,在这种情况下虚拟机方法明显快于导数方法。导数方法需要频繁分配中间的正则表达式结构,带来了更高的开销和更慢的性能。相比之下,虚拟机执行的是一组固定的指令,一旦双端队列扩展到完整大小后,就很少需要分配新的结构了。
不过,导数方法在理论分析上更简洁。我们可以很容易地证明算法的终止性,因为需要计算的导数数量受到 AST 大小的限制,并且随着 deriv
函数的每次递归调 用而严格递减。而虚拟机方法则不同,如果输入的 Prg
包含无限循环,程序可能永远不会终止,这就需要仔细处理线程优先级,以避免无限循环和线程数量的指数级增长。