跳到主要内容

正则表达式引擎的两种实现方法:导数与 Thompson 虚拟机

· 阅读需 12 分钟

正则表达式引擎的实现方式多样,不同方法在性能、内存消耗和实现复杂度上各有权衡。本文将介绍两种数学上等价但实际表现迥异的正则匹配方法:Brzozowski 导数方法和 Thompson 虚拟机方法。

这两种方法都基于相同的抽象语法树表示,为直接的性能对比提供了统一的基础。其核心思想在于:这些看似不同的方法实际上是用不同的计算策略来解决同一个问题——一个依靠代数变换,另一个则通过程序执行。

约定与定义

为了建立统一的基础,两种正则表达式引擎都采用相同的抽象语法树(AST)表示,用树形结构来描述正则表达式的基本构造:

enum Ast {
  
(Char) -> Ast
Chr
(
Char
Char
)
(Ast, Ast) -> Ast
Seq
(
enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast
,
enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast
)
(Ast, Int?) -> Ast
Rep
(
enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast
,
Int
Int
?)
(Ast) -> Ast
Opt
(
enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast
)
} derive(
trait Show {
  output(Self, &Logger) -> Unit
  to_string(Self) -> String
}

Trait for types that can be converted to String

Show
,
trait ToJson {
  to_json(Self) -> Json
}

Trait for types that can be converted to Json

ToJson
,
trait Hash {
  hash_combine(Self, Hasher) -> Unit
  hash(Self) -> Int
}

Trait for types that can be hashed

The hash method should return a hash value for the type, which is used in hash tables and other data structures. The hash_combine method is used to combine the hash of the current value with another hash value, typically used to hash composite types.

When two values are equal according to the Eq trait, they should produce the same hash value.

The hash method does not need to be implemented if hash_combine is implemented, When implemented separately, hash does not need to produce a hash value that is consistent with hash_combine.

Hash
,
trait Eq {
  equal(Self, Self) -> Bool
  op_equal(Self, Self) -> Bool
}

Trait for types whose elements can test for equality

Eq
)

此外,我们还提供了智能构造函数来简化正则表达式的构建:

fn 
enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast
::
(chr : Char) -> Ast
chr
(
Char
chr
:
Char
Char
) ->
enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast
{
(Char) -> Ast
Chr
(
Char
chr
)
} fn
enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast
::
(self : Ast, other : Ast) -> Ast
seq
(
Ast
self
:
enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast
,
Ast
other
:
enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast
) ->
enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast
{
(Ast, Ast) -> Ast
Seq
(
Ast
self
,
Ast
other
)
} fn
enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast
::
(self : Ast, n? : Int) -> Ast
rep
(
Ast
self
:
enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast
,
Int?
n
? :
Int
Int
) ->
enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast
{
(Ast, Int?) -> Ast
Rep
(
Ast
self
,
Int?
n
)
} fn
enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast
::
(self : Ast) -> Ast
opt
(
Ast
self
:
enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast
) ->
enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast
{
Unit
@fs.
(Ast) -> Ast
Opt
(
Ast
self
)
}

AST 定义了四种基本的正则表达式操作:

  1. Chr(Char) - 匹配单个字符字面量
  2. Seq(Ast, Ast) - 序列匹配,即一个模式紧跟另一个模式
  3. Rep(Ast, Int?) - 重复匹配,None 表示无限次重复,Some(n) 表示恰好重复 n 次
  4. Opt(Ast) - 可选匹配,相当于标准正则语法中的 pattern?

举个例子,正则表达式 (ab*)? 表示一个可选的序列('a' 后跟零个或多个 'b'),可以这样构建:

Ast::chr('a').seq(Ast::chr('b').rep()).opt()

Brzozowski 导数方法

导数方法基于形式语言理论,通过代数变换来处理正则表达式。对于输入的每个字符,该方法计算正则表达式的"导数",实质上是在问:"消费掉这个字符后,还剩下什么需要匹配?"这样就得到了一个新的正则表达式,代表剩余的匹配模式。

为了明确表示导数和可空性,我们对基本的 Ast 类型进行了扩展:

enum Exp {
  
Exp
Nil
Exp
Eps
(Char) -> Exp
Chr
(
Char
Char
)
(Exp, Exp) -> Exp
Alt
(
enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp
,
enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp
)
(Exp, Exp) -> Exp
Seq
(
enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp
,
enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp
)
(Exp) -> Exp
Rep
(
enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp
)
} derive(
trait Show {
  output(Self, &Logger) -> Unit
  to_string(Self) -> String
}

Trait for types that can be converted to String

Show
,
trait Hash {
  hash_combine(Self, Hasher) -> Unit
  hash(Self) -> Int
}

Trait for types that can be hashed

The hash method should return a hash value for the type, which is used in hash tables and other data structures. The hash_combine method is used to combine the hash of the current value with another hash value, typically used to hash composite types.

When two values are equal according to the Eq trait, they should produce the same hash value.

The hash method does not need to be implemented if hash_combine is implemented, When implemented separately, hash does not need to produce a hash value that is consistent with hash_combine.

Hash
,
trait Eq {
  equal(Self, Self) -> Bool
  op_equal(Self, Self) -> Bool
}

Trait for types whose elements can test for equality

Eq
,
trait Compare {
  compare(Self, Self) -> Int
}

Trait for types whose elements are ordered

The return value of [compare] is:

  • zero, if the two arguments are equal
  • negative, if the first argument is smaller
  • positive, if the first argument is greater
Compare
,
trait ToJson {
  to_json(Self) -> Json
}

Trait for types that can be converted to Json

ToJson
)

Exp 中各构造器的含义如下:

  1. Nil - 表示不可能匹配的模式,即空集
  2. Eps - 匹配空字符串
  3. Chr(Char) - 匹配单个字符
  4. Alt(Exp, Exp) - 表示选择(或),在多个模式间进行选择
  5. Seq(Exp, Exp) - 表示连接,将两个模式依次连接
  6. Rep(Exp) - 表示重复,对模式进行零次或多次重复

通过 Exp::of_ast 函数,我们可以将 Ast 转换为表达能力更强的 Exp 格式:

fn 
enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp
::
(ast : Ast) -> Exp
of_ast
(
Ast
ast
:
enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast
) ->
enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp
{
match
Ast
ast
{
(Char) -> Ast
Chr
(
Char
c
) =>
(Char) -> Exp
Chr
(
Char
c
)
(Ast, Ast) -> Ast
Seq
(
Ast
a
,
Ast
b
) =>
(Exp, Exp) -> Exp
Seq
(
enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp
::
(ast : Ast) -> Exp
of_ast
(
Ast
a
),
enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp
::
(ast : Ast) -> Exp
of_ast
(
Ast
b
))
(Ast, Int?) -> Ast
Rep
(
Ast
a
,
Int?
None
) =>
(Exp) -> Exp
Rep
(
enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp
::
(ast : Ast) -> Exp
of_ast
(
Ast
a
))
(Ast, Int?) -> Ast
Rep
(
Ast
a
,
(Int) -> Int?
Some
(
Int
n
)) => {
let
Exp
sec
=
enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp
::
(ast : Ast) -> Exp
of_ast
(
Ast
a
)
let mut
Exp
exp
=
Exp
sec
for _ in
Int
1
..<
Int
n
{
Exp
exp
=
(Exp, Exp) -> Exp
Seq
(
Exp
exp
,
Exp
sec
)
}
Exp
exp
}
(Ast) -> Ast
Opt
(
Ast
a
) =>
(Exp, Exp) -> Exp
Alt
(
enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp
::
(ast : Ast) -> Exp
of_ast
(
Ast
a
),
Exp
Eps
)
} }

同样,我们也为 Exp 提供了智能构造函数来简化模式构建:

fn 
enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp
::
(a : Exp, b : Exp) -> Exp
seq
(
Exp
a
:
enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp
,
Exp
b
:
enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp
) ->
enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp
{
match (
Exp
a
,
Exp
b
) {
(
Exp
Nil
, _) | (_,
Exp
Nil
) =>
Exp
Nil
(
Exp
Eps
,
Exp
b
) =>
Exp
b
(
Exp
a
,
Exp
Eps
) =>
Exp
a
(
Exp
a
,
Exp
b
) =>
(Exp, Exp) -> Exp
Seq
(
Exp
a
,
Exp
b
)
} }

不过,Alt 的智能构造函数特别重要——它保证构造出的 Exp 符合 Brzozowski 原论文中的"相似性"标准化要求。两个正则表达式如果能通过以下规则相互转换,就被认为是相似的:

AAABBAA(BC)(AB)C \begin{align} & A \mid \emptyset &&\rightarrow A \\ & A \mid B &&\rightarrow B \mid A \\ & A \mid (B \mid C) &&\rightarrow (A \mid B) \mid C \end{align}

因此,我们对 Alt 构造进行标准化,确保始终使用一致的结合律和选择顺序:

fn 
enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp
::
(a : Exp, b : Exp) -> Exp
alt
(
Exp
a
:
enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp
,
Exp
b
:
enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp
) ->
enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp
{
match (
Exp
a
,
Exp
b
) {
(
Exp
Nil
,
Exp
b
) =>
Exp
b
(
Exp
a
,
Exp
Nil
) =>
Exp
a
(
(Exp, Exp) -> Exp
Alt
(
Exp
a
,
Exp
b
),
Exp
c
) =>
Exp
a
.
(a : Exp, b : Exp) -> Exp
alt
(
Exp
b
.
(a : Exp, b : Exp) -> Exp
alt
(
Exp
c
))
(
Exp
a
,
Exp
b
) => {
if
Exp
a
(Exp, Exp) -> Bool

automatically derived

==
Exp
b
{
Exp
a
} else if
Exp
a
(self_ : Exp, other : Exp) -> Bool
>
Exp
b
{
(Exp, Exp) -> Exp
Alt
(
Exp
b
,
Exp
a
)
} else {
(Exp, Exp) -> Exp
Alt
(
Exp
a
,
Exp
b
)
} } } }

nullable 函数用于判断一个模式是否能够在不消费任何输入的情况下成功匹配(即匹配空字符串):

fn 
enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp
::
(self : Exp) -> Bool
nullable
(
Exp
self
:
enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp
) ->
Bool
Bool
{
match
Exp
self
{
Exp
Nil
=> false
Exp
Eps
=> true
(Char) -> Exp
Chr
(_) => false
(Exp, Exp) -> Exp
Alt
(
Exp
l
,
Exp
r
) =>
Exp
l
.
(self : Exp) -> Bool
nullable
()
(Bool, Bool) -> Bool
||
Exp
r
.
(self : Exp) -> Bool
nullable
()
(Exp, Exp) -> Exp
Seq
(
Exp
l
,
Exp
r
) =>
Exp
l
.
(self : Exp) -> Bool
nullable
()
(Bool, Bool) -> Bool
&&
Exp
r
.
(self : Exp) -> Bool
nullable
()
(Exp) -> Exp
Rep
(_) => true
} }

deriv 函数计算模式对于特定字符的导数,按照 Brzozowski 导数理论中定义的规则对模式进行变换。我们对规则进行了重新排列,使其与 deriv 函数的实现顺序保持一致:

Da=Daϵ=Daa=ϵDab= for (ab)Da(PQ)=(DaP)(DaQ)Da(PQ)=(DaPQ)(ν(P)DaQ)Da(P)=DaPP \begin{align} D_{a} \emptyset &= \emptyset \\ D_{a} \epsilon &= \emptyset \\ D_{a} a &= \epsilon \\ D_{a} b &= \emptyset & \text{ for }(a \neq b) \\ D_{a} (P \mid Q) &= (D_{a} P) \mid (D_{a} Q) \\ D_{a} (P \cdot Q) &= (D_{a} P \cdot Q) \mid (\nu(P) \cdot D_{a} Q) \\ D_{a} (P\ast) &= D_{a} P \cdot P\ast \\ \end{align}
fn 
enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp
::
(self : Exp, c : Char) -> Exp
deriv
(
Exp
self
:
enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp
,
Char
c
:
Char
Char
) ->
enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp
{
match
Exp
self
{
Exp
Nil
=>
Exp
self
Exp
Eps
=>
Exp
Nil
(Char) -> Exp
Chr
(
Char
d
) if
Char
d
(self : Char, other : Char) -> Bool

Compares two characters for equality.

Parameters:

  • self : The first character to compare.
  • other : The second character to compare.

Returns true if both characters represent the same Unicode code point, false otherwise.

Example:

  let a = 'A'
  let b = 'A'
  let c = 'B'
  inspect(a == b, content="true")
  inspect(a == c, content="false")
==
Char
c
=>
Exp
Eps
(Char) -> Exp
Chr
(_) =>
Exp
Nil
(Exp, Exp) -> Exp
Alt
(
Exp
l
,
Exp
r
) =>
Exp
l
.
(self : Exp, c : Char) -> Exp
deriv
(
Char
c
).
(a : Exp, b : Exp) -> Exp
alt
(
Exp
r
.
(self : Exp, c : Char) -> Exp
deriv
(
Char
c
))
(Exp, Exp) -> Exp
Seq
(
Exp
l
,
Exp
r
) => {
let
Exp
dl
=
Exp
l
.
(self : Exp, c : Char) -> Exp
deriv
(
Char
c
)
if
Exp
l
.
(self : Exp) -> Bool
nullable
() {
Exp
dl
.
(a : Exp, b : Exp) -> Exp
seq
(
Exp
r
).
(a : Exp, b : Exp) -> Exp
alt
(
Exp
r
.
(self : Exp, c : Char) -> Exp
deriv
(
Char
c
))
} else {
Exp
dl
.
(a : Exp, b : Exp) -> Exp
seq
(
Exp
r
)
} }
(Exp) -> Exp
Rep
(
Exp
e
) =>
Exp
e
.
(self : Exp, c : Char) -> Exp
deriv
(
Char
c
).
(a : Exp, b : Exp) -> Exp
seq
(
Exp
self
)
} }

为了简化实现,我们这里只进行严格匹配,也就是说模式必须匹配整个输入字符串。因此,只有在处理完所有输入字符后,我们才检查最终模式的可空性:

fn 
enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp
::
(self : Exp, s : String) -> Bool
matches
(
Exp
self
:
enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp
,
String
s
:
String
String
) ->
Bool
Bool
{
loop (
Exp
self
,
String
s
.
(self : String, start_offset? : Int, end_offset? : Int) -> @string.View

Creates a View into a String.

Example

  let str = "Hello🤣🤣🤣"
  let view1 = str.view()
  inspect(view1, content=
   "Hello🤣🤣🤣"
  )
  let start_offset = str.offset_of_nth_char(1).unwrap()
  let end_offset = str.offset_of_nth_char(6).unwrap() // the second emoji
  let view2 = str.view(start_offset~, end_offset~)
  inspect(view2, content=
   "ello🤣"
  )
view
()) {
(
Exp
Nil
, _) => {
return false } (
Exp
e
, []) => {
return
Exp
e
.
(self : Exp) -> Bool
nullable
()
} (
Exp
e
,
@string.View
[
Char
c
@string.View
, .. s]
) => {
continue (
Exp
e
.
(self : Exp, c : Char) -> Exp
deriv
(
Char
c
),
@string.View
s
)
} } }

虚拟机方法

虚拟机方法将正则表达式编译成简单虚拟机的字节码指令。这种方法把模式匹配问题转化为程序执行过程,虚拟机同时模拟非确定性有限自动机中所有可能的执行路径。

Ken Thompson 在 1968 年的经典论文中描述了一种将正则模式编译为 IBM 7094 机器代码的引擎。其关键思路是:通过维护多个执行线程来避免指数级回溯,这些线程同步地在输入中前进,每次处理一个字符,同时探索所有可能的匹配路径。

指令集与程序表示

该虚拟机基于四种基本指令运行,它们分别对应 NFA 的不同操作:

enum Ops {
  
Ops
Done
(Char) -> Ops
Char
(
Char
Char
)
(Int) -> Ops
Jump
(
Int
Int
)
(Int) -> Ops
Fork
(
Int
Int
)
} derive(
trait Show {
  output(Self, &Logger) -> Unit
  to_string(Self) -> String
}

Trait for types that can be converted to String

Show
,
trait ToJson {
  to_json(Self) -> Json
}

Trait for types that can be converted to Json

ToJson
)

每条指令在 NFA 模拟中都有其特定作用:Done 标记匹配成功完成,对应 Thompson 原设计中的 matchChar(c) 消费输入字符 c 并跳转到下一条指令;Jump(addr) 无条件跳转至地址 addr,即 Thompson 的 jmpFork(addr) 创建两条执行路径——一条继续执行下一条指令,另一条跳转到 addr,对应 Thompson 的 split

Fork 指令是处理模式非确定性的关键,比如选择和重复操作,这些情况下需要同时探索多条执行路径。这直接对应了 NFA 中的 ε-转换,即执行流可以在不消费输入的情况下发生分支。

我们定义了 Prg 类型,它封装了指令数组并提供便捷的方法来构建和操作字节码程序:

type Prg 
type Array[T]

An Array is a collection of values that supports random access and can grow in size.

Array
[
enum Ops {
  Done
  Char(Char)
  Jump(Int)
  Fork(Int)
} derive(Show, ToJson)
Ops
] derive(
trait Show {
  output(Self, &Logger) -> Unit
  to_string(Self) -> String
}

Trait for types that can be converted to String

Show
,
trait ToJson {
  to_json(Self) -> Json
}

Trait for types that can be converted to Json

ToJson
)
fn
type Prg Array[Ops] derive(Show, ToJson)
Prg
::
(self : Prg, inst : Ops) -> Unit
push
(
Prg
self
:
type Prg Array[Ops] derive(Show, ToJson)
Prg
,
Ops
inst
:
enum Ops {
  Done
  Char(Char)
  Jump(Int)
  Fork(Int)
} derive(Show, ToJson)
Ops
) ->
Unit
Unit
{
Prg
self
.
(self : Prg) -> Array[Ops]

Convert newtype to its underlying type, automatically derived.

inner
().
(self : Array[Ops], value : Ops) -> Unit

Adds an element to the end of the array.

If the array is at capacity, it will be reallocated.

Example

  let v = []
  v.push(3)
push
(
Ops
inst
)
} fn
type Prg Array[Ops] derive(Show, ToJson)
Prg
::
(self : Prg) -> Int
length
(
Prg
self
:
type Prg Array[Ops] derive(Show, ToJson)
Prg
) ->
Int
Int
{
Prg
self
.
(self : Prg) -> Array[Ops]

Convert newtype to its underlying type, automatically derived.

inner
().
(self : Array[Ops]) -> Int

Returns the number of elements in the array.

Parameters:

  • array : The array whose length is to be determined.

Returns the number of elements in the array as an integer.

Example:

  let arr = [1, 2, 3]
  inspect(arr.length(), content="3")
  let empty : Array[Int] = []
  inspect(empty.length(), content="0")
length
()
} fn
type Prg Array[Ops] derive(Show, ToJson)
Prg
::
(self : Prg, index : Int, inst : Ops) -> Unit
op_set
(
Prg
self
:
type Prg Array[Ops] derive(Show, ToJson)
Prg
,
Int
index
:
Int
Int
,
Ops
inst
:
enum Ops {
  Done
  Char(Char)
  Jump(Int)
  Fork(Int)
} derive(Show, ToJson)
Ops
) ->
Unit
Unit
{
Prg
self
Array[Ops]

Sets the element at the specified index in the array to a new value. The original value at that index is overwritten.

Parameters:

  • array : The array to modify.
  • index : The position in the array where the value will be set.
  • value : The new value to assign at the specified index.

Throws an error if index is negative or greater than or equal to the length of the array.

Example:

  let arr = [1, 2, 3]
  arr[1] = 42
  inspect(arr, content="[1, 42, 3]")
.
(self : Prg) -> Array[Ops]

Convert newtype to its underlying type, automatically derived.

inner
Array[Ops]

Sets the element at the specified index in the array to a new value. The original value at that index is overwritten.

Parameters:

  • array : The array to modify.
  • index : The position in the array where the value will be set.
  • value : The new value to assign at the specified index.

Throws an error if index is negative or greater than or equal to the length of the array.

Example:

  let arr = [1, 2, 3]
  arr[1] = 42
  inspect(arr, content="[1, 42, 3]")
()
(Array[Ops], Int, Ops) -> Unit

Sets the element at the specified index in the array to a new value. The original value at that index is overwritten.

Parameters:

  • array : The array to modify.
  • index : The position in the array where the value will be set.
  • value : The new value to assign at the specified index.

Throws an error if index is negative or greater than or equal to the length of the array.

Example:

  let arr = [1, 2, 3]
  arr[1] = 42
  inspect(arr, content="[1, 42, 3]")
[
index] =
Ops
inst
}

AST 到字节码的编译

Prg::of_ast 函数采用标准的 NFA 构造技术,将 AST 模式转换为虚拟机指令:

  1. Seq(a, b)

    code for a
    code for b
    
  2. Rep(a, None) (无界重复):

        Fork L1, L2
    L1: code for a
        Jump L1
    L2:
    
  3. Rep(a, Some(n)) (固定重复):

    code for a
    code for a
    ... (n times) ...
    
  4. Opt(a) (可选):

        Fork L1, L2
    L1: code for a
    L2:
    

需要注意的是,Fork 构造器只接受一个地址参数,这是因为我们总是希望在 Fork 指令后继续执行下一条指令。

fn 
type Prg Array[Ops] derive(Show, ToJson)
Prg
::
(ast : Ast) -> Prg
of_ast
(
Ast
ast
:
enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast
) ->
type Prg Array[Ops] derive(Show, ToJson)
Prg
{
fn
(Prg, Ast) -> Unit
compile
(
Prg
prog
:
type Prg Array[Ops] derive(Show, ToJson)
Prg
,
Ast
ast
:
enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast
) ->
Unit
Unit
{
match
Ast
ast
{
(Char) -> Ast
Chr
(
Char
chr
) =>
Prg
prog
.
(self : Prg, inst : Ops) -> Unit
push
(
(Char) -> Ops
Char
(
Char
chr
))
(Ast, Ast) -> Ast
Seq
(
Ast
l
,
Ast
r
) => {
(Prg, Ast) -> Unit
compile
(
Prg
prog
,
Ast
l
)
(Prg, Ast) -> Unit
compile
(
Prg
prog
,
Ast
r
)
}
(Ast, Int?) -> Ast
Rep
(
Ast
e
,
Int?
None
) => {
let
Int
fork
=
Prg
prog
.
(self : Prg) -> Int
length
()
Prg
prog
.
(self : Prg, inst : Ops) -> Unit
push
(
(Int) -> Ops
Fork
(0))
(Prg, Ast) -> Unit
compile
(
Prg
prog
,
Ast
e
)
Prg
prog
.
(self : Prg, inst : Ops) -> Unit
push
(
(Int) -> Ops
Jump
(
Int
fork
))
Prg
prog
(Prg, Int, Ops) -> Unit
[
fork] =
(Int) -> Ops
Fork
(
Prg
prog
.
(self : Prg) -> Int
length
())
}
(Ast, Int?) -> Ast
Rep
(
Ast
e
,
(Int) -> Int?
Some
(
Int
n
)) =>
for _ in
Int
0
..<
Int
n
{
(Prg, Ast) -> Unit
compile
(
Prg
prog
,
Ast
e
)
}
(Ast) -> Ast
Opt
(
Ast
e
) => {
let
Int
fork_inst
=
Prg
prog
.
(self : Prg) -> Int
length
()
Prg
prog
.
(self : Prg, inst : Ops) -> Unit
push
(
(Int) -> Ops
Fork
(0))
(Prg, Ast) -> Unit
compile
(
Prg
prog
,
Ast
e
)
Prg
prog
(Prg, Int, Ops) -> Unit
[
fork_inst] =
(Int) -> Ops
Fork
(
Prg
prog
.
(self : Prg) -> Int
length
())
} } } let
Prg
prog
:
type Prg Array[Ops] derive(Show, ToJson)
Prg
= []
(Prg, Ast) -> Unit
compile
(
Prg
prog
,
Ast
ast
)
Prg
prog
.
(self : Prg, inst : Ops) -> Unit
push
(
Ops
Done
)
Prg
prog
}

虚拟机执行循环

在 Rob Pike 的实现中,虚拟机会在输入字符串结束后再执行一轮来处理最终的接受状态。为了明确这个过程,我们的 matches 函数采用两阶段方法来实现核心的虚拟机执行循环:

阶段一:字符处理。对于每个输入字符,处理当前上下文中所有活跃的线程。如果 Char 指令匹配当前字符,就在下一个上下文中创建新线程。JumpFork 指令会立即在当前上下文中产生新线程。处理完所有线程后,交换上下文并继续处理下一个字符。

阶段二:最终接受判断。处理完所有输入后,检查剩余线程中是否有 Done 指令。同时处理那些不消费输入的 Jump/Fork 指令。如果有任何线程到达 Done 指令,就返回 true

fn 
type Prg Array[Ops] derive(Show, ToJson)
Prg
::
(self : Prg, data : @string.View) -> Bool
matches
(
Prg
self
:
type Prg Array[Ops] derive(Show, ToJson)
Prg
,
@string.View
data
:
#builtin.valtype
type @string.View

A @string.View represents a view of a String that maintains proper Unicode character boundaries. It allows safe access to a substring while handling multi-byte characters correctly.

@string.View
) ->
Bool
Bool
{
let
(Array[Ops]) -> Prg
Prg
(
Array[Ops]
prog
) =
Prg
self
let mut
Ctx
curr
=
struct Ctx {
  deque: @deque.Deque[Int]
  visit: FixedArray[Bool]
}
Ctx
::
(length : Int) -> Ctx
new
(
Array[Ops]
prog
.
(self : Array[Ops]) -> Int

Returns the number of elements in the array.

Parameters:

  • array : The array whose length is to be determined.

Returns the number of elements in the array as an integer.

Example:

  let arr = [1, 2, 3]
  inspect(arr.length(), content="3")
  let empty : Array[Int] = []
  inspect(empty.length(), content="0")
length
())
let mut
Ctx
next
=
struct Ctx {
  deque: @deque.Deque[Int]
  visit: FixedArray[Bool]
}
Ctx
::
(length : Int) -> Ctx
new
(
Array[Ops]
prog
.
(self : Array[Ops]) -> Int

Returns the number of elements in the array.

Parameters:

  • array : The array whose length is to be determined.

Returns the number of elements in the array as an integer.

Example:

  let arr = [1, 2, 3]
  inspect(arr.length(), content="3")
  let empty : Array[Int] = []
  inspect(empty.length(), content="0")
length
())
Ctx
curr
.
(self : Ctx, pc : Int) -> Unit
add
(0)
for
Char
c
in
@string.View
data
{
while
Ctx
curr
.
(self : Ctx) -> Int?
pop
() is
(Int) -> Int?
Some
(
Int
pc
) {
match
Array[Ops]
prog
(Array[Ops], Int) -> Ops

Retrieves an element from the array at the specified index.

Parameters:

  • array : The array to get the element from.
  • index : The position in the array from which to retrieve the element.

Returns the element at the specified index.

Throws a panic if the index is negative or greater than or equal to the length of the array.

Example:

  let arr = [1, 2, 3]
  inspect(arr[1], content="2")
[
pc] {
Ops
Done
=> ()
(Char) -> Ops
Char
(
Char
char
) if
Char
char
(self : Char, other : Char) -> Bool

Compares two characters for equality.

Parameters:

  • self : The first character to compare.
  • other : The second character to compare.

Returns true if both characters represent the same Unicode code point, false otherwise.

Example:

  let a = 'A'
  let b = 'A'
  let c = 'B'
  inspect(a == b, content="true")
  inspect(a == c, content="false")
==
Char
c
=> {
Ctx
next
.
(self : Ctx, pc : Int) -> Unit
add
(
Int
pc
(self : Int, other : Int) -> Int

Adds two 32-bit signed integers. Performs two's complement arithmetic, which means the operation will wrap around if the result exceeds the range of a 32-bit integer.

Parameters:

  • self : The first integer operand.
  • other : The second integer operand.

Returns a new integer that is the sum of the two operands. If the mathematical sum exceeds the range of a 32-bit integer (-2,147,483,648 to 2,147,483,647), the result wraps around according to two's complement rules.

Example:

  inspect(42 + 1, content="43")
  inspect(2147483647 + 1, content="-2147483648") // Overflow wraps around to minimum value
+
1)
}
(Int) -> Ops
Jump
(
Int
jump
) =>
Ctx
curr
.
(self : Ctx, pc : Int) -> Unit
add
(
Int
jump
)
(Int) -> Ops
Fork
(
Int
fork
) => {
Ctx
curr
.
(self : Ctx, pc : Int) -> Unit
add
(
Int
fork
)
Ctx
curr
.
(self : Ctx, pc : Int) -> Unit
add
(
Int
pc
(self : Int, other : Int) -> Int

Adds two 32-bit signed integers. Performs two's complement arithmetic, which means the operation will wrap around if the result exceeds the range of a 32-bit integer.

Parameters:

  • self : The first integer operand.
  • other : The second integer operand.

Returns a new integer that is the sum of the two operands. If the mathematical sum exceeds the range of a 32-bit integer (-2,147,483,648 to 2,147,483,647), the result wraps around according to two's complement rules.

Example:

  inspect(42 + 1, content="43")
  inspect(2147483647 + 1, content="-2147483648") // Overflow wraps around to minimum value
+
1)
} _ => () } } let
Ctx
temp
=
Ctx
curr
Ctx
curr
=
Ctx
next
Ctx
next
=
Ctx
temp
Ctx
next
.
(self : Ctx) -> Unit
reset
()
} while
Ctx
curr
.
(self : Ctx) -> Int?
pop
() is
(Int) -> Int?
Some
(
Int
pc
) {
match
Array[Ops]
prog
(Array[Ops], Int) -> Ops

Retrieves an element from the array at the specified index.

Parameters:

  • array : The array to get the element from.
  • index : The position in the array from which to retrieve the element.

Returns the element at the specified index.

Throws a panic if the index is negative or greater than or equal to the length of the array.

Example:

  let arr = [1, 2, 3]
  inspect(arr[1], content="2")
[
pc] {
Ops
Done
=> return true
(Int) -> Ops
Jump
(
Int
x
) =>
Ctx
curr
.
(self : Ctx, pc : Int) -> Unit
add
(
Int
x
)
(Int) -> Ops
Fork
(
Int
x
) => {
Ctx
curr
.
(self : Ctx, pc : Int) -> Unit
add
(
Int
x
)
Ctx
curr
.
(self : Ctx, pc : Int) -> Unit
add
(
Int
pc
(self : Int, other : Int) -> Int

Adds two 32-bit signed integers. Performs two's complement arithmetic, which means the operation will wrap around if the result exceeds the range of a 32-bit integer.

Parameters:

  • self : The first integer operand.
  • other : The second integer operand.

Returns a new integer that is the sum of the two operands. If the mathematical sum exceeds the range of a 32-bit integer (-2,147,483,648 to 2,147,483,647), the result wraps around according to two's complement rules.

Example:

  inspect(42 + 1, content="43")
  inspect(2147483647 + 1, content="-2147483648") // Overflow wraps around to minimum value
+
1)
} _ => () } } false }

在 Rob Pike 的原始博客中,他使用递归函数来处理 ForkJump 指令,以保证线程按优先级执行。而我们这里采用了类似栈的结构来管理所有执行线程,这样可以自然地维护线程优先级:

struct Ctx {
  
@deque.Deque[Int]
deque
:
type @deque.Deque[A]
@deque.T
[
Int
Int
]
FixedArray[Bool]
visit
:
type FixedArray[A]
FixedArray
[
Bool
Bool
]
} fn
struct Ctx {
  deque: @deque.Deque[Int]
  visit: FixedArray[Bool]
}
Ctx
::
(length : Int) -> Ctx
new
(
Int
length
:
Int
Int
) ->
struct Ctx {
  deque: @deque.Deque[Int]
  visit: FixedArray[Bool]
}
Ctx
{
{
@deque.Deque[Int]
deque
:
(capacity? : Int) -> @deque.Deque[Int]

Creates a new empty deque with an optional initial capacity.

Parameters:

  • capacity : The initial capacity of the deque. If not specified, defaults to 0 and will be automatically adjusted as elements are added.

Returns a new empty deque of type T[A] where A is the type of elements the deque will hold.

Example

  let dq : @deque.Deque[Int] = @deque.new()
  inspect(dq.length(), content="0")
  inspect(dq.capacity(), content="0")

  let dq : @deque.Deque[Int] = @deque.new(capacity=10)
  inspect(dq.length(), content="0")
  inspect(dq.capacity(), content="10")
@deque.new
(),
FixedArray[Bool]
visit
:
type FixedArray[A]
FixedArray
::
(len : Int, init : Bool) -> FixedArray[Bool]

Creates a new fixed-size array with the specified length, initializing all elements with the given value.

Parameters:

  • length : The length of the array to create. Must be non-negative.
  • initial_value : The value used to initialize all elements in the array.

Returns a new fixed-size array of type FixedArray[T] with length elements, where each element is initialized to initial_value.

Throws a panic if length is negative.

Example:

  let arr = FixedArray::make(3, 42)
  inspect(arr[0], content="42")
  inspect(arr.length(), content="3")

WARNING: A common pitfall is creating with the same initial value, for example:

  let two_dimension_array = FixedArray::make(10, FixedArray::make(10, 0))
  two_dimension_array[0][5] = 10
  assert_eq(two_dimension_array[5][5], 10)

This is because all the cells reference to the same object (the FixedArray[Int] in this case). One should use makei() instead which creates an object for each index.

make
(
Int
length
, false) }
} fn
struct Ctx {
  deque: @deque.Deque[Int]
  visit: FixedArray[Bool]
}
Ctx
::
(self : Ctx, pc : Int) -> Unit
add
(
Ctx
self
:
struct Ctx {
  deque: @deque.Deque[Int]
  visit: FixedArray[Bool]
}
Ctx
,
Int
pc
:
Int
Int
) ->
Unit
Unit
{
if
Bool
!
Ctx
self
Bool
.
FixedArray[Bool]
visit
(FixedArray[Bool], Int) -> Bool

Retrieves an element at the specified index from a fixed-size array. This function implements the array indexing operator [].

Parameters:

  • array : The fixed-size array to access.
  • index : The position in the array from which to retrieve the element.

Returns the element at the specified index.

Throws a runtime error if the index is out of bounds (negative or greater than or equal to the length of the array).

Example:

  let arr = FixedArray::make(3, 42)
  inspect(arr[1], content="42")
[
Bool
pc]
{
Ctx
self
.
@deque.Deque[Int]
deque
.
(self : @deque.Deque[Int], value : Int) -> Unit

Adds an element to the back of the deque.

If the deque is at capacity, it will be reallocated.

Example

  let dv = @deque.of([1, 2, 3, 4, 5])
  dv.push_back(6)
  assert_eq(dv.back(), Some(6))
push_back
(
Int
pc
)
Ctx
self
.
FixedArray[Bool]
visit
(FixedArray[Bool], Int, Bool) -> Unit

Sets a value at the specified index in a fixed-size array. The original value at that index is overwritten.

Parameters:

  • array : The fixed-size array to modify.
  • index : The position in the array where the value will be set.
  • value : The new value to assign at the specified index.

Throws a runtime error if the index is out of bounds (less than 0 or greater than or equal to the length of the array).

Example:

  let arr = [1, 2, 3]
  arr[1] = 42
  inspect(arr, content="[1, 42, 3]")
[
pc] = true
} } fn
struct Ctx {
  deque: @deque.Deque[Int]
  visit: FixedArray[Bool]
}
Ctx
::
(self : Ctx) -> Int?
pop
(
Ctx
self
:
struct Ctx {
  deque: @deque.Deque[Int]
  visit: FixedArray[Bool]
}
Ctx
) ->
Int
Int
? {
match
Ctx
self
.
@deque.Deque[Int]
deque
.
(self : @deque.Deque[Int]) -> Int?

Removes a back element from a deque and returns it, or None if it is empty.

Example

  let dv = @deque.of([1, 2, 3, 4, 5])
  assert_eq(dv.pop_back(), Some(5))
pop_back
() {
(Int) -> Int?
Some
(
Int
pc
) => {
Ctx
self
.
FixedArray[Bool]
visit
(FixedArray[Bool], Int, Bool) -> Unit

Sets a value at the specified index in a fixed-size array. The original value at that index is overwritten.

Parameters:

  • array : The fixed-size array to modify.
  • index : The position in the array where the value will be set.
  • value : The new value to assign at the specified index.

Throws a runtime error if the index is out of bounds (less than 0 or greater than or equal to the length of the array).

Example:

  let arr = [1, 2, 3]
  arr[1] = 42
  inspect(arr, content="[1, 42, 3]")
[
pc] = false
(Int) -> Int?
Some
(
Int
pc
)
}
Int?
None
=>
Int?
None
} } fn
struct Ctx {
  deque: @deque.Deque[Int]
  visit: FixedArray[Bool]
}
Ctx
::
(self : Ctx) -> Unit
reset
(
Ctx
self
:
struct Ctx {
  deque: @deque.Deque[Int]
  visit: FixedArray[Bool]
}
Ctx
) ->
Unit
Unit
{
Ctx
self
.
@deque.Deque[Int]
deque
.
(self : @deque.Deque[Int]) -> Unit

Clears the deque, removing all values.

This method has no effect on the allocated capacity of the deque, only setting the length to 0.

Example

  let dv = @deque.of([1, 2, 3, 4, 5])
  dv.clear()
  inspect(dv.length(), content="0")
clear
()
Ctx
self
.
FixedArray[Bool]
visit
.
(self : FixedArray[Bool], value : Bool, start? : Int, end? : Int) -> Unit

Fill the array with a given value.

This method fills all or part of a FixedArray with the given value.

Parameters

  • value: The value to fill the array with
  • start: The starting index (inclusive, default: 0)
  • end: The ending index (exclusive, optional)

If end is not provided, fills from start to the end of the array. If start equals end, no elements are modified.

Panics

  • Panics if start is negative or greater than or equal to the array length
  • Panics if end is provided and is less than start or greater than array length
  • Does nothing if the array is empty

Example

// Fill entire array
let fa : FixedArray[Int] = [0, 0, 0, 0, 0]
fa.fill(3)
inspect(fa, content="[3, 3, 3, 3, 3]")

// Fill from index 1 to 3 (exclusive)
let fa2 : FixedArray[Int] = [0, 0, 0, 0, 0]
fa2.fill(9, start=1, end=3)
inspect(fa2, content="[0, 9, 9, 0, 0]")

// Fill from index 2 to end
let fa3 : FixedArray[String] = ["a", "b", "c", "d"]
fa3.fill("x", start=2)
inspect(fa3, content=(
  #|["a", "b", "x", "x"]
))
fill
(false)
}

visit 数组用于过滤掉低优先级的重复线程。添加新线程时,我们先通过 visit 数组检查该线程是否已存在于 deque 中。如果已存在就直接丢弃;否则加入 deque 并标记为已访问。这个机制对于处理像 (a?)* 这样可能无限扩展的模式很重要,能够有效避免无限循环或指数级的线程爆炸。

基准测试与性能分析

我们通过一个对很多正则表达式实现都构成挑战的病理性案例来比较这两种方法:

test (
@bench.T
b
:
type @bench.T
@bench.T
) {
let
Int
n
= 15
let
String
txt
= "a".
(self : String, n : Int) -> String

Returns a new string with self repeated n times.

repeat
(
Int
n
)
let
Ast
chr
=
enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast
::
(chr : Char) -> Ast
chr
('a')
let
Ast
ast
:
enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast
=
Ast
chr
.
(self : Ast) -> Ast
opt
().
(self : Ast, n~ : Int) -> Ast
rep
(
Int
n
~).
(self : Ast, other : Ast) -> Ast
seq
(
Ast
chr
.
(self : Ast, n~ : Int) -> Ast
rep
(
Int
n
~))
let
Exp
exp
=
enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare, ToJson)
Exp
::
(ast : Ast) -> Exp
of_ast
(
Ast
ast
)
@bench.T
b
.
(self : @bench.T, name~ : String, f : () -> Unit, count? : UInt) -> Unit

Run a benchmark in batch mode

bench
(
String
name
="derive", () =>
Exp
exp
.
(self : Exp, s : String) -> Bool
matches
(
String
txt
) |>
(t : Bool) -> Unit

Evaluates an expression and discards its result. This is useful when you want to execute an expression for its side effects but don't care about its return value, or when you want to explicitly indicate that a value is intentionally unused.

Parameters:

  • value : The value to be ignored. Can be of any type.

Example:

  let x = 42
  ignore(x) // Explicitly ignore the value
  let mut sum = 0
  ignore([1, 2, 3].iter().each((x) => { sum = sum + x })) // Ignore the Unit return value of each()
ignore
())
let
Prg
tvm
=
type Prg Array[Ops] derive(Show, ToJson)
Prg
::
(ast : Ast) -> Prg
of_ast
(
Ast
ast
)
@bench.T
b
.
(self : @bench.T, name~ : String, f : () -> Unit, count? : UInt) -> Unit

Run a benchmark in batch mode

bench
(
String
name
="thompson", () =>
Prg
tvm
.
(self : Prg, data : @string.View) -> Bool
matches
(
String
txt
) |>
(t : Bool) -> Unit

Evaluates an expression and discards its result. This is useful when you want to execute an expression for its side effects but don't care about its return value, or when you want to explicitly indicate that a value is intentionally unused.

Parameters:

  • value : The value to be ignored. Can be of any type.

Example:

  let x = 42
  ignore(x) // Explicitly ignore the value
  let mut sum = 0
  ignore([1, 2, 3].iter().each((x) => { sum = sum + x })) // Ignore the Unit return value of each()
ignore
())
}

模式 (a?){n}a{n} 是回溯引擎中典型的指数爆炸案例。这个模式有 n 种不同的方式来匹配 n 个 'a' 字符,在朴素的实现中会产生指数级的搜索空间。

name     time (mean ± σ)         range (min … max)
derive     41.78 µs ±   0.14 µs    41.61 µs …  42.13 µs  in 10 ×   2359 runs
thompson   12.79 µs ±   0.04 µs    12.74 µs …  12.84 µs  in 10 ×   7815 runs

从基准测试结果可以看出,在这种情况下虚拟机方法明显快于导数方法。导数方法需要频繁分配中间的正则表达式结构,带来了更高的开销和更慢的性能。相比之下,虚拟机执行的是一组固定的指令,一旦双端队列扩展到完整大小后,就很少需要分配新的结构了。

不过,导数方法在理论分析上更简洁。我们可以很容易地证明算法的终止性,因为需要计算的导数数量受到 AST 大小的限制,并且随着 deriv 函数的每次递归调用而严格递减。而虚拟机方法则不同,如果输入的 Prg 包含无限循环,程序可能永远不会终止,这就需要仔细处理线程优先级,以避免无限循环和线程数量的指数级增长。