正则表达式引擎的两种实现方法:导数与 Thompson 虚拟机

正则表达式引擎的实现方式多样,不同方法在性能、内存消耗和实现复杂度上各有权衡。本文将介绍两种数学上等价但实际表现迥异的正则匹配方法:Brzozowski 导数方法和 Thompson 虚拟机方法。
这两种方法都基于相同的抽象语法树表示,为直接的性能对比提供了统一的基础。其核心思想在于:这些看似不同的方法实际上是用不同的计算策略来解决同一个问题——一个依靠代数变换,另一个则通过程序执行。
约定与定义
为了建立统一的基础,两种正则表达式引擎都采用相同的抽象语法树(AST)表示,用树形结构来描述正则表达式的基本构造:
enum Ast {
(Char) -> Ast
Chr(Char
Char)
(Ast, Ast) -> Ast
Seq(enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast, enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast)
(Ast, Int?) -> Ast
Rep(enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast, Int
Int?)
(Ast) -> Ast
Opt(enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast)
} derive(trait Show {
output(Self, &Logger) -> Unit
to_string(Self) -> String
}
Trait for types that can be converted to String
Show, trait ToJson {
to_json(Self) -> Json
}
Trait for types that can be converted to Json
ToJson, trait Hash {
hash_combine(Self, Hasher) -> Unit
hash(Self) -> Int
}
Trait for types that can be hashed
The hash method should return a hash value for the type, which is used in hash tables and other data structures.
The hash_combine method is used to combine the hash of the current value with another hash value,
typically used to hash composite types.
When two values are equal according to the Eq trait, they should produce the same hash value.
The hash method does not need to be implemented if hash_combine is implemented,
When implemented separately, hash does not need to produce a hash value that is consistent with hash_combine.
Hash, trait Eq {
equal(Self, Self) -> Bool
op_equal(Self, Self) -> Bool
}
Trait for types whose elements can test for equality
Eq)
此外,我们还提供了智能构造函数来简化正则表达式的构建:
fn enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast::fn Ast::chr(chr : Char) -> Ast
chr(Char
chr : Char
Char) -> enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast {
(Char) -> Ast
Chr(Char
chr)
}
fn enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast::fn Ast::seq(self : Ast, other : Ast) -> Ast
seq(Ast
self : enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast, Ast
other : enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast) -> enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast {
(Ast, Ast) -> Ast
Seq(Ast
self, Ast
other)
}
fn enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast::fn Ast::rep(self : Ast, n? : Int) -> Ast
rep(Ast
self : enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast, Int?
n? : Int
Int) -> enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast {
(Ast, Int?) -> Ast
Rep(Ast
self, Int?
n)
}
fn enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast::fn Ast::opt(self : Ast) -> Ast
opt(Ast
self : enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast) -> enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, ToJson, Hash, Eq)
Ast {
Unit
@fs.
(Ast) -> Ast
Opt(Ast
self)
}
AST 定义了四种基本的正则表达式操作:
Chr(Char)- 匹配单个字符字面量Seq(Ast, Ast)- 序列匹配,即一个模式紧跟另一个模式Rep(Ast, Int?)- 重复匹配,None表示无限次重复,Some(n)表示恰好重复 n 次Opt(Ast)- 可选匹配,相当于标准正则语法中的pattern?
举个例子,正则表达式 (ab*)? 表示一个可选的序列('a' 后跟零个或多个 'b'