生成字符串和识别子字符串非常慢

Question

我想对 Rust 中的某些操作进行基准测试，但我似乎遇到了一些麻烦：

fn main(){

    let needle   = (0..100).map(|_| "b").collect::<String>();
    let haystack = (0..100_000).map(|_| "a").collect::<String>();

    println!("Data ready.");

    for _ in 0..1_000_000 {
        if haystack.contains( &needle ) {
            // Stuff...
        }
    }

}

上面的操作需要很长时间才能完成，而 Ruby 中的相同操作在大约 4.5 秒内完成：

needle   = 'b' * 100
haystack = 'a' * 100_000

puts 'Data ready.'

1_000_000.times do
    haystack.include? needle
end

我忍不住认为我做的事情从根本上是错误的。在 Rust 中执行此操作的正确方法是什么？

rustc 1.0.0 (a59de37e9 2015-05-13) (built 2015-05-14)
ruby 2.2.2p95 (2015-04-13 revision 50295) [x86_64-linux]

Answer 1

今天合并了针对此问题的修复程序。这意味着它应该是下一晚的一部分，预计将在 Rust 1.3 中发布。该修复恢复了标准库中的 Two-way substring search implementation that Rust used to have and adapted it to the new Pattern API。

双向算法非常适合 Rust 的 libcore，因为它是线性时间子串搜索算法，使用 O(1) space 并且不需要动态分配。

特定实现包含一个简单的添加，可以非常快速地拒绝问题中的这个特定查询（不，它不是因为这个问题而写的，它也是旧代码的一部分）。

在设置过程中，搜索器为针计算一种指纹：对于针中的每个字节，取其低 6 位，即数字 0-63，然后在 u64变量byteset.

let byteset = needle.iter().fold(0, |a, &b| (1 << ((b & 0x3f) as usize)) | a);

由于指针只包含'b'，byteset的值将只有第34位设置（98 & 63 == 34）。

现在我们可以测试任何字节是否可能是指针的一部分。如果byteset中没有设置其对应的位，则指针无法匹配。在这种情况下，我们在 haystack 中测试的每个字节都是 'a' (97 & 63 == 33)，它无法匹配。所以算法会读取一个字节，拒绝它，然后跳过针的长度。

fn byteset_contains(&self, byte: u8) -> bool {
    (self.byteset >> ((byte & 0x3f) as usize)) & 1 != 0
}

// Quickly skip by large portions unrelated to our substring
if !self.byteset_contains(haystack[self.position + needle.len() - 1]) {
    self.position += needle.len();
    continue 'search;
}

From libcore/str/pattern.rs in rust-lang/rust

生成字符串和识别子字符串非常慢

Generating strings and identifying substrings is very slow

string

substring

rust