如何在庞大的二进制数据中快速识别 1's(Index) 的连续范围？

Question

谁能建议更快的算法来识别大型二进制数据中 1 的连续范围？

遍历数据是唯一的解决方案吗？在我真的不想要的最坏情况下，遍历会给出 O(n)。

谁能推荐更快的算法？

如下图所示。我需要找到索引 4000，它是 1 的连续范围的起始位置

index 0
|
00000000000000000000000000000000000000000011111100000

Answer 1

我想不出任何不是 O(n) 的事情，因为数据总是未排序的。

但是，我可以想到捷径，因为你想要一组至少 3 个，而且是二进制数据。

#include <iostream>

using namespace std;

int main()
{
    unsigned int seed = 3758096384; //11100000000000000000000000000000
    unsigned int testvar = 419307644; //00011000111111100010000001111100
    int result = 0;
    int continuous = 0;

    while (seed != 7 && (continuous == 1 || result == 0)) {
        if (seed == (testvar & seed)) {
            result |= seed;
            continuous = 1;
        } else
            continuous = 0;
        seed >>= 1;
    }
    // result = 16646144 or 00000000111111100000000000000000
    cout << result << endl;
    //the index, 8388608 or 00000000100000000000000000000000
    cout << (int)((result ^ (result >> 1)) & ~(result >> 1)) << endl;
    return 0;
}

工作原理：它是一个二进制过滤器，它创建一个 3 位的掩码，并在循环的每一步中连续向左移 1。

所以你有这些数字作为过滤器：

3758096384 - 11100000000000000000000000000000
1879048192 - 01110000000000000000000000000000
939524096  - 00111000000000000000000000000000
...
14         - 00000000000000000000000000001110
7          - 00000000000000000000000000000111

然后它检查种子是否与测试数字和种子本身之间的逻辑与结果匹配（这会过滤掉所有与过滤器不匹配的数字）。

如果种子和AND匹配，它使用逻辑或将种子移动到结果，并设置一个连续来控制序列的连续性。第一次结果不连续，就断了循环。

最后，你得到了结果，可以通过以下方式计算指数：

1110
0111 SHIFT TO LEFT by 1 and XOR
1001
0111 NOT (SHIFT TO LEFT by 1) and AND
------------
1000

您将需要以 32 位块扫描 50GB 数据（很容易适应 64 位，甚至对其进行矢量化）。

Answer 2

好吧，你无法避免至少检查整个数据一次（你必须至少查看所有内容！），但你可以避免多次检查它时间如果你例如run-length encode数据。

如何在庞大的二进制数据中快速识别 1's(Index) 的连续范围？

How to fast identify contiguous range of 1’s(Index) in huge binary data?

c++

algorithm

search