python，加速数据流字数统计近似算法

Question

这是我正在做的家庭作业。我有代码的工作版本，但目前需要约 1 小时才能运行处理我们收到的文件。我将分享一个文件示例以及我的代码（和高级描述），然后可以思考为什么我的代码运行ning 这么慢。下面的第一个文件是单词文件，我正在估算每个单词（表示为数字）出现的次数：

the_words.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
13
16
17
6
18
19
20
21
22
23
24
25
6
26
27
28
29
30
9
31
32
33
34
15
35
36
37
9
38
39
11
40
13
41
42

第二个文件包含我的脚本中使用的 5 个哈希函数的参数：

the_hashes.txt
3   1561
17  277
38  394
61  13
78  246

这是我的代码的一个版本。在高层次上，我（1）做我的导入和设置变量，（2）创建一个散列函数，（3）遍历 the_words.txt 中的单词（这是一个 int，我知道很混乱），散列每个使用 5 个哈希函数的单词，并在 C 矩阵中将适当索引中的值递增 1。我的代码：

# imports
import numpy as np
import matplotlib.pyplot as plt
import math


# variables used throughout the program
dlt = math.exp(-5)
eps = math.exp(1) * math.pow(10, -4)
my_p = 123457

the_hashes = map(str.split, open('the_hashes.txt', 'r'))
the_hashes = [[int(float(j)) for j in i] for i in the_hashes]
end = len(the_hashes)

rows = math.ceil(math.log(1/dlt))
cols = math.ceil(math.exp(1)/eps)
C = np.zeros((rows,cols))


# Returns hash(x) for hash function 
# given by parameters a, b, p and n_buckets
def hash_fun(a, b, p, n_buckets, x):
    y = x % p
    hash_val = (a*y + b) % p
    output = hash_val % n_buckets
    return(output)


# read the file line by line, implementing the algorithm
counter = 0
with open("the_words.txt", "r") as file:
    for word in file:
        counter = counter + 1
        my_x = int(word)

        # loop over the 5 different pairs of (a,b) values for the hashes
        for i in range(0,end):
            my_a = the_hashes[i][0]
            my_b = the_hashes[i][1]

            my_output = hash_fun(my_a, my_b, my_p, cols, my_x)
            C[i,my_output] += 1

        if(counter % 10000 == 0):
            print counter

但是，对于一个 200M 字的文件，目前这对我来说花费的时间太长了。有什么明显的原因导致我的代码运行变慢吗？我知道流式传输超过 200M 的单词可能需要一段时间，但我想将它从目前花费的时间中缩短。

谢谢！

Answer 1

如果无法将数据加载到内存中，有些部分可以内联和分解：

my_range = range(0, end)  # python 2 only, see note below
with open("the_words.txt", "r") as file:
    for word in file:
        counter = counter + 1
        y = int(word) % p  # factor this out: save 160 million calculations
        # loop over the 5 different pairs of (a,b) values for the hashes
        for i in my_range:
            my_a = the_hashes[i][0]
            my_b = the_hashes[i][1]

            # save a function call by inlining
            # my_output = hash_fun(my_a, my_b, my_p, cols, my_x)

            hash_val = (a*y + b) % p
            my_output = hash_val % n_buckets
            C[i,my_output] += 1

        if(counter % 10000 == 0):
            print counter

我也会看看 hash_val = ... 中的数学，看看你是否可以分解出一些计算。

对于 range(0, end)，根据您使用的 python 版本，您可能需要缓存调用。参见 )。（我怀疑 python 2 来自你的打印语句）。

此外，我建议阅读 Python performance characteristics 以了解一些提高性能的有趣方法，或者至少更好地理解您在做什么。

以上仅为猜测。查看 How can you profile a script? 了解如何分析您的代码并确定瓶颈在哪里。

我的另一个猜测是，由于您使用的是 numpy，因此将依赖于它的矩阵计算函数，我认为这将得到更好的优化。 (a*y + b) % p 在我看来是不错的矢量数学:)

python，加速数据流字数统计近似算法

In python, speeding up a data stream word counting approximation algorithm

python

hash

profiling