比较海量数据的最佳算法
Best algorithm to compare massive data
我有一个很大的 csv 数据集 (334MB),如下所示。
month, output
1,"['23482394','4358309','098903284'....(total 2.5 million entries)]"
2,"['92438545','23482394',323103404'....(total 2.2 million entries)]"
3,"[...continue
现在,我需要比较一个月的输出与上个月重叠的百分比。
例如,当我比较第 1 个月和第 2 个月时,我希望得到 "Month 2 output has 90% overlap against month1" 这样的结果,然后 "Month3 has 88% overap against Month2"
Python3解决这个问题的最佳方法是什么?
您可以使用集合交集的方法来提取共同的元素b/w两个数组或列表。集合交集的复杂度为O(min(len(a), len(b)).
# generate random numpy array with unique elements
import numpy as np
month1 = np.random.choice(range(10**5, 10**7), size=25*10**5, replace=False)
month2 = np.random.choice(range(10**5, 10**7), size=22*10**5, replace=False)
month3 = np.random.choice(range(10**5, 10**7), size=21*10**5, replace=False)
print('Month 1, 2, and 3 contains {}, {}, and {} elements respectively'.format(len(month1), len(month2), len(month3)))
Month 1, 2, and 3 contains 2500000, 2200000, and 2100000 elements respectively
# Compare month arrays for overlap
import time
startTime = time.time()
union_m1m2 = set(month1).intersection(month2)
union_m2m3 = set(month2).intersection(month3)
print('Percent of elements in both month 1 & 2: {}%'.format(round(100*len(union_m1m2)/len(month2),2)))
print('Percent of elements in both month 2 & 3: {}%'.format(round(100*len(union_m2m3)/len(month3),2)))
print('Process time:{:.2f}s'.format(time.time()-startTime))
Percent of elements in both month 1 & 2: 25.3%
Percent of elements in both month 2 & 3: 22.24%
Process time:2.46s
您可能会在月份条目与实际数据之间的重叠方面取得更好的成功。
我有一个很大的 csv 数据集 (334MB),如下所示。
month, output
1,"['23482394','4358309','098903284'....(total 2.5 million entries)]"
2,"['92438545','23482394',323103404'....(total 2.2 million entries)]"
3,"[...continue
现在,我需要比较一个月的输出与上个月重叠的百分比。
例如,当我比较第 1 个月和第 2 个月时,我希望得到 "Month 2 output has 90% overlap against month1" 这样的结果,然后 "Month3 has 88% overap against Month2"
Python3解决这个问题的最佳方法是什么?
您可以使用集合交集的方法来提取共同的元素b/w两个数组或列表。集合交集的复杂度为O(min(len(a), len(b)).
# generate random numpy array with unique elements
import numpy as np
month1 = np.random.choice(range(10**5, 10**7), size=25*10**5, replace=False)
month2 = np.random.choice(range(10**5, 10**7), size=22*10**5, replace=False)
month3 = np.random.choice(range(10**5, 10**7), size=21*10**5, replace=False)
print('Month 1, 2, and 3 contains {}, {}, and {} elements respectively'.format(len(month1), len(month2), len(month3)))
Month 1, 2, and 3 contains 2500000, 2200000, and 2100000 elements respectively
# Compare month arrays for overlap
import time
startTime = time.time()
union_m1m2 = set(month1).intersection(month2)
union_m2m3 = set(month2).intersection(month3)
print('Percent of elements in both month 1 & 2: {}%'.format(round(100*len(union_m1m2)/len(month2),2)))
print('Percent of elements in both month 2 & 3: {}%'.format(round(100*len(union_m2m3)/len(month3),2)))
print('Process time:{:.2f}s'.format(time.time()-startTime))
Percent of elements in both month 1 & 2: 25.3%
Percent of elements in both month 2 & 3: 22.24%
Process time:2.46s
您可能会在月份条目与实际数据之间的重叠方面取得更好的成功。