处理大量组合的最佳方法 python

Question

我有一堆 Twitter 数据（来自 45 万用户的 3 亿条消息），我正试图通过@提及来解开一个社交网络。我的最终目标是拥有一对，其中第一项是一对@mentions，第二项是提及这两个人的用户数。例如：[(@sam, @kim), 25]。 @mentions 的顺序无关紧要，因此 (@sam,@kim)=(@kim,@sam)。

首先，我创建了一个字典，其中键是用户 ID，值是一组@mentions

for row in data:
    user_id = int(row[1])
    msg = str(unicode(row[0], errors='ignore'))

    if user_id not in userData:
        userData[user_id] = set([ tag.lower() for tag in msg.split() if tag.startswith("@") ])
    else:
        userData[user_id] |= set([ tag.lower() for tag in msg.split() if tag.startswith("@") ])

然后我遍历用户并创建一个字典，其中键是@mentions 的元组，值是同时提及两者的用户数：

for user in userData.keys():
    if len(userData[user]) < MENTION_THRESHOLD:
        continue
    for ht in itertools.combinations(userData[user], 2):
        if ht in hashtag_set:
            hashtag_set[ht] += 1
        else:
            hashtag_set[ht] = 1

第二部分将永远持续到运行。有没有更好的方法运行这个 and/or 更好的方法来存储这些数据？

Answer 1

与其像现在这样尝试在内存中执行所有这些操作，我建议使用生成器来管道化数据。看看 David Beazely 在 PyCon 2008 上的这张幻灯片：http://www.dabeaz.com/generators-uk/GeneratorsUK.pdf

特别是，第 2 部分有许多解析大数据的示例，可直接应用于您要执行的操作。通过使用生成器，您可以避免现在的大部分内存消耗，我希望您能看到显着的性能改进。

处理大量组合的最佳方法 python

Best way to deal with giant number of combinations python

python

bigdata