连接大型 csv 的所有行

Concatenate all rows of a large csv

所以我有一个包含多列的大型 csv 文件(500 万行)。我特别感兴趣的是包含文本的列。

输入的 csv 格式如下:

system_id,member_name,留言,is_post

0157e407,member1011,"I have had problems with my lungs for years now. It all started with an infection...",假

1915d457, member1055, "Looks like a lot of people take Paracetamol for managing pain and....",false

'message' 列包含文本并且很有趣。

现在的任务是将这一列的所有行连接成一个大文本,然后在其上计算 n-grams (n=1,2,3,4,5)。输出应该是 5 个不同的文件,对应于以下格式的 n-gram: 例如:

bigram.csv

n-gram,计数

"word1 word2", 7

"word1 word3", 11

trigram.csv

n-gram,计数

"word1 word2 word3", 22

"word 1 word2 word4", 24

这是我目前尝试过的方法:

from collections import OrderedDict
import csv
import re
import sys

import nltk


if __name__ == '__main__':
    if len(sys.argv) < 2:
        print "%d Arguments Given : Exiting..." % (len(sys.argv)-1)
        print "Usage: python %s <inp_file_path>" % sys.argv[0]
        exit(1)
    ifpath = sys.argv[1]
    with open(ifpath, 'r') as ifp:
        reader = csv.DictReader(ifp)
        all_msgs = []
        fieldnames = reader.fieldnames
        processed_rows = []
        for row in reader:
            msg = row['message']
            res = {'message': msg}
            txt = msg.decode('ascii', 'ignore')
            # some preprocessing
            txt = re.sub(r'[\.]{2,}', r". ", txt)
            txt = re.sub(r'([\.,;!?])([A-Z])', r' ', txt)
            sentences = nltk.tokenize.sent_tokenize(txt.strip())
            all_msgs.append(' '.join(sentences))
    text = ' '.join(all_msgs)

    tokens = nltk.word_tokenize(text)
    tokens = [token.lower() for token in tokens if len(token) > 1]
    bi_tokens = list(nltk.bigrams(tokens))
    tri_tokens = list(nltk.trigrams(tokens))
    bigrms = []
    for item in sorted(set(bi_tokens)):
        bb = OrderedDict()
        bb['bigrams'] = ' '.join(item)
        bb['count'] = bi_tokens.count(item)
        bigrms.append(bb)

    trigrms = []
    for item in sorted(set(tri_tokens)):
        tt = OrderedDict()
        tt['trigrams'] = ' '.join(item)
        tt['count'] = tri_tokens.count(item)
        trigrms.append(tt)

    with open('bigrams.csv', 'w') as ofp2:
        header = ['bigrams', 'count']
        dict_writer = csv.DictWriter(ofp2, header)
        dict_writer.writeheader()
        dict_writer.writerows(bigrms)

    with open('trigrams.csv', 'w') as ofp3:
        header = ['trigrams', 'count']
        dict_writer = csv.DictWriter(ofp3, header)
        dict_writer.writeheader()
        dict_writer.writerows(trigrms)

    tokens = nltk.word_tokenize(text)
    fourgrams = nltk.collocations.QuadgramCollocationFinder.from_words(tokens)
    quadgrams = []
    for fourgram, freq in fourgrams.ngram_fd.items():
        dd = OrderedDict()
        dd['quadgram'] = " ".join(fourgram)
        dd['count'] = freq
        quadgrams.append(dd)
    with open('quadgram.csv', 'w') as ofp4:
        header = ['quadgram', 'count']
        dict_writer = csv.DictWriter(ofp4, header)
        dict_writer.writeheader()
        dict_writer.writerows(quadgrams)

过去 2 天在 4 核机器上一直是 运行。我怎样才能使它更有效率(也许使用 pandas and/or 多处理)并尽可能合理地加快速度?

我会做一些改变:

找到瓶颈

哪部分花了这么长时间?

  • 正在读取 CSV
  • 标记化
  • 制作 n-grams
  • 计算 n-grams
  • 写入磁盘

所以我要做的第一件事就是在不同步骤之间进行更清晰的分离,理想情况下可以中途重新启动

阅读课文

我会将其提取为其他方法。根据我的阅读(例如 here),pandas 读取 csv 文件的速度比 csv 快得多。如果 csv 的读取只需要 2 天中的 1 分钟,这可能不是问题,但我会这样做:

def read_text(filename):  # you could add **kwarg to pass on to the read_csv
    df = pd.read_csv(filename) # add info on file encoding etc
    message = df['message'].str.replace(r'[\.]{2,}', r". ")  # http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.replace.html
    message = message.str.replace(r'([\.,;!?])([A-Z])', r' ')

    messame = message.strip()
    sentences = message.apply(nltk.tokenize.sent_tokenize)
    return ' '.join(sentences.appy(' '.join))

您甚至可以分块执行此操作,yield 而不是 return 句子使其成为生成器,可能会节省内存

你在 sent_tokenize 之后加入句子是否有特定原因,因为我在文档中找到了这个

The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank. This is the method that is invoked by word_tokenize(). It assumes that the text has already been segmented into sentences, e.g. using sent_tokenize().

所以你可以这样称呼它:

text = read_text(csv_file)
with open(text_file, 'w') as file:
    file.write(text)
print('finished reading text from file') # or use logging

标记化

大致保持不变

tokens = nltk.word_tokenize(text)
print('finished tokenizing the text')

def save_tokens(filename, tokens):
    # save the list somewhere, either json or pickle, so you can pick up later if something goes wrong

制作 n-gram,计数并将它们写入磁盘

你的代码包含很多样板文件,它们只用不同的函数或文件名做同样的事情,所以我把它抽象到一个包含名称的元组列表中,函数获取二元组来计算它们,要保存的文件名

ngrams = [
    ('bigrams', nltk.bigrams, collections.Counter, 'bigrams.csv'),
    ('trigrams', nltk.trigrams, collections.Counter, 'quadgrams.csv'),
    ('quadgrams', nltk.collocations.QuadgramCollocationFinder.from_words, parse_quadgrams, 'quadgrams.csv'),
]

如果你想计算列表中有多少项目,只需使用 collections.Counter 而不是对每个项目进行(昂贵的)collection.OrderedDict。如果您想自己进行计数,最好使用元组而不是 OrderedDict。您也可以使用 pd.Series.value_counts()

def parse_quadgrams(quadgrams):
    return quadgrams.ngram_fd #from what I see in the code this dict already contains the counts

for name, ngram_method, parse_method, output_file in ngrams:
    grams = ngram_method(tokens)
    print('finished generating ', name)
    # You could write this intermediate result to a temporary file in case something goes wrong
    count_df = pd.Series(parse_method(grams)).reset_index().rename(columns={'index': name, 0: 'count')
    # if you need it sorted you can do this on the DataFrame
    print('finished counting ', name)
    count_df.to_csv(output_file)
    print('finished writing ', name, ' to file: ', output_file)