连接大型 csv 的所有行
Concatenate all rows of a large csv
所以我有一个包含多列的大型 csv 文件(500 万行)。我特别感兴趣的是包含文本的列。
输入的 csv 格式如下:
system_id,member_name,留言,is_post
0157e407,member1011,"I have had problems with my lungs for years now. It all started with an infection...",假
1915d457, member1055, "Looks like a lot of people take Paracetamol for managing pain and....",false
'message' 列包含文本并且很有趣。
现在的任务是将这一列的所有行连接成一个大文本,然后在其上计算 n-grams (n=1,2,3,4,5)。输出应该是 5 个不同的文件,对应于以下格式的 n-gram:
例如:
bigram.csv
n-gram,计数
"word1 word2", 7
"word1 word3", 11
trigram.csv
n-gram,计数
"word1 word2 word3", 22
"word 1 word2 word4", 24
这是我目前尝试过的方法:
from collections import OrderedDict
import csv
import re
import sys
import nltk
if __name__ == '__main__':
if len(sys.argv) < 2:
print "%d Arguments Given : Exiting..." % (len(sys.argv)-1)
print "Usage: python %s <inp_file_path>" % sys.argv[0]
exit(1)
ifpath = sys.argv[1]
with open(ifpath, 'r') as ifp:
reader = csv.DictReader(ifp)
all_msgs = []
fieldnames = reader.fieldnames
processed_rows = []
for row in reader:
msg = row['message']
res = {'message': msg}
txt = msg.decode('ascii', 'ignore')
# some preprocessing
txt = re.sub(r'[\.]{2,}', r". ", txt)
txt = re.sub(r'([\.,;!?])([A-Z])', r' ', txt)
sentences = nltk.tokenize.sent_tokenize(txt.strip())
all_msgs.append(' '.join(sentences))
text = ' '.join(all_msgs)
tokens = nltk.word_tokenize(text)
tokens = [token.lower() for token in tokens if len(token) > 1]
bi_tokens = list(nltk.bigrams(tokens))
tri_tokens = list(nltk.trigrams(tokens))
bigrms = []
for item in sorted(set(bi_tokens)):
bb = OrderedDict()
bb['bigrams'] = ' '.join(item)
bb['count'] = bi_tokens.count(item)
bigrms.append(bb)
trigrms = []
for item in sorted(set(tri_tokens)):
tt = OrderedDict()
tt['trigrams'] = ' '.join(item)
tt['count'] = tri_tokens.count(item)
trigrms.append(tt)
with open('bigrams.csv', 'w') as ofp2:
header = ['bigrams', 'count']
dict_writer = csv.DictWriter(ofp2, header)
dict_writer.writeheader()
dict_writer.writerows(bigrms)
with open('trigrams.csv', 'w') as ofp3:
header = ['trigrams', 'count']
dict_writer = csv.DictWriter(ofp3, header)
dict_writer.writeheader()
dict_writer.writerows(trigrms)
tokens = nltk.word_tokenize(text)
fourgrams = nltk.collocations.QuadgramCollocationFinder.from_words(tokens)
quadgrams = []
for fourgram, freq in fourgrams.ngram_fd.items():
dd = OrderedDict()
dd['quadgram'] = " ".join(fourgram)
dd['count'] = freq
quadgrams.append(dd)
with open('quadgram.csv', 'w') as ofp4:
header = ['quadgram', 'count']
dict_writer = csv.DictWriter(ofp4, header)
dict_writer.writeheader()
dict_writer.writerows(quadgrams)
过去 2 天在 4 核机器上一直是 运行。我怎样才能使它更有效率(也许使用 pandas and/or 多处理)并尽可能合理地加快速度?
我会做一些改变:
找到瓶颈
哪部分花了这么长时间?
- 正在读取 CSV
- 标记化
- 制作 n-grams
- 计算 n-grams
- 写入磁盘
所以我要做的第一件事就是在不同步骤之间进行更清晰的分离,理想情况下可以中途重新启动
阅读课文
我会将其提取为其他方法。根据我的阅读(例如 here),pandas
读取 csv 文件的速度比 csv
快得多。如果 csv 的读取只需要 2 天中的 1 分钟,这可能不是问题,但我会这样做:
def read_text(filename): # you could add **kwarg to pass on to the read_csv
df = pd.read_csv(filename) # add info on file encoding etc
message = df['message'].str.replace(r'[\.]{2,}', r". ") # http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.replace.html
message = message.str.replace(r'([\.,;!?])([A-Z])', r' ')
messame = message.strip()
sentences = message.apply(nltk.tokenize.sent_tokenize)
return ' '.join(sentences.appy(' '.join))
您甚至可以分块执行此操作,yield
而不是 return 句子使其成为生成器,可能会节省内存
你在 sent_tokenize
之后加入句子是否有特定原因,因为我在文档中找到了这个
The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank. This is the method that is invoked by word_tokenize(). It assumes that the text has already been segmented into sentences, e.g. using sent_tokenize().
所以你可以这样称呼它:
text = read_text(csv_file)
with open(text_file, 'w') as file:
file.write(text)
print('finished reading text from file') # or use logging
标记化
大致保持不变
tokens = nltk.word_tokenize(text)
print('finished tokenizing the text')
def save_tokens(filename, tokens):
# save the list somewhere, either json or pickle, so you can pick up later if something goes wrong
制作 n-gram,计数并将它们写入磁盘
你的代码包含很多样板文件,它们只用不同的函数或文件名做同样的事情,所以我把它抽象到一个包含名称的元组列表中,函数获取二元组来计算它们,要保存的文件名
ngrams = [
('bigrams', nltk.bigrams, collections.Counter, 'bigrams.csv'),
('trigrams', nltk.trigrams, collections.Counter, 'quadgrams.csv'),
('quadgrams', nltk.collocations.QuadgramCollocationFinder.from_words, parse_quadgrams, 'quadgrams.csv'),
]
如果你想计算列表中有多少项目,只需使用 collections.Counter
而不是对每个项目进行(昂贵的)collection.OrderedDict
。如果您想自己进行计数,最好使用元组而不是 OrderedDict
。您也可以使用 pd.Series.value_counts()
def parse_quadgrams(quadgrams):
return quadgrams.ngram_fd #from what I see in the code this dict already contains the counts
for name, ngram_method, parse_method, output_file in ngrams:
grams = ngram_method(tokens)
print('finished generating ', name)
# You could write this intermediate result to a temporary file in case something goes wrong
count_df = pd.Series(parse_method(grams)).reset_index().rename(columns={'index': name, 0: 'count')
# if you need it sorted you can do this on the DataFrame
print('finished counting ', name)
count_df.to_csv(output_file)
print('finished writing ', name, ' to file: ', output_file)
所以我有一个包含多列的大型 csv 文件(500 万行)。我特别感兴趣的是包含文本的列。
输入的 csv 格式如下:
system_id,member_name,留言,is_post
0157e407,member1011,"I have had problems with my lungs for years now. It all started with an infection...",假
1915d457, member1055, "Looks like a lot of people take Paracetamol for managing pain and....",false
'message' 列包含文本并且很有趣。
现在的任务是将这一列的所有行连接成一个大文本,然后在其上计算 n-grams (n=1,2,3,4,5)。输出应该是 5 个不同的文件,对应于以下格式的 n-gram: 例如:
bigram.csv
n-gram,计数
"word1 word2", 7
"word1 word3", 11
trigram.csv
n-gram,计数
"word1 word2 word3", 22
"word 1 word2 word4", 24
这是我目前尝试过的方法:
from collections import OrderedDict
import csv
import re
import sys
import nltk
if __name__ == '__main__':
if len(sys.argv) < 2:
print "%d Arguments Given : Exiting..." % (len(sys.argv)-1)
print "Usage: python %s <inp_file_path>" % sys.argv[0]
exit(1)
ifpath = sys.argv[1]
with open(ifpath, 'r') as ifp:
reader = csv.DictReader(ifp)
all_msgs = []
fieldnames = reader.fieldnames
processed_rows = []
for row in reader:
msg = row['message']
res = {'message': msg}
txt = msg.decode('ascii', 'ignore')
# some preprocessing
txt = re.sub(r'[\.]{2,}', r". ", txt)
txt = re.sub(r'([\.,;!?])([A-Z])', r' ', txt)
sentences = nltk.tokenize.sent_tokenize(txt.strip())
all_msgs.append(' '.join(sentences))
text = ' '.join(all_msgs)
tokens = nltk.word_tokenize(text)
tokens = [token.lower() for token in tokens if len(token) > 1]
bi_tokens = list(nltk.bigrams(tokens))
tri_tokens = list(nltk.trigrams(tokens))
bigrms = []
for item in sorted(set(bi_tokens)):
bb = OrderedDict()
bb['bigrams'] = ' '.join(item)
bb['count'] = bi_tokens.count(item)
bigrms.append(bb)
trigrms = []
for item in sorted(set(tri_tokens)):
tt = OrderedDict()
tt['trigrams'] = ' '.join(item)
tt['count'] = tri_tokens.count(item)
trigrms.append(tt)
with open('bigrams.csv', 'w') as ofp2:
header = ['bigrams', 'count']
dict_writer = csv.DictWriter(ofp2, header)
dict_writer.writeheader()
dict_writer.writerows(bigrms)
with open('trigrams.csv', 'w') as ofp3:
header = ['trigrams', 'count']
dict_writer = csv.DictWriter(ofp3, header)
dict_writer.writeheader()
dict_writer.writerows(trigrms)
tokens = nltk.word_tokenize(text)
fourgrams = nltk.collocations.QuadgramCollocationFinder.from_words(tokens)
quadgrams = []
for fourgram, freq in fourgrams.ngram_fd.items():
dd = OrderedDict()
dd['quadgram'] = " ".join(fourgram)
dd['count'] = freq
quadgrams.append(dd)
with open('quadgram.csv', 'w') as ofp4:
header = ['quadgram', 'count']
dict_writer = csv.DictWriter(ofp4, header)
dict_writer.writeheader()
dict_writer.writerows(quadgrams)
过去 2 天在 4 核机器上一直是 运行。我怎样才能使它更有效率(也许使用 pandas and/or 多处理)并尽可能合理地加快速度?
我会做一些改变:
找到瓶颈
哪部分花了这么长时间?
- 正在读取 CSV
- 标记化
- 制作 n-grams
- 计算 n-grams
- 写入磁盘
所以我要做的第一件事就是在不同步骤之间进行更清晰的分离,理想情况下可以中途重新启动
阅读课文
我会将其提取为其他方法。根据我的阅读(例如 here),pandas
读取 csv 文件的速度比 csv
快得多。如果 csv 的读取只需要 2 天中的 1 分钟,这可能不是问题,但我会这样做:
def read_text(filename): # you could add **kwarg to pass on to the read_csv
df = pd.read_csv(filename) # add info on file encoding etc
message = df['message'].str.replace(r'[\.]{2,}', r". ") # http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.replace.html
message = message.str.replace(r'([\.,;!?])([A-Z])', r' ')
messame = message.strip()
sentences = message.apply(nltk.tokenize.sent_tokenize)
return ' '.join(sentences.appy(' '.join))
您甚至可以分块执行此操作,yield
而不是 return 句子使其成为生成器,可能会节省内存
你在 sent_tokenize
之后加入句子是否有特定原因,因为我在文档中找到了这个
The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank. This is the method that is invoked by word_tokenize(). It assumes that the text has already been segmented into sentences, e.g. using sent_tokenize().
所以你可以这样称呼它:
text = read_text(csv_file)
with open(text_file, 'w') as file:
file.write(text)
print('finished reading text from file') # or use logging
标记化
大致保持不变
tokens = nltk.word_tokenize(text)
print('finished tokenizing the text')
def save_tokens(filename, tokens):
# save the list somewhere, either json or pickle, so you can pick up later if something goes wrong
制作 n-gram,计数并将它们写入磁盘
你的代码包含很多样板文件,它们只用不同的函数或文件名做同样的事情,所以我把它抽象到一个包含名称的元组列表中,函数获取二元组来计算它们,要保存的文件名
ngrams = [
('bigrams', nltk.bigrams, collections.Counter, 'bigrams.csv'),
('trigrams', nltk.trigrams, collections.Counter, 'quadgrams.csv'),
('quadgrams', nltk.collocations.QuadgramCollocationFinder.from_words, parse_quadgrams, 'quadgrams.csv'),
]
如果你想计算列表中有多少项目,只需使用 collections.Counter
而不是对每个项目进行(昂贵的)collection.OrderedDict
。如果您想自己进行计数,最好使用元组而不是 OrderedDict
。您也可以使用 pd.Series.value_counts()
def parse_quadgrams(quadgrams):
return quadgrams.ngram_fd #from what I see in the code this dict already contains the counts
for name, ngram_method, parse_method, output_file in ngrams:
grams = ngram_method(tokens)
print('finished generating ', name)
# You could write this intermediate result to a temporary file in case something goes wrong
count_df = pd.Series(parse_method(grams)).reset_index().rename(columns={'index': name, 0: 'count')
# if you need it sorted you can do this on the DataFrame
print('finished counting ', name)
count_df.to_csv(output_file)
print('finished writing ', name, ' to file: ', output_file)