Python 标记词

Question

summaries = []
texts = []
with open("C:\Users\apandey\Documents\Reviews.csv","r",encoding="utf8") as csvfile: 
    reader = csv.reader(csvfile)
    for row in reader:
        clean_text = clean(row['Text'])
        clean_summary = clean(row['Summary'])
        summaries.append(word_tokenize(clean_summary))
        texts.append(word_tokenize(clean_text))

我只想对 csv 文件中的行进行标记化，但出现此错误： "list indices must be integers or slices, not str"

Answer 1

我认为您的 csv 文件看起来像这样：

Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
1,'B001E4KFG0','A3SGXH7AUHU8GW','delmartian',1,1,5,1303862400,'Good Quality Dog 
Food','I have bought several of the Vitality canned dog food products and have 
found them all to be of good quality...'

那么你应该按照 Peter Wood 在评论部分的建议使用 DictReader。

summaries = []
texts = []
with open("foo.csv",encoding="utf8", newline='') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        clean_text = row["Text"]
        clean_summary = row["Summary"]
        summaries.append(word_tokenize(clean_summary))
        texts.append(word_tokenize(clean_text))

输出：

# texts
[["'I", 'have', 'bought', 'several', 'of', 'the', 'Vitality', 'canned', 'dog', 'food', 'products', 'and', 'have', 'found', 'them', 'all', 'to', 'be', 'of', 'good', 'quality', '.', 'The', 'product', 'looks', 'more', 'like', 'a', 'stew', 'than', 'a', 'processed', 'meat', 'and', 'it', 'smells', 'better', '.', 'My', 'Labrador', 'is', 'finicky', 'and', 'she', 'appreciates', 'this', 'product', 'better', 'than', 'most', '.', "'"]]

# summaries
[["'Good", 'Quality', 'Dog', 'Food', "'"]]

Python 标记词

Python tokenizing words

python

csv

list

tokenize

nltk