为什么预处理导致我丢失字典键？

Question

有一个非常奇怪的问题。 extract 函数接受一个 XML 文件并使用餐厅评论作为键生成一个字典。在这里，我正在对文本进行一些基本的预处理，因为我将它用于情感分析：文本被标记化，标点符号被删除，并且在重新插入字典之前它是 'un-tokenized' 。

import string
from nltk.tokenize import word_tokenize, RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')

def preprocess(file):
    d = extract(file)
    for text in list(d.keys()):
        tokenized_text = tokenizer.tokenize(text)
        text2 = ''.join([" "+i if not i.startswith("'") and i not in string.punctuation else i for i in tokenized_text]).strip()
        d[text2] = d.pop(text) 
    return d

675 条评论中，有 2 条在该功能具有运行后缺失。它们是 'great service.' 和 'Delicious'。我希望这些能按原样归还，除了句号应该从第一个中去掉。

作为参考，extract 函数：

from collections import OrderedDict, defaultdict
import xml.etree.ElementTree as ET

def extract(file):

    tree = ET.parse(file)
    root = tree.getroot()

    if file == 'EN_REST_SB1_TEST.xml':
        d = OrderedDict()
        for sentence in root.findall('.//sentence'):
            opinion = sentence.findall('.//Opinion')
            if opinion == []:
                text = sentence.find('text').text
                d[text] = 0

        return d

如果有人熟悉 SemEval ABSA 任务，您会注意到我是以一种有点迂回的方式完成的，没有使用 XML 中的 id 标签，但我更愿意坚持我是如何做到的。

Answer 1

您正在使用评论作为键，这意味着您将丢失所有重复项。显然，这些非常简短的评论出现了两次。

我想不出任何理由将评论用作关键字，尤其是当您关心保留重复项时。那么，为什么不将它们收集到一个列表中呢？

d = []
...
d.append(text)

为什么预处理导致我丢失字典键？

Why is pre-processing causing me to lose dictionary keys?

python

tokenize

nltk

sentiment-analysis