在文本预处理中，缩写不识别单引号和双引号

Question

我正在对文章文本进行文本预处理。在我的预处理代码中，其中一个步骤是缩略语，我试图在其中扩展“我已经”、“我是”等词。但是我遇到了一个问题，即当我键入示例文本但我没有处理我的文本时，收缩正在起作用。我也知道原因。原因是字体有差异。例如示例文本：

“我来这里是因为需要一位内阁部长。”

下面是相同的文字，但我自己写的：

"I'm here because a Cabinet minister is needed."

如果你仔细看，你会发现引号的不同（单引号和双引号）。

如何解决这个问题？

下面是我用于收缩的代码。

def expand_contractions(row, contraction_mapping=CONTRACTION_MAP):
    Japan_3 = row['Articles']
    Japan_3 = Japan_3.apply(lambda x: str(x).replace("’", "'"))
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())),
                                      flags=re.IGNORECASE | re.DOTALL)

    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match) \
            if contraction_mapping.get(match) \
            else contraction_mapping.get(match.lower())
        expanded_contraction = first_char + expanded_contraction[1:]
        return expanded_contraction

    expanded_text = contractions_pattern.sub(expand_match, Japan_3)
    expanded_text = re.sub("'", "", expanded_text)
    return expanded_text


Japan['expanded_text'] = Japan.apply(expand_contractions, axis=1)

更改代码后出现以下错误：

AttributeError: ("'str' object has no attribute 'apply'", 'occurred at index 0')

我不知道如何用一种不那么令人困惑的方式来解释它。

提前致谢！

Answer 1

一种解决方法是将所有错误的收缩标记替换为正确的收缩标记。在您的情况下，这可以通过将替换函数应用于 Pandas Dataframe:

中的 Article 列来完成

Japan_3 = Japan_3.apply(lambda x:str(x).replace("’","'"))

我无法测试你的函数，因为我没有你作为参数传递的收缩映射。但我的猜测是，你可以在 Japan_3 = row['Articles'] 之后添加那段代码。然后像往常一样执行其余的收缩。事实上，我会这样调用函数：

expand_contractions(Japan, contraction_mapping=CONTRACTION_MAP)

但是，老实说，我不确切知道您在该代码中试图做什么来消除缩略语。公平地说，为了扩展缩略语，我只替换文本中的每一个，但它们的扩展形式。以下是我要做的。不过，我没有对其进行测试，因此它可能无法正常工作，但我想它是相似的。

CONTRACTION_MAP = {"I'm":"I am"} # contraction definition. This is just an example, please change it here with your contractions
Japan["Article"] = Japan["Article"].apply(lambda x:str(x).replace("’","'")) # replace the wrong quotation mark by the correct one
for contraction in CONTRACTION_MAP:
    Japan["Article"] = Japan["Article"].apply(lambda x:str(x).replace(contraction,CONTRACTION_MAP[contraction])) # in this case I'm just replacing the contraction by the expanded form. I iterate it through all the possible contractions

在文本预处理中，缩写不识别单引号和双引号

In text preprocessing, Contractions are not recognising single and double quotes

nlp

python-3.7