正则表达式在单词和标点符号之间添加 NOT

regex to prepend NOT between word and punctuation

我试图使用正则表达式重现经典的标记化技巧来处理像

这样的句子
"I didn't like that SO question, but I like pizza!"

文献中提出的解决方案其实很简单。在“didnt”和 下一个标点符号 之间的每个标记前添加 NOT_。因此在我们的示例中,它变为:

"I didn't NOT_like NOT_that NOT_SO NOT_question, but I like pizza!"

我们如何使用 python 或正则表达式来做到这一点?

谢谢!

使用正则表达式进行分词,然后像这样拆分和连接:

import re
sentence = "I didn't like that SO question, but I like pizza!"
words = re.split("([,.?:!;]|didn't)", sentence)
not_sentence = "".join([word if (idx == 0 or words[idx-1] != "didn't")
                        else re.sub(r"(\w+)", "NOT_\1", word)
                        for idx, word in enumerate(words)])
print(not_sentence)
# I didn't NOT_like NOT_that NOT_SO NOT_question, but I like pizza!
import re

text = "I didn't like that SO question, but I like pizza!"

regex = re.compile(r'(?<=didn\'t)(\s.+)+\,')

segment = regex.search(text).group(0)

result = text.replace(segment, segment.replace(' ', ' Not_'))

print(result)
"I didn't Not_like Not_that Not_SO Not_question, but I like pizza!"