正则表达式在单词和标点符号之间添加 NOT
regex to prepend NOT between word and punctuation
我试图使用正则表达式重现经典的标记化技巧来处理像
这样的句子
"I didn't like that SO question, but I like pizza!"
文献中提出的解决方案其实很简单。在“didnt”和 下一个标点符号 之间的每个标记前添加 NOT_
。因此在我们的示例中,它变为:
"I didn't NOT_like NOT_that NOT_SO NOT_question, but I like pizza!"
我们如何使用 python 或正则表达式来做到这一点?
谢谢!
使用正则表达式进行分词,然后像这样拆分和连接:
import re
sentence = "I didn't like that SO question, but I like pizza!"
words = re.split("([,.?:!;]|didn't)", sentence)
not_sentence = "".join([word if (idx == 0 or words[idx-1] != "didn't")
else re.sub(r"(\w+)", "NOT_\1", word)
for idx, word in enumerate(words)])
print(not_sentence)
# I didn't NOT_like NOT_that NOT_SO NOT_question, but I like pizza!
import re
text = "I didn't like that SO question, but I like pizza!"
regex = re.compile(r'(?<=didn\'t)(\s.+)+\,')
segment = regex.search(text).group(0)
result = text.replace(segment, segment.replace(' ', ' Not_'))
print(result)
"I didn't Not_like Not_that Not_SO Not_question, but I like pizza!"
我试图使用正则表达式重现经典的标记化技巧来处理像
这样的句子"I didn't like that SO question, but I like pizza!"
文献中提出的解决方案其实很简单。在“didnt”和 下一个标点符号 之间的每个标记前添加 NOT_
。因此在我们的示例中,它变为:
"I didn't NOT_like NOT_that NOT_SO NOT_question, but I like pizza!"
我们如何使用 python 或正则表达式来做到这一点?
谢谢!
使用正则表达式进行分词,然后像这样拆分和连接:
import re
sentence = "I didn't like that SO question, but I like pizza!"
words = re.split("([,.?:!;]|didn't)", sentence)
not_sentence = "".join([word if (idx == 0 or words[idx-1] != "didn't")
else re.sub(r"(\w+)", "NOT_\1", word)
for idx, word in enumerate(words)])
print(not_sentence)
# I didn't NOT_like NOT_that NOT_SO NOT_question, but I like pizza!
import re
text = "I didn't like that SO question, but I like pizza!"
regex = re.compile(r'(?<=didn\'t)(\s.+)+\,')
segment = regex.search(text).group(0)
result = text.replace(segment, segment.replace(' ', ' Not_'))
print(result)
"I didn't Not_like Not_that Not_SO Not_question, but I like pizza!"