如何使用 RegexpTokenizer 删除字符串中的 '

How to remove ' in strings with RegexpTokenizer

from nltk.tokenize import RegexpTokenizer
text="That's some text, you know!"
tokens=[]
tokenizer = RegexpTokenizer(r'\w+')
tokens+=tokenizer.tokenize(text.lower())

当前 return 秒:text = ['that', 's', 'some', 'text', 'you', 'know']

我需要它return:目前returns:text = ['thats', 'some', 'text', 'you', 'know']("thats"是一个词)

有2个解决方案。要么你想预处理你的文本变量:

text = text.replace("'", "")

或者您希望将 "that's" 作为单个词与此修改匹配:

tokenizer = RegexpTokenizer(r'[\w\']+')