标记化不要不要使用 NLTK Python
Tokenize don't to dont using NLTK Python
当我使用时:
nltk.word_tokenize("don't")
我明白了
["do", "n't"]
我想要的是:
["dont"]
您可以使用TweetTokenizer
from nltk.tokenize import TweetTokenizer
tweet_tokenizer = TweetTokenizer()
sen = "don't won't can't"
res = [x.replace("'", '') for x in tweet_tokenizer.tokenize(sen)]
print(res)
输出:
['dont', 'wont', 'cant']
当我使用时:
nltk.word_tokenize("don't")
我明白了
["do", "n't"]
我想要的是:
["dont"]
您可以使用TweetTokenizer
from nltk.tokenize import TweetTokenizer
tweet_tokenizer = TweetTokenizer()
sen = "don't won't can't"
res = [x.replace("'", '') for x in tweet_tokenizer.tokenize(sen)]
print(res)
输出:
['dont', 'wont', 'cant']