如何在不删除重复标点符号的情况下从文本中删除重复的子字符串?
How to remove duplicate substring from text without removing duplicate punctuations?
这是我的示例字符串,
string = '111 East Sego Lily Drive Lily Drive, Suite 200 Sandy, UT 84070'
此处“Lily Drive”重复了两次,我想删除该重复项。但是,如果您看到标点符号“,”也重复了两次,但我不想删除它。
string = nltk.word_tokenize(string)
string = OrderedDict().fromkeys(string)
string = " ".join(string)
这个returns,
'111 East Sego Lily Drive, Suite 200 Sandy UT 84070'
我要找的,
'111 East Sego Lily Drive, Suite 200 Sandy, UT 84070'
除了 OrderedDict 之外,您还可以采取一些变通方法来防止删除重复的 ,
或您定义的任何内容。像这样:
import nltk.tokenize as nltk
string = '111 East Sego Lily Drive Lily Drive, Suite 200 Sandy, UT 84070'
s = nltk.word_tokenize(string)
uniques = set()
res = []
for word in s:
if word not in uniques or word==',':
uniques.add(word)
res.append(word)
out = ' '.join(res).replace(' ,', ',')
print(out)
111 East Sego Lily Drive, Suite 200 Sandy, UT 84070
这是我的示例字符串,
string = '111 East Sego Lily Drive Lily Drive, Suite 200 Sandy, UT 84070'
此处“Lily Drive”重复了两次,我想删除该重复项。但是,如果您看到标点符号“,”也重复了两次,但我不想删除它。
string = nltk.word_tokenize(string)
string = OrderedDict().fromkeys(string)
string = " ".join(string)
这个returns,
'111 East Sego Lily Drive, Suite 200 Sandy UT 84070'
我要找的,
'111 East Sego Lily Drive, Suite 200 Sandy, UT 84070'
除了 OrderedDict 之外,您还可以采取一些变通方法来防止删除重复的 ,
或您定义的任何内容。像这样:
import nltk.tokenize as nltk
string = '111 East Sego Lily Drive Lily Drive, Suite 200 Sandy, UT 84070'
s = nltk.word_tokenize(string)
uniques = set()
res = []
for word in s:
if word not in uniques or word==',':
uniques.add(word)
res.append(word)
out = ' '.join(res).replace(' ,', ',')
print(out)
111 East Sego Lily Drive, Suite 200 Sandy, UT 84070