python 中的字符串操作使用列表
string manipulation in python using list
我有一些推文包含一些 shorthand 文本,如 ur、bcz 等。我正在使用字典来映射正确的单词。我知道我们不能改变 python 中的字符串。因此,在用正确的词替换之后,我将副本存储在新列表中。它的工作。如果任何推文包含多个 shorthand 文本,我将面临问题。
我的代码一次替换一个词。如何在单个字符串中多次替换单词。
这是我的代码
# some sample tweets
tweet = ['stats is gr8', 'india is grt bcz it is colourfull', 'i like you','your movie is grt', 'i hate ur book of hatred' ]
short_text={
"bcz" : "because",
"ur" : "your",
"grt" : "great",
"gr8" : "great",
"u" : "you"
}
import re
def find_word(text,search):
result = re.findall('\b'+search+'\b',text,flags=re.IGNORECASE)
if len(result) > 0:
return True
else:
return False
corrected_tweets=list()
for i in tweet:
tweettoken=i.split()
for short_word in short_text:
print("current iteration")
for tok in tweettoken:
if(find_word(tok,short_word)):
print(tok)
print(i)
newi = i.replace(tok,short_text[short_word])
corrected_tweets.append(newi)
print(newi)
我的输出是
['stats is great',
'india is grt because it is colourfull',
'india is great bcz it is colourfull',
'your movie is great',
'i hate your book of hatred']
我需要的是推文 2 和 3 应该附加一次并进行所有更正。我是 python 的新手。任何帮助都会很棒。
您可以为此使用列表组合:
[' '.join(short_text.get(s, s) for s in new_str.split()) for new_str in tweet]
结果:
In [1]: tweet = ['stats is gr8', 'india is grt bcz it is colourfull', 'i like you','your movie is grt', 'i hate ur book of hatred' ]
...:
In [2]: short_text={
...: "bcz" : "because",
...: "ur" : "your",
...: "grt" : "great",
...: "gr8" : "great",
...: "u" : "you"
...: }
In [4]: [' '.join(short_text.get(s, s) for s in new_str.split()) for new_str in tweet]
Out[4]:
['stats is great',
'india is great because it is colourfull',
'i like you',
'your movie is great',
'i hate your book of hatred']
在单词边界上使用正则表达式函数,在字典中获取替换项(默认为原始单词,因此 returns 如果找不到相同的单词)
tweet = ['stats is gr8', 'india is grt bcz it is colourfull', 'i like you','your movie is grt', 'i hate ur book of hatred' ]
short_text={
"bcz" : "because",
"ur" : "your",
"grt" : "great",
"gr8" : "great",
"u" : "you"
}
import re
changed = [re.sub(r"\b(\w+)\b",lambda m:short_text.get(m.group(1),m.group(1)),x) for x in tweet]
结果:
['stats is great', 'india is great because it is colourfull', 'i like you', 'your movie is great', 'i hate your book of hatred']
这种方法非常快,因为它 O(1)
查找每个单词(不依赖于字典的长度)
与 str.split
相比,re+word boundary 的优势在于它在用标点符号分隔单词时也有效。
你可以试试这个方法:
tweet = ['stats is gr8', 'india is grt bcz it is colourfull', 'i like you','your movie is grt', 'i hate ur book of hatred' ]
short_text={
"bcz" : "because",
"ur" : "your",
"grt" : "great",
"gr8" : "great",
"u" : "you"
}
for j,i in enumerate(tweet):
data=i.split()
for index_np,value in enumerate(data):
if value in short_text:
data[index_np]=short_text[value]
tweet[j]=" ".join(data)
print(tweet)
输出:
['stats is great', 'india is great because it is colourfull', 'i like you', 'your movie is great', 'i hate your book of hatred']
我有一些推文包含一些 shorthand 文本,如 ur、bcz 等。我正在使用字典来映射正确的单词。我知道我们不能改变 python 中的字符串。因此,在用正确的词替换之后,我将副本存储在新列表中。它的工作。如果任何推文包含多个 shorthand 文本,我将面临问题。
我的代码一次替换一个词。如何在单个字符串中多次替换单词。 这是我的代码
# some sample tweets
tweet = ['stats is gr8', 'india is grt bcz it is colourfull', 'i like you','your movie is grt', 'i hate ur book of hatred' ]
short_text={
"bcz" : "because",
"ur" : "your",
"grt" : "great",
"gr8" : "great",
"u" : "you"
}
import re
def find_word(text,search):
result = re.findall('\b'+search+'\b',text,flags=re.IGNORECASE)
if len(result) > 0:
return True
else:
return False
corrected_tweets=list()
for i in tweet:
tweettoken=i.split()
for short_word in short_text:
print("current iteration")
for tok in tweettoken:
if(find_word(tok,short_word)):
print(tok)
print(i)
newi = i.replace(tok,short_text[short_word])
corrected_tweets.append(newi)
print(newi)
我的输出是
['stats is great',
'india is grt because it is colourfull',
'india is great bcz it is colourfull',
'your movie is great',
'i hate your book of hatred']
我需要的是推文 2 和 3 应该附加一次并进行所有更正。我是 python 的新手。任何帮助都会很棒。
您可以为此使用列表组合:
[' '.join(short_text.get(s, s) for s in new_str.split()) for new_str in tweet]
结果:
In [1]: tweet = ['stats is gr8', 'india is grt bcz it is colourfull', 'i like you','your movie is grt', 'i hate ur book of hatred' ]
...:
In [2]: short_text={
...: "bcz" : "because",
...: "ur" : "your",
...: "grt" : "great",
...: "gr8" : "great",
...: "u" : "you"
...: }
In [4]: [' '.join(short_text.get(s, s) for s in new_str.split()) for new_str in tweet]
Out[4]:
['stats is great',
'india is great because it is colourfull',
'i like you',
'your movie is great',
'i hate your book of hatred']
在单词边界上使用正则表达式函数,在字典中获取替换项(默认为原始单词,因此 returns 如果找不到相同的单词)
tweet = ['stats is gr8', 'india is grt bcz it is colourfull', 'i like you','your movie is grt', 'i hate ur book of hatred' ]
short_text={
"bcz" : "because",
"ur" : "your",
"grt" : "great",
"gr8" : "great",
"u" : "you"
}
import re
changed = [re.sub(r"\b(\w+)\b",lambda m:short_text.get(m.group(1),m.group(1)),x) for x in tweet]
结果:
['stats is great', 'india is great because it is colourfull', 'i like you', 'your movie is great', 'i hate your book of hatred']
这种方法非常快,因为它 O(1)
查找每个单词(不依赖于字典的长度)
与 str.split
相比,re+word boundary 的优势在于它在用标点符号分隔单词时也有效。
你可以试试这个方法:
tweet = ['stats is gr8', 'india is grt bcz it is colourfull', 'i like you','your movie is grt', 'i hate ur book of hatred' ]
short_text={
"bcz" : "because",
"ur" : "your",
"grt" : "great",
"gr8" : "great",
"u" : "you"
}
for j,i in enumerate(tweet):
data=i.split()
for index_np,value in enumerate(data):
if value in short_text:
data[index_np]=short_text[value]
tweet[j]=" ".join(data)
print(tweet)
输出:
['stats is great', 'india is great because it is colourfull', 'i like you', 'your movie is great', 'i hate your book of hatred']