如何用空格分割字符串的字符,然后用特殊字符和数字分割列表的结果元素,然后再次加入它们?
How to split the characters of a string by spaces and then resultant elements of list by special characters and numbers and then again join them?
所以,我想做的是将字符串中的一些单词转换为字典中的相应单词,然后按原样 is.For 示例输入为:
standarisationn("well-2-34 2 @$%23beach bend com")
我希望输出为:
"well-2-34 2 @$%23bch bnd com"
我使用的代码是:
def standarisationn(addr):
a=re.sub(',', ' ', addr)
lookp_dict = {"allee":"ale","alley":"ale","ally":"ale","aly":"ale",
"arcade":"arc",
"apartment":"apt","aprtmnt":"apt","aptmnt":"apt",
"av":"ave","aven":"ave","avenu":"ave","avenue":"ave","avn":"ave","avnue":"ave",
"beach":"bch",
"bend":"bnd",
"blfs":"blf","bluf":"blf","bluff":"blf","bluffs":"blf",
"boul":"blvd","boulevard":"blvd","boulv":"blvd",
"bottm":"bot","bottom":"bot",
"branch":"br","brnch":"br",
"brdge":"brg","bridge":"brg",
"bypa":"byp","bypas":"byp","bypass":"byp","byps":"byp",
"camp":"cmp",
"canyn":"cny","canyon":"cny","cnyn":"cny",
"southwest":"sw" ,"northwest":"nw"}
temp=re.findall(r"[A-Za-z0-9]+|\S", a)
print(temp)
res = []
for wrd in temp:
res.append(lookp_dict.get(wrd,wrd))
res = ' '.join(res)
return str(res)
但它给出了错误的输出:
'well - 2 - 34 2 @ $ % 23beach bnd com'
有太多 space,甚至没有将“beach”转换为“bch”。所以,这就是我认为的 issue.What 首先将字符串拆分为 spaces 然后用特殊字符和数字拆分结果元素并使用字典,然后首先用没有 space 的特殊字符连接分隔的字符串,然后用 space.Can 连接所有列表任何人都建议如何去关于这个或任何更好的方法?
您可以使用字典的键构建正则表达式,确保它们不包含在另一个词中(即不直接在字母之前或之后):
import re
def standarisationn(addr):
addr = re.sub(r'(,|\s+)', " ", addr)
lookp_dict = {"allee":"ale","alley":"ale","ally":"ale","aly":"ale",
"arcade":"arc",
"apartment":"apt","aprtmnt":"apt","aptmnt":"apt",
"av":"ave","aven":"ave","avenu":"ave","avenue":"ave","avn":"ave","avnue":"ave",
"beach":"bch",
"bend":"bnd",
"blfs":"blf","bluf":"blf","bluff":"blf","bluffs":"blf",
"boul":"blvd","boulevard":"blvd","boulv":"blvd",
"bottm":"bot","bottom":"bot",
"branch":"br","brnch":"br",
"brdge":"brg","bridge":"brg",
"bypa":"byp","bypas":"byp","bypass":"byp","byps":"byp",
"camp":"cmp",
"canyn":"cny","canyon":"cny","cnyn":"cny",
"southwest":"sw" ,"northwest":"nw"}
for wrd in lookp_dict:
addr = re.sub(rf'(?:^|(?<=[^a-zA-Z])){wrd}(?=[^a-zA-Z]|$)', lookp_dict[wrd], addr)
return addr
print(standarisationn("well-2-34 2 @$%23beach bend com"))
表达式由三部分组成:
^
匹配字符串的开头
(?<=[^a-zA-Z])
是后视(即非捕获表达式),检查前面的字符是否为字母
{wrd}
是你字典的关键字
(?=[^a-zA-Z]|$)
是先行(即非捕获表达式),检查后面的字符是字母还是字符串的结尾
输出:
well-2-34 2 @$%23bch bnd com
编辑:如果将循环替换为:
,则可以编译整个表达式并仅使用 re.sub 一次
repl_pattern = re.compile(rf"(?:^|(?<=[^a-zA-Z]))({'|'.join(lookp_dict.keys())})(?=([^a-zA-Z]|$))")
addr = re.sub(repl_pattern, lambda x: lookp_dict[x.group(1)], addr)
如果你的字典变大,这应该会快得多,因为我们用你所有的字典键构建了一个表达式:
({'|'.join(lookp_dict.keys())})
被解释为 (allee|alley|...
- re.sub 中的 lambda 函数用 lookp_dict 中的相应值替换匹配元素(有关此的更多详细信息,请参见示例 this link)
所以,我想做的是将字符串中的一些单词转换为字典中的相应单词,然后按原样 is.For 示例输入为:
standarisationn("well-2-34 2 @$%23beach bend com")
我希望输出为:
"well-2-34 2 @$%23bch bnd com"
我使用的代码是:
def standarisationn(addr):
a=re.sub(',', ' ', addr)
lookp_dict = {"allee":"ale","alley":"ale","ally":"ale","aly":"ale",
"arcade":"arc",
"apartment":"apt","aprtmnt":"apt","aptmnt":"apt",
"av":"ave","aven":"ave","avenu":"ave","avenue":"ave","avn":"ave","avnue":"ave",
"beach":"bch",
"bend":"bnd",
"blfs":"blf","bluf":"blf","bluff":"blf","bluffs":"blf",
"boul":"blvd","boulevard":"blvd","boulv":"blvd",
"bottm":"bot","bottom":"bot",
"branch":"br","brnch":"br",
"brdge":"brg","bridge":"brg",
"bypa":"byp","bypas":"byp","bypass":"byp","byps":"byp",
"camp":"cmp",
"canyn":"cny","canyon":"cny","cnyn":"cny",
"southwest":"sw" ,"northwest":"nw"}
temp=re.findall(r"[A-Za-z0-9]+|\S", a)
print(temp)
res = []
for wrd in temp:
res.append(lookp_dict.get(wrd,wrd))
res = ' '.join(res)
return str(res)
但它给出了错误的输出:
'well - 2 - 34 2 @ $ % 23beach bnd com'
有太多 space,甚至没有将“beach”转换为“bch”。所以,这就是我认为的 issue.What 首先将字符串拆分为 spaces 然后用特殊字符和数字拆分结果元素并使用字典,然后首先用没有 space 的特殊字符连接分隔的字符串,然后用 space.Can 连接所有列表任何人都建议如何去关于这个或任何更好的方法?
您可以使用字典的键构建正则表达式,确保它们不包含在另一个词中(即不直接在字母之前或之后):
import re
def standarisationn(addr):
addr = re.sub(r'(,|\s+)', " ", addr)
lookp_dict = {"allee":"ale","alley":"ale","ally":"ale","aly":"ale",
"arcade":"arc",
"apartment":"apt","aprtmnt":"apt","aptmnt":"apt",
"av":"ave","aven":"ave","avenu":"ave","avenue":"ave","avn":"ave","avnue":"ave",
"beach":"bch",
"bend":"bnd",
"blfs":"blf","bluf":"blf","bluff":"blf","bluffs":"blf",
"boul":"blvd","boulevard":"blvd","boulv":"blvd",
"bottm":"bot","bottom":"bot",
"branch":"br","brnch":"br",
"brdge":"brg","bridge":"brg",
"bypa":"byp","bypas":"byp","bypass":"byp","byps":"byp",
"camp":"cmp",
"canyn":"cny","canyon":"cny","cnyn":"cny",
"southwest":"sw" ,"northwest":"nw"}
for wrd in lookp_dict:
addr = re.sub(rf'(?:^|(?<=[^a-zA-Z])){wrd}(?=[^a-zA-Z]|$)', lookp_dict[wrd], addr)
return addr
print(standarisationn("well-2-34 2 @$%23beach bend com"))
表达式由三部分组成:
^
匹配字符串的开头(?<=[^a-zA-Z])
是后视(即非捕获表达式),检查前面的字符是否为字母{wrd}
是你字典的关键字(?=[^a-zA-Z]|$)
是先行(即非捕获表达式),检查后面的字符是字母还是字符串的结尾
输出:
well-2-34 2 @$%23bch bnd com
编辑:如果将循环替换为:
,则可以编译整个表达式并仅使用 re.sub 一次repl_pattern = re.compile(rf"(?:^|(?<=[^a-zA-Z]))({'|'.join(lookp_dict.keys())})(?=([^a-zA-Z]|$))")
addr = re.sub(repl_pattern, lambda x: lookp_dict[x.group(1)], addr)
如果你的字典变大,这应该会快得多,因为我们用你所有的字典键构建了一个表达式:
({'|'.join(lookp_dict.keys())})
被解释为(allee|alley|...
- re.sub 中的 lambda 函数用 lookp_dict 中的相应值替换匹配元素(有关此的更多详细信息,请参见示例 this link)