Python 字符串:整个单词匹配未按预期工作
Python strings : Whole word match not working as intended
我的 objective 是搜索字符串中是否存在某些(整个)单词。下面是代码。我无法理解为什么我得到搜索词 'odin' 的匹配项,因为这不是我的字符串中的完整词。有人可以解释吗?我希望在这种情况下找不到匹配项。
import re
#search words
hero = ['catwoman', 'hellboy', 'eternals', 'elektra', 'hydra', 'iron-man', 'iron man', 'green arrow', 'nightwing', 'flash gordon', 'lanterne verte', 'lantern',
'kryptonite', 'asgard', 'spider-man', 'spiderman', 'superheroes', 'super heroes', 'super hero', 'hancock', 'daredevil', 'avengers', 'metropolis',
'gotham', 'batman', 'captain america', 'wolverine', 'magneto', 'dark knight', 'aquaman', 'shazam', 'wolverine', 'punisher', 'batmobile',
'daredevil', 'superwoman', 'supergirl', 'wonderwoman', 'batgirl', 'catgirl', 'starfire', 'sandman', 'superman', 'thor', 'x-men', 'x men',
'marvel', 'spidey', 'superheroine', 'supervillain', 'supervillains', 'odin', 'loki', 'spiderman', 'ragnarok', 'asgardian', 'supergirl', 'spiderman',
'teen titans', 'stan lee', 'doctor strange', 'groot', 'ant man', 'ant-man', 'deadpool', 'professor x', 'wasp', 'phoenix', 'star wars',
'eternals', 'morbius', 'shang-chi', 'shang', 'rocketeer']
#string
s = "Hoping to escape from his troubled past, former DEA agent Phil Broker (Jason Statham) moves to a seemingly quiet backwater town in the bayou with his daughter. However, he finds anything but quiet there, for the town is riddled with drugs and violence. When Gator Bodine (James Franco), a sociopathic druglord, puts the newcomer and his young daughter in harm's way, Broker is forced back into action to save her and their home. Based on a novel by Chuck Logan.^A former DEA agent (Jason Statham) returns to action to save his daughter and his new town from a drug dealing sociopath (James Franco).^A former DEA agent (Jason Statham) encounters trouble when he moves to a small town"
match = re.search(r'\b{}\b'.format('|'.join(hero)),s )
print(match)
输出
<re.Match object; span=(265, 269), match='odin'>
re.search 非常不准确。它与 odin 匹配,因为在句子中有:" When Gator B>ODIN< (James F)".
没有正则表达式的更简单的方法怎么样?
import re
#search words
hero = ['catwoman', 'hellboy', 'eternals', 'elektra', 'hydra', 'iron-man',
'iron man', 'green arrow', 'nightwing', 'flash gordon', 'lanterne verte',
'lantern',
'kryptonite', 'asgard', 'spider-man', 'spiderman', 'superheroes', 'super heroes', 'super hero', 'hancock', 'daredevil', 'avengers', 'metropolis',
'gotham', 'batman', 'captain america', 'wolverine', 'magneto', 'dark knight', 'aquaman', 'shazam', 'wolverine', 'punisher', 'batmobile',
'daredevil', 'superwoman', 'supergirl', 'wonderwoman', 'batgirl', 'catgirl', 'starfire', 'sandman', 'superman', 'thor', 'x-men', 'x men',
'marvel', 'spidey', 'superheroine', 'supervillain', 'supervillains', 'odin', 'loki', 'spiderman', 'ragnarok', 'asgardian', 'supergirl', 'spiderman',
'teen titans', 'stan lee', 'doctor strange', 'groot', 'ant man', 'ant-man', 'deadpool', 'professor x', 'wasp', 'phoenix', 'star wars',
'eternals', 'morbius', 'shang-chi', 'shang', 'rocketeer']
#string
s = "Hoping to escape from his troubled past, former DEA agent Phil Broker
(Jason Statham) moves to a seemingly quiet backwater town in the bayou with
his daughter. However, he finds anything but quiet there, for the town is
riddled with drugs and violence. When Gator Bodine (James Franco), a
sociopathic druglord, puts the newcomer and his young daughter in harm's way,
Broker is forced back into action to save her and their home. Based on a
novel by Chuck Logan.^A former DEA agent (Jason Statham) returns to action to
save his daughter and his new town from a drug dealing sociopath (James
Franco).^A former DEA agent (Jason Statham) encounters trouble when he moves
to a small town"
split_sentence = s.split(" ")
for word in split_sentence:
if word in hero:
print("{} is in hero list!".format(word))
我意识到出了什么问题。 "hero" 中的每个词的搜索模式都没有词边界
我将搜索模式从 r'\b{}\b'.format('|'.join(hero))
更改为 r'\b{}\b'.format(r'\b|\b'.join(hero))
,现在它按预期工作。这是完整的代码:
import re
#search words
hero = ['catwoman', 'hellboy', 'eternals', 'elektra', 'hydra', 'iron-man', 'iron man', 'green arrow', 'nightwing', 'flash gordon', 'lanterne verte', 'lantern',
'kryptonite', 'asgard', 'spider-man', 'spiderman', 'superheroes', 'super heroes', 'super hero', 'hancock', 'daredevil', 'avengers', 'metropolis',
'gotham', 'batman', 'captain america', 'wolverine', 'magneto', 'dark knight', 'aquaman', 'shazam', 'wolverine', 'punisher', 'batmobile',
'daredevil', 'superwoman', 'supergirl', 'wonderwoman', 'batgirl', 'catgirl', 'starfire', 'sandman', 'superman', 'thor', 'x-men', 'x men',
'marvel', 'spidey', 'superheroine', 'supervillain', 'supervillains', 'odin', 'loki', 'spiderman', 'ragnarok', 'asgardian', 'supergirl', 'spiderman',
'teen titans', 'stan lee', 'doctor strange', 'groot', 'ant man', 'ant-man', 'deadpool', 'professor x', 'wasp', 'phoenix', 'star wars',
'eternals', 'morbius', 'shang-chi', 'shang', 'rocketeer']
#string
s = "Hoping to escape from his troubled past, former DEA agent Phil Broker (Jason Statham) moves to a seemingly quiet backwater town in the bayou with his daughter. However, he finds anything but quiet there, for the town is riddled with drugs and violence. When Gator Bodine (James Franco), a sociopathic druglord, puts the newcomer and his young daughter in harm's way, Broker is forced back into action to save her and their home. Based on a novel by Chuck Logan.^A former DEA agent (Jason Statham) returns to action to save his daughter and his new town from a drug dealing sociopath (James Franco).^A former DEA agent (Jason Statham) encounters trouble when he moves to a small town"
match = re.search(r'\b{}\b'.format(r'\b|\b'.join(hero)),s )
print(match)
输出:
None
我的 objective 是搜索字符串中是否存在某些(整个)单词。下面是代码。我无法理解为什么我得到搜索词 'odin' 的匹配项,因为这不是我的字符串中的完整词。有人可以解释吗?我希望在这种情况下找不到匹配项。
import re
#search words
hero = ['catwoman', 'hellboy', 'eternals', 'elektra', 'hydra', 'iron-man', 'iron man', 'green arrow', 'nightwing', 'flash gordon', 'lanterne verte', 'lantern',
'kryptonite', 'asgard', 'spider-man', 'spiderman', 'superheroes', 'super heroes', 'super hero', 'hancock', 'daredevil', 'avengers', 'metropolis',
'gotham', 'batman', 'captain america', 'wolverine', 'magneto', 'dark knight', 'aquaman', 'shazam', 'wolverine', 'punisher', 'batmobile',
'daredevil', 'superwoman', 'supergirl', 'wonderwoman', 'batgirl', 'catgirl', 'starfire', 'sandman', 'superman', 'thor', 'x-men', 'x men',
'marvel', 'spidey', 'superheroine', 'supervillain', 'supervillains', 'odin', 'loki', 'spiderman', 'ragnarok', 'asgardian', 'supergirl', 'spiderman',
'teen titans', 'stan lee', 'doctor strange', 'groot', 'ant man', 'ant-man', 'deadpool', 'professor x', 'wasp', 'phoenix', 'star wars',
'eternals', 'morbius', 'shang-chi', 'shang', 'rocketeer']
#string
s = "Hoping to escape from his troubled past, former DEA agent Phil Broker (Jason Statham) moves to a seemingly quiet backwater town in the bayou with his daughter. However, he finds anything but quiet there, for the town is riddled with drugs and violence. When Gator Bodine (James Franco), a sociopathic druglord, puts the newcomer and his young daughter in harm's way, Broker is forced back into action to save her and their home. Based on a novel by Chuck Logan.^A former DEA agent (Jason Statham) returns to action to save his daughter and his new town from a drug dealing sociopath (James Franco).^A former DEA agent (Jason Statham) encounters trouble when he moves to a small town"
match = re.search(r'\b{}\b'.format('|'.join(hero)),s )
print(match)
输出
<re.Match object; span=(265, 269), match='odin'>
re.search 非常不准确。它与 odin 匹配,因为在句子中有:" When Gator B>ODIN< (James F)".
没有正则表达式的更简单的方法怎么样?
import re
#search words
hero = ['catwoman', 'hellboy', 'eternals', 'elektra', 'hydra', 'iron-man',
'iron man', 'green arrow', 'nightwing', 'flash gordon', 'lanterne verte',
'lantern',
'kryptonite', 'asgard', 'spider-man', 'spiderman', 'superheroes', 'super heroes', 'super hero', 'hancock', 'daredevil', 'avengers', 'metropolis',
'gotham', 'batman', 'captain america', 'wolverine', 'magneto', 'dark knight', 'aquaman', 'shazam', 'wolverine', 'punisher', 'batmobile',
'daredevil', 'superwoman', 'supergirl', 'wonderwoman', 'batgirl', 'catgirl', 'starfire', 'sandman', 'superman', 'thor', 'x-men', 'x men',
'marvel', 'spidey', 'superheroine', 'supervillain', 'supervillains', 'odin', 'loki', 'spiderman', 'ragnarok', 'asgardian', 'supergirl', 'spiderman',
'teen titans', 'stan lee', 'doctor strange', 'groot', 'ant man', 'ant-man', 'deadpool', 'professor x', 'wasp', 'phoenix', 'star wars',
'eternals', 'morbius', 'shang-chi', 'shang', 'rocketeer']
#string
s = "Hoping to escape from his troubled past, former DEA agent Phil Broker
(Jason Statham) moves to a seemingly quiet backwater town in the bayou with
his daughter. However, he finds anything but quiet there, for the town is
riddled with drugs and violence. When Gator Bodine (James Franco), a
sociopathic druglord, puts the newcomer and his young daughter in harm's way,
Broker is forced back into action to save her and their home. Based on a
novel by Chuck Logan.^A former DEA agent (Jason Statham) returns to action to
save his daughter and his new town from a drug dealing sociopath (James
Franco).^A former DEA agent (Jason Statham) encounters trouble when he moves
to a small town"
split_sentence = s.split(" ")
for word in split_sentence:
if word in hero:
print("{} is in hero list!".format(word))
我意识到出了什么问题。 "hero" 中的每个词的搜索模式都没有词边界
我将搜索模式从 r'\b{}\b'.format('|'.join(hero))
更改为 r'\b{}\b'.format(r'\b|\b'.join(hero))
,现在它按预期工作。这是完整的代码:
import re
#search words
hero = ['catwoman', 'hellboy', 'eternals', 'elektra', 'hydra', 'iron-man', 'iron man', 'green arrow', 'nightwing', 'flash gordon', 'lanterne verte', 'lantern',
'kryptonite', 'asgard', 'spider-man', 'spiderman', 'superheroes', 'super heroes', 'super hero', 'hancock', 'daredevil', 'avengers', 'metropolis',
'gotham', 'batman', 'captain america', 'wolverine', 'magneto', 'dark knight', 'aquaman', 'shazam', 'wolverine', 'punisher', 'batmobile',
'daredevil', 'superwoman', 'supergirl', 'wonderwoman', 'batgirl', 'catgirl', 'starfire', 'sandman', 'superman', 'thor', 'x-men', 'x men',
'marvel', 'spidey', 'superheroine', 'supervillain', 'supervillains', 'odin', 'loki', 'spiderman', 'ragnarok', 'asgardian', 'supergirl', 'spiderman',
'teen titans', 'stan lee', 'doctor strange', 'groot', 'ant man', 'ant-man', 'deadpool', 'professor x', 'wasp', 'phoenix', 'star wars',
'eternals', 'morbius', 'shang-chi', 'shang', 'rocketeer']
#string
s = "Hoping to escape from his troubled past, former DEA agent Phil Broker (Jason Statham) moves to a seemingly quiet backwater town in the bayou with his daughter. However, he finds anything but quiet there, for the town is riddled with drugs and violence. When Gator Bodine (James Franco), a sociopathic druglord, puts the newcomer and his young daughter in harm's way, Broker is forced back into action to save her and their home. Based on a novel by Chuck Logan.^A former DEA agent (Jason Statham) returns to action to save his daughter and his new town from a drug dealing sociopath (James Franco).^A former DEA agent (Jason Statham) encounters trouble when he moves to a small town"
match = re.search(r'\b{}\b'.format(r'\b|\b'.join(hero)),s )
print(match)
输出:
None