从 python 中的字符串集中删除不需要的字符
Remove unwanted characters from set of strings in python
我正在尝试清理一组字符串以删除不需要的字符。
输入
Lethal Lunch t5+ 0 0 D 10 t5+ Michael Bell . Alex Jary7 .
Muscika 1 v5+ W5+ 0 0 D 5 v5+ W5+ D O'Meara . Cam Hardie . C5
Typhoon Ten 1 0 0 D 13 R Hannon . Luke Catton7 .
Wentworth Falls 1 cp5+ 0 0 C D 45 cp5+ G Harker . Connor Beasley .
One Night Stand 0 0 D 34 W Jarvis . Silvestre De Sousa . 30 C1 C5
Dancinginthewoods 1 0 0 D 24 D Ivory . 14 Jamie Spencer . 30
Case Key 1 v3 0 0 D 13 v3 M Appleby . Andrew Mullen . 14
想要的输出
Lethal Lunch
Muscika
Typhoon Ten
Wentworth Falls
One Night Stand
Dancinginthewoods
Case Key
我试过了
re.findall('([a-zA-Z ]*)\d*.*',final_df.loc[index, 'Horse'])
这会删除数字后的所有内容,但会在第一个条目上保留 t。请问有没有更好的办法?
我会用 re.split
代替:
for d in data.splitlines():
print(re.split(r'\s+t?[0-9]\+?', d)[0])
结果
Lethal Lunch
Muscika
Typhoon Ten
Wentworth Falls
One Night Stand
Dancinginthewoods
Case Key
解释:它在指定模式匹配的地方分割字符串,然后取第一部分。您可能想要调整它,以便其他模式也匹配。
在Pandas
我刚刚注意到您似乎在使用 Pandas – 假设您的 df 如下所示:
Horse
0 Lethal Lunch t5+ 0 0 D 10 t5+ Michael Bell . A...
1 Muscika 1 v5+ W5+ 0 0 D 5 v5+ W5+ D O'Meara . ...
2 Typhoon Ten 1 0 0 D 13 R Hannon . Luke Catton7 .
3 Wentworth Falls 1 cp5+ 0 0 C D 45 cp5+ G Harke...
4 One Night Stand 0 0 D 34 W Jarvis . Silvestre ...
5 Dancinginthewoods 1 0 0 D 24 D Ivory . 14 Jami...
6 Case Key 1 v3 0 0 D 13 v3 M Appleby . Andrew M...
你可以做到
from operator import itemgetter
df["name"] = df.Horse.str.split('\s+t?[0-9]\+?').map(itemgetter(0))
得到这个:
Horse name
0 Lethal Lunch t5+ 0 0 D 10 t5+ Michael Bell . A... Lethal Lunch
1 Muscika 1 v5+ W5+ 0 0 D 5 v5+ W5+ D O'Meara . ... Muscika
2 Typhoon Ten 1 0 0 D 13 R Hannon . Luke Catton7 . Typhoon Ten
3 Wentworth Falls 1 cp5+ 0 0 C D 45 cp5+ G Harke... Wentworth Falls
4 One Night Stand 0 0 D 34 W Jarvis . Silvestre ... One Night Stand
5 Dancinginthewoods 1 0 0 D 24 D Ivory . 14 Jami... Dancinginthewoods
6 Case Key 1 v3 0 0 D 13 v3 M Appleby . Andrew M... Case Key
像这样的东西应该可以工作:
filtered_text = list()
for line in text:
part = ""
for word in text.split(" "):
if len(word) <= 3:
break
else:
part = str(part) + " " + str(word)
part = part[1:] # skip first space
filtered_text.append(part)
像这样就够了吗?
input = [
"Lethal Lunch t5+ 0 0 D 10 t5+ Michael Bell . Alex Jary7 .",
"Muscika 1 v5+ W5+ 0 0 D 5 v5+ W5+ D O'Meara . Cam Hardie . C5",
"Typhoon Ten 1 0 0 D 13 R Hannon . Luke Catton7 .",
"Wentworth Falls 1 cp5+ 0 0 C D 45 cp5+ G Harker . Connor Beasley .",
"One Night Stand 0 0 D 34 W Jarvis . Silvestre De Sousa . 30 C1 C5",
"Dancinginthewoods 1 0 0 D 24 D Ivory . 14 Jamie Spencer . 30",
"Case Key 1 v3 0 0 D 13 v3 M Appleby . Andrew Mullen . 14",
]
for inp in input:
print(re.findall(r'\b[a-zA-Z ]+\b', inp)[0])
我们基本上忽略了带有数字或奇怪符号的单词。
输出:
Lethal Lunch
Muscika
Typhoon Ten
Wentworth Falls
One Night Stand
Dancinginthewoods
Case Key
我正在尝试清理一组字符串以删除不需要的字符。
输入
Lethal Lunch t5+ 0 0 D 10 t5+ Michael Bell . Alex Jary7 .
Muscika 1 v5+ W5+ 0 0 D 5 v5+ W5+ D O'Meara . Cam Hardie . C5
Typhoon Ten 1 0 0 D 13 R Hannon . Luke Catton7 .
Wentworth Falls 1 cp5+ 0 0 C D 45 cp5+ G Harker . Connor Beasley .
One Night Stand 0 0 D 34 W Jarvis . Silvestre De Sousa . 30 C1 C5
Dancinginthewoods 1 0 0 D 24 D Ivory . 14 Jamie Spencer . 30
Case Key 1 v3 0 0 D 13 v3 M Appleby . Andrew Mullen . 14
想要的输出
Lethal Lunch
Muscika
Typhoon Ten
Wentworth Falls
One Night Stand
Dancinginthewoods
Case Key
我试过了
re.findall('([a-zA-Z ]*)\d*.*',final_df.loc[index, 'Horse'])
这会删除数字后的所有内容,但会在第一个条目上保留 t。请问有没有更好的办法?
我会用 re.split
代替:
for d in data.splitlines():
print(re.split(r'\s+t?[0-9]\+?', d)[0])
结果
Lethal Lunch
Muscika
Typhoon Ten
Wentworth Falls
One Night Stand
Dancinginthewoods
Case Key
解释:它在指定模式匹配的地方分割字符串,然后取第一部分。您可能想要调整它,以便其他模式也匹配。
在Pandas
我刚刚注意到您似乎在使用 Pandas – 假设您的 df 如下所示:
Horse
0 Lethal Lunch t5+ 0 0 D 10 t5+ Michael Bell . A...
1 Muscika 1 v5+ W5+ 0 0 D 5 v5+ W5+ D O'Meara . ...
2 Typhoon Ten 1 0 0 D 13 R Hannon . Luke Catton7 .
3 Wentworth Falls 1 cp5+ 0 0 C D 45 cp5+ G Harke...
4 One Night Stand 0 0 D 34 W Jarvis . Silvestre ...
5 Dancinginthewoods 1 0 0 D 24 D Ivory . 14 Jami...
6 Case Key 1 v3 0 0 D 13 v3 M Appleby . Andrew M...
你可以做到
from operator import itemgetter
df["name"] = df.Horse.str.split('\s+t?[0-9]\+?').map(itemgetter(0))
得到这个:
Horse name
0 Lethal Lunch t5+ 0 0 D 10 t5+ Michael Bell . A... Lethal Lunch
1 Muscika 1 v5+ W5+ 0 0 D 5 v5+ W5+ D O'Meara . ... Muscika
2 Typhoon Ten 1 0 0 D 13 R Hannon . Luke Catton7 . Typhoon Ten
3 Wentworth Falls 1 cp5+ 0 0 C D 45 cp5+ G Harke... Wentworth Falls
4 One Night Stand 0 0 D 34 W Jarvis . Silvestre ... One Night Stand
5 Dancinginthewoods 1 0 0 D 24 D Ivory . 14 Jami... Dancinginthewoods
6 Case Key 1 v3 0 0 D 13 v3 M Appleby . Andrew M... Case Key
像这样的东西应该可以工作:
filtered_text = list()
for line in text:
part = ""
for word in text.split(" "):
if len(word) <= 3:
break
else:
part = str(part) + " " + str(word)
part = part[1:] # skip first space
filtered_text.append(part)
像这样就够了吗?
input = [
"Lethal Lunch t5+ 0 0 D 10 t5+ Michael Bell . Alex Jary7 .",
"Muscika 1 v5+ W5+ 0 0 D 5 v5+ W5+ D O'Meara . Cam Hardie . C5",
"Typhoon Ten 1 0 0 D 13 R Hannon . Luke Catton7 .",
"Wentworth Falls 1 cp5+ 0 0 C D 45 cp5+ G Harker . Connor Beasley .",
"One Night Stand 0 0 D 34 W Jarvis . Silvestre De Sousa . 30 C1 C5",
"Dancinginthewoods 1 0 0 D 24 D Ivory . 14 Jamie Spencer . 30",
"Case Key 1 v3 0 0 D 13 v3 M Appleby . Andrew Mullen . 14",
]
for inp in input:
print(re.findall(r'\b[a-zA-Z ]+\b', inp)[0])
我们基本上忽略了带有数字或奇怪符号的单词。 输出:
Lethal Lunch
Muscika
Typhoon Ten
Wentworth Falls
One Night Stand
Dancinginthewoods
Case Key