pandas:提取以给定子字符串结尾的连字符前后的特定文本
pandas: extract specific text before or after hyphen, that ends in given substrings
我是 pandas
的新手,data frame
类似于下面的
import pandas as pd
df = pd.DataFrame({'id': ["1", "2", "3","4","5"],
'mill': ["Company A Palm Oil Mill – Special Company A of CC Ltd",
"Company X POM – Company X Ltd","DDDD Mill – Company New and Old Ltd",
"Company Not Special – R Mill","Greatest Company – Great World POM"]})
id mill
0 1 Company A Palm Oil Mill – Special Company A of...
1 2 Company X POM – Company X Ltd
2 3 DDDD Mill – Company New and Old Ltd
3 4 Company Not Special – R Mill
4 5 Greatest Company – Great World POM
我想从上面data frame
得到的是下面这样的东西:
有没有一种简单的方法可以将这些子字符串提取到同一列中。工厂名称有时可以在“-”之前,有时可以在“-”之后,但几乎总是以 Palm Oil Mill、POM 或 Mill 结尾。
以前的解决方案: 您可以使用 .str.split()
并执行此操作:
df.mill = df.mill.str.split(' –').str[0]
.
Update:看到你有一些限制,你可以建立你自己的 returning 函数(下面称为 func
)并把你的任何逻辑想在里面。这将遍历所有由 -
分割的字符串,如果 Mill 在第一个单词中,则 return.
其他情况推荐温氏方案
import pandas as pd
df = pd.DataFrame({'id': ["1", "2", "3","4","5"],
'mill': ["Company A Palm Oil Mill – Special Company A of CC Ltd",
"Company X POM – Company X Ltd","DDDD Mill – Company New and Old Ltd",
"Company Not Special – R Mill","Greatest Company – Great World POM"]})
def func(x):
#Split array
ar = x.split(' – ')
# If length is smaller than 2 return value
if len(ar) < 2:
return x
# Else loop through and apply logic here
for ind, x in enumerate(ar):
if x.lower().endswith(('mill', 'pom')):
return x
# Nothing found, return x
return x
df.mill = df.mill.apply(func)
print(df)
Returns:
id mill
0 1 Company A Palm Oil Mill
1 2 Company X POM
2 3 DDDD Mill
3 4 R Mill
4 5 Great World POM
IIUC,您可以将 str.contains
与这些关键字一起使用 Palm Oil Mill,POM,Mill
s = df.mill.str.split(' – ', expand=True)
df['Name']=s[s.apply(lambda x : x.str.contains('Palm Oil Mill|POM|Mill'))].fillna('').sum(1)
df
Out[230]:
id mill \
0 1 Company A Palm Oil Mill – Special Company A of...
1 2 Company X POM – Company X Ltd
2 3 DDDD Mill – Company New and Old Ltd
3 4 Company Not Special – R Mill
4 5 Greatest Company – Great World POM
Name
0 Company A Palm Oil Mill
1 Company X POM
2 DDDD Mill
3 R Mill
4 Great World POM
您想在连字符(如果有)和 return 以 'Mill' 或 'POM':
结尾的子字符串上拆分
def extract_mill_name(s):
"""Extract the substring which ends in 'Mill' or 'POM'"""
for subs in s.split('–'):
subs = subs.strip(' ')
if subs.endswith('Mill') or subs.endswith('POM'):
return subs
return None # parsing error. Could raise Exception instead
df.mill.apply(extract_mill_name)
0 Company A Palm Oil Mill
1 Company X POM
2 DDDD Mill
3 R Mill
4 Great World POM
我是 pandas
的新手,data frame
类似于下面的
import pandas as pd
df = pd.DataFrame({'id': ["1", "2", "3","4","5"],
'mill': ["Company A Palm Oil Mill – Special Company A of CC Ltd",
"Company X POM – Company X Ltd","DDDD Mill – Company New and Old Ltd",
"Company Not Special – R Mill","Greatest Company – Great World POM"]})
id mill
0 1 Company A Palm Oil Mill – Special Company A of...
1 2 Company X POM – Company X Ltd
2 3 DDDD Mill – Company New and Old Ltd
3 4 Company Not Special – R Mill
4 5 Greatest Company – Great World POM
我想从上面data frame
得到的是下面这样的东西:
有没有一种简单的方法可以将这些子字符串提取到同一列中。工厂名称有时可以在“-”之前,有时可以在“-”之后,但几乎总是以 Palm Oil Mill、POM 或 Mill 结尾。
以前的解决方案: 您可以使用 .str.split()
并执行此操作:
df.mill = df.mill.str.split(' –').str[0]
.
Update:看到你有一些限制,你可以建立你自己的 returning 函数(下面称为 func
)并把你的任何逻辑想在里面。这将遍历所有由 -
分割的字符串,如果 Mill 在第一个单词中,则 return.
其他情况推荐温氏方案
import pandas as pd
df = pd.DataFrame({'id': ["1", "2", "3","4","5"],
'mill': ["Company A Palm Oil Mill – Special Company A of CC Ltd",
"Company X POM – Company X Ltd","DDDD Mill – Company New and Old Ltd",
"Company Not Special – R Mill","Greatest Company – Great World POM"]})
def func(x):
#Split array
ar = x.split(' – ')
# If length is smaller than 2 return value
if len(ar) < 2:
return x
# Else loop through and apply logic here
for ind, x in enumerate(ar):
if x.lower().endswith(('mill', 'pom')):
return x
# Nothing found, return x
return x
df.mill = df.mill.apply(func)
print(df)
Returns:
id mill
0 1 Company A Palm Oil Mill
1 2 Company X POM
2 3 DDDD Mill
3 4 R Mill
4 5 Great World POM
IIUC,您可以将 str.contains
与这些关键字一起使用 Palm Oil Mill,POM,Mill
s = df.mill.str.split(' – ', expand=True)
df['Name']=s[s.apply(lambda x : x.str.contains('Palm Oil Mill|POM|Mill'))].fillna('').sum(1)
df
Out[230]:
id mill \
0 1 Company A Palm Oil Mill – Special Company A of...
1 2 Company X POM – Company X Ltd
2 3 DDDD Mill – Company New and Old Ltd
3 4 Company Not Special – R Mill
4 5 Greatest Company – Great World POM
Name
0 Company A Palm Oil Mill
1 Company X POM
2 DDDD Mill
3 R Mill
4 Great World POM
您想在连字符(如果有)和 return 以 'Mill' 或 'POM':
结尾的子字符串上拆分def extract_mill_name(s):
"""Extract the substring which ends in 'Mill' or 'POM'"""
for subs in s.split('–'):
subs = subs.strip(' ')
if subs.endswith('Mill') or subs.endswith('POM'):
return subs
return None # parsing error. Could raise Exception instead
df.mill.apply(extract_mill_name)
0 Company A Palm Oil Mill
1 Company X POM
2 DDDD Mill
3 R Mill
4 Great World POM