pandas:提取以给定子字符串结尾的连字符前后的特定文本

pandas: extract specific text before or after hyphen, that ends in given substrings

我是 pandas 的新手,data frame 类似于下面的

import pandas as pd 

df = pd.DataFrame({'id': ["1", "2", "3","4","5"],
                   'mill': ["Company A Palm Oil Mill – Special Company A of CC Ltd",
                            "Company X POM – Company X Ltd","DDDD Mill – Company New and Old Ltd",
                            "Company Not Special – R Mill","Greatest Company – Great World POM"]})

  id                                               mill
0  1  Company A Palm Oil Mill – Special Company A of...
1  2                      Company X POM – Company X Ltd
2  3                DDDD Mill – Company New and Old Ltd
3  4                       Company Not Special – R Mill
4  5                 Greatest Company – Great World POM

我想从上面data frame得到的是下面这样的东西:

有没有一种简单的方法可以将这些子字符串提取到同一列中。工厂名称有时可以在“-”之前,有时可以在“-”之后,但几乎总是以 Palm Oil Mill、POM 或 Mill 结尾。

以前的解决方案: 您可以使用 .str.split() 并执行此操作: df.mill = df.mill.str.split(' –').str[0].

Update:看到你有一些限制,你可以建立你自己的 returning 函数(下面称为 func)并把你的任何逻辑想在里面。这将遍历所有由 - 分割的字符串,如果 Mill 在第一个单词中,则 return.

其他情况推荐温氏方案

import pandas as pd 

df = pd.DataFrame({'id': ["1", "2", "3","4","5"],
                   'mill': ["Company A Palm Oil Mill – Special Company A of CC Ltd",
                            "Company X POM – Company X Ltd","DDDD Mill – Company New and Old Ltd",
                            "Company Not Special – R Mill","Greatest Company – Great World POM"]})

def func(x):
    #Split array
    ar = x.split(' – ')

    # If length is smaller than 2 return value
    if len(ar) < 2:
        return x

    # Else loop through and apply logic here
    for ind, x in enumerate(ar):
        if x.lower().endswith(('mill', 'pom')):
            return x

    # Nothing found, return x
    return x

df.mill = df.mill.apply(func)

print(df)

Returns:

  id                     mill
0  1  Company A Palm Oil Mill
1  2            Company X POM
2  3                DDDD Mill
3  4                   R Mill
4  5          Great World POM

IIUC,您可以将 str.contains 与这些关键字一起使用 Palm Oil Mill,POM,Mill

s = df.mill.str.split(' – ', expand=True)

df['Name']=s[s.apply(lambda x : x.str.contains('Palm Oil Mill|POM|Mill'))].fillna('').sum(1)
df
Out[230]: 
  id                                               mill  \
0  1  Company A Palm Oil Mill – Special Company A of...   
1  2                      Company X POM – Company X Ltd   
2  3                DDDD Mill – Company New and Old Ltd   
3  4                       Company Not Special – R Mill   
4  5                 Greatest Company – Great World POM   
                      Name  
0  Company A Palm Oil Mill  
1            Company X POM  
2                DDDD Mill  
3                   R Mill  
4          Great World POM  

您想在连字符(如果有)和 return 以 'Mill' 或 'POM':

结尾的子字符串上拆分
def extract_mill_name(s):
    """Extract the substring which ends in 'Mill' or 'POM'"""
    for subs in s.split('–'):
        subs = subs.strip(' ')
        if subs.endswith('Mill') or subs.endswith('POM'):
            return subs

    return None # parsing error. Could raise Exception instead

df.mill.apply(extract_mill_name)

0    Company A Palm Oil Mill
1              Company X POM
2                  DDDD Mill
3                     R Mill
4            Great World POM