根据单位拆分数据框中的字符串并将内容分配给特定列

Split string in data frame depending on units and assign content to specific columns

使用原始 csv 文件,我导入以下 df

import pandas as pd  
import numpy as np

# assign data of lists.  
data = {'INTERVAL': ['100 A', '100 A or 20 B', '100 A or 20 B or 3 C','5 C']}     
# Create DataFrame  
df = pd.DataFrame(data)       
# Print the output.  
print(df)  

我的objective是将单元格INTERVAL的内容按照单位拆分成特定的列,就像这样:

# Objective 
data = {'INTERVAL': ['100 A', '100 A or 20 B', '100 A or 20 B or 3 C','5 C'],'INTERVAL_A': ['100', '100', '100',np.nan],'INTERVAL_A': ['100', '100', '100',np.nan],'INTERVAL_B': [np.nan, '20', np.nan, np.nan],'INTERVAL_C': [np.nan, np.nan, '3','5']}     
# Create DataFrame  
df = pd.DataFrame(data)       
# Print the output.  
print(df)  

我可以拆分间隔列并将内容分配给其他列,当 INTERVAL 列的排列不同时,此方法会失败,请参考以下代码段的最后一行。

# Split Interval
A0 = df['INTERVAL'].str.split(pat="or",expand=True, n=-1)
df['INTERVAL_X1'] = A0.loc[:,0] # Assign
df['INTERVAL_X2'] = A0.loc[:,1]
df['INTERVAL_X3'] = A0.loc[:,2]

那么循环遍历 INTERVAL_X 列的内容并根据单位重新分配内容的好方法是什么?另外一个问题是如何隔离值,因为列的标签包含有关单位的信息?

提前谢谢大家

预期输出:

               INTERVAL INTERVAL_A INTERVAL_B INTERVAL_C
0                 100 A        100        NaN        NaN
1         100 A or 20 B        100         20        NaN
2  100 A or 20 B or 3 C        100        NaN          3
3                   5 C        NaN        NaN          5

您可以使用正则表达式来匹配数字后跟 space 和大写字母以及 str.extractall。然后重塑你的数据,最后 join 到原始数据帧:

df2 = (df['INTERVAL'].str.extractall('(?P<INTERVAL>\d+) (?P<ID>[A-Z])')
      .droplevel(1)
      .set_index('ID', append=True)
      .unstack('ID')
      )

df2.columns = df2.columns.map('_'.join)

df.join(df2)

输出:

               INTERVAL INTERVAL_A INTERVAL_B INTERVAL_C
0                 100 A        100        NaN        NaN
1         100 A or 20 B        100         20        NaN
2  100 A or 20 B or 3 C        100         20          3
3                   5 C        NaN        NaN          5
微调

如果您有更长的标识符(例如 A/AB/GHI),请使用:'(?P<INTERVAL>\d+) (?P<ID>[A-Z]+)'.

如果您有可选的或多个 space:'(?P<INTERVAL>\d+)\s*(?P<ID>[A-Z]+)'

灵感来自@mozway's答案:

df.join(
    df['INTERVAL']                                     # Select column to extract info from
    .str.extractall('(?P<INTERVAL>\d+) (?P<ID>[A-Z])') # Extract INTERVAL and ID as different columns
    .pivot(columns="ID")                               # Use values of ID column as columns
    .droplevel(0, axis=1)                              # Drop original column name from columns' levels
    .sum(level=0)                                      # Collapse values to ID col
)