根据单位拆分数据框中的字符串并将内容分配给特定列
Split string in data frame depending on units and assign content to specific columns
使用原始 csv 文件,我导入以下 df
import pandas as pd
import numpy as np
# assign data of lists.
data = {'INTERVAL': ['100 A', '100 A or 20 B', '100 A or 20 B or 3 C','5 C']}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
print(df)
我的objective是将单元格INTERVAL的内容按照单位拆分成特定的列,就像这样:
# Objective
data = {'INTERVAL': ['100 A', '100 A or 20 B', '100 A or 20 B or 3 C','5 C'],'INTERVAL_A': ['100', '100', '100',np.nan],'INTERVAL_A': ['100', '100', '100',np.nan],'INTERVAL_B': [np.nan, '20', np.nan, np.nan],'INTERVAL_C': [np.nan, np.nan, '3','5']}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
print(df)
我可以拆分间隔列并将内容分配给其他列,当 INTERVAL 列的排列不同时,此方法会失败,请参考以下代码段的最后一行。
# Split Interval
A0 = df['INTERVAL'].str.split(pat="or",expand=True, n=-1)
df['INTERVAL_X1'] = A0.loc[:,0] # Assign
df['INTERVAL_X2'] = A0.loc[:,1]
df['INTERVAL_X3'] = A0.loc[:,2]
那么循环遍历 INTERVAL_X 列的内容并根据单位重新分配内容的好方法是什么?另外一个问题是如何隔离值,因为列的标签包含有关单位的信息?
提前谢谢大家
预期输出:
INTERVAL INTERVAL_A INTERVAL_B INTERVAL_C
0 100 A 100 NaN NaN
1 100 A or 20 B 100 20 NaN
2 100 A or 20 B or 3 C 100 NaN 3
3 5 C NaN NaN 5
您可以使用正则表达式来匹配数字后跟 space 和大写字母以及 str.extractall
。然后重塑你的数据,最后 join
到原始数据帧:
df2 = (df['INTERVAL'].str.extractall('(?P<INTERVAL>\d+) (?P<ID>[A-Z])')
.droplevel(1)
.set_index('ID', append=True)
.unstack('ID')
)
df2.columns = df2.columns.map('_'.join)
df.join(df2)
输出:
INTERVAL INTERVAL_A INTERVAL_B INTERVAL_C
0 100 A 100 NaN NaN
1 100 A or 20 B 100 20 NaN
2 100 A or 20 B or 3 C 100 20 3
3 5 C NaN NaN 5
微调
如果您有更长的标识符(例如 A/AB/GHI),请使用:'(?P<INTERVAL>\d+) (?P<ID>[A-Z]+)'
.
如果您有可选的或多个 space:'(?P<INTERVAL>\d+)\s*(?P<ID>[A-Z]+)'
灵感来自@mozway's答案:
df.join(
df['INTERVAL'] # Select column to extract info from
.str.extractall('(?P<INTERVAL>\d+) (?P<ID>[A-Z])') # Extract INTERVAL and ID as different columns
.pivot(columns="ID") # Use values of ID column as columns
.droplevel(0, axis=1) # Drop original column name from columns' levels
.sum(level=0) # Collapse values to ID col
)
使用原始 csv 文件,我导入以下 df
import pandas as pd
import numpy as np
# assign data of lists.
data = {'INTERVAL': ['100 A', '100 A or 20 B', '100 A or 20 B or 3 C','5 C']}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
print(df)
我的objective是将单元格INTERVAL的内容按照单位拆分成特定的列,就像这样:
# Objective
data = {'INTERVAL': ['100 A', '100 A or 20 B', '100 A or 20 B or 3 C','5 C'],'INTERVAL_A': ['100', '100', '100',np.nan],'INTERVAL_A': ['100', '100', '100',np.nan],'INTERVAL_B': [np.nan, '20', np.nan, np.nan],'INTERVAL_C': [np.nan, np.nan, '3','5']}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
print(df)
我可以拆分间隔列并将内容分配给其他列,当 INTERVAL 列的排列不同时,此方法会失败,请参考以下代码段的最后一行。
# Split Interval
A0 = df['INTERVAL'].str.split(pat="or",expand=True, n=-1)
df['INTERVAL_X1'] = A0.loc[:,0] # Assign
df['INTERVAL_X2'] = A0.loc[:,1]
df['INTERVAL_X3'] = A0.loc[:,2]
那么循环遍历 INTERVAL_X 列的内容并根据单位重新分配内容的好方法是什么?另外一个问题是如何隔离值,因为列的标签包含有关单位的信息?
提前谢谢大家
预期输出:
INTERVAL INTERVAL_A INTERVAL_B INTERVAL_C
0 100 A 100 NaN NaN
1 100 A or 20 B 100 20 NaN
2 100 A or 20 B or 3 C 100 NaN 3
3 5 C NaN NaN 5
您可以使用正则表达式来匹配数字后跟 space 和大写字母以及 str.extractall
。然后重塑你的数据,最后 join
到原始数据帧:
df2 = (df['INTERVAL'].str.extractall('(?P<INTERVAL>\d+) (?P<ID>[A-Z])')
.droplevel(1)
.set_index('ID', append=True)
.unstack('ID')
)
df2.columns = df2.columns.map('_'.join)
df.join(df2)
输出:
INTERVAL INTERVAL_A INTERVAL_B INTERVAL_C
0 100 A 100 NaN NaN
1 100 A or 20 B 100 20 NaN
2 100 A or 20 B or 3 C 100 20 3
3 5 C NaN NaN 5
微调
如果您有更长的标识符(例如 A/AB/GHI),请使用:'(?P<INTERVAL>\d+) (?P<ID>[A-Z]+)'
.
如果您有可选的或多个 space:'(?P<INTERVAL>\d+)\s*(?P<ID>[A-Z]+)'
灵感来自@mozway's答案:
df.join(
df['INTERVAL'] # Select column to extract info from
.str.extractall('(?P<INTERVAL>\d+) (?P<ID>[A-Z])') # Extract INTERVAL and ID as different columns
.pivot(columns="ID") # Use values of ID column as columns
.droplevel(0, axis=1) # Drop original column name from columns' levels
.sum(level=0) # Collapse values to ID col
)