如何从包含空值和多个值的一列中创建单独的列?
How do I make separate columns from one column that contains null and multiple values?
我已将此文件从 PDF 转换为 CSV 以训练模型。 pdf 文件中的三列已合并为 csv 中的一列,例如产品 ID、商品和国家/地区。
我试图借助正则表达式来分隔这些列,但我不太确定这些列将如何排列。
这组数据是我正在处理的:
country/commodity Unit Quantity Value
1 0011101 BREEDING BULLS (OXEN) NO NaN 75
2 DUBAI NaN NaN 75
3 0011102 BREEDING BULLS (BUFFALO) NO 248 1921
4 SRI LUNKA NaN 248 1921
5 0011103 BUFFALO,BREEDING NO NaN 90
6 SRI LUNKA NaN NaN 90
7 0011104 COWS BREEDING NO 1249 258921665
8 AJMAN NaN NaN NaN
9 CYPRUS NaN NaN NaN
我需要此数据采用以下格式:
0 ProductID Commodity Country Unit Quantity Value
1 0011101 BREEDING BULLS (OXEN) DUBAI NaN NaN 75
3 0011102 BREEDING BULLS (BUFFALO) SRI LUNKA NaN 248 1921
4 0011103 BUFFALO,BREEDING SRI LUNKA NaN NaN 90
7 0011104 COWS BREEDING AJMAN NaN NaN NaN
8 0011104 COWS BREEDING CYPRUS NaN NaN NaN
9 0011104 COWS BREEDING CHINA NaN 590 3290
首先,我们通过从 country/commodity
列中减去信息来制作您的列 ProductID, Commodity, Country
:
str.split
str.extract
Series.where
Series.mask
str.contains
然后我们在ProductID
上GroupBy
一起获取对应产品的信息,为此我们使用named aggregation
,这是自pandas 0.25.0
以来新增的:
# Extract information from country/commodity
df['ProductID'] = df['country/commodity'].str.split(' ', 1).str[0].str.extract('(\d+)').ffill()
df['Commodity'] = df['country/commodity'].str.split('\d+').str[-1].where(df['Unit'].notna())
df['Country'] = df['country/commodity'].mask(df['country/commodity'].str.contains('\d+')).fillna('')
# Groupby ProductID to get information together
df_new = df.groupby(['ProductID']).agg(
Commodity=('Commodity', 'first'),
Country=('Country', ', '.join),
Unit=('Unit', 'first'),
Quantity=('Quantity', 'first'),
Value=('Value', 'first')
).reset_index()
# Remove unnecessary comma's
df_new['Country'] = df_new['Country'].str.lstrip(', ')
输出
ProductID Commodity Country Unit Quantity \
0 0011101 BREEDING BULLS (OXEN) DUBAI NO NaN
1 0011102 BREEDING BULLS (BUFFALO) SRI LUNKA NO 248.0
2 0011103 BUFFALO,BREEDING SRI LUNKA NO NaN
3 0011104 COWS BREEDING AJMAN, CYPRUS NO 1249.0
Value
0 75.0
1 1921.0
2 90.0
3 258921665.0
我已将此文件从 PDF 转换为 CSV 以训练模型。 pdf 文件中的三列已合并为 csv 中的一列,例如产品 ID、商品和国家/地区。
我试图借助正则表达式来分隔这些列,但我不太确定这些列将如何排列。
这组数据是我正在处理的:
country/commodity Unit Quantity Value
1 0011101 BREEDING BULLS (OXEN) NO NaN 75
2 DUBAI NaN NaN 75
3 0011102 BREEDING BULLS (BUFFALO) NO 248 1921
4 SRI LUNKA NaN 248 1921
5 0011103 BUFFALO,BREEDING NO NaN 90
6 SRI LUNKA NaN NaN 90
7 0011104 COWS BREEDING NO 1249 258921665
8 AJMAN NaN NaN NaN
9 CYPRUS NaN NaN NaN
我需要此数据采用以下格式:
0 ProductID Commodity Country Unit Quantity Value
1 0011101 BREEDING BULLS (OXEN) DUBAI NaN NaN 75
3 0011102 BREEDING BULLS (BUFFALO) SRI LUNKA NaN 248 1921
4 0011103 BUFFALO,BREEDING SRI LUNKA NaN NaN 90
7 0011104 COWS BREEDING AJMAN NaN NaN NaN
8 0011104 COWS BREEDING CYPRUS NaN NaN NaN
9 0011104 COWS BREEDING CHINA NaN 590 3290
首先,我们通过从 country/commodity
列中减去信息来制作您的列 ProductID, Commodity, Country
:
str.split
str.extract
Series.where
Series.mask
str.contains
然后我们在ProductID
上GroupBy
一起获取对应产品的信息,为此我们使用named aggregation
,这是自pandas 0.25.0
以来新增的:
# Extract information from country/commodity
df['ProductID'] = df['country/commodity'].str.split(' ', 1).str[0].str.extract('(\d+)').ffill()
df['Commodity'] = df['country/commodity'].str.split('\d+').str[-1].where(df['Unit'].notna())
df['Country'] = df['country/commodity'].mask(df['country/commodity'].str.contains('\d+')).fillna('')
# Groupby ProductID to get information together
df_new = df.groupby(['ProductID']).agg(
Commodity=('Commodity', 'first'),
Country=('Country', ', '.join),
Unit=('Unit', 'first'),
Quantity=('Quantity', 'first'),
Value=('Value', 'first')
).reset_index()
# Remove unnecessary comma's
df_new['Country'] = df_new['Country'].str.lstrip(', ')
输出
ProductID Commodity Country Unit Quantity \
0 0011101 BREEDING BULLS (OXEN) DUBAI NO NaN
1 0011102 BREEDING BULLS (BUFFALO) SRI LUNKA NO 248.0
2 0011103 BUFFALO,BREEDING SRI LUNKA NO NaN
3 0011104 COWS BREEDING AJMAN, CYPRUS NO 1249.0
Value
0 75.0
1 1921.0
2 90.0
3 258921665.0