读取按 nan 行拆分的数据帧并提取 Python 中的特定列
Read dataframe split by nan rows and extract specific columns in Python
我有一个示例 excel 文件 data2.xlsx
来自 here,它有一个 Sheet1
如下:
预处理:
列 2018, 2019, 2020, num
是 object
类型,我需要将其转换为浮点数:
cols = ['2018', '2019', '2020', 'num']
df[cols].replace('--', np.nan, regex=True).astype(float)
我还需要从 2019-bj-price-quantity, 2019-sh-price-quantity, 2019-gz-price-quantity, 2019-sz-price-quantity
中提取 bj, sh, gz, sz
中的城市名称
pattern = '|'.join(['2019-', '-price-quantity'])
df['city'] = df['city'].str.replace(pattern, '')
最后,我需要为每个城市提取 num
中的 price
和 quantity
,并像这样重塑一个新的数据框:
我怎么能在 pandas 中做到这一点?谢谢。
更新:
df = pd.read_excel('./data2.xlsx', sheet_name = 'Sheet1', header = None)
df.groupby(df.iloc[:, 0].isna().cumsum()).transform('first')
输出:
0 1 2 3 4
0 2019-bj-price-quantity 2018.0 2019.0 2020.0 num
1 2019-bj-price-quantity 2018.0 2019.0 2020.0 num
2 2019-bj-price-quantity 2018.0 2019.0 2020.0 num
3 2019-bj-price-quantity 2018.0 2019.0 2020.0 num
4 2019-sh-price-quantity 2018.0 2019.0 2020.0 num
5 2019-sh-price-quantity 2018.0 2019.0 2020.0 num
6 2019-sh-price-quantity 2018.0 2019.0 2020.0 num
7 2019-sh-price-quantity 2018.0 2019.0 2020.0 num
8 2019-sh-price-quantity 2018.0 2019.0 2020.0 num
9 NaN NaN NaN NaN NaN
10 2019-gz-price-quantity 2018.0 2019.0 2020.0 num
11 2019-gz-price-quantity 2018.0 2019.0 2020.0 num
12 2019-gz-price-quantity 2018.0 2019.0 2020.0 num
13 2019-gz-price-quantity 2018.0 2019.0 2020.0 num
14 2019-gz-price-quantity 2018.0 2019.0 2020.0 num
15 NaN NaN NaN NaN NaN
16 2019-sz-price-quantity 2018.0 2019.0 2020.0 num
17 2019-sz-price-quantity 2018.0 2019.0 2020.0 num
18 2019-sz-price-quantity 2018.0 2019.0 2020.0 num
19 2019-sz-price-quantity 2018.0 2019.0 2020.0 num
20 2019-sz-price-quantity 2018.0 2019.0 2020.0 num
参考相关:
*注意当列名不确定时我使用列索引
您可以将 table 拆分为
df['city'] = df.groupby(df.iloc[:, 0].isna().cumsum()).transform(first)
df.dropna(subset=df.columns[0], inplace=True)
df = df.loc[df[df.colmns[0]] != df.city]
现在 df
将有一个带有 table 标题的附加列 city
,而标题和空行已被丢弃。您可以使用 .str.split.str.get
访问该 city
列的任何部分
df.city = df.city.str.split('-').str.get(1)
最后您只想保留 num
列,这是最简单的步骤
df = df.iloc[:, [0, 4, 5]]
df = df.pivot(index='city', columns=df.columns[0], values=df.columns[1])
我的代码基于jezrael的精彩回答,欢迎分享更好的解决方案或改进它:
# add header=None for default columns names
df = pd.read_excel('./data2.xlsx', sheet_name = 'Sheet1', header=None)
# convert columns by second row
df.columns = df.iloc[1].rename(None)
# create new column `city` by forward filling non missing values by second column
df.insert(0, 'city', df.iloc[:, 0].mask(df.iloc[:, 1].notna()).ffill())
pattern = '|'.join(['2019-', '-price-quantity'])
df['city'] = df['city'].str.replace(pattern, '')
df['year'] = df['year'].str.replace(pattern, '')
# convert floats to integers
df.columns = [int(x) if isinstance(x, float) else x for x in df.columns]
df = df[df.year.isin(['price', 'quantity'])]
df = df[['city', 'year', 'num']]
df['num'] = df['num'].replace('--', np.nan, regex=True).astype(float)
df = df.set_index(['city', 'year']).unstack().reset_index()
df.columns = df.columns.droplevel(0)
df.rename({'year': 'city'}, axis=1, inplace=True)
print(df)
输出:
year price quantity
0 bj 21.0 10.0
1 gz 6.0 15.0
2 sh 12.0 NaN
3 sz 13.0 NaN
我有一个示例 excel 文件 data2.xlsx
来自 here,它有一个 Sheet1
如下:
预处理:
列 2018, 2019, 2020, num
是 object
类型,我需要将其转换为浮点数:
cols = ['2018', '2019', '2020', 'num']
df[cols].replace('--', np.nan, regex=True).astype(float)
我还需要从 2019-bj-price-quantity, 2019-sh-price-quantity, 2019-gz-price-quantity, 2019-sz-price-quantity
bj, sh, gz, sz
中的城市名称
pattern = '|'.join(['2019-', '-price-quantity'])
df['city'] = df['city'].str.replace(pattern, '')
最后,我需要为每个城市提取 num
中的 price
和 quantity
,并像这样重塑一个新的数据框:
我怎么能在 pandas 中做到这一点?谢谢。
更新:
df = pd.read_excel('./data2.xlsx', sheet_name = 'Sheet1', header = None)
df.groupby(df.iloc[:, 0].isna().cumsum()).transform('first')
输出:
0 1 2 3 4
0 2019-bj-price-quantity 2018.0 2019.0 2020.0 num
1 2019-bj-price-quantity 2018.0 2019.0 2020.0 num
2 2019-bj-price-quantity 2018.0 2019.0 2020.0 num
3 2019-bj-price-quantity 2018.0 2019.0 2020.0 num
4 2019-sh-price-quantity 2018.0 2019.0 2020.0 num
5 2019-sh-price-quantity 2018.0 2019.0 2020.0 num
6 2019-sh-price-quantity 2018.0 2019.0 2020.0 num
7 2019-sh-price-quantity 2018.0 2019.0 2020.0 num
8 2019-sh-price-quantity 2018.0 2019.0 2020.0 num
9 NaN NaN NaN NaN NaN
10 2019-gz-price-quantity 2018.0 2019.0 2020.0 num
11 2019-gz-price-quantity 2018.0 2019.0 2020.0 num
12 2019-gz-price-quantity 2018.0 2019.0 2020.0 num
13 2019-gz-price-quantity 2018.0 2019.0 2020.0 num
14 2019-gz-price-quantity 2018.0 2019.0 2020.0 num
15 NaN NaN NaN NaN NaN
16 2019-sz-price-quantity 2018.0 2019.0 2020.0 num
17 2019-sz-price-quantity 2018.0 2019.0 2020.0 num
18 2019-sz-price-quantity 2018.0 2019.0 2020.0 num
19 2019-sz-price-quantity 2018.0 2019.0 2020.0 num
20 2019-sz-price-quantity 2018.0 2019.0 2020.0 num
参考相关:
*注意当列名不确定时我使用列索引
您可以将 table 拆分为
df['city'] = df.groupby(df.iloc[:, 0].isna().cumsum()).transform(first)
df.dropna(subset=df.columns[0], inplace=True)
df = df.loc[df[df.colmns[0]] != df.city]
现在 df
将有一个带有 table 标题的附加列 city
,而标题和空行已被丢弃。您可以使用 .str.split.str.get
city
列的任何部分
df.city = df.city.str.split('-').str.get(1)
最后您只想保留 num
列,这是最简单的步骤
df = df.iloc[:, [0, 4, 5]]
df = df.pivot(index='city', columns=df.columns[0], values=df.columns[1])
我的代码基于jezrael的精彩回答,欢迎分享更好的解决方案或改进它:
# add header=None for default columns names
df = pd.read_excel('./data2.xlsx', sheet_name = 'Sheet1', header=None)
# convert columns by second row
df.columns = df.iloc[1].rename(None)
# create new column `city` by forward filling non missing values by second column
df.insert(0, 'city', df.iloc[:, 0].mask(df.iloc[:, 1].notna()).ffill())
pattern = '|'.join(['2019-', '-price-quantity'])
df['city'] = df['city'].str.replace(pattern, '')
df['year'] = df['year'].str.replace(pattern, '')
# convert floats to integers
df.columns = [int(x) if isinstance(x, float) else x for x in df.columns]
df = df[df.year.isin(['price', 'quantity'])]
df = df[['city', 'year', 'num']]
df['num'] = df['num'].replace('--', np.nan, regex=True).astype(float)
df = df.set_index(['city', 'year']).unstack().reset_index()
df.columns = df.columns.droplevel(0)
df.rename({'year': 'city'}, axis=1, inplace=True)
print(df)
输出:
year price quantity
0 bj 21.0 10.0
1 gz 6.0 15.0
2 sh 12.0 NaN
3 sz 13.0 NaN