如何使用 python pandas 将 CSV 解析为我想要的格式?

How can I use python pandas to parse CSV into the format I want?

我是 python pandas 的新手。我有这样一个 CSV 文件:

insectName   count   weather  location   time        date      Condition
  aaa         15      sunny   balabala  0900:1200   1990-02-10     25
  bbb         10      sunny   balabala  0900:1200   1990-02-10     25
  ccc         20      sunny   balabala  0900:1200   1990-02-10     25
  ddd         50      sunny   balabala  0900:1200   1990-02-10     25
  ...        ...      ...      ...        ...            ...       ...
  XXX         40      sunny   balabala  1300:1500   1990-02-15     38
  yyy         10      sunny   balabala  1300:1500   1990-02-15     38
  yyy         25      sunny   balabala  1300:1500   1990-02-15     38

该文件有很多数据,每天的insectName可能会重复。 我想翻译 'date' 连续一天使用的数据。 像这样:

insectName  count  insectName  count  insectName  count  weather  location  time        date      Condition
  ccc         20      bbb       10       aaa        15    sunny   balabala  0900:1200   1990-02-10     25
  yyy         25      yyy       10       XXX        40    sunny   balabala  1300:1500   1990-02-15     38
  ...        ...      ...      ...       ...        ...    ...      ...        ...            ...        ...     

我该怎么办?

有一个 groupby/cumcount/unstack 技巧可以将长格式数据帧转换为宽格式数据帧:

import pandas as pd
df = pd.read_table('data', sep='\s+')

common = ['weather', 'location', 'time', 'date', 'Condition']
grouped = df.groupby(common)
df['idx'] = grouped.cumcount()
df2 = df.set_index(common+['idx'])
df2 = df2.unstack('idx')
df2 = df2.swaplevel(0, 1, axis=1)
df2 = df2.sortlevel(axis=1)
df2.columns = df2.columns.droplevel(0)
df2 = df2.reset_index()
print(df2)

产量

  weather  location       time        date  Condition insectName  count  \
0   sunny  balabala  0900:1200  1990-02-10         25        aaa     15   
1   sunny  balabala  1300:1500  1990-02-15         38        XXX     40   

  insectName  count insectName  count insectName  count  
0        bbb     10        ccc     20        ddd     50  
1        yyy     10        yyy     25        NaN    NaN  

虽然宽幅可能对演示有用,但请注意长幅 通常是数据处理的正确格式。参见 Hadley Wickham 的 article on the virtues of tidy data (PDF).