如何使用 python pandas 将 CSV 解析为我想要的格式?
How can I use python pandas to parse CSV into the format I want?
我是 python pandas 的新手。我有这样一个 CSV 文件:
insectName count weather location time date Condition
aaa 15 sunny balabala 0900:1200 1990-02-10 25
bbb 10 sunny balabala 0900:1200 1990-02-10 25
ccc 20 sunny balabala 0900:1200 1990-02-10 25
ddd 50 sunny balabala 0900:1200 1990-02-10 25
... ... ... ... ... ... ...
XXX 40 sunny balabala 1300:1500 1990-02-15 38
yyy 10 sunny balabala 1300:1500 1990-02-15 38
yyy 25 sunny balabala 1300:1500 1990-02-15 38
该文件有很多数据,每天的insectName可能会重复。
我想翻译 'date' 连续一天使用的数据。
像这样:
insectName count insectName count insectName count weather location time date Condition
ccc 20 bbb 10 aaa 15 sunny balabala 0900:1200 1990-02-10 25
yyy 25 yyy 10 XXX 40 sunny balabala 1300:1500 1990-02-15 38
... ... ... ... ... ... ... ... ... ... ...
我该怎么办?
有一个 groupby/cumcount/unstack
技巧可以将长格式数据帧转换为宽格式数据帧:
import pandas as pd
df = pd.read_table('data', sep='\s+')
common = ['weather', 'location', 'time', 'date', 'Condition']
grouped = df.groupby(common)
df['idx'] = grouped.cumcount()
df2 = df.set_index(common+['idx'])
df2 = df2.unstack('idx')
df2 = df2.swaplevel(0, 1, axis=1)
df2 = df2.sortlevel(axis=1)
df2.columns = df2.columns.droplevel(0)
df2 = df2.reset_index()
print(df2)
产量
weather location time date Condition insectName count \
0 sunny balabala 0900:1200 1990-02-10 25 aaa 15
1 sunny balabala 1300:1500 1990-02-15 38 XXX 40
insectName count insectName count insectName count
0 bbb 10 ccc 20 ddd 50
1 yyy 10 yyy 25 NaN NaN
虽然宽幅可能对演示有用,但请注意长幅
通常是数据处理的正确格式。参见 Hadley Wickham 的 article on the virtues of tidy data (PDF).
我是 python pandas 的新手。我有这样一个 CSV 文件:
insectName count weather location time date Condition
aaa 15 sunny balabala 0900:1200 1990-02-10 25
bbb 10 sunny balabala 0900:1200 1990-02-10 25
ccc 20 sunny balabala 0900:1200 1990-02-10 25
ddd 50 sunny balabala 0900:1200 1990-02-10 25
... ... ... ... ... ... ...
XXX 40 sunny balabala 1300:1500 1990-02-15 38
yyy 10 sunny balabala 1300:1500 1990-02-15 38
yyy 25 sunny balabala 1300:1500 1990-02-15 38
该文件有很多数据,每天的insectName可能会重复。 我想翻译 'date' 连续一天使用的数据。 像这样:
insectName count insectName count insectName count weather location time date Condition
ccc 20 bbb 10 aaa 15 sunny balabala 0900:1200 1990-02-10 25
yyy 25 yyy 10 XXX 40 sunny balabala 1300:1500 1990-02-15 38
... ... ... ... ... ... ... ... ... ... ...
我该怎么办?
有一个 groupby/cumcount/unstack
技巧可以将长格式数据帧转换为宽格式数据帧:
import pandas as pd
df = pd.read_table('data', sep='\s+')
common = ['weather', 'location', 'time', 'date', 'Condition']
grouped = df.groupby(common)
df['idx'] = grouped.cumcount()
df2 = df.set_index(common+['idx'])
df2 = df2.unstack('idx')
df2 = df2.swaplevel(0, 1, axis=1)
df2 = df2.sortlevel(axis=1)
df2.columns = df2.columns.droplevel(0)
df2 = df2.reset_index()
print(df2)
产量
weather location time date Condition insectName count \
0 sunny balabala 0900:1200 1990-02-10 25 aaa 15
1 sunny balabala 1300:1500 1990-02-15 38 XXX 40
insectName count insectName count insectName count
0 bbb 10 ccc 20 ddd 50
1 yyy 10 yyy 25 NaN NaN
虽然宽幅可能对演示有用,但请注意长幅 通常是数据处理的正确格式。参见 Hadley Wickham 的 article on the virtues of tidy data (PDF).