使用 pandas python 将笨拙的格式数据转换为数据框
Convert awkwardly formatted data into a dataframe with pandas python
我有一些数据采用以下 csv 格式:
Variable 1
Time
Value
Time1
12
Time2
32
Time3
4
Time4
5
Time5
34
Time6
5
Time7
46
Time8
7
Time9
8
Time10
543
Variable 2
Time
Value
Time1
1
2
3
Time2
2
45
5
Time3
4
2
54
Time4
3
1
2
Time5
3
2
4
Time6
4
5
8
Time7
4
7
4
Time8
8
65
12
Time9
12
8
14
Time10
65
65
13
Variable 3
Time
Value
Time1
3
Time2
4
Time3
5
Time4
2
Time5
1
Time6
7
Time7
5
Time8
3
Time9
5
Time10
7
并希望将其放入以下数据帧格式 pandas:
Variable1 Variable2 Variable3
Time1 12 [1,2,3] 3
Time2 32 [2,45,5] 4
Time3 4 [4,2,54] 5
我该怎么做呢?我知道这种格式很糟糕,不要问我为什么会这样,但我还是坚持了下来。我真的不知道从哪里开始。 TIA
根据评论更新代码
初始文件读取基于代码 from here:
import pandas as pd
import numpy as np
file = r'C:\Test\TIMBER-1100-10M.csv'
# Loop the data lines
with open(file, 'r') as temp_f:
# get No of columns in each line
col_count = [ len(l.split(",")) for l in temp_f.readlines() ]
# Generate column names (names will be 0, 1, 2, ..., maximum columns - 1)
column_names = [i for i in range(0, max(col_count))]
df = pd.read_csv(file, header=None, delimiter=",", names=column_names)
# preparing dataframe for pivot
df['Variable'] = np.where(df[0].str.contains('VARIABLE:'), df[0], np.nan)
df['Variable'].ffill(inplace=True)
df[1].dropna(inplace=True)
drop_values = ['Timestamp','VARIABLE:']
df2 = df[~df[0].str.contains('|'.join(drop_values))].astype({col: str for col in df.columns[2:]})
conc_col = df2.columns.to_list()
conc_col.remove(0)
conc_col.remove('Variable')
df2['Value'] = df2[conc_col].apply(lambda x: ','.join(x.dropna()), axis=1).str.strip(',nan')
df2.rename(columns={ df.columns[0]: "Time" }, inplace = True)
# creating the pivot as final dataframe
pivot = df2.pivot_table(index=['Time'],
columns=['Variable'],
values='Value',
aggfunc='sum')\
.rename_axis(None, axis=1)\
.reset_index()
pivot.to_excel(r'C:\Test\temp1.xlsx')
我有一些数据采用以下 csv 格式:
Variable 1 | |||
Time | Value | ||
Time1 | 12 | ||
Time2 | 32 | ||
Time3 | 4 | ||
Time4 | 5 | ||
Time5 | 34 | ||
Time6 | 5 | ||
Time7 | 46 | ||
Time8 | 7 | ||
Time9 | 8 | ||
Time10 | 543 | ||
Variable 2 | |||
Time | Value | ||
Time1 | 1 | 2 | 3 |
Time2 | 2 | 45 | 5 |
Time3 | 4 | 2 | 54 |
Time4 | 3 | 1 | 2 |
Time5 | 3 | 2 | 4 |
Time6 | 4 | 5 | 8 |
Time7 | 4 | 7 | 4 |
Time8 | 8 | 65 | 12 |
Time9 | 12 | 8 | 14 |
Time10 | 65 | 65 | 13 |
Variable 3 | |||
Time | Value | ||
Time1 | 3 | ||
Time2 | 4 | ||
Time3 | 5 | ||
Time4 | 2 | ||
Time5 | 1 | ||
Time6 | 7 | ||
Time7 | 5 | ||
Time8 | 3 | ||
Time9 | 5 | ||
Time10 | 7 | ||
并希望将其放入以下数据帧格式 pandas:
Variable1 Variable2 Variable3
Time1 12 [1,2,3] 3
Time2 32 [2,45,5] 4
Time3 4 [4,2,54] 5
我该怎么做呢?我知道这种格式很糟糕,不要问我为什么会这样,但我还是坚持了下来。我真的不知道从哪里开始。 TIA
根据评论更新代码
初始文件读取基于代码 from here:
import pandas as pd
import numpy as np
file = r'C:\Test\TIMBER-1100-10M.csv'
# Loop the data lines
with open(file, 'r') as temp_f:
# get No of columns in each line
col_count = [ len(l.split(",")) for l in temp_f.readlines() ]
# Generate column names (names will be 0, 1, 2, ..., maximum columns - 1)
column_names = [i for i in range(0, max(col_count))]
df = pd.read_csv(file, header=None, delimiter=",", names=column_names)
# preparing dataframe for pivot
df['Variable'] = np.where(df[0].str.contains('VARIABLE:'), df[0], np.nan)
df['Variable'].ffill(inplace=True)
df[1].dropna(inplace=True)
drop_values = ['Timestamp','VARIABLE:']
df2 = df[~df[0].str.contains('|'.join(drop_values))].astype({col: str for col in df.columns[2:]})
conc_col = df2.columns.to_list()
conc_col.remove(0)
conc_col.remove('Variable')
df2['Value'] = df2[conc_col].apply(lambda x: ','.join(x.dropna()), axis=1).str.strip(',nan')
df2.rename(columns={ df.columns[0]: "Time" }, inplace = True)
# creating the pivot as final dataframe
pivot = df2.pivot_table(index=['Time'],
columns=['Variable'],
values='Value',
aggfunc='sum')\
.rename_axis(None, axis=1)\
.reset_index()
pivot.to_excel(r'C:\Test\temp1.xlsx')