对列多个文件的操作 Pandas
Operations on Columns multiple files Pandas
我正在尝试在 Python Pandas 中执行一些算术运算并将结果合并到其中一个文件中。
Path_1: File_1.csv, File_2.csv, ....
此路径有几个文件,它们应该按时间间隔增加。以下列
File_1.csv | File_2.csv
Nos,12:00:00 | Nos,12:30:00
123,1451 485,5464
656,4544 456,4865
853,5484 658,4584
Path_2: Master_1.csv
Nos,00:00:00
123,2000
485,1500
656,1000
853,2500
456,4500
658,5000
我正在尝试从 Path_1
中读取 n
个 .csv
文件,并将 col[1]
header 时间序列与 col[last]
进行比较Master_1.csv
的时间序列。
如果 Master_1.csv
没有那个时间,它应该从 path_1 .csv
文件中创建一个包含时间序列的新列,并根据 col['Nos']
更新值,同时从 col[1]
中减去它们] Master_1.csv
.
如果存在从 path_1 file
开始的 col
,则查找 col['Nos']
,然后用相对于 [=23= 的减去值替换 NAN
].
即
Master_1.csv 中的预期输出
Nos,00:00:00,12:00:00,12:30:00,
123,2000,549,NAN,
485,1500,NAN,3964,
656,1000,3544,NAN
853,2500,2984,NAN
456,4500,NAN,365
658,5000,NAN,-416
我可以理解算术计算,但我无法在 Nos
和 timeseries
方面进行循环 我已尝试将一些代码放在一起并尝试解决循环问题。在这种情况下需要帮助。谢谢
import pandas as pd
import numpy as np
path_1 = '/'
path_2 = '/'
df_1 = pd.read_csv(os.path_1('/.*csv'), Index=None, columns=['Nos', 'timeseries'] #times series is different in every file eg: 12:00, 12:30, 17:30 etc
df_2 = pd.read_csv('master_1.csv', Index=None, columns=['Nos', '00:00:00']) #00:00:00 time series
for Nos in df_1 and df_2:
df_1['Nos'] = df_2['Nos']
new_tseries = df_2['00:00:00'] - df_1['timeseries']
merged.concat('master_1.csv', Index=None, columns=['Nos', '00:00:00', 'new_tseries'], axis=0) # new_timeseries is the dynamic time series that every .csv file will have from path_1
三步搞定
- 将您的 csv 读入数据帧列表
- 将数据帧合并在一起(相当于 SQL 左连接或 Excel VLOOKUP
- 使用矢量化减法计算派生列。
您可以尝试以下代码:
#read dataframes into a list
import glob
L = []
for fname in glob.glob(path_1+'*.csv'):
L.append(df.read_csv(fname))
#read master dataframe, and merge in other dataframes
df_2 = pd.read_csv('master_1.csv')
for df in L:
df_2 = pd.merge(df_2,df, on = 'Nos', how = 'left')
#for each column, caluculate the difference with the master column
df_2.apply(lambda x: x - df_2['00:00:00'])
我正在尝试在 Python Pandas 中执行一些算术运算并将结果合并到其中一个文件中。
Path_1: File_1.csv, File_2.csv, ....
此路径有几个文件,它们应该按时间间隔增加。以下列
File_1.csv | File_2.csv
Nos,12:00:00 | Nos,12:30:00
123,1451 485,5464
656,4544 456,4865
853,5484 658,4584
Path_2: Master_1.csv
Nos,00:00:00
123,2000
485,1500
656,1000
853,2500
456,4500
658,5000
我正在尝试从 Path_1
中读取 n
个 .csv
文件,并将 col[1]
header 时间序列与 col[last]
进行比较Master_1.csv
的时间序列。
如果 Master_1.csv
没有那个时间,它应该从 path_1 .csv
文件中创建一个包含时间序列的新列,并根据 col['Nos']
更新值,同时从 col[1]
中减去它们] Master_1.csv
.
如果存在从 path_1 file
开始的 col
,则查找 col['Nos']
,然后用相对于 [=23= 的减去值替换 NAN
].
即
Master_1.csv 中的预期输出
Nos,00:00:00,12:00:00,12:30:00,
123,2000,549,NAN,
485,1500,NAN,3964,
656,1000,3544,NAN
853,2500,2984,NAN
456,4500,NAN,365
658,5000,NAN,-416
我可以理解算术计算,但我无法在 Nos
和 timeseries
方面进行循环 我已尝试将一些代码放在一起并尝试解决循环问题。在这种情况下需要帮助。谢谢
import pandas as pd
import numpy as np
path_1 = '/'
path_2 = '/'
df_1 = pd.read_csv(os.path_1('/.*csv'), Index=None, columns=['Nos', 'timeseries'] #times series is different in every file eg: 12:00, 12:30, 17:30 etc
df_2 = pd.read_csv('master_1.csv', Index=None, columns=['Nos', '00:00:00']) #00:00:00 time series
for Nos in df_1 and df_2:
df_1['Nos'] = df_2['Nos']
new_tseries = df_2['00:00:00'] - df_1['timeseries']
merged.concat('master_1.csv', Index=None, columns=['Nos', '00:00:00', 'new_tseries'], axis=0) # new_timeseries is the dynamic time series that every .csv file will have from path_1
三步搞定
- 将您的 csv 读入数据帧列表
- 将数据帧合并在一起(相当于 SQL 左连接或 Excel VLOOKUP
- 使用矢量化减法计算派生列。
您可以尝试以下代码:
#read dataframes into a list
import glob
L = []
for fname in glob.glob(path_1+'*.csv'):
L.append(df.read_csv(fname))
#read master dataframe, and merge in other dataframes
df_2 = pd.read_csv('master_1.csv')
for df in L:
df_2 = pd.merge(df_2,df, on = 'Nos', how = 'left')
#for each column, caluculate the difference with the master column
df_2.apply(lambda x: x - df_2['00:00:00'])