Pandas列数学运算无错误无答案
Pandas Column mathematical operations No error no answer
我正在尝试对文件执行一些简单的数学运算。
下面 file_1.csv
中的列本质上是动态的,列数会不时增加。所以我们不能固定 last_column
master_ids.csv
: 在任何预处理之前
Ids,ref0 #the columns increase dynamically
1234,1000
8435,5243
2341,563
7352,345
master_count.csv
: 在任何处理之前
Ids,Name,lat,lon,ref1
1234,London,40.4,10.1,500
8435,Paris,50.5,20.2,400
2341,NewYork,60.6,30.3,700
7352,Japan,70.7,80.8,500
1234,Prague,40.4,10.1,100
8435,Berlin,50.5,20.2,200
2341,Austria,60.6,30.3,500
7352,China,70.7,80.8,300
master_Ids.csv
: 经过一次预处理
Ids,ref,00:30:00
1234,1000,500
8435,5243,300
2341,563,400
7352,345,500
master_count.csv
:预期输出 (Append/merge)
Ids,Name,lat,lon,ref1,00:30:00
1234,London,40.4,10.1,500,750
8435,Paris,50.5,20.2,400,550
2341,NewYork,60.6,30.3,700,900
7352,Japan,70.7,80.8,500,750
1234,Prague,40.4,10.1,100,350
8435,Berlin,50.5,20.2,200,350
2341,Austria,60.6,30.3,500,700
7352,China,70.7,80.8,300,750
例如:Ids: 1234
出现 2
次,所以 ids:1234
在 current time (00:30:00)
的值是 500
,它要除以 [= 的计数29=] 发生,然后添加到 ref1
的相应值,并使用当前时间创建一个新列。
master_Ids.csv
: 经过再一次预处理
Ids,ref,00:30:00,00:45:00
1234,1000,500,100
8435,5243,300,200
2341,563,400,400
7352,345,500,600
master_count.csv
: 另一次执行后的预期输出 (Merge/append)
Ids,Name,lat,lon,ref1,00:30:00,00:45:00
1234,London,40.4,10.1,500,750,550
8435,Paris,50.5,20.2,400,550,500
2341,NewYork,60.6,30.3,700,900,900
7352,Japan,70.7,80.8,500,750,800
1234,Prague,40.4,10.1,100,350,150
8435,Berlin,50.5,20.2,200,350,300
2341,Austria,60.6,30.3,500,700,700
7352,China,70.7,80.8,300,750,600
所以这里 current time
是 00:45:00
,我们将 current time value
除以出现 ids
的 count
,然后 add
通过使用 new current time
.
创建一个新列到相应的 ref1
值
计划:Jianxun Li
import pandas as pd
import numpy as np
csv_file1 = '/Data_repository/master_ids.csv'
csv_file2 = '/Data_repository/master_count.csv'
df1 = pd.read_csv(csv_file1).set_index('Ids')
# need to sort index in file 2
df2 = pd.read_csv(csv_file2).set_index('Ids').sort_index()
# df1 and df2 has a duplicated column 00:00:00, use df1 without 1st column
temp = df2.join(df1.iloc[:, 1:])
# do the division by number of occurence of each Ids
# and add column any time series
def my_func(group):
num_obs = len(group)
# process with column name after next timeseries (inclusive)
group.iloc[:,4:] = (group.iloc[:,4:]/num_obs).add(group.iloc[:,3], axis=0)
return group
result = temp.groupby(level='Ids').apply(my_func)
程序执行无错误且无输出。请需要一些修复建议。
该程序假定 master_counts.csv 和 master_ids.csv 随时间更新,并且应该对更新时间具有鲁棒性。也就是说,如果 运行 在同一个更新上多次或错过更新,它应该会产生正确的结果。
# this program updates (and replaces) the original master_counts.csv with data
# in master_ids.csv, so we only want the first 5 columns when we read it in
master_counts = pd.read_csv('master_counts.csv').iloc[:,:5]
# this file is assumed to be periodically updated with the addition of new columns
master_ids = pd.read_csv('master_ids.csv')
for i in range( 2, len(master_ids.columns) ):
master_counts = master_counts.merge( master_ids.iloc[:,[0,i]], on='Ids' )
count = master_counts.groupby('Ids')['ref1'].transform('count')
master_counts.iloc[:,-1] = master_counts['ref1'] + master_counts.iloc[:,-1]/count
master_counts.to_csv('master_counts.csv',index=False)
%more master_counts.csv
Ids,Name,lat,lon,ref1,00:30:00,00:45:00
1234,London,40.4,10.1,500,750.0,550.0
1234,Prague,40.4,10.1,100,350.0,150.0
8435,Paris,50.5,20.2,400,550.0,500.0
8435,Berlin,50.5,20.2,200,350.0,300.0
2341,NewYork,60.6,30.3,700,900.0,900.0
2341,Austria,60.6,30.3,500,700.0,700.0
7352,Japan,70.7,80.8,500,750.0,800.0
7352,China,70.7,80.8,300,550.0,600.0
import pandas as pd
import numpy as np
csv_file1 = '/home/Jian/Downloads/stack_flow_bundle/Data_repository/master_lac_Test.csv'
csv_file2 = '/home/Jian/Downloads/stack_flow_bundle/Data_repository/lat_lon_master.csv'
df1 = pd.read_csv(csv_file1).set_index('Ids')
Out[53]:
00:00:00 00:30:00 00:45:00
Ids
1234 1000 500 100
8435 5243 300 200
2341 563 400 400
7352 345 500 600
# need to sort index in file 2
df2 = pd.read_csv(csv_file2).set_index('Ids').sort_index()
Out[81]:
Name lat lon 00:00:00
Ids
1234 London 40.4 10.1 500
1234 Prague 40.4 10.1 500
2341 NewYork 60.6 30.3 700
2341 Austria 60.6 30.3 700
7352 Japan 70.7 80.8 500
7352 China 70.7 80.8 500
8435 Paris 50.5 20.2 400
8435 Berlin 50.5 20.2 400
# df1 and df2 has a duplicated column 00:00:00, use df1 without 1st column
temp = df2.join(df1.iloc[:, 1:])
Out[55]:
Name lat lon 00:00:00 00:30:00 00:45:00
Ids
1234 London 40.4 10.1 500 500 100
1234 Prague 40.4 10.1 500 500 100
2341 NewYork 60.6 30.3 700 400 400
2341 Austria 60.6 30.3 700 400 400
7352 Japan 70.7 80.8 500 500 600
7352 China 70.7 80.8 500 500 600
8435 Paris 50.5 20.2 400 300 200
8435 Berlin 50.5 20.2 400 300 200
# do the division by number of occurence of each Ids
# and add column 00:00:00
def my_func(group):
num_obs = len(group)
# process with column name after 00:30:00 (inclusive)
group.iloc[:,4:] = (group.iloc[:,4:]/num_obs).add(group.iloc[:,3], axis=0)
return group
result = temp.groupby(level='Ids').apply(my_func)
Out[104]:
Name lat lon 00:00:00 00:30:00 00:45:00
Ids
1234 London 40.4 10.1 500 750 550
1234 Prague 40.4 10.1 500 750 550
2341 NewYork 60.6 30.3 700 900 900
2341 Austria 60.6 30.3 700 900 900
7352 Japan 70.7 80.8 500 750 800
7352 China 70.7 80.8 500 750 800
8435 Paris 50.5 20.2 400 550 500
8435 Berlin 50.5 20.2 400 550 500
我的建议是重新格式化您的数据,使其像这样:
Ids,ref0,current_time,ref1
1234,1000,None,None
8435,5243,None,None
2341,563,None,None
7352,345,None,None
然后你的"first preprocess"之后会变成这样:
Ids,ref0,time,ref1
1234,1000,None,None
8435,5243,None,None
2341,563,None,None
7352,345,None,None
1234,1000,00:30:00,500
8435,5243,00:30:00,300
2341,563,00:30:00,400
7352,345,00:30:00,500
。 . .等等。这个想法是你应该制作一个单独的列来保存时间信息,然后对于每个预处理,将新数据插入新的 rows,并在时间列中给这些行一个值表明他们来自哪个时间段。您可能希望也可能不希望在此 table 中保留带有 "None" 的初始行;也许您只想从“00:30:00”值开始并将 "master ids" 保存在单独的文件中。
我还没有完全理解你是如何计算新的 ref1
值的,但重点是这样做可能会大大简化你的生活。通常,与其添加无限数量的新列,不如添加一个新列,其值将成为您要用作 headers 的值,用于 open-ended 新列,这样会更好列。
我正在尝试对文件执行一些简单的数学运算。
下面 file_1.csv
中的列本质上是动态的,列数会不时增加。所以我们不能固定 last_column
master_ids.csv
: 在任何预处理之前
Ids,ref0 #the columns increase dynamically
1234,1000
8435,5243
2341,563
7352,345
master_count.csv
: 在任何处理之前
Ids,Name,lat,lon,ref1
1234,London,40.4,10.1,500
8435,Paris,50.5,20.2,400
2341,NewYork,60.6,30.3,700
7352,Japan,70.7,80.8,500
1234,Prague,40.4,10.1,100
8435,Berlin,50.5,20.2,200
2341,Austria,60.6,30.3,500
7352,China,70.7,80.8,300
master_Ids.csv
: 经过一次预处理
Ids,ref,00:30:00
1234,1000,500
8435,5243,300
2341,563,400
7352,345,500
master_count.csv
:预期输出 (Append/merge)
Ids,Name,lat,lon,ref1,00:30:00
1234,London,40.4,10.1,500,750
8435,Paris,50.5,20.2,400,550
2341,NewYork,60.6,30.3,700,900
7352,Japan,70.7,80.8,500,750
1234,Prague,40.4,10.1,100,350
8435,Berlin,50.5,20.2,200,350
2341,Austria,60.6,30.3,500,700
7352,China,70.7,80.8,300,750
例如:Ids: 1234
出现 2
次,所以 ids:1234
在 current time (00:30:00)
的值是 500
,它要除以 [= 的计数29=] 发生,然后添加到 ref1
的相应值,并使用当前时间创建一个新列。
master_Ids.csv
: 经过再一次预处理
Ids,ref,00:30:00,00:45:00
1234,1000,500,100
8435,5243,300,200
2341,563,400,400
7352,345,500,600
master_count.csv
: 另一次执行后的预期输出 (Merge/append)
Ids,Name,lat,lon,ref1,00:30:00,00:45:00
1234,London,40.4,10.1,500,750,550
8435,Paris,50.5,20.2,400,550,500
2341,NewYork,60.6,30.3,700,900,900
7352,Japan,70.7,80.8,500,750,800
1234,Prague,40.4,10.1,100,350,150
8435,Berlin,50.5,20.2,200,350,300
2341,Austria,60.6,30.3,500,700,700
7352,China,70.7,80.8,300,750,600
所以这里 current time
是 00:45:00
,我们将 current time value
除以出现 ids
的 count
,然后 add
通过使用 new current time
.
ref1
值
计划:Jianxun Li
import pandas as pd
import numpy as np
csv_file1 = '/Data_repository/master_ids.csv'
csv_file2 = '/Data_repository/master_count.csv'
df1 = pd.read_csv(csv_file1).set_index('Ids')
# need to sort index in file 2
df2 = pd.read_csv(csv_file2).set_index('Ids').sort_index()
# df1 and df2 has a duplicated column 00:00:00, use df1 without 1st column
temp = df2.join(df1.iloc[:, 1:])
# do the division by number of occurence of each Ids
# and add column any time series
def my_func(group):
num_obs = len(group)
# process with column name after next timeseries (inclusive)
group.iloc[:,4:] = (group.iloc[:,4:]/num_obs).add(group.iloc[:,3], axis=0)
return group
result = temp.groupby(level='Ids').apply(my_func)
程序执行无错误且无输出。请需要一些修复建议。
该程序假定 master_counts.csv 和 master_ids.csv 随时间更新,并且应该对更新时间具有鲁棒性。也就是说,如果 运行 在同一个更新上多次或错过更新,它应该会产生正确的结果。
# this program updates (and replaces) the original master_counts.csv with data
# in master_ids.csv, so we only want the first 5 columns when we read it in
master_counts = pd.read_csv('master_counts.csv').iloc[:,:5]
# this file is assumed to be periodically updated with the addition of new columns
master_ids = pd.read_csv('master_ids.csv')
for i in range( 2, len(master_ids.columns) ):
master_counts = master_counts.merge( master_ids.iloc[:,[0,i]], on='Ids' )
count = master_counts.groupby('Ids')['ref1'].transform('count')
master_counts.iloc[:,-1] = master_counts['ref1'] + master_counts.iloc[:,-1]/count
master_counts.to_csv('master_counts.csv',index=False)
%more master_counts.csv
Ids,Name,lat,lon,ref1,00:30:00,00:45:00
1234,London,40.4,10.1,500,750.0,550.0
1234,Prague,40.4,10.1,100,350.0,150.0
8435,Paris,50.5,20.2,400,550.0,500.0
8435,Berlin,50.5,20.2,200,350.0,300.0
2341,NewYork,60.6,30.3,700,900.0,900.0
2341,Austria,60.6,30.3,500,700.0,700.0
7352,Japan,70.7,80.8,500,750.0,800.0
7352,China,70.7,80.8,300,550.0,600.0
import pandas as pd
import numpy as np
csv_file1 = '/home/Jian/Downloads/stack_flow_bundle/Data_repository/master_lac_Test.csv'
csv_file2 = '/home/Jian/Downloads/stack_flow_bundle/Data_repository/lat_lon_master.csv'
df1 = pd.read_csv(csv_file1).set_index('Ids')
Out[53]:
00:00:00 00:30:00 00:45:00
Ids
1234 1000 500 100
8435 5243 300 200
2341 563 400 400
7352 345 500 600
# need to sort index in file 2
df2 = pd.read_csv(csv_file2).set_index('Ids').sort_index()
Out[81]:
Name lat lon 00:00:00
Ids
1234 London 40.4 10.1 500
1234 Prague 40.4 10.1 500
2341 NewYork 60.6 30.3 700
2341 Austria 60.6 30.3 700
7352 Japan 70.7 80.8 500
7352 China 70.7 80.8 500
8435 Paris 50.5 20.2 400
8435 Berlin 50.5 20.2 400
# df1 and df2 has a duplicated column 00:00:00, use df1 without 1st column
temp = df2.join(df1.iloc[:, 1:])
Out[55]:
Name lat lon 00:00:00 00:30:00 00:45:00
Ids
1234 London 40.4 10.1 500 500 100
1234 Prague 40.4 10.1 500 500 100
2341 NewYork 60.6 30.3 700 400 400
2341 Austria 60.6 30.3 700 400 400
7352 Japan 70.7 80.8 500 500 600
7352 China 70.7 80.8 500 500 600
8435 Paris 50.5 20.2 400 300 200
8435 Berlin 50.5 20.2 400 300 200
# do the division by number of occurence of each Ids
# and add column 00:00:00
def my_func(group):
num_obs = len(group)
# process with column name after 00:30:00 (inclusive)
group.iloc[:,4:] = (group.iloc[:,4:]/num_obs).add(group.iloc[:,3], axis=0)
return group
result = temp.groupby(level='Ids').apply(my_func)
Out[104]:
Name lat lon 00:00:00 00:30:00 00:45:00
Ids
1234 London 40.4 10.1 500 750 550
1234 Prague 40.4 10.1 500 750 550
2341 NewYork 60.6 30.3 700 900 900
2341 Austria 60.6 30.3 700 900 900
7352 Japan 70.7 80.8 500 750 800
7352 China 70.7 80.8 500 750 800
8435 Paris 50.5 20.2 400 550 500
8435 Berlin 50.5 20.2 400 550 500
我的建议是重新格式化您的数据,使其像这样:
Ids,ref0,current_time,ref1
1234,1000,None,None
8435,5243,None,None
2341,563,None,None
7352,345,None,None
然后你的"first preprocess"之后会变成这样:
Ids,ref0,time,ref1
1234,1000,None,None
8435,5243,None,None
2341,563,None,None
7352,345,None,None
1234,1000,00:30:00,500
8435,5243,00:30:00,300
2341,563,00:30:00,400
7352,345,00:30:00,500
。 . .等等。这个想法是你应该制作一个单独的列来保存时间信息,然后对于每个预处理,将新数据插入新的 rows,并在时间列中给这些行一个值表明他们来自哪个时间段。您可能希望也可能不希望在此 table 中保留带有 "None" 的初始行;也许您只想从“00:30:00”值开始并将 "master ids" 保存在单独的文件中。
我还没有完全理解你是如何计算新的 ref1
值的,但重点是这样做可能会大大简化你的生活。通常,与其添加无限数量的新列,不如添加一个新列,其值将成为您要用作 headers 的值,用于 open-ended 新列,这样会更好列。