Pandas列数学运算无错误无答案

Question

我正在尝试对文件执行一些简单的数学运算。

下面 file_1.csv 中的列本质上是动态的，列数会不时增加。所以我们不能固定 last_column

master_ids.csv : 在任何预处理之前

Ids,ref0 #the columns increase dynamically
1234,1000
8435,5243
2341,563
7352,345

master_count.csv : 在任何处理之前

Ids,Name,lat,lon,ref1
1234,London,40.4,10.1,500
8435,Paris,50.5,20.2,400
2341,NewYork,60.6,30.3,700
7352,Japan,70.7,80.8,500
1234,Prague,40.4,10.1,100
8435,Berlin,50.5,20.2,200
2341,Austria,60.6,30.3,500
7352,China,70.7,80.8,300

master_Ids.csv : 经过一次预处理

Ids,ref,00:30:00
1234,1000,500
8435,5243,300
2341,563,400
7352,345,500

master_count.csv：预期输出 (Append/merge)

Ids,Name,lat,lon,ref1,00:30:00
1234,London,40.4,10.1,500,750
8435,Paris,50.5,20.2,400,550
2341,NewYork,60.6,30.3,700,900
7352,Japan,70.7,80.8,500,750
1234,Prague,40.4,10.1,100,350
8435,Berlin,50.5,20.2,200,350
2341,Austria,60.6,30.3,500,700
7352,China,70.7,80.8,300,750

例如：Ids: 1234 出现 2 次，所以 ids:1234 在 current time (00:30:00) 的值是 500，它要除以 [= 的计数29=] 发生，然后添加到 ref1 的相应值，并使用当前时间创建一个新列。

master_Ids.csv : 经过再一次预处理

Ids,ref,00:30:00,00:45:00
1234,1000,500,100
8435,5243,300,200
2341,563,400,400
7352,345,500,600

master_count.csv: 另一次执行后的预期输出 (Merge/append)

Ids,Name,lat,lon,ref1,00:30:00,00:45:00
1234,London,40.4,10.1,500,750,550
8435,Paris,50.5,20.2,400,550,500
2341,NewYork,60.6,30.3,700,900,900
7352,Japan,70.7,80.8,500,750,800
1234,Prague,40.4,10.1,100,350,150
8435,Berlin,50.5,20.2,200,350,300
2341,Austria,60.6,30.3,500,700,700
7352,China,70.7,80.8,300,750,600

所以这里 current time 是 00:45:00，我们将 current time value 除以出现 ids 的 count，然后 add通过使用 new current time.

创建一个新列到相应的 ref1 值

计划：Jianxun Li

import pandas as pd
import numpy as np

csv_file1 = '/Data_repository/master_ids.csv'
csv_file2 = '/Data_repository/master_count.csv'

df1 = pd.read_csv(csv_file1).set_index('Ids')

# need to sort index in file 2
df2 = pd.read_csv(csv_file2).set_index('Ids').sort_index()

# df1 and df2 has a duplicated column 00:00:00, use df1 without 1st column
temp = df2.join(df1.iloc[:, 1:])

# do the division by number of occurence of each Ids 
# and add column any time series
def my_func(group):
    num_obs = len(group)
    # process with column name after next timeseries (inclusive)
    group.iloc[:,4:] = (group.iloc[:,4:]/num_obs).add(group.iloc[:,3], axis=0)
    return group

result = temp.groupby(level='Ids').apply(my_func)

程序执行无错误且无输出。请需要一些修复建议。

Answer 1

该程序假定 master_counts.csv 和 master_ids.csv 随时间更新，并且应该对更新时间具有鲁棒性。也就是说，如果运行在同一个更新上多次或错过更新，它应该会产生正确的结果。

# this program updates (and replaces) the original master_counts.csv with data
# in master_ids.csv, so we only want the first 5 columns when we read it in
master_counts = pd.read_csv('master_counts.csv').iloc[:,:5]

# this file is assumed to be periodically updated with the addition of new columns
master_ids = pd.read_csv('master_ids.csv')

for i in range( 2, len(master_ids.columns) ):
    master_counts = master_counts.merge( master_ids.iloc[:,[0,i]], on='Ids' )
    count = master_counts.groupby('Ids')['ref1'].transform('count')
    master_counts.iloc[:,-1] = master_counts['ref1'] + master_counts.iloc[:,-1]/count

master_counts.to_csv('master_counts.csv',index=False)

%more master_counts.csv
Ids,Name,lat,lon,ref1,00:30:00,00:45:00
1234,London,40.4,10.1,500,750.0,550.0
1234,Prague,40.4,10.1,100,350.0,150.0
8435,Paris,50.5,20.2,400,550.0,500.0
8435,Berlin,50.5,20.2,200,350.0,300.0
2341,NewYork,60.6,30.3,700,900.0,900.0
2341,Austria,60.6,30.3,500,700.0,700.0
7352,Japan,70.7,80.8,500,750.0,800.0
7352,China,70.7,80.8,300,550.0,600.0

Answer 2

import pandas as pd
import numpy as np

csv_file1 = '/home/Jian/Downloads/stack_flow_bundle/Data_repository/master_lac_Test.csv'
csv_file2 = '/home/Jian/Downloads/stack_flow_bundle/Data_repository/lat_lon_master.csv'

df1 = pd.read_csv(csv_file1).set_index('Ids')

Out[53]: 
      00:00:00  00:30:00  00:45:00
Ids                               
1234      1000       500       100
8435      5243       300       200
2341       563       400       400
7352       345       500       600

# need to sort index in file 2
df2 = pd.read_csv(csv_file2).set_index('Ids').sort_index()

Out[81]: 
         Name   lat   lon  00:00:00
Ids                                
1234   London  40.4  10.1       500
1234   Prague  40.4  10.1       500
2341  NewYork  60.6  30.3       700
2341  Austria  60.6  30.3       700
7352    Japan  70.7  80.8       500
7352    China  70.7  80.8       500
8435    Paris  50.5  20.2       400
8435   Berlin  50.5  20.2       400

# df1 and df2 has a duplicated column 00:00:00, use df1 without 1st column
temp = df2.join(df1.iloc[:, 1:])



Out[55]: 
         Name   lat   lon  00:00:00  00:30:00  00:45:00
Ids                                                    
1234   London  40.4  10.1       500       500       100
1234   Prague  40.4  10.1       500       500       100
2341  NewYork  60.6  30.3       700       400       400
2341  Austria  60.6  30.3       700       400       400
7352    Japan  70.7  80.8       500       500       600
7352    China  70.7  80.8       500       500       600
8435    Paris  50.5  20.2       400       300       200
8435   Berlin  50.5  20.2       400       300       200

# do the division by number of occurence of each Ids 
# and add column 00:00:00
def my_func(group):
    num_obs = len(group)
    # process with column name after 00:30:00 (inclusive)
    group.iloc[:,4:] = (group.iloc[:,4:]/num_obs).add(group.iloc[:,3], axis=0)
    return group



result = temp.groupby(level='Ids').apply(my_func)

Out[104]: 
         Name   lat   lon  00:00:00  00:30:00  00:45:00
Ids                                                    
1234   London  40.4  10.1       500       750       550
1234   Prague  40.4  10.1       500       750       550
2341  NewYork  60.6  30.3       700       900       900
2341  Austria  60.6  30.3       700       900       900
7352    Japan  70.7  80.8       500       750       800
7352    China  70.7  80.8       500       750       800
8435    Paris  50.5  20.2       400       550       500
8435   Berlin  50.5  20.2       400       550       500

Answer 3

我的建议是重新格式化您的数据，使其像这样：

Ids,ref0,current_time,ref1
1234,1000,None,None
8435,5243,None,None
2341,563,None,None
7352,345,None,None

然后你的"first preprocess"之后会变成这样：

Ids,ref0,time,ref1
1234,1000,None,None
8435,5243,None,None
2341,563,None,None
7352,345,None,None
1234,1000,00:30:00,500
8435,5243,00:30:00,300
2341,563,00:30:00,400
7352,345,00:30:00,500

。 . .等等。这个想法是你应该制作一个单独的列来保存时间信息，然后对于每个预处理，将新数据插入新的 rows，并在时间列中给这些行一个值表明他们来自哪个时间段。您可能希望也可能不希望在此 table 中保留带有 "None" 的初始行；也许您只想从“00:30:00”值开始并将 "master ids" 保存在单独的文件中。

我还没有完全理解你是如何计算新的 ref1 值的，但重点是这样做可能会大大简化你的生活。通常，与其添加无限数量的新列，不如添加一个新列，其值将成为您要用作 headers 的值，用于 open-ended 新列，这样会更好列。

Pandas列数学运算无错误无答案

Pandas Column mathematical operations No error no answer

python

csv

datetime

multiple-columns

pandas