根据组在 DataFrame 中插入特定条目

Interpolate specific entries in DataFrame depending on groups

我有一个问题,我有从鹿特丹到汉堡的几次旅行的 AIS 数据。该航线分为 6 个扇区,为航线预先定义了扇区边界,我需要知道船舶何时何地进入下一个扇区。我尝试只使用一个扇区内的最后一条记录,但数据的分辨率不够高。所以我想根据扇区边界的纬度插入时间和经度。

你可以在下图中看到我为这次旅行决定的边界。越过边界的经度总是恰好在边界线上。我需要确定的是这条线被船越过的纬度。

我的 DataFrame 如下所示:

       TripID  time  Latitude Longitude  SectorID
0      42       7    52.9     4.4        1
1      42       8    53.0     4.6        1
2      42       9    53.0     4.7        1
3      42      10    53.1     4.9        2
4       5       9    53.0     4.5        1
5       5      10    53.0     4.7        1
6       5      11    53.2     5.0        2
7       5      12    53.3     5.2        2

扇区 1 和扇区 2 之间的边界是在经度 4.8 处预定义的,所以我想为每个行程和扇区边界在经度 4.8 处插入纬度和时间。我猜一个好的解决方案将涉及 df.groupby(['TripID', 'SectorID']).

之类的东西

我尝试为每个行程和扇区添加一个条目,其中只有扇区边界的纬度,然后使用 interpolate,但是添加条目对我来说大约需要一个小时并插入缺失值立即崩溃。

我正在寻找的结果应该是这样的:

       TripID  time  Latitude Longitude  SectorID
0      42       7    52.9     4.4        1
1      42       8    53.0     4.6        1
2      42       9    53.0     4.7        1
8      42     9.5   53.05     4.8        1
3      42      10    53.1     4.9        2
4       5       9    53.0     4.5        1
5       5      10    53.0     4.7        1
9       5    10.3   53.06     4.8        1
6       5      11    53.2     5.0        2
7       5      12    53.3     5.2        2

我也很乐意并能够处理如下所示的结果:

 TripID  SectorID  leave_lat  leave_lon  leave_time
 42      1         53.05      4.8        9.5
 5       1         53.06      4.8        10.3

如果我对问题的描述不是很清楚,请询问。

由于通常的 pandas 工作人员没有发现这个好问题,我给你一个解决方案,但有一些注意事项。这是示例输入,我使用了:

TripID  time  Latitude Longitude  
42       7    52.9     4.4        
42       8    53.0     4.6        
42       9    53.0     4.7 * missing value
42      10    53.1     4.9 
42      11    53.2     4.9         
42      12    53.3     5.3 * missing value
42      15    53.7     5.6    
5        9    53.0     4.5        
5       10    53.0     4.7  * missing value
5       11    53.2     5.0       
5       12    53.4     5.2        
5       14    53.6     5.3  * missing value
5       17    53.4     5.5        
5       18    53.3     5.7  
34      19    53.0     4.5  
34      20    53.0     4.7          
34      24    53.9     4.8  ** value already exists
34      25    53.8     4.9        
34      27    53.8     5.3        
34      28    53.8     5.3  * missing value
34      31    53.7     5.6        
34      32    53.6     5.7 

此代码:

import numpy as np
import pandas as pd

#import data
df = pd.read_csv("test.txt", delim_whitespace=True)

#set floating point output precision to prevent excessively long columns
pd.set_option("display.precision", 2)
#remember original column order
cols = df.columns
#define the sector borders
sectors = [4.8, 5.4]

#create all combinations of sector borders and TripIDs
dfborders = pd.DataFrame(index = pd.MultiIndex.from_product([df.TripID.unique(), sectors], names = ["TripID", "Longitude"])).reset_index()
#delete those combinations of TripID and Longitude that already exist in the original dataframe
dfborders = pd.merge(df, dfborders, on = ["TripID", "Longitude"], how = "right")
dfborders = dfborders[dfborders.isnull().any(axis = 1)]
#insert missing data points
df = pd.concat([df, dfborders])
#and sort dataframe to insert the missing data points in the right position
df = df[cols].groupby("TripID", sort = False).apply(pd.DataFrame.sort_values, ["Longitude", "time", "Latitude"])

#temporarily set longitude as index for value-based interpolation
df.set_index(["Longitude"], inplace = True, drop = False)
#interpolate group-wise
df = df.groupby("TripID", sort = False).apply(lambda g: g.interpolate(method = "index"))
#create sector ID column assuming that longitude is between -180 and +180
df["SectorID"] = np.digitize(df["Longitude"], bins = [-180] + sectors + [180])
#and reset index
df.reset_index(drop = True, inplace = True)
print(df)

产生以下输出:

    TripID   time  Latitude  Longitude  SectorID
0       42   7.00     52.90        4.4         1
1       42   8.00     53.00        4.6         1
2       42   9.00     53.00        4.7         1
3       42   9.50     53.05        4.8         2 * interpolated data point
4       42  10.00     53.10        4.9         2
5       42  11.00     53.20        4.9         2
6       42  12.00     53.30        5.3         2
7       42  13.00     53.43        5.4         3 * interpolated data point
8       42  15.00     53.70        5.6         3
9        5   9.00     53.00        4.5         1
10       5  10.00     53.00        4.7         1
11       5  10.33     53.07        4.8         2 * interpolated data point
12       5  11.00     53.20        5.0         2
13       5  12.00     53.40        5.2         2
14       5  14.00     53.60        5.3         2
15       5  15.50     53.50        5.4         3 * interpolated data point
16       5  17.00     53.40        5.5         3
17       5  18.00     53.30        5.7         3
18      34  19.00     53.00        4.5         1
19      34  20.00     53.00        4.7         1
20      34  24.00     53.90        4.8         2
21      34  25.00     53.80        4.9         2
22      34  27.00     53.80        5.3         2
23      34  28.00     53.80        5.3         2
24      34  29.00     53.77        5.4         3 * interpolated data point
25      34  31.00     53.70        5.6         3
26      34  32.00     53.60        5.7         3

现在是注意事项。我不知道,如何就地添加缺失的行。我会问一个问题,如何做到这一点。如果我得到答案,我会在这里更新我的。在那之前,副作用是 table 在每个 TripID 中对 Longitude 进行排序,并且假设 Longitude 不会减少,这实际上可能不是案例.

我用不同的方式解决了这个问题。因为这为我解决了问题,但不是我要求的确切解决方案,所以我会接受 T 先生的回答。无论如何,我发布这个是为了完整起见,所以这是我的解决方案:

从我的问题 df 的 DataFrame 开始

        TripID  time  Latitude Longitude  SectorID
0      42       7    52.9     4.4        1
1      42       8    53.0     4.6        1
2      42       9    53.0     4.7        1
3      42      10    53.1     4.9        2
4       5       9    53.0     4.5        1
5       5      10    53.0     4.7        1
6       5      11    53.2     5.0        2
7       5      12    53.3     5.2        2

我使用了这个代码

df = df.sort_values('time')

df['next_lat'] = df.groupby('TripID')['Latitude'].shift(-1)
df['next_lon'] = df('TripID')['Longitude'].shift(-1)
df['next_time'] = df('TripID')['time'].shift(-1)
df['next_sector_id'] = df('TripID')['sector'].shift(-1)
df = df.sort_values(['TripID', 'time'])
df['next_trip_id'] = df['TripID'].shift(-1)

lasts = df[df['SectorID'] != df['next_sector_id']]

lasts.loc[lasts['SectorID'] == '1', 'sector_leave_lon'] = 4.8

lasts.loc[lasts['sector'] == '2', 'sector_leave_lat'] = lasts.loc[lasts['sector'] == '2', 'Latitude'] + ((lasts.loc[lasts['sector'] == '2', 'sector_leave_lon'] - lasts.loc[lasts['sector'] == '2', 'Longitude']) / (lasts.loc[lasts['sector'] == '2', 'next_lon'] - lasts.loc[lasts['sector'] == '2', 'Longitude'])) * (lasts.loc[lasts['sector'] == '2', 'next_lon'] - lasts.loc[lasts['sector'] == '2', 'Longitude'])

lasts.loc[lasts['sector'] == '2', 'sector_leave_time'] = lasts.loc[lasts['sector'] == '2', 'time'] + ((lasts.loc[lasts['sector'] == '2', 'sector_leave_lon'] - lasts.loc[lasts['sector'] == '2', 'Longitude']) / (lasts.loc[lasts['sector'] == '2', 'next_lon'] - lasts.loc[lasts['sector'] == '2', 'Longitude'])) * (lasts.loc[lasts['sector'] == '2', 'next_time'] - lasts.loc[lasts['sector'] == '2', 'time'])

df['sector_leave_lat'] = lasts['sector_leave_lat']
df['sector_leave_time'] = lasts['sector_leave_time']

df['sector_leave_lat'] = df(['TripID', 'sector'])['sector_leave_lat'].transform('last')
df['sector_leave_time'] = df(['TripID', 'sector'])['sector_leave_time'].transform('last')

df = df.drop(['next_lat', 'next_lon', 'next_time', 'next_sector_id', 'next_trip_id'], axis = 1)

给出这样的结果

        TripID  time  Latitude Longitude  SectorID  sector_leave_lat  sector_leave_time
0      42       7    52.9     4.4        1          53.05              9.5
1      42       8    53.0     4.6        1          53.05              9.5
2      42       9    53.0     4.7        1          53.05              9.5
3      42      10    53.1     4.9        2          NaN               NaN
4       5       9    53.0     4.5        1          53.06             10.3
5       5      10    53.0     4.7        1          53.06             10.3
6       5      11    53.2     5.0        2          NaN               NaN
7       5      12    53.3     5.2        2          NaN               NaN

我希望这对那些实际解决方案无法解决问题的人有所帮助。