Python 中的联系人跟踪 - 使用时间序列
Contact Tracing in Python - working with timeseries
假设我有时间序列数据(x 轴上的时间,y-z 平面上的坐标。
给定受感染用户的种子集,我想在 t
时间内从种子集中的点获取距离 d
内的所有用户。这基本上只是接触者追踪。
实现此目的的明智方法是什么?
天真的方法是这样的:
points_at_end_of_iteration = []
for p in seed_set:
other_ps = find_points_t_time_away(t)
points_at_end_of_iteration += find_points_d_distance_away_from_set(other_ps)
执行此操作的更明智的方法是什么 - 最好将所有数据保存在 RAM 中(尽管我不确定这是否可行)。 Pandas 是一个好的选择吗?我也一直在考虑Bandicoot,但它似乎无法为我做到这一点。
如果我可以改进这个问题,请告诉我 - 也许它太宽泛了。
编辑:
我认为我上面给出的算法有缺陷。
这样更好吗:
for user,time,pos in infected_set:
info = get_next_info(user, time) # info will be a tuple: (t, pos)
intersecting_users = find_intersecting_users(user, time, delta_t, pos, delta_pos) # intersect if close enough to the user's pos/time
infected_set.add(intersecting_users)
update_infected_set(user, info) # change last_time and last_pos (described below)
infected_set
我觉得其实应该是一个hashmap {user_id: {last_time: ..., last_pos: ...}, user_id2: ...}
一个潜在的问题是用户是独立对待的,因此用户 2 的下一个时间戳可能是用户 1 之后的几小时或几天。
如果我进行插值,那么接触者追踪可能会更容易,这样每个用户都有每个时间点(比如每小时)的信息,尽管这会大大增加数据量。
数据Format/Sample
user_id = 123
timestamp = 2015-05-01 05:22:25
position = 12.111,-12.111 # lat,long
有一个包含所有记录的 csv 文件:
uid1,timestamp1,position1
uid1,timestamp2,position2
uid2,timestamp3,position3
还有一个文件目录(相同格式),每个文件对应一个用户。
records/uid1.csv
records/uid2.csv
带插值的第一个解决方案:
# i would use a shelf (a persistent, dictionary-like object,
# included with python).
import shelve
# hashmap of clean users indexed by timestamp)
# { timestamp1: {uid1: (lat11, long11), uid12: (lat12, long12), ...},
# timestamp2: {uid1: (lat21, long21), uid2: (lat22, long22), ...},
# ...
# }
#
clean_users = shelve.open("clean_users.dat")
# load data in clean_users from csv (shelve use same syntax than
# hashmap). You will interpolate data (only data at a given timestamp
# will be in memory at the same time). Note: the timestamp must be a string
# hashmap of infected users indexed by timestamp (same format than clean_users)
infected_users = shelve.open("infected_users.dat")
# for each iteration
for iteration in range(1, N):
# compute current timestamp because we interpolate each user has a location
current_timestamp = timestamp_from_iteration(iteration)
# get clean users for this iteration (in memory)
current_clean_users = clean_user[current_timestamp]
# get infected users for this iteration (in memory)
current_infected_users = infected_user[current_timestamp]
# new infected user for this iteration
new_infected_users = dict()
# compute new infected_users for this iteration from current_clean_users and
# current_infected_users then store the result in new_infected_users
# remove user in new_infected_users from clean_users
# add user in new_infected_users to infected_users
# close the shelves
infected_users.close()
clean_users.close()
没有插值的第二个解决方案:
# i would use a shelf (a persistent, dictionary-like object,
# included with python).
import shelve
# hashmap of clean users indexed by timestamp)
# { timestamp1: {uid1: (lat11, long11), uid12: (lat12, long12), ...},
# timestamp2: {uid1: (lat21, long21), uid2: (lat22, long22), ...},
# ...
# }
#
clean_users = shelve.open("clean_users.dat")
# load data in clean_users from csv (shelve use same syntax than
# hashmap). Note: the timestamp must be a string
# hashmap of infected users indexed by timestamp (same format than clean_users)
infected_users = shelve.open("infected_users.dat")
# for each iteration (not time related as previous version)
# could also stop when there is no new infected users in the iteration
for iteration in range(1, N):
# new infected users for this iteration
new_infected_users = dict()
# get timestamp from infected_users
for an_infected_timestamp in infected_users.keys():
# get infected users for this time stamp
current_infected_users = infected_users[an_infected_timestamp]
# get relevant timestamp from clean users
for a_clean_timestamp in clean_users.keys():
if time_stamp_in_delta(an_infected_timestamp, a_clean_timestamp):
# get clean users for this clean time stamp
current_clean_users = clean_users[a_clean_timestamp]
# compute infected users from current_clean_users and
# current_infected_users then append the result to
# new_infected_users
# remove user in new_infected_users from clean_users
# add user in new_infected_users to infected_users
# close the shelves
infected_users.close()
clean_users.close()
假设我有时间序列数据(x 轴上的时间,y-z 平面上的坐标。
给定受感染用户的种子集,我想在 t
时间内从种子集中的点获取距离 d
内的所有用户。这基本上只是接触者追踪。
实现此目的的明智方法是什么?
天真的方法是这样的:
points_at_end_of_iteration = []
for p in seed_set:
other_ps = find_points_t_time_away(t)
points_at_end_of_iteration += find_points_d_distance_away_from_set(other_ps)
执行此操作的更明智的方法是什么 - 最好将所有数据保存在 RAM 中(尽管我不确定这是否可行)。 Pandas 是一个好的选择吗?我也一直在考虑Bandicoot,但它似乎无法为我做到这一点。
如果我可以改进这个问题,请告诉我 - 也许它太宽泛了。
编辑:
我认为我上面给出的算法有缺陷。
这样更好吗:
for user,time,pos in infected_set:
info = get_next_info(user, time) # info will be a tuple: (t, pos)
intersecting_users = find_intersecting_users(user, time, delta_t, pos, delta_pos) # intersect if close enough to the user's pos/time
infected_set.add(intersecting_users)
update_infected_set(user, info) # change last_time and last_pos (described below)
infected_set
我觉得其实应该是一个hashmap {user_id: {last_time: ..., last_pos: ...}, user_id2: ...}
一个潜在的问题是用户是独立对待的,因此用户 2 的下一个时间戳可能是用户 1 之后的几小时或几天。
如果我进行插值,那么接触者追踪可能会更容易,这样每个用户都有每个时间点(比如每小时)的信息,尽管这会大大增加数据量。
数据Format/Sample
user_id = 123
timestamp = 2015-05-01 05:22:25
position = 12.111,-12.111 # lat,long
有一个包含所有记录的 csv 文件:
uid1,timestamp1,position1
uid1,timestamp2,position2
uid2,timestamp3,position3
还有一个文件目录(相同格式),每个文件对应一个用户。
records/uid1.csv
records/uid2.csv
带插值的第一个解决方案:
# i would use a shelf (a persistent, dictionary-like object,
# included with python).
import shelve
# hashmap of clean users indexed by timestamp)
# { timestamp1: {uid1: (lat11, long11), uid12: (lat12, long12), ...},
# timestamp2: {uid1: (lat21, long21), uid2: (lat22, long22), ...},
# ...
# }
#
clean_users = shelve.open("clean_users.dat")
# load data in clean_users from csv (shelve use same syntax than
# hashmap). You will interpolate data (only data at a given timestamp
# will be in memory at the same time). Note: the timestamp must be a string
# hashmap of infected users indexed by timestamp (same format than clean_users)
infected_users = shelve.open("infected_users.dat")
# for each iteration
for iteration in range(1, N):
# compute current timestamp because we interpolate each user has a location
current_timestamp = timestamp_from_iteration(iteration)
# get clean users for this iteration (in memory)
current_clean_users = clean_user[current_timestamp]
# get infected users for this iteration (in memory)
current_infected_users = infected_user[current_timestamp]
# new infected user for this iteration
new_infected_users = dict()
# compute new infected_users for this iteration from current_clean_users and
# current_infected_users then store the result in new_infected_users
# remove user in new_infected_users from clean_users
# add user in new_infected_users to infected_users
# close the shelves
infected_users.close()
clean_users.close()
没有插值的第二个解决方案:
# i would use a shelf (a persistent, dictionary-like object,
# included with python).
import shelve
# hashmap of clean users indexed by timestamp)
# { timestamp1: {uid1: (lat11, long11), uid12: (lat12, long12), ...},
# timestamp2: {uid1: (lat21, long21), uid2: (lat22, long22), ...},
# ...
# }
#
clean_users = shelve.open("clean_users.dat")
# load data in clean_users from csv (shelve use same syntax than
# hashmap). Note: the timestamp must be a string
# hashmap of infected users indexed by timestamp (same format than clean_users)
infected_users = shelve.open("infected_users.dat")
# for each iteration (not time related as previous version)
# could also stop when there is no new infected users in the iteration
for iteration in range(1, N):
# new infected users for this iteration
new_infected_users = dict()
# get timestamp from infected_users
for an_infected_timestamp in infected_users.keys():
# get infected users for this time stamp
current_infected_users = infected_users[an_infected_timestamp]
# get relevant timestamp from clean users
for a_clean_timestamp in clean_users.keys():
if time_stamp_in_delta(an_infected_timestamp, a_clean_timestamp):
# get clean users for this clean time stamp
current_clean_users = clean_users[a_clean_timestamp]
# compute infected users from current_clean_users and
# current_infected_users then append the result to
# new_infected_users
# remove user in new_infected_users from clean_users
# add user in new_infected_users to infected_users
# close the shelves
infected_users.close()
clean_users.close()