有效地检查数据框的日期是否介于范围和 return 之间

Efficiently check if dataframe has date between a range, and return a count

假设我们有一个数据框 df,其中包含按时间顺序排列的日期列表。

目标是获取给定日期内日期范围包含给定日期的人数。

df = pd.DataFrame(data={'date': [datetime.date(2007, 12, 1), 
                                 datetime.date(2007, 12, 2), 
                                 datetime.date(2007, 12, 3)], 
                        'num_people_on_day': [0,0,0]})

dg = pd.DataFrame(data={'person': ['Alice', 'Bob', 'Chuck'],
                        'start': [datetime.date(2007, 11, 5), 
                                  datetime.date(2007, 12, 8), 
                                  datetime.date(2007, 1, 5)],
                        'end': [datetime.date(2007, 12, 6), 
                                datetime.date(2008, 1, 3), 
                                datetime.date(2007, 11, 30)]})

那么对于 df 中的每个日期,我怎样才能有效地检查所有 dg 然后计算返回的数字并将其放入 df.

我什至不确定这里是否需要合并(也试图节省内存),并且我真的试图尽可能快地编写它。

编辑:好的,所以我想出了一个不同的方法来做到这一点,但我讨厌使用应用。有没有办法在不使用 .apply 的情况下实现这种新方式?

import pandas as pd
import datetime

df = pd.DataFrame(data={'date': [datetime.date(2007, 12, 1), 
                                 datetime.date(2007, 12, 2), 
                                 datetime.date(2007, 12, 3)]})

dg = pd.DataFrame(data={'person': ['Alice', 'Bob', 'Chuck', 'Dave'],
                        'start': [datetime.date(2007, 11, 5), 
                                  datetime.date(2007, 12, 8), 
                                  datetime.date(2007, 1, 5),
                                  datetime.date(2007, 11, 6)],
                        'end': [datetime.date(2007, 12, 1), 
                                datetime.date(2008, 1, 3), 
                                datetime.date(2007, 11, 30),
                                datetime.date(2007, 12, 2)]})

def get_num_persons(date, vec_starts, vec_ends):
    """
    Helper function for .apply to get number of persons.
    For each given date, if start and end date is 
    between the given date, then both results are True.
    The bitwise AND then only sums these True and True values.
    """
    return (((vec_starts <= date) & (vec_ends >= date)).sum())

def num_of_persons(persons, dates_df):
    """
    Obtains the number of persons for each day.
    """
    dates_df['num_persons'] = dates_df['date'].apply(lambda row: 
                                                   get_num_persons(row, 
                                                   persons['start'],
                                                   persons['end']))
    return dates_df

num_of_persons(dg, df.copy())

有了足够的内存,merge 然后计算介于两者之间的日期。 .reindex 确保我们得到 0。

#df['date'] = pd.to_datetime(df.date)
#dg['start'] = pd.to_datetime(dg.start)
#dg['end'] = pd.to_datetime(dg.end)

m = df[['date']].assign(k=1).merge(dg.assign(k=1))

(m[m.date.between(m.start, m.end)].groupby('date').size()
   .reindex(df.date).fillna(0)
   .rename('num_people_on_day').reset_index())

         date  num_people_on_day
0  2007-12-01                  1
1  2007-12-02                  1
2  2007-12-03                  1

另一种选择是使用申请。这是一个循环,因此性能会随着 df 的增长而受到影响。

def get_len(x, dg):
    try:
        return len(dg.iloc[dg.index.get_loc(x)])
    except KeyError:  # Deal with dates that have 0
        return 0

dg.index = pd.IntervalIndex.from_arrays(dg['start'], dg['end'], closed='both')
df['num_people_on_day'] = df['date'].apply(get_len, dg=dg)

为了说明时间安排,请查看您的小集合,然后查看更大的集合 df

%%timeit 
m = df[['date']].assign(k=1).merge(dg.assign(k=1))
(m[m.date.between(m.start, m.end)].groupby('date').size()
   .reindex(df.date).fillna(0)
   .rename('num_people_on_day').reset_index())
#9.39 ms ± 52 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit 
dg.index = pd.IntervalIndex.from_arrays(dg['start'], dg['end'], closed='both')
df['num_people_on_day'] = df['date'].apply(get_len, dg=dg)
#4.06 ms ± 27.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

但是一旦 df 更长(即使只有 90 行),差异就会变得明显。

df = pd.DataFrame({'date': pd.date_range('2007-01-01', '2007-03-31')})

%%timeit merge
#9.78 ms ± 75.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit IntervalIndex
#65.5 ms ± 418 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
data_df = pd.DataFrame()

def adding_dates(x):

    dates = list(pd.date_range(x['start'],x['end']))
    data= pd.DataFrame({'date':dates})
    data['name'] = x['person']
    global data_df
    data_df = data_df.append(data)

dg.apply(lambda x: adding_dates(x),axis=1)

data_df['date'] = pd.to_datetime(data_df['date']).dt.date
df['date'] = pd.to_datetime(df['date']).dt.date
data_df = data_df.groupby(['date'],as_index=False)[['name']].count().rename(columns={'name':'count'})

final_df = pd.merge(df[['date']],res,on=['date'],how='left')
print(final_df)