在按时间排序的页面视图数据框中计算在特定页面之前访问过的前一个页面

Counting the previous one visited prior to a specific page in a time-ordered page-view dataframe

我创建了以下数据框,按访问日期的升序列出了用户访问过的页面。共有 5 页:BLQ2_1 至 BLQ2_5.

user_id  created_at  PAGE  
72672    2017-02-20  BLQ2_1
72672    2017-03-03  BLQ2_5
72672    2017-03-03  BLQ2_3
72672    2017-03-05  BLQ2_4
12370    2017-03-06  BLQ2_4
12370    2017-03-06  BLQ2_5
12370    2017-03-06  BLQ2_3
94822    2017-03-06  BLQ2_2
94822    2017-03-10  BLQ2_4
94822    2017-03-10  BLQ2_5
94822    2017-02-24  BLQ2_4

对于每个页面,我想获得有关所有用户访问的上一个页面的统计信息。也就是说,我需要计算每个页面的统计信息,例如:

Path to BLQ2_5 is: 2 times from BLQ2_4 and 1 time from BLQ2_1.

Path to BLQ2_3 is: 2 times from BLQ2_5 and 1 time from BLQ2_4.

Path to BLQ2_4 is: 1 time from BLQ2_5, 1 time from BLQ2_3, 1 time from BLQ2_2, and 1 time from nowhere.

我必须为此使用循环吗?或者有没有办法利用 pandas 的 groupby 功能?有什么建议吗?

下面是我使用 for 循环的解决方案:

pg_BLQ2_5 = pd.DataFrame()
pg_BLQ2_4 = pd.DataFrame()
pg_BLQ2_3 = pd.DataFrame()
pg_BLQ2_2 = pd.DataFrame()
pg_BLQ2_1 = pd.DataFrame()
first_pages = pd.DataFrame()

for user_id in df['user_id'].unique():
    #get only current user's records, and reset index
    _pg = df[df['user_id'] == user_id].reset_index()
    _pg.drop('index', axis=1, inplace=True)
    
    #if this is the first page visited, treat differently
    first_page = _pg.iloc[0]
    first_pages = first_pages.append(first_page)

    #exclude the first page visited from the dataframe
    _pg = _pg.loc[1:].reset_index()
    _pg.drop('index', axis=1, inplace=True)

    #for each page, get the record from its previous index, and build the dataframe.
    pg_BLQ2_5 = pg_BLQ2_5.append(_pg.iloc[_pg[_pg['PAGE'] == 'BLQ2_5'].index -1])
    pg_BLQ2_4 = pg_BLQ2_4.append(_pg.iloc[_pg[_pg['PAGE'] == 'BLQ2_4'].index -1])
    pg_BLQ2_3 = pg_BLQ2_3.append(_pg.iloc[_pg[_pg['PAGE'] == 'BLQ2_3'].index -1])
    pg_BLQ2_2 = pg_BLQ2_2.append(_pg.iloc[_pg[_pg['PAGE'] == 'BLQ2_2'].index -1])
    pg_BLQ2_1 = pg_BLQ2_1.append(_pg.iloc[_pg[_pg['PAGE'] == 'BLQ2_1'].index -1])

首先创建一个显示上一页的列(假设数据框按用户排序,然后按日期排序):

df['prev'] = df['PAGE'].shift()
# remove when different user
df['prev'] = df['prev'].where(df['user_id'].shift() == df['user_id'], np.nan)

然后只需 groupby 并计算值:

df.groupby('PAGE')['prev'].value_counts()

PAGE    prev  
BLQ2_3  BLQ2_5    2
BLQ2_4  BLQ2_2    1
        BLQ2_3    1
        BLQ2_5    1
BLQ2_5  BLQ2_4    2
        BLQ2_1    1

例如,您也可以使用 unstack 进行整形。