创建列表子序列的有效方法 Pandas

Efficient way to create subsequences of list Pandas

所以我有一个如下所示的数据框:

    user_id movie_embedding_index
0   6   [998.0, 520.0, 755.0, 684.0, 13.0, 4248.0, 1.0...
1   7   [1216.0, 12.0, 148.0, 1.0, 289.0, 64.0, 110.0,...
2   8   [40.0, 199.0, 42.0, 316.0, 96.0, 34.0, 152.0, ...
3   10  [117.0, 2283.0, 1.0, 25.0, 29.0, 14.0, 11.0, 2...
4   25  [5263.0, 117.0, 5003.0, 5086.0, 34.0, 152.0, 1...

每个 user_id 都有电影历史 [998.0, 520.0, 755.0, 684.0, 13.0, 4248.0],我想为此用户历史创建多个序列,以封装过去的历史和观看的下一部电影。所以对于历史 [998.0, 520.0, 755.0, 684.0, 13.0, 4248.0] 我想创建以下序列:

past_history   next_movie
[]             998.0
[998.0]        520.0
[998.0,520.0]  755.0
...
[998.0, 520.0, 755.0, 684.0, 13.0] 4248.0

我想为数据框中的所有用户构建它并获得如下最终结果:

    user_id past_history next_movie
0   6   []             998.0
1   6   [998.0]        520.0
2   6   [998.0,520.0]  755.0
.
.
.

我可以想办法做到这一点,但它们效率极低并且不使用 pandas 方法。是否有任何 pandas 方法可以帮助更有效地做到这一点?

一个解决方案可能是先调用 apply 进行您想要的计算。

import pandas as pd

# Generate an example dataframe
d = {'user_id': [1, 2, 3], 'movie_embedding_index': [[998.0, 520.0, 755.0, 684.0, 13.0, 4248.0], [98.0, 20.0, 55.0, 84.0], [132.0, 5432.0, 97-0, 675.0]]}
df = pd.DataFrame(data=d)

# Calculate lists of past movies and current movie
df['calculation'] = df.movie_embedding_index.apply(lambda x: [(x[:index], elem) for index, elem in enumerate(x, start=0)])

然后在此计算列上应用 explode

df = df.explode('calculation')

并最终将这些值检索为新列

df['past_history'] = df['calculation'].apply(lambda x: x[0])
df['next_movie'] = df['calculation'].apply(lambda x: x[1])

最终结果:

  • 示例数据有日蚀 (...) 延续字符。一些按摩使其成为有效列表
  • lambda 函数列表理解中建立 dict of watched and next
  • explode() 在上面的步骤中列出行
  • apply(pd.Series)dict 扩展为列
  • join()回到原始数据框
import json

df = pd.read_csv(io.StringIO("""    user_id  movie_embedding_index
0   6   [998.0, 520.0, 755.0, 684.0, 13.0, 4248.0, 1.0...
1   7   [1216.0, 12.0, 148.0, 1.0, 289.0, 64.0, 110.0,...
2   8   [40.0, 199.0, 42.0, 316.0, 96.0, 34.0, 152.0, ...
3   10  [117.0, 2283.0, 1.0, 25.0, 29.0, 14.0, 11.0, 2...
4   25  [5263.0, 117.0, 5003.0, 5086.0, 34.0, 152.0, 1..."""), sep="\s\s+", engine="python")
df.movie_embedding_index = (df.movie_embedding_index.str.strip(". ,")+"]").apply(lambda s: json.loads(s))

df = df.join(df.movie_embedding_index.apply(lambda l: [{"watched":l[0:i],"next":m} 
                                                  for i,m in enumerate(l)]).explode().apply(pd.Series))


user_id movie_embedding_index watched next
0 6 [998.0, 520.0, 755.0, 684.0, 13.0, 4248.0, 1.0] [] 998
0 6 [998.0, 520.0, 755.0, 684.0, 13.0, 4248.0, 1.0] [998.0] 520
0 6 [998.0, 520.0, 755.0, 684.0, 13.0, 4248.0, 1.0] [998.0, 520.0] 755
0 6 [998.0, 520.0, 755.0, 684.0, 13.0, 4248.0, 1.0] [998.0, 520.0, 755.0] 684
0 6 [998.0, 520.0, 755.0, 684.0, 13.0, 4248.0, 1.0] [998.0, 520.0, 755.0, 684.0] 13
0 6 [998.0, 520.0, 755.0, 684.0, 13.0, 4248.0, 1.0] [998.0, 520.0, 755.0, 684.0, 13.0] 4248
0 6 [998.0, 520.0, 755.0, 684.0, 13.0, 4248.0, 1.0] [998.0, 520.0, 755.0, 684.0, 13.0, 4248.0] 1
1 7 [1216.0, 12.0, 148.0, 1.0, 289.0, 64.0, 110.0] [] 1216
1 7 [1216.0, 12.0, 148.0, 1.0, 289.0, 64.0, 110.0] [1216.0] 12
1 7 [1216.0, 12.0, 148.0, 1.0, 289.0, 64.0, 110.0] [1216.0, 12.0] 148
1 7 [1216.0, 12.0, 148.0, 1.0, 289.0, 64.0, 110.0] [1216.0, 12.0, 148.0] 1
1 7 [1216.0, 12.0, 148.0, 1.0, 289.0, 64.0, 110.0] [1216.0, 12.0, 148.0, 1.0] 289
1 7 [1216.0, 12.0, 148.0, 1.0, 289.0, 64.0, 110.0] [1216.0, 12.0, 148.0, 1.0, 289.0] 64
1 7 [1216.0, 12.0, 148.0, 1.0, 289.0, 64.0, 110.0] [1216.0, 12.0, 148.0, 1.0, 289.0, 64.0] 110
2 8 [40.0, 199.0, 42.0, 316.0, 96.0, 34.0, 152.0] [] 40
2 8 [40.0, 199.0, 42.0, 316.0, 96.0, 34.0, 152.0] [40.0] 199
2 8 [40.0, 199.0, 42.0, 316.0, 96.0, 34.0, 152.0] [40.0, 199.0] 42
2 8 [40.0, 199.0, 42.0, 316.0, 96.0, 34.0, 152.0] [40.0, 199.0, 42.0] 316
2 8 [40.0, 199.0, 42.0, 316.0, 96.0, 34.0, 152.0] [40.0, 199.0, 42.0, 316.0] 96
2 8 [40.0, 199.0, 42.0, 316.0, 96.0, 34.0, 152.0] [40.0, 199.0, 42.0, 316.0, 96.0] 34
2 8 [40.0, 199.0, 42.0, 316.0, 96.0, 34.0, 152.0] [40.0, 199.0, 42.0, 316.0, 96.0, 34.0] 152
3 10 [117.0, 2283.0, 1.0, 25.0, 29.0, 14.0, 11.0, 2] [] 117
3 10 [117.0, 2283.0, 1.0, 25.0, 29.0, 14.0, 11.0, 2] [117.0] 2283
3 10 [117.0, 2283.0, 1.0, 25.0, 29.0, 14.0, 11.0, 2] [117.0, 2283.0] 1
3 10 [117.0, 2283.0, 1.0, 25.0, 29.0, 14.0, 11.0, 2] [117.0, 2283.0, 1.0] 25
3 10 [117.0, 2283.0, 1.0, 25.0, 29.0, 14.0, 11.0, 2] [117.0, 2283.0, 1.0, 25.0] 29
3 10 [117.0, 2283.0, 1.0, 25.0, 29.0, 14.0, 11.0, 2] [117.0, 2283.0, 1.0, 25.0, 29.0] 14
3 10 [117.0, 2283.0, 1.0, 25.0, 29.0, 14.0, 11.0, 2] [117.0, 2283.0, 1.0, 25.0, 29.0, 14.0] 11
3 10 [117.0, 2283.0, 1.0, 25.0, 29.0, 14.0, 11.0, 2] [117.0, 2283.0, 1.0, 25.0, 29.0, 14.0, 11.0] 2
4 25 [5263.0, 117.0, 5003.0, 5086.0, 34.0, 152.0, 1] [] 5263
4 25 [5263.0, 117.0, 5003.0, 5086.0, 34.0, 152.0, 1] [5263.0] 117
4 25 [5263.0, 117.0, 5003.0, 5086.0, 34.0, 152.0, 1] [5263.0, 117.0] 5003
4 25 [5263.0, 117.0, 5003.0, 5086.0, 34.0, 152.0, 1] [5263.0, 117.0, 5003.0] 5086
4 25 [5263.0, 117.0, 5003.0, 5086.0, 34.0, 152.0, 1] [5263.0, 117.0, 5003.0, 5086.0] 34
4 25 [5263.0, 117.0, 5003.0, 5086.0, 34.0, 152.0, 1] [5263.0, 117.0, 5003.0, 5086.0, 34.0] 152
4 25 [5263.0, 117.0, 5003.0, 5086.0, 34.0, 152.0, 1] [5263.0, 117.0, 5003.0, 5086.0, 34.0, 152.0] 1

我们可以使用带有 lambda 的简单 explodecumcountslicing 来做到这一点。由于我们需要对每一行应用一个方法我只能想到一个使用apply的方法,也许比我聪明的人可以做得更好

import pandas as pd 

# if your list is not a true list i.e its a string.
# df['movie_embedding_index'] = df['movie_embedding_index'].map(pd.eval)


s = df.explode('movie_embedding_index')
s = s.assign(seq=s.groupby('user_id').cumcount())

s['movie_embedding_list'] = s['user_id'].map(df.set_index('user_id')['movie_embedding_index'])
s['movie_embedding_list'] =  s.apply(lambda x : x['movie_embedding_list'][:x['seq']],1)

print(s.drop('seq',1)

   user_id movie_embedding_index                          movie_embedding_list
0        6                   998                                            []
0        6                   520                                       [998.0]
0        6                   755                                [998.0, 520.0]
0        6                   684                         [998.0, 520.0, 755.0]
0        6                    13                  [998.0, 520.0, 755.0, 684.0]
0        6                  4248            [998.0, 520.0, 755.0, 684.0, 13.0]
0        6                     1    [998.0, 520.0, 755.0, 684.0, 13.0, 4248.0]
1        7                  1216                                            []
1        7                    12                                      [1216.0]
1        7                   148                                [1216.0, 12.0]
1        7                     1                         [1216.0, 12.0, 148.0]
1        7                   289                    [1216.0, 12.0, 148.0, 1.0]
1        7                    64             [1216.0, 12.0, 148.0, 1.0, 289.0]
1        7                   110       [1216.0, 12.0, 148.0, 1.0, 289.0, 64.0]
2        8                    40                                            []
2        8                   199                                        [40.0]
2        8                    42                                 [40.0, 199.0]
2        8                   316                           [40.0, 199.0, 42.0]
2        8                    96                    [40.0, 199.0, 42.0, 316.0]
2        8                    34              [40.0, 199.0, 42.0, 316.0, 96.0]
2        8                   152        [40.0, 199.0, 42.0, 316.0, 96.0, 34.0]
3       10                   117                                            []
3       10                  2283                                       [117.0]
3       10                     1                               [117.0, 2283.0]
3       10                    25                          [117.0, 2283.0, 1.0]
3       10                    29                    [117.0, 2283.0, 1.0, 25.0]
3       10                    14              [117.0, 2283.0, 1.0, 25.0, 29.0]
3       10                    11        [117.0, 2283.0, 1.0, 25.0, 29.0, 14.0]
3       10                     2  [117.0, 2283.0, 1.0, 25.0, 29.0, 14.0, 11.0]
4       25                  5263                                            []
4       25                   117                                      [5263.0]
4       25                  5003                               [5263.0, 117.0]
4       25                  5086                       [5263.0, 117.0, 5003.0]
4       25                    34               [5263.0, 117.0, 5003.0, 5086.0]
4       25                   152         [5263.0, 117.0, 5003.0, 5086.0, 34.0]
4       25                     1  [5263.0, 117.0, 5003.0, 5086.0, 34.0, 152.0]