创建列表子序列的有效方法 Pandas
Efficient way to create subsequences of list Pandas
所以我有一个如下所示的数据框:
user_id movie_embedding_index
0 6 [998.0, 520.0, 755.0, 684.0, 13.0, 4248.0, 1.0...
1 7 [1216.0, 12.0, 148.0, 1.0, 289.0, 64.0, 110.0,...
2 8 [40.0, 199.0, 42.0, 316.0, 96.0, 34.0, 152.0, ...
3 10 [117.0, 2283.0, 1.0, 25.0, 29.0, 14.0, 11.0, 2...
4 25 [5263.0, 117.0, 5003.0, 5086.0, 34.0, 152.0, 1...
每个 user_id 都有电影历史 [998.0, 520.0, 755.0, 684.0, 13.0, 4248.0]
,我想为此用户历史创建多个序列,以封装过去的历史和观看的下一部电影。所以对于历史 [998.0, 520.0, 755.0, 684.0, 13.0, 4248.0]
我想创建以下序列:
past_history next_movie
[] 998.0
[998.0] 520.0
[998.0,520.0] 755.0
...
[998.0, 520.0, 755.0, 684.0, 13.0] 4248.0
我想为数据框中的所有用户构建它并获得如下最终结果:
user_id past_history next_movie
0 6 [] 998.0
1 6 [998.0] 520.0
2 6 [998.0,520.0] 755.0
.
.
.
我可以想办法做到这一点,但它们效率极低并且不使用 pandas 方法。是否有任何 pandas 方法可以帮助更有效地做到这一点?
一个解决方案可能是先调用 apply
进行您想要的计算。
import pandas as pd
# Generate an example dataframe
d = {'user_id': [1, 2, 3], 'movie_embedding_index': [[998.0, 520.0, 755.0, 684.0, 13.0, 4248.0], [98.0, 20.0, 55.0, 84.0], [132.0, 5432.0, 97-0, 675.0]]}
df = pd.DataFrame(data=d)
# Calculate lists of past movies and current movie
df['calculation'] = df.movie_embedding_index.apply(lambda x: [(x[:index], elem) for index, elem in enumerate(x, start=0)])
然后在此计算列上应用 explode
df = df.explode('calculation')
并最终将这些值检索为新列
df['past_history'] = df['calculation'].apply(lambda x: x[0])
df['next_movie'] = df['calculation'].apply(lambda x: x[1])
最终结果:
- 示例数据有日蚀 (...) 延续字符。一些按摩使其成为有效列表
- 在
lambda
函数列表理解中建立 dict
of watched and next
explode()
在上面的步骤中列出行
apply(pd.Series)
将 dict
扩展为列
join()
回到原始数据框
import json
df = pd.read_csv(io.StringIO(""" user_id movie_embedding_index
0 6 [998.0, 520.0, 755.0, 684.0, 13.0, 4248.0, 1.0...
1 7 [1216.0, 12.0, 148.0, 1.0, 289.0, 64.0, 110.0,...
2 8 [40.0, 199.0, 42.0, 316.0, 96.0, 34.0, 152.0, ...
3 10 [117.0, 2283.0, 1.0, 25.0, 29.0, 14.0, 11.0, 2...
4 25 [5263.0, 117.0, 5003.0, 5086.0, 34.0, 152.0, 1..."""), sep="\s\s+", engine="python")
df.movie_embedding_index = (df.movie_embedding_index.str.strip(". ,")+"]").apply(lambda s: json.loads(s))
df = df.join(df.movie_embedding_index.apply(lambda l: [{"watched":l[0:i],"next":m}
for i,m in enumerate(l)]).explode().apply(pd.Series))
user_id
movie_embedding_index
watched
next
0
6
[998.0, 520.0, 755.0, 684.0, 13.0, 4248.0, 1.0]
[]
998
0
6
[998.0, 520.0, 755.0, 684.0, 13.0, 4248.0, 1.0]
[998.0]
520
0
6
[998.0, 520.0, 755.0, 684.0, 13.0, 4248.0, 1.0]
[998.0, 520.0]
755
0
6
[998.0, 520.0, 755.0, 684.0, 13.0, 4248.0, 1.0]
[998.0, 520.0, 755.0]
684
0
6
[998.0, 520.0, 755.0, 684.0, 13.0, 4248.0, 1.0]
[998.0, 520.0, 755.0, 684.0]
13
0
6
[998.0, 520.0, 755.0, 684.0, 13.0, 4248.0, 1.0]
[998.0, 520.0, 755.0, 684.0, 13.0]
4248
0
6
[998.0, 520.0, 755.0, 684.0, 13.0, 4248.0, 1.0]
[998.0, 520.0, 755.0, 684.0, 13.0, 4248.0]
1
1
7
[1216.0, 12.0, 148.0, 1.0, 289.0, 64.0, 110.0]
[]
1216
1
7
[1216.0, 12.0, 148.0, 1.0, 289.0, 64.0, 110.0]
[1216.0]
12
1
7
[1216.0, 12.0, 148.0, 1.0, 289.0, 64.0, 110.0]
[1216.0, 12.0]
148
1
7
[1216.0, 12.0, 148.0, 1.0, 289.0, 64.0, 110.0]
[1216.0, 12.0, 148.0]
1
1
7
[1216.0, 12.0, 148.0, 1.0, 289.0, 64.0, 110.0]
[1216.0, 12.0, 148.0, 1.0]
289
1
7
[1216.0, 12.0, 148.0, 1.0, 289.0, 64.0, 110.0]
[1216.0, 12.0, 148.0, 1.0, 289.0]
64
1
7
[1216.0, 12.0, 148.0, 1.0, 289.0, 64.0, 110.0]
[1216.0, 12.0, 148.0, 1.0, 289.0, 64.0]
110
2
8
[40.0, 199.0, 42.0, 316.0, 96.0, 34.0, 152.0]
[]
40
2
8
[40.0, 199.0, 42.0, 316.0, 96.0, 34.0, 152.0]
[40.0]
199
2
8
[40.0, 199.0, 42.0, 316.0, 96.0, 34.0, 152.0]
[40.0, 199.0]
42
2
8
[40.0, 199.0, 42.0, 316.0, 96.0, 34.0, 152.0]
[40.0, 199.0, 42.0]
316
2
8
[40.0, 199.0, 42.0, 316.0, 96.0, 34.0, 152.0]
[40.0, 199.0, 42.0, 316.0]
96
2
8
[40.0, 199.0, 42.0, 316.0, 96.0, 34.0, 152.0]
[40.0, 199.0, 42.0, 316.0, 96.0]
34
2
8
[40.0, 199.0, 42.0, 316.0, 96.0, 34.0, 152.0]
[40.0, 199.0, 42.0, 316.0, 96.0, 34.0]
152
3
10
[117.0, 2283.0, 1.0, 25.0, 29.0, 14.0, 11.0, 2]
[]
117
3
10
[117.0, 2283.0, 1.0, 25.0, 29.0, 14.0, 11.0, 2]
[117.0]
2283
3
10
[117.0, 2283.0, 1.0, 25.0, 29.0, 14.0, 11.0, 2]
[117.0, 2283.0]
1
3
10
[117.0, 2283.0, 1.0, 25.0, 29.0, 14.0, 11.0, 2]
[117.0, 2283.0, 1.0]
25
3
10
[117.0, 2283.0, 1.0, 25.0, 29.0, 14.0, 11.0, 2]
[117.0, 2283.0, 1.0, 25.0]
29
3
10
[117.0, 2283.0, 1.0, 25.0, 29.0, 14.0, 11.0, 2]
[117.0, 2283.0, 1.0, 25.0, 29.0]
14
3
10
[117.0, 2283.0, 1.0, 25.0, 29.0, 14.0, 11.0, 2]
[117.0, 2283.0, 1.0, 25.0, 29.0, 14.0]
11
3
10
[117.0, 2283.0, 1.0, 25.0, 29.0, 14.0, 11.0, 2]
[117.0, 2283.0, 1.0, 25.0, 29.0, 14.0, 11.0]
2
4
25
[5263.0, 117.0, 5003.0, 5086.0, 34.0, 152.0, 1]
[]
5263
4
25
[5263.0, 117.0, 5003.0, 5086.0, 34.0, 152.0, 1]
[5263.0]
117
4
25
[5263.0, 117.0, 5003.0, 5086.0, 34.0, 152.0, 1]
[5263.0, 117.0]
5003
4
25
[5263.0, 117.0, 5003.0, 5086.0, 34.0, 152.0, 1]
[5263.0, 117.0, 5003.0]
5086
4
25
[5263.0, 117.0, 5003.0, 5086.0, 34.0, 152.0, 1]
[5263.0, 117.0, 5003.0, 5086.0]
34
4
25
[5263.0, 117.0, 5003.0, 5086.0, 34.0, 152.0, 1]
[5263.0, 117.0, 5003.0, 5086.0, 34.0]
152
4
25
[5263.0, 117.0, 5003.0, 5086.0, 34.0, 152.0, 1]
[5263.0, 117.0, 5003.0, 5086.0, 34.0, 152.0]
1
我们可以使用带有 lambda 的简单 explode
、cumcount
和 slicing
来做到这一点。由于我们需要对每一行应用一个方法我只能想到一个使用apply的方法,也许比我聪明的人可以做得更好
import pandas as pd
# if your list is not a true list i.e its a string.
# df['movie_embedding_index'] = df['movie_embedding_index'].map(pd.eval)
s = df.explode('movie_embedding_index')
s = s.assign(seq=s.groupby('user_id').cumcount())
s['movie_embedding_list'] = s['user_id'].map(df.set_index('user_id')['movie_embedding_index'])
s['movie_embedding_list'] = s.apply(lambda x : x['movie_embedding_list'][:x['seq']],1)
print(s.drop('seq',1)
user_id movie_embedding_index movie_embedding_list
0 6 998 []
0 6 520 [998.0]
0 6 755 [998.0, 520.0]
0 6 684 [998.0, 520.0, 755.0]
0 6 13 [998.0, 520.0, 755.0, 684.0]
0 6 4248 [998.0, 520.0, 755.0, 684.0, 13.0]
0 6 1 [998.0, 520.0, 755.0, 684.0, 13.0, 4248.0]
1 7 1216 []
1 7 12 [1216.0]
1 7 148 [1216.0, 12.0]
1 7 1 [1216.0, 12.0, 148.0]
1 7 289 [1216.0, 12.0, 148.0, 1.0]
1 7 64 [1216.0, 12.0, 148.0, 1.0, 289.0]
1 7 110 [1216.0, 12.0, 148.0, 1.0, 289.0, 64.0]
2 8 40 []
2 8 199 [40.0]
2 8 42 [40.0, 199.0]
2 8 316 [40.0, 199.0, 42.0]
2 8 96 [40.0, 199.0, 42.0, 316.0]
2 8 34 [40.0, 199.0, 42.0, 316.0, 96.0]
2 8 152 [40.0, 199.0, 42.0, 316.0, 96.0, 34.0]
3 10 117 []
3 10 2283 [117.0]
3 10 1 [117.0, 2283.0]
3 10 25 [117.0, 2283.0, 1.0]
3 10 29 [117.0, 2283.0, 1.0, 25.0]
3 10 14 [117.0, 2283.0, 1.0, 25.0, 29.0]
3 10 11 [117.0, 2283.0, 1.0, 25.0, 29.0, 14.0]
3 10 2 [117.0, 2283.0, 1.0, 25.0, 29.0, 14.0, 11.0]
4 25 5263 []
4 25 117 [5263.0]
4 25 5003 [5263.0, 117.0]
4 25 5086 [5263.0, 117.0, 5003.0]
4 25 34 [5263.0, 117.0, 5003.0, 5086.0]
4 25 152 [5263.0, 117.0, 5003.0, 5086.0, 34.0]
4 25 1 [5263.0, 117.0, 5003.0, 5086.0, 34.0, 152.0]
所以我有一个如下所示的数据框:
user_id movie_embedding_index
0 6 [998.0, 520.0, 755.0, 684.0, 13.0, 4248.0, 1.0...
1 7 [1216.0, 12.0, 148.0, 1.0, 289.0, 64.0, 110.0,...
2 8 [40.0, 199.0, 42.0, 316.0, 96.0, 34.0, 152.0, ...
3 10 [117.0, 2283.0, 1.0, 25.0, 29.0, 14.0, 11.0, 2...
4 25 [5263.0, 117.0, 5003.0, 5086.0, 34.0, 152.0, 1...
每个 user_id 都有电影历史 [998.0, 520.0, 755.0, 684.0, 13.0, 4248.0]
,我想为此用户历史创建多个序列,以封装过去的历史和观看的下一部电影。所以对于历史 [998.0, 520.0, 755.0, 684.0, 13.0, 4248.0]
我想创建以下序列:
past_history next_movie
[] 998.0
[998.0] 520.0
[998.0,520.0] 755.0
...
[998.0, 520.0, 755.0, 684.0, 13.0] 4248.0
我想为数据框中的所有用户构建它并获得如下最终结果:
user_id past_history next_movie
0 6 [] 998.0
1 6 [998.0] 520.0
2 6 [998.0,520.0] 755.0
.
.
.
我可以想办法做到这一点,但它们效率极低并且不使用 pandas 方法。是否有任何 pandas 方法可以帮助更有效地做到这一点?
一个解决方案可能是先调用 apply
进行您想要的计算。
import pandas as pd
# Generate an example dataframe
d = {'user_id': [1, 2, 3], 'movie_embedding_index': [[998.0, 520.0, 755.0, 684.0, 13.0, 4248.0], [98.0, 20.0, 55.0, 84.0], [132.0, 5432.0, 97-0, 675.0]]}
df = pd.DataFrame(data=d)
# Calculate lists of past movies and current movie
df['calculation'] = df.movie_embedding_index.apply(lambda x: [(x[:index], elem) for index, elem in enumerate(x, start=0)])
然后在此计算列上应用 explode
df = df.explode('calculation')
并最终将这些值检索为新列
df['past_history'] = df['calculation'].apply(lambda x: x[0])
df['next_movie'] = df['calculation'].apply(lambda x: x[1])
最终结果:
- 示例数据有日蚀 (...) 延续字符。一些按摩使其成为有效列表
- 在
lambda
函数列表理解中建立dict
of watched and next explode()
在上面的步骤中列出行apply(pd.Series)
将dict
扩展为列join()
回到原始数据框
import json
df = pd.read_csv(io.StringIO(""" user_id movie_embedding_index
0 6 [998.0, 520.0, 755.0, 684.0, 13.0, 4248.0, 1.0...
1 7 [1216.0, 12.0, 148.0, 1.0, 289.0, 64.0, 110.0,...
2 8 [40.0, 199.0, 42.0, 316.0, 96.0, 34.0, 152.0, ...
3 10 [117.0, 2283.0, 1.0, 25.0, 29.0, 14.0, 11.0, 2...
4 25 [5263.0, 117.0, 5003.0, 5086.0, 34.0, 152.0, 1..."""), sep="\s\s+", engine="python")
df.movie_embedding_index = (df.movie_embedding_index.str.strip(". ,")+"]").apply(lambda s: json.loads(s))
df = df.join(df.movie_embedding_index.apply(lambda l: [{"watched":l[0:i],"next":m}
for i,m in enumerate(l)]).explode().apply(pd.Series))
user_id | movie_embedding_index | watched | next | |
---|---|---|---|---|
0 | 6 | [998.0, 520.0, 755.0, 684.0, 13.0, 4248.0, 1.0] | [] | 998 |
0 | 6 | [998.0, 520.0, 755.0, 684.0, 13.0, 4248.0, 1.0] | [998.0] | 520 |
0 | 6 | [998.0, 520.0, 755.0, 684.0, 13.0, 4248.0, 1.0] | [998.0, 520.0] | 755 |
0 | 6 | [998.0, 520.0, 755.0, 684.0, 13.0, 4248.0, 1.0] | [998.0, 520.0, 755.0] | 684 |
0 | 6 | [998.0, 520.0, 755.0, 684.0, 13.0, 4248.0, 1.0] | [998.0, 520.0, 755.0, 684.0] | 13 |
0 | 6 | [998.0, 520.0, 755.0, 684.0, 13.0, 4248.0, 1.0] | [998.0, 520.0, 755.0, 684.0, 13.0] | 4248 |
0 | 6 | [998.0, 520.0, 755.0, 684.0, 13.0, 4248.0, 1.0] | [998.0, 520.0, 755.0, 684.0, 13.0, 4248.0] | 1 |
1 | 7 | [1216.0, 12.0, 148.0, 1.0, 289.0, 64.0, 110.0] | [] | 1216 |
1 | 7 | [1216.0, 12.0, 148.0, 1.0, 289.0, 64.0, 110.0] | [1216.0] | 12 |
1 | 7 | [1216.0, 12.0, 148.0, 1.0, 289.0, 64.0, 110.0] | [1216.0, 12.0] | 148 |
1 | 7 | [1216.0, 12.0, 148.0, 1.0, 289.0, 64.0, 110.0] | [1216.0, 12.0, 148.0] | 1 |
1 | 7 | [1216.0, 12.0, 148.0, 1.0, 289.0, 64.0, 110.0] | [1216.0, 12.0, 148.0, 1.0] | 289 |
1 | 7 | [1216.0, 12.0, 148.0, 1.0, 289.0, 64.0, 110.0] | [1216.0, 12.0, 148.0, 1.0, 289.0] | 64 |
1 | 7 | [1216.0, 12.0, 148.0, 1.0, 289.0, 64.0, 110.0] | [1216.0, 12.0, 148.0, 1.0, 289.0, 64.0] | 110 |
2 | 8 | [40.0, 199.0, 42.0, 316.0, 96.0, 34.0, 152.0] | [] | 40 |
2 | 8 | [40.0, 199.0, 42.0, 316.0, 96.0, 34.0, 152.0] | [40.0] | 199 |
2 | 8 | [40.0, 199.0, 42.0, 316.0, 96.0, 34.0, 152.0] | [40.0, 199.0] | 42 |
2 | 8 | [40.0, 199.0, 42.0, 316.0, 96.0, 34.0, 152.0] | [40.0, 199.0, 42.0] | 316 |
2 | 8 | [40.0, 199.0, 42.0, 316.0, 96.0, 34.0, 152.0] | [40.0, 199.0, 42.0, 316.0] | 96 |
2 | 8 | [40.0, 199.0, 42.0, 316.0, 96.0, 34.0, 152.0] | [40.0, 199.0, 42.0, 316.0, 96.0] | 34 |
2 | 8 | [40.0, 199.0, 42.0, 316.0, 96.0, 34.0, 152.0] | [40.0, 199.0, 42.0, 316.0, 96.0, 34.0] | 152 |
3 | 10 | [117.0, 2283.0, 1.0, 25.0, 29.0, 14.0, 11.0, 2] | [] | 117 |
3 | 10 | [117.0, 2283.0, 1.0, 25.0, 29.0, 14.0, 11.0, 2] | [117.0] | 2283 |
3 | 10 | [117.0, 2283.0, 1.0, 25.0, 29.0, 14.0, 11.0, 2] | [117.0, 2283.0] | 1 |
3 | 10 | [117.0, 2283.0, 1.0, 25.0, 29.0, 14.0, 11.0, 2] | [117.0, 2283.0, 1.0] | 25 |
3 | 10 | [117.0, 2283.0, 1.0, 25.0, 29.0, 14.0, 11.0, 2] | [117.0, 2283.0, 1.0, 25.0] | 29 |
3 | 10 | [117.0, 2283.0, 1.0, 25.0, 29.0, 14.0, 11.0, 2] | [117.0, 2283.0, 1.0, 25.0, 29.0] | 14 |
3 | 10 | [117.0, 2283.0, 1.0, 25.0, 29.0, 14.0, 11.0, 2] | [117.0, 2283.0, 1.0, 25.0, 29.0, 14.0] | 11 |
3 | 10 | [117.0, 2283.0, 1.0, 25.0, 29.0, 14.0, 11.0, 2] | [117.0, 2283.0, 1.0, 25.0, 29.0, 14.0, 11.0] | 2 |
4 | 25 | [5263.0, 117.0, 5003.0, 5086.0, 34.0, 152.0, 1] | [] | 5263 |
4 | 25 | [5263.0, 117.0, 5003.0, 5086.0, 34.0, 152.0, 1] | [5263.0] | 117 |
4 | 25 | [5263.0, 117.0, 5003.0, 5086.0, 34.0, 152.0, 1] | [5263.0, 117.0] | 5003 |
4 | 25 | [5263.0, 117.0, 5003.0, 5086.0, 34.0, 152.0, 1] | [5263.0, 117.0, 5003.0] | 5086 |
4 | 25 | [5263.0, 117.0, 5003.0, 5086.0, 34.0, 152.0, 1] | [5263.0, 117.0, 5003.0, 5086.0] | 34 |
4 | 25 | [5263.0, 117.0, 5003.0, 5086.0, 34.0, 152.0, 1] | [5263.0, 117.0, 5003.0, 5086.0, 34.0] | 152 |
4 | 25 | [5263.0, 117.0, 5003.0, 5086.0, 34.0, 152.0, 1] | [5263.0, 117.0, 5003.0, 5086.0, 34.0, 152.0] | 1 |
我们可以使用带有 lambda 的简单 explode
、cumcount
和 slicing
来做到这一点。由于我们需要对每一行应用一个方法我只能想到一个使用apply的方法,也许比我聪明的人可以做得更好
import pandas as pd
# if your list is not a true list i.e its a string.
# df['movie_embedding_index'] = df['movie_embedding_index'].map(pd.eval)
s = df.explode('movie_embedding_index')
s = s.assign(seq=s.groupby('user_id').cumcount())
s['movie_embedding_list'] = s['user_id'].map(df.set_index('user_id')['movie_embedding_index'])
s['movie_embedding_list'] = s.apply(lambda x : x['movie_embedding_list'][:x['seq']],1)
print(s.drop('seq',1)
user_id movie_embedding_index movie_embedding_list
0 6 998 []
0 6 520 [998.0]
0 6 755 [998.0, 520.0]
0 6 684 [998.0, 520.0, 755.0]
0 6 13 [998.0, 520.0, 755.0, 684.0]
0 6 4248 [998.0, 520.0, 755.0, 684.0, 13.0]
0 6 1 [998.0, 520.0, 755.0, 684.0, 13.0, 4248.0]
1 7 1216 []
1 7 12 [1216.0]
1 7 148 [1216.0, 12.0]
1 7 1 [1216.0, 12.0, 148.0]
1 7 289 [1216.0, 12.0, 148.0, 1.0]
1 7 64 [1216.0, 12.0, 148.0, 1.0, 289.0]
1 7 110 [1216.0, 12.0, 148.0, 1.0, 289.0, 64.0]
2 8 40 []
2 8 199 [40.0]
2 8 42 [40.0, 199.0]
2 8 316 [40.0, 199.0, 42.0]
2 8 96 [40.0, 199.0, 42.0, 316.0]
2 8 34 [40.0, 199.0, 42.0, 316.0, 96.0]
2 8 152 [40.0, 199.0, 42.0, 316.0, 96.0, 34.0]
3 10 117 []
3 10 2283 [117.0]
3 10 1 [117.0, 2283.0]
3 10 25 [117.0, 2283.0, 1.0]
3 10 29 [117.0, 2283.0, 1.0, 25.0]
3 10 14 [117.0, 2283.0, 1.0, 25.0, 29.0]
3 10 11 [117.0, 2283.0, 1.0, 25.0, 29.0, 14.0]
3 10 2 [117.0, 2283.0, 1.0, 25.0, 29.0, 14.0, 11.0]
4 25 5263 []
4 25 117 [5263.0]
4 25 5003 [5263.0, 117.0]
4 25 5086 [5263.0, 117.0, 5003.0]
4 25 34 [5263.0, 117.0, 5003.0, 5086.0]
4 25 152 [5263.0, 117.0, 5003.0, 5086.0, 34.0]
4 25 1 [5263.0, 117.0, 5003.0, 5086.0, 34.0, 152.0]