Python SciKit 中基于用户和项目的基本数据过滤

Basic filtering of data based on user & item in Python SciKit

我正在尝试根据用户的评分向用户实施推荐系统。我认为最常见的一种。我读了很多书并入围了 Surprise,一个基于 python-scikit 的推荐系统。

虽然我能够导入数据和 运行 预测,但它并不完全符合我的要求。

现在我有: 我可以通过 user_id、item_id 和评级,并获得该用户给出我通过的评级的概率.

我真正想做的事: 传递 user_id 并在 return 中获取可能 liked/rated 的项目列表根据数据,该用户高度评价。

from surprise import Reader, Dataset    
from surprise import SVD, evaluate

# Define the format
reader = Reader(line_format='user item rating timestamp', sep='\t')
# Load the data from the file using the reader format
data = Dataset.load_from_file('./data/ecomm/e.data', reader=reader)    

# Split data into 5 folds
data.split(n_folds=5)

algo = SVD()

# Retrieve the trainset.
trainset = data.build_full_trainset()
algo.fit(trainset)

//Inputs are: user_id, item_id & rating.
print algo.predict(3, 107, 1)

来自数据文件的样本行。

First column is user_id, 2nd is item id, 3rd is rating and then timestamp.

196 242 3   881250949
186 302 3   891717742
22  377 1   878887116
244 51  2   880606923
166 346 1   886397596
298 474 4   884182806
115 265 2   881171488
253 465 5   891628467
305 451 3   886324817
6   86  3   883603013

您需要为单个 user_id 遍历所有可能的 item_id 值并预测其评级。然后您收集评分最高的项目以推荐给该用户。

但要确保 user_iditem_id 对不在训练数据集中。类似于 this function here:

build_anti_testset

Return a list of ratings that can be used as a testset in the test() method.

The ratings are all the ratings that are not in the trainset, i.e. all the ratings rui where the user u is known, the item i is known, but the rating rui is not in the trainset. As rui is unknown, it is either replaced by the fill value or assumed to be equal to the mean of all ratings global_mean.

之后,您可以将这些对传递给 test()predict() 方法并收集评分,并从该数据中获取特定用户的前 N ​​个推荐。

这里给出了一个例子: