使用 python 过滤每个评级中具有 n 个值的产品

Question

我正在处理亚马逊评论数据，我仍在学习 python 和数据框。

df 看起来像这样：

ID   Product_ID   Rating  reviewText
1    product_1    1       'ABC...
2    Product_1    1       'ABC...
3    Product_1    4       'ABC...
4    Product_1    5       'ABC...
5    Product_1    3       'ABC...
6    Product_2    3       'ABC...
7    Product_2    1       'ABC...
8    Product_2    1       'ABC...
9    Product_2    5       'ABC...
10   Product_2    2       'ABC...
11   Product_2    4       'ABC...
12   Product_2    4       'ABC...
.
.
.

我想过滤每个评分至少有 n 条评论的产品（评分是从 1 到 5 的整数）。例如，我想要每个评级至少有 10 条评论的产品，这意味着总共有 10*5 条评论。

我们的目标是在每个产品的每个评级中有足够数量的评论。然后进行进一步的 NLP 分析。我已经被困了两天试图对它们进行分组并使用计数，但我无法正确处理。非常感谢任何提示或帮助。

Answer 1

给你，几个简单的步骤：

获取每个产品和评分的计数

reviews_per_rating = df[['Product_ID', 'Rating']].value_counts()

每个产品至少检查 5 个

select_product = (reviews_per_rating >= 10).groupby('Product_ID').all()

现在以列表形式获取结果

select_product = select_product.index[select_product].to_list()

最后过滤产品

df.loc[df['Product_ID'].isin(select_product)]

使用 python 过滤每个评级中具有 n 个值的产品

Filter products that has n values in each rating using python

python

pandas

data-science