计算购物篮中的唯一组合频率

Counting unique combination frequencies in a market-basket

我有一组 1000000 个购物篮,每个购物篮包含 1-4 件商品。我想计算购买商品的每个独特组合的频率。

数据组织如下:

[in] print(training_df.head(n=5))

[out]                     product_id
transaction_id                      
0000001                   [P06, P09]
0000002         [P01, P05, P06, P09]
0000003                   [P01, P06]
0000004                   [P01, P09]
0000005                   [P06, P09]

在此示例中,[P06, P09] 的频率为 2,所有其他组合的频率为 1。我创建了以下二进制矩阵并计算了每个单独项目的频率:

# Create a matrix for the transactions
from sklearn.preprocessing import MultiLabelBinarizer

product_ids = ['P{:02d}'.format(i+1) for i in range(10)]

mlb = MultiLabelBinarizer(classes = product_ids)
training_df1 = training_df.drop('product_id', 1).join(pd.DataFrame(mlb.fit_transform(training_df['product_id']),
                          columns=mlb.classes_,
                          index=training_df.index))

# Calculate the support count for each product (frequency)
train_product_support = {}
for column in training_df1.columns:
    train_product_support[column] = sum(training_df1[column]>0)

如何计算数据中出现的 1-4 项的每个唯一组合的频率?

好吧,既然你不能使用 df.groupby('product_id').count(),这是我能想到的最好的办法。我们用列表的字符串表示作为键创建一个字典,并计算其中的出现次数。

counts = dict()
for i in df['product_id']:
    key = i.__repr__()
    if key in counts:
        counts[key] += 1
    else:
        counts[key] = 1

也许:

df['frozensets'] = df.apply(lambda row: frozenset(row.product_id),axis=1)
df['frozensets'].value_counts()

从 product_ids 创建一列冻结集(可散列,并忽略排序),然后计算每个唯一值的数量。