计算购物篮中的唯一组合频率
Counting unique combination frequencies in a market-basket
我有一组 1000000 个购物篮,每个购物篮包含 1-4 件商品。我想计算购买商品的每个独特组合的频率。
数据组织如下:
[in] print(training_df.head(n=5))
[out] product_id
transaction_id
0000001 [P06, P09]
0000002 [P01, P05, P06, P09]
0000003 [P01, P06]
0000004 [P01, P09]
0000005 [P06, P09]
在此示例中,[P06, P09] 的频率为 2,所有其他组合的频率为 1。我创建了以下二进制矩阵并计算了每个单独项目的频率:
# Create a matrix for the transactions
from sklearn.preprocessing import MultiLabelBinarizer
product_ids = ['P{:02d}'.format(i+1) for i in range(10)]
mlb = MultiLabelBinarizer(classes = product_ids)
training_df1 = training_df.drop('product_id', 1).join(pd.DataFrame(mlb.fit_transform(training_df['product_id']),
columns=mlb.classes_,
index=training_df.index))
# Calculate the support count for each product (frequency)
train_product_support = {}
for column in training_df1.columns:
train_product_support[column] = sum(training_df1[column]>0)
如何计算数据中出现的 1-4 项的每个唯一组合的频率?
好吧,既然你不能使用 df.groupby('product_id').count()
,这是我能想到的最好的办法。我们用列表的字符串表示作为键创建一个字典,并计算其中的出现次数。
counts = dict()
for i in df['product_id']:
key = i.__repr__()
if key in counts:
counts[key] += 1
else:
counts[key] = 1
也许:
df['frozensets'] = df.apply(lambda row: frozenset(row.product_id),axis=1)
df['frozensets'].value_counts()
从 product_ids 创建一列冻结集(可散列,并忽略排序),然后计算每个唯一值的数量。
我有一组 1000000 个购物篮,每个购物篮包含 1-4 件商品。我想计算购买商品的每个独特组合的频率。
数据组织如下:
[in] print(training_df.head(n=5))
[out] product_id
transaction_id
0000001 [P06, P09]
0000002 [P01, P05, P06, P09]
0000003 [P01, P06]
0000004 [P01, P09]
0000005 [P06, P09]
在此示例中,[P06, P09] 的频率为 2,所有其他组合的频率为 1。我创建了以下二进制矩阵并计算了每个单独项目的频率:
# Create a matrix for the transactions
from sklearn.preprocessing import MultiLabelBinarizer
product_ids = ['P{:02d}'.format(i+1) for i in range(10)]
mlb = MultiLabelBinarizer(classes = product_ids)
training_df1 = training_df.drop('product_id', 1).join(pd.DataFrame(mlb.fit_transform(training_df['product_id']),
columns=mlb.classes_,
index=training_df.index))
# Calculate the support count for each product (frequency)
train_product_support = {}
for column in training_df1.columns:
train_product_support[column] = sum(training_df1[column]>0)
如何计算数据中出现的 1-4 项的每个唯一组合的频率?
好吧,既然你不能使用 df.groupby('product_id').count()
,这是我能想到的最好的办法。我们用列表的字符串表示作为键创建一个字典,并计算其中的出现次数。
counts = dict()
for i in df['product_id']:
key = i.__repr__()
if key in counts:
counts[key] += 1
else:
counts[key] = 1
也许:
df['frozensets'] = df.apply(lambda row: frozenset(row.product_id),axis=1)
df['frozensets'].value_counts()
从 product_ids 创建一列冻结集(可散列,并忽略排序),然后计算每个唯一值的数量。