比较字典的值和 return 匹配值的计数

Compare values of a dictionary and return a count of matching values

我有一个字典,其中包含产品名称和购买了这些商品的唯一客户电子邮件,如下所示:

customer_emails = {
'Backpack':['customer1@gmail.com','customer2@gmail.com','customer3@yahoo.com','customer4@msn.com'], 
'Baseball Bat':['customer1@gmail.com','customer3@yahoo.com','customer5@gmail.com'],
'Gloves':['customer2@gmail.com','customer3@yahoo.com','customer4@msn.com']}

我正在尝试遍历每个键的值并确定有多少电子邮件在其他键中匹配。我将这本字典转换为 DataFrame 并使用类似这样的东西得到了我想要的单列比较答案

customers[customers['Baseball Bat'].notna() == True]['Baseball Bat'].isin(customers['Gloves']).sum()

我想要完成的是创建一个基本上看起来像这样的 DataFrame,以便我可以轻松地将它用于相关图表。

             Backpack  Baseball Bat    Gloves
Backpack            4             2         3
Baseball Bat        2             3         1 
Gloves              3             1         3

我认为这样做的方法是遍历 customer_emails 字典,但我不确定您将如何选择一个键来将其值与所有其他键进行比较等等,然后存储它。

您可以先找到每个产品的所有计数和相应的电子邮件,然后将生成的字典传递给 pd.DataFrame:

import pandas as pd
emails = {'Baseball Bat': ['customer1@gmail.com', 'customer3@yahoo.com', 'customer5@gmail.com'], 'Backpack': ['customer1@gmail.com', 'customer2@gmail.com', 'customer3@yahoo.com', 'customer4@msn.com'], 'Gloves': ['customer2@gmail.com', 'customer3@yahoo.com', 'customer4@msn.com']}
results = {a:{c:sum(h in j for h in b) for c, j in emails.items()} for a, b in emails.items()}
df = pd.DataFrame(results)

输出:

               Backpack  Baseball Bat  Gloves
Backpack             4             2       3
Baseball Bat         2             3       1
Gloves               3             1       3

pd.DataFrame.from_dict 开始:

df = pd.DataFrame.from_dict(customer_emails, orient='index').T

df
              Backpack         Baseball Bat               Gloves
0  customer1@gmail.com  customer1@gmail.com  customer2@gmail.com
1  customer2@gmail.com  customer3@yahoo.com  customer3@yahoo.com
2  customer3@yahoo.com  customer5@gmail.com    customer4@msn.com
3    customer4@msn.com                 None                 None

现在,使用 stack + get_dummies + sum + dot:

v = df.stack().str.get_dummies().sum(level=1)
v.dot(v.T)

              Backpack  Baseball Bat  Gloves
Backpack             4             2       3
Baseball Bat         2             3       1
Gloves               3             1       3

或者,将 stack 切换为 melt 以获得一些额外的性能。

v = (df.melt()
       .set_index('variable')['value']
       .str.get_dummies()
       .sum(level=0)
)
v.dot(v.T)

variable      Backpack  Baseball Bat  Gloves
variable                                    
Backpack             4             2       3
Baseball Bat         2             3       1
Gloves               3             1       3

使用相同的逻辑创建系列,然后我们使用 intersection 用于列表

s=pd.Series(customer_emails)

pd.DataFrame(np.reshape([len(set(x).intersection(set(y)))for x in s for y in s ],(3,3)),index=s.index,columns=s.index)
Out[299]: 
              Backpack  Baseball Bat  Gloves
Backpack             4             2       3
Baseball Bat         2             3       1
Gloves               3             1       3