如何通过筛选和排序 "repeated" 值来比较两个列表

How to compare two lists by filtering and sorting "repeated" values

我有以下 act2.txt 电子邮件活动文件:

2021-04-02//email@example.com//Enhance your presentation skills in 15 minutes//Open
2021-04-11//email@example.com//Enroll in the presentations skills - FREE WEBINAR//Open
2021-04-11//email@example.com//Enroll in the presentations skills - FREE WEBINAR//Delivered
2021-04-11//email@example.com//Enroll in the presentations skills - FREE WEBINAR//Delivered
2021-04-11//email@example.com//Enroll in the presentations skills - FREE WEBINAR//Delivered
2021-04-16//email@example.com//YOU ARE INVITED TO THIS PROGRAMMING EVENT//Delivered
2021-04-01//email@example.com//Enhance your presentation skills in 15 minutes//Delivered
2021-04-09//email@example.com//we are here to help you improve your skills//Delivered
2021-04-12//email@example.com//(1st meeting) here is our recorded presentation skills webinar//Delivered
2021-04-13//email@example.com//YOU ARE INVITED TO THIS PROGRAMMING EVENT//Delivered

我想按客户跟踪电子邮件 activity - 我计算了发送的电子邮件、发送的电子邮件然后打开率。

我生成了两个列表,一个用于发送的电子邮件,另一个用于打开的电子邮件:

import re
from pprint import pprint

#read the file with activities separated by //
afile = "act2.txt"
afile_read = open(afile,"r")
lines = afile_read.readlines()

activityList = []
for activities in lines:
            activity = activities.split("//")
            date = activity[0]
            customer_email = activity[1]
            email_title = activity[2]
            action = activity[3]
            stripped_line = [s.rstrip() for s in activity]
            activityList.append(stripped_line)

#print (activityList)


stripped_email = 'email@example.com'
email_actions = [x for x in activityList if stripped_email in x[1]]
delivered = [x for x in email_actions if 'Delivered' in x]
Opened = [x for x in email_actions if 'Open' in x]
delcount = (len(delivered))
opencount = (len(Opened))
try:
  Open_rate =  opencount / delcount * 100
except ZeroDivisionError:
  Open_rate = 0
print (stripped_email,",", delcount,",", opencount,",", Open_rate,"%")

pprint(delivered)
pprint (Opened)

送达名单:

[['2021-04-11',
  'email@example.com',
  'Enroll in the presentations skills - FREE WEBINAR',
  'Delivered'],
 ['2021-04-11',
  'email@example.com',
  'Enroll in the presentations skills - FREE WEBINAR',
  'Delivered'],
 ['2021-04-11',
  'email@example.com',
  'Enroll in the presentations skills - FREE WEBINAR',
  'Delivered'],
 ['2021-04-16',
  'email@example.com',
  'YOU ARE INVITED TO THIS PROGRAMMING EVENT',
  'Delivered'],
 ['2021-04-01',
  'email@example.com',
  'Enhance your presentation skills in 15 minutes',
  'Delivered'],
 ['2021-04-09',
  'email@example.com',
  'we are here to help you improve your skills',
  'Delivered'],
 ['2021-04-12',
  'email@example.com',
  '(1st meeting) here is our recorded presentation skills webinar',
  'Delivered'],
 ['2021-04-13',
  'email@example.com',
  'YOU ARE INVITED TO THIS PROGRAMMING EVENT',
  'Delivered']]

打开的列表:

[['2021-04-02',
  'email@example.com',
  'Enhance your presentation skills in 15 minutes',
  'Open'],
 ['2021-04-11',
  'email@example.com',
  'Enroll in the presentations skills - FREE WEBINAR',
  'Open']]

我想比较两个列表并生成第三个列表(合并 activity),按电子邮件主题过滤 - 如果主题在已发送列表和已打开列表中,那么它将被计为一个 activity。但是,邮件主题可以重复,比如邮件发送了 3 次,但只打开了一次。由于我仍在学习 python.

,因此我无法找到正确的逻辑

为更清晰起见编辑:

如果在按标题筛选的打开列表中找到一封电子邮件,则应在最后日期之前从已发送列表中删除相同的标题,并生成包含组合活动的新列表。

您需要以不同的方式思考这个问题,您不是在组合列表。

如果一封电子邮件被打开,则意味着它也被收到了。这意味着您打开的列表也是您的组合列表。

意识到这一点后,您只需将未打开的邮件复制到未打开邮件的结果列表即可。

查看打开的邮件列表并将主题复制到一个集合中,然后查看收到的电子邮件并检查主题是否在集合中,如果在则什么也不做。如果主题不在集合中,则将其复制到未打开的电子邮件列表。

很简单的一段代码:

opened_subjects = set()
unopened = []
for email in opened:
    opened_subjects.add(email[2])

unopened_subjects = set()
for email in received:
    if all(email[2] not in subj_set 
           for subj_set in (opened_subjects, unopened_subjects)):
        unopened.append(email)
        unopened_subjects.add(email[2])

print('Both received and opened:', opened)
print('Unopened emails:', unopened)

小记-
每组的原因是不同的。第一个集合 opened_subjects 存在是因为 set 能够只包含唯一的项目,而这正是本例中所需要的。第二个集合 unopened_subjects 在那里是因为检查一个项目是否在集合中比在列表中更快,因为我在以任何方式添加到集合之前正在检查,所以不需要集合能力仅存储唯一。