python:数据清理 - 检测欺诈性电子邮件地址的模式
python: data cleaning - detect pattern for fraudulent email addresses
我正在清理一个包含我要删除的欺诈性电子邮件地址的数据集。
我建立了多个规则来捕获重复和欺诈域。但是有一个场景,我想不出如何在 python 中编写规则来标记它们。
所以我有这样的规则:
#delete punction
df['email'].apply(lambda x:''.join([i for i in x if i not in string.punctuation]))
#flag yopmail
pattern = "yopmail"
match = df['email'].str.contains(pattern)
df['yopmail'] = np.where(match, 'Y', '0')
#flag duplicates
df['duplicate']=df.email.duplicated(keep=False)
这是我无法找出捕捉规则的数据。基本上我正在寻找一种方法来标记以相同方式开始但最后有连续数字的地址。
abc7020@gmail.com
abc7020.1@gmail.com
abc7020.10@gmail.com
abc7020.11@gmail.com
abc7020.12@gmail.com
abc7020.13@gmail.com
abc7020.14@gmail.com
abc7020.15@gmail.com
attn1@gmail.com
attn12@gmail.com
attn123@gmail.com
attn1234@gmail.com
attn12345@gmail.com
attn123456@gmail.com
attn1234567@gmail.com
attn12345678@gmail.com
先看看正则表达式问题here
其次,尝试像这样过滤电子邮件地址:
# Let's email is = 'attn1234@gmail.com'
email = 'attn1234@gmail.com'
email_name = email.split(',', maxsplit=1)[0]
# Here you get email_name = 'attn1234
import re
m = re.search(r'\d+$', email_name)
# if the string ends in digits m will be a Match object, or None otherwise.
if m is not None:
print ('%s is good' % email)
else:
print ('%s is BAD' % email)
您可以使用编辑距离(又名 Levenshtein distance)选择差异阈值。在 python:
$pip install editdistance
$ipython2
>>> import editdistance
>>> threshold = 5 # This could be anything, really
>>> data = ["attn1@gmail.com...", ...]# set up data to be the set you gave
>>> fraudulent_emails = set([email for email in data for _ in data if editdistance.eval(email, _) < threshold])
如果您想更聪明地处理它,您可以 运行 浏览结果列表,而不是将它变成一个集合,而是跟踪它附近有多少其他电子邮件地址 - 然后使用它作为 'weight' 来确定假货。
这不仅为您提供了给定的情况(欺诈地址都共享一个共同的开头,仅在数字后缀上不同,而且在电子邮件地址的开头或中间还有数字或字母填充。
您可以使用正则表达式来做到这一点;下面的示例:
import re
a = "attn12345@gmail.comf"
b = "abc7020.14@gmail.com"
c = "abc7020@gmail.com"
d = "attn12345678@gmail.com"
pattern = re.compile("[0-9]{3,500}\.?[0-9]{0,500}?@")
if pattern.search(a):
print("spam1")
if pattern.search(b):
print("spam2")
if pattern.search(c):
print("spam3")
if pattern.search(d):
print("spam4")
如果您 运行 您将看到的代码:
$ python spam.py
spam1
spam2
spam3
spam4
这种方法的好处是它标准化了(正则表达式),您可以通过调整 {}
内的值轻松调整匹配强度;这意味着您可以拥有一个全局配置文件,其中 set/adjust 值。您还可以轻松调整正则表达式,而无需重写代码。
ids = [s.split('@')[0] for s in email_list]
det = np.zeros((len(ids), len(ids)), dtype=np.bool)
for i in range(len(ids)):
for j in range(i + 1, len(ids)):
mi = ids[i]
mj = ids[j]
if len(mj) == len(mi) + 1 and mj.startswith(mi):
try:
int(mj[-1])
det[j,i] = True
det[i,j] = True
except:
continue
spam_indices = np.where(np.sum(det, axis=0) != 0)[0].tolist()
我知道如何解决这个问题:
模糊不清
创建一组独特的电子邮件,对它们进行循环,并将它们与 fuzzywuzzy 进行比较。
示例:
from fuzzywuzzy import fuzz
for email in emailset:
for row in data:
emailcomp = re.search(pattern=r'(.+)@.+',string=email).groups()[0]
rowemail = re.search(pattern=r'(.+)@.+',string=row['email']).groups()[0]
if row['email']==email:
continue
elif fuzz.partial_ratio(emailcomp,rowemail)>80:
'flagging operation'
我对数据的表示方式有些随意,但我觉得变量名称足够助记,您可以理解我的意思。这是一段非常粗糙的代码,因为我还没有想清楚如何停止重复标记。
无论如何,elif 部分比较两个没有@gmail.com(或任何其他电子邮件,例如@yahoo.com)的电子邮件地址,如果比率高于 80(玩弄这个数字)使用你的标记操作。
例如:
fuzz.partial_ratio("abc7020.1", "abc7020")
100
我的解决方案效率不高,也不漂亮。但是检查一下,看看它是否适合你@jeangelj。它绝对适用于您提供的示例。祝你好运!
import os
from random import shuffle
from difflib import SequenceMatcher
emails = [... ...] # for example the 16 email addresses you gave in your question
shuffle(emails) # everyday i'm shuffling
emails = sorted(emails) # sort that shit!
names = [email.split('@')[0] for email in emails]
T = 0.7 # <- set your string similarity threshold here!!
split_indices=[]
for i in range(1,len(emails)):
if SequenceMatcher(None, emails[i], emails[i-1]).ratio() < T:
split_indices.append(i) # we want to remember where dissimilar email address occurs
grouped=[]
for i in split_indices:
grouped.append(emails[:i])
grouped.append(emails[i:])
# now we have similar email addresses grouped, we want to find the common prefix for each group
prefix_strings=[]
for group in grouped:
prefix_strings.append(os.path.commonprefix(group))
# finally
ham=[]
spam=[]
true_ids = [names.index(p) for p in prefix_strings]
for i in range(len(emails)):
if i in true_ids:
ham.append(emails[i])
else:
spam.append(emails[i])
In [30]: ham
Out[30]: ['abc7020@gmail.com', 'attn1@gmail.com']
In [31]: spam
Out[31]:
['abc7020.10@gmail.com',
'abc7020.11@gmail.com',
'abc7020.12@gmail.com',
'abc7020.13@gmail.com',
'abc7020.14@gmail.com',
'abc7020.15@gmail.com',
'abc7020.1@gmail.com',
'attn12345678@gmail.com',
'attn1234567@gmail.com',
'attn123456@gmail.com',
'attn12345@gmail.com',
'attn1234@gmail.com',
'attn123@gmail.com',
'attn12@gmail.com']
# THE TRUTH YALL!
这是一种处理它的方法,应该非常有效。
我们通过将电子邮件地址按长度分组来做到这一点,这样我们只需要检查每个电子邮件地址是否匹配向下的级别,通过切片并设置成员资格检查。
代码:
首先,读入数据:
import pandas as pd
import numpy as np
string = '''
abc7020@gmail.com
abc7020.1@gmail.com
abc7020.10@gmail.com
abc7020.11@gmail.com
abc7020.12@gmail.com
abc7020.13@gmail.com
abc7020.14@gmail.com
abc7020.15@gmail.com
attn1@gmail.com
attn12@gmail.com
attn123@gmail.com
attn1234@gmail.com
attn12345@gmail.com
attn123456@gmail.com
attn1234567@gmail.com
attn12345678@gmail.com
foo123@bar.com
foo1@bar.com
'''
x = pd.DataFrame({'x':string.split()})
#remove duplicates:
x = x[~x.x.duplicated()]
我们去掉@foo.bar部分,然后只过滤那些以数字结尾的部分,并添加一个'length'列:
#split on @, expand means into two columns
emails = x.x.str.split('@', expand = True)
#filter by last in string is a digit
emails = emails.loc[:,emails.loc[:,0].str[-1].str.isdigit()]
#add a length of email column for the next step
emails['lengths'] = emails.loc[:,0].str.len()
现在,我们所要做的就是取每个长度,并将长度设为 -1,然后查看长度是否。最后一个字符被删除,出现在一组 n-1 长度中(并且,我们必须检查是否相反,以防它是最短的重复):
#unique lengths to check
lengths = emails.lengths.unique()
#mask to hold results
mask = pd.Series([0]*len(emails), index = emails.index)
#for each length
for j in lengths:
#we subset those of that length
totest = emails['lengths'] == j
#and those who might be the shorter version
against = emails['lengths'] == j -1
#we make a set of unique values, for a hashed lookup
againstset = set([i for i in emails.loc[against,0]])
#we cut off the last char of each in to test
tests = emails.loc[totest,0].str[:-1]
#we check matches, by checking the set
mask = mask.add(tests.apply(lambda x: x in againstset), fill_value = 0)
#viceversa, otherwise we miss the smallest one in the group
againstset = set([i for i in emails.loc[totest,0].str[:-1]])
tests = emails.loc[against,0]
mask = mask.add(tests.apply(lambda x: x in againstset), fill_value = 0)
生成的掩码可以转换为布尔值,并用于对原始(去重)数据帧进行子集化,索引应与原始索引匹配,这样子集化:
x.loc[~mask.astype(bool),:]
x
0 abc7020@gmail.com
16 foo123@bar.com
17 foo1@bar.com
您可以看到我们没有删除您的第一个值,因为“.”表示不匹配 - 您可以先删除标点符号。
我正在清理一个包含我要删除的欺诈性电子邮件地址的数据集。
我建立了多个规则来捕获重复和欺诈域。但是有一个场景,我想不出如何在 python 中编写规则来标记它们。
所以我有这样的规则:
#delete punction
df['email'].apply(lambda x:''.join([i for i in x if i not in string.punctuation]))
#flag yopmail
pattern = "yopmail"
match = df['email'].str.contains(pattern)
df['yopmail'] = np.where(match, 'Y', '0')
#flag duplicates
df['duplicate']=df.email.duplicated(keep=False)
这是我无法找出捕捉规则的数据。基本上我正在寻找一种方法来标记以相同方式开始但最后有连续数字的地址。
abc7020@gmail.com
abc7020.1@gmail.com
abc7020.10@gmail.com
abc7020.11@gmail.com
abc7020.12@gmail.com
abc7020.13@gmail.com
abc7020.14@gmail.com
abc7020.15@gmail.com
attn1@gmail.com
attn12@gmail.com
attn123@gmail.com
attn1234@gmail.com
attn12345@gmail.com
attn123456@gmail.com
attn1234567@gmail.com
attn12345678@gmail.com
先看看正则表达式问题here
其次,尝试像这样过滤电子邮件地址:
# Let's email is = 'attn1234@gmail.com'
email = 'attn1234@gmail.com'
email_name = email.split(',', maxsplit=1)[0]
# Here you get email_name = 'attn1234
import re
m = re.search(r'\d+$', email_name)
# if the string ends in digits m will be a Match object, or None otherwise.
if m is not None:
print ('%s is good' % email)
else:
print ('%s is BAD' % email)
您可以使用编辑距离(又名 Levenshtein distance)选择差异阈值。在 python:
$pip install editdistance
$ipython2
>>> import editdistance
>>> threshold = 5 # This could be anything, really
>>> data = ["attn1@gmail.com...", ...]# set up data to be the set you gave
>>> fraudulent_emails = set([email for email in data for _ in data if editdistance.eval(email, _) < threshold])
如果您想更聪明地处理它,您可以 运行 浏览结果列表,而不是将它变成一个集合,而是跟踪它附近有多少其他电子邮件地址 - 然后使用它作为 'weight' 来确定假货。
这不仅为您提供了给定的情况(欺诈地址都共享一个共同的开头,仅在数字后缀上不同,而且在电子邮件地址的开头或中间还有数字或字母填充。
您可以使用正则表达式来做到这一点;下面的示例:
import re
a = "attn12345@gmail.comf"
b = "abc7020.14@gmail.com"
c = "abc7020@gmail.com"
d = "attn12345678@gmail.com"
pattern = re.compile("[0-9]{3,500}\.?[0-9]{0,500}?@")
if pattern.search(a):
print("spam1")
if pattern.search(b):
print("spam2")
if pattern.search(c):
print("spam3")
if pattern.search(d):
print("spam4")
如果您 运行 您将看到的代码:
$ python spam.py
spam1
spam2
spam3
spam4
这种方法的好处是它标准化了(正则表达式),您可以通过调整 {}
内的值轻松调整匹配强度;这意味着您可以拥有一个全局配置文件,其中 set/adjust 值。您还可以轻松调整正则表达式,而无需重写代码。
ids = [s.split('@')[0] for s in email_list]
det = np.zeros((len(ids), len(ids)), dtype=np.bool)
for i in range(len(ids)):
for j in range(i + 1, len(ids)):
mi = ids[i]
mj = ids[j]
if len(mj) == len(mi) + 1 and mj.startswith(mi):
try:
int(mj[-1])
det[j,i] = True
det[i,j] = True
except:
continue
spam_indices = np.where(np.sum(det, axis=0) != 0)[0].tolist()
我知道如何解决这个问题:
模糊不清
创建一组独特的电子邮件,对它们进行循环,并将它们与 fuzzywuzzy 进行比较。 示例:
from fuzzywuzzy import fuzz
for email in emailset:
for row in data:
emailcomp = re.search(pattern=r'(.+)@.+',string=email).groups()[0]
rowemail = re.search(pattern=r'(.+)@.+',string=row['email']).groups()[0]
if row['email']==email:
continue
elif fuzz.partial_ratio(emailcomp,rowemail)>80:
'flagging operation'
我对数据的表示方式有些随意,但我觉得变量名称足够助记,您可以理解我的意思。这是一段非常粗糙的代码,因为我还没有想清楚如何停止重复标记。
无论如何,elif 部分比较两个没有@gmail.com(或任何其他电子邮件,例如@yahoo.com)的电子邮件地址,如果比率高于 80(玩弄这个数字)使用你的标记操作。 例如:
fuzz.partial_ratio("abc7020.1", "abc7020")
100
我的解决方案效率不高,也不漂亮。但是检查一下,看看它是否适合你@jeangelj。它绝对适用于您提供的示例。祝你好运!
import os
from random import shuffle
from difflib import SequenceMatcher
emails = [... ...] # for example the 16 email addresses you gave in your question
shuffle(emails) # everyday i'm shuffling
emails = sorted(emails) # sort that shit!
names = [email.split('@')[0] for email in emails]
T = 0.7 # <- set your string similarity threshold here!!
split_indices=[]
for i in range(1,len(emails)):
if SequenceMatcher(None, emails[i], emails[i-1]).ratio() < T:
split_indices.append(i) # we want to remember where dissimilar email address occurs
grouped=[]
for i in split_indices:
grouped.append(emails[:i])
grouped.append(emails[i:])
# now we have similar email addresses grouped, we want to find the common prefix for each group
prefix_strings=[]
for group in grouped:
prefix_strings.append(os.path.commonprefix(group))
# finally
ham=[]
spam=[]
true_ids = [names.index(p) for p in prefix_strings]
for i in range(len(emails)):
if i in true_ids:
ham.append(emails[i])
else:
spam.append(emails[i])
In [30]: ham
Out[30]: ['abc7020@gmail.com', 'attn1@gmail.com']
In [31]: spam
Out[31]:
['abc7020.10@gmail.com',
'abc7020.11@gmail.com',
'abc7020.12@gmail.com',
'abc7020.13@gmail.com',
'abc7020.14@gmail.com',
'abc7020.15@gmail.com',
'abc7020.1@gmail.com',
'attn12345678@gmail.com',
'attn1234567@gmail.com',
'attn123456@gmail.com',
'attn12345@gmail.com',
'attn1234@gmail.com',
'attn123@gmail.com',
'attn12@gmail.com']
# THE TRUTH YALL!
这是一种处理它的方法,应该非常有效。 我们通过将电子邮件地址按长度分组来做到这一点,这样我们只需要检查每个电子邮件地址是否匹配向下的级别,通过切片并设置成员资格检查。
代码:
首先,读入数据:
import pandas as pd
import numpy as np
string = '''
abc7020@gmail.com
abc7020.1@gmail.com
abc7020.10@gmail.com
abc7020.11@gmail.com
abc7020.12@gmail.com
abc7020.13@gmail.com
abc7020.14@gmail.com
abc7020.15@gmail.com
attn1@gmail.com
attn12@gmail.com
attn123@gmail.com
attn1234@gmail.com
attn12345@gmail.com
attn123456@gmail.com
attn1234567@gmail.com
attn12345678@gmail.com
foo123@bar.com
foo1@bar.com
'''
x = pd.DataFrame({'x':string.split()})
#remove duplicates:
x = x[~x.x.duplicated()]
我们去掉@foo.bar部分,然后只过滤那些以数字结尾的部分,并添加一个'length'列:
#split on @, expand means into two columns
emails = x.x.str.split('@', expand = True)
#filter by last in string is a digit
emails = emails.loc[:,emails.loc[:,0].str[-1].str.isdigit()]
#add a length of email column for the next step
emails['lengths'] = emails.loc[:,0].str.len()
现在,我们所要做的就是取每个长度,并将长度设为 -1,然后查看长度是否。最后一个字符被删除,出现在一组 n-1 长度中(并且,我们必须检查是否相反,以防它是最短的重复):
#unique lengths to check
lengths = emails.lengths.unique()
#mask to hold results
mask = pd.Series([0]*len(emails), index = emails.index)
#for each length
for j in lengths:
#we subset those of that length
totest = emails['lengths'] == j
#and those who might be the shorter version
against = emails['lengths'] == j -1
#we make a set of unique values, for a hashed lookup
againstset = set([i for i in emails.loc[against,0]])
#we cut off the last char of each in to test
tests = emails.loc[totest,0].str[:-1]
#we check matches, by checking the set
mask = mask.add(tests.apply(lambda x: x in againstset), fill_value = 0)
#viceversa, otherwise we miss the smallest one in the group
againstset = set([i for i in emails.loc[totest,0].str[:-1]])
tests = emails.loc[against,0]
mask = mask.add(tests.apply(lambda x: x in againstset), fill_value = 0)
生成的掩码可以转换为布尔值,并用于对原始(去重)数据帧进行子集化,索引应与原始索引匹配,这样子集化:
x.loc[~mask.astype(bool),:]
x
0 abc7020@gmail.com
16 foo123@bar.com
17 foo1@bar.com
您可以看到我们没有删除您的第一个值,因为“.”表示不匹配 - 您可以先删除标点符号。