R 中的数据清理:删除测试客户名称
Data Cleaning in R: remove test customer names
我正在处理包含客户名字和姓氏的客户数据。我想清除任何随机击键的名称。测试账户在数据集中杂乱无章,并且有垃圾名称。例如,在下面的数据中,我想删除客户 2、5、9、10、12 等。非常感谢你的帮助。
Customer Id FirstName LastName
1 MARY MEYER
2 GFRTYUIO UHBVYY
3 CHARLES BEAL
4 MARNI MONTANEZ
5 GDTDTTD DTTHDTHTHTHD
6 TIFFANY BAYLESS
7 CATHRYN JONES
8 TINA CUNNINGHAM
9 FGCYFCGCGFC FGCGFCHGHG
10 ADDHJSDLG DHGAHG
11 WALTER FINN
12 GFCTFCGCFGC CG GFCGFCGFCGF
13 ASDASDASD AASDASDASD
14 TYKTYKYTKTY YTKTYKTYK
15 HFHFHF HAVE
16 REBECCA CROSSWHITE
17 GHSGHG HGASGH
18 JESSICA TREMBLEY
19 GFRTYUIO UHBVYY
20 HUBHGBUHBUH YTVYVFYVYFFV
21 HEATHER WYRICK
22 JASON SPLICHAL
23 RUSTY OWENS
24 DUSTIN WILLIAMS
25 GFCGFCFGCGFC GRCGFXFGDGF
26 QWQWQW QWQWWW
27 LIWNDVLIHWDV LIAENVLIHEAV
28 DARLENE SHORTRIDGE
29 BETH HDHDHDH
30 ROBERT SHIELDS
31 GHERDHBXFH DFHFDHDFH
32 ACE TESSSSSRT
33 ALLISON AWTREY
34 UYGUGVHGVGHVG HGHGVUYYU
35 HCJHV FHJSEFHSIEHF
问题似乎是您需要对不可能的名称进行可靠的定义,而这与 R 并没有真正的关系。无论如何,我建议您按照名字进行操作并删除所有那些不合理的名称.作为合理的名字或肯定列表的来源,您可以使用例如SSA Baby Name Database。这应该可以很好地过滤掉英文名字。如果您对名字有更多特定于位置的需求,只需在线查找其他婴儿姓名数据库并尝试将它们作为肯定列表抓取。
将它们放入名为 positiveNames
的向量后,像这样过滤掉所有非正名称:
data_new <- data_original[!data_original$firstName %in% positiveNames,]
我的方法如下:
1) 将 FirstName
和 LastName
合并为一个字符串 strname
。
然后,计算每个 strname
的字母数。
2) 至此,我们发现对于实名来说,像"MARNIMONTANEZ",是由两个'M'组成的;两个 'A';一个 'R';一个 'I';三 'N';一个 'O';一个 'T'.
而且我们发现像"GFCTFCGCFGCCGGFCGFCGFCGF"这样的假名是由六个'G'组成的;五 'F'; 8 'C'。
3) 区分真名和假名的模式变得清晰:
- 真实姓名的特点是字母种类更多。我们可以通过创建一个计算为
number of unique letters / total string length
的变量 check_real
来衡量这一点
- 假名的特点是几个字母重复多次。我们可以通过创建一个计算为
check_fake
的变量来衡量这一点:average frequency of each letter
4) 最后,我们只需要定义一个阈值来识别两个变量的异常。在触发这些阈值的情况下,会出现 flag_real
和 flag_fake
。
- 如果
flag_real == 1 & flag_fake == 0
,名字是真实的
- 如果
flag_real == 0 & flag_fake == 1
,名字是假的
- 在极少数情况下两个标志一致(即
flag_real == 1 & flag_fake == 1
),您必须手动调查记录以优化阈值。
You can calculate variability strength of full name (combine FirstName and LastName) by calculating length of unique letters in full name divided by total number of characters in the full name. Then, just remove the names that has low variability strength. This means that you are removing the names that has a high frequency of same random keystrokes resulting in low variability strength.
我使用 charToRaw
函数执行此操作,因为它速度非常快,并且使用 dplyr
库,如下所示:
# Building Test Data
df <- data.frame(CustomerId = c(1, 2, 3, 4, 5, 6, 7),
FirstName = c("MARY", "FGCYFCGCGFC", "GFCTFCGCFGC", "ASDASDASD", "GDTDTTD", "WALTER", "GFCTFCGCFGC"),
LastName = c("MEYER", "FGCGFCHGHG", "GFCGFCGFCGF", "AASDASDASD", "DTTHDTHTHTHD", "FINN", "CG GFCGFCGFCGF"), stringsAsFactors = FALSE)
#test data: df
# CustomerId FirstName LastName
#1 1 MARY MEYER
#2 2 FGCYFCGCGFC FGCGFCHGHG
#3 3 GFCTFCGCFGC GFCGFCGFCGF
#4 4 ASDASDASD AASDASDASD
#5 5 GDTDTTD DTTHDTHTHTHD
#6 6 WALTER FINN
#7 7 GFCTFCGCFGC CG GFCGFCGFCGF
library(dplyr)
df %>%
## Combining FirstName and LastName
mutate(FullName = paste(FirstName, gsub(" ", "", LastName, fixed = TRUE))) %>%
group_by(FullName) %>%
## Calculating variability strength for each full name
mutate(Variability = length(unique(as.integer(charToRaw(FullName))))/nchar(FullName))%>%
## Filtering full name, I set above or equal to 0.4 (You can change this)
## Meaning we are keeping full name that has variability strength greater than or equal to 0.40
filter(Variability >= 0.40)
# A tibble: 2 x 5
# Groups: FullName [2]
# CustomerId FirstName LastName FullName Variability
# <dbl> <chr> <chr> <chr> <dbl>
#1 1 MARY MEYER MARY MEYER 0.6000000
#2 6 WALTER FINN WALTER FINN 0.9090909
我尝试结合以下代码中的建议。谢谢大家的帮助。
# load required libraries
library(hunspell)
library(dplyr)
# read data in dataframe df
df<-data.frame(CustomerId = c(1, 2, 3, 4, 5, 6, 7,8),
FirstName = c("MARY"," ALBERT SAM", "FGCYFCGCGFC", "GFCTFCGCFGC", "ASDASDASD", "GDTDTTD", "WALTER", "GFCTFCGCFGC"),
LastName = c("MEYER","TEST", "FGCGFCHGHG", "GFCGFCGFCGF", "AASDASDASD", "DTTHDTHTHTHD", "FINN", "CG GFCGFCGFCGF"), stringsAsFactors = FALSE)
# Keep unique names
df<-distinct(df,FirstName, LastName, .keep_all = TRUE)
# Spell check using hunspel
df$flag <- hunspell_check(df$FirstName) | hunspell_check(as.character(df$LastName))
# remove middle names
df$FirstNameOnly<-gsub(" .*","",df$FirstName)
# SSA name data using https://www.ssa.gov/oact/babynames/names.zip
# unzip files in folder named names
files<-list.files("/names",pattern="*.txt")
ssa_names<- do.call(rbind, lapply(files, function(x) read.csv(x,
col.names = c("Name","Gender","Frequency"),stringsAsFactors = FALSE)))
# Change SSA names to uppercase
ssa_names$Name <- toupper(ssa_names$Name)
# Flad for SSA names
df$flag_SSA<-ifelse(df$FirstNameOnly %in% ssa_names$Name,TRUE,FALSE)
rm(ssa_names)
# remove spaces and concatenate first name and last name
df$strname<-gsub(" ","",paste(df$FirstName,df$LastName, sep = ""))
# Name string length
df$len<-nchar(df$strname)
# Unique string length
for(n in 1:nrow(df))
{
df$ulen[n]<-length(unique(strsplit(df$strname[n], "")[[1]]))
}
# Ratio variable for unique string length over total string length
df$ratio<-ifelse(df$len==0,0,df$ulen/df$len)
# Histogram to determine cutoff ratio
hist(df$ratio)
test<-df[df$ratio<.4 & df$flag_SSA==FALSE & df$flag==FALSE,]
我正在处理包含客户名字和姓氏的客户数据。我想清除任何随机击键的名称。测试账户在数据集中杂乱无章,并且有垃圾名称。例如,在下面的数据中,我想删除客户 2、5、9、10、12 等。非常感谢你的帮助。
Customer Id FirstName LastName
1 MARY MEYER
2 GFRTYUIO UHBVYY
3 CHARLES BEAL
4 MARNI MONTANEZ
5 GDTDTTD DTTHDTHTHTHD
6 TIFFANY BAYLESS
7 CATHRYN JONES
8 TINA CUNNINGHAM
9 FGCYFCGCGFC FGCGFCHGHG
10 ADDHJSDLG DHGAHG
11 WALTER FINN
12 GFCTFCGCFGC CG GFCGFCGFCGF
13 ASDASDASD AASDASDASD
14 TYKTYKYTKTY YTKTYKTYK
15 HFHFHF HAVE
16 REBECCA CROSSWHITE
17 GHSGHG HGASGH
18 JESSICA TREMBLEY
19 GFRTYUIO UHBVYY
20 HUBHGBUHBUH YTVYVFYVYFFV
21 HEATHER WYRICK
22 JASON SPLICHAL
23 RUSTY OWENS
24 DUSTIN WILLIAMS
25 GFCGFCFGCGFC GRCGFXFGDGF
26 QWQWQW QWQWWW
27 LIWNDVLIHWDV LIAENVLIHEAV
28 DARLENE SHORTRIDGE
29 BETH HDHDHDH
30 ROBERT SHIELDS
31 GHERDHBXFH DFHFDHDFH
32 ACE TESSSSSRT
33 ALLISON AWTREY
34 UYGUGVHGVGHVG HGHGVUYYU
35 HCJHV FHJSEFHSIEHF
问题似乎是您需要对不可能的名称进行可靠的定义,而这与 R 并没有真正的关系。无论如何,我建议您按照名字进行操作并删除所有那些不合理的名称.作为合理的名字或肯定列表的来源,您可以使用例如SSA Baby Name Database。这应该可以很好地过滤掉英文名字。如果您对名字有更多特定于位置的需求,只需在线查找其他婴儿姓名数据库并尝试将它们作为肯定列表抓取。
将它们放入名为 positiveNames
的向量后,像这样过滤掉所有非正名称:
data_new <- data_original[!data_original$firstName %in% positiveNames,]
我的方法如下:
1) 将 FirstName
和 LastName
合并为一个字符串 strname
。
然后,计算每个 strname
的字母数。
2) 至此,我们发现对于实名来说,像"MARNIMONTANEZ",是由两个'M'组成的;两个 'A';一个 'R';一个 'I';三 'N';一个 'O';一个 'T'.
而且我们发现像"GFCTFCGCFGCCGGFCGFCGFCGF"这样的假名是由六个'G'组成的;五 'F'; 8 'C'。
3) 区分真名和假名的模式变得清晰:
- 真实姓名的特点是字母种类更多。我们可以通过创建一个计算为
number of unique letters / total string length
的变量 - 假名的特点是几个字母重复多次。我们可以通过创建一个计算为
check_fake
的变量来衡量这一点:average frequency of each letter
check_real
来衡量这一点
4) 最后,我们只需要定义一个阈值来识别两个变量的异常。在触发这些阈值的情况下,会出现 flag_real
和 flag_fake
。
- 如果
flag_real == 1 & flag_fake == 0
,名字是真实的 - 如果
flag_real == 0 & flag_fake == 1
,名字是假的 - 在极少数情况下两个标志一致(即
flag_real == 1 & flag_fake == 1
),您必须手动调查记录以优化阈值。
You can calculate variability strength of full name (combine FirstName and LastName) by calculating length of unique letters in full name divided by total number of characters in the full name. Then, just remove the names that has low variability strength. This means that you are removing the names that has a high frequency of same random keystrokes resulting in low variability strength.
我使用 charToRaw
函数执行此操作,因为它速度非常快,并且使用 dplyr
库,如下所示:
# Building Test Data
df <- data.frame(CustomerId = c(1, 2, 3, 4, 5, 6, 7),
FirstName = c("MARY", "FGCYFCGCGFC", "GFCTFCGCFGC", "ASDASDASD", "GDTDTTD", "WALTER", "GFCTFCGCFGC"),
LastName = c("MEYER", "FGCGFCHGHG", "GFCGFCGFCGF", "AASDASDASD", "DTTHDTHTHTHD", "FINN", "CG GFCGFCGFCGF"), stringsAsFactors = FALSE)
#test data: df
# CustomerId FirstName LastName
#1 1 MARY MEYER
#2 2 FGCYFCGCGFC FGCGFCHGHG
#3 3 GFCTFCGCFGC GFCGFCGFCGF
#4 4 ASDASDASD AASDASDASD
#5 5 GDTDTTD DTTHDTHTHTHD
#6 6 WALTER FINN
#7 7 GFCTFCGCFGC CG GFCGFCGFCGF
library(dplyr)
df %>%
## Combining FirstName and LastName
mutate(FullName = paste(FirstName, gsub(" ", "", LastName, fixed = TRUE))) %>%
group_by(FullName) %>%
## Calculating variability strength for each full name
mutate(Variability = length(unique(as.integer(charToRaw(FullName))))/nchar(FullName))%>%
## Filtering full name, I set above or equal to 0.4 (You can change this)
## Meaning we are keeping full name that has variability strength greater than or equal to 0.40
filter(Variability >= 0.40)
# A tibble: 2 x 5
# Groups: FullName [2]
# CustomerId FirstName LastName FullName Variability
# <dbl> <chr> <chr> <chr> <dbl>
#1 1 MARY MEYER MARY MEYER 0.6000000
#2 6 WALTER FINN WALTER FINN 0.9090909
我尝试结合以下代码中的建议。谢谢大家的帮助。
# load required libraries
library(hunspell)
library(dplyr)
# read data in dataframe df
df<-data.frame(CustomerId = c(1, 2, 3, 4, 5, 6, 7,8),
FirstName = c("MARY"," ALBERT SAM", "FGCYFCGCGFC", "GFCTFCGCFGC", "ASDASDASD", "GDTDTTD", "WALTER", "GFCTFCGCFGC"),
LastName = c("MEYER","TEST", "FGCGFCHGHG", "GFCGFCGFCGF", "AASDASDASD", "DTTHDTHTHTHD", "FINN", "CG GFCGFCGFCGF"), stringsAsFactors = FALSE)
# Keep unique names
df<-distinct(df,FirstName, LastName, .keep_all = TRUE)
# Spell check using hunspel
df$flag <- hunspell_check(df$FirstName) | hunspell_check(as.character(df$LastName))
# remove middle names
df$FirstNameOnly<-gsub(" .*","",df$FirstName)
# SSA name data using https://www.ssa.gov/oact/babynames/names.zip
# unzip files in folder named names
files<-list.files("/names",pattern="*.txt")
ssa_names<- do.call(rbind, lapply(files, function(x) read.csv(x,
col.names = c("Name","Gender","Frequency"),stringsAsFactors = FALSE)))
# Change SSA names to uppercase
ssa_names$Name <- toupper(ssa_names$Name)
# Flad for SSA names
df$flag_SSA<-ifelse(df$FirstNameOnly %in% ssa_names$Name,TRUE,FALSE)
rm(ssa_names)
# remove spaces and concatenate first name and last name
df$strname<-gsub(" ","",paste(df$FirstName,df$LastName, sep = ""))
# Name string length
df$len<-nchar(df$strname)
# Unique string length
for(n in 1:nrow(df))
{
df$ulen[n]<-length(unique(strsplit(df$strname[n], "")[[1]]))
}
# Ratio variable for unique string length over total string length
df$ratio<-ifelse(df$len==0,0,df$ulen/df$len)
# Histogram to determine cutoff ratio
hist(df$ratio)
test<-df[df$ratio<.4 & df$flag_SSA==FALSE & df$flag==FALSE,]