R 中的数据清理:删除测试客户名称

Data Cleaning in R: remove test customer names

我正在处理包含客户名字和姓氏的客户数据。我想清除任何随机击键的名称。测试账户在数据集中杂乱无章,并且有垃圾名称。例如,在下面的数据中,我想删除客户 2、5、9、10、12 等。非常感谢你的帮助。

 Customer Id    FirstName   LastName
1   MARY    MEYER
2   GFRTYUIO    UHBVYY
3   CHARLES BEAL
4   MARNI   MONTANEZ
5   GDTDTTD DTTHDTHTHTHD
6   TIFFANY BAYLESS
7   CATHRYN JONES
8   TINA    CUNNINGHAM
9   FGCYFCGCGFC FGCGFCHGHG
10  ADDHJSDLG   DHGAHG
11  WALTER  FINN
12  GFCTFCGCFGC CG GFCGFCGFCGF
13  ASDASDASD   AASDASDASD
14  TYKTYKYTKTY YTKTYKTYK
15  HFHFHF  HAVE
16  REBECCA CROSSWHITE
17  GHSGHG  HGASGH
18  JESSICA TREMBLEY
19  GFRTYUIO    UHBVYY
20  HUBHGBUHBUH YTVYVFYVYFFV
21  HEATHER WYRICK
22  JASON   SPLICHAL
23  RUSTY   OWENS
24  DUSTIN  WILLIAMS
25  GFCGFCFGCGFC    GRCGFXFGDGF
26  QWQWQW  QWQWWW
27  LIWNDVLIHWDV    LIAENVLIHEAV
28  DARLENE SHORTRIDGE
29  BETH    HDHDHDH
30  ROBERT  SHIELDS
31  GHERDHBXFH  DFHFDHDFH
32  ACE TESSSSSRT
33  ALLISON AWTREY
34  UYGUGVHGVGHVG   HGHGVUYYU
35  HCJHV   FHJSEFHSIEHF

问题似乎是您需要对不可能的名称进行可靠的定义,而这与 R 并没有真正的关系。无论如何,我建议您按照名字进行操作并删除所有那些不合理的名称.作为合理的名字或肯定列表的来源,您可以使用例如SSA Baby Name Database。这应该可以很好地过滤掉英文名字。如果您对名字有更多特定于位置的需求,只需在线查找其他婴儿姓名数据库并尝试将它们作为肯定列表抓取。

将它们放入名为 positiveNames 的向量后,像这样过滤掉所有非正名称:

data_new <- data_original[!data_original$firstName %in% positiveNames,]

我的方法如下:

1) 将 FirstNameLastName 合并为一个字符串 strname。 然后,计算每个 strname 的字母数。

2) 至此,我们发现对于实名来说,像"MARNIMONTANEZ",是由两个'M'组成的;两个 'A';一个 'R';一个 'I';三 'N';一个 'O';一个 'T'.
而且我们发现像"GFCTFCGCFGCCGGFCGFCGFCGF"这样的假名是由六个'G'组成的;五 'F'; 8 'C'。

3) 区分真名和假名的模式变得清晰:

  • 真实姓名的特点是字母种类更多。我们可以通过创建一个计算为 number of unique letters / total string length
  • 的变量 check_real 来衡量这一点
  • 假名的特点是几个字母重复多次。我们可以通过创建一个计算为 check_fake 的变量来衡量这一点:average frequency of each letter

4) 最后,我们只需要定义一个阈值来识别两个变量的异常。在触发这些阈值的情况下,会出现 flag_realflag_fake

  • 如果flag_real == 1 & flag_fake == 0,名字是真实的
  • 如果flag_real == 0 & flag_fake == 1,名字是假的
  • 在极少数情况下两个标志一致(即 flag_real == 1 & flag_fake == 1),您必须手动调查记录以优化阈值。

You can calculate variability strength of full name (combine FirstName and LastName) by calculating length of unique letters in full name divided by total number of characters in the full name. Then, just remove the names that has low variability strength. This means that you are removing the names that has a high frequency of same random keystrokes resulting in low variability strength.

我使用 charToRaw 函数执行此操作,因为它速度非常快,并且使用 dplyr 库,如下所示:

# Building Test Data
df <- data.frame(CustomerId = c(1, 2, 3, 4, 5, 6, 7), 
          FirstName = c("MARY", "FGCYFCGCGFC", "GFCTFCGCFGC", "ASDASDASD", "GDTDTTD", "WALTER", "GFCTFCGCFGC"),
          LastName = c("MEYER", "FGCGFCHGHG", "GFCGFCGFCGF", "AASDASDASD", "DTTHDTHTHTHD", "FINN", "CG GFCGFCGFCGF"), stringsAsFactors = FALSE)


#test data: df
#   CustomerId    FirstName         LastName
#1         1           MARY            MEYER
#2         2    FGCYFCGCGFC       FGCGFCHGHG
#3         3    GFCTFCGCFGC      GFCGFCGFCGF
#4         4      ASDASDASD       AASDASDASD
#5         5        GDTDTTD     DTTHDTHTHTHD
#6         6         WALTER             FINN
#7         7    GFCTFCGCFGC   CG GFCGFCGFCGF

library(dplyr)
df %>%
  ## Combining FirstName and LastName
  mutate(FullName = paste(FirstName, gsub(" ", "", LastName, fixed = TRUE))) %>%
  group_by(FullName) %>%
  ## Calculating variability strength for each full name
  mutate(Variability = length(unique(as.integer(charToRaw(FullName))))/nchar(FullName))%>%
  ## Filtering full name, I set above or equal to 0.4 (You can change this)
  ## Meaning we are keeping full name that has variability strength greater than or equal to 0.40
  filter(Variability >= 0.40)


# A tibble: 2 x 5
# Groups:   FullName [2]
# CustomerId FirstName LastName    FullName   Variability
#  <dbl>     <chr>      <chr>        <chr>        <dbl>
#1   1        MARY      MEYER     MARY MEYER    0.6000000
#2   6      WALTER      FINN     WALTER FINN    0.9090909

我尝试结合以下代码中的建议。谢谢大家的帮助。

# load required libraries 
library(hunspell)
library(dplyr)
# read data in dataframe df
df<-data.frame(CustomerId = c(1, 2, 3, 4, 5, 6, 7,8), 
               FirstName = c("MARY"," ALBERT SAM", "FGCYFCGCGFC", "GFCTFCGCFGC", "ASDASDASD", "GDTDTTD", "WALTER", "GFCTFCGCFGC"),
               LastName = c("MEYER","TEST", "FGCGFCHGHG", "GFCGFCGFCGF", "AASDASDASD", "DTTHDTHTHTHD", "FINN", "CG GFCGFCGFCGF"), stringsAsFactors = FALSE)
# Keep unique names
df<-distinct(df,FirstName, LastName, .keep_all = TRUE)
# Spell check using hunspel
df$flag <- hunspell_check(df$FirstName) | hunspell_check(as.character(df$LastName))
# remove middle names
df$FirstNameOnly<-gsub(" .*","",df$FirstName)

# SSA name data using https://www.ssa.gov/oact/babynames/names.zip
# unzip files in folder named names
files<-list.files("/names",pattern="*.txt")
ssa_names<- do.call(rbind, lapply(files, function(x) read.csv(x, 
                          col.names = c("Name","Gender","Frequency"),stringsAsFactors = FALSE)))
# Change SSA names to uppercase
ssa_names$Name <- toupper(ssa_names$Name)
# Flad for SSA names
df$flag_SSA<-ifelse(df$FirstNameOnly %in% ssa_names$Name,TRUE,FALSE)
rm(ssa_names)
# remove spaces and concatenate first name and last name
df$strname<-gsub(" ","",paste(df$FirstName,df$LastName, sep = ""))
# Name string length
df$len<-nchar(df$strname)

# Unique string length
for(n in 1:nrow(df))
{
  df$ulen[n]<-length(unique(strsplit(df$strname[n], "")[[1]]))
}

# Ratio variable for unique string length over total string length 
df$ratio<-ifelse(df$len==0,0,df$ulen/df$len)
# Histogram to determine cutoff ratio
hist(df$ratio)
test<-df[df$ratio<.4 & df$flag_SSA==FALSE & df$flag==FALSE,]