如何在 R 中构建嵌套的 For 循环

How to Construct Nested For Loops in R

我正在使用 R 来匹配两个不同数据集中的名称。我想比较字符串。我基本上有两个字符串数据框,都包含一个位置 ID(不是唯一的)以及人的全名。对于某些人来说,一个数据框的全名可能包含两个姓氏。另一个数据框具有相同的位置代码(不是唯一的),但姓氏只有两者之一(总是随机的两者之一)。

我想做的是 grep(),逐行处理第一个数据帧,并获得第二个数据帧的输出搜索结果。我的方法是执行以下操作:

  1. 使用 paste() 函数,粘贴位置 ID 和名字。这将有助于匹配。但我真的需要匹配姓氏(可以是任何一个姓氏)。我们称这个新向量为 location_first

  2. 在姓氏列上使用函数 strsplit()。列表中的某些元素将只有一项,而其他元素(即具有两个姓氏的个人)将在该元素中包含两项。我们可以称这个列表为strsplit_ln

  3. 然后我会以循环的形式进行第二次粘贴:将 strsplit_ln 的第一个元素粘贴到 location_first,对其进行 grep,然后移动到 strplit_ln 的下一个元素并对其进行 grep。我想在我的控制台上的下沉文本文件中打印出完整的 grep 搜索结果。

这是我想以循环(或嵌套循环)的形式逐步完成的过程

# prepare the test data
names_df1 = data.frame(location = c(1530, 6801, 1530, 6801, 1967),
                       first_name = c("Axel", "Bill", "Carlos", "Flavio", "Jong"),
                       last_name = c("Williams", "Johnson Clarke", "Lopez Gutierrez",  "Mar", "Yoon"), stringsAsFactors = F)

names_df2 = data.frame(location = c(1530, 6801, 1530, 6801, 1967),
                       first_name = c("Axel", "Bill", "Carlos", "Flavio", "Jong"),
                       last_name = c("Williams", "Clarke", "Lopez", "Mar", "Yoon"), stringsAsFactors = F)


# Step 1: paste id and first name. Location ID and First Name are identical in both data frames. I will paste the last name in the second step. 
location_name_df1 = paste(names_df1$location, names_df1$first_name)
location_name_df2 = paste(names_df2$location, names_df2$first_name, names_df2$last_name)


# Step 2: string split the last names in df1. I want a loop to go through each element and subelement of this list. 
last_name_strsplit = strsplit(names_df1$last_name, split = " ")


          # these are what I would be searching. Note that in the loop, I go search through each sub element v of the ith element in the list.
          # paste(location_name_df1[i], last_name_strsplit[[i]][v])
          paste(location_name_df1[1], last_name_strsplit[[1]][1])

          paste(location_name_df1[2], last_name_strsplit[[2]][1])
          paste(location_name_df1[2], last_name_strsplit[[2]][2])

          paste(location_name_df1[3], last_name_strsplit[[3]][1])
          paste(location_name_df1[3], last_name_strsplit[[3]][2])

          paste(location_name_df1[4], last_name_strsplit[[4]][1])

          paste(location_name_df1[5], last_name_strsplit[[5]][1])


    # this is the actual search I would like to do. I paste the location_name_df1 with the last names in last_name_strsplit, going through each element (i), as well as each sub element (v)
    names_df1[grep(paste(location_name_df1[1], last_name_strsplit[[1]][1]),location_name_df2),] # search result successful

    names_df1[grep(paste(location_name_df1[2], last_name_strsplit[[2]][1]),location_name_df2),] # search result NOT successful. Note that this part of the list has two elements. Loop should jump to the second sub element of last_name_strplit
    names_df1[grep(paste(location_name_df1[2], last_name_strsplit[[2]][2]),location_name_df2),] # This search result was successful

    names_df1[grep(paste(location_name_df1[3], last_name_strsplit[[3]][1]),location_name_df2),] # search result successful
    names_df1[grep(paste(location_name_df1[3], last_name_strsplit[[3]][2]),location_name_df2),] # search result NOT successful. Note that this part of the list has two elements. End of sub elements, move on to the next row

    names_df1[grep(paste(location_name_df1[4], last_name_strsplit[[4]][1]),location_name_df2),] # search result successful

    names_df1[grep(paste(location_name_df1[5], last_name_strsplit[[5]][1]),location_name_df2),] # search result successful

我很确定我必须做一个嵌套循环结构,我在其中遍历列表的每个元素 (i),然后遍历它的每个子元素 (v)。但是,当我执行嵌套循环时,往往会发生我复制大量粘贴并且搜索本身出错的情况。

有人可以给我一些关于如何使用上述步骤创建循环结构的指示吗?我再次使用 R/RStudio 来匹配数据。

谢谢!

这是一个更简单的方法。首先,我们对位置和名字进行全连接,然后使用 stringr::str_detect(与 grep 不同,它在字符串 模式上进行矢量化)过滤掉最后一个姓氏不是可能的双重姓氏之一的行:

full = merge(names_df1, names_df2, by = c("location", "first_name"))

library(stringr)
matches = full[str_detect(string = full$last_name.x, pattern = fixed(full$last_name.y)), ]
matches           
#   location first_name     last_name.x last_name.y
# 1     1530       Axel        Williams    Williams
# 2     1530     Carlos Lopez Gutierrez       Lopez
# 3     1967       Jong            Yoon        Yoon
# 4     6801       Bill  Johnson Clarke      Clarke
# 5     6801     Flavio             Mar         Mar

如果你喜欢dplyr,你可以这样做:

library(dplyr)
full_join(names_df1, names_df2, by = c("location", "first_name")) %>% 
  filter(str_detect(string = last_name.x, pattern = fixed(last_name.y))