如何在 R 中构建嵌套的 For 循环
How to Construct Nested For Loops in R
我正在使用 R 来匹配两个不同数据集中的名称。我想比较字符串。我基本上有两个字符串数据框,都包含一个位置 ID(不是唯一的)以及人的全名。对于某些人来说,一个数据框的全名可能包含两个姓氏。另一个数据框具有相同的位置代码(不是唯一的),但姓氏只有两者之一(总是随机的两者之一)。
我想做的是 grep()
,逐行处理第一个数据帧,并获得第二个数据帧的输出搜索结果。我的方法是执行以下操作:
使用 paste()
函数,粘贴位置 ID 和名字。这将有助于匹配。但我真的需要匹配姓氏(可以是任何一个姓氏)。我们称这个新向量为 location_first
在姓氏列上使用函数 strsplit()
。列表中的某些元素将只有一项,而其他元素(即具有两个姓氏的个人)将在该元素中包含两项。我们可以称这个列表为strsplit_ln
。
然后我会以循环的形式进行第二次粘贴:将 strsplit_ln
的第一个元素粘贴到 location_first
,对其进行 grep,然后移动到 strplit_ln
的下一个元素并对其进行 grep。我想在我的控制台上的下沉文本文件中打印出完整的 grep
搜索结果。
这是我想以循环(或嵌套循环)的形式逐步完成的过程
# prepare the test data
names_df1 = data.frame(location = c(1530, 6801, 1530, 6801, 1967),
first_name = c("Axel", "Bill", "Carlos", "Flavio", "Jong"),
last_name = c("Williams", "Johnson Clarke", "Lopez Gutierrez", "Mar", "Yoon"), stringsAsFactors = F)
names_df2 = data.frame(location = c(1530, 6801, 1530, 6801, 1967),
first_name = c("Axel", "Bill", "Carlos", "Flavio", "Jong"),
last_name = c("Williams", "Clarke", "Lopez", "Mar", "Yoon"), stringsAsFactors = F)
# Step 1: paste id and first name. Location ID and First Name are identical in both data frames. I will paste the last name in the second step.
location_name_df1 = paste(names_df1$location, names_df1$first_name)
location_name_df2 = paste(names_df2$location, names_df2$first_name, names_df2$last_name)
# Step 2: string split the last names in df1. I want a loop to go through each element and subelement of this list.
last_name_strsplit = strsplit(names_df1$last_name, split = " ")
# these are what I would be searching. Note that in the loop, I go search through each sub element v of the ith element in the list.
# paste(location_name_df1[i], last_name_strsplit[[i]][v])
paste(location_name_df1[1], last_name_strsplit[[1]][1])
paste(location_name_df1[2], last_name_strsplit[[2]][1])
paste(location_name_df1[2], last_name_strsplit[[2]][2])
paste(location_name_df1[3], last_name_strsplit[[3]][1])
paste(location_name_df1[3], last_name_strsplit[[3]][2])
paste(location_name_df1[4], last_name_strsplit[[4]][1])
paste(location_name_df1[5], last_name_strsplit[[5]][1])
# this is the actual search I would like to do. I paste the location_name_df1 with the last names in last_name_strsplit, going through each element (i), as well as each sub element (v)
names_df1[grep(paste(location_name_df1[1], last_name_strsplit[[1]][1]),location_name_df2),] # search result successful
names_df1[grep(paste(location_name_df1[2], last_name_strsplit[[2]][1]),location_name_df2),] # search result NOT successful. Note that this part of the list has two elements. Loop should jump to the second sub element of last_name_strplit
names_df1[grep(paste(location_name_df1[2], last_name_strsplit[[2]][2]),location_name_df2),] # This search result was successful
names_df1[grep(paste(location_name_df1[3], last_name_strsplit[[3]][1]),location_name_df2),] # search result successful
names_df1[grep(paste(location_name_df1[3], last_name_strsplit[[3]][2]),location_name_df2),] # search result NOT successful. Note that this part of the list has two elements. End of sub elements, move on to the next row
names_df1[grep(paste(location_name_df1[4], last_name_strsplit[[4]][1]),location_name_df2),] # search result successful
names_df1[grep(paste(location_name_df1[5], last_name_strsplit[[5]][1]),location_name_df2),] # search result successful
我很确定我必须做一个嵌套循环结构,我在其中遍历列表的每个元素 (i),然后遍历它的每个子元素 (v)。但是,当我执行嵌套循环时,往往会发生我复制大量粘贴并且搜索本身出错的情况。
有人可以给我一些关于如何使用上述步骤创建循环结构的指示吗?我再次使用 R/RStudio 来匹配数据。
谢谢!
这是一个更简单的方法。首先,我们对位置和名字进行全连接,然后使用 stringr::str_detect
(与 grep
不同,它在字符串 和 模式上进行矢量化)过滤掉最后一个姓氏不是可能的双重姓氏之一的行:
full = merge(names_df1, names_df2, by = c("location", "first_name"))
library(stringr)
matches = full[str_detect(string = full$last_name.x, pattern = fixed(full$last_name.y)), ]
matches
# location first_name last_name.x last_name.y
# 1 1530 Axel Williams Williams
# 2 1530 Carlos Lopez Gutierrez Lopez
# 3 1967 Jong Yoon Yoon
# 4 6801 Bill Johnson Clarke Clarke
# 5 6801 Flavio Mar Mar
如果你喜欢dplyr
,你可以这样做:
library(dplyr)
full_join(names_df1, names_df2, by = c("location", "first_name")) %>%
filter(str_detect(string = last_name.x, pattern = fixed(last_name.y))
我正在使用 R 来匹配两个不同数据集中的名称。我想比较字符串。我基本上有两个字符串数据框,都包含一个位置 ID(不是唯一的)以及人的全名。对于某些人来说,一个数据框的全名可能包含两个姓氏。另一个数据框具有相同的位置代码(不是唯一的),但姓氏只有两者之一(总是随机的两者之一)。
我想做的是 grep()
,逐行处理第一个数据帧,并获得第二个数据帧的输出搜索结果。我的方法是执行以下操作:
使用
paste()
函数,粘贴位置 ID 和名字。这将有助于匹配。但我真的需要匹配姓氏(可以是任何一个姓氏)。我们称这个新向量为location_first
在姓氏列上使用函数
strsplit()
。列表中的某些元素将只有一项,而其他元素(即具有两个姓氏的个人)将在该元素中包含两项。我们可以称这个列表为strsplit_ln
。然后我会以循环的形式进行第二次粘贴:将
strsplit_ln
的第一个元素粘贴到location_first
,对其进行 grep,然后移动到strplit_ln
的下一个元素并对其进行 grep。我想在我的控制台上的下沉文本文件中打印出完整的grep
搜索结果。
这是我想以循环(或嵌套循环)的形式逐步完成的过程
# prepare the test data
names_df1 = data.frame(location = c(1530, 6801, 1530, 6801, 1967),
first_name = c("Axel", "Bill", "Carlos", "Flavio", "Jong"),
last_name = c("Williams", "Johnson Clarke", "Lopez Gutierrez", "Mar", "Yoon"), stringsAsFactors = F)
names_df2 = data.frame(location = c(1530, 6801, 1530, 6801, 1967),
first_name = c("Axel", "Bill", "Carlos", "Flavio", "Jong"),
last_name = c("Williams", "Clarke", "Lopez", "Mar", "Yoon"), stringsAsFactors = F)
# Step 1: paste id and first name. Location ID and First Name are identical in both data frames. I will paste the last name in the second step.
location_name_df1 = paste(names_df1$location, names_df1$first_name)
location_name_df2 = paste(names_df2$location, names_df2$first_name, names_df2$last_name)
# Step 2: string split the last names in df1. I want a loop to go through each element and subelement of this list.
last_name_strsplit = strsplit(names_df1$last_name, split = " ")
# these are what I would be searching. Note that in the loop, I go search through each sub element v of the ith element in the list.
# paste(location_name_df1[i], last_name_strsplit[[i]][v])
paste(location_name_df1[1], last_name_strsplit[[1]][1])
paste(location_name_df1[2], last_name_strsplit[[2]][1])
paste(location_name_df1[2], last_name_strsplit[[2]][2])
paste(location_name_df1[3], last_name_strsplit[[3]][1])
paste(location_name_df1[3], last_name_strsplit[[3]][2])
paste(location_name_df1[4], last_name_strsplit[[4]][1])
paste(location_name_df1[5], last_name_strsplit[[5]][1])
# this is the actual search I would like to do. I paste the location_name_df1 with the last names in last_name_strsplit, going through each element (i), as well as each sub element (v)
names_df1[grep(paste(location_name_df1[1], last_name_strsplit[[1]][1]),location_name_df2),] # search result successful
names_df1[grep(paste(location_name_df1[2], last_name_strsplit[[2]][1]),location_name_df2),] # search result NOT successful. Note that this part of the list has two elements. Loop should jump to the second sub element of last_name_strplit
names_df1[grep(paste(location_name_df1[2], last_name_strsplit[[2]][2]),location_name_df2),] # This search result was successful
names_df1[grep(paste(location_name_df1[3], last_name_strsplit[[3]][1]),location_name_df2),] # search result successful
names_df1[grep(paste(location_name_df1[3], last_name_strsplit[[3]][2]),location_name_df2),] # search result NOT successful. Note that this part of the list has two elements. End of sub elements, move on to the next row
names_df1[grep(paste(location_name_df1[4], last_name_strsplit[[4]][1]),location_name_df2),] # search result successful
names_df1[grep(paste(location_name_df1[5], last_name_strsplit[[5]][1]),location_name_df2),] # search result successful
我很确定我必须做一个嵌套循环结构,我在其中遍历列表的每个元素 (i),然后遍历它的每个子元素 (v)。但是,当我执行嵌套循环时,往往会发生我复制大量粘贴并且搜索本身出错的情况。
有人可以给我一些关于如何使用上述步骤创建循环结构的指示吗?我再次使用 R/RStudio 来匹配数据。
谢谢!
这是一个更简单的方法。首先,我们对位置和名字进行全连接,然后使用 stringr::str_detect
(与 grep
不同,它在字符串 和 模式上进行矢量化)过滤掉最后一个姓氏不是可能的双重姓氏之一的行:
full = merge(names_df1, names_df2, by = c("location", "first_name"))
library(stringr)
matches = full[str_detect(string = full$last_name.x, pattern = fixed(full$last_name.y)), ]
matches
# location first_name last_name.x last_name.y
# 1 1530 Axel Williams Williams
# 2 1530 Carlos Lopez Gutierrez Lopez
# 3 1967 Jong Yoon Yoon
# 4 6801 Bill Johnson Clarke Clarke
# 5 6801 Flavio Mar Mar
如果你喜欢dplyr
,你可以这样做:
library(dplyr)
full_join(names_df1, names_df2, by = c("location", "first_name")) %>%
filter(str_detect(string = last_name.x, pattern = fixed(last_name.y))