R模糊字符串匹配到return基于匹配字符串的特定列
R fuzzy string match to return specific column based on matched string
我有两个大型数据集,一个大约有 50 万条记录,另一个大约有 7 万条记录。这些数据集有地址。我想匹配较小数据集中的任何地址是否存在于较大数据集中。正如您想象的那样,地址可以用不同的方式和不同的大小写/拼写等方式书写。此外,如果只写到建筑物级别,则可以复制该地址。所以不同的公寓有相同的地址。我做了一些研究,找出了可以使用的包 stringdist。
我做了一些工作并设法根据距离获得最接近的匹配。但是我无法 return 地址匹配的相应列。
下面是一个示例虚拟数据以及我创建的用于解释情况的代码
library(stringdist)
Address1 <- c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR","786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr")
Year1 <- c(2001:2007)
Address2 <- c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR")
Year2 <- c(2001:2010)
df1 <- data.table(Address1,Year1)
df2 <- data.table(Address2,Year2)
df2[,unique_id := sprintf("%06d", 1:nrow(df2))]
fn_match = function(str, strVec, n){
strVec[amatch(str, strVec, method = "dl", maxDist=n,useBytes = T)]
}
df1[!is.na(Address1)
, address_match :=
fn_match(Address1, df2$Address2,3)
]
这 return 是基于距离 3 的闭合字符串匹配,但是我还想在 df1 中包含来自 df2 的 "Year" 和 "unique_id" 列。这将帮助我了解该字符串与 df2 中的哪一行数据匹配。所以最后我想知道 df1 中的每一行,根据指定的距离,df2 的壁橱匹配是什么,并且对于匹配的行具体 "Year" 和 "unique_id" 来自 df2.
我想这与合并(左连接)有关,但我不确定如何合并以保留重复项并确保我具有与 df1(小数据集)中相同的行数。
任何一种解决方案都会有所帮助!!
你已经完成了 90%...
你说你想
know with which row of data the string was matched from df2
你只需要理解你已有的代码。见 ?amatch
:
amatch
returns the position of the closest match of x
in table
. When multiple matches with the same smallest distance metric exist, the first one is returned.
换句话说,amatch
为您提供 df2
中行的索引(即您的 table
),这是 [=18= 中每个地址的最接近匹配项](这是你的 x
)。您通过返回新地址来过早地包装此索引。
取而代之的是,检索索引本身以供查找 或 unique_id(如果您确信它确实是一个唯一 ID)用于左连接。
两种方法的说明:
library(data.table) # you forgot this in your example
library(stringdist)
df1 <- data.table(Address1 = c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR","786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr"),
Year1 = 2001:2007) # already a vector, no need to combine
df2 <- data.table(Address2=c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR"),
Year2=2001:2010)
df2[,unique_id := sprintf("%06d", .I)] # use .I, it's neater
# Return position from strVec of closest match to str
match_pos = function(str, strVec, n){
amatch(str, strVec, method = "dl", maxDist=n,useBytes = T) # are you sure you want useBytes = TRUE?
}
# Option 1: use unique_id as a key for left join
df1[!is.na(Address1) | nchar(Address1>0), # I would exclude only on NA_character_ but also empty string, perhaps string of length < 3
unique_id := df2$unique_id[match_pos(Address1, df2$Address2,3)] ]
merge(df1, df2, by='unique_id', all.x=TRUE) # see ?merge for more options
# Option 2: use the row index
df1[!is.na(Address1) | nchar(Address1>0),
df2_pos := match_pos(Address1, df2$Address2,3) ]
df1[!is.na(df2_pos), (c('Address2','Year2','UniqueID')):=df2[df2_pos,.(Address2,Year2,unique_id)] ][]
这是一个使用 fuzzyjoin
包的解决方案。它使用类似 dplyr
的语法和 stringdist
作为一种可能的模糊匹配类型。
您可以使用 stringdist
方法="dl"(或其他可能效果更好的方法)。
为了满足您对"ensuring that I have same number of rows as in df1"的要求,我使用了较大的max_dist,然后使用dplyr::group_by
和dplyr::top_n
来获得最小距离的最佳匹配。这是 fuzzyjoin
的开发者 dgrtwo 的 suggested。 (希望将来它会成为软件包本身的一部分。)
(我还必须假设在距离关系的情况下采用最大 year2。)
代码:
library(data.table, quietly = TRUE)
df1 <- data.table(Address1 = c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR","786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr"),
Year1 = 2001:2007)
df2 <- data.table(Address2=c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR"),
Year2=2001:2010)
df2[,unique_id := sprintf("%06d", .I)]
library(fuzzyjoin, quietly = TRUE); library(dplyr, quietly = TRUE)
stringdist_join(df1, df2,
by = c("Address1" = "Address2"),
mode = "left",
method = "dl",
max_dist = 99,
distance_col = "dist") %>%
group_by(Address1, Year1) %>%
top_n(1, -dist) %>%
top_n(1, Year2)
结果:
# A tibble: 7 x 6
# Groups: Address1, Year1 [7]
Address1 Year1 Address2 Year2 unique_id dist
<chr> <int> <chr> <int> <chr> <dbl>
1 786, GALI NO 5, XYZ 2001 786, GALI NO 4 XYZ 2007 000007 2
2 rambo, 45, strret 4, atlast, pqr 2002 del, 546, strret2, towards east, pqr 2009 000009 17
3 23/4, 23RD FLOOR, STREET 2, ABC-E, PQR 2003 23/4, STREET 2, PQR 2010 000010 19
4 45-B, GALI NO5, XYZ 2004 45B, GALI NO 5, XYZ 2008 000008 2
5 HECTIC, 99 STREET, PQR 2005 23/4, STREET 2, PQR 2010 000010 11
6 786, GALI NO 5, XYZ 2006 786, GALI NO 4 XYZ 2007 000007 2
7 rambo, 45, strret 4, atlast, pqr 2007 del, 546, strret2, towards east, pqr 2009 000009 17
我有两个大型数据集,一个大约有 50 万条记录,另一个大约有 7 万条记录。这些数据集有地址。我想匹配较小数据集中的任何地址是否存在于较大数据集中。正如您想象的那样,地址可以用不同的方式和不同的大小写/拼写等方式书写。此外,如果只写到建筑物级别,则可以复制该地址。所以不同的公寓有相同的地址。我做了一些研究,找出了可以使用的包 stringdist。
我做了一些工作并设法根据距离获得最接近的匹配。但是我无法 return 地址匹配的相应列。
下面是一个示例虚拟数据以及我创建的用于解释情况的代码
library(stringdist)
Address1 <- c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR","786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr")
Year1 <- c(2001:2007)
Address2 <- c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR")
Year2 <- c(2001:2010)
df1 <- data.table(Address1,Year1)
df2 <- data.table(Address2,Year2)
df2[,unique_id := sprintf("%06d", 1:nrow(df2))]
fn_match = function(str, strVec, n){
strVec[amatch(str, strVec, method = "dl", maxDist=n,useBytes = T)]
}
df1[!is.na(Address1)
, address_match :=
fn_match(Address1, df2$Address2,3)
]
这 return 是基于距离 3 的闭合字符串匹配,但是我还想在 df1 中包含来自 df2 的 "Year" 和 "unique_id" 列。这将帮助我了解该字符串与 df2 中的哪一行数据匹配。所以最后我想知道 df1 中的每一行,根据指定的距离,df2 的壁橱匹配是什么,并且对于匹配的行具体 "Year" 和 "unique_id" 来自 df2.
我想这与合并(左连接)有关,但我不确定如何合并以保留重复项并确保我具有与 df1(小数据集)中相同的行数。
任何一种解决方案都会有所帮助!!
你已经完成了 90%...
你说你想
know with which row of data the string was matched from df2
你只需要理解你已有的代码。见 ?amatch
:
amatch
returns the position of the closest match ofx
intable
. When multiple matches with the same smallest distance metric exist, the first one is returned.
换句话说,amatch
为您提供 df2
中行的索引(即您的 table
),这是 [=18= 中每个地址的最接近匹配项](这是你的 x
)。您通过返回新地址来过早地包装此索引。
取而代之的是,检索索引本身以供查找 或 unique_id(如果您确信它确实是一个唯一 ID)用于左连接。
两种方法的说明:
library(data.table) # you forgot this in your example
library(stringdist)
df1 <- data.table(Address1 = c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR","786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr"),
Year1 = 2001:2007) # already a vector, no need to combine
df2 <- data.table(Address2=c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR"),
Year2=2001:2010)
df2[,unique_id := sprintf("%06d", .I)] # use .I, it's neater
# Return position from strVec of closest match to str
match_pos = function(str, strVec, n){
amatch(str, strVec, method = "dl", maxDist=n,useBytes = T) # are you sure you want useBytes = TRUE?
}
# Option 1: use unique_id as a key for left join
df1[!is.na(Address1) | nchar(Address1>0), # I would exclude only on NA_character_ but also empty string, perhaps string of length < 3
unique_id := df2$unique_id[match_pos(Address1, df2$Address2,3)] ]
merge(df1, df2, by='unique_id', all.x=TRUE) # see ?merge for more options
# Option 2: use the row index
df1[!is.na(Address1) | nchar(Address1>0),
df2_pos := match_pos(Address1, df2$Address2,3) ]
df1[!is.na(df2_pos), (c('Address2','Year2','UniqueID')):=df2[df2_pos,.(Address2,Year2,unique_id)] ][]
这是一个使用 fuzzyjoin
包的解决方案。它使用类似 dplyr
的语法和 stringdist
作为一种可能的模糊匹配类型。
您可以使用 stringdist
方法="dl"(或其他可能效果更好的方法)。
为了满足您对"ensuring that I have same number of rows as in df1"的要求,我使用了较大的max_dist,然后使用dplyr::group_by
和dplyr::top_n
来获得最小距离的最佳匹配。这是 fuzzyjoin
的开发者 dgrtwo 的 suggested。 (希望将来它会成为软件包本身的一部分。)
(我还必须假设在距离关系的情况下采用最大 year2。)
代码:
library(data.table, quietly = TRUE)
df1 <- data.table(Address1 = c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR","786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr"),
Year1 = 2001:2007)
df2 <- data.table(Address2=c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR"),
Year2=2001:2010)
df2[,unique_id := sprintf("%06d", .I)]
library(fuzzyjoin, quietly = TRUE); library(dplyr, quietly = TRUE)
stringdist_join(df1, df2,
by = c("Address1" = "Address2"),
mode = "left",
method = "dl",
max_dist = 99,
distance_col = "dist") %>%
group_by(Address1, Year1) %>%
top_n(1, -dist) %>%
top_n(1, Year2)
结果:
# A tibble: 7 x 6
# Groups: Address1, Year1 [7]
Address1 Year1 Address2 Year2 unique_id dist
<chr> <int> <chr> <int> <chr> <dbl>
1 786, GALI NO 5, XYZ 2001 786, GALI NO 4 XYZ 2007 000007 2
2 rambo, 45, strret 4, atlast, pqr 2002 del, 546, strret2, towards east, pqr 2009 000009 17
3 23/4, 23RD FLOOR, STREET 2, ABC-E, PQR 2003 23/4, STREET 2, PQR 2010 000010 19
4 45-B, GALI NO5, XYZ 2004 45B, GALI NO 5, XYZ 2008 000008 2
5 HECTIC, 99 STREET, PQR 2005 23/4, STREET 2, PQR 2010 000010 11
6 786, GALI NO 5, XYZ 2006 786, GALI NO 4 XYZ 2007 000007 2
7 rambo, 45, strret 4, atlast, pqr 2007 del, 546, strret2, towards east, pqr 2009 000009 17