从 google 街道地址中提取城市和州信息
extracting city and state information from a google street address
我有一个数据集,其中包含不同点位置的 lat/long 信息,我想知道每个点与哪个城市和州相关联。
在这个 example 之后,我使用了 ggmap
中的 revgeocode
函数来获取每个位置的街道地址,生成以下数据框:
df <- structure(list(PointID = c(1787L, 2805L, 3025L, 3027L, 3028L,
3029L, 3030L, 3031L, 3033L), Latitude = c(38.36648102, 36.19548585,
43.419774, 43.437222, 43.454722, 43.452643, 43.411949, 43.255479,
43.261464), Longitude = c(-76.4802046, -94.21554661, -87.960399,
-88.018333, -87.974722, -87.978542, -87.94149, -87.986433, -87.968612
), Address = structure(c(2L, 8L, 5L, 3L, 9L, 7L, 4L, 1L, 6L), .Label = c("13004 N Thomas Dr, Mequon, WI 53097, USA",
"2160 Turner Rd, Lusby, MD 20657, USA", "2805 County Rd Y, Saukville, WI 53080, USA",
"3701-3739 County Hwy W, Saukville, WI 53080, USA", "3907 Echo Ln, Saukville, WI 53080, USA",
"4823 W Bonniwell Rd, Mequon, WI 53097, USA", "5100-5260 County Rd I, Saukville, WI 53080, USA",
"7948 W Gibbs Rd, Springdale, AR 72762, USA", "River Park Rd, Saukville, WI 53080, USA"
), class = "factor")), row.names = c(NA, -9L), class = "data.frame", .Names = c("PointID",
"Latitude", "Longitude", "Address"))
我想使用 R 从完整街道地址中提取 city/state 信息,并创建两列来存储此信息("City" 和“州”)。
我假设 stringr
包是正确的选择,但我不确定如何使用它。上面的 example 使用以下代码提取邮政编码(在该示例中名为 "result" )。他们的数据集:
# ID Longitude Latitude result
# 1 311175 41.29844 -72.92918 16 Church Street South, New Haven, CT 06519, USA
# 2 292058 41.93694 -87.66984 1632 West Nelson Street, Chicago, IL 60657, USA
# 3 12979 37.58096 -77.47144 2077-2199 Seddon Way, Richmond, VA 23230, USA
以及提取邮政编码的代码:
library(stringr)
data$zipcode <- substr(str_extract(data$result," [0-9]{5}, .+"),2,6)
data[,-4]
是否可以轻松修改以上代码来获取城市和州数据?
1) sub 像这样使用sub
。不需要包裹。
正则表达式匹配开头 (^) 后跟最短的字符串直到逗号和 space 后跟最短的字符串(代表城市)直到另一个逗号和 space 后跟两个个字符(代表州)、一个 space、5 个字符(代表邮政编码)、一个逗号、一个 space、美国和字符串结尾。括号内的匹配项可以通过\1、\2和\3引用,但在双引号内\必须加倍。
如果您的邮政编码不全是 5 位数字,请尝试 pat <- "^.*?, (.*?), (..) (.*), USA$"
。
pat <- "^.*?, (.*?), (..) (.....), USA$"
transform(df, City = sub(pat, "\1", Address),
State = sub(pat, "\2", Address),
Zip = sub(pat, "\3", Address))
给予:
PointID Latitude Longitude Address City State Zip
1 1787 38.36648 -76.48020 2160 Turner Rd, Lusby, MD 20657, USA Lusby MD 20657
2 2805 36.19549 -94.21555 7948 W Gibbs Rd, Springdale, AR 72762, USA Springdale AR 72762
3 3025 43.41977 -87.96040 3907 Echo Ln, Saukville, WI 53080, USA Saukville WI 53080
4 3027 43.43722 -88.01833 2805 County Rd Y, Saukville, WI 53080, USA Saukville WI 53080
5 3028 43.45472 -87.97472 River Park Rd, Saukville, WI 53080, USA Saukville WI 53080
6 3029 43.45264 -87.97854 5100-5260 County Rd I, Saukville, WI 53080, USA Saukville WI 53080
7 3030 43.41195 -87.94149 3701-3739 County Hwy W, Saukville, WI 53080, USA Saukville WI 53080
8 3031 43.25548 -87.98643 13004 N Thomas Dr, Mequon, WI 53097, USA Mequon WI 53097
9 3033 43.26146 -87.96861 4823 W Bonniwell Rd, Mequon, WI 53097, USA Mequon WI 53097
2) read.pattern 另一种可能是read.pattern
同上pat
:
library(gsubfn)
cn <- c("City", "State", "Zip")
Address <- as.character(df$Address)
cbind(df, read.pattern(text = Address, pattern = pat, as.is = TRUE, col.names = cn))
如果您想使用 stringr,可以这样做:
library(stringr)
library(data.table)
parse_address <- function(address){
address <- address %>%
str_split(",") %>%
.[[1]]
state <- address %>%
.[3] %>%
str_replace_all("[^A-Z]","")
zip <- address %>%
.[3] %>%
str_replace_all("[^0-9]","")
city <- address %>%
.[2] %>%
str_trim()
street <- address %>%
.[1] %>%
str_trim()
data.table(street, city, state, zip)
}
lapply(df$Address, parse_address) %>%
rbindlist
您可以使用 revgeocode()
本身获取城市和州:
df <- cbind(df,do.call(rbind,
lapply(1:nrow(df),
function(i)
revgeocode(as.numeric(
df[i,3:2]), output = "more")[c("administrative_area_level_1","locality")])))
df
# PointID Latitude Longitude Address
# 1 1787 38.36648 -76.48020 2160 Turner Rd, Lusby, MD 20657, USA
# 2 2805 36.19549 -94.21555 7948 W Gibbs Rd, Springdale, AR 72762, USA
# 3 3025 43.41977 -87.96040 3907 Echo Ln, Saukville, WI 53080, USA
# 4 3027 43.43722 -88.01833 2805 County Rd Y, Saukville, WI 53080, USA
# 5 3028 43.45472 -87.97472 River Park Rd, Saukville, WI 53080, USA
# 6 3029 43.45264 -87.97854 5100-5260 County Rd I, Saukville, WI 53080, USA
# 7 3030 43.41195 -87.94149 3701-3739 County Hwy W, Saukville, WI 53080, USA
# 8 3031 43.25548 -87.98643 13004 N Thomas Dr, Mequon, WI 53097, USA
# 9 3033 43.26146 -87.96861 4823 W Bonniwell Rd, Mequon, WI 53097, USA
# administrative_area_level_1 locality
# 1 Maryland Lusby
# 2 Arkansas Springdale
# 3 Wisconsin Saukville
# 4 Wisconsin Saukville
# 5 Wisconsin Saukville
# 6 Wisconsin Saukville
# 7 Wisconsin Saukville
# 8 Wisconsin Mequon
# 9 Wisconsin Mequon
P.S. 您可以一次完成所有操作(包括获取地址 or/and 邮政编码)步。只需将 "address"
or/and "postal_code"
添加到 c("administrative_area_level_1","locality")
即可,这是您要提取的变量列表。
我有一个数据集,其中包含不同点位置的 lat/long 信息,我想知道每个点与哪个城市和州相关联。
在这个 example 之后,我使用了 ggmap
中的 revgeocode
函数来获取每个位置的街道地址,生成以下数据框:
df <- structure(list(PointID = c(1787L, 2805L, 3025L, 3027L, 3028L,
3029L, 3030L, 3031L, 3033L), Latitude = c(38.36648102, 36.19548585,
43.419774, 43.437222, 43.454722, 43.452643, 43.411949, 43.255479,
43.261464), Longitude = c(-76.4802046, -94.21554661, -87.960399,
-88.018333, -87.974722, -87.978542, -87.94149, -87.986433, -87.968612
), Address = structure(c(2L, 8L, 5L, 3L, 9L, 7L, 4L, 1L, 6L), .Label = c("13004 N Thomas Dr, Mequon, WI 53097, USA",
"2160 Turner Rd, Lusby, MD 20657, USA", "2805 County Rd Y, Saukville, WI 53080, USA",
"3701-3739 County Hwy W, Saukville, WI 53080, USA", "3907 Echo Ln, Saukville, WI 53080, USA",
"4823 W Bonniwell Rd, Mequon, WI 53097, USA", "5100-5260 County Rd I, Saukville, WI 53080, USA",
"7948 W Gibbs Rd, Springdale, AR 72762, USA", "River Park Rd, Saukville, WI 53080, USA"
), class = "factor")), row.names = c(NA, -9L), class = "data.frame", .Names = c("PointID",
"Latitude", "Longitude", "Address"))
我想使用 R 从完整街道地址中提取 city/state 信息,并创建两列来存储此信息("City" 和“州”)。
我假设 stringr
包是正确的选择,但我不确定如何使用它。上面的 example 使用以下代码提取邮政编码(在该示例中名为 "result" )。他们的数据集:
# ID Longitude Latitude result
# 1 311175 41.29844 -72.92918 16 Church Street South, New Haven, CT 06519, USA
# 2 292058 41.93694 -87.66984 1632 West Nelson Street, Chicago, IL 60657, USA
# 3 12979 37.58096 -77.47144 2077-2199 Seddon Way, Richmond, VA 23230, USA
以及提取邮政编码的代码:
library(stringr)
data$zipcode <- substr(str_extract(data$result," [0-9]{5}, .+"),2,6)
data[,-4]
是否可以轻松修改以上代码来获取城市和州数据?
1) sub 像这样使用sub
。不需要包裹。
正则表达式匹配开头 (^) 后跟最短的字符串直到逗号和 space 后跟最短的字符串(代表城市)直到另一个逗号和 space 后跟两个个字符(代表州)、一个 space、5 个字符(代表邮政编码)、一个逗号、一个 space、美国和字符串结尾。括号内的匹配项可以通过\1、\2和\3引用,但在双引号内\必须加倍。
如果您的邮政编码不全是 5 位数字,请尝试 pat <- "^.*?, (.*?), (..) (.*), USA$"
。
pat <- "^.*?, (.*?), (..) (.....), USA$"
transform(df, City = sub(pat, "\1", Address),
State = sub(pat, "\2", Address),
Zip = sub(pat, "\3", Address))
给予:
PointID Latitude Longitude Address City State Zip
1 1787 38.36648 -76.48020 2160 Turner Rd, Lusby, MD 20657, USA Lusby MD 20657
2 2805 36.19549 -94.21555 7948 W Gibbs Rd, Springdale, AR 72762, USA Springdale AR 72762
3 3025 43.41977 -87.96040 3907 Echo Ln, Saukville, WI 53080, USA Saukville WI 53080
4 3027 43.43722 -88.01833 2805 County Rd Y, Saukville, WI 53080, USA Saukville WI 53080
5 3028 43.45472 -87.97472 River Park Rd, Saukville, WI 53080, USA Saukville WI 53080
6 3029 43.45264 -87.97854 5100-5260 County Rd I, Saukville, WI 53080, USA Saukville WI 53080
7 3030 43.41195 -87.94149 3701-3739 County Hwy W, Saukville, WI 53080, USA Saukville WI 53080
8 3031 43.25548 -87.98643 13004 N Thomas Dr, Mequon, WI 53097, USA Mequon WI 53097
9 3033 43.26146 -87.96861 4823 W Bonniwell Rd, Mequon, WI 53097, USA Mequon WI 53097
2) read.pattern 另一种可能是read.pattern
同上pat
:
library(gsubfn)
cn <- c("City", "State", "Zip")
Address <- as.character(df$Address)
cbind(df, read.pattern(text = Address, pattern = pat, as.is = TRUE, col.names = cn))
如果您想使用 stringr,可以这样做:
library(stringr)
library(data.table)
parse_address <- function(address){
address <- address %>%
str_split(",") %>%
.[[1]]
state <- address %>%
.[3] %>%
str_replace_all("[^A-Z]","")
zip <- address %>%
.[3] %>%
str_replace_all("[^0-9]","")
city <- address %>%
.[2] %>%
str_trim()
street <- address %>%
.[1] %>%
str_trim()
data.table(street, city, state, zip)
}
lapply(df$Address, parse_address) %>%
rbindlist
您可以使用 revgeocode()
本身获取城市和州:
df <- cbind(df,do.call(rbind,
lapply(1:nrow(df),
function(i)
revgeocode(as.numeric(
df[i,3:2]), output = "more")[c("administrative_area_level_1","locality")])))
df
# PointID Latitude Longitude Address
# 1 1787 38.36648 -76.48020 2160 Turner Rd, Lusby, MD 20657, USA
# 2 2805 36.19549 -94.21555 7948 W Gibbs Rd, Springdale, AR 72762, USA
# 3 3025 43.41977 -87.96040 3907 Echo Ln, Saukville, WI 53080, USA
# 4 3027 43.43722 -88.01833 2805 County Rd Y, Saukville, WI 53080, USA
# 5 3028 43.45472 -87.97472 River Park Rd, Saukville, WI 53080, USA
# 6 3029 43.45264 -87.97854 5100-5260 County Rd I, Saukville, WI 53080, USA
# 7 3030 43.41195 -87.94149 3701-3739 County Hwy W, Saukville, WI 53080, USA
# 8 3031 43.25548 -87.98643 13004 N Thomas Dr, Mequon, WI 53097, USA
# 9 3033 43.26146 -87.96861 4823 W Bonniwell Rd, Mequon, WI 53097, USA
# administrative_area_level_1 locality
# 1 Maryland Lusby
# 2 Arkansas Springdale
# 3 Wisconsin Saukville
# 4 Wisconsin Saukville
# 5 Wisconsin Saukville
# 6 Wisconsin Saukville
# 7 Wisconsin Saukville
# 8 Wisconsin Mequon
# 9 Wisconsin Mequon
P.S. 您可以一次完成所有操作(包括获取地址 or/and 邮政编码)步。只需将 "address"
or/and "postal_code"
添加到 c("administrative_area_level_1","locality")
即可,这是您要提取的变量列表。