使用 r 从(地址)字符串中提取门牌号
Extract House Number from (address) string using r
我想将地址解析(提取)为 HouseNumber 和 Streetname。
稍后我应该能够将提取的 "values" 写入新列(shops$HouseNumber 和 shops$Streetname)。
所以假设我有一个名为 "shops":
的数据框
> shops
Name city street
1 Something Fakecity New Street 3
2 SomethingOther Fakecity Some-Complicated-Casestreet 1-3
3 SomethingDifferent Fakecity Fake Street 14a
有没有办法将街道列分成两个列表,一个是街道名称,一个是门牌号,包括“1-3”、“14a”等情况,这样最后的结果可能是被分配到数据框看起来像。
> shops
Name city Streetname HouseNumber
1 Something Fakecity New Street 3
2 SomethingOther Fakecity Some-Complicated-Casestreet 1-3
3 SomethingDifferent Fakecity Fake Street 14a
示例:Easyfakestreet 5 --> Easyfakestreet , 5
由于我的一些街道字符串将包含带连字符的街道地址并且具有非数字部分,所以它变得稍微复杂了。
示例:
新街 3 --> ['New Street', '3 ']
一些复杂的案例街 1-3 --> ['Some-Complicated-Casestreet','1-3']
假街 14a --> ['Fake Street', '14a']
非常感谢您的帮助!
你可以试试:
shops$Streetname <- gsub("(.+)\s[^ ]+$","\1", shops$street)
shops$HousNumber <- gsub(".+\s([^ ]+)$","\1", shops$street)
数据
shops$street
#[1] "New Street 3" "Some-Complicated-Casestreet 1-3" "Fake Street 14a"
结果
shops$Streetname
#[1] "New Street" "Some-Complicated-Casestreet" "Fake` Street"
shops$HousNumber
#[1] "3" "1-3" "14a"
这是一个可能的tidyr
解决方案
library(tidyr)
extract(df, "street", c("Streetname", "HouseNumber"), "(\D+)(\d.*)")
# Name city Streetname HouseNumber
# 1 Something Fakecity New Street 3
# 2 SomethingOther Fakecity Some-Complicated-Casestreet 1-3
# 3 SomethingDifferent Fakecity Fake Street 14a
创建一个包含匹配街道和号码的反向引用的模式,然后使用 sub
依次将其替换为每个反向引用。不需要软件包:
pat <- "(.*) (\d.*)"
transform(shops,
street = sub(pat, "\1", street),
HouseNumber = sub(pat, "\2", street)
)
给予:
Name city street HouseNumber
1 Something Fakecity New Street 3
2 SomethingOther Fakecity Some-Complicated-Casestreet 1-3
3 SomethingDifferent Fakecity Fake Street 14a
这是 pat
的可视化:
(.*) (\d.*)
注:
1) 我们将其用于 shops
:
shops <-
structure(list(Name = c("Something", "SomethingOther", "SomethingDifferent"
), city = c("Fakecity", "Fakecity", "Fakecity"), street = c("New Street 3",
"Some-Complicated-Casestreet 1-3", "Fake Street 14a")), .Names = c("Name",
"city", "street"), class = "data.frame", row.names = c(NA, -3L))
2) David Arenburg 的模式可以在这里交替使用。只需设置 pat
即可。上面的模式的优点是它允许在其中嵌入数字的街道名称,但 David 的优点是 space 可能在街道号码之前丢失。
您可以使用包 unglue
library(unglue)
unglue_unnest(shops, street, "{street} {value=\d.*}")
#> Name city street value
#> 1 Something Fakecity New Street 3
#> 2 SomethingOther Fakecity Some-Complicated-Casestreet 1-3
#> 3 SomethingDifferent Fakecity Fake Street 14a
由 reprex package (v0.3.0)
于 2019-10-08 创建
国际地址的问题非常复杂
$re = '/(\d+[\d\/\-\. ,]*[ ,\d\-\w]{0,2} )/m';
$str = '234 Test Road, Testville
456b Tester Road, Testville
789 c Tester Road, Testville
Mystreet 14a
123/3 dsdsdfs
Roobertinkatu 36-40
Flats 1-24 Acacia Avenue
Apartment 9D, 1 Acacia Avenue
Flat 24, 1 Acacia Avenue
Moscow Street, plot,23 building 2
Apartment 5005 no. 7 lane 31 Wuming Rd
Quinta da Redonda Lote 3 - 1 º
102 - 3 Esq
Av 1 Maio 16,2 dt,
Rua de Ceuta Lote 1 Loja 5
11334 Nc Highway 72 E ';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
// Print the entire match result
var_dump($matches);
Output example
我想将地址解析(提取)为 HouseNumber 和 Streetname。 稍后我应该能够将提取的 "values" 写入新列(shops$HouseNumber 和 shops$Streetname)。
所以假设我有一个名为 "shops":
的数据框> shops
Name city street
1 Something Fakecity New Street 3
2 SomethingOther Fakecity Some-Complicated-Casestreet 1-3
3 SomethingDifferent Fakecity Fake Street 14a
有没有办法将街道列分成两个列表,一个是街道名称,一个是门牌号,包括“1-3”、“14a”等情况,这样最后的结果可能是被分配到数据框看起来像。
> shops
Name city Streetname HouseNumber
1 Something Fakecity New Street 3
2 SomethingOther Fakecity Some-Complicated-Casestreet 1-3
3 SomethingDifferent Fakecity Fake Street 14a
示例:Easyfakestreet 5 --> Easyfakestreet , 5
由于我的一些街道字符串将包含带连字符的街道地址并且具有非数字部分,所以它变得稍微复杂了。
示例:
新街 3 --> ['New Street', '3 ']
一些复杂的案例街 1-3 --> ['Some-Complicated-Casestreet','1-3']
假街 14a --> ['Fake Street', '14a']
非常感谢您的帮助!
你可以试试:
shops$Streetname <- gsub("(.+)\s[^ ]+$","\1", shops$street)
shops$HousNumber <- gsub(".+\s([^ ]+)$","\1", shops$street)
数据
shops$street
#[1] "New Street 3" "Some-Complicated-Casestreet 1-3" "Fake Street 14a"
结果
shops$Streetname
#[1] "New Street" "Some-Complicated-Casestreet" "Fake` Street"
shops$HousNumber
#[1] "3" "1-3" "14a"
这是一个可能的tidyr
解决方案
library(tidyr)
extract(df, "street", c("Streetname", "HouseNumber"), "(\D+)(\d.*)")
# Name city Streetname HouseNumber
# 1 Something Fakecity New Street 3
# 2 SomethingOther Fakecity Some-Complicated-Casestreet 1-3
# 3 SomethingDifferent Fakecity Fake Street 14a
创建一个包含匹配街道和号码的反向引用的模式,然后使用 sub
依次将其替换为每个反向引用。不需要软件包:
pat <- "(.*) (\d.*)"
transform(shops,
street = sub(pat, "\1", street),
HouseNumber = sub(pat, "\2", street)
)
给予:
Name city street HouseNumber
1 Something Fakecity New Street 3
2 SomethingOther Fakecity Some-Complicated-Casestreet 1-3
3 SomethingDifferent Fakecity Fake Street 14a
这是 pat
的可视化:
(.*) (\d.*)
注:
1) 我们将其用于 shops
:
shops <-
structure(list(Name = c("Something", "SomethingOther", "SomethingDifferent"
), city = c("Fakecity", "Fakecity", "Fakecity"), street = c("New Street 3",
"Some-Complicated-Casestreet 1-3", "Fake Street 14a")), .Names = c("Name",
"city", "street"), class = "data.frame", row.names = c(NA, -3L))
2) David Arenburg 的模式可以在这里交替使用。只需设置 pat
即可。上面的模式的优点是它允许在其中嵌入数字的街道名称,但 David 的优点是 space 可能在街道号码之前丢失。
您可以使用包 unglue
library(unglue)
unglue_unnest(shops, street, "{street} {value=\d.*}")
#> Name city street value
#> 1 Something Fakecity New Street 3
#> 2 SomethingOther Fakecity Some-Complicated-Casestreet 1-3
#> 3 SomethingDifferent Fakecity Fake Street 14a
由 reprex package (v0.3.0)
于 2019-10-08 创建国际地址的问题非常复杂
$re = '/(\d+[\d\/\-\. ,]*[ ,\d\-\w]{0,2} )/m';
$str = '234 Test Road, Testville
456b Tester Road, Testville
789 c Tester Road, Testville
Mystreet 14a
123/3 dsdsdfs
Roobertinkatu 36-40
Flats 1-24 Acacia Avenue
Apartment 9D, 1 Acacia Avenue
Flat 24, 1 Acacia Avenue
Moscow Street, plot,23 building 2
Apartment 5005 no. 7 lane 31 Wuming Rd
Quinta da Redonda Lote 3 - 1 º
102 - 3 Esq
Av 1 Maio 16,2 dt,
Rua de Ceuta Lote 1 Loja 5
11334 Nc Highway 72 E ';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
// Print the entire match result
var_dump($matches);
Output example