使用 r 从(地址)字符串中提取门牌号

Extract House Number from (address) string using r

我想将地址解析(提取)为 HouseNumber 和 Streetname。 稍后我应该能够将提取的 "values" 写入新列(shops$HouseNumber 和 shops$Streetname)。

所以假设我有一个名为 "shops":

的数据框
> shops
      Name                 city        street
 1    Something            Fakecity    New Street 3
 2    SomethingOther       Fakecity    Some-Complicated-Casestreet 1-3
 3    SomethingDifferent   Fakecity    Fake Street 14a

有没有办法将街道列分成两个列表,一个是街道名称,一个是门牌号,包括“1-3”、“14a”等情况,这样最后的结果可能是被分配到数据框看起来像。

 > shops
      Name                 city        Streetname                    HouseNumber
 1    Something            Fakecity    New Street                    3
 2    SomethingOther       Fakecity    Some-Complicated-Casestreet   1-3
 3    SomethingDifferent   Fakecity    Fake Street                   14a 

示例:Easyfakestreet 5 --> Easyfakestreet , 5

由于我的一些街道字符串将包含带连字符的街道地址并且具有非数字部分,所以它变得稍微复杂了。

示例:
新街 3 --> ['New Street', '3 ']
一些复杂的案例街 1-3 --> ['Some-Complicated-Casestreet','1-3']
假街 14a --> ['Fake Street', '14a']

非常感谢您的帮助!

你可以试试:

shops$Streetname <- gsub("(.+)\s[^ ]+$","\1", shops$street)
shops$HousNumber <- gsub(".+\s([^ ]+)$","\1", shops$street)

数据

shops$street
#[1] "New Street 3"                    "Some-Complicated-Casestreet 1-3" "Fake Street 14a" 

结果

shops$Streetname
#[1] "New Street"                  "Some-Complicated-Casestreet" "Fake` Street" 

shops$HousNumber
#[1] "3"   "1-3" "14a"

这是一个可能的tidyr解决方案

library(tidyr)
extract(df, "street", c("Streetname", "HouseNumber"), "(\D+)(\d.*)")
#                 Name     city                   Streetname HouseNumber
# 1          Something Fakecity                  New Street            3
# 2     SomethingOther Fakecity Some-Complicated-Casestreet          1-3
# 3 SomethingDifferent Fakecity                 Fake Street          14a

创建一个包含匹配街道和号码的反向引用的模式,然后使用 sub 依次将其替换为每个反向引用。不需要软件包:

pat <- "(.*) (\d.*)"
transform(shops,
   street = sub(pat, "\1", street), 
   HouseNumber = sub(pat, "\2", street)
)

给予:

                Name     city                      street  HouseNumber
1          Something Fakecity                  New Street            3
2     SomethingOther Fakecity Some-Complicated-Casestreet          1-3
3 SomethingDifferent Fakecity                 Fake Street          14a

这是 pat 的可视化:

(.*) (\d.*)

Debuggex Demo

注:

1) 我们将其用于 shops:

shops <-
structure(list(Name = c("Something", "SomethingOther", "SomethingDifferent"
), city = c("Fakecity", "Fakecity", "Fakecity"), street = c("New Street 3", 
"Some-Complicated-Casestreet 1-3", "Fake Street 14a")), .Names = c("Name", 
"city", "street"), class = "data.frame", row.names = c(NA, -3L))

2) David Arenburg 的模式可以在这里交替使用。只需设置 pat 即可。上面的模式的优点是它允许在其中嵌入数字的街道名称,但 David 的优点是 space 可能在街道号码之前丢失。

您可以使用包 unglue

library(unglue)
unglue_unnest(shops, street, "{street} {value=\d.*}")
#>                 Name     city                      street value
#> 1          Something Fakecity                  New Street     3
#> 2     SomethingOther Fakecity Some-Complicated-Casestreet   1-3
#> 3 SomethingDifferent Fakecity                 Fake Street   14a

reprex package (v0.3.0)

于 2019-10-08 创建

国际地址的问题非常复杂

$re = '/(\d+[\d\/\-\. ,]*[ ,\d\-\w]{0,2} )/m';
$str = '234 Test Road, Testville
456b Tester Road, Testville
789 c Tester Road, Testville
Mystreet 14a 
123/3 dsdsdfs
Roobertinkatu 36-40 
Flats 1-24 Acacia Avenue 
Apartment 9D, 1 Acacia Avenue 
Flat 24, 1 Acacia Avenue
Moscow Street, plot,23 building 2 
Apartment 5005  no. 7 lane 31 Wuming Rd
Quinta da Redonda Lote 3 - 1 º 
102 - 3 Esq 
Av 1 Maio 16,2 dt,
Rua de Ceuta Lote 1 Loja 5 
11334 Nc Highway 72 E ';

preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);

// Print the entire match result
var_dump($matches);

Output example

https://regex101.com/r/WVPBji/1