清理地址 - 根据其他记录在缺失的街道名称 (Ave, St,..) 中添加最后一个标记

Cleaning addresses - add last token in street name (Ave, St,..) where missing, based on other records

在下面的示例数据中,一些地址缺少构成街道名称的最后 'token' - ave、st、dr 等。我正在使用 OSM 进行地理编码,我发现这些记录得到了命中,但通常发生在其他国家。我想根据数据中的其他记录添加最有可能丢失的标记来进一步清理它们。

valid_ends <- c("AVE", "ST", "EXT", "BLVD")

data.frame(address = c("75 NEW PARK AVE", "245 NEW PARK AVE", "42 NEW PARK",
                       "934 NEW PARK ST", "394 NEW PARK", "34 ASYLUM ST",
                       "42 ASYLUM", "953 ASYLUM AVE", "23 ASYLUM ST",
                       "65 WASHINGTON AVE EXT", "94 WASHINGTON AVE")) %>% 
    mutate(addr_tokens = str_split(address, " ")) %>%
    mutate(addr_fix = NA)

期望的结果:一个新的字符列(“addr_fix”)添加到上面,其中包含记录 3、5、7(“AVE”、“AVE”、“ST”)的“扩充”地址“...分别)。那些被扩充的是基于 valid_ends 中不包含的最后一个地址标记来完成的。附加到该街道最常出现的标记的标记(匹配基于从数据集中的地址中删除数字第一个标记和有效结束标记)

有点乱,但这种方法应该可行:

  1. 首先获取“核心地址”——没有后缀的街道名称——然后复制后缀/“有效端”,如果有的话,到end:
valid_ends_rgx <- paste0(valid_ends, collapse = "|")

df2 <- df %>% 
  mutate(has_valid_end = str_detect(address, valid_ends_rgx),
         core_addr = 
           str_remove_all(address, valid_ends_rgx) %>% 
           str_trim() %>% 
           str_remove("\d+ "),
         end = str_match(address, valid_ends_rgx)[, 1]
         ) 

df2
# A tibble: 11 x 4
   address               has_valid_end core_addr  end  
   <chr>                 <lgl>         <chr>      <chr>
 1 75 NEW PARK AVE       TRUE          NEW PARK   AVE  
 2 245 NEW PARK AVE      TRUE          NEW PARK   AVE  
 3 42 NEW PARK           FALSE         NEW PARK   NA   
 4 934 NEW PARK ST       TRUE          NEW PARK   ST   
 5 394 NEW PARK          FALSE         NEW PARK   NA   
 6 34 ASYLUM ST          TRUE          ASYLUM     ST   
 7 42 ASYLUM             FALSE         ASYLUM     NA   
 8 953 ASYLUM AVE        TRUE          ASYLUM     AVE  
 9 23 ASYLUM ST          TRUE          ASYLUM     ST   
10 65 WASHINGTON AVE EXT TRUE          WASHINGTON AVE  
11 94 WASHINGTON AVE     TRUE          WASHINGTON AVE  
  1. 找出每条街道最常见的有效结尾:
replacements <- df2 %>% 
  group_by(core_addr, end) %>% 
  summarise(end_ct = n()) %>% 
  group_by(core_addr) %>% 
  summarise(most_end = end[which.max(end_ct)])
  
# A tibble: 3 x 2
  core_addr  most_end
  <chr>      <chr>   
1 ASYLUM     ST      
2 NEW PARK   AVE     
3 WASHINGTON AVE     
  1. 根据 `replacements 中的 most_end 字段,更新缺少结尾的 address 字段。
df2 %>% 
  left_join(replacements, by = "core_addr") %>% 
  transmute(
    address = if_else(has_valid_end, address, str_c(address, most_end, sep = " "))
  )

# A tibble: 11 x 1
   address              
   <chr>                
 1 75 NEW PARK AVE      
 2 245 NEW PARK AVE     
 3 42 NEW PARK AVE      
 4 934 NEW PARK ST      
 5 394 NEW PARK AVE     
 6 34 ASYLUM ST         
 7 42 ASYLUM ST         
 8 953 ASYLUM AVE       
 9 23 ASYLUM ST         
10 65 WASHINGTON AVE EXT
11 94 WASHINGTON AVE