清理地址 - 根据其他记录在缺失的街道名称 (Ave, St,..) 中添加最后一个标记
Cleaning addresses - add last token in street name (Ave, St,..) where missing, based on other records
在下面的示例数据中,一些地址缺少构成街道名称的最后 'token' - ave、st、dr 等。我正在使用 OSM 进行地理编码,我发现这些记录得到了命中,但通常发生在其他国家。我想根据数据中的其他记录添加最有可能丢失的标记来进一步清理它们。
valid_ends <- c("AVE", "ST", "EXT", "BLVD")
data.frame(address = c("75 NEW PARK AVE", "245 NEW PARK AVE", "42 NEW PARK",
"934 NEW PARK ST", "394 NEW PARK", "34 ASYLUM ST",
"42 ASYLUM", "953 ASYLUM AVE", "23 ASYLUM ST",
"65 WASHINGTON AVE EXT", "94 WASHINGTON AVE")) %>%
mutate(addr_tokens = str_split(address, " ")) %>%
mutate(addr_fix = NA)
期望的结果:一个新的字符列(“addr_fix”)添加到上面,其中包含记录 3、5、7(“AVE”、“AVE”、“ST”)的“扩充”地址“...分别)。那些被扩充的是基于 valid_ends 中不包含的最后一个地址标记来完成的。附加到该街道最常出现的标记的标记(匹配基于从数据集中的地址中删除数字第一个标记和有效结束标记)
有点乱,但这种方法应该可行:
- 首先获取“核心地址”——没有后缀的街道名称——然后复制后缀/“有效端”,如果有的话,到
end
:
valid_ends_rgx <- paste0(valid_ends, collapse = "|")
df2 <- df %>%
mutate(has_valid_end = str_detect(address, valid_ends_rgx),
core_addr =
str_remove_all(address, valid_ends_rgx) %>%
str_trim() %>%
str_remove("\d+ "),
end = str_match(address, valid_ends_rgx)[, 1]
)
df2
# A tibble: 11 x 4
address has_valid_end core_addr end
<chr> <lgl> <chr> <chr>
1 75 NEW PARK AVE TRUE NEW PARK AVE
2 245 NEW PARK AVE TRUE NEW PARK AVE
3 42 NEW PARK FALSE NEW PARK NA
4 934 NEW PARK ST TRUE NEW PARK ST
5 394 NEW PARK FALSE NEW PARK NA
6 34 ASYLUM ST TRUE ASYLUM ST
7 42 ASYLUM FALSE ASYLUM NA
8 953 ASYLUM AVE TRUE ASYLUM AVE
9 23 ASYLUM ST TRUE ASYLUM ST
10 65 WASHINGTON AVE EXT TRUE WASHINGTON AVE
11 94 WASHINGTON AVE TRUE WASHINGTON AVE
- 找出每条街道最常见的有效结尾:
replacements <- df2 %>%
group_by(core_addr, end) %>%
summarise(end_ct = n()) %>%
group_by(core_addr) %>%
summarise(most_end = end[which.max(end_ct)])
# A tibble: 3 x 2
core_addr most_end
<chr> <chr>
1 ASYLUM ST
2 NEW PARK AVE
3 WASHINGTON AVE
- 根据 `replacements 中的
most_end
字段,更新缺少结尾的 address
字段。
df2 %>%
left_join(replacements, by = "core_addr") %>%
transmute(
address = if_else(has_valid_end, address, str_c(address, most_end, sep = " "))
)
# A tibble: 11 x 1
address
<chr>
1 75 NEW PARK AVE
2 245 NEW PARK AVE
3 42 NEW PARK AVE
4 934 NEW PARK ST
5 394 NEW PARK AVE
6 34 ASYLUM ST
7 42 ASYLUM ST
8 953 ASYLUM AVE
9 23 ASYLUM ST
10 65 WASHINGTON AVE EXT
11 94 WASHINGTON AVE
在下面的示例数据中,一些地址缺少构成街道名称的最后 'token' - ave、st、dr 等。我正在使用 OSM 进行地理编码,我发现这些记录得到了命中,但通常发生在其他国家。我想根据数据中的其他记录添加最有可能丢失的标记来进一步清理它们。
valid_ends <- c("AVE", "ST", "EXT", "BLVD")
data.frame(address = c("75 NEW PARK AVE", "245 NEW PARK AVE", "42 NEW PARK",
"934 NEW PARK ST", "394 NEW PARK", "34 ASYLUM ST",
"42 ASYLUM", "953 ASYLUM AVE", "23 ASYLUM ST",
"65 WASHINGTON AVE EXT", "94 WASHINGTON AVE")) %>%
mutate(addr_tokens = str_split(address, " ")) %>%
mutate(addr_fix = NA)
期望的结果:一个新的字符列(“addr_fix”)添加到上面,其中包含记录 3、5、7(“AVE”、“AVE”、“ST”)的“扩充”地址“...分别)。那些被扩充的是基于 valid_ends 中不包含的最后一个地址标记来完成的。附加到该街道最常出现的标记的标记(匹配基于从数据集中的地址中删除数字第一个标记和有效结束标记)
有点乱,但这种方法应该可行:
- 首先获取“核心地址”——没有后缀的街道名称——然后复制后缀/“有效端”,如果有的话,到
end
:
valid_ends_rgx <- paste0(valid_ends, collapse = "|")
df2 <- df %>%
mutate(has_valid_end = str_detect(address, valid_ends_rgx),
core_addr =
str_remove_all(address, valid_ends_rgx) %>%
str_trim() %>%
str_remove("\d+ "),
end = str_match(address, valid_ends_rgx)[, 1]
)
df2
# A tibble: 11 x 4
address has_valid_end core_addr end
<chr> <lgl> <chr> <chr>
1 75 NEW PARK AVE TRUE NEW PARK AVE
2 245 NEW PARK AVE TRUE NEW PARK AVE
3 42 NEW PARK FALSE NEW PARK NA
4 934 NEW PARK ST TRUE NEW PARK ST
5 394 NEW PARK FALSE NEW PARK NA
6 34 ASYLUM ST TRUE ASYLUM ST
7 42 ASYLUM FALSE ASYLUM NA
8 953 ASYLUM AVE TRUE ASYLUM AVE
9 23 ASYLUM ST TRUE ASYLUM ST
10 65 WASHINGTON AVE EXT TRUE WASHINGTON AVE
11 94 WASHINGTON AVE TRUE WASHINGTON AVE
- 找出每条街道最常见的有效结尾:
replacements <- df2 %>%
group_by(core_addr, end) %>%
summarise(end_ct = n()) %>%
group_by(core_addr) %>%
summarise(most_end = end[which.max(end_ct)])
# A tibble: 3 x 2
core_addr most_end
<chr> <chr>
1 ASYLUM ST
2 NEW PARK AVE
3 WASHINGTON AVE
- 根据 `replacements 中的
most_end
字段,更新缺少结尾的address
字段。
df2 %>%
left_join(replacements, by = "core_addr") %>%
transmute(
address = if_else(has_valid_end, address, str_c(address, most_end, sep = " "))
)
# A tibble: 11 x 1
address
<chr>
1 75 NEW PARK AVE
2 245 NEW PARK AVE
3 42 NEW PARK AVE
4 934 NEW PARK ST
5 394 NEW PARK AVE
6 34 ASYLUM ST
7 42 ASYLUM ST
8 953 ASYLUM AVE
9 23 ASYLUM ST
10 65 WASHINGTON AVE EXT
11 94 WASHINGTON AVE