regex/stringr:拆分 joined/sequence 国名

regex/stringr: splitting joined/sequence of countrynames

我有一个包含多个国家/地区名称的字符串。除了大写字母后面没有 space 的小写字母之外,这些名称没有任何模式分隔(但是 space 是某些国家/地区名称的一部分,例如刚果民主共和国。

我的 stringr/regex 尝试很接近,但是我丢失了第二个和后续国家/地区名称的第一个字母。有什么帮助吗?非常感谢。

library(tidyverse)
#> Warning: package 'dplyr' was built under R version 3.6.2
#> Warning: package 'forcats' was built under R version 3.6.3
v <- structure(list(countries = c("Democratic Republic of the CongoSweden", 
                             "DenmarkIran (Islamic Republic of)", "AfghanistanSweden", "AzerbaijanSwedenGermany", 
                             "BangladeshSweden", "DenmarkSri Lanka", "CanadaSri Lanka", "DenmarkNigeria", 
                             "CanadaIreland", "CanadaMexico")), class = c("tbl_df", "tbl", 
                                                                          "data.frame"), row.names = c(NA, -10L))



v %>% 
  mutate(index=row_number()) %>% 
  #mutate(countries_split=str_split(countries, "[A-Z][a-z]*[a-z:space:]+(?=[A-Z])")) %>%
  #mutate(countries_split=str_split(countries, "(?<=[A-Z][a-z]{0,20}+[a-z:space:]{0,20}+[A-Z][a-z]{1,20}+).")) %>% 
  mutate(countries_split=str_split(countries, "(?<=[A-Z][a-z]{0,20}+[a-z:space:]{0,20}+)[A-Z]")) %>% 
  unnest(countries_split)
#> # A tibble: 21 x 3
#>    countries                              index countries_split                 
#>    <chr>                                  <int> <chr>                           
#>  1 Democratic Republic of the CongoSweden     1 Democratic Republic of the Congo
#>  2 Democratic Republic of the CongoSweden     1 weden                           
#>  3 DenmarkIran (Islamic Republic of)          2 Denmark                         
#>  4 DenmarkIran (Islamic Republic of)          2 ran (Islamic Republic of)       
#>  5 AfghanistanSweden                          3 Afghanistan                     
#>  6 AfghanistanSweden                          3 weden                           
#>  7 AzerbaijanSwedenGermany                    4 Azerbaijan                      
#>  8 AzerbaijanSwedenGermany                    4 weden                           
#>  9 AzerbaijanSwedenGermany                    4 ermany                          
#> 10 BangladeshSweden                           5 Bangladesh                      
#> # ... with 11 more rows

reprex package (v0.3.0)

于 2020 年 3 月 6 日创建

我们可以使用积极的前瞻来捕获第二组。

library(tidyverse)

v %>%
  mutate(row = row_number(), 
         countries = str_split(countries, 
                   "(?<=[A-Z][a-z]{0,20}+[a-z:space:]{0,20}+)(?=[A-Z])")) %>%
  unnest(countries)

# A tibble: 21 x 2
#   countries                          row
#   <chr>                            <int>
# 1 Democratic Republic of the Congo     1
# 2 Sweden                               1
# 3 Denmark                              2
# 4 Iran (Islamic Republic of)           2
# 5 Afghanistan                          3
# 6 Sweden                               3
# 7 Azerbaijan                           4
# 8 Sweden                               4
# 9 Germany                              4
#10 Bangladesh                           5
# … with 11 more rows