regex/stringr:拆分 joined/sequence 国名
regex/stringr: splitting joined/sequence of countrynames
我有一个包含多个国家/地区名称的字符串。除了大写字母后面没有 space 的小写字母之外,这些名称没有任何模式分隔(但是 space 是某些国家/地区名称的一部分,例如刚果民主共和国。
我的 stringr/regex 尝试很接近,但是我丢失了第二个和后续国家/地区名称的第一个字母。有什么帮助吗?非常感谢。
library(tidyverse)
#> Warning: package 'dplyr' was built under R version 3.6.2
#> Warning: package 'forcats' was built under R version 3.6.3
v <- structure(list(countries = c("Democratic Republic of the CongoSweden",
"DenmarkIran (Islamic Republic of)", "AfghanistanSweden", "AzerbaijanSwedenGermany",
"BangladeshSweden", "DenmarkSri Lanka", "CanadaSri Lanka", "DenmarkNigeria",
"CanadaIreland", "CanadaMexico")), class = c("tbl_df", "tbl",
"data.frame"), row.names = c(NA, -10L))
v %>%
mutate(index=row_number()) %>%
#mutate(countries_split=str_split(countries, "[A-Z][a-z]*[a-z:space:]+(?=[A-Z])")) %>%
#mutate(countries_split=str_split(countries, "(?<=[A-Z][a-z]{0,20}+[a-z:space:]{0,20}+[A-Z][a-z]{1,20}+).")) %>%
mutate(countries_split=str_split(countries, "(?<=[A-Z][a-z]{0,20}+[a-z:space:]{0,20}+)[A-Z]")) %>%
unnest(countries_split)
#> # A tibble: 21 x 3
#> countries index countries_split
#> <chr> <int> <chr>
#> 1 Democratic Republic of the CongoSweden 1 Democratic Republic of the Congo
#> 2 Democratic Republic of the CongoSweden 1 weden
#> 3 DenmarkIran (Islamic Republic of) 2 Denmark
#> 4 DenmarkIran (Islamic Republic of) 2 ran (Islamic Republic of)
#> 5 AfghanistanSweden 3 Afghanistan
#> 6 AfghanistanSweden 3 weden
#> 7 AzerbaijanSwedenGermany 4 Azerbaijan
#> 8 AzerbaijanSwedenGermany 4 weden
#> 9 AzerbaijanSwedenGermany 4 ermany
#> 10 BangladeshSweden 5 Bangladesh
#> # ... with 11 more rows
由 reprex package (v0.3.0)
于 2020 年 3 月 6 日创建
我们可以使用积极的前瞻来捕获第二组。
library(tidyverse)
v %>%
mutate(row = row_number(),
countries = str_split(countries,
"(?<=[A-Z][a-z]{0,20}+[a-z:space:]{0,20}+)(?=[A-Z])")) %>%
unnest(countries)
# A tibble: 21 x 2
# countries row
# <chr> <int>
# 1 Democratic Republic of the Congo 1
# 2 Sweden 1
# 3 Denmark 2
# 4 Iran (Islamic Republic of) 2
# 5 Afghanistan 3
# 6 Sweden 3
# 7 Azerbaijan 4
# 8 Sweden 4
# 9 Germany 4
#10 Bangladesh 5
# … with 11 more rows
我有一个包含多个国家/地区名称的字符串。除了大写字母后面没有 space 的小写字母之外,这些名称没有任何模式分隔(但是 space 是某些国家/地区名称的一部分,例如刚果民主共和国。
我的 stringr/regex 尝试很接近,但是我丢失了第二个和后续国家/地区名称的第一个字母。有什么帮助吗?非常感谢。
library(tidyverse)
#> Warning: package 'dplyr' was built under R version 3.6.2
#> Warning: package 'forcats' was built under R version 3.6.3
v <- structure(list(countries = c("Democratic Republic of the CongoSweden",
"DenmarkIran (Islamic Republic of)", "AfghanistanSweden", "AzerbaijanSwedenGermany",
"BangladeshSweden", "DenmarkSri Lanka", "CanadaSri Lanka", "DenmarkNigeria",
"CanadaIreland", "CanadaMexico")), class = c("tbl_df", "tbl",
"data.frame"), row.names = c(NA, -10L))
v %>%
mutate(index=row_number()) %>%
#mutate(countries_split=str_split(countries, "[A-Z][a-z]*[a-z:space:]+(?=[A-Z])")) %>%
#mutate(countries_split=str_split(countries, "(?<=[A-Z][a-z]{0,20}+[a-z:space:]{0,20}+[A-Z][a-z]{1,20}+).")) %>%
mutate(countries_split=str_split(countries, "(?<=[A-Z][a-z]{0,20}+[a-z:space:]{0,20}+)[A-Z]")) %>%
unnest(countries_split)
#> # A tibble: 21 x 3
#> countries index countries_split
#> <chr> <int> <chr>
#> 1 Democratic Republic of the CongoSweden 1 Democratic Republic of the Congo
#> 2 Democratic Republic of the CongoSweden 1 weden
#> 3 DenmarkIran (Islamic Republic of) 2 Denmark
#> 4 DenmarkIran (Islamic Republic of) 2 ran (Islamic Republic of)
#> 5 AfghanistanSweden 3 Afghanistan
#> 6 AfghanistanSweden 3 weden
#> 7 AzerbaijanSwedenGermany 4 Azerbaijan
#> 8 AzerbaijanSwedenGermany 4 weden
#> 9 AzerbaijanSwedenGermany 4 ermany
#> 10 BangladeshSweden 5 Bangladesh
#> # ... with 11 more rows
由 reprex package (v0.3.0)
于 2020 年 3 月 6 日创建我们可以使用积极的前瞻来捕获第二组。
library(tidyverse)
v %>%
mutate(row = row_number(),
countries = str_split(countries,
"(?<=[A-Z][a-z]{0,20}+[a-z:space:]{0,20}+)(?=[A-Z])")) %>%
unnest(countries)
# A tibble: 21 x 2
# countries row
# <chr> <int>
# 1 Democratic Republic of the Congo 1
# 2 Sweden 1
# 3 Denmark 2
# 4 Iran (Islamic Republic of) 2
# 5 Afghanistan 3
# 6 Sweden 3
# 7 Azerbaijan 4
# 8 Sweden 4
# 9 Germany 4
#10 Bangladesh 5
# … with 11 more rows