计算 R 中的字符串
Counting strings in R
我有一个数据集如下。我想分组然后计算字符串的数量。非常感谢。
SO = c("Journal Of Business", "Journal Of Business", "Journal of Economy")
AU_UN = c("Dartmouth Coll;Wellesley Coll;Wellesley Coll",
"Georgetown Univ;Fed Reserve Syst",
"Georgetown Univ;Fed Reserve Syst")
df <- data.frame(SO, AU_UN);df
预期答案
Journal Of Business Dartmouth Coll (1);Wellesley Coll (2); Georgetown Univ (1);Fed Reserve Syst (1)
Journal of Economy Georgetown Univ (1); Fed Reserve Syst (1)
使用base::strsplit()
我们可以提取“子字符串”。 strsplit()
returns 一个 list
包含每行的 vector
个字符串。新的 list-column
或 nested column
可以与 tidyr::unnest()
解除嵌套。要获得每个期刊的每个字符串的频率,我们使用 dplyr::count()
.
library(tidyverse)
df %>%
mutate(strings = strsplit(AU_UN, ";")) %>%
unnest(strings) %>%
count(SO, strings)
#> # A tibble: 6 x 3
#> SO strings n
#> <chr> <chr> <int>
#> 1 Journal Of Business Dartmouth Coll 1
#> 2 Journal Of Business Fed Reserve Syst 1
#> 3 Journal Of Business Georgetown Univ 1
#> 4 Journal Of Business Wellesley Coll 2
#> 5 Journal of Economy Fed Reserve Syst 1
#> 6 Journal of Economy Georgetown Univ 1
使用 separate_rows 转换为长格式,计算行数并使用汇总转换回来。
library(dplyr)
library(tidyr)
df %>%
separate_rows(AU_UN, sep = ";") %>%
count(SO, AU_UN) %>%
group_by(SO) %>%
summarize(AU_UN = paste(sprintf("%s (%d)", AU_UN, n), collapse=";"), .groups = "drop")
给予:
# A tibble: 2 x 2
SO AU_UN
<chr> <chr>
1 Journal Of Business Dartmouth Coll (1);Fed Reserve Syst (1);Georgetown Univ (1);Wellesley Coll (2)
2 Journal of Economy Fed Reserve Syst (1);Georgetown Univ (1)
我有一个数据集如下。我想分组然后计算字符串的数量。非常感谢。
SO = c("Journal Of Business", "Journal Of Business", "Journal of Economy")
AU_UN = c("Dartmouth Coll;Wellesley Coll;Wellesley Coll",
"Georgetown Univ;Fed Reserve Syst",
"Georgetown Univ;Fed Reserve Syst")
df <- data.frame(SO, AU_UN);df
预期答案
Journal Of Business Dartmouth Coll (1);Wellesley Coll (2); Georgetown Univ (1);Fed Reserve Syst (1)
Journal of Economy Georgetown Univ (1); Fed Reserve Syst (1)
使用base::strsplit()
我们可以提取“子字符串”。 strsplit()
returns 一个 list
包含每行的 vector
个字符串。新的 list-column
或 nested column
可以与 tidyr::unnest()
解除嵌套。要获得每个期刊的每个字符串的频率,我们使用 dplyr::count()
.
library(tidyverse)
df %>%
mutate(strings = strsplit(AU_UN, ";")) %>%
unnest(strings) %>%
count(SO, strings)
#> # A tibble: 6 x 3
#> SO strings n
#> <chr> <chr> <int>
#> 1 Journal Of Business Dartmouth Coll 1
#> 2 Journal Of Business Fed Reserve Syst 1
#> 3 Journal Of Business Georgetown Univ 1
#> 4 Journal Of Business Wellesley Coll 2
#> 5 Journal of Economy Fed Reserve Syst 1
#> 6 Journal of Economy Georgetown Univ 1
使用 separate_rows 转换为长格式,计算行数并使用汇总转换回来。
library(dplyr)
library(tidyr)
df %>%
separate_rows(AU_UN, sep = ";") %>%
count(SO, AU_UN) %>%
group_by(SO) %>%
summarize(AU_UN = paste(sprintf("%s (%d)", AU_UN, n), collapse=";"), .groups = "drop")
给予:
# A tibble: 2 x 2
SO AU_UN
<chr> <chr>
1 Journal Of Business Dartmouth Coll (1);Fed Reserve Syst (1);Georgetown Univ (1);Wellesley Coll (2)
2 Journal of Economy Fed Reserve Syst (1);Georgetown Univ (1)