计算 R 中的字符串

Counting strings in R

我有一个数据集如下。我想分组然后计算字符串的数量。非常感谢。

SO = c("Journal Of Business", "Journal Of Business", "Journal of Economy")

AU_UN = c("Dartmouth Coll;Wellesley Coll;Wellesley Coll",                                                                                             
          "Georgetown Univ;Fed Reserve Syst",
          "Georgetown Univ;Fed Reserve Syst")

df <- data.frame(SO, AU_UN);df

预期答案

Journal Of Business      Dartmouth Coll (1);Wellesley Coll (2);  Georgetown Univ (1);Fed Reserve Syst (1)
Journal of Economy       Georgetown Univ (1); Fed Reserve Syst (1)

使用base::strsplit()我们可以提取“子字符串”。 strsplit() returns 一个 list 包含每行的 vector 个字符串。新的 list-columnnested column 可以与 tidyr::unnest() 解除嵌套。要获得每个期刊的每个字符串的频率,我们使用 dplyr::count().

library(tidyverse)
df %>% 
  mutate(strings  = strsplit(AU_UN, ";")) %>% 
  unnest(strings) %>% 
  count(SO, strings)
#> # A tibble: 6 x 3
#>   SO                  strings              n
#>   <chr>               <chr>            <int>
#> 1 Journal Of Business Dartmouth Coll       1
#> 2 Journal Of Business Fed Reserve Syst     1
#> 3 Journal Of Business Georgetown Univ      1
#> 4 Journal Of Business Wellesley Coll       2
#> 5 Journal of Economy  Fed Reserve Syst     1
#> 6 Journal of Economy  Georgetown Univ      1

使用 separate_rows 转换为长格式,计算行数并使用汇总转换回来。

library(dplyr)
library(tidyr)

df %>% 
  separate_rows(AU_UN, sep = ";") %>% 
  count(SO, AU_UN) %>% 
  group_by(SO) %>% 
  summarize(AU_UN = paste(sprintf("%s (%d)", AU_UN, n), collapse=";"), .groups = "drop")

给予:

# A tibble: 2 x 2
  SO                  AU_UN                                                                         
  <chr>               <chr>                                                                         
1 Journal Of Business Dartmouth Coll (1);Fed Reserve Syst (1);Georgetown Univ (1);Wellesley Coll (2)
2 Journal of Economy  Fed Reserve Syst (1);Georgetown Univ (1)