使用 purrr 有效地计算大型数据框中的正则表达式匹配

Question

编辑以更改正则表达式并显示我的 tidyr/dplyr 解决方案

我正在寻找一种有效的方式（最好是 purrr）来处理大型数据框中的大量搜索和计数正则表达式模式。

这是我要实现的目标的一个简单示例。

假设我有一个句子数据框：

  library(stringr)  
dat <- tibble(id = 1:5,
                  text = sentences[1:5])
dat

# A tibble: 5 x 2
     id text                                       
  <int> <chr>                                      
1     1 The birch canoe slid on the smooth planks. 
2     2 Glue the sheet to the dark blue background.
3     3 It's easy to tell the depth of a well.     
4     4 These days a chicken leg is a rare dish.   
5     5 Rice is often served in round bowls.

我还有 table 按类型分类的搜索模式和相应的正则表达式：

searches <- tibble(type = c("Article","Article","Preposition","Preposition","Preposition","Preposition"),
                   pattern = c("the","a","on","of","in","to"),
                   regex = c("\b[Tt]he\b", "\b[Aa]\b","\b[Oo]n\b","\b[Oo]f\b",
                             "\b[Ii]n\b","\b[Tt]o\b"))

searches

  # A tibble: 6 x 3
  type        pattern regex         
  <chr>       <chr>   <chr>         
1 Article     the     "\b[Tt]he\b"
2 Article     a       "\b[Aa]\b"  
3 Preposition on      "\b[Oo]n\b" 
4 Preposition of      "\b[Oo]f\b" 
5 Preposition in      "\b[Ii]n\b" 
6 Preposition to      "\b[Tt]o\b"

我想遍历每个句子的每个搜索模式，并计算找到的模式数量，以便输出看起来像：

   # A tibble: 9 x 5
     id sentence                                    type        pattern count
  <int> <chr>                                       <chr>       <chr>   <int>
1     1 The birch canoe slid on the smooth planks.  article     the         2
2     1 The birch canoe slid on the smooth planks.  preposition on          1
3     2 Glue the sheet to the dark blue background. article     the         2
4     2 Glue the sheet to the dark blue background. preposition to          1
5     3 It's easy to tell the depth of a well.      article     a           1
6     3 It's easy to tell the depth of a well.      preposition of          1
7     3 It's easy to tell the depth of a well.      preposition to          1
8     4 These days a chicken leg is a rare dish.    article     a           2
9     5 Rice is often served in round bowls.        preposition in          1

真实数据和搜索table要大几个数量级，所以我想避免使用循环。我知道一定有办法通过几个 map 调用或 pmap 来完成它，但我无法理解它。

添加了 tidyr

的解决方案

这似乎可行，但我想知道是否有 purrr 替代方案会更快：

crossing(text = dat$text,regex =searches$regex)%>% 
mutate(count = str_count(text,regex)) %>% 
inner_join(searches,.) %>% 
inner_join(dat,.) %>% 
filter(count>0) %>% 
select(-regex)

Answer 1

您可以尝试使用 map_df -

library(tidyverse)

map_df(searches$regex, ~dat %>%
                      mutate(count = str_count(text, .x)) %>%
                      filter(count > 0)) %>%
  arrange(id)

#     id text                                        count
#   <int> <chr>                                       <int>
# 1     1 The birch canoe slid on the smooth planks.      2
# 2     1 The birch canoe slid on the smooth planks.      1
# 3     2 Glue the sheet to the dark blue background.     2
# 4     2 Glue the sheet to the dark blue background.     1
# 5     3 It's easy to tell the depth of a well.          1
# 6     3 It's easy to tell the depth of a well.          1
# 7     3 It's easy to tell the depth of a well.          1
# 8     3 It's easy to tell the depth of a well.          1
# 9     4 These days a chicken leg is a rare dish.        2
#10     5 Rice is often served in round bowls.            1

如果您需要来自 searches 数据框的所有信息，请使用 pmap_df -

pmap_df(searches, ~dat %>%
                  mutate(type = ..1, 
                         pattern = ..2, 
                         count = str_count(text, ..3)) %>% 
       filter(count > 0)) %>%
  arrange(id)

#      id text                                        type        pattern count
#   <int> <chr>                                       <chr>       <chr>   <int>
# 1     1 The birch canoe slid on the smooth planks.  Article     the         2
# 2     1 The birch canoe slid on the smooth planks.  Preposition on          1
# 3     2 Glue the sheet to the dark blue background. Article     the         2
# 4     2 Glue the sheet to the dark blue background. Preposition to          1
# 5     3 It's easy to tell the depth of a well.      Article     the         1
# 6     3 It's easy to tell the depth of a well.      Article     a           1
# 7     3 It's easy to tell the depth of a well.      Preposition of          1
# 8     3 It's easy to tell the depth of a well.      Preposition to          1
# 9     4 These days a chicken leg is a rare dish.    Article     a           2
#10     5 Rice is often served in round bowls.        Preposition in          1

使用 purrr 有效地计算大型数据框中的正则表达式匹配

Using purrr to efficiently count regex matches in a large dataframe

regex

r

purrr

tidyverse