使用 purrr 有效地计算大型数据框中的正则表达式匹配
Using purrr to efficiently count regex matches in a large dataframe
编辑以更改正则表达式并显示我的 tidyr/dplyr
解决方案
我正在寻找一种有效的方式(最好是 purrr
)来处理大型数据框中的大量搜索和计数正则表达式模式。
这是我要实现的目标的一个简单示例。
假设我有一个句子数据框:
library(stringr)
dat <- tibble(id = 1:5,
text = sentences[1:5])
dat
# A tibble: 5 x 2
id text
<int> <chr>
1 1 The birch canoe slid on the smooth planks.
2 2 Glue the sheet to the dark blue background.
3 3 It's easy to tell the depth of a well.
4 4 These days a chicken leg is a rare dish.
5 5 Rice is often served in round bowls.
我还有 table 按类型分类的搜索模式和相应的正则表达式:
searches <- tibble(type = c("Article","Article","Preposition","Preposition","Preposition","Preposition"),
pattern = c("the","a","on","of","in","to"),
regex = c("\b[Tt]he\b", "\b[Aa]\b","\b[Oo]n\b","\b[Oo]f\b",
"\b[Ii]n\b","\b[Tt]o\b"))
searches
# A tibble: 6 x 3
type pattern regex
<chr> <chr> <chr>
1 Article the "\b[Tt]he\b"
2 Article a "\b[Aa]\b"
3 Preposition on "\b[Oo]n\b"
4 Preposition of "\b[Oo]f\b"
5 Preposition in "\b[Ii]n\b"
6 Preposition to "\b[Tt]o\b"
我想遍历每个句子的每个搜索模式,并计算找到的模式数量,以便输出看起来像:
# A tibble: 9 x 5
id sentence type pattern count
<int> <chr> <chr> <chr> <int>
1 1 The birch canoe slid on the smooth planks. article the 2
2 1 The birch canoe slid on the smooth planks. preposition on 1
3 2 Glue the sheet to the dark blue background. article the 2
4 2 Glue the sheet to the dark blue background. preposition to 1
5 3 It's easy to tell the depth of a well. article a 1
6 3 It's easy to tell the depth of a well. preposition of 1
7 3 It's easy to tell the depth of a well. preposition to 1
8 4 These days a chicken leg is a rare dish. article a 2
9 5 Rice is often served in round bowls. preposition in 1
真实数据和搜索table要大几个数量级,所以我想避免使用循环。我知道一定有办法通过几个 map
调用或 pmap
来完成它,但我无法理解它。
添加了 tidyr
的解决方案
这似乎可行,但我想知道是否有 purrr 替代方案会更快:
crossing(text = dat$text,regex =searches$regex)%>%
mutate(count = str_count(text,regex)) %>%
inner_join(searches,.) %>%
inner_join(dat,.) %>%
filter(count>0) %>%
select(-regex)
您可以尝试使用 map_df
-
library(tidyverse)
map_df(searches$regex, ~dat %>%
mutate(count = str_count(text, .x)) %>%
filter(count > 0)) %>%
arrange(id)
# id text count
# <int> <chr> <int>
# 1 1 The birch canoe slid on the smooth planks. 2
# 2 1 The birch canoe slid on the smooth planks. 1
# 3 2 Glue the sheet to the dark blue background. 2
# 4 2 Glue the sheet to the dark blue background. 1
# 5 3 It's easy to tell the depth of a well. 1
# 6 3 It's easy to tell the depth of a well. 1
# 7 3 It's easy to tell the depth of a well. 1
# 8 3 It's easy to tell the depth of a well. 1
# 9 4 These days a chicken leg is a rare dish. 2
#10 5 Rice is often served in round bowls. 1
如果您需要来自 searches
数据框的所有信息,请使用 pmap_df
-
pmap_df(searches, ~dat %>%
mutate(type = ..1,
pattern = ..2,
count = str_count(text, ..3)) %>%
filter(count > 0)) %>%
arrange(id)
# id text type pattern count
# <int> <chr> <chr> <chr> <int>
# 1 1 The birch canoe slid on the smooth planks. Article the 2
# 2 1 The birch canoe slid on the smooth planks. Preposition on 1
# 3 2 Glue the sheet to the dark blue background. Article the 2
# 4 2 Glue the sheet to the dark blue background. Preposition to 1
# 5 3 It's easy to tell the depth of a well. Article the 1
# 6 3 It's easy to tell the depth of a well. Article a 1
# 7 3 It's easy to tell the depth of a well. Preposition of 1
# 8 3 It's easy to tell the depth of a well. Preposition to 1
# 9 4 These days a chicken leg is a rare dish. Article a 2
#10 5 Rice is often served in round bowls. Preposition in 1
编辑以更改正则表达式并显示我的 tidyr/dplyr
解决方案
我正在寻找一种有效的方式(最好是 purrr
)来处理大型数据框中的大量搜索和计数正则表达式模式。
这是我要实现的目标的一个简单示例。
假设我有一个句子数据框:
library(stringr)
dat <- tibble(id = 1:5,
text = sentences[1:5])
dat
# A tibble: 5 x 2
id text
<int> <chr>
1 1 The birch canoe slid on the smooth planks.
2 2 Glue the sheet to the dark blue background.
3 3 It's easy to tell the depth of a well.
4 4 These days a chicken leg is a rare dish.
5 5 Rice is often served in round bowls.
我还有 table 按类型分类的搜索模式和相应的正则表达式:
searches <- tibble(type = c("Article","Article","Preposition","Preposition","Preposition","Preposition"),
pattern = c("the","a","on","of","in","to"),
regex = c("\b[Tt]he\b", "\b[Aa]\b","\b[Oo]n\b","\b[Oo]f\b",
"\b[Ii]n\b","\b[Tt]o\b"))
searches
# A tibble: 6 x 3
type pattern regex
<chr> <chr> <chr>
1 Article the "\b[Tt]he\b"
2 Article a "\b[Aa]\b"
3 Preposition on "\b[Oo]n\b"
4 Preposition of "\b[Oo]f\b"
5 Preposition in "\b[Ii]n\b"
6 Preposition to "\b[Tt]o\b"
我想遍历每个句子的每个搜索模式,并计算找到的模式数量,以便输出看起来像:
# A tibble: 9 x 5
id sentence type pattern count
<int> <chr> <chr> <chr> <int>
1 1 The birch canoe slid on the smooth planks. article the 2
2 1 The birch canoe slid on the smooth planks. preposition on 1
3 2 Glue the sheet to the dark blue background. article the 2
4 2 Glue the sheet to the dark blue background. preposition to 1
5 3 It's easy to tell the depth of a well. article a 1
6 3 It's easy to tell the depth of a well. preposition of 1
7 3 It's easy to tell the depth of a well. preposition to 1
8 4 These days a chicken leg is a rare dish. article a 2
9 5 Rice is often served in round bowls. preposition in 1
真实数据和搜索table要大几个数量级,所以我想避免使用循环。我知道一定有办法通过几个 map
调用或 pmap
来完成它,但我无法理解它。
添加了 tidyr
这似乎可行,但我想知道是否有 purrr 替代方案会更快:
crossing(text = dat$text,regex =searches$regex)%>%
mutate(count = str_count(text,regex)) %>%
inner_join(searches,.) %>%
inner_join(dat,.) %>%
filter(count>0) %>%
select(-regex)
您可以尝试使用 map_df
-
library(tidyverse)
map_df(searches$regex, ~dat %>%
mutate(count = str_count(text, .x)) %>%
filter(count > 0)) %>%
arrange(id)
# id text count
# <int> <chr> <int>
# 1 1 The birch canoe slid on the smooth planks. 2
# 2 1 The birch canoe slid on the smooth planks. 1
# 3 2 Glue the sheet to the dark blue background. 2
# 4 2 Glue the sheet to the dark blue background. 1
# 5 3 It's easy to tell the depth of a well. 1
# 6 3 It's easy to tell the depth of a well. 1
# 7 3 It's easy to tell the depth of a well. 1
# 8 3 It's easy to tell the depth of a well. 1
# 9 4 These days a chicken leg is a rare dish. 2
#10 5 Rice is often served in round bowls. 1
如果您需要来自 searches
数据框的所有信息,请使用 pmap_df
-
pmap_df(searches, ~dat %>%
mutate(type = ..1,
pattern = ..2,
count = str_count(text, ..3)) %>%
filter(count > 0)) %>%
arrange(id)
# id text type pattern count
# <int> <chr> <chr> <chr> <int>
# 1 1 The birch canoe slid on the smooth planks. Article the 2
# 2 1 The birch canoe slid on the smooth planks. Preposition on 1
# 3 2 Glue the sheet to the dark blue background. Article the 2
# 4 2 Glue the sheet to the dark blue background. Preposition to 1
# 5 3 It's easy to tell the depth of a well. Article the 1
# 6 3 It's easy to tell the depth of a well. Article a 1
# 7 3 It's easy to tell the depth of a well. Preposition of 1
# 8 3 It's easy to tell the depth of a well. Preposition to 1
# 9 4 These days a chicken leg is a rare dish. Article a 2
#10 5 Rice is often served in round bowls. Preposition in 1