使用 purrr 迭代替换数据框列中的字符串

Question

我想使用 purrr 通过 gsub() 函数迭代运行数据框列上的多个字符串替换。

这是示例数据框：

df <- data.frame(Year = "2019",
                 Text = c(rep("a aa", 5), 
                          rep("a bb", 3), 
                          rep("a cc", 2)))

> df
   Year Text
1  2019 a aa
2  2019 a aa
3  2019 a aa
4  2019 a aa
5  2019 a aa
6  2019 a bb
7  2019 a bb
8  2019 a bb
9  2019 a cc
10 2019 a cc

这就是我通常运行字符串替换和所需结果的方式。

df$Text <- gsub("aa", "One", df$Text, fixed = T)
df$Text <- gsub("bb", "Two", df$Text, fixed = T)
df$Text <- gsub("cc", "Three", df$Text, fixed = T)

> df
   Year    Text
1  2019   a One
2  2019   a One
3  2019   a One
4  2019   a One
5  2019   a One
6  2019   a Two
7  2019   a Two
8  2019   a Two
9  2019 a Three
10 2019 a Three

然而，随着字符串替换列表的增长，这是不现实的，因此我尝试使用 purrr 来使用 patterns 和 replacements 的列表来迭代此类更改，但我' 只设法产生错误消息。对于每对 pattern/replacement，我希望代码在 df$Text 上遍历 text_pattern 和 text_replacement 以及运行 gsub。下面是示例以及错误消息。

text_pattern <- c("aa", "bb", "cc")
text_replacement <- c("One", "Two", "Three")

walk2(text_pattern, text_replacement, function(...){
  gsub(text_pattern, text_replacement, df$Text, fixed = F)
  }
)

Warning messages:
1: In gsub(text_former, text_replace, df$Text, fixed = F) :
  argument 'pattern' has length > 1 and only the first element will be used
2: In gsub(text_former, text_replace, df$Text, fixed = F) :
  argument 'replacement' has length > 1 and only the first element will be used
3: In gsub(text_former, text_replace, df$Text, fixed = F) :
  argument 'pattern' has length > 1 and only the first element will be used
4: In gsub(text_former, text_replace, df$Text, fixed = F) :
  argument 'replacement' has length > 1 and only the first element will be used
5: In gsub(text_former, text_replace, df$Text, fixed = F) :
  argument 'pattern' has length > 1 and only the first element will be used
6: In gsub(text_former, text_replace, df$Text, fixed = F) :
  argument 'replacement' has length > 1 and only the first element will be used

是否可以使用 purrr 中的函数来完成此操作？或者我是否尝试使用错误的工具，我应该使用其他功能吗？

Answer 1

我们可以使用reduce2

library(purrr)
library(stringr)
df$Text <- reduce2(text_pattern, text_replacement, ~ str_replace(..1, ..2, ..3), 
           .init = df$Text)
df$Text
#[1] "a One"   "a One"   "a One"   "a One"   "a One"   "a Two"   "a Two"   "a Two"   "a Three" "a Three"

或者不使用匿名函数调用

reduce2(text_pattern, text_replacement, .init = df$Text, str_replace)

Answer 2

@akrun 的回答很好，但是您可能还会发现一些中间点有助于更好地理解 purrr。

walk2 不会 return 输出，它只是 return 第一个输入向量。

来自docs：

walk() calls .f for its side-effect and returns the input .x.

与您正在做的事情最接近的模拟是 map2，但请参阅下文了解为什么这也不是您所需要的。
purrr 函数中的参数，如 map 和 walk 是指被迭代的向量的通用表示。

对于如何引用输入向量，您有几个选项。一种是在 function(...) 中命名参数。例如，使用 function(x, y) 那么这将产生无错误的输出：
```
map2(text_pattern, text_replacement, function(x, y){
  gsub(x, y, df$Text, fixed = F)
}
)  # switching to map2() because walk2 gives silent output
```
您还可以使用 ~ 语法，然后将输入迭代引用为 .x 和 .y：
```
map2(text_pattern, text_replacement, ~gsub(.x, .y, df$Text, fixed = F))
```
输出不是您所期望的。

purrr 方法，如 map 和 walk 为每个模式循环遍历整个向量。 2. 中两个代码片段的输出如下：
```
[[1]]
 [1] "a One" "a One" "a One" "a One" "a One" "a bb"  "a bb"  "a bb"  "a cc"  "a cc" 

[[2]]
 [1] "a aa"  "a aa"  "a aa"  "a aa"  "a aa"  "a Two" "a Two" "a Two" "a cc"  "a cc" 

[[3]]
 [1] "a aa"    "a aa"    "a aa"    "a aa"    "a aa"    "a bb"    "a bb"    "a bb"   
 [9] "a Three" "a Three"  
```
所以即使修复了语法，你仍然得到一个三元素列表，每个元素的内容是每对 text_pattern-text_replacement 的替换操作的结果。仍然需要执行 smush 操作才能将它们与替换的元素放在一起。这就是 @akrun 转向 reduce2 所完成的。

关于 reduce 语法的附加说明 - 参数 ..1、..2、..3 引用每次迭代的输入，以及 [=34] 的使用=] 使第一个参数 (..1) 等于 df$Text。 ..2 和 ..3 分别是 map2 的早期示例中的 .x 和 .y（即模式和替换值）。有关更多信息，请参阅 reduce docs。

使用 purrr 迭代替换数据框列中的字符串

Using purrr to iteratively replace strings in a dataframe column

r

gsub

purrr