如何使用 dplyr stringr 删除列中每一行括号内的所有内容

Question

我有以下数据框：

library(tidyverse)
dat <- tribble(
  ~x, ~y,
  1,  "foo",
  2,  "bar (103 xxx)",
  3,  "bar",
  4,  "foo (yyy)"
)

dat 
#> # A tibble: 4 x 2
#>       x             y
#>   <dbl>         <chr>
#> 1     1           foo
#> 2     2 bar (103 xxx)
#> 3     3           bar
#> 4     4     foo (yyy)

我想做的是通过删除 () 括号中包含的所有字符串来清理列 y。结果：

      x             y
  <dbl>         <chr>
1     1           foo
2     2           bar 
3     3           bar
4     4           foo

我该怎么做？

我试过这个错误：

> dat  %>% stringr::str_replace(y, "\([a-zA-Z0-9]+\)","")
Error in stringr::str_replace(., y, "\([a-zA-Z0-9]+\)", "") : 
  unused argument ("")

Answer 1

问题出在管道 %>%，它将 dat 作为第一个参数传递给 str_replace，即错误消息中的 dot，这不是 str_replace 预计：

> Error in stringr::str_replace(., y, "\([a-zA-Z0-9]+\)", "") :
 #                              ^   dat passed here

您可以使用 str_replace 和 mutate 创建新列：

dat %>% mutate(y = trimws(str_replace(y, "\(.*?\)", "")))

# A tibble: 4 x 2
#      x     y
#  <dbl> <chr>
#1     1   foo
#2     2   bar
#3     3   bar
#4     4   foo

如果想在pipe之后直接应用str_replace，只能修改一个column/vector:

# here use pull to extract the column and manipulate it
dat %>% pull(y) %>% str_replace("\(.*?\)", "") %>% trimws()
# [1] "foo" "bar" "bar" "foo"

Answer 2

假设这些是模式，base R 选项将是

dat$y<- sub("\s*\(.*", "", dat$y)
dat$y
#[1] "foo" "bar" "bar" "foo"

Answer 3

您也可以只执行以下操作，避免处理括号：

library(stringr)
dat %>%
  mutate(y = str_extract(y, "^\w+"))

但我不确定您的实际数据集是否具有这样的结构。

如何使用 dplyr stringr 删除列中每一行括号内的所有内容

How to remove everything within brackets for every row in a column using dplyr stringr

regex

r

stringr

tidyverse