在结束字符的第一个实例之后执行 gsub 而不是继续到字符串结尾

Question

我有一个数据集，其中列是调查问题，行中的值包含响应者选择的答案以及多个 HTML 标签。我正在尝试删除所有 HTML 标签，只留下答案文本。

在 Excel 中，这可以通过 <*> 用空字符串作为替换来完成。我不知道如何在 R 中执行此操作，因为我遇到的问题是我无法让通配符在第一个大于括号后停止。相反，它只是将其识别为通配符的一部分并继续到字符串的末尾。我在下面包含了一个玩具数据集和我的尝试。

temp <- data.frame(one = c('<b style="font-weight: normal;"><span style="font-size: 12pt; font-family: "Times New Roman";white-space: pre-wrap;">Answer 1</span></b>',
                         '<b style="font-weight: normal;"><span style="font-size: 12pt; font-family: "Times New Roman";white-space: pre-wrap;">Answer 2</span></b>',
                         '<b style="font-weight: normal;"><span style="font-size: 12pt; font-family: "Times New Roman";white-space: pre-wrap;">Answer 3</span></b>'),
                   two = c('<b style="font-weight: normal;"><span style="font-size: 12pt; font-family: "Times New Roman";white-space: pre-wrap;">apples are red</span></b>',
                         '<b style="font-weight: normal;"><span style="font-size: 12pt; font-family: "Times New Roman";white-space: pre-wrap;">apples are blue</span></b>',
                         '<b style="font-weight: normal;"><span style="font-size: 12pt; font-family: "Times New Roman";white-space: pre-wrap;">apples are bananas</span></b>'))


temp[] <- sapply(temp, function(x) gsub('<.*>+', "", x))

# what I want the new temp to look like (above code results in empty strings
data.frame(one = c("Answer 1", 
                   "Answer 2", 
                   "Answer 3"),
           two = c("apples are red",
                   "apples are blue", 
                   "apples are bananas

我尝试使用第 n 次出现的代码和其他一些代码，但它仍然继续从第一个实例到字符串的末尾。

我缺少的正则表达式命令是什么让它在第一个实例后终止？另外，我假设它会在完成第一次删除后移动到下一行，从而迫使我运行 gsub() n 次，其中 n 是任何给定列中的最大标签数.这不是特别有问题，但是有解决方法吗？

Answer 1

查看 regex 文档中的这段摘录：

By default repetition is greedy, so the maximal possible number of repeats is used. This can be changed to ‘minimal’ by appending ? to the quantifier. (There are further quantifiers that allow approximate matching: see the TRE documentation.)

temp[] <- sapply(temp, function(x) gsub('<.*?>', "", x))

       one                two
1 Answer 1     apples are red
2 Answer 2    apples are blue
3 Answer 3 apples are bananas

为了回答您的第二个问题，gsub 将替换所有匹配项（与 sub 不同，后者仅替换第一个匹配项）- 所以您应该没问题。

Answer 2

使用str_extract，我们可以提取>和<之间的单词字符和空格：

library(stringr)
library(dplyr)

temp %>%
  mutate_all(str_extract, "(?<=\>)[\w\s]+(?=\<)")

输出：

       one                two
1 Answer 1     apples are red
2 Answer 2    apples are blue
3 Answer 3 apples are bananas

在结束字符的第一个实例之后执行 gsub 而不是继续到字符串结尾

Execute gsub after first instance of closing character instead of continuing to end of string

regex

r

gsub