在 R 中使用正则表达式替换重复的字符串

Question

我有一个字符串如下：

text <- "http://x.co/imag/xyz.png,http://x.co/imag/xyz.png,http://x.co/imag/xyz.png,http://x.co/imag/jpg.png"

我想消除所有重复的地址，所以我的预期结果是：

expected <- "http://x.co/imag/xyz.png,http://x.co/imag/jpg.png"

我在 regex101.com 中尝试了 (^[\w|.|:|\/]*),+，它可以删除字符串的第一次重复（第二次失败）。但是，如果我将它移植到 R 的 gsub，它不会按预期工作：

gsub("(^[\w|.|:|\/]*),\1+", "\1", text)

我试过 perl = FALSE 和 TRUE 都无济于事。

我做错了什么？

Answer 1

另一种方法是在逗号上拆分字符串，然后将结果唯一化，然后重新组合为您的单个文本

paste0(unique(strsplit(text, ",")[[1]]), collapse = ",")
# [1] "http://x.co/imag/xyz.png,http://x.co/imag/jpg.png"

Answer 2

如果它们是连续的，你只需要稍微修改你的正则表达式。

拿出你的BOS锚点^。
在逗号和反向引用周围添加一个簇组，然后对其进行量化(?:,)+.
并且，在 class 中丢失管道符号 | 它只是一个文字。

([\w.:/]+)(?:,)+

https://regex101.com/r/FDzop9/1

 ( [\w.:/]+ )         # (1), The adress
 (?:                  # Cluster
      ,                  # Comma followed by what found in group 1 
 )+                   # Cluster end, 1 to many times

注意 - 如果你使用 split and unique 然后 combine，你会失去顺序项目。

Answer 3

text <- c("http://x.co/imag/xyz.png,http://x.co/imag/xyz.png,http://x.co/imag/xyz.png,http://x.co/imag/jpg.png",
          "http://q.co/imag/qrs.png,http://q.co/imag/qrs.png")
df <- data.frame(no = 1:2, text)

如果您的字符串在数据框中，您可以使用 tidyverse 中的函数：

library(tidyverse)
separate_rows(df, text, sep = ",") %>% 
  distinct %>% 
  group_by(no) %>% 
  mutate(text = paste(text, collapse = ",")) %>% 
  slice(1)

输出为：

#     no                                              text
#   <int>                                             <chr>
# 1     1 http://x.co/imag/xyz.png,http://x.co/imag/jpg.png
# 2     2                          http://q.co/imag/qrs.png

在 R 中使用正则表达式替换重复的字符串

replacing repeated strings using regex in R

regex

r

gsub