如何删除字符串中包含“.com”但具有link的所有内容?

How to delete everything that contains ".com" in the string but has a link?

如何在如下例的文本中获得预期的输出?

x<-c("Commerce recommend erkanexample.com.tr. This site erkanexample.com. erkandeneme.com is widely. The company name is apple.commerce is coma. spread")
x<-gsub("(.com)\S+", "",x)
x
[1] "Commerce r erkanexample This site erkanexample erkandeneme.com is widely. The name is apple is"
expected
[1] "Commerce recommend This site. is widely. The company name is apple.commerce is coma. spread"
> 

stringr 包提供基本字符串操作的函数:

library(stringr)
library(dplyr)

x %>% 
  str_split(" ") %>% 
  unlist() %>% 
  str_subset("\.com($|\.)",negate = TRUE) %>% 
  str_c(collapse = " ")

给出:

"Commerce recommend This site is widely. The company name is apple.commerce is coma. spread" 

编辑后

x %>% 
  str_split(" ") %>% 
  unlist() %>%
  str_subset("\.com$", negate = TRUE) %>% 
  str_replace(".*\.com.*\.$", ".") %>%
  str_c(collapse = " ") %>%
  str_replace_all(" \.", "\.")

给出:

"Commerce recommend. This site. is widely. The company name is apple.commerce is coma. spread"

想法:按 space 拆分并检测哪个单词包含 .com 和 select 不包含它并加入它们

x<-c("Commerce recommend erkanexample.com.tr. This site erkanexample.com. erkandeneme.com is widely. The company name is apple.commerce is coma. spread")
split_str <- str_split(x," ",simplify =FALSE)[[1]]
paste(split_str[!grepl("[.]com", split_str)],collapse = " ")

给予

“商业推荐本站广泛。公司名称是coma.spread”

这是你想要的吗?

gsub("\s[a-z]+\.com(\.[a-z]+)?\b", "", x)
[1] "Commerce recommend. This site. is widely. The company name is apple.commerce is coma. spread"

这里,我们什么都不替换:

  • \s: 一个白色space字符
  • [a-z]+: 一个或多个小写字母
  • \.: 一期
  • com:字符串com
  • (\.[a-z]+)?: 一个可选的句点后跟一个或多个可选的小写字母
  • \b: 一个单词边界