在 R 中提取准确的单词

Question

我想从一个变量（实际上是 url 的）中提取一些确切的词并创建一个仅包含提取的词的新变量。检查模式我发现我想要字符 \\"> 和 "，如下所示：

> dados$source[1:20]
 [1] "<a href=\\"http://twitter.com/download/iphone\\" rel=\\"nofollow\\">Twitter for iPhone</a>"  

 [2] "<a href=\\"http://twitter.com/download/android\\" rel=\\"nofollow\\">Twitter for Android</a>"

 [3] "<a href=\\"http://twitter.com\\" rel=\\"nofollow\\">Twitter Web Client</a>"

我该怎么做？

Answer 1

我不确定我是否完全理解您想要提取的模式。但是，使用 Regex 是可行的方法。问题示例：Removing html tags from a string in R

cleanFun <- function(htmlString) {
  return(gsub("<.*?>", "", htmlString))
}

Answer 2

如果您有 HTML，请使用像 rvest 这样的 HTML 解析器来解析字符串。获得非 HTML 字符串后，您可以使用正则表达式。

library(purrr)    # use lapply and sapply if you prefer
library(rvest)

# representative data
links <- c("<a href=\\"http://twitter.com/download/iphone\\" rel=\\"nofollow\\">Twitter for iPhone</a>", 
    "<a href=\\"http://twitter.com/download/android\\" rel=\\"nofollow\\">Twitter for Android</a>", 
    "<a href=\\"http://twitter.com\\" rel=\\"nofollow\\">Twitter Web Client</a>")

links %>% map(read_html) %>% 
    map_chr(html_text) %>% 
    sub('Twitter (for )?', '', .)

## [1] "iPhone"     "Android"    "Web Client"

在 R 中提取准确的单词

Extract Exact Word in R

r

extract

html-parsing