在 R 中从 Dataframe 抓取网页

Webscraping in R From Dataframe

来自以下数据框

我正在尝试使用包 rvest 从网站 https://www.thesaurus.com/browse/research?s=t 中抓取每个词的词性和同义词到 csv 中。

我不确定如何让 R 搜索数据框的每个单词并提取其词性和同义词。

install.packages("rvest") install.packages("xml2") library(xml2) library(rvest) library(dplyr) words<data.frame("keywords"=c("research","survey","staff","outpatient","consent")) html<- read_html("https://www.merriam-webster.com/thesaurus/research") html %>% html_nodes(".mw-list") %>% html_text () %>% head(n=1) # take the first 1st records

如果您在同义词库中搜索 [您的术语],您最终会看到以下 HTML 页面:“https://www.thesaurus.com/browse/[your 术语]”。如果你知道这一点,你可以获得你感兴趣的所有页面的 HTMLs。之后你应该能够从 purrr 中使用 map() 函数进行迭代打包以获取您想要的信息:


# It makes more sense to just keep "words" as a vector for now

words <- c("research","survey","staff","outpatient","consent")
htmls <- paste0("https://www.thesaurus.com/browse/", words)

info_list <- map(htmls, .x %>%
                          read_html() %>%
                          html_node(.mw-list) %>%
                          html_text())