R抓取网页链接

Question

我想这里必须有一个简单的答案，但我似乎找不到。

我正在抓取各种网页，我想从网页中拉下所有 link。我正在使用 htmlParse 来执行此操作，并且大约完成了 95%，但需要一些帮助。

这是我抓取网页的代码

MyURL <- "http://whosebug.com/"
MyPage <- htmlParse(MyURL) # Parse the web page
URLroot <- xmlRoot(MyPage) # Get root node

一旦我有了根节点，我就可以运行得到一个节点

URL_Links <- xpathSApply(URLroot, "//a") # get all hrefs from root

这给了我这样的输出

[[724]]
<a href="//area51.stackexchange.com" title="proposing new sites in the Stack Exchange network">Area 51</a> 

[[725]]
<a href="//careers.whosebug.com">Stack Overflow Careers</a> 

[[726]]
<a href="http://creativecommons.org/licenses/by-sa/3.0/" rel="license">cc by-sa 3.0</a>

或者，我可以运行这个

URL_Links_values = xpathSApply(URLroot, "//a", xmlGetAttr, "href") # Get all href values

它只获取像这样的 HREF 值

[[721]]
[1] "http://creativecommons.org/licenses/by-sa/3.0/"

[[722]]
[1] "http://blog.whosebug.com/2009/06/attribution-required/"

但是，我正在寻找一种方法来轻松获取 HREF 值和 link 的名称，最好将其加载到数据框或矩阵中，这样就不会返回

<a href="http://creativecommons.org/licenses/by-sa/3.0/" rel="license">cc by-sa 3.0</a> 
<a href="http://blog.whosebug.com/2009/06/attribution-required/" rel="license">attribution required</a>

我明白了

                  Name                                                        HREF
1         cc by-sa 3.0              http://creativecommons.org/licenses/by-sa/3.0/
2 attribution required http://blog.whosebug.com/2009/06/attribution-required/

现在我可以获取 URL_Links 的输出并做一些正则表达式或将字符串拆分开来获取这些数据，但似乎应该有一种更简单的方法来使用 XML 包.

有没有一种简单的方法可以完成我想做的事情？

编辑：

刚刚发现我可以这样做来获得 URL 个名字

URL_Links_names <- xpathSApply(URLroot, "//a", xmlValue) # Get all href values

然而当我运行这个

df <- data.frame(URL_Links_names, URL_Links_values)

我收到这个错误

Error in data.frame("//whosebug.com", "http://chat.whosebug.com",  : arguments imply differing number of rows: 1, 0

我猜有些 link 没有名字，那么我如何才能为任何未命名的 link 返回 "" 或 NA？

Answer 1

我的目标是查看所有 link 名称，然后确定我需要哪些 URL。我没有找到一种方法来获取我想要的所有数据框，但我能做的就是获取所有 link 个这样的名称

MyURL <- "http://whosebug.com/"
MyPage <- htmlParse(MyURL) # Parse the web page
URLroot <- xmlRoot(MyPage) # Get root node
URL_Links_names <- xpathSApply(URLroot, "//a", xmlValue) # Get all href values

这就是我所有 link 的名字。搜索名称并确定是否需要部分或全部名称，然后可以将 link 名称传递给此函数，以根据 link 获取每个 link 的 HREF 值名字

GetLinkURLByName <- function(LinkName, WebPageURL) {
  LinkURL <- getHTMLLinks(WebPageURL, xpQuery = sprintf("//a[text()='%s']/@href",LinkName))
  return(LinkURL)
}

LinkName = 来自 URL_Links_Name 的 link 的名称。 WebPageURL = 你正在抓取的网页（在这个例子中我会传递它 MyURL）

Answer 2

html 中似乎缺少几个 href 链接。因为 xmlGetAttr() returns NULL 当没有请求的属性时，你可以用 is.null() 找到它们。然后，您可以将其放入 if() 条件中，为缺少的字符串包含一个空字符串，否则为 href 属性。不需要对根节点进行子集化。

library(XML)
## parse the html document
doc <- htmlParse("http://whosebug.com/")
## use the [.XMLNode accessor to drop into 'a' and then apply our functions
getvals <- lapply(doc["//a"], function(x) {
    data.frame(
        ## get the xml value
        Name = xmlValue(x, trim = TRUE), 
        ## get the href link if it exists
        HREF = if(is.null(att <- xmlGetAttr(x, "href"))) "" else att,
        stringsAsFactors = FALSE
    )
})
## create the full data frame
df <- do.call(rbind, getvals)
## have a look
str(df)
# 'data.frame': 697 obs. of  2 variables:
#  $ Name: chr  "current community" "chat" "Stack Overflow" "Meta Stack Overflow" ...
#  $ HREF: chr  "//whosebug.com" "http://chat.whosebug.com" "//whosebug.com" "http://meta.whosebug.com" ...

tail(df)
#                       Name                                                        HREF
# 692             Stack Apps                                             //stackapps.com
# 693    Meta Stack Exchange                                    //meta.stackexchange.com
# 694                Area 51                                  //area51.stackexchange.com
# 695 Stack Overflow Careers                                 //careers.whosebug.com
# 696           cc by-sa 3.0              http://creativecommons.org/licenses/by-sa/3.0/
# 697   attribution required http://blog.whosebug.com/2009/06/attribution-required/

R抓取网页链接

R Scrape web page links

r

html-parsing

web-scraping