How to extract the inherent links from the webpage with my code (error: subscript out of bounds)?

How to extract the inherent links from the webpage with my code (error: subscript out of bounds)?

我是网络抓取方面的新手,但我的博士项目需要数据。为此,我正在从欧洲议会的网站上提取有关欧洲议会议员不同活动的数据。具体来说,在我遇到问题的地方,我想从 MEP 的个人页面中提取标题,尤其是每个演讲标题下面的 link。我使用的代码已经多次运行良好,但在这里我没有成功获得 link,而只是演讲的标题。对于 links,我收到错误消息 "subscript out of bounds"。我正在使用 RSelenium,因为在提取数据之前我必须先单击各个页面上的多个加载更多按钮(据我所知,这使得 rvest 成为一个复杂的选项)。

几天来我基本上都在尝试解决这个问题,但我真的不知道如何进一步解决。我的印象是 css 选择器实际上并没有捕获底层的 link (因为它提取标题没有问题),但是 class 有一个复合名称("ep-a_heading ep-layout_level2" ) 所以也不可能通过这种方式。我也尝试了 Rvest(忽略了加载更多按钮时我会遇到的问题),但我仍然没有找到那些 links。

```{r}
library(RSelenium)
library(wdman)
library(rvest, warn.conflicts=FALSE)
library(stringr)

server <- phantomjs(port=7005L)
browser <- remoteDriver(browserName = "phantomjs", port=7005L)

## this is one of the urls I will use, there are others, constructed all 
##the same way and all with the same problem
url <- 'http://www.europarl.europa.eu/meps/en/124936/MARIA_ARENA/all- 
activities/plenary-speeches/8'


browser$open() 
browser$navigate(url)

## now I identify the load more button and click on it as long as there 
##is a "load more" button on the page

more <- browser$findElement(using = "css", value=".erpl-activities- 
loadmore-button .ep_name")

while (!is.null(more)){
more$clickElement()
Sys.sleep(1)}

## I get an error message doing this in the end but it is working anyway 
##(yes, I really am a beginner!)

##Now, what I want to extract are the title of the speech and most 
##importantly: the URL.

links <- browser$findElements(using="css", ".ep-layout_level2 .ep_title") 
length(links) 


## there are 128 Speeches listed on the page

URL <- rep(NA, length(links))
Title <- rep(NA, length(links))

## after having created vectors to store the results, I apply the loop 
##function that had worked fine already many times to extract the data I 
##want

 for (i in 1:length(links)){
     URL[i] <- links[[i]]$getElementAttribute('href')[[1]]
     Title[i] <- links[[i]]$getElementText()[[1]] 
    }

speeches <- data.frame(Title, URL)

对于这个例子,页面上有 128 篇演讲,所以最后我需要一个包含 128 个标题和 link 的 table。当我只尝试标题但我得到的 URL 时,代码工作正常:

    `"Error in links[[i]]$getElementAttribute("href")[[1]] :   subscript out of bounds"`

非常感谢您的帮助,我已经在本论坛看了很多关于下标越界问题的帖子,但遗憾的是我仍然无法解决问题。

祝你有愉快的一天!

我使用 rvest 获取该信息似乎没有问题。不需要使用硒的开销。您想要定位 class 的 a 标签子标签,即 .ep-layout_level2 a 以便能够访问 href 属性。相同的选择器将适用于硒。

library(rvest)
library(magrittr)

page <- read_html('https://www.europarl.europa.eu/meps/en/124936/MARIA_ARENA/all-activities/plenary-speeches/8')

titles <- page %>% html_nodes('.ep-layout_level2 .ep_title') %>% html_text()  %>% gsub("\r\n\t+", "", .) 
links <- page %>% html_nodes('.ep-layout_level2 a') %>% html_attr(., "href") 
results <- data.frame(titles,links)

这里有一个基于您提供的代码的有效解决方案:

library(RSelenium)
library(wdman)
library(rvest, warn.conflicts=FALSE)
library(stringr)

server <- phantomjs(port=7005L)
browser <- remoteDriver(browserName = "phantomjs", port=7005L)

## this is one of the urls I will use, there are others, constructed all 
##the same way and all with the same problem
url <- 'http://www.europarl.europa.eu/meps/en/124936/MARIA_ARENA/all-activities/plenary-speeches/8'


browser$open() 
browser$navigate(url)

## now I identify the load more button and click on it as long as there 
##is a "load more" button on the page
more <- browser$findElement(using = "class",value= "erpl-activity-loadmore-button")

while ((grepl("erpl-activity-loadmore-button",more$getPageSource(),fixed=TRUE)){
  more$clickElement()
  Sys.sleep(1)}

## I get an error message doing this in the end but it is working anyway 
##(yes, I really am a beginner!)

##Now, what I want to extract are the title of the speech and most 
##importantly: the URL.

links <- browser$findElements(using="class", "ep-layout_level2") 

## there are 128 Speeches listed on the page
URL <- rep(NA, length(links))
Title <- rep(NA, length(links))

## after having created vectors to store the results, I apply the loop 
##function that had worked fine already many times to extract the data I 
##want

for (i in 1:length(links)){
  l=links[[i]]$findChildElement(using="css","a")

  URL[i] <-l$getElementAttribute('href')[[1]]
  Title[i] <- links[[i]]$getElementText()[[1]] 
}

speeches <- data.frame(Title, URL)

speeches

主要区别是:

  • 第一个findElement我用的是value= erpl-activity-loadmore-button。事实上文档说你不能一次查看多个 class 值

  • 查找 links

  • 时相同
  • 在最后一个循环中,你需要先 select 中的 link 元素 div 您 select 编辑并阅读了 href 属性

回答你关于 while 循环后注释中的错误消息的问题:当你按下 "Load more" 按钮的时间足够长时,它会变得不可见,但仍然存在。因此,当您检查 !is.null(more) 它是 TRUE 时,因为该按钮仍然存在,但是当您尝试单击它时,您会收到错误消息,因为它是不可见的。所以你可以通过检查它是否可见或注释来修复它。