How to extract the inherent links from the webpage with my code (error: subscript out of bounds)?
How to extract the inherent links from the webpage with my code (error: subscript out of bounds)?
我是网络抓取方面的新手,但我的博士项目需要数据。为此,我正在从欧洲议会的网站上提取有关欧洲议会议员不同活动的数据。具体来说,在我遇到问题的地方,我想从 MEP 的个人页面中提取标题,尤其是每个演讲标题下面的 link。我使用的代码已经多次运行良好,但在这里我没有成功获得 link,而只是演讲的标题。对于 links,我收到错误消息 "subscript out of bounds"。我正在使用 RSelenium,因为在提取数据之前我必须先单击各个页面上的多个加载更多按钮(据我所知,这使得 rvest 成为一个复杂的选项)。
几天来我基本上都在尝试解决这个问题,但我真的不知道如何进一步解决。我的印象是 css 选择器实际上并没有捕获底层的 link (因为它提取标题没有问题),但是 class 有一个复合名称("ep-a_heading ep-layout_level2" ) 所以也不可能通过这种方式。我也尝试了 Rvest(忽略了加载更多按钮时我会遇到的问题),但我仍然没有找到那些 links。
```{r}
library(RSelenium)
library(wdman)
library(rvest, warn.conflicts=FALSE)
library(stringr)
server <- phantomjs(port=7005L)
browser <- remoteDriver(browserName = "phantomjs", port=7005L)
## this is one of the urls I will use, there are others, constructed all
##the same way and all with the same problem
url <- 'http://www.europarl.europa.eu/meps/en/124936/MARIA_ARENA/all-
activities/plenary-speeches/8'
browser$open()
browser$navigate(url)
## now I identify the load more button and click on it as long as there
##is a "load more" button on the page
more <- browser$findElement(using = "css", value=".erpl-activities-
loadmore-button .ep_name")
while (!is.null(more)){
more$clickElement()
Sys.sleep(1)}
## I get an error message doing this in the end but it is working anyway
##(yes, I really am a beginner!)
##Now, what I want to extract are the title of the speech and most
##importantly: the URL.
links <- browser$findElements(using="css", ".ep-layout_level2 .ep_title")
length(links)
## there are 128 Speeches listed on the page
URL <- rep(NA, length(links))
Title <- rep(NA, length(links))
## after having created vectors to store the results, I apply the loop
##function that had worked fine already many times to extract the data I
##want
for (i in 1:length(links)){
URL[i] <- links[[i]]$getElementAttribute('href')[[1]]
Title[i] <- links[[i]]$getElementText()[[1]]
}
speeches <- data.frame(Title, URL)
对于这个例子,页面上有 128 篇演讲,所以最后我需要一个包含 128 个标题和 link 的 table。当我只尝试标题但我得到的 URL 时,代码工作正常:
`"Error in links[[i]]$getElementAttribute("href")[[1]] : subscript out of bounds"`
非常感谢您的帮助,我已经在本论坛看了很多关于下标越界问题的帖子,但遗憾的是我仍然无法解决问题。
祝你有愉快的一天!
我使用 rvest 获取该信息似乎没有问题。不需要使用硒的开销。您想要定位 class 的 a
标签子标签,即 .ep-layout_level2 a
以便能够访问 href
属性。相同的选择器将适用于硒。
library(rvest)
library(magrittr)
page <- read_html('https://www.europarl.europa.eu/meps/en/124936/MARIA_ARENA/all-activities/plenary-speeches/8')
titles <- page %>% html_nodes('.ep-layout_level2 .ep_title') %>% html_text() %>% gsub("\r\n\t+", "", .)
links <- page %>% html_nodes('.ep-layout_level2 a') %>% html_attr(., "href")
results <- data.frame(titles,links)
这里有一个基于您提供的代码的有效解决方案:
library(RSelenium)
library(wdman)
library(rvest, warn.conflicts=FALSE)
library(stringr)
server <- phantomjs(port=7005L)
browser <- remoteDriver(browserName = "phantomjs", port=7005L)
## this is one of the urls I will use, there are others, constructed all
##the same way and all with the same problem
url <- 'http://www.europarl.europa.eu/meps/en/124936/MARIA_ARENA/all-activities/plenary-speeches/8'
browser$open()
browser$navigate(url)
## now I identify the load more button and click on it as long as there
##is a "load more" button on the page
more <- browser$findElement(using = "class",value= "erpl-activity-loadmore-button")
while ((grepl("erpl-activity-loadmore-button",more$getPageSource(),fixed=TRUE)){
more$clickElement()
Sys.sleep(1)}
## I get an error message doing this in the end but it is working anyway
##(yes, I really am a beginner!)
##Now, what I want to extract are the title of the speech and most
##importantly: the URL.
links <- browser$findElements(using="class", "ep-layout_level2")
## there are 128 Speeches listed on the page
URL <- rep(NA, length(links))
Title <- rep(NA, length(links))
## after having created vectors to store the results, I apply the loop
##function that had worked fine already many times to extract the data I
##want
for (i in 1:length(links)){
l=links[[i]]$findChildElement(using="css","a")
URL[i] <-l$getElementAttribute('href')[[1]]
Title[i] <- links[[i]]$getElementText()[[1]]
}
speeches <- data.frame(Title, URL)
speeches
主要区别是:
第一个findElement
我用的是value= erpl-activity-loadmore-button
。事实上文档说你不能一次查看多个 class 值
查找 links
时相同
在最后一个循环中,你需要先 select 中的 link 元素
div
您 select 编辑并阅读了 href
属性
回答你关于 while 循环后注释中的错误消息的问题:当你按下 "Load more" 按钮的时间足够长时,它会变得不可见,但仍然存在。因此,当您检查 !is.null(more)
它是 TRUE
时,因为该按钮仍然存在,但是当您尝试单击它时,您会收到错误消息,因为它是不可见的。所以你可以通过检查它是否可见或注释来修复它。
我是网络抓取方面的新手,但我的博士项目需要数据。为此,我正在从欧洲议会的网站上提取有关欧洲议会议员不同活动的数据。具体来说,在我遇到问题的地方,我想从 MEP 的个人页面中提取标题,尤其是每个演讲标题下面的 link。我使用的代码已经多次运行良好,但在这里我没有成功获得 link,而只是演讲的标题。对于 links,我收到错误消息 "subscript out of bounds"。我正在使用 RSelenium,因为在提取数据之前我必须先单击各个页面上的多个加载更多按钮(据我所知,这使得 rvest 成为一个复杂的选项)。
几天来我基本上都在尝试解决这个问题,但我真的不知道如何进一步解决。我的印象是 css 选择器实际上并没有捕获底层的 link (因为它提取标题没有问题),但是 class 有一个复合名称("ep-a_heading ep-layout_level2" ) 所以也不可能通过这种方式。我也尝试了 Rvest(忽略了加载更多按钮时我会遇到的问题),但我仍然没有找到那些 links。
```{r}
library(RSelenium)
library(wdman)
library(rvest, warn.conflicts=FALSE)
library(stringr)
server <- phantomjs(port=7005L)
browser <- remoteDriver(browserName = "phantomjs", port=7005L)
## this is one of the urls I will use, there are others, constructed all
##the same way and all with the same problem
url <- 'http://www.europarl.europa.eu/meps/en/124936/MARIA_ARENA/all-
activities/plenary-speeches/8'
browser$open()
browser$navigate(url)
## now I identify the load more button and click on it as long as there
##is a "load more" button on the page
more <- browser$findElement(using = "css", value=".erpl-activities-
loadmore-button .ep_name")
while (!is.null(more)){
more$clickElement()
Sys.sleep(1)}
## I get an error message doing this in the end but it is working anyway
##(yes, I really am a beginner!)
##Now, what I want to extract are the title of the speech and most
##importantly: the URL.
links <- browser$findElements(using="css", ".ep-layout_level2 .ep_title")
length(links)
## there are 128 Speeches listed on the page
URL <- rep(NA, length(links))
Title <- rep(NA, length(links))
## after having created vectors to store the results, I apply the loop
##function that had worked fine already many times to extract the data I
##want
for (i in 1:length(links)){
URL[i] <- links[[i]]$getElementAttribute('href')[[1]]
Title[i] <- links[[i]]$getElementText()[[1]]
}
speeches <- data.frame(Title, URL)
对于这个例子,页面上有 128 篇演讲,所以最后我需要一个包含 128 个标题和 link 的 table。当我只尝试标题但我得到的 URL 时,代码工作正常:
`"Error in links[[i]]$getElementAttribute("href")[[1]] : subscript out of bounds"`
非常感谢您的帮助,我已经在本论坛看了很多关于下标越界问题的帖子,但遗憾的是我仍然无法解决问题。
祝你有愉快的一天!
我使用 rvest 获取该信息似乎没有问题。不需要使用硒的开销。您想要定位 class 的 a
标签子标签,即 .ep-layout_level2 a
以便能够访问 href
属性。相同的选择器将适用于硒。
library(rvest)
library(magrittr)
page <- read_html('https://www.europarl.europa.eu/meps/en/124936/MARIA_ARENA/all-activities/plenary-speeches/8')
titles <- page %>% html_nodes('.ep-layout_level2 .ep_title') %>% html_text() %>% gsub("\r\n\t+", "", .)
links <- page %>% html_nodes('.ep-layout_level2 a') %>% html_attr(., "href")
results <- data.frame(titles,links)
这里有一个基于您提供的代码的有效解决方案:
library(RSelenium)
library(wdman)
library(rvest, warn.conflicts=FALSE)
library(stringr)
server <- phantomjs(port=7005L)
browser <- remoteDriver(browserName = "phantomjs", port=7005L)
## this is one of the urls I will use, there are others, constructed all
##the same way and all with the same problem
url <- 'http://www.europarl.europa.eu/meps/en/124936/MARIA_ARENA/all-activities/plenary-speeches/8'
browser$open()
browser$navigate(url)
## now I identify the load more button and click on it as long as there
##is a "load more" button on the page
more <- browser$findElement(using = "class",value= "erpl-activity-loadmore-button")
while ((grepl("erpl-activity-loadmore-button",more$getPageSource(),fixed=TRUE)){
more$clickElement()
Sys.sleep(1)}
## I get an error message doing this in the end but it is working anyway
##(yes, I really am a beginner!)
##Now, what I want to extract are the title of the speech and most
##importantly: the URL.
links <- browser$findElements(using="class", "ep-layout_level2")
## there are 128 Speeches listed on the page
URL <- rep(NA, length(links))
Title <- rep(NA, length(links))
## after having created vectors to store the results, I apply the loop
##function that had worked fine already many times to extract the data I
##want
for (i in 1:length(links)){
l=links[[i]]$findChildElement(using="css","a")
URL[i] <-l$getElementAttribute('href')[[1]]
Title[i] <- links[[i]]$getElementText()[[1]]
}
speeches <- data.frame(Title, URL)
speeches
主要区别是:
第一个
findElement
我用的是value= erpl-activity-loadmore-button
。事实上文档说你不能一次查看多个 class 值查找 links
时相同
在最后一个循环中,你需要先 select 中的 link 元素
div
您 select 编辑并阅读了href
属性
回答你关于 while 循环后注释中的错误消息的问题:当你按下 "Load more" 按钮的时间足够长时,它会变得不可见,但仍然存在。因此,当您检查 !is.null(more)
它是 TRUE
时,因为该按钮仍然存在,但是当您尝试单击它时,您会收到错误消息,因为它是不可见的。所以你可以通过检查它是否可见或注释来修复它。