从弹出窗口中提取网页

Web Extraction from popups

我需要获取下一页中列出的所有关注者的网络链接。

https://www.researchgate.net/topic/biotechnology

目前该话题已有206770人关注。当我单击 "View all" 按钮时,会出现一个弹出窗口,其中提供了一个列表,并在我向下移动时不断扩展。

https://www.researchgate.net/profile/Kestutis_Sasnauskas ...

以上是热门关注者的链接。有没有办法让我们获得所有 206770 位关注者的网络链接?

这可以通过使用 rvestRSelenium 来完成。后者是最需要的,前者会让你的生活更轻松。从 github devtools::install_github("ropensci/RSelenium") 安装 RSeleniumrvest 来自 cran.

这是您需要的代码来完成您正在寻找的。

siteUrl <- "http://www.researchgate.net/"
GateUrl <- "http://www.researchgate.net/publictopics.KeywordFollowersPeopleList.html?view=dialog&showFollowButton=1&followEvent=tp_followers_xflw&keywordId=4f15497280e582373c000000&offset="

library(rvest)
library(RSelenium)

checkForServer()
startServer()
remDrv <- remoteDriver()
remDrv$open(silent = FALSE)

i <- 0
profileUrls <- c()

for(j in 1:3){
  print(j)
  remDrv$navigate(paste0(GateUrl, i))
  l <- html(remDrv$getPageSource()[[1]])
  profileUrls <- c(profileUrls, 
               paste0(siteUrl, l %>% html_nodes(".display-name") %>% xml_attr("href")))
  i <- length(profileUrls)+1

}

remDrv$close()
profileUrls 

这里有几件事。您需要找出 j 循环。我认为每个 url 会获取 38 个配置文件,因此 j 应该类似于 for(j in 1:(followers/38)).

第二点是代码保存链接的方式不是很有效,即每次都附加它。更好的解决方案是使用 lapplyunlist.

最后一点,您需要在计算机上安装 mozilla firefox,因为这是 RSelenium 使用的默认浏览器,但您可以将其设置为使用您喜欢的任何最流行的浏览器。

结果 从前56

> profileUrls
[1] "http://www.researchgate.net/profile/Jose_Carbajo2"           
[2] "http://www.researchgate.net/profile/Daniele_Riccio"          
[3] "http://www.researchgate.net/profile/Fiona_Togneri2"          
[4] "http://www.researchgate.net/profile/Sukanya_Patel"           
[5] "http://www.researchgate.net/profile/Neri_Fattorini"          
[6] "http://www.researchgate.net/profile/Pham_Thi_Thuy_Van"       
[7] "http://www.researchgate.net/profile/Kestutis_Sasnauskas"     
[8] "http://www.researchgate.net/profile/Iris_Weintal"            
[9] "http://www.researchgate.net/profile/Godelieve_Verhaegen"     
[10] "http://www.researchgate.net/profile/Janani_Venkatraman2"     
[11] "http://www.researchgate.net/profile/Kai_Wang126"             
[12] "http://www.researchgate.net/profile/Irine_Ronin"             
[13] "http://www.researchgate.net/profile/Natasha_Ikhsan"          
[14] "http://www.researchgate.net/profile/Nadya_Hajar"             
[15] "http://www.researchgate.net/profile/Gayatr_Venkataraman2"    
[16] "http://www.researchgate.net/profile/Amsha_Viraragavan"       
[17] "http://www.researchgate.net/profile/Wei_Leiyan"              
[18] "http://www.researchgate.net/profile/Yosuke_Inada"            
[19] "http://www.researchgate.net/profile/Nadya_Hajar"             
[20] "http://www.researchgate.net/profile/Gayatr_Venkataraman2"    
[21] "http://www.researchgate.net/profile/Amsha_Viraragavan"       
[22] "http://www.researchgate.net/profile/Wei_Leiyan"              
[23] "http://www.researchgate.net/profile/Yosuke_Inada"            
[24] "http://www.researchgate.net/profile/Yongning_You"            
[25] "http://www.researchgate.net/profile/Susan_Hu6"               
[26] "http://www.researchgate.net/profile/Matt_Evans11"            
[27] "http://www.researchgate.net/profile/Nam_Kieu"                
[28] "http://www.researchgate.net/profile/Nur_Musa3"               
[29] "http://www.researchgate.net/profile/Varaporn_S"              
[30] "http://www.researchgate.net/profile/Askar_Begzat3"           
[31] "http://www.researchgate.net/profile/Bing_Wang63"             
[32] "http://www.researchgate.net/profile/Xuebin_Yan"              
[33] "http://www.researchgate.net/profile/Roberto_Sibaja_Hernandez"
[34] "http://www.researchgate.net/profile/Stephen_Heimann"         
[35] "http://www.researchgate.net/profile/Hanina_Hanifa"           
[36] "http://www.researchgate.net/profile/Bo_Wang143"              
[37] "http://www.researchgate.net/profile/Xuebin_Yan"              
[38] "http://www.researchgate.net/profile/Roberto_Sibaja_Hernandez"
[39] "http://www.researchgate.net/profile/Stephen_Heimann"         
[40] "http://www.researchgate.net/profile/Hanina_Hanifa"           
[41] "http://www.researchgate.net/profile/Bo_Wang143"              
[42] "http://www.researchgate.net/profile/Huili_Li5"               
[43] "http://www.researchgate.net/profile/Giuseppe_Infusini"       
[44] "http://www.researchgate.net/profile/Carmen_Wacher"           
[45] "http://www.researchgate.net/profile/Linyn_Linyn"             
[46] "http://www.researchgate.net/profile/Dan_Youel"               
[47] "http://www.researchgate.net/profile/Catherine_Williams16"    
[48] "http://www.researchgate.net/profile/Nichole_Macaraeg"        
[49] "http://www.researchgate.net/profile/Peter_Oroszlan"          
[50] "http://www.researchgate.net/profile/Eduard_Karamov"          
[51] "http://www.researchgate.net/profile/Mauricio_Franco3"        
[52] "http://www.researchgate.net/profile/Patricia_Zancan"         
[53] "http://www.researchgate.net/profile/Rohana_Dassanayake"      
[54] "http://www.researchgate.net/profile/Khadija_Khataby"         
[55] "http://www.researchgate.net/profile/Imane_Moest"             
[56] "http://www.researchgate.net/profile/Rory_Adey"

作为 RSelenium 的替代方法,您可以这样尝试(以前 56 位关注者为例):

library(XML)
library(jsonlite)
offsets <- seq(from = 1, to = 50, 18)
urls <- sprintf("http://www.researchgate.net/publictopics.KeywordFollowersPeopleList.html?view=dialog&showFollowButton=1&followEvent=tp_followers_xflw&keywordId=4f15497280e582373c000000&offset=%d", offsets)

df <- data.frame()
for (x in seq_along(urls)) {
  doc <- htmlParse(urls[x])
  script <- as(doc[['//script[5]']], "character")
  splits <- strsplit(script, '\(function\(\)\{Y\.rg\.createInitialWidget\("[^\"]+",')[[1]][-1]
  res <- lapply(splits, function(split) {
    split <-sub(");})();\n", "", split, fixed = TRUE)
    res <- try(as.data.frame(t(unlist(fromJSON(gsub("\\", "", split))))), silent = TRUE)
    if (!inherits(res, "try-error")) return(res) else return(NULL)
  })
  df <- rbind(df, do.call(rbind, res[1:(length(res)-2)]))
}
dplyr::glimpse(df)
# Observations: 56
# Variables:
#   $ _isReact                                                         (fctr) FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
# $ data.displayName                                                 (fctr) Jose Maria Carbajo, Daniele Riccio, Fiona S Togneri, Sukanya Paramashivaiah Patel, Neri Fattorini, Pham thi thuy van, Kestutis Sasnauskas, Iris Weintal, Godelieve Verhaegen, Ja...
# $ data.profile.professionalInstitution.professionalInstitutionName (fctr) Instituto Nacional de Investigaciu00f3n y Tecnologu00eda Agraria y Alimentaria, University of Milan, Birmingham Women's NHS Foundation Trust, Himalya drug company, University o...
# $ data.profile.professionalInstitution.professionalInstitutionUrl  (fctr) institution/Instituto_Nacional_de_Investigaciones_y_Experiencias_Agronomicas_y_Forestales, institution/University_of_Milan, institution/Birmingham_Womens_NHS_Foundation_Trust, ...
# $ data.professionalInstitutionName                                 (fctr) Instituto Nacional de Investigaciu00f3n y Tecnologu00eda Agraria y Alimentaria, University of Milan, Birmingham Women's NHS Foundation Trust, Himalya drug company, University o...
# $ data.professionalInstitutionUrl                                  (fctr) institution/Instituto_Nacional_de_Investigaciones_y_Experiencias_Agronomicas_y_Forestales, institution/University_of_Milan, institution/Birmingham_Womens_NHS_Foundation_Trust, ...
# $ data.url                                                         (fctr) profile/Jose_Carbajo2, profile/Daniele_Riccio, profile/Fiona_Togneri2, profile/Sukanya_Patel, profile/Neri_Fattorini, profile/Pham_Thi_Thuy_Van, profile/Kestutis_Sasnauskas, pr...
# $ data.imageUrl                                                    (fctr) http://c1.rgstatic.net/m/797670414832/images/template/default/profile/profile_default_m.jpg, http://i1.rgstatic.net/i/profile/54a1a5539f8e2f289f_m_25d91.jpg, http://i1.rgstatic...
# $ data.imageSize                                                   (fctr) m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m, m
# $ data.imageHeight                                                 (fctr) 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, ...
# $ data.imageWidth                                                  (fctr) 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, ...
# $ data.enableFollowButton                                          (fctr) TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR...
# $ data.enableHideButton                                            (fctr) FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
# $ data.enableConnectionButton                                      (fctr) FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
# $ data.isClaimedAuthor                                             (fctr) TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR...
# $ data.hasExtraContainer                                           (fctr) FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
# $ data.showStatsWidgets                                            (fctr) FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
# $ data.showHideButton                                              (fctr) FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
# $ data.accountKey                                                  (fctr) Jose_Carbajo2, Daniele_Riccio, Fiona_Togneri2, Sukanya_Patel, Neri_Fattorini, Pham_Thi_Thuy_Van, Kestutis_Sasnauskas, Iris_Weintal, Godelieve_Verhaegen, Janani_Venkatraman2, Ka...
# $ data.hasInfoPopup                                                (fctr) FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
# $ data.hasTeaserPopup                                              (fctr) TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR...
# $ data.widgetId                                                    (fctr) rgw3_5539fc8299ef4, rgw4_5539fc8299ef4, rgw5_5539fc8299ef4, rgw6_5539fc8299ef4, rgw7_5539fc8299ef4, rgw8_5539fc8299ef4, rgw9_5539fc8299ef4, rgw10_5539fc8299ef4, rgw11_5539fc829...
# $ id                                                               (fctr) rgw3_5539fc8299ef4, rgw4_5539fc8299ef4, rgw5_5539fc8299ef4, rgw6_5539fc8299ef4, rgw7_5539fc8299ef4, rgw8_5539fc8299ef4, rgw9_5539fc8299ef4, rgw10_5539fc8299ef4, rgw11_5539fc829...
# $ templateName                                                     (fctr) application/stubs/PeopleItem.html, application/stubs/PeopleItem.html, application/stubs/PeopleItem.html, application/stubs/PeopleItem.html, application/stubs/PeopleItem.html, a...
# $ templateExtensions                                               (fctr) generalHelpers, generalHelpers, generalHelpers, generalHelpers, generalHelpers, generalHelpers, generalHelpers, generalHelpers, generalHelpers, generalHelpers, generalHelpers, ...
# $ widgetUrl                                                        (fctr) http://www.researchgate.net/application.PeopleAccountItem.html?entityId=7508014&imageSize=m&enableFollowButton=1&showHideButton=0&showConnectionButton=0&event=tp_followers_xflw...
# $ viewClass                                                        (fctr) views.application.PeopleItemView, views.application.PeopleItemView, views.application.PeopleItemView, views.application.PeopleItemView, views.application.PeopleItemView, views....
# $ yuiModules                                                       (fctr) rg.views.application.PeopleItemView, rg.views.application.PeopleItemView, rg.views.application.PeopleItemView, rg.views.application.PeopleItemView, rg.views.application.PeopleI...

服务器 returns 数据为 JSON 如果你要求的话。后续调用使用前一个 JSON 调用提供的偏移参数。在下面的示例中,我刚刚调用了前 10 个偏移量。这相当于向下滚动 10 次。数据比个人资料网络链接多得多:

library(RCurl)
library(XML)
library(jsonlite)
# get initial page
initURL <- "http://www.researchgate.net/topic/biotechnology"
doc <- htmlParse(initURL)
noFollowers <- doc["//*/strong/*/a[@class='js-see-all']", fun = xmlValue][[1]]
noFollowers <- as.integer(gsub("[^0-9]", "", noFollowers))

appURL <- "http://www.researchgate.net/publictopics.KeywordFollowersPeopleList.html?view=dialog&showFollowButton=1&followEvent=tp_followers_xflw&keywordId=4f15497280e582373c000000"
appData <- getURL(appURL
                  , httpheader = c(accept = "application/json"))
follData <- list(fromJSON(appData)$result$data$content$data$listItems)
for(i in 1:10){
  nextURL <- fromJSON(appData)$result$data$nextOffset
  appData <- getURL(paste0(appURL, "&offset=", nextURL)
                    , httpheader = c(accept = "application/json"))
  follData[[i+1]] <- fromJSON(appData)$result$data$content$data$listItems
}
followers <- na.omit(do.call(c, lapply(follData, function(x){x$data$url})))
> head(followers)
[1] "profile/Subhashish_Dutta" "profile/Jerome_Wang3"     "profile/Jose_Carbajo2"   
[4] "profile/Daniele_Riccio"   "profile/Fiona_Togneri2"   "profile/Sukanya_Patel"