使用 Phantomjs/Selenium(来自 R)进行网页抓取,设置元素值
Web scraping with Phantomjs/Selenium (from R), setting element values
从 http://www.nasdaqomx.com/commodities/market-prices
抓取表格数据时出现问题
我可以获取数据,但我似乎无法 change/set 页面上的参数,因此检索其他数据。
这些是我可以在页面上找到的 ID:
'#marketSelectId, #typesSelectId, #productsSelectId, #dateId,#isTraded, #excelId'
我需要更改的似乎是(来自 Chrome 的选择器小工具):
'#marketSelectId, #isTraded'(代码来自最后的网页)
关于如何更改这些的任何帮助。
我的 phantomjs 尝试如下:
// phantomNasdaqOmx.js
var webPage = require('webpage');
var page = webPage.create();
var fs = require('fs');
var path = 'NasdaqOmx.html';
page.open('http://www.nasdaqomx.com/commodities/market-prices/history/',
function (status) {
// no luck
// page.evaluate(function(){
// document.getElementById("#isTraded").value = false;
// });
// no luck
// $('.myCheckbox').removeAttr('checked');
// no luck
page.evaluate(function(){
document.getElementById('marketSelectId').value='EUK';
});
var content = page.content;
fs.write(path,content,'w');
phantom.exit();
});
我的 Rselenium 尝试
require('RSelenium')
library('XML')
remDr <- remoteDriver(remoteServerAddr = "localhost"
, port = 32770L
, browserName = "firefox"
)
remDr$open()
site <- "http://www.nasdaqomx.com/commodities/market-prices" # create URL for each page to scrape
remDr$navigate(site) # navigates to webpage
## remDr$findElements(using = 'xpath', value = '//*@id')
remDr$executeScript("document.getElementById('marketSelectId').setAttribute('value', 'EUK')")
remDr$executeScript("document.getElementById('isTraded').setAttribute('value', '')");
##a <- remDr$executeScript("document.getElementById('isTraded').getAttribute('value')")
## remDR$ findElement(By.id("isTraded")).getAttribute("value");
##
## Throws error
## remDr$click(buttonId = 'isTraded')
elem <- remDr$findElement(using="id", value="derivatesNordicOutput") # get big table in text string
## elem$highlightElement() # just for interactive use in browser. not necessary.
elemtxt <- elem$getElementAttribute("outerHTML")[[1]] # gets us the HTML
elemxml <- htmlTreeParse(elemtxt, useInternalNodes=T) # parse string into HTML tree to allow for querying with XPath
readHTMLTable(elemxml)
head(master)
marketSelectId - 需要的值和脚本信息:'eno'、'ede'、'euk'
//*[(@id = "marketSelectId")]
webpage js code
<label>Market:</label> <select id="marketSelectId">
<!--optgroup label="Electricity"-->
<option selected="selected" value="ENO">Electricity Nordic</option>
<option value="EBE">Electricity Belgium</option>
<option value="EFR">Electricity France</option>
<option value="EDE">Electricity Germany</option>
<option value="EIT">Electricity Italy</option>
<option value="ENL">Electricity Netherlands</option>
<option value="EES">Electricity Spain</option>
<option value="EUK">Electricity UK</option>
<!--/optgroup-->
<option value="EUA">Carbon Market</option>
<option value="ZEE">Natural Gas Belgium</option>
<option value="PNO">Natural Gas France</option>
<option value="GPO">Natural Gas Germany</option>
<option value="TTF">Natural Gas Netherlands</option>
<option value="NGUK">Natural Gas UK</option>
<!--option value="ELEUR">Electricity Certificates</option-->
<option value="ELSEK">Swedish Electricity Certificate</option>
<option value="NCFO">Fuel Oil</option>
<option value="NCDF">Freight - Dry</option>
<option value="NCTC">Freight - Tankers Clean</option>
<option value="NCTD">Freight - Tankers Dirty</option>
<!--option value="COAL">Coal</option-->
<option value="NCSF">Seafood</option>
<option value="STEEL">Steel</option>
<option value="NCIO">Iron Ore</option>
<option value="RWEU">Renewables</option>
<option value="COKCOAL">Coking Coal</option>
</select>
isTraded - 脚本信息并希望从检查更改为 'UNchecked',(不知道该字段的正确值,代码似乎检查 'checked' 和其他,但确实如此不工作
//*[(@id = "isTraded")]
webpage js code
// only those who have oi or volume
if ( $("#isTraded").is(":checked")) {
xpath += "[ph/hi/@rv!='' or ph/hi/@tv!='']"; //or ph/hi/@oi!=''
您需要使用clickElement
方法。您还可以使用 selectTag
方法来操作 select 菜单
library(RSelenium)
library(XML)
rD <- rsDriver()
remDr <- rD[["client"]]
remDr$navigate("http://www.nasdaqomx.com/commodities/market-prices")
isTraded <- remDr$findElement("id", "isTraded")
isTraded$clickElement()
waitforupdate(remDr)
marketSelect <- remDr$findElement("id", "marketSelectId")
msSelect <- marketSelect$selectTag()
# select seafood market
seafood <- msSelect$elements[msSelect$text == "Seafood"][[1]]
# switch to seafood market
seafood$clickElement()
waitforupdate(remDr)
elem <- remDr$findElement(using="id", value="derivatesNordicOutput") # get big table in text string
## elem$highlightElement() # just for interactive use in browser. not necessary.
elemtxt <- elem$getElementAttribute("outerHTML")[[1]] # gets us the HTML
elemxml <- htmlTreeParse(elemtxt, useInternalNodes=T) # parse string into HTML tree to allow for querying with XPath
readHTMLTable(elemxml)
# function to wait for update to appear
waitforupdate <- function(remDr, maxwait = 30){
chk <- FALSE
count <- 0L
while(!chk && count < maxwait){
count <- count + 1L
res <- suppressMessages(
tryCatch({
remDr$findElement("css", "#derivatesNordicOutput span[title = 'Last update']")
},
error = function(e){e}
)
)
chk <- !inherits(res, "error")
Sys.sleep(1L)
}
if(count >= maxwait){
stop("table has not updated in alloted time")
}
}
# UPDATE get german electric prices
gerElec <- msSelect$elements[msSelect$text == "Electricity Germany"][[1]]
gerElec$clickElement()
waitforupdate(remDr)
elem <- remDr$findElement(using="id", value="derivatesNordicOutput") # get big table in text string
elemtxt <- elem$getElementAttribute("outerHTML")[[1]] # gets us the HTML
elemxml <- htmlTreeParse(elemtxt, useInternalNodes=T) # parse string into HTML tree to allow for querying with XPath
readHTMLTable(elemxml)
# close browser and stop server
remDr$close()
rD[["server"]]$stop()
从 http://www.nasdaqomx.com/commodities/market-prices
抓取表格数据时出现问题我可以获取数据,但我似乎无法 change/set 页面上的参数,因此检索其他数据。
这些是我可以在页面上找到的 ID:
'#marketSelectId, #typesSelectId, #productsSelectId, #dateId,#isTraded, #excelId'
我需要更改的似乎是(来自 Chrome 的选择器小工具):
'#marketSelectId, #isTraded'(代码来自最后的网页)
关于如何更改这些的任何帮助。
我的 phantomjs 尝试如下: // phantomNasdaqOmx.js
var webPage = require('webpage');
var page = webPage.create();
var fs = require('fs');
var path = 'NasdaqOmx.html';
page.open('http://www.nasdaqomx.com/commodities/market-prices/history/',
function (status) {
// no luck
// page.evaluate(function(){
// document.getElementById("#isTraded").value = false;
// });
// no luck
// $('.myCheckbox').removeAttr('checked');
// no luck
page.evaluate(function(){
document.getElementById('marketSelectId').value='EUK';
});
var content = page.content;
fs.write(path,content,'w');
phantom.exit();
});
我的 Rselenium 尝试
require('RSelenium')
library('XML')
remDr <- remoteDriver(remoteServerAddr = "localhost"
, port = 32770L
, browserName = "firefox"
)
remDr$open()
site <- "http://www.nasdaqomx.com/commodities/market-prices" # create URL for each page to scrape
remDr$navigate(site) # navigates to webpage
## remDr$findElements(using = 'xpath', value = '//*@id')
remDr$executeScript("document.getElementById('marketSelectId').setAttribute('value', 'EUK')")
remDr$executeScript("document.getElementById('isTraded').setAttribute('value', '')");
##a <- remDr$executeScript("document.getElementById('isTraded').getAttribute('value')")
## remDR$ findElement(By.id("isTraded")).getAttribute("value");
##
## Throws error
## remDr$click(buttonId = 'isTraded')
elem <- remDr$findElement(using="id", value="derivatesNordicOutput") # get big table in text string
## elem$highlightElement() # just for interactive use in browser. not necessary.
elemtxt <- elem$getElementAttribute("outerHTML")[[1]] # gets us the HTML
elemxml <- htmlTreeParse(elemtxt, useInternalNodes=T) # parse string into HTML tree to allow for querying with XPath
readHTMLTable(elemxml)
head(master)
marketSelectId - 需要的值和脚本信息:'eno'、'ede'、'euk'
//*[(@id = "marketSelectId")]
webpage js code
<label>Market:</label> <select id="marketSelectId">
<!--optgroup label="Electricity"-->
<option selected="selected" value="ENO">Electricity Nordic</option>
<option value="EBE">Electricity Belgium</option>
<option value="EFR">Electricity France</option>
<option value="EDE">Electricity Germany</option>
<option value="EIT">Electricity Italy</option>
<option value="ENL">Electricity Netherlands</option>
<option value="EES">Electricity Spain</option>
<option value="EUK">Electricity UK</option>
<!--/optgroup-->
<option value="EUA">Carbon Market</option>
<option value="ZEE">Natural Gas Belgium</option>
<option value="PNO">Natural Gas France</option>
<option value="GPO">Natural Gas Germany</option>
<option value="TTF">Natural Gas Netherlands</option>
<option value="NGUK">Natural Gas UK</option>
<!--option value="ELEUR">Electricity Certificates</option-->
<option value="ELSEK">Swedish Electricity Certificate</option>
<option value="NCFO">Fuel Oil</option>
<option value="NCDF">Freight - Dry</option>
<option value="NCTC">Freight - Tankers Clean</option>
<option value="NCTD">Freight - Tankers Dirty</option>
<!--option value="COAL">Coal</option-->
<option value="NCSF">Seafood</option>
<option value="STEEL">Steel</option>
<option value="NCIO">Iron Ore</option>
<option value="RWEU">Renewables</option>
<option value="COKCOAL">Coking Coal</option>
</select>
isTraded - 脚本信息并希望从检查更改为 'UNchecked',(不知道该字段的正确值,代码似乎检查 'checked' 和其他,但确实如此不工作
//*[(@id = "isTraded")]
webpage js code
// only those who have oi or volume
if ( $("#isTraded").is(":checked")) {
xpath += "[ph/hi/@rv!='' or ph/hi/@tv!='']"; //or ph/hi/@oi!=''
您需要使用clickElement
方法。您还可以使用 selectTag
方法来操作 select 菜单
library(RSelenium)
library(XML)
rD <- rsDriver()
remDr <- rD[["client"]]
remDr$navigate("http://www.nasdaqomx.com/commodities/market-prices")
isTraded <- remDr$findElement("id", "isTraded")
isTraded$clickElement()
waitforupdate(remDr)
marketSelect <- remDr$findElement("id", "marketSelectId")
msSelect <- marketSelect$selectTag()
# select seafood market
seafood <- msSelect$elements[msSelect$text == "Seafood"][[1]]
# switch to seafood market
seafood$clickElement()
waitforupdate(remDr)
elem <- remDr$findElement(using="id", value="derivatesNordicOutput") # get big table in text string
## elem$highlightElement() # just for interactive use in browser. not necessary.
elemtxt <- elem$getElementAttribute("outerHTML")[[1]] # gets us the HTML
elemxml <- htmlTreeParse(elemtxt, useInternalNodes=T) # parse string into HTML tree to allow for querying with XPath
readHTMLTable(elemxml)
# function to wait for update to appear
waitforupdate <- function(remDr, maxwait = 30){
chk <- FALSE
count <- 0L
while(!chk && count < maxwait){
count <- count + 1L
res <- suppressMessages(
tryCatch({
remDr$findElement("css", "#derivatesNordicOutput span[title = 'Last update']")
},
error = function(e){e}
)
)
chk <- !inherits(res, "error")
Sys.sleep(1L)
}
if(count >= maxwait){
stop("table has not updated in alloted time")
}
}
# UPDATE get german electric prices
gerElec <- msSelect$elements[msSelect$text == "Electricity Germany"][[1]]
gerElec$clickElement()
waitforupdate(remDr)
elem <- remDr$findElement(using="id", value="derivatesNordicOutput") # get big table in text string
elemtxt <- elem$getElementAttribute("outerHTML")[[1]] # gets us the HTML
elemxml <- htmlTreeParse(elemtxt, useInternalNodes=T) # parse string into HTML tree to allow for querying with XPath
readHTMLTable(elemxml)
# close browser and stop server
remDr$close()
rD[["server"]]$stop()