使用 Phantomjs/Selenium（来自 R）进行网页抓取，设置元素值

Question

从 http://www.nasdaqomx.com/commodities/market-prices

抓取表格数据时出现问题

我可以获取数据，但我似乎无法 change/set 页面上的参数，因此检索其他数据。

这些是我可以在页面上找到的 ID：

'#marketSelectId, #typesSelectId, #productsSelectId, #dateId,#isTraded, #excelId'

我需要更改的似乎是（来自 Chrome 的选择器小工具）：

'#marketSelectId, #isTraded'（代码来自最后的网页）

关于如何更改这些的任何帮助。

我的 phantomjs 尝试如下： // phantomNasdaqOmx.js

var webPage = require('webpage');
var page = webPage.create();

var fs = require('fs');
var path = 'NasdaqOmx.html';

page.open('http://www.nasdaqomx.com/commodities/market-prices/history/',
function (status) {

// no luck
//  page.evaluate(function(){
// document.getElementById("#isTraded").value = false;
//  });

// no luck
//  $('.myCheckbox').removeAttr('checked');

// no luck
page.evaluate(function(){
document.getElementById('marketSelectId').value='EUK';

});

var content = page.content;
fs.write(path,content,'w');

phantom.exit();
});

我的 Rselenium 尝试

require('RSelenium')
library('XML')

remDr <- remoteDriver(remoteServerAddr = "localhost" 
                  , port = 32770L
                  , browserName = "firefox"
)

remDr$open()

site <- "http://www.nasdaqomx.com/commodities/market-prices" # create URL for each page to scrape
remDr$navigate(site) # navigates to webpage
## remDr$findElements(using = 'xpath', value = '//*@id')
remDr$executeScript("document.getElementById('marketSelectId').setAttribute('value', 'EUK')")

remDr$executeScript("document.getElementById('isTraded').setAttribute('value', '')");
##a <- remDr$executeScript("document.getElementById('isTraded').getAttribute('value')")
## remDR$ findElement(By.id("isTraded")).getAttribute("value");
##
##  Throws error
##  remDr$click(buttonId = 'isTraded')

elem <- remDr$findElement(using="id", value="derivatesNordicOutput") # get big table in text string

## elem$highlightElement() # just for interactive use in browser.  not necessary.
elemtxt <- elem$getElementAttribute("outerHTML")[[1]] # gets us the HTML
elemxml <- htmlTreeParse(elemtxt, useInternalNodes=T) # parse string into HTML tree to allow for querying with XPath
readHTMLTable(elemxml)

head(master)

marketSelectId - 需要的值和脚本信息：'eno'、'ede'、'euk'

//*[(@id = "marketSelectId")]
webpage js code
<label>Market:</label> <select id="marketSelectId">
    <!--optgroup label="Electricity"-->
    <option selected="selected" value="ENO">Electricity Nordic</option>
    <option value="EBE">Electricity Belgium</option>
    <option value="EFR">Electricity France</option>
    <option value="EDE">Electricity Germany</option>
    <option value="EIT">Electricity Italy</option>
    <option value="ENL">Electricity Netherlands</option>
    <option value="EES">Electricity Spain</option>
    <option value="EUK">Electricity UK</option>
    <!--/optgroup-->
    <option value="EUA">Carbon Market</option>
    <option value="ZEE">Natural Gas Belgium</option>        
    <option value="PNO">Natural Gas France</option>
    <option value="GPO">Natural Gas Germany</option>
    <option value="TTF">Natural Gas Netherlands</option>
    <option value="NGUK">Natural Gas UK</option>
    <!--option value="ELEUR">Electricity Certificates</option-->
    <option value="ELSEK">Swedish Electricity Certificate</option>
    <option value="NCFO">Fuel Oil</option>
    <option value="NCDF">Freight - Dry</option>
    <option value="NCTC">Freight - Tankers Clean</option>
    <option value="NCTD">Freight - Tankers Dirty</option>
    <!--option value="COAL">Coal</option-->
    <option value="NCSF">Seafood</option>
    <option value="STEEL">Steel</option>
    <option value="NCIO">Iron Ore</option>
    <option value="RWEU">Renewables</option>
    <option value="COKCOAL">Coking Coal</option>
</select>

isTraded - 脚本信息并希望从检查更改为 'UNchecked'，（不知道该字段的正确值，代码似乎检查 'checked' 和其他，但确实如此不工作

//*[(@id = "isTraded")]
webpage js code
        // only those who have oi or volume
    if ( $("#isTraded").is(":checked")) {
        xpath += "[ph/hi/@rv!='' or ph/hi/@tv!='']"; //or ph/hi/@oi!=''

Answer 1

您需要使用clickElement方法。您还可以使用 selectTag 方法来操作 select 菜单

library(RSelenium)
library(XML)
rD <- rsDriver()
remDr <- rD[["client"]]
remDr$navigate("http://www.nasdaqomx.com/commodities/market-prices")
isTraded <- remDr$findElement("id", "isTraded")
isTraded$clickElement()
waitforupdate(remDr)
marketSelect <- remDr$findElement("id", "marketSelectId")
msSelect <- marketSelect$selectTag()
# select seafood market
seafood <- msSelect$elements[msSelect$text == "Seafood"][[1]]
# switch to seafood market
seafood$clickElement()
waitforupdate(remDr)
elem <- remDr$findElement(using="id", value="derivatesNordicOutput") # get big table in text string

## elem$highlightElement() # just for interactive use in browser.  not necessary.
elemtxt <- elem$getElementAttribute("outerHTML")[[1]] # gets us the HTML
elemxml <- htmlTreeParse(elemtxt, useInternalNodes=T) # parse string into HTML tree to allow for querying with XPath
readHTMLTable(elemxml)

# function to wait for update to appear
waitforupdate <- function(remDr, maxwait = 30){
  chk <- FALSE
  count <- 0L
  while(!chk && count < maxwait){
    count <- count + 1L
    res <- suppressMessages(
      tryCatch({
        remDr$findElement("css", "#derivatesNordicOutput span[title = 'Last update']")
      },
      error = function(e){e}
      )
    )
    chk <- !inherits(res, "error")
    Sys.sleep(1L)
  }
  if(count >= maxwait){
    stop("table has not updated in alloted time")
  }
}

# UPDATE get german electric prices
gerElec <- msSelect$elements[msSelect$text == "Electricity Germany"][[1]]
gerElec$clickElement()
waitforupdate(remDr)
elem <- remDr$findElement(using="id", value="derivatesNordicOutput") # get big table in text string
elemtxt <- elem$getElementAttribute("outerHTML")[[1]] # gets us the HTML
elemxml <- htmlTreeParse(elemtxt, useInternalNodes=T) # parse string into HTML tree to allow for querying with XPath
readHTMLTable(elemxml)


# close browser and stop server
remDr$close()
rD[["server"]]$stop()

使用 Phantomjs/Selenium（来自 R）进行网页抓取，设置元素值

Web scraping with Phantomjs/Selenium (from R), setting element values

selenium

automated-tests

web-scraping

phantomjs

rselenium