Casperjs 使用 casper.each 遍历链接列表

Casperjs iterating over a list of links using casper.each

我正在尝试使用 Casperjs 从页面中获取 link 的列表,然后打开每个 link,并将特定类型的数据添加到数组对象中那些页面。

我遇到的问题是对每个列表项执行的循环。

首先我从原始页面得到了一个listOfLinks。这部分有效并使用长度我可以检查这个列表是否已填充。

但是,使用如下循环语句 this.each,none 的控制台语句会出现,并且 casperjs 似乎会跳过此块。

用标准的 for 循环替换 this.each,执行仅在第一个 link 中完成一部分,因为语句 "Creating new array in object for x.html" 出现一次,然后代码停止执行。使用 IIFE 不会改变这一点。

编辑: 在详细调试模式下会发生以下情况:

Creating new array object for https://example.com 
[debug] [phantom] Navigation requested: url=about:blank, type=Other, willNavigate=true, isMainFrame=true

所以由于某种原因,传递给 thenOpen 函数的 URL 变为空白...

我觉得关于 Casperjs 的异步特性,我在这里没有掌握,如果能指出一个工作示例,我将不胜感激。

casper.then(function () {

  var date = Date.now();
  console.log(date);

  var object = {};
  object[date] = {}; // new object for date

  var listOfLinks = this.evaluate(function(){
    console.log("getting links");
    return document.getElementsByClassName('importantLink');
  });

  console.log(listOfLinks.length);

  this.each(listOfLinks, function(self, link) {

    var eachPageHref = link.href;

    console.log("Creating new array in object for " + eachPageHref);

    object[date][eachPageHref] = []; // array for page to store names

    self.thenOpen(eachPageHref, function () {

      var listOfItems = this.evaluate(function() {
        var items = [];
        // Perform DOM manipulation to get items
        return items;
      });
    });

    object[date][eachPageHref] = items;

  });
  console.log(JSON.stringify(object));

});

您在 evaluate() 函数中 returning DOM 节点,这是不允许的。您可以 return 实际网址。

Note: The arguments and the return value to the evaluate function must be a simple primitive object. The rule of thumb: if it can be serialized via JSON, then it is fine.

Closures, functions, DOM nodes, etc. will not work!

参考:PhantomJS#evaluate

如果我对你的问题理解正确,要解决,请给 items[] 一个全局范围。在您的代码中,我会执行以下操作:

var items = [];
this.each(listOfLinks, function(self, link) {

    var eachPageHref = link.href;

    console.log("Creating new array in object for " + eachPageHref);

    object[date][eachPageHref] = []; // array for page to store names

    self.thenOpen(eachPageHref, function () {

        this.evaluate(function() {
        // Perform DOM manipulation to get items
        items.push(whateverThisItemIs);
      });
    });

希望对您有所帮助。

我决定使用我们自己的 whosebug.com 作为演示站点来 运行 您的脚本。我在您的代码中纠正了一些小问题,结果是这个练习从 PhantomJS 赏金问题中获取评论。

var casper = require('casper').create();

casper
.start()
.open('http://whosebug.com/questions/tagged/phantomjs?sort=featured&pageSize=30')
.then(function () {

    var date = Date.now(), object = {};
    object[date] = {};

    var listOfLinks = this.evaluate(function(){

        // Getting links to other pages to scrape, this will be 
        // a primitive array that will be easily returned from page.evaluate
        var links = [].map.call(document.querySelectorAll("#questions .question-hyperlink"), function(link) {
          return link.href;
        });    
        return links;
    });

    // Now to iterate over that array of links
    this.each(listOfLinks, function(self, eachPageHref) {

        object[date][eachPageHref] = []; // array for page to store names

        self.thenOpen(eachPageHref, function () {

            // Getting comments from each page, also as an array
            var listOfItems = this.evaluate(function() {
                var items = [].map.call(document.getElementsByClassName("comment-text"), function(comment) {
                    return comment.innerText;
                });    
                return items;
            });
            object[date][eachPageHref] = listOfItems;
        });
    });

    // After each links has been scraped, output the resulting object
    this.then(function(){
        console.log(JSON.stringify(object));
    });
})

casper.run();

更改内容:page.evaluate 现在 returns 简单数组,casper.each() 需要这些数组才能正确迭代。 href 属性立即在 page.evaluate 中提取。还有这个更正:

 object[date][eachPageHref] = listOfItems; // previously assigned items which were undefined in this scope

脚本运行的结果是

{"1478596579898":{"":["en.wikipedia.org/wiki/File_URI_scheme – Igor 2 days ago\n","@Igor is there something in particular you see wrong, or are you suggesting the phantom module has an incorrect URI? – Danny Buonocore 2 days ago\n","Probably windows security issue not allowing to run an unsigned program. – Vaviloff yesterday\n"],"":["Thanks, this looked really promising. I made the changes but it didn't solve the problem. And I just realised that in debug mode the following happens: Creating new array object for https://example.com [debug] [phantom] Navigation requested: url=about:blank, type=Other, willNavigate=true, isMainFrame=true and then Casperjs silently fails. It seems that the correct link that gets passed into thenOpen gets changed to about:blank... – cyc665 yesterday\n"]}}