我如何修复这个使用 puppeteer 制作的网络爬虫,它在抓取一半数据后什么都不做但没有给出任何错误?
How do I fix this webscraper made using puppeteer which is doing nothing after scraping half data but not giving any error?
对于我的大学项目,我使用 nodejs 和 puppeteer 制作了一个维基百科抓取工具。它适用于除一个 link 以外的所有人。在该页面中抓取了 table 的几乎一半数据后(我正在使用 console.log 查看当时抓取了哪些数据)它什么也没做。它没有显示任何错误。它不会停止执行,之后什么都不做。 puppeteer浏览器也不关闭
在原来的爬虫中,我使用了一个links的循环来生成数据。由于它不起作用,所以我为 link 制作了一个单独的刮刀,但同样的事情正在发生。谁能帮帮我?
const puppeteer = require('puppeteer');
const fs = require('fs');
(async () => {
try {
const browser = await puppeteer.launch({
headless: false
});
const page = await browser.newPage();
await page.setViewport({ width: 1280, height: 800 });
link = "https://en.wikipedia.org/wiki/List_of_terrorist_incidents_in_June_2016";
console.log("==============================");
console.log("Travelling to link:", link);
console.log("==============================");
await page.goto(link, {waitUntil: 'networkidle0'});
let rowArray = await page.$$("table[class='wikitable sortable jquery-tablesorter'] > tbody > tr");
var dataA = [];
for(let row of rowArray){
let date = await row.$eval('td:nth-child(1)', element => element.textContent);
date = date.substring(0, date.length - 1);
let type = await row.$eval('td:nth-child(2)', element => element.textContent);
type = type.substring(0, type.length - 1);
let dead = await row.$eval('td:nth-child(3)', element => element.textContent);
dead = dead.substring(0, dead.length - 1);
let injured = await row.$eval('td:nth-child(4)', element => element.textContent);
injured = injured.substring(0, injured.length - 1);
let location = await row.$eval('td:nth-child(5)', element => element.textContent);
location = location.substring(0, location.length - 1);
let details = await row.$eval('td:nth-child(6)', element => element.textContent);
details = details.substring(0, details.length - 1);
let perpetrator = await row.$eval('td:nth-child(7)', element => element.textContent);
perpetrator = perpetrator.substring(0, perpetrator.length - 1);
let partOf = await row.$eval('td:nth-child(8)', element => element.textContent);
partOf = partOf.substring(0, partOf.length - 1);
console.log("==============================");
console.log({date, type, dead, injured, location, details, perpetrator, partOf});
console.log("==============================");
dataA.push({date, type, dead, injured, location, details, perpetrator, partOf});
}
console.log("==============================");
console.log("Started writing JSON file");
fs.writeFileSync(`./june.json`, JSON.stringify(dataA), 'utf-8');
console.log("Finished writing JSON file");
console.log("==============================");
await browser.close();
} catch (error) {
console.error();
}
})();
只要看看它停止的地方
似乎脚本无法处理没有 "closing cell"
的下一行
我的猜测是,如果您编辑该页面并关闭它,它将起作用(或更新您的脚本以处理该情况)
查看维基百科源代码,在那一行中缺少 "part of" 单元格,因此您的代码只是挂在 'await' 部分
let partOf = await row.$eval('td:nth-child(8)', element => element.textContent);
这样你就不会出错。
对于我的大学项目,我使用 nodejs 和 puppeteer 制作了一个维基百科抓取工具。它适用于除一个 link 以外的所有人。在该页面中抓取了 table 的几乎一半数据后(我正在使用 console.log 查看当时抓取了哪些数据)它什么也没做。它没有显示任何错误。它不会停止执行,之后什么都不做。 puppeteer浏览器也不关闭
在原来的爬虫中,我使用了一个links的循环来生成数据。由于它不起作用,所以我为 link 制作了一个单独的刮刀,但同样的事情正在发生。谁能帮帮我?
const puppeteer = require('puppeteer');
const fs = require('fs');
(async () => {
try {
const browser = await puppeteer.launch({
headless: false
});
const page = await browser.newPage();
await page.setViewport({ width: 1280, height: 800 });
link = "https://en.wikipedia.org/wiki/List_of_terrorist_incidents_in_June_2016";
console.log("==============================");
console.log("Travelling to link:", link);
console.log("==============================");
await page.goto(link, {waitUntil: 'networkidle0'});
let rowArray = await page.$$("table[class='wikitable sortable jquery-tablesorter'] > tbody > tr");
var dataA = [];
for(let row of rowArray){
let date = await row.$eval('td:nth-child(1)', element => element.textContent);
date = date.substring(0, date.length - 1);
let type = await row.$eval('td:nth-child(2)', element => element.textContent);
type = type.substring(0, type.length - 1);
let dead = await row.$eval('td:nth-child(3)', element => element.textContent);
dead = dead.substring(0, dead.length - 1);
let injured = await row.$eval('td:nth-child(4)', element => element.textContent);
injured = injured.substring(0, injured.length - 1);
let location = await row.$eval('td:nth-child(5)', element => element.textContent);
location = location.substring(0, location.length - 1);
let details = await row.$eval('td:nth-child(6)', element => element.textContent);
details = details.substring(0, details.length - 1);
let perpetrator = await row.$eval('td:nth-child(7)', element => element.textContent);
perpetrator = perpetrator.substring(0, perpetrator.length - 1);
let partOf = await row.$eval('td:nth-child(8)', element => element.textContent);
partOf = partOf.substring(0, partOf.length - 1);
console.log("==============================");
console.log({date, type, dead, injured, location, details, perpetrator, partOf});
console.log("==============================");
dataA.push({date, type, dead, injured, location, details, perpetrator, partOf});
}
console.log("==============================");
console.log("Started writing JSON file");
fs.writeFileSync(`./june.json`, JSON.stringify(dataA), 'utf-8');
console.log("Finished writing JSON file");
console.log("==============================");
await browser.close();
} catch (error) {
console.error();
}
})();
只要看看它停止的地方
似乎脚本无法处理没有 "closing cell"
的下一行我的猜测是,如果您编辑该页面并关闭它,它将起作用(或更新您的脚本以处理该情况)
查看维基百科源代码,在那一行中缺少 "part of" 单元格,因此您的代码只是挂在 'await' 部分
let partOf = await row.$eval('td:nth-child(8)', element => element.textContent);
这样你就不会出错。