无法使用 puppeteer 重用新填充的链接

Can't reuse newly populated links using puppeteer

我在 node.js 中结合 puppeteer 编写了一个脚本,用于将 links 解析为网页中所有帖子的标题,并使用这些 links 到其内页以抓取标题。

我本可以从它的着陆页上抓取标题,但我的目的是使用这些新填充的链接进行导航并从目标页面解析标题。当我执行我的脚本时,它会抓取第一个标题然后抛出错误。我怎样才能按照我尝试应用的逻辑使其成功。

Link to the site

Link to one of such target pages

到目前为止,这是我的脚本:

const puppeteer = require("puppeteer");

(async function main() {
    const browser = await puppeteer.launch({headless:false});
    const page = await browser.newPage();
    await page.goto("https://whosebug.com/questions/tagged/web-scraping?sort=newest&pageSize=50");
    page.waitForSelector(".summary");
    const sections = await page.$$(".summary");

    for (const section of sections) {
        const itemName = await section.$eval(".question-hyperlink", item => item.href);

        (async function main() {
            await page.goto(itemName);
            page.waitForSelector(".summary");
            const titles = await page.$$("#question-header");

            for (const title of titles) {
                const itmName = await title.$eval("#question-header .question-hyperlink", itm => itm.innerText);
                console.log(itmName);
            }
        })();
    }
    browser.close();
})();

我在控制台中看到的内容:

(node:1992) UnhandledPromiseRejectionWarning: Error: Execution context was destroyed, most likely because of a navigation.
    at rewriteError (c:\Users\WCS\node_modules\puppeteer\lib\ExecutionContext.js:144:15)
    at <anonymous>
    at process._tickCallback (internal/process/next_tick.js:189:7)
(node:1992) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:1992) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

How to search content related to keyword in an website?

(node:1992) UnhandledPromiseRejectionWarning: TimeoutError: waiting for selector ".summary" failed: timeout 30000ms exceeded
    at new WaitTask (c:\Users\WCS\node_modules\puppeteer\lib\FrameManager.js:862:28)
    at Frame._waitForSelectorOrXPath (c:\Users\WCS\node_modules\puppeteer\lib\FrameManager.js:753:12)
    at Frame.waitForSelector (c:\Users\WCS\node_modules\puppeteer\lib\FrameManager.js:711:17)
    at Page.waitForSelector (c:\Users\WCS\node_modules\puppeteer\lib\Page.js:1043:29)
    at main (c:\Users\WCS\scrape.js:15:18)
    at <anonymous>
    at process._tickCallback (internal/process/next_tick.js:189:7)
(node:1992) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 2)

你可以看到我在错误中得到了一个结果。

我没有重播场景,但你的两个错误来自:

  • page.waitForSelector(".summary");
  • 前面少了两个await
  • 您使用 for 循环内的 page.goto() 离开您的上下文,然后尝试评估 section 对象上不再属于 DOM 的某些内容。

要解决第一个问题,只需添加缺少的两个 await

要解决第二个问题,请使用 let newPage = await browser.newPage()newPage.goto('whereveryouwanttogo.com') 打开一个新页面。这样,您就不会破坏原来的 page,并且仍然可以做 section 的事情。

有两种方法可以解决您的问题:

首先:创建一个要遍历的 URL 数组,然后重新使用 page 来访问它们。

const puppeteer = require("puppeteer");

(async function main() {
    const browser = await puppeteer.launch({headless:false});
    const page = await browser.newPage();
    await page.goto("https://whosebug.com/questions/tagged/web-scraping?sort=newest&pageSize=50", {waitUntil: 'networkidle2'});
    await page.waitForSelector(".summary");
    const urls = await page.$$eval(".question-hyperlink", items => items.map(item => item.href));
    console.log(urls);

    for (let url of urls) 
    {
        await page.goto(url);
        await page.waitForSelector("#question-header");
        let title = await page.$eval("#question-header a", item => item.textContent);
        console.log(title);
    }

    await browser.close();
})();

其次:正如罗曼建议的那样,创建另一个页面并使用它来遍历页面。

这是您的脚本的副本,其中实施了方法 2 并更正了其他几个问题(缺少 await 运算符,问题页面上的选择器不正确)

const puppeteer = require("puppeteer");

(async function main() {
    const browser = await puppeteer.launch({headless:false});
    const page = await browser.newPage();
    const newPage = await browser.newPage();
    await page.goto("https://whosebug.com/questions/tagged/web-scraping?sort=newest&pageSize=50", {waitUntil: 'networkidle2'});
    await page.waitForSelector(".summary");
    const sections = await page.$$(".summary");

    for (const section of sections) {
        let itemURL = await section.$eval(".question-hyperlink", item => item.href);

        await newPage.goto(itemURL);
        await newPage.waitForSelector("#question-header"); // <-- was ".summary"
        let titles = await newPage.$$("#question-header");

        for (let title of titles) {
            let itmName = await title.$eval("#question-header .question-hyperlink", itm => itm.innerText);
            console.log(itmName);
        }
    }
    await browser.close();
})();