Puppeteer 无法获取完整的源代码

Question

我正在使用 Node.js 和 Puppeteer 创建一个简单的抓取应用程序。我要抓取的页面是 this。下面是我现在正在使用的代码。

const url = `https://www.betrebels.gr/el/sports?catids=122,40,87,28,45,2&champids=423,274616,1496978,1484069,1484383,465990,465991,91,71,287,488038,488076,488075,1483480,201,2,367,38,1481454,18,226,440,441,442,443,444,445,446,447,448,449,451,452,453,456,457,458,459,460,278261&datefilter=TodayTomorrow&page=prelive`
await page.goto(url, {waitUntil: 'networkidle2'});
let content: string = await page.content();
await page.screenshot({path: 'page.png',fullPage: true});
await fs.writeFile("temp.html", content);
//...Analyze the html and other stuff.

我得到的屏幕截图是 this，这正是我所期待的。

另一方面，页面内容很少，不代表图像上的数据。

我是不是做错了什么？我没有正确等待 Javascript 完成吗？

Answer 1

该页面正在使用框架。您只能看到页面的主要内容（没有框架的内容）。同样要获取frame的内容，需要先找到frame（例如通过page.$) and then get its frame handle via elementHandle.contentFrame. You can then call frame.content()获取frame的内容

简单示例

const frameElementHandle = await page.$('#selector iframe');
const frame = await frameElementHandle.contentFrame();
const frameContent = await frame.content();

根据页面的结构，您需要对多个框架执行此操作以获取所有内容，或者您甚至需要对框架内的框架执行此操作（给定页面似乎就是这种情况） .

读取所有帧内容的示例

下面是一个递归读取页面所有框架内容的例子

const contents = [];
async function extractFrameContents(pageOrFrame) {
  const frames = await pageOrFrame.$$('iframe');
  for (let frameElement of frames) {
    const frame = await frameElement.contentFrame();
    const frameContent = await frame.content();

    // do something with the content, example:
    contents.push(frameContent);

    // recursively repeat
    await extractFrameContents(frame); 
  }
}
await extractFrameContents(page);

Puppeteer 无法获取完整的源代码

Puppeteer is unable to get the complete source code

javascript

node.js

web-scraping

puppeteer