Puppeteer

Question

使用 Puppeteer（https://github.com/GoogleChrome/puppeteer), I have a page that's a application/pdf. With headless: false, the page is loaded though the Chromium PDF viewer, but I want to use headless. How can I download the original .pdf file or use as a blob with another library, such as (pdf-parse https://www.npmjs.com/package/pdf-parse）？

Answer 1

由于 Puppeteer 目前不支持通过 page.goto() due to the upstream issue, you can use page.setRequestInterception() to enable request interception, and then you can listen for the 'request' 事件以无头模式导航到 PDF 文档，并在使用请求客户端获取 PDF 缓冲区之前检测资源是否为 PDF。

获取PDF缓冲区后，可以使用request.abort() to abort the original Puppeteer request, or if the request is not for a PDF, you can use request.continue()正常继续请求

这是一个完整的工作示例：

'use strict';

const puppeteer = require('puppeteer');
const request_client = require('request-promise-native');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  await page.setRequestInterception(true);

  page.on('request', request => {
    if (request.url().endsWith('.pdf')) {
      request_client({
        uri: request.url(),
        encoding: null,
        headers: {
          'Content-type': 'applcation/pdf',
        },
      }).then(response => {
        console.log(response); // PDF Buffer
        request.abort();
      });
    } else {
      request.continue();
    }
  });

  await page.goto('https://example.com/hello-world.pdf').catch(error => {});

  await browser.close();
})();

Answer 2

Grant Miller 的解决方案对我不起作用，因为我已登录该网站。但是，如果 pdf 是 public，则此解决方案效果很好。

我的解决方案是添加 cookie

await page.setRequestInterception(true);

page.on('request', async request => {
    if (request.url().indexOf('exibirFat.do')>0) { //This condition is true only in pdf page (in my case of course)
      const options = {
        encoding: null,
        method: request._method,
        uri: request._url,
        body: request._postData,
        headers: request._headers
      }
      /* add the cookies */
      const cookies = await page.cookies();
      options.headers.Cookie = cookies.map(ck => ck.name + '=' + ck.value).join(';');
      /* resend the request */
      const response = await request_client(options);
      //console.log(response); // PDF Buffer
      buffer = response;
      let filename = 'file.pdf';
      fs.writeFileSync(filename, buffer); //Save file
   } else {
      request.continue();
   }
});

Puppeteer - 如何获取当前页面 (application/pdf) 作为缓冲区或文件？

Puppeteer - How can I get the current page (application/pdf) as a buffer or file?

javascript

pdf

buffer

node.js