格式化用无头 chrome 爬虫抓取的文本

Formatting text scraped with headless chrome crawler

下面的代码从页面上的多个元素中抓取文本,但是文本需要格式化(添加空格等)以便我能够在其他地方使用它。

我有一些 JavaScript(在浏览器控制台中工作)循环遍历元素,将它们的文本添加到数组,然后将其转换为字符串,这就是我想要的。该代码可以在这里重用吗?我不确定where/if我可以添加吗?

const HCCrawler = require("headless-chrome-crawler");
const CSVExporter = require("headless-chrome-crawler/exporter/csv");

const FILE = "result.csv";

const exporter = new CSVExporter({
  file: FILE,
  fields: ["response.url", "response.status", "result.text"],
});

(async () => {
  const crawler = await HCCrawler.launch({
    maxDepth: 9999,
    exporter,
    allowedDomains: ["example.com"],
    // Function to be evaluated in browsers
    evaluatePage: () => ({
      text: $("h1, h2, p").text(),
    }),
    // Function to be called with evaluated results from browsers
    onSuccess: (result) => {
      console.log(result.result.h1);
    },
  });
  // Queue a request
  await crawler.queue("https://example.com");

  await crawler.onIdle(); // Resolved when no queue is left
  await crawler.close(); // Close the crawler
})();

是的,我认为您可以在应用您的代码的 evaluatePage 回调中添加 post-爬行步骤:

 function cleanCrawledText(text) {
   // clean text here and return it
 }

 ...
 evaluatePage: () => ({
   text: cleanCrawledText($("h1, h2, p").text()),
 }),
 ...