格式化用无头 chrome 爬虫抓取的文本
Formatting text scraped with headless chrome crawler
下面的代码从页面上的多个元素中抓取文本,但是文本需要格式化(添加空格等)以便我能够在其他地方使用它。
我有一些 JavaScript(在浏览器控制台中工作)循环遍历元素,将它们的文本添加到数组,然后将其转换为字符串,这就是我想要的。该代码可以在这里重用吗?我不确定where/if我可以添加吗?
const HCCrawler = require("headless-chrome-crawler");
const CSVExporter = require("headless-chrome-crawler/exporter/csv");
const FILE = "result.csv";
const exporter = new CSVExporter({
file: FILE,
fields: ["response.url", "response.status", "result.text"],
});
(async () => {
const crawler = await HCCrawler.launch({
maxDepth: 9999,
exporter,
allowedDomains: ["example.com"],
// Function to be evaluated in browsers
evaluatePage: () => ({
text: $("h1, h2, p").text(),
}),
// Function to be called with evaluated results from browsers
onSuccess: (result) => {
console.log(result.result.h1);
},
});
// Queue a request
await crawler.queue("https://example.com");
await crawler.onIdle(); // Resolved when no queue is left
await crawler.close(); // Close the crawler
})();
是的,我认为您可以在应用您的代码的 evaluatePage
回调中添加 post-爬行步骤:
function cleanCrawledText(text) {
// clean text here and return it
}
...
evaluatePage: () => ({
text: cleanCrawledText($("h1, h2, p").text()),
}),
...
下面的代码从页面上的多个元素中抓取文本,但是文本需要格式化(添加空格等)以便我能够在其他地方使用它。
我有一些 JavaScript(在浏览器控制台中工作)循环遍历元素,将它们的文本添加到数组,然后将其转换为字符串,这就是我想要的。该代码可以在这里重用吗?我不确定where/if我可以添加吗?
const HCCrawler = require("headless-chrome-crawler");
const CSVExporter = require("headless-chrome-crawler/exporter/csv");
const FILE = "result.csv";
const exporter = new CSVExporter({
file: FILE,
fields: ["response.url", "response.status", "result.text"],
});
(async () => {
const crawler = await HCCrawler.launch({
maxDepth: 9999,
exporter,
allowedDomains: ["example.com"],
// Function to be evaluated in browsers
evaluatePage: () => ({
text: $("h1, h2, p").text(),
}),
// Function to be called with evaluated results from browsers
onSuccess: (result) => {
console.log(result.result.h1);
},
});
// Queue a request
await crawler.queue("https://example.com");
await crawler.onIdle(); // Resolved when no queue is left
await crawler.close(); // Close the crawler
})();
是的,我认为您可以在应用您的代码的 evaluatePage
回调中添加 post-爬行步骤:
function cleanCrawledText(text) {
// clean text here and return it
}
...
evaluatePage: () => ({
text: cleanCrawledText($("h1, h2, p").text()),
}),
...