如果找到某个 url，则使用 FS 写入新文件，如果找不到，则删除该文件

Question

我正在尝试编写一个脚本，当发现新的 url 时，它会将 url 转换为散列。检查文件是否已经写入它只是忽略它，如果之前不知道它应该添加。

needle.get(mainUrl, function(err, res) {
  if (err) throw err;

  if (res.statusCode == 200 && !err ) {
    var $ = cheerio.load(res.body)

    var href = $('div div a').each(function(index, element) {
      urlList.push($(element).attr("href"))

      var url =($(element).attr("href"))
      var hash = crypto.createHash('md5').update(url).digest('hex');
                
      fs.writeFile('./directory/otherdirectory' + `${hash}`, url, (err) => {
        if (err) throw err;
        console.log('Hash created: ' + url + ' saved as ' + hash
      });
    }
  )
}
})

这是我目前所做的，但这只会写入新文件。它不会检查文件是否已添加，也不会删除不再找到的文件。

所以我尝试做的是：

我编写了一个脚本，可以在 url 秒内获取网站。
哈希所有 urls。
让FS检查文件是否已经写入，如果已经写入则忽略
如果之前不知道，请将其添加为新文件。
如果在提取时找不到 url，请将其从列表中删除。

Answer 1

我认为这可能是 X/Y problem and for that I'm still awaiting the answer to 。

话虽如此，您可以使用 fs.existsSync, if that returns true just skip saving the current file, otherwise save it. And to remove files that are not available anymore, just get all the files in the directory using fs.readdir and remove files that you whose urls are not in the response using fs.unlink:

简单地忽略现有文件

needle.get(mainUrl, (err, res) => {
  if (err) throw err;

  if (res.statusCode == 200) {
    let $ = cheerio.load(res.body);

    let hashes = [];                                                      // list of hashes for this website (to be used later to keep only the items that are still available)
    $('div div a').each((index, element) => {
      let url = $(element).attr("href");
      let hash = crypto.createHash('md5').update(url).digest('hex');
      hashes.push(hash);                                                 // store the hash of the current url
      
      if (!fs.existsSync('./directory/otherdirectory/' + hash)) {        // if this file doesn't exist (notice the "not operator !" before fs.existsSync)
        fs.writeFile('./directory/otherdirectory/' + hash, url, err => { // save it
          if (err) throw err;
          console.log('Hash created: ' + url + ' saved as ' + hash);
        });
      }
    });

    fs.readdir('./directory/otherdirectory', (err, files) => {           // get a list of all the files in the directory
      if (err) throw err;
      files.forEach(file => {                                            // and for each file
        if(!hashes.includes(file)) {                                     // if it was not encountered above (meaning that it doesn't exist in the hashes array)
          fs.unlink('./directory/otherdirectory/' + file, err => {       // remove it
            if (err) throw err;
          });
        }
      });
    });
});

另一种方法：

由于您似乎只想存储 url，最好的方法是使用一个文件来存储它们，而不是将每个 url 存储在自己的文件中.这样的事情更有效率：

needle.get(mainUrl, (err, res) => {
  if (err) throw err;

  if (res.statusCode == 200) {
    let $ = cheerio.load(res.body);

    let urls = $('div div a')                                           // get the 'a' elements
      .map((index, element) => $(element).attr("href"))                 // map each one into its href attribute
      .get();                                                           // and get them as an array
        
    fs.writeFile('./directory/list-of-urls', urls.join('\n'), err => {  // then save all the urls encountered in the file 'list-of-urls' (each on its own line, hence the join('\n'))
      if (err) throw err;
      console.log('saved all the urls to the file "list-of-urls"');
    });
  }
});

这样旧的 urls 将在每次覆盖文件时自动删除，并自动添加新的 urls。无需检查是否已遇到 url，因为它无论如何都会得到 re-saved。

如果您想在其他地方获取 url 的列表，只需读取文件并将其拆分为 '\n'，如下所示：

 fs.readFile('./directory/list-of-urls', 'utf8', (err, data) => {
   if (err) throw err;
   let urls = data.split('\n');
   // use urls here
 });

如果找到某个 url，则使用 FS 写入新文件，如果找不到，则删除该文件

Using FS to write new files if a certain url is found and remove the file if it's not found anymore

javascript

fs

web-scraping

cheerio