如何在使用 axios 或其他请求程序进行中间下载时检测文件大小/类型?

how to detect file size / type while mid-download using axios or other requestor?

我有一个通过 google 搜索在网站上查找文本的抓取工具。但是,有时用于搜索的 URL 是没有扩展名的大文件(即 https://myfile.com/myfile/)。

我确实有一个超时机制,但是当它超时时,文件已经超载了内存。有什么方法可以在下载文件时检测文件大小或文件类型?

这是我的请求函数:

const getHtml = async (url, { timeout = 10000, ...opts } = {}) => {
  const CancelToken = axios.CancelToken
  const source = CancelToken.source()
  try {
    const timeoutId = setTimeout(() => source.cancel('Request cancelled due to timeout'), timeout)
    let site = await axios.get(url, {
      headers: {
        'user-agent': userAgent().toString(),
        connection: 'keep-alive', // self note: Isn't this prohibited on http/2?
      },
      cancelToken: source.token,
      ...opts,
    })
    clearTimeout(timeoutId)
    return site.data
  } catch (err) {
    throw err
  }
}

PS:我见过类似的问题,但 none 有一个适用的答案。

好的,这并不像人们预期的那样容易解决。理想情况下,http headers 'Content-length' 和 'Content-type' 存在,因此用户可以知道他应该期待什么,但这些都不是必需的 headers。然而,这些通常不准确或缺失。

我为这个问题找到的解决方案看起来非常可靠,涉及两件事:

  1. 将请求作为流
  2. 正在阅读 file signature that the first byte of a lot of files have(probably due to ISO 8859-1, which lists these signatures); These are actually commonly known as Magic Numbers/Bytes

使用这两个东西的一个好方法是流式传输响应并读取第一个字节以检查文件签名;在你知道文件是否采用你 support/want 的任何格式之后,你可以像往常一样处理它,或者在读取流的下一个块之前取消请求,这应该可以防止系统过载(并且您还可以使用它来更准确地测量文件大小 - 我在以下代码段中显示)

以下是我实现上述解决方案的方法:

const getHtml = async (url, { timeout = 10000, ...opts } = {}) => {
  const CancelToken = axios.CancelToken
  const source = CancelToken.source()
  try {
    const timeoutId = setTimeout(() => source.cancel('Request cancelled due to timeout'), timeout)
    const res = await axios.get(url, {
      headers: {
        connection: 'keep-alive',
      },
      cancelToken: source.token,
      // Use stream mode so we can read the first chunk before getting the rest(1.6kB/chunk(highWatermark)) 
      responseType: 'stream',
      ...opts,
    })
    const stream = res.data;
    let firstChunk = true
    let size = 0
    // Not to be confused with arrayBuffer(the object) ;)
    const bufferArray = []
    // Async iterator syntax for consuming the stream. Iterating over a stream will consume it fully, but returning or breaking the loop in any way will destroy it
    for await (const chunk of stream) {
      if (firstChunk) {
        firstChunk = false
        // Only check the first 100(relevant, spaces excl.) chars of the chunk for html. This would possibly only fail in a raw text file which contains the word html at the very top(very unlikely and even then, wouldn't break anything)
        const stringChunk = String(chunk).replace(/\s+/g, '').slice(0, 100).toLowerCase()
        if (!stringChunk.includes('html')) return { error: `Requested URL is detected as a file. URL: ${url}\nChunk's magic 100: ${stringChunk}` };
      }
      size += Buffer.byteLength(chunk);
      if (size > sizeLimit) return { error: `Requested URL is too large.\nURL: ${url}\nSize: ${size}` };
      const buff = new Buffer.from(chunk)
      bufferArray.push(buff)
    }
    // After the stream is fully consumed, we clear the timeout and create one big buffer to convert to str and return that
    clearTimeout(timeoutId)
    return { html: Buffer.concat(bufferArray).toString() }
  } catch (err) {
    throw err
  }
}