BeautifulSoup 为 html 转换提供垃圾

Question

我正试图避开这个 url = 'http://www.jmlr.org/proceedings/papers/v36/li14.pdf url。这是我的代码

    html = requests.get(url)
    htmlText = html.text
    soup = BeautifulSoup(htmlText)
    print soup #gives garbage

但是它给出了我认为是垃圾的奇怪符号。这是一个 html 文件，所以它不应该尝试将其解析为 pdf 文件吗？

我尝试了以下操作： How to correctly parse UTF-8 encoded HTML to Unicode strings with BeautifulSoup?

    request = urllib2.Request(url)
    request.add_header('Accept-Encoding', 'utf-8') #tried with 'latin-1'too
    response = urllib2.urlopen(request)
    soup = BeautifulSoup(response.read().decode('utf-8', 'ignore'))

还有这个： Python and BeautifulSoup encoding issues

    html = requests.get(url)
    htmlText = html.text
    soup = BeautifulSoup(htmlText)
    print soup.prettify('utf-8')

两者都给了我垃圾，即没有正确解析 html 标签。最后一个 link 还建议尽管 metaa 字符集是 'utf8'，但编码可能会有所不同，所以我也用 'latin-1' 尝试了上面的方法，但似乎没有任何效果

关于如何抓取给定 link 数据的任何建议？请不要建议在文件上下载和使用 pdfminer。欢迎询问更多信息！

Answer 1

那是因为 URL 指向的是 PDF 格式的文档，所以将其解释为 HTML 根本没有任何意义。

BeautifulSoup 为 html 转换提供垃圾

BeautifulSoup gives garbage for html conversion

html

python

pdf

beautifulsoup

utf-8