使用 pandas read_html 从网站读取 table

Question

我想使用 pandas.read_html 阅读 this website 中的 table。该网站显示了 YouTube 上观看次数最多的前 100 个新闻频道。

我尝试使用 pandas 获取 table:

import pandas as pd
df = pd.read_html('https://socialblade.com/youtube/top/category/news/mostviewed')

但是，它引发了以下错误：

HTTPError: HTTP Error 403: Forbidden

关注，我假装是浏览器，但是回复的文字好像没有table:

import requests
import pandas as pd

url = 'https://socialblade.com/youtube/top/category/news/mostviewed'
header = {
  "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
  "X-Requested-With": "XMLHttpRequest"}
df = pd.read_html(requests.get(url, headers=header).text)

ValueError: No tables found

将此 table 放入 pandas.DataFrame 对象的最简单方法是什么？

Answer 1

您似乎想从站点中删除数据.. 但是，我会说您为此目的使用了错误的工具，就好像您仔细查看了您正在获取的网站的 html 响应一样，它没有 html table 标签. pandas read_html() 函数搜索 <table> 标签，如 pandas 文档中所述：- https://pandas.pydata.org/docs/reference/api/pandas.read_html.html#:~:text=This%20function%20searches,into%20the%20header).

我建议您使用正确的工具通过 Beautiful Soup 抓取数据。它是一个 python 用于抓取网站的库。

使用 pandas read_html 从网站读取 table

Read table from website using pandas read_html

html

python

web-scraping

pandas