使用 python 和 BeautifulSoup 进行网页抓取

Question

我正在尝试从网站提取数据，数据位于 table :

url=requests.get("xxxxx")
soup =BeautifulSoup(url.content)
table=soup.find_all("table")[0]
rows = table.find_all('tr')

我试过这段代码，但它只提取了 42 行，而源代码 table 包含 220 行？有人告诉我如何解决这个问题。

Answer 1

欢迎。
2种可能性。 Javascript 或网站安全。

requests 与 javscript 无关，不执行任何 javascript 代码。您需要一个更接近于浏览器的无头浏览器解决方案（selenium 很流行），尤其是当涉及到 javascript 时。

许多网站不想被抓取并采用不同的方法来防止它。最简单的形式是检查客户端的 User-Agent 值（您的 Python 脚本）或速率限制（每秒刷新 20k 不是人类）。例如，如果 User-Agent 不是 known value, it'll behave differently (little or no data). Other forms of defense are more complex. Such as trying to play audio on your "browser" or polling your "browser"'s resolution. For that you'll need to investigate the site's behavior. This can take time. You can start off with either the Networking tab of your browser's developing tools (F12 on Firefox) or Zap Proxy 以进行更精细的控制。

使用 python 和 BeautifulSoup 进行网页抓取

web scraping with python and BeautifulSoup

python

html-table

beautifulsoup

web-crawler

web-scraping