使用 BeautifulSoup 在网页上抓取表格

Question

我需要在 Python 中使用美洲 500 强公司的信息做一个 DataFrame：

https://www.americaeconomia.com/negocios-industrias/estas-son-las-500-mayores-empresas-de-america-latina-2021

我尝试进行网页抓取，当我打印（tabla）时它说 [] 或 None...

from bs4 import BeautifulSoup
import requests

url = 'https://www.americaeconomia.com/negocios-industrias/estas-son-las-500-mayores-empresas-de-america-latina-2021'
page = requests.get(url)

soup = BeautifulSoup(page.text, 'html.parser')

tabla = soup.find('table', {"id":"awesomeTable"})
print(tabla)

Answer 1

会发生什么？

永远先看看你的汤 - 这就是真相。内容总是与开发工具中的视图略有不同。

您不会在汤中找到 table，因为它在 iframe 中。

如何修复？

使用 iframe 源的 url 来执行您的请求：

https://rk.americaeconomia.com/display/embed/500-latam/2021

例子

import requests
from bs4 import BeautifulSoup
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
r = requests.get('https://rk.americaeconomia.com/display/embed/500-latam/2021',headers=headers)
soup = BeautifulSoup(r.text,'lxml')
data = []
for row in soup.select('#awesomeTable tbody tr.dataRow'):
    data.append(list(row.stripped_strings))

pd.DataFrame(data, columns=list(soup.select_one('#awesomeTable tr').stripped_strings))

输出

RK 2021	EMPRESA	PAÍS
1	PETROBRAS	BRA
2	JBS	BRA
3	AMÉRICA MÓVIL	MX
4	PEMEX	MX
5	VALE	BRA
...	...	...

使用 BeautifulSoup 在网页上抓取表格

Scrapping Tables on a Web page with BeautifulSoap

python

html-table

web-scraping

web

会发生什么？

如何修复？

例子

输出