不能网页废料htmltable美汤

Question

尝试从此处删除 IPO table 数据： https://www.iposcoop.com/last-12-months/

这是我的代码：

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://www.iposcoop.com/last-12-months/'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
table1 = soup.find("table",id='DataTables_Table_0')
table1_data = table1.tbody.find_all("tr")
table1

但是，table1 是非类型。这是为什么？任何解决方案？我看过相关问题，iframe好像不是答案。

Answer 1

您可以使用 pandas

获取 table 数据

import pandas as pd
import requests 
from bs4 import BeautifulSoup

url='https://www.iposcoop.com/last-12-months'
req=requests.get(url).text
soup=BeautifulSoup(req,'lxml')
table=soup.select_one('.standard-table.ipolist')
table_data =pd.read_html(str(table))[0]
print(table_data)

输出：

                 Company  Symbol  ...   Return SCOOP Rating
0                                         Akanda Corp.    AKAN  ...   85.00%          S/O     
1    The Marygold Companies, Inc. (aka Concierge Te...    MGLD  ...    9.50%          S/O     
2                            Blue Water Vaccines, Inc.     BWV  ...  343.33%          S/O     
3            Meihua International Medical Technologies    MHUA  ...  -33.00%          S/O     
4                                        Vivakor, Inc.    VIVK  ...  -49.40%          S/O     
..                                                 ...     ...  ...      ...          ...     
355                Khosla Ventures Acquisition Co. III    KVSC  ...   -2.80%          S/O     
356           Dragoneer Growth Opportunities Corp. III    DGNU  ...   -2.40%          S/O     
357                                        Movano Inc.    MOVE  ...  -43.60%          S/O     
358         Supernova Partners Acquisition Company III  STRE.U  ...    0.10%          S/O     
359                           Universe Pharmaceuticals     UPC  ...  -74.00%          S/O     

[360 rows x 10 columns]

Answer 2

虽然 F.Hoque 的回答为您提供了解决方案，但它似乎没有回答您的代码为何会抛出错误。

您正在尝试查找 ID 为 DataTables_Table_0 的 table。在浏览器中打开 page ，您可以看到具有给定 id 的元素存在。但是如果您在禁用 Javascript 后打开同一页面，您会看到该 id 不再存在于 table 上。此 ID 由某些 javascript 模块分配。

BeautifulSoup 只能获取基础 HTML 而不会运行 javascript 模块。所以你有 2 个解决方案：

使用基 HTML 中存在的选择器（在本例中 .standard-table.ipolist）
使用 selenium 运行 Javascript 并获取在浏览器中看到的 HTML

不能网页废料htmltable美汤

Can't web scrap html table beautiful soup

python

beautifulsoup

web-scraping