如何从托管 table 数据且 HTML 之外的网站抓取 table?
How do you scrape a table from a website which is hosting the table data outside of the HTML?
我正在尝试从 table URL 中抓取 table 数据:https://covid19criticalcare.com/pharmacies/
在我之前的抓取中,我使用了以下 Python 包:
从 bs4 导入 BeautifulSoup
导入请求
导入 mysql.connector
将 pandas 导入为 pd
从 sqlalchemy 导入 create_engine
但是这个 url 的 HTML 不包含其中的 table 数据,相反它似乎是从外部数据库中提取数据。
有人可以为我指出使用 python 脚本使用这种 HTML 设置抓取 table 数据的正确方向吗?
我试过盲刮,用我之前刮的方法。
from bs4 import BeautifulSoup
import requests
import mysql.connector
import pandas as pd
from sqlalchemy import create_engine
url = "https://covid19criticalcare.com/pharmacies/"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
result = requests.get(url, headers = headers)
doc = BeautifulSoup(result.text, "html.parser")
name = doc.find_all("td", class\_="column-1")
td_pharmacy_name = \[\]
for td in name:
names = td.text
td_names.append(names)
print(td_names)
当网站上的 Javascript 呈现时,您尝试抓取的内容可用。最简单的方法是使用相同的 Rest API 方法模拟请求,或者使用有助于呈现内容的库;例如,Selenium、Scrapy 等
有关如何抓取 JS-rendered 内容的更多详细信息,您可以查看此线程
Web-scraping JavaScript page with Python
有关如何查看请求和响应的基本故障排除,您可以通过 right click on the HTML page > click on "inspect" > click on "Network" tab > click on "Fetch/XHR" > Press "command + Shift + R" to reload your page once
打开 Chrome 开发人员工具。
如果您不确定哪个请求包含您要查找的数据,您可以使用command + F
进行搜索并输入关键字,Chrome将列出与您的搜索相匹配的请求
This image shows that the data is sent using AJAX and it also depicts the result of the steps above
编辑 1
如果您想使用 Selenium 来避免模仿 Web 请求的麻烦,您的代码应该如下所示。
from selenium import webdriver
import pandas
import time
if __name__ == "__main__":
driver = webdriver.Chrome()
driver.get("https://covid19criticalcare.com/pharmacies/")
time.sleep(7)
df = pandas.read_html(browser.page_source)[0]
print(df)
就像@Naphat Theerawat 的替代品s answer and while I noticed that you started with a
seleniumbased solution you could get your goal with that much easier in combination with
pandas`。
加载网站并使用 pd.read_html()
从 driver.page_source
中提取 table - 为避免迭代每个页面只需 select 显示所有条目
例子
from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import Select
import pandas as pd
url = 'https://covid19criticalcare.com/pharmacies/'
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
driver.get(url)
wait = WebDriverWait(driver, 5)
select = Select(wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, '[name = "DataTables_Table_0_length"'))))
select.select_by_value('-1')
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, 'a.paginate_button.next.disabled')))
df = pd.read_html(driver.page_source, displayed_only=False)[1]
driver.close()
df
输出
Pharmacy Name
Email
Phone
Website
Requires prescription?
Pharmacy Address
Based in the United States?
Overnight shipping to the United States?
Overnight International shipping?
Ships to the following States/Provinces
0 Covid Pharmacy
sales@0covidpharmacy.com
(785) 672 9222
0covidpharmacy.com
NO
245 Krishna Market Channi RoadNagpur, Maharashtra 440001India
NO
YES
YES
AlabamaAlaskaArizonaArkansasCaliforniaColoradoConnecticutDelawareDistrict of ColumbiaFloridaGeorgiaHawaiiIdahoIllinoisIndianaIowaKansasKentuckyLouisianaMaineMarylandMassachusettsMichiganMinnesotaMississippiMissouriMontanaNebraskaNevadaNew HampshireNew JerseyNew MexicoNew YorkNorth CarolinaNorth DakotaOhioOklahomaOregonPennsylvaniaRhode IslandSouth CarolinaSouth DakotaTennesseeTexasUtahVermontVirginiaWashingtonWest VirginiaWisconsinWyomingGuamPuerto RicoVirgin IslandsArmed Forces AmericasArmed Forces EuropeArmed Forces Pacific
1 Ivermectin Service
ask24@1ivermectin.com
(888) 290 0964 (US), +91 22509 72606 (IN)
1ivermectin.com
NO
1/16, First Floor, Tardeo Air Conditioned Market Building, TardeoMumbai, Tardeo 400034India
NO
YES
YES
AlabamaAlaskaArizonaArkansasCaliforniaColoradoConnecticutDelawareDistrict of ColumbiaFloridaGeorgiaIdahoIllinoisIndianaIowaKansasKentuckyLouisianaMaineMarylandMassachusettsMichiganMinnesotaMississippiMissouriMontanaNebraskaNevadaNew HampshireNew JerseyNew MexicoNew YorkNorth CarolinaNorth DakotaOhioOklahomaOregonPennsylvaniaRhode IslandSouth CarolinaSouth DakotaTennesseeTexasUtahVermontVirginiaWashingtonWest VirginiaWisconsinWyomingPuerto RicoVirgin Islands
1 Life Pharmacy
sales@1lifepharmacy.net
(888) 560-0430 (US); +91 (807 ) 127-9990 (India)
1lifepharmacy.net
NO
302, Pride Plaza, Rajkot, 360002Rajkot, Gujarat 360002; 84118India
NO
YES
YES
AlabamaAlaskaArizonaArkansasCaliforniaColoradoConnecticutDelawareDistrict of ColumbiaFloridaGeorgiaHawaiiIdahoIllinoisIndianaIowaKansasKentuckyLouisianaMaineMarylandMassachusettsMichiganMinnesotaMississippiMissouriMontanaNebraskaNevadaNew HampshireNew JerseyNew MexicoNew YorkNorth CarolinaNorth DakotaOhioOklahomaOregonPennsylvaniaRhode IslandSouth CarolinaSouth DakotaTennesseeTexasUtahVermontVirginiaWashingtonWest VirginiaWisconsinWyoming
1-2-3 RX Global Pharmacy
doctor@123rx.net
(516) 758-2630
123rx.net
NO
2967 Dundas St. W.Toronto, Ontario M6P 1Z2Canada
NO
YES
YES
AlabamaAlaskaArizonaArkansasCaliforniaColoradoConnecticutDelawareDistrict of ColumbiaFloridaGeorgiaHawaiiIdahoIllinoisIndianaIowaKansasKentuckyLouisianaMaineMarylandMassachusettsMichiganMinnesotaMississippiMissouriMontanaNebraskaNevadaNew HampshireNew JerseyNew MexicoNew YorkNorth CarolinaNorth DakotaOhioOklahomaOregonPennsylvaniaRhode IslandSouth CarolinaSouth DakotaTennesseeTexasUtahVermontVirginiaWashingtonWest VirginiaWisconsinWyoming
12 Angel Pharmacy Store
12angel.store@gmail.com
(908) 866-4260
12angel.store
NO
1050 Bharat Diamond BourseBandra Kurla ComplexMumbai, Maharashtra 400051India
NO
YES
YES
AlabamaAlaskaArizonaArkansasCaliforniaColoradoConnecticutDelawareDistrict of ColumbiaFloridaGeorgiaHawaiiIdahoIllinoisIndianaIowaKansasKentuckyLouisianaMaineMarylandMassachusettsMichiganMinnesotaMississippiMissouriMontanaNebraskaNevadaNew HampshireNew JerseyNew MexicoNew YorkNorth CarolinaNorth DakotaOhioOklahomaOregonPennsylvaniaRhode IslandSouth CarolinaSouth DakotaTennesseeTexasUtahVermontVirginiaWashingtonWest VirginiaWisconsinWyomingGuamPuerto RicoVirgin IslandsArmed Forces AmericasArmed Forces EuropeArmed Forces Pacific
24 x 7 Pharma
contact@24x7pharma.com
(851) 127-5721
24x7pharma.com
NO
Mahek IconSumul Diary Road, KatargamSurat, Gujarat 395003India
NO
YES
YES
AlabamaAlaskaArizonaArkansasCaliforniaColoradoConnecticutDelawareDistrict of ColumbiaFloridaGeorgiaHawaiiIdahoIllinoisIndianaIowaKansasKentuckyLouisianaMaineMarylandMassachusettsMichiganMinnesotaMississippiMissouriMontanaNebraskaNevadaNew HampshireNew JerseyNew MexicoNew YorkNorth CarolinaNorth DakotaOhioOklahomaOregonPennsylvaniaRhode IslandSouth CarolinaSouth DakotaTennesseeTexasUtahVermontVirginiaWashingtonWest VirginiaWisconsinWyomingGuamPuerto RicoVirgin IslandsArmed Forces AmericasArmed Forces EuropeArmed Forces Pacific
...
我正在尝试从 table URL 中抓取 table 数据:https://covid19criticalcare.com/pharmacies/
在我之前的抓取中,我使用了以下 Python 包: 从 bs4 导入 BeautifulSoup 导入请求 导入 mysql.connector 将 pandas 导入为 pd 从 sqlalchemy 导入 create_engine
但是这个 url 的 HTML 不包含其中的 table 数据,相反它似乎是从外部数据库中提取数据。
有人可以为我指出使用 python 脚本使用这种 HTML 设置抓取 table 数据的正确方向吗?
我试过盲刮,用我之前刮的方法。
from bs4 import BeautifulSoup
import requests
import mysql.connector
import pandas as pd
from sqlalchemy import create_engine
url = "https://covid19criticalcare.com/pharmacies/"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
result = requests.get(url, headers = headers)
doc = BeautifulSoup(result.text, "html.parser")
name = doc.find_all("td", class\_="column-1")
td_pharmacy_name = \[\]
for td in name:
names = td.text
td_names.append(names)
print(td_names)
当网站上的 Javascript 呈现时,您尝试抓取的内容可用。最简单的方法是使用相同的 Rest API 方法模拟请求,或者使用有助于呈现内容的库;例如,Selenium、Scrapy 等
有关如何抓取 JS-rendered 内容的更多详细信息,您可以查看此线程 Web-scraping JavaScript page with Python
有关如何查看请求和响应的基本故障排除,您可以通过 right click on the HTML page > click on "inspect" > click on "Network" tab > click on "Fetch/XHR" > Press "command + Shift + R" to reload your page once
打开 Chrome 开发人员工具。
如果您不确定哪个请求包含您要查找的数据,您可以使用command + F
进行搜索并输入关键字,Chrome将列出与您的搜索相匹配的请求
This image shows that the data is sent using AJAX and it also depicts the result of the steps above
编辑 1
如果您想使用 Selenium 来避免模仿 Web 请求的麻烦,您的代码应该如下所示。
from selenium import webdriver
import pandas
import time
if __name__ == "__main__":
driver = webdriver.Chrome()
driver.get("https://covid19criticalcare.com/pharmacies/")
time.sleep(7)
df = pandas.read_html(browser.page_source)[0]
print(df)
就像@Naphat Theerawat 的替代品s answer and while I noticed that you started with a
seleniumbased solution you could get your goal with that much easier in combination with
pandas`。
加载网站并使用 pd.read_html()
从 driver.page_source
中提取 table - 为避免迭代每个页面只需 select 显示所有条目
例子
from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import Select
import pandas as pd
url = 'https://covid19criticalcare.com/pharmacies/'
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
driver.get(url)
wait = WebDriverWait(driver, 5)
select = Select(wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, '[name = "DataTables_Table_0_length"'))))
select.select_by_value('-1')
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, 'a.paginate_button.next.disabled')))
df = pd.read_html(driver.page_source, displayed_only=False)[1]
driver.close()
df
输出
Pharmacy Name | Phone | Website | Requires prescription? | Pharmacy Address | Based in the United States? | Overnight shipping to the United States? | Overnight International shipping? | Ships to the following States/Provinces | |
---|---|---|---|---|---|---|---|---|---|
0 Covid Pharmacy | sales@0covidpharmacy.com | (785) 672 9222 | 0covidpharmacy.com | NO | 245 Krishna Market Channi RoadNagpur, Maharashtra 440001India | NO | YES | YES | AlabamaAlaskaArizonaArkansasCaliforniaColoradoConnecticutDelawareDistrict of ColumbiaFloridaGeorgiaHawaiiIdahoIllinoisIndianaIowaKansasKentuckyLouisianaMaineMarylandMassachusettsMichiganMinnesotaMississippiMissouriMontanaNebraskaNevadaNew HampshireNew JerseyNew MexicoNew YorkNorth CarolinaNorth DakotaOhioOklahomaOregonPennsylvaniaRhode IslandSouth CarolinaSouth DakotaTennesseeTexasUtahVermontVirginiaWashingtonWest VirginiaWisconsinWyomingGuamPuerto RicoVirgin IslandsArmed Forces AmericasArmed Forces EuropeArmed Forces Pacific |
1 Ivermectin Service | ask24@1ivermectin.com | (888) 290 0964 (US), +91 22509 72606 (IN) | 1ivermectin.com | NO | 1/16, First Floor, Tardeo Air Conditioned Market Building, TardeoMumbai, Tardeo 400034India | NO | YES | YES | AlabamaAlaskaArizonaArkansasCaliforniaColoradoConnecticutDelawareDistrict of ColumbiaFloridaGeorgiaIdahoIllinoisIndianaIowaKansasKentuckyLouisianaMaineMarylandMassachusettsMichiganMinnesotaMississippiMissouriMontanaNebraskaNevadaNew HampshireNew JerseyNew MexicoNew YorkNorth CarolinaNorth DakotaOhioOklahomaOregonPennsylvaniaRhode IslandSouth CarolinaSouth DakotaTennesseeTexasUtahVermontVirginiaWashingtonWest VirginiaWisconsinWyomingPuerto RicoVirgin Islands |
1 Life Pharmacy | sales@1lifepharmacy.net | (888) 560-0430 (US); +91 (807 ) 127-9990 (India) | 1lifepharmacy.net | NO | 302, Pride Plaza, Rajkot, 360002Rajkot, Gujarat 360002; 84118India | NO | YES | YES | AlabamaAlaskaArizonaArkansasCaliforniaColoradoConnecticutDelawareDistrict of ColumbiaFloridaGeorgiaHawaiiIdahoIllinoisIndianaIowaKansasKentuckyLouisianaMaineMarylandMassachusettsMichiganMinnesotaMississippiMissouriMontanaNebraskaNevadaNew HampshireNew JerseyNew MexicoNew YorkNorth CarolinaNorth DakotaOhioOklahomaOregonPennsylvaniaRhode IslandSouth CarolinaSouth DakotaTennesseeTexasUtahVermontVirginiaWashingtonWest VirginiaWisconsinWyoming |
1-2-3 RX Global Pharmacy | doctor@123rx.net | (516) 758-2630 | 123rx.net | NO | 2967 Dundas St. W.Toronto, Ontario M6P 1Z2Canada | NO | YES | YES | AlabamaAlaskaArizonaArkansasCaliforniaColoradoConnecticutDelawareDistrict of ColumbiaFloridaGeorgiaHawaiiIdahoIllinoisIndianaIowaKansasKentuckyLouisianaMaineMarylandMassachusettsMichiganMinnesotaMississippiMissouriMontanaNebraskaNevadaNew HampshireNew JerseyNew MexicoNew YorkNorth CarolinaNorth DakotaOhioOklahomaOregonPennsylvaniaRhode IslandSouth CarolinaSouth DakotaTennesseeTexasUtahVermontVirginiaWashingtonWest VirginiaWisconsinWyoming |
12 Angel Pharmacy Store | 12angel.store@gmail.com | (908) 866-4260 | 12angel.store | NO | 1050 Bharat Diamond BourseBandra Kurla ComplexMumbai, Maharashtra 400051India | NO | YES | YES | AlabamaAlaskaArizonaArkansasCaliforniaColoradoConnecticutDelawareDistrict of ColumbiaFloridaGeorgiaHawaiiIdahoIllinoisIndianaIowaKansasKentuckyLouisianaMaineMarylandMassachusettsMichiganMinnesotaMississippiMissouriMontanaNebraskaNevadaNew HampshireNew JerseyNew MexicoNew YorkNorth CarolinaNorth DakotaOhioOklahomaOregonPennsylvaniaRhode IslandSouth CarolinaSouth DakotaTennesseeTexasUtahVermontVirginiaWashingtonWest VirginiaWisconsinWyomingGuamPuerto RicoVirgin IslandsArmed Forces AmericasArmed Forces EuropeArmed Forces Pacific |
24 x 7 Pharma | contact@24x7pharma.com | (851) 127-5721 | 24x7pharma.com | NO | Mahek IconSumul Diary Road, KatargamSurat, Gujarat 395003India | NO | YES | YES | AlabamaAlaskaArizonaArkansasCaliforniaColoradoConnecticutDelawareDistrict of ColumbiaFloridaGeorgiaHawaiiIdahoIllinoisIndianaIowaKansasKentuckyLouisianaMaineMarylandMassachusettsMichiganMinnesotaMississippiMissouriMontanaNebraskaNevadaNew HampshireNew JerseyNew MexicoNew YorkNorth CarolinaNorth DakotaOhioOklahomaOregonPennsylvaniaRhode IslandSouth CarolinaSouth DakotaTennesseeTexasUtahVermontVirginiaWashingtonWest VirginiaWisconsinWyomingGuamPuerto RicoVirgin IslandsArmed Forces AmericasArmed Forces EuropeArmed Forces Pacific |
...