Python - 以 table(s) 格式将动态网站抓取到 csv
Python - scrape dynamic website to csv in neat table(s) format
我可以通过以下代码将静态网站抓取到 csv:
import pandas as pd
url = 'http://www.etnet.com.hk/www/tc/futures/index.php?subtype=HSI&month=201801&tab=interval'
for i, df in enumerate(pd.read_html(url)):
filename = 'C:/Users/Lawrence/Desktop/PyTest/output%02d.csv' % i
df.to_csv(filename, encoding='UTF-8')
但是,我发现它不适用于动态网站。我怎样才能做到这一点?
P.S.: 我正在使用 Python 3.6
您可以使用 selenium 的 webdriver
,它可以像处理常规 Web 浏览器一样处理网站。在您的示例中,在不更改代码的情况下应用硒的最简单方法如下:
import pandas as pd
from selenium import webdriver
url = 'http://www.etnet.com.hk/www/tc/futures/index.php?subtype=HSI&month=201801&tab=interval'
# The following lines are so the browser is headless, i.e. it doesn't open a window
options = webdriver.ChromeOptions()
options.add_argument('headless')
options.add_argument('window-size=1200x600')
wd = webdriver.Chrome(chrome_options=options) # Open a browser using the options set
wd.get(url) # Open the desired url in the browser
for i, df in enumerate(pd.read_html(wd.page_source)): # Use wd.page_source to feed pd.read_html
filename = 'C:/Users/Lawrence/Desktop/PyTest/output%02d.csv' % i
df.to_csv(filename, encoding='UTF-8')
wd.close() # Close the browser
我可以通过以下代码将静态网站抓取到 csv:
import pandas as pd
url = 'http://www.etnet.com.hk/www/tc/futures/index.php?subtype=HSI&month=201801&tab=interval'
for i, df in enumerate(pd.read_html(url)):
filename = 'C:/Users/Lawrence/Desktop/PyTest/output%02d.csv' % i
df.to_csv(filename, encoding='UTF-8')
但是,我发现它不适用于动态网站。我怎样才能做到这一点?
P.S.: 我正在使用 Python 3.6
您可以使用 selenium 的 webdriver
,它可以像处理常规 Web 浏览器一样处理网站。在您的示例中,在不更改代码的情况下应用硒的最简单方法如下:
import pandas as pd
from selenium import webdriver
url = 'http://www.etnet.com.hk/www/tc/futures/index.php?subtype=HSI&month=201801&tab=interval'
# The following lines are so the browser is headless, i.e. it doesn't open a window
options = webdriver.ChromeOptions()
options.add_argument('headless')
options.add_argument('window-size=1200x600')
wd = webdriver.Chrome(chrome_options=options) # Open a browser using the options set
wd.get(url) # Open the desired url in the browser
for i, df in enumerate(pd.read_html(wd.page_source)): # Use wd.page_source to feed pd.read_html
filename = 'C:/Users/Lawrence/Desktop/PyTest/output%02d.csv' % i
df.to_csv(filename, encoding='UTF-8')
wd.close() # Close the browser