努力使用 selenium 抓取 table
Struggling to scrape a table using selenium
所以我很期待对link中出现的table进行抓取。
为了抓取,我决定使用 selenium。
我的第一次尝试是:
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(url)
html_source = self.driver.page_source
self.driver.quit()
BeautifulSoup(html_source, "html5lib")
table = soup.find('table', {'class': 'heavy-table ncpulse-fav-table ncpulse-sortable compressed-table'})
df = pd.read_html(str(table), flavor='html5lib', header=0, thousands='.', decimal=',')
然而它输出错误
'no tables found'
然后我尝试使用 expected_conditions class 因为当我在 中查找时,可能“页面源甚至在子元素完全呈现之前就被拉出”
因此我尝试了这样的事情:
driver.get(route)
element_present = expected_conditions.presence_of_element_located(
(By.CLASS_NAME, 'heavy-table ncpulse-fav-table ncpulse-sortable compressed-table'))
WebDriverWait(driver, 20).until(element_present)
html_source = driver.page_source
driver.quit()
但是这次它输出:
selenium.common.exceptions.TimeoutException: Message
因此我的问题是:如何获得所需的输出?使用 expected_conditions
class 我做错了什么?是什么 issue/front-end-technology 使得刮 table 如此艰难?
复合 class 名称不由 CLASSNAME 选择器处理,但您可以通过 css 选择器或 xpath 获取它。 CSS_SELECTOR 比 XPATH
更高效
element_present = expected_conditions.presence_of_element_located(
(By.CSS_SELECTOR, "table[class='heavy-table ncpulse-fav-table ncpulse-sortable compressed-table']"))
#or by xpath
element_present = expected_conditions.presence_of_element_located(
(By.XPATH, "//table[@class='heavy-table ncpulse-fav-table ncpulse-sortable compressed-table']"))
对于多个 class 名称,使用最右边的名称。
element_present = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.CLASS_NAME, 'compressed-table')))
print(element_present.text)
产出
AKTIE +/- +/-% SENESTE ÅTD% VOLUMEN OMSÆTNING MARKEDSVÆRDI
Abn Amro Bank N.V. -0,32 -4,08% 7,48 -53,90% 7,9 mio 59,0 mio -
Adyen 81,00 5,62% 1523,00 108% 954 082 1,5 mia -
Aegon -0,08 -3,49% 2,16 -45,47% 17,4 mio 37,5 mio -
Ahold Del 0,25 0,98% 25,65 19,74% 8,0 mio 204,1 mio -
Akzo Nobel 0,14 0,16% 85,86 -3,16% 1,1 mio 90,6 mio -
Arcelormittal Sa 0,08 0,66% 11,53 -26,26% 11,9 mio 137,3 mio -
转换成英语
options = Options()
prefs = {
"translate_whitelists": {"da":"en"},
"translate":{"enabled":"true"}
}
options.add_experimental_option("prefs", prefs)
driver = webdriver.Chrome(ChromeDriverManager().install(),options=options)
SHARES +/- + / -% MOST RECENT ÅTD% VOLUME TURNOVER MARKET VALUE
Abn Amro Bank NV -0.32 -4.08% 7.48 -53.90% 7.9 million 59.0 million -
Adyen 81.00 5.62% 1523.00 108% 954 082 1.5 billion -
Aegon -0.08 -3.49% 2.16 -45.47% 17.4 million 37.5 million -
Ahold Del 0.25 0.98% 25.65 19.74% 8.0 million 204.1 million -
Akzo Nobel 0.14 0.16% 85.86 -3.16% 1.1 million 90.6 million -
导入
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
获取 table 信息 WebDriverWait
() 和 visibility_of_element_located
() 以及后续 css selector
driver.get("https://borsen.dk/investor/kurser/eur-aktier/?filter=aex25")
WebDriverWait(driver,20).until(EC.visibility_of_element_located((By.CSS_SELECTOR,".compressed-table")))
html_source =driver.page_source
driver.quit()
soup=BeautifulSoup(html_source, "html5lib")
table=soup.select_one(".compressed-table")
df = pd.read_html(str(table), flavor='html5lib', header=0, thousands='.', decimal=',')
print(df[0])
输出:
Unnamed: 0 Aktie +/- ... Markedsværdi Sektor Tid
0 NaN Abn Amro Bank N.V. -0.32 ... - NaN 16:39
1 NaN Adyen 81.00 ... - NaN 16:39
2 NaN Aegon -0.08 ... - NaN 16:35
3 NaN Ahold Del 0.25 ... - NaN 16:35
4 NaN Akzo Nobel 0.14 ... - NaN 16:36
5 NaN Arcelormittal Sa 0.08 ... - NaN 16:39
6 NaN Asm International 0.35 ... - NaN 16:37
7 NaN Asml Holding 1.50 ... - NaN 16:35
8 NaN Asr Nederland -0.22 ... - NaN 16:35
9 NaN Dsm Kon 2.25 ... - NaN 16:39
10 NaN Galapagos -1.45 ... - NaN 16:35
11 NaN Heineken 0.74 ... - NaN 16:35
12 NaN Imcd 1.85 ... - NaN 16:35
13 NaN Ing Groep N.V. -0.19 ... - NaN 16:38
14 NaN Just Eat Takeaway 0.08 ... - NaN 16:39
15 NaN Kpn Kon -0.03 ... - NaN 16:35
16 NaN Nn Group -0.35 ... - NaN 16:35
17 NaN Philips Kon -0.08 ... - NaN 16:35
18 NaN Prosus -1.52 ... - NaN 16:39
19 NaN Randstad Nv -0.98 ... - NaN 16:35
20 NaN Relx 0.00 ... - NaN 16:36
21 NaN Royal Dutch Shella -0.24 ... - NaN 16:37
22 NaN Unibail-Rodamco-We -3.79 ... - NaN 16:37
23 NaN Unilever -1.04 ... - NaN 16:38
24 NaN Wolters Kluwer -0.04 ... - NaN 16:35
[25 rows x 13 columns]
您也可以使用 find()
driver.get("https://borsen.dk/investor/kurser/eur-aktier/?filter=aex25")
WebDriverWait(driver,20).until(EC.visibility_of_element_located((By.CSS_SELECTOR,".compressed-table")))
html_source =driver.page_source
driver.quit()
soup=BeautifulSoup(html_source, "html5lib")
df=pd.read_html(str(soup.find('table',class_='heavy-table ncpulse-fav-table ncpulse-sortable compressed-table')))[0]
print(df)
从 table 中提取内容,因为 <table>
是 Angular based element using and python instead of presence_of_element_located()
you have to induce for the visibility_of_element_located()
and you can use either of the following :
使用CSS_SELECTOR
:
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "table.heavy-table.ncpulse-fav-table.ncpulse-sortable.compressed-table"))).text)
使用XPATH
:
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[@class='heavy-table ncpulse-fav-table ncpulse-sortable compressed-table']"))).text)
注意:您必须添加以下导入:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
控制台输出:
AKTIE +/- +/-% SENESTE ÅTD% BUD UDBUD VOLUMEN OMSÆTNING MARKEDSVÆRDI TID
Abn Amro Bank N.V. -0,32 -4,08% 7,48 -53,90% - - 7,9 mio 59,0 mio - 21:09
Adyen 81,00 5,62% 1523,00 108% - - 954 082 1,5 mia - 21:09
Aegon -0,08 -3,49% 2,16 -45,47% - - 17,4 mio 37,5 mio - 21:05
Ahold Del 0,25 0,98% 25,65 19,74% - - 8,0 mio 204,1 mio - 21:05
Akzo Nobel 0,14 0,16% 85,86 -3,16% - - 1,1 mio 90,6 mio - 21:06
Arcelormittal Sa 0,08 0,66% 11,53 -26,26% - - 11,9 mio 137,3 mio - 21:09
Asm International 0,35 0,29% 119,10 21,23% - - 403 117 48,0 mio - 21:07
Asml Holding 1,50 0,49% 308,45 17,56% - - 2,3 mio 712,7 mio - 21:05
Asr Nederland -0,22 -0,73% 29,76 -4,97% - - 740 781 22,0 mio - 21:05
Dsm Kon 2,25 1,66% 138,20 21,52% - - 680 867 94,1 mio - 21:09
Galapagos -1,45 -1,22% 117,70 -36,89% - - 475 793 56,0 mio - 21:05
Heineken 0,74 0,94% 79,10 -15,50% - - 1,1 mio 88,0 mio - 21:05
Imcd 1,85 1,80% 104,85 36,23% - - 922 391 96,7 mio - 21:05
Ing Groep N.V. -0,19 -2,80% 6,60 -38,24% - - 43,4 mio 286,2 mio - 21:08
Just Eat Takeaway 0,08 0,09% 91,70 11,56% - - 1,1 mio 100,2 mio - 21:09
Kpn Kon -0,03 -1,54% 2,11 -15,04% - - 21,4 mio 45,1 mio - 21:05
Nn Group -0,35 -1,06% 32,80 3,82% - - 2,4 mio 79,6 mio - 21:05
Philips Kon -0,08 -0,20% 39,42 -9,42% - - 5,2 mio 205,9 mio - 21:05
Prosus -1,52 -1,89% 78,74 18,35% - - 15,0 mio 1,2 mia - 21:09
Randstad Nv -0,98 -2,09% 45,93 -15,63% - - 698 496 32,1 mio - 21:05
Relx 0,00 0,03% 19,64 -10,24% - - 1,9 mio 36,6 mio - 21:06
Royal Dutch Shella -0,24 -2,07% 11,45 -54,58% - - 21,1 mio 241,2 mio - 21:07
Unibail-Rodamco-We -3,79 -10,53% 32,20 -75,02% - - 6,6 mio 213,2 mio - 21:07
Unilever -1,04 -2,00% 50,98 2,00% - - 8,2 mio 417,5 mio - 21:08
Wolters Kluwer -0,04 -0,05% 72,88 14,19% - - 803 644 58,6 mio - 21:05
所以我很期待对link中出现的table进行抓取。
为了抓取,我决定使用 selenium。
我的第一次尝试是:
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(url)
html_source = self.driver.page_source
self.driver.quit()
BeautifulSoup(html_source, "html5lib")
table = soup.find('table', {'class': 'heavy-table ncpulse-fav-table ncpulse-sortable compressed-table'})
df = pd.read_html(str(table), flavor='html5lib', header=0, thousands='.', decimal=',')
然而它输出错误
'no tables found'
然后我尝试使用 expected_conditions class 因为当我在
因此我尝试了这样的事情:
driver.get(route)
element_present = expected_conditions.presence_of_element_located(
(By.CLASS_NAME, 'heavy-table ncpulse-fav-table ncpulse-sortable compressed-table'))
WebDriverWait(driver, 20).until(element_present)
html_source = driver.page_source
driver.quit()
但是这次它输出:
selenium.common.exceptions.TimeoutException: Message
因此我的问题是:如何获得所需的输出?使用 expected_conditions
class 我做错了什么?是什么 issue/front-end-technology 使得刮 table 如此艰难?
复合 class 名称不由 CLASSNAME 选择器处理,但您可以通过 css 选择器或 xpath 获取它。 CSS_SELECTOR 比 XPATH
更高效element_present = expected_conditions.presence_of_element_located(
(By.CSS_SELECTOR, "table[class='heavy-table ncpulse-fav-table ncpulse-sortable compressed-table']"))
#or by xpath
element_present = expected_conditions.presence_of_element_located(
(By.XPATH, "//table[@class='heavy-table ncpulse-fav-table ncpulse-sortable compressed-table']"))
对于多个 class 名称,使用最右边的名称。
element_present = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.CLASS_NAME, 'compressed-table')))
print(element_present.text)
产出
AKTIE +/- +/-% SENESTE ÅTD% VOLUMEN OMSÆTNING MARKEDSVÆRDI
Abn Amro Bank N.V. -0,32 -4,08% 7,48 -53,90% 7,9 mio 59,0 mio -
Adyen 81,00 5,62% 1523,00 108% 954 082 1,5 mia -
Aegon -0,08 -3,49% 2,16 -45,47% 17,4 mio 37,5 mio -
Ahold Del 0,25 0,98% 25,65 19,74% 8,0 mio 204,1 mio -
Akzo Nobel 0,14 0,16% 85,86 -3,16% 1,1 mio 90,6 mio -
Arcelormittal Sa 0,08 0,66% 11,53 -26,26% 11,9 mio 137,3 mio -
转换成英语
options = Options()
prefs = {
"translate_whitelists": {"da":"en"},
"translate":{"enabled":"true"}
}
options.add_experimental_option("prefs", prefs)
driver = webdriver.Chrome(ChromeDriverManager().install(),options=options)
SHARES +/- + / -% MOST RECENT ÅTD% VOLUME TURNOVER MARKET VALUE
Abn Amro Bank NV -0.32 -4.08% 7.48 -53.90% 7.9 million 59.0 million -
Adyen 81.00 5.62% 1523.00 108% 954 082 1.5 billion -
Aegon -0.08 -3.49% 2.16 -45.47% 17.4 million 37.5 million -
Ahold Del 0.25 0.98% 25.65 19.74% 8.0 million 204.1 million -
Akzo Nobel 0.14 0.16% 85.86 -3.16% 1.1 million 90.6 million -
导入
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
获取 table 信息 WebDriverWait
() 和 visibility_of_element_located
() 以及后续 css selector
driver.get("https://borsen.dk/investor/kurser/eur-aktier/?filter=aex25")
WebDriverWait(driver,20).until(EC.visibility_of_element_located((By.CSS_SELECTOR,".compressed-table")))
html_source =driver.page_source
driver.quit()
soup=BeautifulSoup(html_source, "html5lib")
table=soup.select_one(".compressed-table")
df = pd.read_html(str(table), flavor='html5lib', header=0, thousands='.', decimal=',')
print(df[0])
输出:
Unnamed: 0 Aktie +/- ... Markedsværdi Sektor Tid
0 NaN Abn Amro Bank N.V. -0.32 ... - NaN 16:39
1 NaN Adyen 81.00 ... - NaN 16:39
2 NaN Aegon -0.08 ... - NaN 16:35
3 NaN Ahold Del 0.25 ... - NaN 16:35
4 NaN Akzo Nobel 0.14 ... - NaN 16:36
5 NaN Arcelormittal Sa 0.08 ... - NaN 16:39
6 NaN Asm International 0.35 ... - NaN 16:37
7 NaN Asml Holding 1.50 ... - NaN 16:35
8 NaN Asr Nederland -0.22 ... - NaN 16:35
9 NaN Dsm Kon 2.25 ... - NaN 16:39
10 NaN Galapagos -1.45 ... - NaN 16:35
11 NaN Heineken 0.74 ... - NaN 16:35
12 NaN Imcd 1.85 ... - NaN 16:35
13 NaN Ing Groep N.V. -0.19 ... - NaN 16:38
14 NaN Just Eat Takeaway 0.08 ... - NaN 16:39
15 NaN Kpn Kon -0.03 ... - NaN 16:35
16 NaN Nn Group -0.35 ... - NaN 16:35
17 NaN Philips Kon -0.08 ... - NaN 16:35
18 NaN Prosus -1.52 ... - NaN 16:39
19 NaN Randstad Nv -0.98 ... - NaN 16:35
20 NaN Relx 0.00 ... - NaN 16:36
21 NaN Royal Dutch Shella -0.24 ... - NaN 16:37
22 NaN Unibail-Rodamco-We -3.79 ... - NaN 16:37
23 NaN Unilever -1.04 ... - NaN 16:38
24 NaN Wolters Kluwer -0.04 ... - NaN 16:35
[25 rows x 13 columns]
您也可以使用 find()
driver.get("https://borsen.dk/investor/kurser/eur-aktier/?filter=aex25")
WebDriverWait(driver,20).until(EC.visibility_of_element_located((By.CSS_SELECTOR,".compressed-table")))
html_source =driver.page_source
driver.quit()
soup=BeautifulSoup(html_source, "html5lib")
df=pd.read_html(str(soup.find('table',class_='heavy-table ncpulse-fav-table ncpulse-sortable compressed-table')))[0]
print(df)
从 table 中提取内容,因为 <table>
是 Angular based element using presence_of_element_located()
you have to induce visibility_of_element_located()
and you can use either of the following
使用
CSS_SELECTOR
:print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "table.heavy-table.ncpulse-fav-table.ncpulse-sortable.compressed-table"))).text)
使用
XPATH
:print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[@class='heavy-table ncpulse-fav-table ncpulse-sortable compressed-table']"))).text)
注意:您必须添加以下导入:
from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC
控制台输出:
AKTIE +/- +/-% SENESTE ÅTD% BUD UDBUD VOLUMEN OMSÆTNING MARKEDSVÆRDI TID Abn Amro Bank N.V. -0,32 -4,08% 7,48 -53,90% - - 7,9 mio 59,0 mio - 21:09 Adyen 81,00 5,62% 1523,00 108% - - 954 082 1,5 mia - 21:09 Aegon -0,08 -3,49% 2,16 -45,47% - - 17,4 mio 37,5 mio - 21:05 Ahold Del 0,25 0,98% 25,65 19,74% - - 8,0 mio 204,1 mio - 21:05 Akzo Nobel 0,14 0,16% 85,86 -3,16% - - 1,1 mio 90,6 mio - 21:06 Arcelormittal Sa 0,08 0,66% 11,53 -26,26% - - 11,9 mio 137,3 mio - 21:09 Asm International 0,35 0,29% 119,10 21,23% - - 403 117 48,0 mio - 21:07 Asml Holding 1,50 0,49% 308,45 17,56% - - 2,3 mio 712,7 mio - 21:05 Asr Nederland -0,22 -0,73% 29,76 -4,97% - - 740 781 22,0 mio - 21:05 Dsm Kon 2,25 1,66% 138,20 21,52% - - 680 867 94,1 mio - 21:09 Galapagos -1,45 -1,22% 117,70 -36,89% - - 475 793 56,0 mio - 21:05 Heineken 0,74 0,94% 79,10 -15,50% - - 1,1 mio 88,0 mio - 21:05 Imcd 1,85 1,80% 104,85 36,23% - - 922 391 96,7 mio - 21:05 Ing Groep N.V. -0,19 -2,80% 6,60 -38,24% - - 43,4 mio 286,2 mio - 21:08 Just Eat Takeaway 0,08 0,09% 91,70 11,56% - - 1,1 mio 100,2 mio - 21:09 Kpn Kon -0,03 -1,54% 2,11 -15,04% - - 21,4 mio 45,1 mio - 21:05 Nn Group -0,35 -1,06% 32,80 3,82% - - 2,4 mio 79,6 mio - 21:05 Philips Kon -0,08 -0,20% 39,42 -9,42% - - 5,2 mio 205,9 mio - 21:05 Prosus -1,52 -1,89% 78,74 18,35% - - 15,0 mio 1,2 mia - 21:09 Randstad Nv -0,98 -2,09% 45,93 -15,63% - - 698 496 32,1 mio - 21:05 Relx 0,00 0,03% 19,64 -10,24% - - 1,9 mio 36,6 mio - 21:06 Royal Dutch Shella -0,24 -2,07% 11,45 -54,58% - - 21,1 mio 241,2 mio - 21:07 Unibail-Rodamco-We -3,79 -10,53% 32,20 -75,02% - - 6,6 mio 213,2 mio - 21:07 Unilever -1,04 -2,00% 50,98 2,00% - - 8,2 mio 417,5 mio - 21:08 Wolters Kluwer -0,04 -0,05% 72,88 14,19% - - 803 644 58,6 mio - 21:05