如何滚动页面和抓取网站?
How to scroll page and scrape site?
有人建议我在另一个主题中将此问题指定为另一个主题。
我的问题与抓取需要动态向下滚动页面并同时将数据复制到我的数据框中的网站有关。
到目前为止,使用下面的代码我只能复制页面中的第一个元素,因为它们是可见的,但我需要整个列表直到页面末尾
driver.maximize_window()
wait=WebDriverWait(driver,30)
driver.get('https://www.livescore.com/en/')
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#onetrust-accept-btn-handler"))).click()
games = driver.find_elements(By.CSS_SELECTOR, 'div[class = "MatchRow_matchRowWrapper__1BtJ3"]')
data1 = []
for game in games:
data1.append({
'Home':game1.find_element(By.XPATH, './/div[contains(@class,"MatchRow_home")]').text,
'Away':game1.find_element(By.XPATH, './/div[contains(@class,"MatchRow_away")]').text,
'Time':game1.find_element(By.XPATH, './/span[contains(@id,"match-row")]').text
})
df = pd.DataFrame(data1) # create dataframe
print(df)
有什么建议吗?
谢谢
我的建议是从 api 获取数据。比这里使用 Selenium 更有效:
import requests
import pandas as pd
import datetime
url = "https://prod-public-api.livescore.com/v1/api/react/date/soccer/20220309/0.00?MD=1"
jsonData = requests.get(url).json()
rows = []
for stage in jsonData['Stages']:
events = stage['Events']
for event in events:
gameDateTime = event['Esd']
date_time_obj = datetime.datetime.strptime(str(gameDateTime), '%Y%m%d%H%M%S')
gameTime = date_time_obj.strftime("%H:%M")
homeTeam = event['T1'][0]['Nm']
awayTeam = event['T2'][0]['Nm']
row = {
'Home':homeTeam,
'Away':awayTeam,
'Time':gameTime}
rows.append(row)
df = pd.DataFrame(rows)
输出:
print(df)
Home Away Time
0 Manchester City Sporting CP 20:00
1 Real Madrid Paris Saint-Germain 20:00
2 FC Porto Lyon 17:45
3 Real Betis Eintracht Frankfurt 17:45
4 Dundee FC St. Mirren 19:45
.. ... ... ...
281 Modafen FK Cankaya FK 11:00
282 UPDF FC Arua Hill SC 11:00
283 Wakiso Giants Mbarara City 13:00
284 Kokand 1912 Olympic 13:30
285 Nasaf Qarshi Metallurg Bekobod 13:30
[286 rows x 3 columns]
有人建议我在另一个主题中将此问题指定为另一个主题。 我的问题与抓取需要动态向下滚动页面并同时将数据复制到我的数据框中的网站有关。
到目前为止,使用下面的代码我只能复制页面中的第一个元素,因为它们是可见的,但我需要整个列表直到页面末尾
driver.maximize_window()
wait=WebDriverWait(driver,30)
driver.get('https://www.livescore.com/en/')
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#onetrust-accept-btn-handler"))).click()
games = driver.find_elements(By.CSS_SELECTOR, 'div[class = "MatchRow_matchRowWrapper__1BtJ3"]')
data1 = []
for game in games:
data1.append({
'Home':game1.find_element(By.XPATH, './/div[contains(@class,"MatchRow_home")]').text,
'Away':game1.find_element(By.XPATH, './/div[contains(@class,"MatchRow_away")]').text,
'Time':game1.find_element(By.XPATH, './/span[contains(@id,"match-row")]').text
})
df = pd.DataFrame(data1) # create dataframe
print(df)
有什么建议吗? 谢谢
我的建议是从 api 获取数据。比这里使用 Selenium 更有效:
import requests
import pandas as pd
import datetime
url = "https://prod-public-api.livescore.com/v1/api/react/date/soccer/20220309/0.00?MD=1"
jsonData = requests.get(url).json()
rows = []
for stage in jsonData['Stages']:
events = stage['Events']
for event in events:
gameDateTime = event['Esd']
date_time_obj = datetime.datetime.strptime(str(gameDateTime), '%Y%m%d%H%M%S')
gameTime = date_time_obj.strftime("%H:%M")
homeTeam = event['T1'][0]['Nm']
awayTeam = event['T2'][0]['Nm']
row = {
'Home':homeTeam,
'Away':awayTeam,
'Time':gameTime}
rows.append(row)
df = pd.DataFrame(rows)
输出:
print(df)
Home Away Time
0 Manchester City Sporting CP 20:00
1 Real Madrid Paris Saint-Germain 20:00
2 FC Porto Lyon 17:45
3 Real Betis Eintracht Frankfurt 17:45
4 Dundee FC St. Mirren 19:45
.. ... ... ...
281 Modafen FK Cankaya FK 11:00
282 UPDF FC Arua Hill SC 11:00
283 Wakiso Giants Mbarara City 13:00
284 Kokand 1912 Olympic 13:30
285 Nasaf Qarshi Metallurg Bekobod 13:30
[286 rows x 3 columns]