如何提高网络抓取脚本（Python 和 Selenium）的性能（运行时）

Question

所以我写了一个脚本来在网站上抓取 table - 4 年多来 32 支球队的 NFL 花名册。然而，该网站一次只显示一个团队，并且显示一年。所以我的脚本打开页面，选择一年，抓取数据，然后转到下一年，依此类推，直到收集完所有四年的数据。然后对其他 32 个团队重复该过程。

现在，我是网络抓取的新手，所以我不确定在计算上，我正在做的是最好的方法。目前，要为一个团队抓取一年的数据，大约需要 40-50 秒，因此每个团队总共需要大约 4 分钟。为所有的团队拼凑所有的岁月，也就两个多小时。

有没有办法抓取数据并减少运行时间？

代码如下：

import requests
import lxml.html as lh
import pandas as pd
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time

# Team list
team_ls = ['Arizona Cardinals','Atlanta Falcons','Baltimore Ravens','Buffalo Bills','Carolina Panthers','Chicago Bears','Cincinnati Bengals',
           'Cleveland Browns','Dallas Cowboys','Denver Broncos','Detroit Lions','Green Bay Packers','Houston Texans','Indianapolis Colts',
           'Jacksonville Jaguars','Kansas City Chiefs','Las Vegas Raiders','Los Angeles Chargers','Los Angeles Rams','Miami Dolphins','Minnesota Vikings','New England Patriots',
           'New Orleans Saints','New York Giants','New York Jets','Philadelphia Eagles','Pittsburgh Steelers','San Francisco 49ers','Seattle Seahawks',
           'Tampa Bay Buccaneers','Tennessee Titans','Washington Redskins']

# Format list for URL
team_ls = [team.lower().replace(' ','-') for team in team_ls]

# Changes the year parameter on a given pages
def next_year(driver, year_idx):
    
    driver.find_element_by_xpath('//*[@id="main-dropdown"]').click()
    parentElement = driver.find_element_by_xpath('/html/body/app-root/app-nfl/app-roster/div/div/div[2]/div/div/div[1]/div/div/div')
    elementList = parentElement.find_elements_by_tag_name("button")
    elementList[year_idx].click()
    time.sleep(3)

# Create scraping function
def sel_scrape(driver, team, year):
    
    # Get main table
    main_table = driver.find_element_by_tag_name('table')
    
    # Scrape rows and header
    rows = [[td.text.strip() for td in row.find_elements_by_xpath(".//td")] for row in main_table.find_elements_by_xpath(".//tr")][1:]
    header = [[th.text.strip() for th in row.find_elements_by_xpath(".//th")] for row in main_table.find_elements_by_xpath(".//tr")][0]
    
    # compile in dataframe
    df=pd.DataFrame(rows,columns=header)
    
    # Edit data frame
    df['Merge Name'] = df['Name'].str.split(' ',1).str[0].str[0] + '.' + df['Name'].str.split(' ').str[1]
    df['Team'] = team.replace('-',' ').title()
    df['Year'] = year
    
    return df

url='https://www.lineups.com/nfl/roster/'

df = pd.DataFrame()
years = [2020,2019,2018,2017]

start_time = time.time()

for team in team_ls:
    driver = webdriver.Chrome()
    # Generate team link
    driver.get(url+team)
    
    # For each of the four years
    for idx in range(0,4):
        print("Starting {} {}".format(team, years[idx]))
        # Scrape the page
        df = pd.concat([df, sel_scrape(driver, team, years[idx])])
        
        # Change to next year
        next_year(driver, idx)
    driver. close()

print("--- %s seconds ---" % (time.time() - start_time))
    
df.head()

Answer 1

您可以通过不使用 Selenium 来改进。 Selenium（虽然有效）自然会运行变慢。获取数据的最佳方法是通过 API 呈现数据的位置：

import pandas as pd
import requests
import time

# Team list
team_ls = ['Arizona Cardinals','Atlanta Falcons','Baltimore Ravens','Buffalo Bills','Carolina Panthers','Chicago Bears','Cincinnati Bengals',
           'Cleveland Browns','Dallas Cowboys','Denver Broncos','Detroit Lions','Green Bay Packers','Houston Texans','Indianapolis Colts',
           'Jacksonville Jaguars','Kansas City Chiefs','Las Vegas Raiders','Los Angeles Chargers','Los Angeles Rams','Miami Dolphins','Minnesota Vikings','New England Patriots',
           'New Orleans Saints','New York Giants','New York Jets','Philadelphia Eagles','Pittsburgh Steelers','San Francisco 49ers','Seattle Seahawks',
           'Tampa Bay Buccaneers','Tennessee Titans','Washington Redskins']


rows = []
start_time = time.time()
for team in team_ls:
    for season in range(2017,2021):
        print ('Season: %s\tTeam: %s' %(season, team))
        teamStr = '-'.join(team.split()).lower()
        url= 'https://api.lineups.com/nfl/fetch/roster/{season}/{teamStr}'.format(season=season, teamStr=teamStr)

        jsonData = requests.get(url).json()
        roster = jsonData['data']
        for item in roster:
            item.update( {'Year':season, 'Team':team})
        rows += roster
        
df = pd.DataFrame(rows)

print("--- %s seconds ---" % (time.time() - start_time))

print (df.head())

如何提高网络抓取脚本（Python 和 Selenium）的性能（运行时）

How can I improve performance (runtime) on my webscraping script (Python and Selenium)

python

selenium

runtime

web-scraping