抓取英超联赛表时出现多个错误
Multiple errors when scraping premier league tables
我正在学习网页抓取。
我以 this 为参考,成功抓取了顶级 youtuber 排名。
我正在使用相同的逻辑来抓取 PL ranking,但有两个问题:
- 它只收集到第 5 名。
- 结果只获得第一名
- 然后,获取属性错误:
from bs4 import BeautifulSoup
import requests
import csv
url = 'https://www.premierleague.com/tables'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
standings = soup.find('div', attrs={'data-ui-tab': 'First Team'}).find_all('tr')[1:]
print(standings)
file = open("pl_standings.csv", 'w')
writer = csv.writer(file)
writer.writerow(['position', 'club_name', 'points'])
for standing in standings:
position = standing.find('span', attrs={'class': 'value'}).text.strip()
club_name = standing.find('span', {'class': 'long'}).text
points = standing.find('td', {'class': 'points'}).text
print(position, club_name, points)
writer.writerow([position, club_name, points])
file.close()
问题是 html.parser
没有正确解析页面(尝试使用 lxml
解析器)。此外,每秒 <tr>
得到正确的结果:
import requests
from bs4 import BeautifulSoup
url = "https://www.premierleague.com/tables"
page = requests.get(url)
soup = BeautifulSoup(page.content, "lxml") # <-- use lxml
standings = soup.find("div", attrs={"data-ui-tab": "First Team"}).find_all(
"tr"
)[1::2] # <-- get every second <tr>
for standing in standings:
position = standing.find("span", attrs={"class": "value"}).text.strip()
club_name = standing.find("span", {"class": "long"}).text
points = standing.find("td", {"class": "points"}).text
print(position, club_name, points)
打印:
1 Manchester City 77
2 Liverpool 76
3 Chelsea 62
4 Tottenham Hotspur 57
5 Arsenal 57
6 Manchester United 54
7 West Ham United 52
8 Wolverhampton Wanderers 49
9 Leicester City 41
10 Brighton and Hove Albion 40
11 Newcastle United 40
12 Brentford 39
13 Southampton 39
14 Crystal Palace 37
15 Aston Villa 36
16 Leeds United 33
17 Everton 29
18 Burnley 28
19 Watford 22
20 Norwich City 21
我正在学习网页抓取。
我以 this 为参考,成功抓取了顶级 youtuber 排名。
我正在使用相同的逻辑来抓取 PL ranking,但有两个问题:
- 它只收集到第 5 名。
- 结果只获得第一名
- 然后,获取属性错误:
from bs4 import BeautifulSoup
import requests
import csv
url = 'https://www.premierleague.com/tables'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
standings = soup.find('div', attrs={'data-ui-tab': 'First Team'}).find_all('tr')[1:]
print(standings)
file = open("pl_standings.csv", 'w')
writer = csv.writer(file)
writer.writerow(['position', 'club_name', 'points'])
for standing in standings:
position = standing.find('span', attrs={'class': 'value'}).text.strip()
club_name = standing.find('span', {'class': 'long'}).text
points = standing.find('td', {'class': 'points'}).text
print(position, club_name, points)
writer.writerow([position, club_name, points])
file.close()
问题是 html.parser
没有正确解析页面(尝试使用 lxml
解析器)。此外,每秒 <tr>
得到正确的结果:
import requests
from bs4 import BeautifulSoup
url = "https://www.premierleague.com/tables"
page = requests.get(url)
soup = BeautifulSoup(page.content, "lxml") # <-- use lxml
standings = soup.find("div", attrs={"data-ui-tab": "First Team"}).find_all(
"tr"
)[1::2] # <-- get every second <tr>
for standing in standings:
position = standing.find("span", attrs={"class": "value"}).text.strip()
club_name = standing.find("span", {"class": "long"}).text
points = standing.find("td", {"class": "points"}).text
print(position, club_name, points)
打印:
1 Manchester City 77
2 Liverpool 76
3 Chelsea 62
4 Tottenham Hotspur 57
5 Arsenal 57
6 Manchester United 54
7 West Ham United 52
8 Wolverhampton Wanderers 49
9 Leicester City 41
10 Brighton and Hove Albion 40
11 Newcastle United 40
12 Brentford 39
13 Southampton 39
14 Crystal Palace 37
15 Aston Villa 36
16 Leeds United 33
17 Everton 29
18 Burnley 28
19 Watford 22
20 Norwich City 21