Python 针对特定 class returns none 的网页抓取
Python web scraping for a specific class returns none
我对编程还很陌生,我一直在努力自学一些使用棒球数据进行网络抓取的原则。在以下示例中,我尝试从 CBS Sports 中抓取与棒球比赛球队对决、比赛时间和可能的投手相关的数据。我很容易让球队对决和比赛时间出现,但可能的投手 returns“None”。
from bs4 import BeautifulSoup as Soup
import requests
import pandas as pd
from pandas import DataFrame
matchups_response=requests.get('https://www.cbssports.com/mlb/schedule/',"lxml")
matchups_soup=Soup(matchups_response.text)
matchups_tables=matchups_soup.find_all('table')
#len(matchups_tables)
matchups_tables=matchups_tables[0]
rows=matchups_tables.find_all('tr')
first_data_row=rows[1]
first_data_row.find_all(True, {'class':['CellPlayerName--short']})
[str(x.string) for x in first_data_row.find_all(True, {'class':['CellPlayerName--short']})]
def parse_row(row): return [str(x.string) for x in row.find_all(True, {'class':['CellPlayerName--short']})]
list_of_parsed_rows=[parse_row(row) for row in rows[1:31]]
dfPitchers=DataFrame(list_of_parsed_rows)
print(dfPitchers)
这就是它 returns:
0 1
0 None None
1 None None
2 None None
3 None None
4 None None
5 None None
6 None None
7 None None
8 None None
9 None None
10 None None
11 None None
当我使用类似的代码并参考 {'class':['TeamName']}) 或 {'class':['CellGame']})] 我得到一个正确输出:
0 1
0 Washington Houston
1 Boston Pittsburgh
2 Minnesota Tampa Bay
3 Philadelphia N.Y. Yankees
4 Milwaukee Cleveland
5 Cincinnati Texas
6 Arizona Chi. Cubs
7 San Diego San Francisco
8 Kansas City Seattle
9 L.A. Angels Colorado
10 N.Y. Mets Miami
11 Oakland L.A. Dodgers
0 WAS 0, HOU 0 - 1st
1 BOS 0, PIT 0 - 1st
2 1:05 pm
3 1:05 pm
4 4:05 pm
5 4:05 pm
6 4:05 pm
7 4:05 pm
8 4:10 pm
9 4:10 pm
10 6:40 pm
11 9:05 pm
但对于 {'class':['CellPlayerName--short']})] 它总是 returns None。任何帮助,将不胜感激。提前道歉,我是一个新手,但我已经搜索并搜索了这个并且找不到我可以工作的解决方案。谢谢!
from the docs 如果一个标签包含多个东西,那么.string应该指代什么就不清楚了,所以.string定义为None
而不是 .string
使用 .text
/ .get_text()
得到你的结果:
def parse_row(row): return [x.text for x in row.find_all(True, {'class':['CellPlayerName--short']})]
和select更具体,如果你在线想从<a>
获得价值:
def parse_row(row): return [x.a.text for x in row.find_all(True, {'class':['CellPlayerName--short']})]
输出
0
1
J. Verlander
C. Edwards
M. Keller
N. Pivetta
D. Rasmussen
B. Ober
C. Schmidt
A. Nola
C. Quantrill
B. Woodruff
S. Howard
R. Sanmartin
J. Steele
Z. Davies
C. Rodon
M. Clevinger
L. Gilbert
D. Lynch
A. Senzatela
J. Suarez
P. Lopez
C. Bassitt
T. Gonsolin
S. Manaea
我对编程还很陌生,我一直在努力自学一些使用棒球数据进行网络抓取的原则。在以下示例中,我尝试从 CBS Sports 中抓取与棒球比赛球队对决、比赛时间和可能的投手相关的数据。我很容易让球队对决和比赛时间出现,但可能的投手 returns“None”。
from bs4 import BeautifulSoup as Soup
import requests
import pandas as pd
from pandas import DataFrame
matchups_response=requests.get('https://www.cbssports.com/mlb/schedule/',"lxml")
matchups_soup=Soup(matchups_response.text)
matchups_tables=matchups_soup.find_all('table')
#len(matchups_tables)
matchups_tables=matchups_tables[0]
rows=matchups_tables.find_all('tr')
first_data_row=rows[1]
first_data_row.find_all(True, {'class':['CellPlayerName--short']})
[str(x.string) for x in first_data_row.find_all(True, {'class':['CellPlayerName--short']})]
def parse_row(row): return [str(x.string) for x in row.find_all(True, {'class':['CellPlayerName--short']})]
list_of_parsed_rows=[parse_row(row) for row in rows[1:31]]
dfPitchers=DataFrame(list_of_parsed_rows)
print(dfPitchers)
这就是它 returns:
0 1
0 None None
1 None None
2 None None
3 None None
4 None None
5 None None
6 None None
7 None None
8 None None
9 None None
10 None None
11 None None
当我使用类似的代码并参考 {'class':['TeamName']}) 或 {'class':['CellGame']})] 我得到一个正确输出:
0 1
0 Washington Houston
1 Boston Pittsburgh
2 Minnesota Tampa Bay
3 Philadelphia N.Y. Yankees
4 Milwaukee Cleveland
5 Cincinnati Texas
6 Arizona Chi. Cubs
7 San Diego San Francisco
8 Kansas City Seattle
9 L.A. Angels Colorado
10 N.Y. Mets Miami
11 Oakland L.A. Dodgers
0 WAS 0, HOU 0 - 1st
1 BOS 0, PIT 0 - 1st
2 1:05 pm
3 1:05 pm
4 4:05 pm
5 4:05 pm
6 4:05 pm
7 4:05 pm
8 4:10 pm
9 4:10 pm
10 6:40 pm
11 9:05 pm
但对于 {'class':['CellPlayerName--short']})] 它总是 returns None。任何帮助,将不胜感激。提前道歉,我是一个新手,但我已经搜索并搜索了这个并且找不到我可以工作的解决方案。谢谢!
from the docs 如果一个标签包含多个东西,那么.string应该指代什么就不清楚了,所以.string定义为None
而不是 .string
使用 .text
/ .get_text()
得到你的结果:
def parse_row(row): return [x.text for x in row.find_all(True, {'class':['CellPlayerName--short']})]
和select更具体,如果你在线想从<a>
获得价值:
def parse_row(row): return [x.a.text for x in row.find_all(True, {'class':['CellPlayerName--short']})]
输出
0 | 1 |
---|---|
J. Verlander | C. Edwards |
M. Keller | N. Pivetta |
D. Rasmussen | B. Ober |
C. Schmidt | A. Nola |
C. Quantrill | B. Woodruff |
S. Howard | R. Sanmartin |
J. Steele | Z. Davies |
C. Rodon | M. Clevinger |
L. Gilbert | D. Lynch |
A. Senzatela | J. Suarez |
P. Lopez | C. Bassitt |
T. Gonsolin | S. Manaea |