Python 针对特定 class returns none 的网页抓取

Question

我对编程还很陌生，我一直在努力自学一些使用棒球数据进行网络抓取的原则。在以下示例中，我尝试从 CBS Sports 中抓取与棒球比赛球队对决、比赛时间和可能的投手相关的数据。我很容易让球队对决和比赛时间出现，但可能的投手 returns“None”。

from bs4 import BeautifulSoup as Soup
import requests
import pandas as pd
from pandas import DataFrame

matchups_response=requests.get('https://www.cbssports.com/mlb/schedule/',"lxml")

matchups_soup=Soup(matchups_response.text)

matchups_tables=matchups_soup.find_all('table')

#len(matchups_tables)

matchups_tables=matchups_tables[0]

rows=matchups_tables.find_all('tr')

first_data_row=rows[1]

first_data_row.find_all(True, {'class':['CellPlayerName--short']})

[str(x.string) for x in first_data_row.find_all(True, {'class':['CellPlayerName--short']})]

def parse_row(row): return [str(x.string) for x in row.find_all(True, {'class':['CellPlayerName--short']})]

list_of_parsed_rows=[parse_row(row) for row in rows[1:31]]

dfPitchers=DataFrame(list_of_parsed_rows)

print(dfPitchers)

这就是它 returns:

       0     1
0   None  None
1   None  None
2   None  None
3   None  None
4   None  None
5   None  None
6   None  None
7   None  None
8   None  None
9   None  None
10  None  None
11  None  None

当我使用类似的代码并参考 {'class':['TeamName']}) 或 {'class':['CellGame']})] 我得到一个正确输出：

               0              1
0     Washington        Houston
1         Boston     Pittsburgh
2      Minnesota      Tampa Bay
3   Philadelphia   N.Y. Yankees
4      Milwaukee      Cleveland
5     Cincinnati          Texas
6        Arizona      Chi. Cubs
7      San Diego  San Francisco
8    Kansas City        Seattle
9    L.A. Angels       Colorado
10     N.Y. Mets          Miami
11       Oakland   L.A. Dodgers

0   WAS 0, HOU 0 - 1st
1   BOS 0, PIT 0 - 1st
2              1:05 pm
3              1:05 pm
4              4:05 pm
5              4:05 pm
6              4:05 pm
7              4:05 pm
8              4:10 pm
9              4:10 pm
10             6:40 pm
11             9:05 pm

但对于 {'class':['CellPlayerName--short']})] 它总是 returns None。任何帮助，将不胜感激。提前道歉，我是一个新手，但我已经搜索并搜索了这个并且找不到我可以工作的解决方案。谢谢！

Answer 1

from the docs 如果一个标签包含多个东西，那么.string应该指代什么就不清楚了，所以.string定义为None

而不是 .string 使用 .text / .get_text() 得到你的结果：

def parse_row(row): return [x.text for x in row.find_all(True, {'class':['CellPlayerName--short']})]

和select更具体，如果你在线想从<a>获得价值：

def parse_row(row): return [x.a.text for x in row.find_all(True, {'class':['CellPlayerName--short']})]

输出

0	1
J. Verlander	C. Edwards
M. Keller	N. Pivetta
D. Rasmussen	B. Ober
C. Schmidt	A. Nola
C. Quantrill	B. Woodruff
S. Howard	R. Sanmartin
J. Steele	Z. Davies
C. Rodon	M. Clevinger
L. Gilbert	D. Lynch
A. Senzatela	J. Suarez
P. Lopez	C. Bassitt
T. Gonsolin	S. Manaea

Python 针对特定 class returns none 的网页抓取

Python web scraping for a specific class returns none

python

beautifulsoup

web-scraping

pandas

python-requests