从 VegasInsider 抓取 table

Question

中抓取这个 table

在网络抓取方面，我完全是个初学者。我已经通过 Whosebug 尝试了几种不同的方法，但一直无法确定。

这是我所能得到的。

from bs4 import BeautifulSoup
import requests

source = requests.get('https://www.vegasinsider.com/college-basketball/odds/las-vegas/money/').text

soup = BeautifulSoup(source, "html.parser")
tbl = soup.find('table', class_='frodds-data-tbl')
for matchups in tbl.find_all('td', {'class': ['viCellBg1', 'oddsGameCell','cellTextNorm','cellTextNorm']}):
    if matchups.span is not None:
        gameDate = matchups.span.text
        print(gameDate)

    for b_ in matchups.find_all('b'):
        print(b_.a.text)

我最终会将这些结果发送到 CSV 并更改列 headers 以匹配 table 上的书名。在此感谢任何帮助。

Answer 1

您可以使用此示例将数据加载到 DataFrame 中：

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = "https://www.vegasinsider.com/college-basketball/odds/las-vegas/money/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

# clean-up the cells:
for br in soup.select("br"):
    br.replace_with("\n")

df = pd.read_html(str(soup.select_one(".frodds-data-tbl")))[0]

# set column names:
# df.columns = ['col1', 'col2', ...]

df.to_csv("data.csv", index=False)
print(df)

打印：

                                                              0           1          2          3          4          5          6          7          8          9
0            02/20 1:00 PM  819 Wright State  820 Detroit Mercy   -120 +100  -125 +105  -125 +105  -120 +100  -114 -105  -115 -105  -120 +100  -120 +100  -125 +105
1                    02/20 1:00 PM  821 Michigan  822 Wisconsin   +110 -130  +135 -155  +125 -150  +135 -155  +130 -156  +130 -150  +135 -160  +120 -145  +135 -155
2                     02/20 1:00 PM  823 Providence  824 Butler   -160 +130  -155 +135  -170 +140  -160 +140  -170 +140  -160 +140  -160 +135  -155 +127  -155 +135
3                        02/20 1:00 PM  825 Fairfield  826 Iona  +650 -1000  +525 -750  +550 -800  +525 -750  +520 -780  +500 -720  +530 -750  +600 -900  +500 -700

...

并保存 data.csv（来自 LibreOffice 的屏幕截图）：

Answer 2

如果你不关心格式，你可以使用pd.read_html:

import pandas as pd
url = "https://www.vegasinsider.com/college-basketball/odds/las-vegas/money/"
pd.read_html(url)[7]

从 VegasInsider 抓取 table

Webscraping a table from VegasInsider

python

beautifulsoup

web-scraping

pandas