从嵌套的 div 中抓取 table
Web Scrape table from nested divs
我正在尝试使用 BeautifulSoup 并请求将 table here 抓取到数据框中。我以前可以用这个来做到这一点:
url = "https://www.vegasinsider.com/college-basketball/odds/las-vegas/money/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for br in soup.select("br"):
br.replace_with("\n")
base = pd.read_html(str(soup.select_one(".frodds-data-tbl")))[0]
可悲的是,网站的布局一夜之间完全改变了。
我现在收到此值错误:
ValueError: No tables found
这是因为我之前一直在寻找 table。现在数据存储在一系列嵌套的 div 中。我在这里使用这段代码取得了一些进展:
url = "https://www.vegasinsider.com/college-basketball/odds/las-vegas/"
page = requests.get(url).text
soup = BeautifulSoup(page, "html.parser")
divList = soup.findAll('div', attrs={"class" : "bc-odds-table bc-table"})
print(divList)
我也已经能够思考并找到我想从这里提取数据的地方:
我也通过这样做得到了一些东西:
data = [[x.text for x in y.findAll('div')] for y in divList]
df = pd.DataFrame(data)
print(df)
[1 rows x 5282 columns]
我如何才能遍历这些 div 和 return pandas 数据帧中的数据?
使用div.text时,return是我想要的一长串数据。我可以将这个字符串分成许多部分并将其粘贴到我想要它去的 df 中。但这充其量只是一项黑客工作。
您基本上需要通过识别 class 名称中的唯一标识符来遍历所有 div。试试这个:
import pandas as pd
import requests
from bs4 import BeautifulSoup
def extract_data_from_div(div):
# contains the names of the teams
left_side_div = div.find('div', class_='d-flex flex-column odds-comparison-border position-relative')
name_data = []
for name in left_side_div.find_all('div', class_='team-stats-box'):
name_data.append(name.text.strip())
# to save all the extracted odds
odds = []
# now isolate the divs with the odds
for row in div.find_all('div', class_='px-1'):
# all the divs for each bookmaker
odds_boxes = row.find_all('div', class_='odds-box')
odds_box_data = []
for odds_box in odds_boxes:
# sometimes they're just 'N/A' so this will stop the code breaking
try:
pt_2 = odds_box.find('div', class_='pt-2').text.strip()
except:
pt_2 = ''
try:
pt_1 = odds_box.find('div', class_='pt-1').text.strip()
except:
pt_1 = ''
odds_box_data.append((pt_2, pt_1))
# append to the odds list
odds.append(odds_box_data)
# put the names and the odds together
extracted_data = dict(zip(name_data, odds))
return extracted_data
url = "https://www.vegasinsider.com/college-basketball/odds/las-vegas/"
resp = requests.get(url)
soup = BeautifulSoup(str(resp.text), "html.parser")
# this will give you a list of each set of match odds
div_list = soup.find_all('div', class_='d-flex flex-row hide-scrollbar odds-slider-all syncscroll tracks')
data = {}
for div in div_list:
extracted = extract_data_from_div(div)
data = {**data, **extracted}
# finally convert to a dataframe
df = pd.DataFrame.from_dict(data, orient='index').reset_index()
我正在尝试使用 BeautifulSoup 并请求将 table here 抓取到数据框中。我以前可以用这个来做到这一点:
url = "https://www.vegasinsider.com/college-basketball/odds/las-vegas/money/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for br in soup.select("br"):
br.replace_with("\n")
base = pd.read_html(str(soup.select_one(".frodds-data-tbl")))[0]
可悲的是,网站的布局一夜之间完全改变了。 我现在收到此值错误:
ValueError: No tables found
这是因为我之前一直在寻找 table。现在数据存储在一系列嵌套的 div 中。我在这里使用这段代码取得了一些进展:
url = "https://www.vegasinsider.com/college-basketball/odds/las-vegas/"
page = requests.get(url).text
soup = BeautifulSoup(page, "html.parser")
divList = soup.findAll('div', attrs={"class" : "bc-odds-table bc-table"})
print(divList)
我也已经能够思考并找到我想从这里提取数据的地方:
我也通过这样做得到了一些东西:
data = [[x.text for x in y.findAll('div')] for y in divList]
df = pd.DataFrame(data)
print(df)
[1 rows x 5282 columns]
我如何才能遍历这些 div 和 return pandas 数据帧中的数据?
使用div.text时,return是我想要的一长串数据。我可以将这个字符串分成许多部分并将其粘贴到我想要它去的 df 中。但这充其量只是一项黑客工作。
您基本上需要通过识别 class 名称中的唯一标识符来遍历所有 div。试试这个:
import pandas as pd
import requests
from bs4 import BeautifulSoup
def extract_data_from_div(div):
# contains the names of the teams
left_side_div = div.find('div', class_='d-flex flex-column odds-comparison-border position-relative')
name_data = []
for name in left_side_div.find_all('div', class_='team-stats-box'):
name_data.append(name.text.strip())
# to save all the extracted odds
odds = []
# now isolate the divs with the odds
for row in div.find_all('div', class_='px-1'):
# all the divs for each bookmaker
odds_boxes = row.find_all('div', class_='odds-box')
odds_box_data = []
for odds_box in odds_boxes:
# sometimes they're just 'N/A' so this will stop the code breaking
try:
pt_2 = odds_box.find('div', class_='pt-2').text.strip()
except:
pt_2 = ''
try:
pt_1 = odds_box.find('div', class_='pt-1').text.strip()
except:
pt_1 = ''
odds_box_data.append((pt_2, pt_1))
# append to the odds list
odds.append(odds_box_data)
# put the names and the odds together
extracted_data = dict(zip(name_data, odds))
return extracted_data
url = "https://www.vegasinsider.com/college-basketball/odds/las-vegas/"
resp = requests.get(url)
soup = BeautifulSoup(str(resp.text), "html.parser")
# this will give you a list of each set of match odds
div_list = soup.find_all('div', class_='d-flex flex-row hide-scrollbar odds-slider-all syncscroll tracks')
data = {}
for div in div_list:
extracted = extract_data_from_div(div)
data = {**data, **extracted}
# finally convert to a dataframe
df = pd.DataFrame.from_dict(data, orient='index').reset_index()