从足球招聘网站抢 table
Grab table from football recruiting website
我想创建与以下网页中显示的完全相同的 table:https://247sports.com/college/penn-state/Season/2022-Football/Commits/
我目前正在使用 Selenium 和 Beautiful Soup 开始在 Google Colab 笔记本上实现它,因为我在执行“read_html”命令时遇到禁止错误。我刚刚开始获得一些输出,但我只想获取文本而不是周围的外部内容。
到目前为止,这是我的代码...
from kora.selenium import wd
from bs4 import BeautifulSoup
import pandas as pd
import time
import datetime as dt
import os
import re
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
url = 'https://247sports.com/college/penn-state/Season/2022-Football/Commits/'
wd.get(url)
time.sleep(5)
soup = BeautifulSoup(wd.page_source)
school=soup.find_all('span', class_='meta')
name=soup.find_all('div', class_='recruit')
position = soup.find_all('div', class_="position")
height_weight = soup.find_all('div', class_="metrics")
rating = soup.find_all('span', class_='score')
nat_rank = soup.find_all('a', class_='natrank')
state_rank = soup.find_all('a', class_='sttrank')
pos_rank = soup.find_all('a', class_='posrank')
status = soup.find_all('p', class_='commit-date withDate')
status
...这是我的输出...
[<p class="commit-date withDate"> Commit 7/25/2020 </p>,
<p class="commit-date withDate"> Commit 9/4/2020 </p>,
<p class="commit-date withDate"> Commit 1/1/2021 </p>,
<p class="commit-date withDate"> Commit 3/8/2021 </p>,
<p class="commit-date withDate"> Commit 10/29/2020 </p>,
<p class="commit-date withDate"> Commit 7/28/2020 </p>,
<p class="commit-date withDate"> Commit 9/8/2020 </p>,
<p class="commit-date withDate"> Commit 8/3/2020 </p>,
<p class="commit-date withDate"> Commit 5/1/2021 </p>]
非常感谢对此的任何帮助。
没有必要使用Selenium
,要从网站获得响应您需要指定HTTP User-Agent
header,否则网站认为您是机器人并且会阻止你。
要创建 DataFrame
请参阅此示例:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "https://247sports.com/college/penn-state/Season/2022-Football/Commits/"
# Add the `user-agent` otherwise we will get blocked when sending the request
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"
}
response = requests.get(url, headers=headers).content
soup = BeautifulSoup(response, "html.parser")
data = []
for tag in soup.find_all("li", class_="ri-page__list-item")[1:]: # `[1:]` Since the first result is a table header
school = tag.find_next("span", class_="meta").text
name = tag.find_next("a", class_="ri-page__name-link").text
position = tag.find_next("div", class_="position").text
height_weight = tag.find_next("div", class_="metrics").text
rating = tag.find_next("span", class_="score").text
nat_rank = tag.find_next("a", class_="natrank").text
state_rank = tag.find_next("a", class_="sttrank").text
pos_rank = tag.find_next("a", class_="posrank").text
status = tag.find_next("p", class_="commit-date withDate").text
data.append(
{
"school": school,
"name": name,
"position": position,
"height_weight": height_weight,
"rating": rating,
"nat_rank": nat_rank,
"state_rank": state_rank,
"pos_rank": pos_rank,
"status": status,
}
)
df = pd.DataFrame(data)
print(df.to_string())
输出:
school name position height_weight rating nat_rank state_rank pos_rank status
0 Westerville South (Westerville, OH) Kaden Saunders WR 5-10 / 172 0.9509 116 5 16 Commit 7/25/2020
1 IMG Academy (Bradenton, FL) Drew Shelton OT 6-5 / 290 0.9468 130 17 14 Commit 9/4/2020
2 Central Dauphin East (Harrisburg, PA) Mehki Flowers WR 6-1 / 190 0.9461 131 4 18 Commit 1/1/2021
3 Medina (Medina, OH) Drew Allar PRO 6-5 / 220 0.9435 138 6 8 Commit 3/8/2021
4 Manheim Township (Lancaster, PA) Anthony Ivey WR 6-0 / 190 0.9249 190 6 26 Commit 10/29/2020
5 King (Milwaukee, WI) Jerry Cross TE 6-6 / 218 0.9153 218 4 8 Commit 7/28/2020
6 Northeast (Philadelphia, PA) Ken Talley WDE 6-3 / 230 0.9069 253 9 13 Commit 9/8/2020
7 Central York (York, PA) Beau Pribula DUAL 6-2 / 215 0.8891 370 12 9 Commit 8/3/2020
8 The Williston Northampton School (Easthampton, MA) Maleek McNeil OT 6-8 / 340 0.8593 705 8 64 Commit 5/1/2021
我想创建与以下网页中显示的完全相同的 table:https://247sports.com/college/penn-state/Season/2022-Football/Commits/
我目前正在使用 Selenium 和 Beautiful Soup 开始在 Google Colab 笔记本上实现它,因为我在执行“read_html”命令时遇到禁止错误。我刚刚开始获得一些输出,但我只想获取文本而不是周围的外部内容。
到目前为止,这是我的代码...
from kora.selenium import wd
from bs4 import BeautifulSoup
import pandas as pd
import time
import datetime as dt
import os
import re
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
url = 'https://247sports.com/college/penn-state/Season/2022-Football/Commits/'
wd.get(url)
time.sleep(5)
soup = BeautifulSoup(wd.page_source)
school=soup.find_all('span', class_='meta')
name=soup.find_all('div', class_='recruit')
position = soup.find_all('div', class_="position")
height_weight = soup.find_all('div', class_="metrics")
rating = soup.find_all('span', class_='score')
nat_rank = soup.find_all('a', class_='natrank')
state_rank = soup.find_all('a', class_='sttrank')
pos_rank = soup.find_all('a', class_='posrank')
status = soup.find_all('p', class_='commit-date withDate')
status
...这是我的输出...
[<p class="commit-date withDate"> Commit 7/25/2020 </p>,
<p class="commit-date withDate"> Commit 9/4/2020 </p>,
<p class="commit-date withDate"> Commit 1/1/2021 </p>,
<p class="commit-date withDate"> Commit 3/8/2021 </p>,
<p class="commit-date withDate"> Commit 10/29/2020 </p>,
<p class="commit-date withDate"> Commit 7/28/2020 </p>,
<p class="commit-date withDate"> Commit 9/8/2020 </p>,
<p class="commit-date withDate"> Commit 8/3/2020 </p>,
<p class="commit-date withDate"> Commit 5/1/2021 </p>]
非常感谢对此的任何帮助。
没有必要使用Selenium
,要从网站获得响应您需要指定HTTP User-Agent
header,否则网站认为您是机器人并且会阻止你。
要创建 DataFrame
请参阅此示例:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "https://247sports.com/college/penn-state/Season/2022-Football/Commits/"
# Add the `user-agent` otherwise we will get blocked when sending the request
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"
}
response = requests.get(url, headers=headers).content
soup = BeautifulSoup(response, "html.parser")
data = []
for tag in soup.find_all("li", class_="ri-page__list-item")[1:]: # `[1:]` Since the first result is a table header
school = tag.find_next("span", class_="meta").text
name = tag.find_next("a", class_="ri-page__name-link").text
position = tag.find_next("div", class_="position").text
height_weight = tag.find_next("div", class_="metrics").text
rating = tag.find_next("span", class_="score").text
nat_rank = tag.find_next("a", class_="natrank").text
state_rank = tag.find_next("a", class_="sttrank").text
pos_rank = tag.find_next("a", class_="posrank").text
status = tag.find_next("p", class_="commit-date withDate").text
data.append(
{
"school": school,
"name": name,
"position": position,
"height_weight": height_weight,
"rating": rating,
"nat_rank": nat_rank,
"state_rank": state_rank,
"pos_rank": pos_rank,
"status": status,
}
)
df = pd.DataFrame(data)
print(df.to_string())
输出:
school name position height_weight rating nat_rank state_rank pos_rank status
0 Westerville South (Westerville, OH) Kaden Saunders WR 5-10 / 172 0.9509 116 5 16 Commit 7/25/2020
1 IMG Academy (Bradenton, FL) Drew Shelton OT 6-5 / 290 0.9468 130 17 14 Commit 9/4/2020
2 Central Dauphin East (Harrisburg, PA) Mehki Flowers WR 6-1 / 190 0.9461 131 4 18 Commit 1/1/2021
3 Medina (Medina, OH) Drew Allar PRO 6-5 / 220 0.9435 138 6 8 Commit 3/8/2021
4 Manheim Township (Lancaster, PA) Anthony Ivey WR 6-0 / 190 0.9249 190 6 26 Commit 10/29/2020
5 King (Milwaukee, WI) Jerry Cross TE 6-6 / 218 0.9153 218 4 8 Commit 7/28/2020
6 Northeast (Philadelphia, PA) Ken Talley WDE 6-3 / 230 0.9069 253 9 13 Commit 9/8/2020
7 Central York (York, PA) Beau Pribula DUAL 6-2 / 215 0.8891 370 12 9 Commit 8/3/2020
8 The Williston Northampton School (Easthampton, MA) Maleek McNeil OT 6-8 / 340 0.8593 705 8 64 Commit 5/1/2021