尝试从 BeautifulSoup4 的网站上抓取文本,但什么也没有发生
Trying to scrape text from a site with BeautifulSoup4, but nothing happens at all
我想从这个网站抓取数据:https://playvalorant.com/en-us/news/game-updates/
from bs4 import BeautifulSoup
import requests
site_text = requests.get('https://playvalorant.com/en-us/news/game-updates/').text
soup = BeautifulSoup(site_text, 'lxml')
posts = soup.find_all('li', class_="ContentListing-module--contentListingItem--3GAoa")
for post in posts:
post_title = post.find(
'h3', class_="heading-05 bold ContentListingCard-module--title--1vIFy").text
post_title = post_title.lower()
if "patch notes" in post_title:
patch_ver = post_title.replace('valorant patch notes ', '')
print(f'Patch version: {patch_ver}')
print("")
但是当我运行它时,什么也没有发生。
我想做的是查看 h3 是否包含文本“补丁说明”,如果是,请检查它是什么版本并转到 https://playvalorant.com/en-us/news/game-updates/valorant-patch-notes-(patch-number)-(patch-number)/ (for example, if the text was "VALORANT Patch Notes 3213.07", then I want to go to https://playvalorant.com/en-us/news/game-updates/valorant-patch-notes-3213-07,依此类推。)
我有点跑题了,关键是,我怎样才能从这个网站上获取文本,然后打印出来?
您看到的数据是通过Javascript加载的,所以BeautifulSoup看不到。可以用requests
模块模拟一下:
import json
import requests
url = (
"https://playvalorant.com/page-data/en-us/news/game-updates/page-data.json"
)
data = requests.get(url).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for a in data["result"]["pageContext"]["data"]["articles"]:
if "Patch Notes" in a["title"]:
patch_notes_url = "https://playvalorant.com" + a["url"]["url"]
print("{:<30} {}".format(a["title"], patch_notes_url))
打印:
VALORANT Patch Notes 4.04 https://playvalorant.com/news/game-updates/valorant-patch-notes-4-04/
VALORANT Patch Notes 4.03 https://playvalorant.com/news/game-updates/valorant-patch-notes-4-03/
VALORANT Patch Notes 4.02 https://playvalorant.com/news/game-updates/valorant-patch-notes-4-02/
VALORANT Patch Notes 4.01 https://playvalorant.com/news/game-updates/valorant-patch-notes-4-01/
VALORANT Patch Notes 4.0 https://playvalorant.com/news/game-updates/valorant-patch-notes-4-0/
VALORANT Patch Notes 3.12 https://playvalorant.com/news/game-updates/valorant-patch-notes-3-12/
VALORANT Patch Notes 3.10 https://playvalorant.com/news/game-updates/valorant-patch-notes-3-10/
VALORANT Patch Notes 3.09 https://playvalorant.com/news/game-updates/valorant-patch-notes-3-09/
VALORANT Patch Notes 3.08 https://playvalorant.com/news/game-updates/valorant-patch-notes-3-08/
VALORANT Patch Notes 3.07 https://playvalorant.com/news/game-updates/valorant-patch-notes-3-07/
VALORANT Patch Notes 3.06 https://playvalorant.com/news/game-updates/valorant-patch-notes-3-06/
VALORANT Patch Notes 3.05 https://playvalorant.com/news/game-updates/valorant-patch-notes-3-05/
VALORANT Patch Notes 3.04 https://playvalorant.com/news/game-updates/valorant-patch-notes-3-04/
VALORANT Patch Notes 3.03 https://playvalorant.com/news/game-updates/valorant-patch-notes-3-03/
VALORANT Patch Notes 3.02 https://playvalorant.com/news/game-updates/valorant-patch-notes-3-02/
VALORANT Patch Notes 3.01 https://playvalorant.com/news/game-updates/valorant-patch-notes-3-01/
...and so on.
尝试使用 lxml 使用 xpath 轻松访问所需的 html 节点。
from lxml import html
import requests
url = "https://playvalorant.com/en-us/news/game-updates/"
response = requests.get(url, stream=True)
tree = html.fromstring(response.content)
posts = tree.xpath('//section[contains(@class, "section light")]/div/ul')
我想从这个网站抓取数据:https://playvalorant.com/en-us/news/game-updates/
from bs4 import BeautifulSoup
import requests
site_text = requests.get('https://playvalorant.com/en-us/news/game-updates/').text
soup = BeautifulSoup(site_text, 'lxml')
posts = soup.find_all('li', class_="ContentListing-module--contentListingItem--3GAoa")
for post in posts:
post_title = post.find(
'h3', class_="heading-05 bold ContentListingCard-module--title--1vIFy").text
post_title = post_title.lower()
if "patch notes" in post_title:
patch_ver = post_title.replace('valorant patch notes ', '')
print(f'Patch version: {patch_ver}')
print("")
但是当我运行它时,什么也没有发生。
我想做的是查看 h3 是否包含文本“补丁说明”,如果是,请检查它是什么版本并转到 https://playvalorant.com/en-us/news/game-updates/valorant-patch-notes-(patch-number)-(patch-number)/ (for example, if the text was "VALORANT Patch Notes 3213.07", then I want to go to https://playvalorant.com/en-us/news/game-updates/valorant-patch-notes-3213-07,依此类推。)
我有点跑题了,关键是,我怎样才能从这个网站上获取文本,然后打印出来?
您看到的数据是通过Javascript加载的,所以BeautifulSoup看不到。可以用requests
模块模拟一下:
import json
import requests
url = (
"https://playvalorant.com/page-data/en-us/news/game-updates/page-data.json"
)
data = requests.get(url).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for a in data["result"]["pageContext"]["data"]["articles"]:
if "Patch Notes" in a["title"]:
patch_notes_url = "https://playvalorant.com" + a["url"]["url"]
print("{:<30} {}".format(a["title"], patch_notes_url))
打印:
VALORANT Patch Notes 4.04 https://playvalorant.com/news/game-updates/valorant-patch-notes-4-04/
VALORANT Patch Notes 4.03 https://playvalorant.com/news/game-updates/valorant-patch-notes-4-03/
VALORANT Patch Notes 4.02 https://playvalorant.com/news/game-updates/valorant-patch-notes-4-02/
VALORANT Patch Notes 4.01 https://playvalorant.com/news/game-updates/valorant-patch-notes-4-01/
VALORANT Patch Notes 4.0 https://playvalorant.com/news/game-updates/valorant-patch-notes-4-0/
VALORANT Patch Notes 3.12 https://playvalorant.com/news/game-updates/valorant-patch-notes-3-12/
VALORANT Patch Notes 3.10 https://playvalorant.com/news/game-updates/valorant-patch-notes-3-10/
VALORANT Patch Notes 3.09 https://playvalorant.com/news/game-updates/valorant-patch-notes-3-09/
VALORANT Patch Notes 3.08 https://playvalorant.com/news/game-updates/valorant-patch-notes-3-08/
VALORANT Patch Notes 3.07 https://playvalorant.com/news/game-updates/valorant-patch-notes-3-07/
VALORANT Patch Notes 3.06 https://playvalorant.com/news/game-updates/valorant-patch-notes-3-06/
VALORANT Patch Notes 3.05 https://playvalorant.com/news/game-updates/valorant-patch-notes-3-05/
VALORANT Patch Notes 3.04 https://playvalorant.com/news/game-updates/valorant-patch-notes-3-04/
VALORANT Patch Notes 3.03 https://playvalorant.com/news/game-updates/valorant-patch-notes-3-03/
VALORANT Patch Notes 3.02 https://playvalorant.com/news/game-updates/valorant-patch-notes-3-02/
VALORANT Patch Notes 3.01 https://playvalorant.com/news/game-updates/valorant-patch-notes-3-01/
...and so on.
尝试使用 lxml 使用 xpath 轻松访问所需的 html 节点。
from lxml import html
import requests
url = "https://playvalorant.com/en-us/news/game-updates/"
response = requests.get(url, stream=True)
tree = html.fromstring(response.content)
posts = tree.xpath('//section[contains(@class, "section light")]/div/ul')