BeautifulSoup.text returns VSCode 中的空白字符串,但在 Google Colab 中工作正常
BeautifulSoup.text returns blank string in VSCode, but works fine in Google Colab
我正在尝试抓取此网站 https://understat.com/league/EPL。
我解析完页面后:
import json
from bs4 import BeautifulSoup
from urllib.request import urlopen
scrape_urlEPL="https://understat.com/league/EPL"
page_connect=urlopen(scrape_urlEPL)
page_html=BeautifulSoup(page_connect, "html.parser")
然后我在 html.
中搜索“脚本”
page_html.findAll(name="script")
这为我提供了所有出现的“脚本”的列表。假设我想从第三个元素中提取文本。只需为此打印 html 即可显示有效输出。
page_html.findAll(name="script")[3]
输出:
<script>
var playersData = JSON.parse('\x5B\x7B\x22id\x22\x3A\x221389\x22,\x22player_name\x22\x3A\x22Jorginho\x22,\x22games\x22\x3A\x2228\x22,\x22time\x22\x3A\x222022\x22,\x22goals\x22\x3A\x227\x22,\x22xG\x22\x3A\x226.972690678201616\x22,\x22assists\x22\x3A\x221\x22,\x22xA\x22\x3A\x221.954869382083416\x22,\x22shots\x22\x3A\x2214\x22,\x22key_passes\x22\x3A\x2224\x22,\x22yellow_cards\x22\x3A\x222\x22,\x22red_cards\x22\x3A\x220\x22,\x22position\x22\x3A\x22M\x20S\x22,\x2....
现在如果我想从中提取文本,
page_html.findAll(name="script")[3].text
这给出了一个空字符串 ''。
然而,相同的代码在 Google Colab 和 returns 中工作正常:
'\n\tvar playersData\t= JSON.parse('\x5B\x7B\x22id\x22\x3A\x22647\x22,\x22player_name\x22\x3A\x22Harry\x20Kane\x22,\x22games\x22\x3A\x2235\x22,\x22time\x22\x3A\x223097\x22,\x22goals\x22\x3A\x2223\x22,\x22xG\x22\x3A\x2222.174858909100294\x22,\x22assists\x22\x3A\x2214\x22,\x22xA\x22\x3A\x227.577093588188291\x22,\x22shots\x22\x3A\x22138\x22,\x22key_passes\x22\x3A\x2249...'
符合预期。我不明白为什么 VSCode.
中会出现此错误
请注意,script
TAG 仅持有 string
而不是 TEXT
。
JSON.PARSE 是一个 JavaScript 函数,它解析 string
您必须使用 .string
而不是 .text
import httpx
import trio
from bs4 import BeautifulSoup
async def main():
async with httpx.AsyncClient(timeout=None) as client:
r = await client.get('https://understat.com/league/EPL')
soup = BeautifulSoup(r.text, 'lxml')
goal = soup.select('script')[3].string
print(goal)
if __name__ == "__main__":
trio.run(main)
我正在尝试抓取此网站 https://understat.com/league/EPL。
我解析完页面后:
import json
from bs4 import BeautifulSoup
from urllib.request import urlopen
scrape_urlEPL="https://understat.com/league/EPL"
page_connect=urlopen(scrape_urlEPL)
page_html=BeautifulSoup(page_connect, "html.parser")
然后我在 html.
中搜索“脚本”page_html.findAll(name="script")
这为我提供了所有出现的“脚本”的列表。假设我想从第三个元素中提取文本。只需为此打印 html 即可显示有效输出。
page_html.findAll(name="script")[3]
输出:
<script>
var playersData = JSON.parse('\x5B\x7B\x22id\x22\x3A\x221389\x22,\x22player_name\x22\x3A\x22Jorginho\x22,\x22games\x22\x3A\x2228\x22,\x22time\x22\x3A\x222022\x22,\x22goals\x22\x3A\x227\x22,\x22xG\x22\x3A\x226.972690678201616\x22,\x22assists\x22\x3A\x221\x22,\x22xA\x22\x3A\x221.954869382083416\x22,\x22shots\x22\x3A\x2214\x22,\x22key_passes\x22\x3A\x2224\x22,\x22yellow_cards\x22\x3A\x222\x22,\x22red_cards\x22\x3A\x220\x22,\x22position\x22\x3A\x22M\x20S\x22,\x2....
现在如果我想从中提取文本,
page_html.findAll(name="script")[3].text
这给出了一个空字符串 ''。
然而,相同的代码在 Google Colab 和 returns 中工作正常:
'\n\tvar playersData\t= JSON.parse('\x5B\x7B\x22id\x22\x3A\x22647\x22,\x22player_name\x22\x3A\x22Harry\x20Kane\x22,\x22games\x22\x3A\x2235\x22,\x22time\x22\x3A\x223097\x22,\x22goals\x22\x3A\x2223\x22,\x22xG\x22\x3A\x2222.174858909100294\x22,\x22assists\x22\x3A\x2214\x22,\x22xA\x22\x3A\x227.577093588188291\x22,\x22shots\x22\x3A\x22138\x22,\x22key_passes\x22\x3A\x2249...'
符合预期。我不明白为什么 VSCode.
中会出现此错误请注意,script
TAG 仅持有 string
而不是 TEXT
。
JSON.PARSE 是一个 JavaScript 函数,它解析 string
您必须使用 .string
而不是 .text
import httpx
import trio
from bs4 import BeautifulSoup
async def main():
async with httpx.AsyncClient(timeout=None) as client:
r = await client.get('https://understat.com/league/EPL')
soup = BeautifulSoup(r.text, 'lxml')
goal = soup.select('script')[3].string
print(goal)
if __name__ == "__main__":
trio.run(main)