BeautifulSoup.text returns VSCode 中的空白字符串，但在 Google Colab 中工作正常

Question

我正在尝试抓取此网站 https://understat.com/league/EPL。

我解析完页面后：

import json
from bs4 import BeautifulSoup
from urllib.request import urlopen
scrape_urlEPL="https://understat.com/league/EPL"
page_connect=urlopen(scrape_urlEPL)
page_html=BeautifulSoup(page_connect, "html.parser")

然后我在 html.

中搜索“脚本”

page_html.findAll(name="script")

这为我提供了所有出现的“脚本”的列表。假设我想从第三个元素中提取文本。只需为此打印 html 即可显示有效输出。

page_html.findAll(name="script")[3]

输出：

<script>
    var playersData = JSON.parse('\x5B\x7B\x22id\x22\x3A\x221389\x22,\x22player_name\x22\x3A\x22Jorginho\x22,\x22games\x22\x3A\x2228\x22,\x22time\x22\x3A\x222022\x22,\x22goals\x22\x3A\x227\x22,\x22xG\x22\x3A\x226.972690678201616\x22,\x22assists\x22\x3A\x221\x22,\x22xA\x22\x3A\x221.954869382083416\x22,\x22shots\x22\x3A\x2214\x22,\x22key_passes\x22\x3A\x2224\x22,\x22yellow_cards\x22\x3A\x222\x22,\x22red_cards\x22\x3A\x220\x22,\x22position\x22\x3A\x22M\x20S\x22,\x2....

现在如果我想从中提取文本，

page_html.findAll(name="script")[3].text

这给出了一个空字符串 ''。

然而，相同的代码在 Google Colab 和 returns 中工作正常：

'\n\tvar playersData\t= JSON.parse('\x5B\x7B\x22id\x22\x3A\x22647\x22,\x22player_name\x22\x3A\x22Harry\x20Kane\x22,\x22games\x22\x3A\x2235\x22,\x22time\x22\x3A\x223097\x22,\x22goals\x22\x3A\x2223\x22,\x22xG\x22\x3A\x2222.174858909100294\x22,\x22assists\x22\x3A\x2214\x22,\x22xA\x22\x3A\x227.577093588188291\x22,\x22shots\x22\x3A\x22138\x22,\x22key_passes\x22\x3A\x2249...'

符合预期。我不明白为什么 VSCode.

中会出现此错误

Answer 1

请注意，script TAG 仅持有 string 而不是 TEXT。

JSON.PARSE 是一个 JavaScript 函数，它解析 string

您必须使用 .string 而不是 .text

import httpx
import trio
from bs4 import BeautifulSoup


async def main():
    async with httpx.AsyncClient(timeout=None) as client:
        r = await client.get('https://understat.com/league/EPL')
        soup = BeautifulSoup(r.text, 'lxml')
        goal = soup.select('script')[3].string
        print(goal)


if __name__ == "__main__":
    trio.run(main)

Ref : Bs4 difference between string and text

BeautifulSoup.text returns VSCode 中的空白字符串，但在 Google Colab 中工作正常

BeautifulSoup.text returns blank string in VSCode, but works fine in Google Colab

beautifulsoup

web-scraping

visual-studio-code

google-colaboratory