尝试使用 Request 和 Beautiful Soup 获取奇怪字符时出错

Question

我有以下代码，但它会生成带有奇怪字符的行，例如 Luka DonÄiÄ‡ 而不是 Luka Dončić。

import pandas as pd
from requests import get
from bs4 import BeautifulSoup

scrapTable = get('https://widgets.sports-reference.com/wg.fcgi?css=1&site=bbr&url=%2Fleagues%2FNBA_2021_per_game.html&div=div_per_game_stats')
scrapTable.encoding = 'utf-8-sig'
soup_a = BeautifulSoup(scrapTable.content, 'html.parser')
table = soup_a.find('table')
df_nba_PerGame = pd.read_html(str(table), encoding='utf8')[0]

知道哪里出了问题吗？

Answer 1

文档包含 utf-8 个编码为 HTML 个特殊字符 (?) 的字符。要解码文档，您可以使用：

import re
import html
import pandas as pd
from requests import get
from bs4 import BeautifulSoup


scrapTable = get(
    "https://widgets.sports-reference.com/wg.fcgi?css=1&site=bbr&url=%2Fleagues%2FNBA_2021_per_game.html&div=div_per_game_stats"
)


s = re.sub(
    rb"&#(\d+);",
    lambda g: b"%c" % int(g.group(1)),
    scrapTable.content,
)

s = (
    html.unescape(s.decode("latin1"))
    .encode("latin1", "ignore")
    .decode("utf-8", "ignore")
)

soup = BeautifulSoup(s, "html.parser")
table = soup.find("table")
df_nba_PerGame = pd.read_html(str(table), encoding="utf8")[0]
print(df_nba_PerGame)

打印：

...

176  129          Donte DiVincenzo     SG   24  MIL  66  66  27.5   3.8   9.1   .420  2.0   5.2   .379   1.8   3.9   .475   .528  0.8   1.1   .718  1.2   4.5   5.8   3.1  1.1  0.2  1.4  1.7  10.4
177  130               Luka Dončić     PG   21  DAL  66  66  34.3   9.8  20.5   .479  2.9   8.3   .350   6.9  12.2   .567   .550  5.2   7.1   .730  0.8   7.2   8.0   8.6  1.0  0.5  4.3  2.3  27.7
178  131             Luguentz Dort     SG   21  OKC  52  52  29.7   4.8  12.3   .387  2.2   6.3   .343   2.6   6.0   .432   .475  2.3   3.2   .744  0.7   2.9   3.6   1.7  0.9  0.4  1.5  2.6  14.0

...

尝试使用 Request 和 Beautiful Soup 获取奇怪字符时出错

Error trying to get weird characters with Request and Beautiful Soup

python

beautifulsoup

utf-8

python-requests