尝试使用 Request 和 Beautiful Soup 获取奇怪字符时出错
Error trying to get weird characters with Request and Beautiful Soup
我有以下代码,但它会生成带有奇怪字符的行,例如 Luka DonÄić 而不是 Luka Dončić。
import pandas as pd
from requests import get
from bs4 import BeautifulSoup
scrapTable = get('https://widgets.sports-reference.com/wg.fcgi?css=1&site=bbr&url=%2Fleagues%2FNBA_2021_per_game.html&div=div_per_game_stats')
scrapTable.encoding = 'utf-8-sig'
soup_a = BeautifulSoup(scrapTable.content, 'html.parser')
table = soup_a.find('table')
df_nba_PerGame = pd.read_html(str(table), encoding='utf8')[0]
知道哪里出了问题吗?
文档包含 utf-8
个编码为 HTML 个特殊字符 (?) 的字符。要解码文档,您可以使用:
import re
import html
import pandas as pd
from requests import get
from bs4 import BeautifulSoup
scrapTable = get(
"https://widgets.sports-reference.com/wg.fcgi?css=1&site=bbr&url=%2Fleagues%2FNBA_2021_per_game.html&div=div_per_game_stats"
)
s = re.sub(
rb"&#(\d+);",
lambda g: b"%c" % int(g.group(1)),
scrapTable.content,
)
s = (
html.unescape(s.decode("latin1"))
.encode("latin1", "ignore")
.decode("utf-8", "ignore")
)
soup = BeautifulSoup(s, "html.parser")
table = soup.find("table")
df_nba_PerGame = pd.read_html(str(table), encoding="utf8")[0]
print(df_nba_PerGame)
打印:
...
176 129 Donte DiVincenzo SG 24 MIL 66 66 27.5 3.8 9.1 .420 2.0 5.2 .379 1.8 3.9 .475 .528 0.8 1.1 .718 1.2 4.5 5.8 3.1 1.1 0.2 1.4 1.7 10.4
177 130 Luka Dončić PG 21 DAL 66 66 34.3 9.8 20.5 .479 2.9 8.3 .350 6.9 12.2 .567 .550 5.2 7.1 .730 0.8 7.2 8.0 8.6 1.0 0.5 4.3 2.3 27.7
178 131 Luguentz Dort SG 21 OKC 52 52 29.7 4.8 12.3 .387 2.2 6.3 .343 2.6 6.0 .432 .475 2.3 3.2 .744 0.7 2.9 3.6 1.7 0.9 0.4 1.5 2.6 14.0
...
我有以下代码,但它会生成带有奇怪字符的行,例如 Luka DonÄić 而不是 Luka Dončić。
import pandas as pd
from requests import get
from bs4 import BeautifulSoup
scrapTable = get('https://widgets.sports-reference.com/wg.fcgi?css=1&site=bbr&url=%2Fleagues%2FNBA_2021_per_game.html&div=div_per_game_stats')
scrapTable.encoding = 'utf-8-sig'
soup_a = BeautifulSoup(scrapTable.content, 'html.parser')
table = soup_a.find('table')
df_nba_PerGame = pd.read_html(str(table), encoding='utf8')[0]
知道哪里出了问题吗?
文档包含 utf-8
个编码为 HTML 个特殊字符 (?) 的字符。要解码文档,您可以使用:
import re
import html
import pandas as pd
from requests import get
from bs4 import BeautifulSoup
scrapTable = get(
"https://widgets.sports-reference.com/wg.fcgi?css=1&site=bbr&url=%2Fleagues%2FNBA_2021_per_game.html&div=div_per_game_stats"
)
s = re.sub(
rb"&#(\d+);",
lambda g: b"%c" % int(g.group(1)),
scrapTable.content,
)
s = (
html.unescape(s.decode("latin1"))
.encode("latin1", "ignore")
.decode("utf-8", "ignore")
)
soup = BeautifulSoup(s, "html.parser")
table = soup.find("table")
df_nba_PerGame = pd.read_html(str(table), encoding="utf8")[0]
print(df_nba_PerGame)
打印:
...
176 129 Donte DiVincenzo SG 24 MIL 66 66 27.5 3.8 9.1 .420 2.0 5.2 .379 1.8 3.9 .475 .528 0.8 1.1 .718 1.2 4.5 5.8 3.1 1.1 0.2 1.4 1.7 10.4
177 130 Luka Dončić PG 21 DAL 66 66 34.3 9.8 20.5 .479 2.9 8.3 .350 6.9 12.2 .567 .550 5.2 7.1 .730 0.8 7.2 8.0 8.6 1.0 0.5 4.3 2.3 27.7
178 131 Luguentz Dort SG 21 OKC 52 52 29.7 4.8 12.3 .387 2.2 6.3 .343 2.6 6.0 .432 .475 2.3 3.2 .744 0.7 2.9 3.6 1.7 0.9 0.4 1.5 2.6 14.0
...