HTML 来自网页的外语字符未正确显示
HTML from a webpage does not display foreign language characters correctly
如果标题有误导性,我们深表歉意。
我试图通过查询歌词网站找出给定歌曲的语言,然后使用 CLD2 检查歌词的语言。但是,对于某些歌曲(例如下面给出的示例),外语字符未正确编码,这意味着 CLD2 会抛出此错误:input contains invalid UTF-8 around byte 2121 (of 32761)
import requests
import re
from bs4 import BeautifulSoup
import cld2
response = requests.get(https://www.azlyrics.com/lyrics/blackpink/ddududdudu.html)
soup = BeautifulSoup(response.text, 'html.parser')
counter = 0
for item in soup.select("div"):
counter+=1
if counter == 21:
lyrics = item.get_text()
checklang(lyrics)
print("Lyrics found!")
break
def checklang(lyrics):
try:
isReliable, textBytesFound, details = cld2.detect(lyrics)
language = re.search("ENGLISH", str(details))
if language == None:
print("foreign lang")
if len(re.findall("Unknown", str(details))) < 2:
print("foreign lang")
if language != None:
print("english")
pass
还值得一提的是,这不仅限于 non-latin 个字符,有时还会出现撇号或其他标点符号。
任何人都可以阐明为什么会发生这种情况,或者我可以做些什么来解决这个问题?
Requests
应该根据 HTTP headers.
对响应的编码进行有根据的猜测
不幸的是,在给定的示例中,response.encoding
显示 ISO-8859-1
而 response.content
显示 <meta charset="utf-8">
。
这是我基于 Response Content paragraph in the requests
documentation.
的解决方案
import requests
import re
from bs4 import BeautifulSoup
# import cld2
import pycld2 as cld2
def checklang(lyrics):
#try:
isReliable, textBytesFound, details = cld2.detect(lyrics)
# language = re.search("ENGLISH", str(details))
for detail in details:
print(detail)
response = requests.get('https://www.azlyrics.com/lyrics/blackpink/ddududdudu.html')
print(response.encoding)
response.encoding = 'utf-8' ### key change ###
soup = BeautifulSoup(response.text, 'html.parser')
counter = 0
for item in soup.select("div"):
counter+=1
if counter == 21:
lyrics = item.get_text()
checklang(lyrics)
print("Lyrics found!")
break
输出:\SO630066.py
ISO-8859-1
('ENGLISH', 'en', 74, 833.0)
('Korean', 'ko', 20, 3575.0)
('Unknown', 'un', 0, 0.0)
Lyrics found!
如果标题有误导性,我们深表歉意。
我试图通过查询歌词网站找出给定歌曲的语言,然后使用 CLD2 检查歌词的语言。但是,对于某些歌曲(例如下面给出的示例),外语字符未正确编码,这意味着 CLD2 会抛出此错误:input contains invalid UTF-8 around byte 2121 (of 32761)
import requests
import re
from bs4 import BeautifulSoup
import cld2
response = requests.get(https://www.azlyrics.com/lyrics/blackpink/ddududdudu.html)
soup = BeautifulSoup(response.text, 'html.parser')
counter = 0
for item in soup.select("div"):
counter+=1
if counter == 21:
lyrics = item.get_text()
checklang(lyrics)
print("Lyrics found!")
break
def checklang(lyrics):
try:
isReliable, textBytesFound, details = cld2.detect(lyrics)
language = re.search("ENGLISH", str(details))
if language == None:
print("foreign lang")
if len(re.findall("Unknown", str(details))) < 2:
print("foreign lang")
if language != None:
print("english")
pass
还值得一提的是,这不仅限于 non-latin 个字符,有时还会出现撇号或其他标点符号。
任何人都可以阐明为什么会发生这种情况,或者我可以做些什么来解决这个问题?
Requests
应该根据 HTTP headers.
不幸的是,在给定的示例中,response.encoding
显示 ISO-8859-1
而 response.content
显示 <meta charset="utf-8">
。
这是我基于 Response Content paragraph in the requests
documentation.
import requests
import re
from bs4 import BeautifulSoup
# import cld2
import pycld2 as cld2
def checklang(lyrics):
#try:
isReliable, textBytesFound, details = cld2.detect(lyrics)
# language = re.search("ENGLISH", str(details))
for detail in details:
print(detail)
response = requests.get('https://www.azlyrics.com/lyrics/blackpink/ddududdudu.html')
print(response.encoding)
response.encoding = 'utf-8' ### key change ###
soup = BeautifulSoup(response.text, 'html.parser')
counter = 0
for item in soup.select("div"):
counter+=1
if counter == 21:
lyrics = item.get_text()
checklang(lyrics)
print("Lyrics found!")
break
输出:\SO630066.py
ISO-8859-1
('ENGLISH', 'en', 74, 833.0)
('Korean', 'ko', 20, 3575.0)
('Unknown', 'un', 0, 0.0)
Lyrics found!